Unit 2

ACTIVATION FUNCTION
An activation function in a neural network defines how the weighted sum of the input is
transformed into an output from a node or nodes in a layer of the network.
Activation function convert the linear input signals of a node into non-linear output signals
to facilitate the learning of high order polynomials that go beyond one degree for deep
networks.
The basic goal of AFs is to give the neural network non-linear qualities. They turn a node's
linear input signals to non-linear output signals to help neural networks learn high-order
polynomials with more than one degree.
The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
Backpropagation
 Backpropagation is the method of fine-tuning the weights of a neural network based on
the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights
allows you to reduce error rates and make the model reliable by increasing its
generalization.
 Backpropagation in neural network is a short form for “backward propagation of errors.”
It is a standard method of training artificial neural networks. This method helps calculate
the gradient of a loss function with respect to all the weights in the network.
 Backpropagation (backward propagation) is a supervised learning algorithm, for training
Multi-layer (ANN) Neural Network. Backpropagation is an important mathematical tool
for improving the accuracy of predictions in machine learning.
Batch Normalization
 Batch normalization is one of the important features we add to our model helps as a
Regularizer, normalizing the inputs, in the backpropagation process, and can be adapted to
most of the models to converge better.
 Batch Normalization is a supervised learning technique that converts interlayer outputs
into of a neural network into a standard format, called normalizing. This effectively
'resets' the distribution of the output of the previous layer to be more efficiently processed
by the subsequent layer.
 Batch normalization is a feature that we add between the layers of the neural network and
it continuously takes the output from the previous layer and normalizes it before sending it
to the next layer. This has the effect of stabilizing the neural network. Batch normalization
is also used to maintain the distribution of the data.
 Since normalization guarantees that no activation value is too high or too low, and since it
enables each layer to learn independently from the others, this strategy leads to quicker
learning rates.
Gradient Descent
What is gradient descent?
 Gradient descent is the most popular optimization strategy used in machine learning and
deep learning at the moment. It is used when training data models, can be combined with
every algorithm and is easy to understand and implement.
 Gradient Descent is an optimization algorithm for finding a local minimum of a
differentiable function. Gradient descent is simply used in machine learning to find the
values of a function's parameters (coefficients) that minimize a cost function as far as
possible.
 Gradient descent is an optimization algorithm that's used when training a machine learning
model. It's based on a convex function. Training data helps these models learn over time,
and the cost function within gradient descent. Until the function is close to or equal to
zero, the model will continue to adjust its parameters to yield the smallest possible error.
What is a Gradient?
 A gradient simply measures the change in all weights with regard to the change in error.
You can also think of a gradient as the slope of a function. The higher the gradient, the
steeper the slope and the faster a model can learn. But if the slope is zero, the model stops
learning. In mathematical terms, a gradient is a partial derivative with respect to its inputs.
∆ w i=n ∑ ( td −od ) xd
Where:- wi = change in weight

n = learning rate
td = target output
od = actual output
xd = input of the respective weight
 In machine learning, a gradient is a derivative of a function that has more than one input
variable. Known as the slope of a function in mathematical terms, the gradient simply
measures the change in all weights with regard to the change in error.
Types of Gradient Descent

There are three types of gradient descent learning algorithms: batch gradient descent,
stochastic gradient descent and mini-batch gradient descent.
i. Batch gradient descent
Batch gradient descent sums the error for each point in a training set; updating the model
only after all training examples have been evaluated. This process referred to as a training
epoch. While this batching provides computation efficiency, it can still have a long
processing time for large training datasets as it still needs to store all of the data into
memory. Batch gradient descent also usually produces a stable error gradient and
convergence, but sometimes that convergence point isn’t the most ideal, finding the local
minimum versus the global one.
ii. Stochastic gradient descent

Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset
and it updates each training example's parameters one at a time. Since you only need to hold
one training example, they are easier to store in memory. While these frequent updates can
offer more detail and speed, it can result in losses in computational efficiency when
compared to batch gradient descent.
iii. Mini-batch gradient descent

Mini-batch gradient descent combines concepts from both batch gradient descent and
stochastic gradient descent. It splits the training dataset into small batch sizes and performs
updates on each of those batches.
Multilayer Network
Notation of Nodes
TUNING HYPER PARAMETERS
Hyperparameters
Hyperparameters are parameters whose values control the learning process and
determine the values of model parameters that a learning algorithm ends up learning.
The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that control the
learning process and the model parameters that result from it.
What is hyperparameter tuning
Hyperparameter tuning is the process of determining the right arrangement of

hyperparameters that maximizes the model performance. It works by running multiple trials
in a single training process. Each trial is a complete execution of your training application
with values for your chosen hyperparameters, set within the limits you specify.
Hyperparameter optimization / tuning methods
hyperparameter optimization methods that are popular today.
Random Search
In the random search method, we create a grid of possible values for hyperparameters. Each
iteration tries a random combination of hyperparameters from this grid, records the
performance, and lastly returns the combination of hyperparameters that provided the best
performance.
Grid Search
In the grid search method, we create a grid of possible values for hyperparameters. Each
iteration tries a combination of hyperparameters in a specific order. It fits the model on each
and every combination of hyperparameters possible and records the model performance.
Finally, it returns the best model with the best hyperparameters.
Source
Bayesian Optimization
Tuning and finding the right hyperparameters for your model is an optimization problem.
We want to minimize the loss function of our model by changing model parameters.
Bayesian optimization helps us find the minimal point in the minimum number of
steps. Bayesian optimization also uses an acquisition function that directs sampling to
areas where an improvement is possible over the current best observation.
Tree-structured Parzen estimators (TPE)
The idea of Tree-based Parzen optimization is similar to Bayesian optimization. Instead of

finding the values of p(y|x) where y is the function to be minimized (e.g., validation loss)
and x is the value of hyperparameter the TPE models P(x|y) and P(y). One of the great
drawbacks of tree-structured Parzen estimators is that they do not model interactions
between the hyper-parameters. That said TPE works extremely well in practice and was
battle-tested across most domains.
Hyperparameter tuning algorithms

These are the algorithms developed specifically for doing hyperparameter tuning.
Hyperband
Hyperband is a variation of random search, but with some explore-exploit theory to find the
best time allocation for each of the configurations.
Population-based training (PBT)
This technique is a hybrid of the two most commonly used search techniques: Random
Search and manual tuning applied to Neural Network models.
PBT starts by training many neural networks in parallel with random hyperparameters. But
these networks aren’t fully independent of each other.
It uses information from the rest of the population to refine the hyperparameters and
determine the value of hyperparameter to try.
Source
BOHB
BOHB (Bayesian Optimization and HyperBand) mixes the Hyperband algorithm and
Bayesian optimization.
Unstable Gradient Problem
What is a Gradient?
The Gradient is nothing but a derivative of loss function with respect to the weights. It is
used to updates the weights to minimize the loss function during the back propagation in
neural networks.
These Gradients are used to update the weights, to minimize the loss function.
With the back propagation there are 2 types of Unstable Gradient Problem issues:
 Vanishing Gradient
 Exploding Gradient
AUTOENCODER
What is an Autoencoder?
 An Autoencoder is a type of neural network that can learn to reconstruct images, text, and other data from
compressed versions of themselves.
 The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-
dimensional data, typically for dimensionality reduction, by training the network to capture the most important
parts of the input image.
The architecture of Autoencoders:-

Autoencoders consist of 3 layers:
1. Encoder: A module that compresses the input data into an encoded representation that is typically several orders of
magnitude smaller than the input data. It encodes the input image as a compressed representation in a reduced
dimension.
2. Code/Bottleneck: A module that contains the compressed knowledge representations and is therefore the most
important part of the network.
3. Decoder: A module that helps the network “decompress” the knowledge representations and reconstructs the data
back from its encoded form. The output is then compared with a ground truth.
The architecture as a whole looks something like this:
5 types of autoencoders
The idea of Autoencoders for neural networks isn't new. In fact the first applications date to the 1980s.
Here are five popular autoencoders that we will discuss:
1. Undercomplete autoencoders
2. Sparse autoencoders
3. Contractive autoencoders
4. Denoising autoencoders
5. Variational Autoencoders (for generative modelling)
1. Undercomplete Autoencoders
An undercomplete autoencoder is one of the simplest types of autoencoders, it takes an image and tries to predict the
same image as output by reconstructing the image from the compressed bottleneck region.
2. Sparse Autoencoders
Sparse autoencoders are controlled by changing the number of nodes at each hidden layer.
Sparse autoencoders offer us an alternative method for introducing an information bottleneck without requiring a
reduction in the number of nodes at our hidden layers. Instead, we’ll construct our loss function such that we impose
penalty activations within a layer.
3. Contractive Autoencoders
Similar to other autoencoders, contractive autoencoders perform task of learning a representation of the image while
passing it through a bottleneck and reconstructing it in the decoder.
4. Denoising Autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have the input image as its
ground truth.
5. Variational Autoencoders
A variational autoencoder (VAE) provides a probabilistic manner for describing an observation in latent
space. Thus, rather than building an encoder which outputs a single value to describe each latent state
attribute, we'll formulate our encoder to describe a probability distribution for each latent attribute.
A variational autoencoder can be defined as being an autoencoder whose training is regularised to avoid overfitting
and ensure that the latent space has good properties that enable generative process.
Standard and variational autoencoders learn to represent the input just in a compressed form called the latent space or
the bottleneck.
Applications of autoencoders
Now that you understand various types of autoencoders, let’s summarize some of their most common use cases.
1. Dimensionality reduction
2. Image denoising
3. Generation of image and time series data
4. Anomaly Detection
5. Watermark Removal from an Image

Unit 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

ACTIVATION FUNCTION

Where:- wi = change in weight

Types of Gradient Descent

ii. Stochastic gradient descent

iii. Mini-batch gradient descent

What is hyperparameter tuning

Hyperparameter tuning is the process of determining the right arrangement of

Hyperparameter optimization / tuning methods

hyperparameter optimization methods that are popular today.

Tree-structured Parzen estimators (TPE)

The idea of Tree-based Parzen optimization is similar to Bayesian optimization. Instead of

Hyperparameter tuning algorithms

Population-based training (PBT)

Unstable Gradient Problem

The architecture of Autoencoders:-

You might also like