You are on page 1of 112

Machine learning –Loss Function

• Computational method to improve performance on a task by using


training data.
This shows a NN,
but can replace with
other ML methods
Loss Function
• loss function = cost function = objective
function = error function
• The loss function does not want to
measure the entire performance of the
network against a validation/test dataset.
• The loss function is used to guide the
training process in order to find a set of
parameters that reduce the value of the
loss function.
Loss Function
• Loss functions depends on the type of task:
• Regression: the network predicts continuous, numeric variables
• Example: Length of fishes in images, temperature from latitude/longitude
• Absolute value, square error
• Classification: the network predicts categorical variables (fixed number of
classes)
• Example: classify email as spam, predict student grades from essays.
• hinge loss, Cross-entropy loss
Loss Function - Absolute value, L1-norm
• Very basic loss function
• Produces sparser solutions
• good in high dimensional spaces
• prediction speed
• Less sensitive to outliers
Loss Function - Square error, Euclidean loss,
L2-norm
• Very common loss function
• More precise and better than L1-norm
• Penalizes large errors more strongly
• Sensitive to outliers
Regularization
• Regularization refers to a set of different techniques that lower
the complexity of a neural network model during training, and
thus prevent the overfitting.
• There are three very popular and efficient regularization
techniques called L1, L2, and dropout
Regularization - L2 regularization
• The L2 regularization is the most common type of all
regularization techniques and is also commonly known as
weight decay or Ridge Regression.
• The mathematical derivation of this regularization, as well as the
mathematical explanation of why this method works at reducing
overfitting, is quite long and complex.
• During the L2 regularization the loss function of the neural
network as extended by a so-called regularization term, which is
called here Ω.
Regularization - L2 regularization

• The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of


the weight matrices, which is the sum over all squared weight values of a
weight matrix.
• The regularization term is weighted by the scalar alpha divided by two and
added to the regular loss function that is chosen for the current task. This
leads to a new expression for the loss function:

• Alpha is sometimes called as the regularization rate and is an additional


hyperparameter we introduce into the neural network. Simply speaking
alpha determines how much we regularize our model.
• In the next step we can compute the gradient of the new loss function and
put the gradient into the update rule for the weights
Regularization – L1 regularization
• In the case of L1 regularization (also knows as Lasso
regression), we simply use another regularization term Ω. This
term is the sum of the absolute values of the weight parameters
in a weight matrix:

• As in the previous case, we multiply the regularization term by


alpha and add the entire thing to the loss function.

• The derivative of the new loss function leads to the following


expression, which the sum of the gradient of the old loss
function and sign of a weight value times alpha.
Regularization – Dropout
• In addition to the L2 and L1
regularization, another famous
and powerful regularization
technique is called the dropout
regularization. The procedure
behind dropout regularization is
quite simple.
• In a nutshell, dropout means that
during training with some
probability P a neuron of the
neural network gets turned off
during training. Let’s look at a
visual example.
Neural Network with dropout (bottom)
and without (top)
Regularization – Dropout
• Assume on the top we have a feedforward
neural network with no dropout. Using
dropout with let’s say a probability
of P=0.5 that a random neuron gets turned
off during training would result in a neural
network on the bottom.
• In this case, you can observe that
approximately half of the neurons are not
active and are not considered as a part of
the neural network. And as you can observe
the neural network becomes simpler.
• A simpler version of the neural network
results in less complexity that can reduce
overfitting. The deactivation of neurons with
a certain probability P is applied at each
forward propagation and weight update Neural Network with dropout (bottom)
step. and without (top)
Building Blocks of a Neural Networks

Neurons
Machine Learning
Deep Learning
Traditional ML Vs DL
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
neural network is called an artificial
y
neuron
Why is it called a neuron ? Where does
σ
the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3
biological neurons = neural cells = neural
Artificial
Neuron
processing units
We will first see what a biological neuron
Artificial neuron vs Biological neuron
The most fundamental unit of a deep
neural network is called an artificial
y
neuron
Why is it called a neuron ? Where does
σ
the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3
biological neurons = neural cells = neural
Artificial
Neuron
processing units
We will first see what a biological neuron
• dendrite: receives signals from
other neurons
• synapse: point of connection to
other neurons
• soma: processes the information
• axon: transmits the output of this
Biological
Neurons∗
neuron


Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg
• Of course, in reality, it is not just a single
neuron which does all this
• There is a massively parallel interconnected
net- work of neurons
• The sense organs relay information to the
lowest layer of neurons
• Some of these neurons may fire (in red) in
re- sponse to this information and in turn
relay inform- ation to other neurons they
are connected to
• These neurons may also fire (again, in red)
and the process continues eventually
resulting in a re- sponse (laughter in this
case)
• This massively parallel network also
ensures that there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus

A simplified
illustration
• The neurons in the brain are
arranged in a hierarchy
• We illustrate this with the help of
visual cortex (part of the brain)
which deals with processing visual
information
• Starting from the retina, the
information is relayed to several
layers (follow the ar- rows)
• We observe that the layers V 1, V 2
to AIT form a hierarchy (from
identifying simple visual forms to
high level objects)
25

Sample illustration of
hierarchical processing∗
Neurons
• ANN’s are built upon simple signal processing elements (Neuron) that are connected
together into a large mesh.

What can neural networks do?


• Neural networks can -
• identify faces,
• recognize speech,
• read your handwriting (mine perhaps not),
• translate texts,
• play games (typically board games or card games)
• control autonomous vehicles and robots
Artificial Neurons
Inside an artificial neuron
• You might be surprised to see how simple the calculations inside a neuron actually
are. We can identify three processing steps:
1.Each input gets scaled up or down
• When a signal comes in, it gets multiplied by a weight value that is assigned to this particular input.
That is, if a neuron has three inputs, then it has three weights that can be adjusted individually. During
the learning phase, the neural network can adjust the weights based on the error of the last test result.
2. All signals are summed up
• In the next step, the modified input signals are summed up to a single value. In this step, an offset is
also added to the sum. This offset is called bias. The neural network also adjusts the bias during the
learning phase.
• This is where the magic happens! At the start, all the neurons have random weights and random biases.
After each learning iteration, weights and biases are gradually shifted so that the next result is a bit
closer to the desired output. This way, the neural network gradually moves towards a state where the
desired patterns are “learned”.
3. Activation
• Finally, the result of the neuron’s calculation is turned into an output signal. This is done by feeding the
result to an activation function (also called transfer function).
Activation Functions
• The activation function is used as a decision making body at the output of a neuron. The neuron learns
Linear or Non-linear decision boundaries based on the activation function. It also has a normalizing
effect on the neuron output which prevents the output of neurons after several layers to become very
large, due to the cascading effect. There are three most widely used activation functions
• Sigmoid
It maps the input ( x axis ) to values between 0 and 1.
• Tanh
It is similar to the sigmoid function butmaps the input to values between -1 and 1.
• Rectified Linear Unit (ReLU)
It allows only positive values to pass through it. The negative values are mapped to zero.
Input Layer
• This is the first layer of a neural network. It is used to provide the input data or features to the network.
Output Layer
• This is the layer which gives out the predictions. The activation function to be used in this layer is
different for different problems. For a binary classification problem, we want the output to be either 0
or 1. Thus, a sigmoid activation function is used. For a Multiclass classification problem, a Softmax (
think of it as a generalization of sigmoid to multiple classes ) is used. For a regression problem, where
the output is not a predefined category, we can simply use a linear unit.
Commonly used Terminologies in Neural networks

All the input values of each perceptron are collectively


The input vector
called the input vector of that perceptron.

Similarly, all the weight values of each perceptron are


The weight vector
collectively called the weight vector of that perceptron.
• McCulloch (neuroscientist) and Pitts
McCulloch Neurons (logician) proposed a highly simplified
computational model of the neuron (1943)
• g aggregates the inputs and the function f
y ∈ {0, 1} takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory

f • y = 0 if any xi is inhibitory, else


n
g(x 1, x 2, ...,nx ) i= g(x) = xΣ
g
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
x1 x2 .. .. xn ∈ {0, 1} = 0 if g(x) < θ
• θ is called the thresholding parameter
• This is called Thresholding Logic
McCulloch
y ∈ {0, 1}
Neurons with Boolean
y ∈ {0, 1}
functions y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3

A McCulloch Pitts AND OR function


unit function
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

1 0 0

x1 x2 x1 x2 x1
x1 AND !x2∗ NOR NOT

function
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output function
will be 0
Limitations Of M-P Neuron
• What about non-boolean (say, real) inputs?
• Do we always need to hand code the threshold?
• Are all inputs equal? What if we want to assign more importance to some inputs?
• What about functions which are not linearly separable? Say XOR function.
The Perceptron
• A perceptron is a very simple learning machine.
• It takes few inputs, each of which has a weight to signify how important it is, and generate an
output decision of “0” or “1”.
• When combined with many other perceptron's, it forms an artificial neural
network.
The perceptron
• The most basic form of an activation function is a simple binary function that has only
two possible results.
• This function returns 1 if the input is positive or zero, and 0 for any negative input. A
neuron whose activation function is a function like this is called a perceptron.

Threshold Logic Unit (TLU)
• In a Threshold Logic Unit (TLU) the output of the unit y in response to
a particular input pattern is calculated in two stages.
• First the activation is calculated.
• The activation a is the weighted sum of the inputs:
inputs
x1 w1
weights output
w2 activation
x2
Σ θ
y
. Σ
. wn
a= i=1
n
wi xi
.
xn
y= { 1 if a ≥ θ
0 if a < θ
Linear Unit
• Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks.
• Further, Perceptron is also understood as an Artificial Neuron or neural network unit that
helps to detect certain input data computations in business intelligence .
• Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers.
• Hence, we can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.
inputs
x1 w1 weights
activation output
w2
x2 y
.
Σ
. wn
a= Σ
i=1
n
wi xi y= a = Σ
i=1
n
wi xi
.x
n
Training ANNs
• Training set S of examples {x,t}
• x is an input vector and
• t the desired target vector
• Example: Logical And
S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}
• Iterative process
• Present a training example x , compute network output y , compare
output y with target t, adjust weights and thresholds
• Learning rule
• Specifies how to change the weights w and thresholds θ of the
network as a function of the inputs x, output y and target t.
Perceptron Learning Rule
• w’=w + α (t-y) x
Or in components
• w’i = wi + Δwi = wi + α (t-y) xi (i=1..n+1)
With wn+1 = θ and xn+1=-1
• The parameter α is called the learning rate. It determines the
magnitude of weight updates Δwi .
• If the output is correct (t=y) the weights are not changed (Δwi
=0).
• If the output is incorrect (t ≠ y) the weights wi are changed
such that the output of the TLU for the new weights w’i is
closer/further to the input xi.
Perceptron Training Algorithm

Repeat
for each training vector pair (x,t)
evaluate the output y when x is the input
if y≠t then
form a new weight vector w’ according
to w’=w + α (t-y) x
else
do nothing
end if
end for
Until y=t for all training vector pairs
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Learning Rule
Perceptron Convergence Theorem
The algorithm converges to the correct classification
•if the training data is linearly separable
•and α is sufficiently small
• If two classes of vectors X1 and X2 are linearly separable, the
application of the perceptron training algorithm will eventually result
in a weight vector w0, such that w0 defines a TLU whose decision
hyper-plane separates X1 and X2 (Rosenblatt 1962).
• Solution w0 is not unique, since if w0 x =0 defines a hyper-plane, so
does w’0 = k w0.
• regularize means to make things regular or acceptable
• Regularization refers to a set of different techniques that
lower the complexity of a neural network model during
training, and thus prevent the overfitting
• Regualrization penalizes the weight matrices of the
nodes

56
What is Overfitting
• The training data contains information about the
regularities in the mapping from input to output. But it
also contains sampling error.
• There will be accidental regularities because of the
particular training cases that were choosen.
• When we fit the model, It cannot tell which regularities
are real and which are caused by sampling error.
• So it fits both kinds of regularity. If the model is very
flexible it can model the sampling error really well.
• This means the model will not generalize well to unseen
data
57
Diagnosing Overfitting

58
Regularization Techniques
• L2 Regualrizartion / Ridge Regularization
• L1 Regualrizartion / Lasso Regularization
• Dropout
• Early Stopping

Alpydin & Ch. Eick: ML Topic1 59


Ridge Regularization

Salary

Experience

60
L1 vs L2 Regularization Methods
• L1 Regularization, also called a Lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
• L2 Regularization, also called a Ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function.
• The key difference between these two is the penalty term.

61
Ridge Regularization

Salary

Experience

62
Steep slope
Salary

Experience

63
• Assume lamda =1
• Slope = 1.3
• Then cost = 0 + 1(1.3)2
• = 1.69

• Assume lamda =1
• Slope = 1.1
• Then cost = 0 + 1(1.3)2
• = 1.21

64

65
Lasso Regression
• This help in feature selection too

66
Dropout
• This is the one of the most interesting types of
regularization techniques.
• It also produces very good results and is consequently
the most frequently used regularization technique in
the field of deep learning
• To understand dropout, let’s say our neural network
structure

67
• At every iteration, it randomly selects some nodes
and removes them along with all of their incoming
and outgoing connections as shown below

68
• So each iteration has a different set of nodes and this
results in a different set of outputs. It can also be
thought of as an ensemble technique in machine
learning.
• Ensemble models usually perform better than a single
model as they capture more randomness. Similarly,
dropout also performs better than a normal neural
network model
• Due to these reasons, dropout is usually preferred
when we have a large neural network structure in
order to introduce more randomness.

69
Early stopping

• Early stopping is a kind of cross-validation strategy where


we keep one part of the training set as the validation
set.
• When we see that the performance on the validation set
is getting worse, we immediately stop the training on the
model. This is known as early stopping

70
• In the above image, we will stop training at the dotted
line since after that our model will start overfitting on
the training data

71
Why Training a Neural Network Is Hard

• Fitting a neural network involves using a training dataset to update


the model weights to create a good mapping of inputs to outputs.
• This training process is solved using an optimization algorithm that
searches through a space of possible values for the neural network
model weights for a set of weights that results in good performance
on the training dataset.
• will discover the challenge of training a neural network framed as an
optimization problem.
Session Outcome
• Training a neural network involves using an optimization algorithm to
find a set of weights to best map inputs to outputs.
• The problem is hard, not least because the error surface is
non-convex and contains local minima, flat spots, and is highly
multidimensional.
• The stochastic gradient descent algorithm is the best general
algorithm to address this challenging problem.
Learning as Optimization
• Deep learning neural network models learn to map inputs to outputs given
a training dataset of examples.
• The training process involves finding a set of weights in the network that
proves to be good, or good enough, at solving the specific problem.
• This training process is iterative, meaning that it progresses step by step
with small updates to the model weights each iteration and, in turn, a
change in the performance of the model each iteration.
• The iterative training process of neural networks solves an optimization
problem that finds for parameters (model weights) that result in a
minimum error or loss when evaluating the examples in the training
dataset.
• Optimization is a directed search procedure and the optimization problem
that we wish to solve when training a neural network model is very
challenging.
Optimization problems
• The optimization algorithm iteratively steps across
this landscape, updating the weights and seeking out
good or low elevation areas.
• For simple optimization problems, the shape of the
landscape is a big bowl and finding the bottom is
easy,
• So easy that very efficient algorithms can be designed
to find the best solution.
• These types of optimization problems are referred to
mathematically as convex.
Optimization problems
• The error surface we wish to navigate when
optimizing the weights of a neural network is not a
bowl shape.
• It is a landscape with many hills and valleys.
• These type of optimization problems are referred to
mathematically as non-convex.
Local Minima

• Local minimal or local optima refer to the


fact that the error landscape contains
multiple regions where the loss is
relatively low.
• These are valleys, where solutions in those
valleys look good relative to the slopes and
peaks around them.
• The problem is, in the broader view of the
entire landscape, the valley has a relatively
high elevation and better solutions may
exist.
Flat Regions (Saddle Points)

• A flat region or saddle point is a point on the


landscape where the gradient is zero.
• These are flat regions at the bottom of valleys
or regions between peaks.
• The problem is that a zero gradient means that
the optimization algorithm does not know
which direction to move in order to improve
the model.
High-Dimensional
• The optimization problem solved when training a neural network is
high-dimensional.
• Each weight in the network represents another parameter or dimension of
the error surface.
• Deep neural networks often have millions of parameters, making the
landscape to be navigated by the algorithm extremely high-dimensional, as
compared to more traditional machine learning algorithms.
• The problem of navigating a high-dimensional space is that the addition of
each new dimension dramatically increases the distance between points in
the space, or hypervolume.
• This is often referred to as the “curse of dimensionality”.
Reasons for Difficulty in Deep Learning
• Possibly Questionable Solution Quality. The optimization process
may or may not find a good solution and solutions can only be
compared relatively, due to deceptive local minima.
• Possibly Long Training Time. The optimization process may take a
long time to find a satisfactory solution, due to the iterative nature of
the search.
• Possible Failure. The optimization process may fail to progress (get
stuck) or fail to locate a viable solution, due to the presence of flat
regions.
Greedy layer wise training
vanishing gradient problem
• Training deep neural networks was traditionally
challenging due to vanishing gradient
• Vanishing Gradient: Weights in layers close to the input
layer were not updated in response to errors calculated on
the training dataset

• Important milestone in the field of deep learning was


greedy layer-wise pretraining that allowed very deep
neural networks to be successfully trained, achieving
better performance.

82
• For increased hidden layers the amount of error
information propagated back to earlier layers is
dramatically reduced.
• Weights in hidden layers close to the output layer are
updated normally, whereas weights in hidden layers
close to the input layer are updated minimally or not
at all.
• Generally, this problem prevented the training of very
deep neural networks and was referred to as
the vanishing gradient problem

83
• Pretraining :
• add a new hidden layer to a model.
• Allow the newly added model to learn the inputs from the existing hidden layer, keeping
the weights for the existing hidden layers fixed.
• This gives the technique the name “layer-wise” as the model is trained one layer at a
time.
• Greedy algorithm:
• Breaks the problem into many components, then solve for the optimal version of each
component in isolation
• Pretraining is based on the assumption that it is easier to train a shallow network instead
of a deep network and contrives a layer-wise training process that we are always only
ever fitting a shallow model

84
85
Pre-training and fine tuning
● Using dataset A train model M
● Pre-training:
● You have a dataset B
● Before training the model, initialize some of the
parameters of M with model trained on A
● Fine-tuning:
● You train M on B
● This is one form of transfer learning
87
• Training a deep structure is difficult due to high dependencies across
layers’ parameters , i.e. the relation between parts of pictures and
pixels.
• To resolve this problem, two things are suggested
• Adapting lower layers to feed good input to the upper layers
• Adjust upper layers to make use of that end setting of lowerr layers

88
Greedy Algorithm
● Greedy algorithms break a problem into many
components, then solve for the optimal version of
each component in isolation

● Unfortunately, combining the individually optimal


components is not guaranteed to yield an optimal
complete solution
Single-layer representation learning

● We need a single-layer representation


learning algorithm, such as:
● An RBM
● (a Markov network)
● A single-layer autoencoder

– A sparse coding model

– Or another model that learns latent


representations
Training a 4-layer network
• Pairs of layers active in each
stage
Greedy pretraining terminology
● Greedy layer-wise pretraining
● Greedy because
● It is a greedy algorithm that optimizes each piece of
the solution independently
● One piece at a time rather than jointly
● Layer-wise because
● Independent pieces are the layers of the network
● Training proceeds one layer at a time
● Training the kth layer while previous ones are fixed
● Pretraining because
● It is only a first step before applying a joint training

algorithm is applied to fine-tune all layers


together
Unsupervised pretraining combines two ideas

1. Initial parameters have a regularizing effect


● i.e., approach one local minimum over another
● But local minima no longer considered serious
2. Learning about input distribution can help to
learn about the mapping from inputs to
outputs
● Learns that cars and motorcycles have wheels
● The representation for wheels is useful for the
supervised learner
Optimization methods for Neural
Networks-Adagrad, Adam
• Optimizers are algorithms or methods used to change
the attributes of the neural network such
as weights and learning rate to reduce the losses.
• Optimizers are used to solve optimization problems
by minimizing the function
• it’s impossible to know model’s weights should be
right from the start. But with some trial and error
based on the loss function we can end up getting
there eventually

95
Gradient Descent Algorithm
• A gradient measures how much the output of a
function changes if you change the inputs a little
bit."
• In mathematical terms, a gradient is a partial
derivative with respect to its inputs.
• Gradient Descent is an optimization algorithm for
finding a local minimum of a differentiable function.
• Gradient descent is simply used to find the values of
a function's parameters (coefficients) that minimize a
cost function as far as possible.

96
• the lowest point on the parabola
occurs at x = 1.
• The objective of gradient descent
algorithm is to find the value of “x”
such that “y” is minimum
•. “y” here is termed as the objective
function that the gradient descent
algorithm operates upon, to descend
to the lowest point

97
• Find the slope of the objective function with respect to
each parameter/feature. In other words, compute the
gradient of the function.
• Pick a random initial value for the parameters.
• Update the gradient function by plugging in the
parameter values.
• Calculate the step sizes for each feature as : step size =
gradient * learning rate.
• delta = - learning_rate * gradient

• Calculate the new parameters as : new params = old


params -step size
• theta += delta

• Repeat steps 3 to 5 until gradient is almost 0.


98
Importance of the Learning Rate
• Learning rate determines the size of the gradient
descent steps into the direction of the local
minimum.
• set the learning rate to an appropriate value, which is
neither too low nor too high.
• if the steps are too big, it may not reach the local
minimum because it bounces back and forth between
the convex function of gradient descent.
• If we set the learning rate to a very small value,
gradient descent will eventually reach the local
minimum but that may take a while

99
The learning rate should never be too high or too low for this
reason.
100
Downsides of the gradient descent algorithm
• Consider we have 10,000 data points and 10 features.
• We need to compute the derivative 10000 * 10 = 100,000
computations per iteration.
• It is common to take 1000 iterations, in effect we have 100,000 *
1000 = 100000000 computations to complete the algorithm.
• Hence gradient descent is slow on huge data

101
Stochastic Gradient Descent (SGD)

• randomly picks one data point from the whole data set at each
iteration to reduce the computations enormously.
• Mini-batch tries to strike a balance between the goodness of gradient
descent and speed of SGD

102
Momentum
• some additional processing of the gradients to be
faster and better
• in addition to the regular gradient, it also adds on the
movement from the previous step
• sum_of_gradient = gradient + previous_sum_of_gradient *
decay_rate
• delta = -learning_rate * sum_of_gradient
• theta += delta

103
• Momentum simply moves faster
• Momentum has a shot at escaping local minima
(because the momentum may propel it out of a local
minimum

104
Adaptive Gradient Descent(AdaGrad)
• One of the disadvantages of all the optimizers is that the learning rate is constant
for all parameters and for each cycle.

• Adagrad changes the learning rate.

• It changes the learning rate ‘η’ for each parameter and at every time step ‘t’.

• It works on the derivative of an error function.

• It performs smaller updates for parameters associated with frequently occurring


features, and larger updates for parameters associated with infrequently occurring
features.

105
• Instead of keeping track of the sum of gradient,
AdaGrad for s keeps track of the sum of
gradient squared and uses that to adapt the gradient
in different directions.
• Sum_of_gradient_squared =
previous_sum_of_gradient_squared + gradient²
• delta = -learning_rate * gradient /
sqrt(sum_of_gradient_squared)
• theta += delta

106
where
• θ is the parameter to be updated,
• η is the initial learning rate,
• ε is some small quantity that used to avoid the division of zero,
• I is the identity matrix,
• gt is the gradient estimate in time-step t that we can get with the
equation

107
Root Mean Square Propagation
• AdaGrad is incredibly slow, because the sum of gradient squared only
grows and never shrinks.
• RMSprob adds a decay factor.
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
decay_rate+ gradient² * (1- decay_rate)
• delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
• theta += delta

108
Adaptive Moment estimation.

• Adam as combining the advantages of two other extensions of stochastic


gradient descent. Specifically:

• Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter


learning rate that improves performance on problems with sparse
gradients (e.g. natural language and computer vision problems).

• Root Mean Square Propagation (RMSProp) that also maintains


per-parameter learning rates that are adapted based on the average of
recent magnitudes of the gradients for the weight (e.g. how quickly it is
changing).

• Adam realizes the benefits of both AdaGrad and RMSProp

109
• Instead of adapting the parameter learning rates
based on the average first moment (the mean) as in
RMSProp, Adam also makes use of the average of the
second moments of the gradients (the uncentered
variance).
• Specifically, the algorithm calculates an exponential
moving average of the gradient and the squared
gradient, and the parameters beta1 and beta2 control
the decay rates of these moving averages.

110
• sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1
- beta1) [Momentum]
• sum_of_gradient_squared = previous_sum_of_gradient_squared *
beta2 + gradient² * (1- beta2) [RMSProp]
• delta = -learning_rate * sum_of_gradient /
sqrt(sum_of_gradient_squared)
• theta += delta

Alpydin & Ch. Eick: ML Topic1 111


References:
• https://machinelearningmastery.com/why-training-a-neural-network
-is-hard/
• https://www.predictiveanalyticstoday.com/deep-learning-software-libraries/

• https://www.simplilearn.com/tutorials/deep-learning-tutorial/what-i
s-deep-learning
• https://machinelearningmastery.com/what-is-deep-learning/

You might also like