Lecture Slides Lec6 PDF

Neural
Networks for Machine Learning

Lecture 6a
Overview of mini-‐batch gradient descent
Geoffrey Hinton
with
Ni@sh Srivastava
Kevin Swersky
Reminder: The error surface for a linear neuron
•  The error surface lies in a space with a
horizontal axis for each weight and one ver@cal
axis for the error. E
–  For a linear neuron with a squared error, it is
a quadra@c bowl.
–  Ver@cal cross-‐sec@ons are parabolas.
–  Horizontal cross-‐sec@ons are ellipses.
•  For mul@-‐layer, non-‐linear nets the error surface
w1
is much more complicated.
–  But locally, a piece of a quadra@c bowl is
usually a very good approxima@on.
w2
Convergence speed of full batch learning when the error
surface is a quadra@c bowl
•  Going downhill reduces the error, but the
direc@on of steepest descent does not point
at the minimum unless the ellipse is a circle.
–  The gradient is big in the direc@on i n

which we only want to travel a small Even for non-‐linear
distance. mul@-‐layer nets, the
error surface is locally
–  The gradient is small in the direc@on in
quadra@c, so the same
which we want to travel a large distance.
speed issues apply.
How the learning goes wrong
•  If the learning rate is big, the weights slosh to
and fro across the ravine.
–  If the learning rate is too big, this
oscilla@on diverges.
•  What we would like to achieve:
–  Move quickly in direc@ons with small but E
consistent gradients.
–  Move slowly in direc@ons with big but w
inconsistent gradients.
Stochas@c gradient descent
•  If the dataset is highly redundant, the •  Mini-‐batches are usually beYer
gradient on the first half is almost than online.
iden@cal to the gradient on the –  Less computa@on is used
second half.
upda@ng the weights.
–  So instead of compu@ng the full
gradient, update the weights using –  Compu@ng the gradient for
the gradient on the first half and many cases simultaneously
then get a gradient for the new uses matrix-‐matrix
weights on the second half. mul@plies which are very
–  The extreme version of this efficient, especially on GPUs
approach updates weights aVer •  Mini-‐batches need to be
each case. Its called “online”. balanced for classes

Two types of learning algorithm
If we use the full gradient computed from all For large neural networks with
the training cases, there are many clever ways very large and highly redundant
to speed up learning (e.g. non-‐linear conjugate training sets, it is nearly always
gradient). best to use mini-‐batch learning.
–  The op@miza@on community has –  The mini-‐batches may
studied the general problem of need to be quite big
op@mizing smooth non-‐linear when adap@ng fancy
func@ons for many years. methods.
–  Mul@layer neural nets are not typical –  Big mini-‐batches are
of the problems they study so their more computa@onally
methods may need a lot of adapta@on. efficient.
A basic mini-‐batch gradient descent algorithm
•  Guess an ini@al learning rate. •  Towards the end of mini-‐batch
–  If the error keeps geang worse learning it nearly always helps to
or oscillates wildly, reduce the turn down the learning rate.
learning rate. –  This removes fluctua@ons in the
–  If the error is falling fairly final weights caused by the
consistently but slowly, increase varia@ons between mini-‐
the learning rate. batches.
•  Write a simple program to automate •  Turn down the learning rate when
this way of adjus@ng the learning the error stops decreasing.
rate. –  Use the error on a separate
valida@on set
Neural Networks for Machine Learning

Lecture 6b
A bag of tricks for mini-‐batch gradient descent
Geoffrey Hinton
with
Ni@sh Srivastava
Kevin Swersky
Be careful about turning down the learning rate
•  Turning down the learning

rate reduces the random reduce
fluctua@ons in the error due learning rate
error
to the different gradients on
different mini-‐batches.
–  So we get a quick win.
–  But then we get slower
learning.
•  Don’t turn down the epoch
learning rate too soon!
Ini@alizing the weights
•  If two hidden units have exactly •  If a hidden unit has a big fan-‐in,
the same bias and exactly the small changes on many of its
same incoming and outgoing incoming weights can cause the
weights, they will always get learning to overshoot.
exactly the same gradient. –  We generally want smaller
–  So they can never learn to be incoming weights when the
different features. fan-‐in is big, so ini@alize the
–  We break symmetry by weights to be propor@onal to
ini@alizing the weights to sqrt(fan-‐in).
have small random values. •  We can also scale the learning
rate the same way.

ShiVing the inputs color indicates
training case
w1 w2
•  When using steepest descent,
shiVing the input values makes a big
difference.
–  It usually helps to transform
each component of the input
vector so that it has zero mean 101, 101 à 2 gives error
over the whole training set. 101, 99 à 0 surface
•  The hypberbolic tangent (which is
2*logis@c -‐1) produces hidden
ac@va@ons that are roughly zero gives error
mean. 1, 1 à 2
–  In this respect its beYer than the 1, -‐1 à 0 surface
logis@c.
Scaling the inputs color indicates
weight axis
w1 w2
•  When using steepest descent,
scaling the input values
makes a big difference.
–  It usually helps to 0.1, 10 à 2 gives error
transform each 0.1, -‐10 à 0 surface
component of the input
vector so that it has unit
variance over the whole
training set. 1, 1 à 2 gives error
1, -‐1 à 0 surface

A more thorough method: Decorrelate the input components
•  For a linear neuron, we get a big win by decorrela@ng each component of the
input from the other input components.
•  There are several different ways to decorrelate inputs. A reasonable method is
to use Principal Components Analysis.
–  Drop the principal components with the smallest eigenvalues.
•  This achieves some dimensionality reduc@on.
–  Divide the remaining principal components by the square roots of their
eigenvalues. For a linear neuron, this converts an axis aligned ellip@cal
error surface into a circular one.
•  For a circular error surface, the gradient points straight towards the minimum.
Common problems that occur in mul@layer networks
•  If we start with a very big learning •  In classifica@on networks that use
rate, the weights of each hidden a squared error or a cross-‐entropy
unit will all become very big and error, the best guessing strategy is
posi@ve or very big and nega@ve. to make each output unit always
–  The error deriva@ves for the produce an output equal to the
hidden units will all become propor@on of @me it should be a 1.
@ny and the error will not –  The network finds this strategy
decrease. quickly and may take a long
–  This is usually a plateau, but @me to improve on it by
people oVen mistake it for a making use of the input.
local minimum. –  This is another plateau that
looks like a local minimum.
Four ways to speed up mini-‐batch learning

•  Use “momentum” •  rmsprop: Divide the learning rate for a
–  Instead of using the gradient weight by a running average of the
to change the posi@on of the magnitudes of recent gradients for that
weight “par@cle”, use it to weight.
change the velocity. –  This is the mini-‐batch version of just
•  Use separate adap@ve learning using the sign of the gradient.
rates for each parameter •  Take a fancy method from the
–  Slowly adjust the rate using op@miza@on literature that makes use of
the consistency of the curvature informa@on (not this lecture)
gradient for that parameter. –  Adapt it to work for neural nets
–  Adapt it to work for mini-‐batches.

Lecture 6c
The momentum method
Geoffrey Hinton
with
Ni@sh Srivastava
Kevin Swersky
The intui@on behind the momentum method
Imagine a ball on the error surface. The •  It damps oscilla@ons in direc@ons of
loca@on of the ball in the horizontal high curvature by combining
plane represents the weight vector. gradients with opposite signs.
–  The ball starts off by following the •  It builds up speed in direc@ons with
gradient, but once it has velocity, a gentle but consistent gradient.
it no longer does steepest descent.
–  Its momentum makes it keep
going in the previous direc@on.
The equa@ons of the momentum method
The effect of the gradient is to
∂E increment the previous velocity. The
v(t) = α v(t −1) − ε (t)
∂w velocity also decays by α which is
slightly less then 1.

Δw(t) = v(t) The weight change is equal to the current
velocity.
∂E
= α v(t −1) − ε (t)
∂w
The weight change can be expressed in
∂E terms of the previous weight change and
= α Δw(t −1) − ε (t)
∂w the current gradient.
The behavior of the momentum method
•  At the beginning of learning there may
•  If the error surface is a @lted plane, be very large gradients.
the ball reaches a terminal velocity.
–  So it pays to use a small
–  If the momentum is close to 1,
momentum (e.g. 0.5).
this is much faster than simple
gradient descent. –  Once the large gradients have
disappeared and the weights are
stuck in a ravine the momentum
can be smoothly raised to its final
value (e.g. 0.9 or even 0.99)
1 $ ∂E ' •  This allows us to learn at a rate that
v(∞) = & −ε )
1− α % ∂w ( would cause divergent oscilla@ons
without the momentum.
A beYer type of momentum (Nesterov 1983)
•  The standard momentum method •  First make a big jump in the
first computes the gradient at the direc@on of the previous
current loca@on and then takes a big accumulated gradient.
jump in the direc@on of the updated •  Then measure the gradient
accumulated gradient. where you end up and make a
•  Ilya Sutskever (2012 unpublished) correc@on.
suggested a new form of momentum –  Its beYer to correct a
that oVen works beYer. mistake aVer you have
–  Inspired by the Nesterov method made it!
for op@mizing convex func@ons.
A picture of the Nesterov method
•  First make a big jump in the direc@on of the previous accumulated gradient.
•  Then measure the gradient where you end up and make a correc@on.

brown vector = jump, red vector = correc@on, green vector = accumulated gradient

blue vectors = standard momentum

Lecture 6d
A separate, adap@ve learning rate for each
connec@on
Geoffrey Hinton
with
Ni@sh Srivastava
Kevin Swersky
The intui@on behind separate adap@ve learning rates
•  In a mul@layer net, the appropriate learning rates
can vary widely between weights:
–  The magnitudes of the gradients are oVen very
different for different layers, especially if the ini@al
weights are small.
–  The fan-‐in of a unit determines the size of the
“overshoot” effects caused by simultaneously Gradients can get very
changing many of the incoming weights of a unit to small in the early layers of
correct the same error.
very deep nets.
•  So use a global learning rate (set by hand)
mul@plied by an appropriate local gain that is The fan-‐in oVen varies
determined empirically for each weight. widely between layers.
One way to determine the individual learning rates
•  Start with a local gain of 1 for every weight. ∂E
•  Increase the local gain if the gradient for Δwij = −ε gij
∂wij
that weight does not change sign.
•  Use small addi@ve increases and
mul@plica@ve decreases (for mini-‐batch) # ∂E &
∂E
–  This ensures that big gains decay rapidly if %% (t) (t −1)(( > 0
when oscilla@ons start. $ ∂wij ∂wij '
–  If the gradient is totally random the gain then gij (t) = gij (t −1) +.05
will hover around 1 when we increase
by plus δ half the @me and decrease else gij (t) = gij (t −1)*.95
by @mes 1− δ half the @me.
Tricks for making adap@ve learning rates work beYer
•  Limit the gains to lie in some •  Adap@ve learning rates can be
reasonable range combined with momentum.
–  e.g. [0.1, 10] or [.01, 100] –  Use the agreement in sign
•  Use full batch learning or big mini-‐ between the current gradient for a
batches weight and the velocity for that
–  This ensures that changes in weight (Jacobs, 1989).
the sign of the gradient are •  Adap@ve learning rates only deal with
not mainly due to the axis-‐aligned effects.
sampling error of a mini-‐
batch. –  Momentum does not care about
the alignment of the axes.

Lecture 6e
rmsprop: Divide the gradient by a running average
of its recent magnitude
Geoffrey Hinton
with
Ni@sh Srivastava
Kevin Swersky
rprop: Using only the sign of the gradient
•  The magnitude of the gradient can be •  rprop: This combines the idea of only
very different for different weights using the sign of the gradient with the
and can change during learning. idea of adap@ng the step size separately
–  This makes it hard to choose a for each weight.
single global learning rate. –  Increase the step size for a weight
•  For full batch learning, we can deal mul@plica@vely (e.g. @mes 1.2) if the
with this varia@on by only using the signs of its last two gradients agree.
sign of the gradient. –  Otherwise decrease the step size
–  The weight updates are all of the mul@plica@vely (e.g. @mes 0.5).
same magnitude. –  Limit the step sizes to be less than
–  This escapes from plateaus with 50 and more than a millionth (Mike
@ny gradients quickly. Shuster’s advice).
Why rprop does not work with mini-‐batches
•  The idea behind stochas@c gradient •  rprop would increment the weight
descent is that when the learning nine @mes and decrement it once by
rate is small, it averages the about the same amount (assuming
gradients over successive mini-‐ any adapta@on of the step sizes is
batches. small on this @me-‐scale).
–  Consider a weight that gets a –  So the weight would grow a lot.
gradient of +0.1 on nine mini-‐ •  Is there a way to combine:
batches and a gradient of -‐0.9 –  The robustness of rprop.
on the tenth mini-‐batch.
–  The efficiency of mini-‐batches.
–  We want this weight to stay
roughly where it is. –  The effec@ve averaging of
gradients over mini-‐batches.
rmsprop: A mini-‐batch version of rprop
•  rprop is equivalent to using the gradient but also dividing by the size of the
gradient.
–  The problem with mini-‐batch rprop is that we divide by a different number
for each mini-‐batch. So why not force the number we divide by to be very
similar for adjacent mini-‐batches?
•  rmsprop: Keep a moving average of the squared gradient for each weight
2

MeanSquare(w, t) = 0.9 MeanSquare(w, t− 1) + 0.1 ∂E ( ∂w
(t) )
•  Dividing the gradient by MeanSquare(w,
t)
makes the learning work much
beYer (Tijmen Tieleman, unpublished).
Further developments of rmsprop
•  Combining rmsprop with standard momentum
–  Momentum does not help as much as it normally does. Needs more
inves@ga@on.
•  Combining rmsprop with Nesterov momentum (Sutskever 2012)
–  It works best if the RMS of the recent gradients is used to divide the
correc@on rather than the jump in the direc@on of accumulated correc@ons.
•  Combining rmsprop with adap@ve learning rates for each connec@on
–  Needs more inves@ga@on.
•  Other methods related to rmsprop
–  Yann LeCun’s group has a fancy version in “No more pesky learning rates”
Summary of learning methods for neural networks
•  For small datasets (e.g. 10,000 cases) •  Why there is no simple recipe:
or bigger datasets without much Neural nets differ a lot:
redundancy, use a full-‐batch
method. –  Very deep nets (especially ones
with narrow boYlenecks).
–  Conjugate gradient, LBFGS ...
–  Recurrent nets.
–  adap@ve learning rates, rprop ... –  Wide shallow nets.
•  For big, redundant datasets use mini-‐ Tasks differ a lot:
batches.
–  Try gradient descent with –  Some require very accurate
momentum. weights, some don’t.
–  Try rmsprop (with momentum ?) –  Some have many very rare
cases (e.g. words).
–  Try LeCun’s latest recipe.

Lecture Slides Lec6 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Slides Lec6 PDF

Uploaded by

Copyright:

Available Formats

Neural

Networks for Machine Learning

•  Turning down the learning

You might also like

Lecture Slides Lec6 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Slides Lec6 PDF

Uploaded by

Copyright:

Available Formats

Neural

Networks for Machine Learning

• Turning down the learning

You might also like

•  Turning down the learning