Professional Documents
Culture Documents
CMP784
DEEP LEARNING
Previously on CMP784
• multi-layer perceptrons
• activation functions
• chain rule
• backpropagation algorithm
• computational graph
• distributed word representations
2
Lecture overview
• data preprocessing and normalization
• weight initializations
• ways to improve generalization
• optimization
• babysitting the learning process
• hyperparameter selection
Disclaimer: Much of the material and slides for this lecture were borrowed from
—Fei-Fei Li, Andrej Karpathy and Justin Johnson’s CS231n class
—Roger Grosse’s CSC321 class
—Shubhendu Trivedi and Risi Kondor’s CMSC 35246 class
—Efstratios Gavves and Max Willing’s UvA deep learning class
—Hinton's Neural Networks for Machine Learning class
3
Activation Functions
4
Activation Functions
Leaky ReLU
max(0.1x, x)
Sigmoid
x
(x) = 1/(1 + e )
Maxout
tanh tanh(x) ELU
ReLU max(0,x)
5
Activation Functions
x
(x) = 1/(1 + e )
Sigmoid
6
Activation Functions
x
= (x)(1 (x)) (x) = 1/(1 + e )
x sigmoid
gate
• Squashes numbers to range [0,1]
• Historically popular since they
have nice interpretation as a
saturating “firing rate” of a
neuron
3 problems:
1. Saturated neurons “kill” the gradients
Sigmoid
Image credit: Jefkine Kafunah
7
Activation Functions
x
(x) = 1/(1 + e )
9
Consider what happens when the input to
a neuron (x) is always positive:
inputs weights sum non-linearity
x0
w0 !
x1
w1 X
x2
w2
∑ f w i xi + b
wn
…
<latexit sha1_base64="0sbCx1Hgxy2hhEN93G53lziDd1c=">AAACBXicbVBNS8NAEN3Ur1q/oh5FWCxCRSiJCOqt6MVjBWMLTQib7aZduvlgd6KW0JMX/4oXDype/Q/e/Ddu2xy0+mDg8d4MM/OCVHAFlvVllObmFxaXysuVldW19Q1zc+tGJZmkzKGJSGQ7IIoJHjMHOAjWTiUjUSBYKxhcjP3WLZOKJ/E1DFPmRaQX85BTAlryzd3QFSyEmquyyOf4zuf3Pj8MXMl7fTjwzapVtybAf4ldkCoq0PTNT7eb0CxiMVBBlOrYVgpeTiRwKtio4maKpYQOSI91NI1JxJSXT94Y4X2tdHGYSF0x4In6cyInkVLDKNCdEYG+mvXG4n9eJ4Pw1Mt5nGbAYjpdFGYCQ4LHmeAul4yCGGpCqOT6Vkz7RBIKOrmKDsGeffkvcY7qZ3X76rjaOC/SKKMdtIdqyEYnqIEuURM5iKIH9IRe0KvxaDwbb8b7tLVkFDPb6BeMj28rq5h+</latexit>
i
b
xn
1
bias
hypothetical
optimal w
What can we say about the gradients on w? vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
11
Activation Functions
x
(x) = 1/(1 + e )
12
Activation Functions
tanh(x)
ReLU
(Rectified Linear Unit)
dead ReLU
will never activate
→ never update
17
active ReLU
DATA CLOUD
18
Activation Functions
• Does not saturate
• Computationally efficient
• Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
• will not “die”.
Leaky ReLU
f (x) = max(0.01x, x)
<latexit sha1_base64="kMkvSHS98X7aaPaV3lXeMOXeLlQ=">AAAB+nicbZDNSsNAFIVv/K31L9alm2ARWpCSiKAuhKIblxWMLbShTKaTdujMJMxMJKX0Vdy4UHHrk7jzbZy2WWjrgYGPc+/l3jlhwqjSrvttrayurW9sFraK2zu7e/v2QelRxanExMcxi2UrRIowKoivqWaklUiCeMhIMxzeTuvNJyIVjcWDHiUk4KgvaEQx0sbq2qWoklWvOxxlFbfmetlpVu3aZYMzOcvg5VCGXI2u/dXpxTjlRGjMkFJtz010MEZSU8zIpNhJFUkQHqI+aRsUiBMVjGe3T5wT4/ScKJbmCe3M3N8TY8SVGvHQdHKkB2qxNjX/q7VTHV0GYyqSVBOB54uilDk6dqZBOD0qCdZsZABhSc2tDh4gibA2cRVNCN7il5fBP6td1bz783L9Jk+jAEdwDBXw4ALqcAcN8AFDBs/wCm/WxHqx3q2PeeuKlc8cwh9Znz+gx5Jt</latexit>
24
Data preprocessing
X
Data pre-processing
• Scale input variables to have similar diagonal covariances ci =
(j)
(xi )2
j
⎯ Similar covariances → more balanced rate of learning for different weights
(𝑗) 2
o ⎯Scale σ
input variables to have similar diagonal covariances 𝑐 = 𝑗 (𝑥𝑖 )
Rescaling to 1 is a good choice, unless some dimensions are𝑖 less important
◦ Similar covariances more balanced rate of learning for different weights
◦ Rescaling to 1 is a good choice, unless some dimensions are less important
x = [x11, x22, x33]T 𝑇, ✓ = [✓1 , 1✓2 , ✓23 ]T3, a𝑇 = tanh(✓T x)Τ 1 2 3 → much different covariances
𝑥 = 𝑥 ,𝑥 ,𝑥 ,𝜃 = 𝜃 ,𝜃 ,𝜃 , 𝑎 = tanh(𝜃 𝑥) x
1 2 3
, x , x
𝑥 ,𝑥 ,𝑥 much different covariances
@L
Generated gradients : much different
@✓ x1 ,x2 ,x3
𝜃1 dℒ
𝜃2 Generated gradients ቚ : much different
2 3
𝑑𝜃 𝑥 1 ,𝑥 2 ,𝑥 3 1
𝜃3 @L/@✓
Gradient update harder: ✓ t+1 ⌘t 4@L/@✓
= ✓t 𝑑ℒ/𝑑θ 1 25
3
@L/@✓
Gradient update harder: 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂𝑡 𝑑ℒ/𝑑θ 2
25
Data preprocessing
• Input variables should be as decorrelated as possible
⎯ Input variables are “more independent”
⎯ Network is forced to find non-trivial correlations between inputs
⎯ Decorrelated inputs → Better optimization
⎯ Obviously not the case when inputs are by definition correlated (sequences)
• Extreme case
− Extreme correlation (linear dependency) might cause problems [CAUTION]
26
Data preprocessing
27
Data preprocessing
In practice, you may also see PCA and Whitening of the data
28
Data preprocessing
In practice, you may also see PCA and Whitening of the data
29
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
30
Weight Initialization
31
Q: what happens when W=0 init is used?
32
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
33
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
34
Lets look
at some
activation
statistics
E.g. 10-layer net with 500
neurons on each layer,
using tanh non-linearities,
and initializing as described
in last slide.
35
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?
37
Almost all neurons
completely saturated,
*1.0 instead of *0.01 either -1 and 1.
Gradients will be all
zero.
38
“Xavier initialization”
[Glorot et al., 2010]
Reasonable initialization.
(Mathematical derivation
assumes linear activations)
• If a hidden unit has a big fan-in,
small changes on many of its
incoming weights can cause the
learning to overshoot.
⎯ We generally want smaller
incoming weights when the
fan-in is big, so initialize the
weights to be proportional to
sqrt(fan-in).
• We can also scale the learning
rate the same way.
(from Hinton’s notes)
39
but when using the ReLU nonlinearity
it breaks.
40
He et al., 2015
(note additional /2)
42
Proper initialization is an active area of
research…
• Understanding the difficulty of training deep feedforward neural networks Glorot and
Bengio, 2010
• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Saxe et al, 2013
• Random walk initialization for training very deep feedforward networks Sussillo and
Abbott, 2014
• Delving deep into rectifiers: Surpassing human-level performance on ImageNet
classification He et al., 2015
• Data-dependent Initializations of Convolutional Neural Networks Krähenbühl et al.,
2015
• All you need is a good init, Mishkin and Matas, 2015
…
43
Batch Normalization
“you want unit Gaussian activations? just make them so.”
BN Problem: do we
(k) x(k) E[x(k) ]
necessarily want a unit
Gaussian input to a tanh
b
x = p
tanh Var[x(k) ]
layer? <latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>
...
[Ioffe and Szegedy, 2015]
46
Batch Normalization
Normalize:
• Weight Normalization
Salimans, Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks,
NIPS, 2016
• Normalization propagation
Arpit et al., Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep
Networks, ICML, 2016
• Divisive Normalization
Ren et al., Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes. ICLR 2016
• Batch Renormalization
Ioffe, Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models, 2017
50
Improving Generalization
51
Preventing Overfitting
• Approach 1: Get more data! • Approach 3: Average many
⎯ Almost always the best bet if you different models.
have enough compute power to ⎯ Use models with different forms.
train on more data. ⎯ Or train the model on different
subsets of the training data (this is
• Approach 2: Use a model that called “bagging”).
has the right capacity:
⎯ enough to fit the true regularities. • Approach 4: (Bayesian) Use a
⎯ not enough to also fit spurious single neural network
regularities (if they are weaker). architecture, but average the
predictions made by many
different weight vectors.
52
Some ways to limit the capacity of a
neural net
• The capacity can be controlled in many ways:
• Architecture: Limit the number of hidden layers and the number of units
per layer.
• Early stopping: Start with small weights and stop the learning before it
overfits.
• Weight-decay: Penalize large weights using penalties or constraints on
their squared values (L2 penalty) or absolute values (L1 penalty).
• Noise: Add noise to the weights or the activities.
53
Regularization
• Neural networks typically have thousands, if not millions of parameters
⎯ Usually, the dataset size smaller than the number of parameters
• Overfitting is a grave danger
• Proper weight regularization is crucial to avoid overfitting
X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ ⌦(✓)
✓
(x,y)✓(X,Y )
• The l2-regularization can pass inside the gradient descend update rule
Original
Original
Contrast
Contrast Tint
Tint
57
Noise as L2
a regularizer
weight-decay via noisy inputs
•• Suppose
Supposewe weadd
addGaussian
Gaussiannoise noisetotothe
the inputs.
inputs. yj + 2 2
N (0, wi σ i )
TheThe
⎯ – variance of theofnoise
variance is amplified
the noise by the squared
is amplified by
weight before going
the squared into before
weight the nextgoing
layer. into the
next layer.
• In a simple net with a linear output unit directly j
• connected
In a simple tonet
thewith a linear
inputs, the output
amplifiedunitnoise
directly
gets
connected
added to thetooutput.
the inputs, the amplified noise
gets added to the output.
wi
•• This
Thismakes
makesananadditive
additivecontribution
contribution to to the
the
squared
squarederror.
error.
i
So So
⎯ – minimizing
minimizingthe squared
the squarederror tends
errortotends
minimize
to the 2
squared weights when the inputs are noisy. xi + N (0, σ i )
minimize the squared weights when the
inputs are noisy. Gaussian noise
Not exactly equivalent to using an L2 weight penalty.
58
pooled data of all the tasks). These are the lower layer
Multi-task Learning in Fig. 7.2.
Training error
Underfitting zone Overfitting zone
Generalization error
Error
Generalization gap
0 Optimal Capacity
Capacity
60
Model Ensembles CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
test examples. 8
• Different models will usually Second resampled dataset Second ensemble member
63
Dropout
• “randomly set some neurons to zero in the forward pass”
66
Waaaait a second…
How could this possibly be a good idea?
Forces the network to have a redundant representation.
has an ear X
has a tail
is furry X cat
score
has claws
mischievous X
look
67
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Waaaait a second…
How could this possibly be a good idea?
y y y y
Another interpretation: h1 h2 h1 h2 h1 h2 h2
share parameters). h1 h1 h2 h2
h1 h2
x1 x2 x1 x2 x2
~one datapoint. x1 x2 x1 x1
y y y y
h2 h1
x2
Ensemble of Sub-Networks 68
At test time….
Ideally:
want to integrate out all the noise
69
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
w0 w1
x y
70
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
a
(this can be shown to be an
approximation to evaluating the
w0 w1 whole ensemble)
x y
71
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
74
We can do something approximate
analytically
75
Dropout Summary
78
Training a neural network, main loop:
W_2
When rwe
W L(Wwrite )𝛻
T
<latexit sha1_base64="ih7VwlGo40RFh99J7KCreQLIbG4=">AAAB83icbVA9SwNBEJ2LXzF+RS1tFoMQm3AngtoFbSwsIhgvkBxhbrNJluztnbt7gXDkd9hYqNj6Z+z8N24+Ck18MPB4b4aZeWEiuDau++3kVlbX1jfym4Wt7Z3dveL+waOOU0VZncYiVo0QNRNcsrrhRrBGohhGoWB+OLiZ+P6QKc1j+WBGCQsi7Ene5RSNlYKWxFBg2yd3Zf+0XSy5FXcKsky8OSnBHLV28avViWkaMWmoQK2bnpuYIENlOBVsXGilmiVIB9hjTUslRkwH2fToMTmxSod0Y2VLGjJVf09kGGk9ikLbGaHp60VvIv7nNVPTvQwyLpPUMElni7qpICYmkwRIhytGjRhZglRxeyuhfVRIjc2pYEPwFl9eJvWzylXFuz8vVa/naeThCI6hDB5cQBVuoQZ1oPAEz/AKb87QeXHenY9Za86ZzxzCHzifP7l5kPQ=</latexit>
@L 𝜕𝐿 @L 𝜕𝐿 @L 𝜕𝐿 𝑇
rW L(W )𝛻𝑊=𝐿 𝑊 = , , , . . . ,, … ,
@W1 𝜕𝑊 @W1 2 𝜕𝑊2 @Wm 𝜕𝑊𝑚 <latexit sha1_base64="+INcLgnpCWL39lbSc82XI+bOezc=">AAACbnichVFNS+RAEO1kXT/GXR314EEXmx2EWZAhEWHXgyB68eBBwTHCJBsqPZ2Zxk4ndFcWhpCrP9Cb/8GL/8CeccBP1gcNj1evqrpfJ4UUBj3vznG/zHydnZtfaCx++7603FxZvTR5qRnvslzm+ioBw6VQvIsCJb8qNIcskTxIro/H9eAf10bk6gJHBY8yGCiRCgZopbh5EypIJMQBPW0Hv+gBDSVPsRemGlgVFqBRgKSn9TMPYr/eof817NY7YT9H84ktq0MtBkOM/l7EzZbX8Sag74k/JS0yxVncvLUbWJlxhUyCMT3fKzCqxrOZ5HUjLA0vgF3DgPcsVZBxE1WTvGq6bZU+TXNtj0I6UV92VJAZM8oS68wAh+ZtbSx+VOuVmP6JKqGKErliT4vSUlLM6Th82heaM5QjS4BpYe9K2RBsRGi/qGFD8N8++T3p7nb2O/75XuvwaJrGPNkgP0mb+OQ3OSQn5Ix0CSP3zqqz4Ww6D+66+8PderK6zrRnjbyC234E7ba8LA==</latexit>
@L𝜕𝐿
whereWhere measures
measures howhow fast
fast thethe
lossloss changes
changes vs. change in 𝑊𝑖 .
@W𝜕𝑊i
@L
𝑖 <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
vs. change in
@W
In figure: lossi surface is blue, gradient vectors are red: <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
When loss
• In figure: 𝛻𝑊 𝐿 surface
𝑊 = 0,isit blue,
meansgradient vectors
all the partials areare red:
zero.
i.e. the loss is not changing in any direction.
• When rW L(W ) = 0 , it means all the partials are
<latexit sha1_base64="wZU32ci7njZNXJdvzwGmlOmrdk4=">AAAB93icbVBNS8NAEN34WetHox69LBahXkoignoQil48eKhgbKENYbLdtEs3m7C7EWroL/HiQcWrf8Wb/8Ztm4O2Phh4vDfDzLww5Uxpx/m2lpZXVtfWSxvlza3tnYq9u/egkkwS6pGEJ7IdgqKcCepppjltp5JCHHLaCofXE7/1SKViibjXo5T6MfQFixgBbaTArnQFhByCFr6ttY4vncCuOnVnCrxI3IJUUYFmYH91ewnJYio04aBUx3VS7ecgNSOcjsvdTNEUyBD6tGOogJgqP58ePsZHRunhKJGmhMZT9fdEDrFSozg0nTHogZr3JuJ/XifT0bmfM5FmmgoyWxRlHOsET1LAPSYp0XxkCBDJzK2YDEAC0SarsgnBnX95kXgn9Yu6e3dabVwVaZTQATpENeSiM9RAN6iJPERQhp7RK3qznqwX6936mLUuWcXMPvoD6/MHKHGRpg==</latexit>
zero,Thus
i.e. we
theare at aislocal
loss not optimum
changing(orin
at any
leastdirection.
a saddle point).
• The regions
Thewhere
regions gradient descent
where gradient converges
descent converges totoa aparticular
particularlocallocal
minimum minimum
are called
are basins of attraction.
called basins of attraction.
82
Local Minima
• Since the optimization problem is non-convex, it probably has local
minima.
• This kept people from using neural nets for a long time, because
they wanted guarantees they were getting the optimal solution.
• But are local minima really a problem?
− Common view among practitioners: yes, there are local minima, but
they’re probably still pretty good.
• Maybe your network wastes some hidden units, but then you can just make it larger.
− It’s very hard to demonstrate the existence of local minima in practice.
− In any case, other optimization-related issues are much more important.
83
Saddle PointsSaddle points
@L
• At a saddle point, = 0 even though
@Wi
<latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
86
Plateaux
An important example of a plateau is a saturated unit. This is wh
• An important
it is in theexample of a plateau
flat region is a saturated
of its activation This is the backpro
unit. Recall
function.
whenequation
it is in the flat region of its activation
for the weight derivative: function.
• If φ′(zi) is always close to zero, then the weights
will get stuck.
0
z i = h i (z)
• If there is a ReLU unit whose input zi is always
negative, the weight
wij =derivatives
z i xj will be exactly 0.
We call this a dead unit.
If 0 (z
i) is always close to zero, then the weights will get stuck.
If there is a ReLU unit whose input zi is always negative, the weig
derivatives will be exactly 0. We call this a dead unit.
87
Batch Gradient Descent
Batch Gradient Descent
Algorithm 1 Batch Gradient Descent at Iteration k
Require: Learning rate ✏k
Require: Initial Parameter ✓
1: while stopping criteria not met do
2: Compute gradient
P estimate over N examples:
3: ĝ + N1 r✓ i L(f (x(i) ; ✓), y(i) )
4: Apply Update: ✓ ✓ ✏ĝ
5: end while
89
Gradient Descent
90
Gradient Descent
91
Gradient Descent
92
Gradient Descent
93
Gradient Descent
94
Stochastic Batch Gradient Descent
95
Minibatching
• Potential Problem: Gradient estimates can be very noisy
• Obvious Solution: Use larger mini-batches
• Advantage: Computation time per update does not depend on
number of training examples N
97
Stochastic Gradient Descent
98
Stochastic Gradient Descent
99
Stochastic Gradient Descent
100
Stochastic Gradient Descent
101
Stochastic Gradient Descent
102
Stochastic Gradient Descent
103
Stochastic Gradient Descent
104
Stochastic Gradient Descent
105
Image credits: Alec Radford 106
Suppose loss function is steep vertically but shallow
horizontally:
107
Suppose loss function is steep vertically but shallow
horizontally:
108
Suppose loss function is steep vertically but shallow
horizontally:
109
SGD + Momentum
Momentum update
SGD
SGD SGD+Momentum
SGD+Momentum
vt+1 = ⇢vt + rf (xt )
xt+1 = xt ↵rf (xt )
<latexit sha1_base64="9swfz75DTzefTvz7jJ45rK0lUcM=">AAACC3icbVDLSgMxFM34tr6qLt0Ei1ARy4wI6kIQ3bis4KjQlnInzbShmcyQ3JGWoR/gxl9x40LFrT/gzr8x03ah1QOBwznncnNPkEhh0HW/nKnpmdm5+YXFwtLyyupacX3jxsSpZtxnsYz1XQCGS6G4jwIlv0s0hyiQ/DboXuT+7T3XRsTqGvsJb0TQViIUDNBKzWKp18xwzxvQU5qzwX4dZNKBuoJAAg3LvSbu2pRbcYegf4k3JiUyRrVZ/Ky3YpZGXCGTYEzNcxNsZKBRMMkHhXpqeAKsC21es1RBxE0jGx4zoDtWadEw1vYppEP150QGkTH9KLDJCLBjJr1c/M+rpRgeNzKhkhS5YqNFYSopxjRvhraE5gxl3xJgWti/UtYBDQxtfwVbgjd58l/iH1ROKt7VYensfNzGAtki26RMPHJEzsglqRKfMPJAnsgLeXUenWfnzXkfRaec8cwm+QXn4xueDJpH</latexit>
xt+1 = xt
<latexit sha1_base64="KUXDF4zWwfrKj8Hf5aHq3ZT9UNQ=">AAACL3icbVBdSxtBFJ2N/dDU1tQ++jIYKinSsCtC64Mg9UEfIxgNZMNydzKbDJmdXWbuimHZn9SX/hR9aUHFV/+Fs8k+pNEDA4dzzuXOPWEqhUHX/efUVt68ffd+da3+Yf3jp43G580Lk2Sa8S5LZKJ7IRguheJdFCh5L9Uc4lDyy3ByXPqXV1wbkahznKZ8EMNIiUgwQCsFjZOrIMddr6A79JD6epzQUih2fQWhBBq1rgP85vv164VYyYvvPsh0DLSaDxpNt+3OQF8SryJNUqETNG78YcKymCtkEozpe26Kgxw0CiZ5Ufczw1NgExjxvqUKYm4G+ezggn61ypBGibZPIZ2pixM5xMZM49AmY8CxWfZK8TWvn2H0c5ALlWbIFZsvijJJMaFle3QoNGcop5YA08L+lbIxaGBoO67bErzlk1+S7l77oO2d7TePflVtrJItsk1axCM/yBE5JR3SJYz8Jrfkjtw7f5y/zoPzOI/WnGrmC/kPztMzZWKnBQ==</latexit>
↵vt+1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25,
110
SGD + Momentum
Momentum update
SGD
SGD SGD+Momentum
SGD+Momentum
vt+1 = ⇢vt + rf (xt )
xt+1 = xt ↵rf (xt )
<latexit sha1_base64="9swfz75DTzefTvz7jJ45rK0lUcM=">AAACC3icbVDLSgMxFM34tr6qLt0Ei1ARy4wI6kIQ3bis4KjQlnInzbShmcyQ3JGWoR/gxl9x40LFrT/gzr8x03ah1QOBwznncnNPkEhh0HW/nKnpmdm5+YXFwtLyyupacX3jxsSpZtxnsYz1XQCGS6G4jwIlv0s0hyiQ/DboXuT+7T3XRsTqGvsJb0TQViIUDNBKzWKp18xwzxvQU5qzwX4dZNKBuoJAAg3LvSbu2pRbcYegf4k3JiUyRrVZ/Ky3YpZGXCGTYEzNcxNsZKBRMMkHhXpqeAKsC21es1RBxE0jGx4zoDtWadEw1vYppEP150QGkTH9KLDJCLBjJr1c/M+rpRgeNzKhkhS5YqNFYSopxjRvhraE5gxl3xJgWti/UtYBDQxtfwVbgjd58l/iH1ROKt7VYensfNzGAtki26RMPHJEzsglqRKfMPJAnsgLeXUenWfnzXkfRaec8cwm+QXn4xueDJpH</latexit>
xt+1 = xt
<latexit sha1_base64="KUXDF4zWwfrKj8Hf5aHq3ZT9UNQ=">AAACL3icbVBdSxtBFJ2N/dDU1tQ++jIYKinSsCtC64Mg9UEfIxgNZMNydzKbDJmdXWbuimHZn9SX/hR9aUHFV/+Fs8k+pNEDA4dzzuXOPWEqhUHX/efUVt68ffd+da3+Yf3jp43G580Lk2Sa8S5LZKJ7IRguheJdFCh5L9Uc4lDyy3ByXPqXV1wbkahznKZ8EMNIiUgwQCsFjZOrIMddr6A79JD6epzQUih2fQWhBBq1rgP85vv164VYyYvvPsh0DLSaDxpNt+3OQF8SryJNUqETNG78YcKymCtkEozpe26Kgxw0CiZ5Ufczw1NgExjxvqUKYm4G+ezggn61ypBGibZPIZ2pixM5xMZM49AmY8CxWfZK8TWvn2H0c5ALlWbIFZsvijJJMaFle3QoNGcop5YA08L+lbIxaGBoO67bErzlk1+S7l77oO2d7TePflVtrJItsk1axCM/yBE5JR3SJYz8Jrfkjtw7f5y/zoPzOI/WnGrmC/kPztMzZWKnBQ==</latexit>
↵vt+1
• Build
- Build up “velocity”
up “velocity” as a running
as a running meanmean of gradients
of gradients
Rho gives
- Rho• gives “friction”;
“friction”; typically
typically rho=0.9
rho=0.9 or 0.99
or 0.99
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25,
111
SGD vs Momentum
notice momentum
overshooting the target,
but overall getting to the
minimum much faster.
112
SGD + Momentum
Momentum update
momentum
step
actual step
gradient
step
113
Nesterov Momentum
“lookahead”
gradient step (bit
momentum momentum
different than
step step
actual step original)
actual step
114
Nesterov Momentum
vt+1 = ⇢vt ↵rf (xt + ⇢vt )
xt+1 = xt + vt+1
<latexit sha1_base64="16F0xTnGlPnKaXsFlCZVv5wk8c4=">AAACN3icbVBBSxtBGJ1Nrdq01bQeexkMLSmhYVcE20NB9OKtCsYEsmH5djKbDJmdXWa+DYZlf1Yv/RnexIsHFa/9B87GPWj0wcDjve/NzPfCVAqDrnvp1N6svF1dW39Xf//h48Zm49PnM5NkmvEuS2Si+yEYLoXiXRQoeT/VHOJQ8l44PSz93oxrIxJ1ivOUD2MYKxEJBmiloPFnFuTY9gr6jf6mvp4ktBSKHz7IdAK+glACjVrnAbYrF7/7fv38SarkRbu6J2g03Y67AH1JvIo0SYXjoHHhjxKWxVwhk2DMwHNTHOagUTDJi7qfGZ4Cm8KYDyxVEHMzzBeLF/SrVUY0SrQ9CulCfZrIITZmHod2MgacmGWvFF/zBhlGP4e5UGmGXLHHh6JMUkxo2SIdCc0ZyrklwLSwf6VsAhoY2q7rtgRveeWXpLvT+dXxTnab+wdVG+vkC9kmLeKRPbJPjsgx6RJG/pIrckNunX/OtXPn3D+O1pwqs0Wewfn/ACRDqm4=</latexit>
115
Nesterov Momentum
Nesterov Momentum
vt+1 = ⇢vt ↵rf (xt + ⇢vt ) Annoying, usually we want
update in terms of xt , rf (xt )
xt+1 = xt + vt+1
<latexit sha1_base64="16F0xTnGlPnKaXsFlCZVv5wk8c4=">AAACN3icbVBBSxtBGJ1Nrdq01bQeexkMLSmhYVcE20NB9OKtCsYEsmH5djKbDJmdXWa+DYZlf1Yv/RnexIsHFa/9B87GPWj0wcDjve/NzPfCVAqDrnvp1N6svF1dW39Xf//h48Zm49PnM5NkmvEuS2Si+yEYLoXiXRQoeT/VHOJQ8l44PSz93oxrIxJ1ivOUD2MYKxEJBmiloPFnFuTY9gr6jf6mvp4ktBSKHz7IdAK+glACjVrnAbYrF7/7fv38SarkRbu6J2g03Y67AH1JvIo0SYXjoHHhjxKWxVwhk2DMwHNTHOagUTDJi7qfGZ4Cm8KYDyxVEHMzzBeLF/SrVUY0SrQ9CulCfZrIITZmHod2MgacmGWvFF/zBhlGP4e5UGmGXLHHh6JMUkxo2SIdCc0ZyrklwLSwf6VsAhoY2q7rtgRveeWXpLvT+dXxTnab+wdVG+vkC9kmLeKRPbJPjsgx6RJG/pIrckNunX/OtXPn3D+O1pwqs0Wewfn/ACRDqm4=</latexit>
Annoying, usually we want
<latexit sha1_base64="nL22RvtrRzuN9hpEm8+KpKbIaY4=">AAAB+nicbVBNS8NAEJ34WetXrEcvi0WoICURQb0VvXisYGyhDWGz3bRLN5uwu5GW0L/ixYOKV3+JN/+N2zYHbX0w8Pa9GXbmhSlnSjvOt7Wyura+sVnaKm/v7O7t2weVR5VkklCPJDyR7RArypmgnmaa03YqKY5DTlvh8Hbqt56oVCwRD3qcUj/GfcEiRrA2UmBXRoE+Q12BQ45RVDOv08CuOnVnBrRM3IJUoUAzsL+6vYRkMRWacKxUx3VS7edYakY4nZS7maIpJkPcpx1DBY6p8vPZ7hN0YpQeihJpSmg0U39P5DhWahyHpjPGeqAWvan4n9fJdHTl50ykmaaCzD+KMo50gqZBoB6TlGg+NgQTycyuiAywxESbuMomBHfx5GXindev6+79RbVxU6RRgiM4hhq4cAkNuIMmeEBgBM/wCm/WxHqx3q2PeeuKVcwcwh9Ynz/vrJNG</latexit>
update in terms of
rearrange:
117
AdaGrad
AdaGrad update
Added
Added element-wise
element-wise scaling
scaling of the
of the gradient
gradient based
based on the
on the
historical
historical sumsum of squares
of squares in each
in each dimension
dimension
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29[DuchiApril 25, 20
et al., 2011]
118
AdaGrad
AdaGrad update
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29 April 25, 20
119
AdaGrad
AdaGrad update
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29 April 25, 20
120
RMSProp
RMSProp
AdaGrad
AdaGrad
RMSProp
RMSProp
122
Adaptive Moment Estimation (Adam)
Adam (almost)
(incomplete, but close)
Momentum
momentum
AdaGrad // RMSProp
AdaGrad RMSProp
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
momentum
Bias correction
AdaGrad / RMSProp
The bias correction compensates for the fact that m,v are
initialized at zero and need some time to “warm up”.
momentum
Bias correction
AdaGrad / RMSProp
step decay:
e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
127
SGD, SGD+Momentum, Adagrad, RMSProp,
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
Adam rate
learning all have learning rate as a hyperparameter.
as a hyperparameter.
=> Learning rate decay over time!
Loss
Learning rate decay!
step decay:
e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
Epoch
128
First-Order Optimization
First-Order Optimization
1) Use gradient form linear approximation
(1) Use gradient form linear approximation
2) Step to (2)
minimize
Step tothe approximation
minimize the approximation
Loss
w1
129
Second-Order Optimization
1) Use gradient and Hessian to form quadratic approximation
2) Step to the minima of the approximation
130
Second order optimization methods
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
notice:
no hyperparameters! (e.g. learning rate)
131
Second order optimization methods
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
notice:
no hyperparameters! (e.g. learning rate)
132
Second order optimization methods
133
Babysitting the Learning Process
134
Say we start with one hidden layer of 50 neurons:
50 hidden
neurons
10 output
output layer neurons, one
CIFAR-10 input layer per class
images, 3072 hidden layer
numbers
135
Double check that the loss is reasonable:
disable regularization
loss ~2.3.
“correct “ for returns the loss and the gradient for
10 classes all parameters
136
Double check that the loss is reasonable:
crank up regularization
137
Lets try to train now…
138
Lets try to train now…
139
Lets try to train now…
140
Lets try to train now…
141
Lets try to train now…
143
Lets try to train now…
144
Lets try to train now…
145
Lets try to train now…
loss not going down: 3e-3 is still too high. Cost explodes….
learning rate too low
loss exploding:
learning rate too high => Rough range for learning rate we
should be cross-validating is
somewhere [1e-3 … 1e-5]
146
Hyperparameter Selection
147
Everything is a hyperparameter
• Network size/depth
• Small model variations
• Minibatch creation strategy
• Optimizer/learning rate
149
For example: run coarse search for 5 epochs
note it’s best to optimize in
log space!
nice
150
Now run finer search...
adjust range
151
Now run finer search...
adjust range
152
Now run finer search...
adjust range
155
Monitor and visualize the loss curve
157
Loss
time
158
Loss
Bad initialization
a prime suspect
time
159
Loss function specimen
lossfunctions.tumblr.com
160
lossfunctions.tumblr.com
161
lossfunctions.tumblr.com
162
Monitor and visualize the accuracy:
no gap
=> increase model capacity?
163
Visualization
Visualization
• Check
• Check gradients numerically
gradients numerically bydifferences
by finite finite differences
• Visualize features
• Visualize features (features
(features need
need to to be uncorrelated)
be uncorrelated) and have and have high
variance
high variance
Goodtraining:
• • Good training: hidden
hidden units units are
are sparse
sparse across samples
across samples
[From Marc'Aurelio Ranzato, CVPR 2014 tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 165
Visualization
Visualization
Check gradients
gradients numerically
numerically by
by finite
finite differences
• • Check
• Check gradients numerically bydifferences
finite differences
• Visualize
• • Visualize features
Visualize features
features (features
(features
(features need need
need to
to to be uncorrelated)
be uncorrelated)
be uncorrelated) and have
and have and have high
high variance
variance
variance
high
Badtraining:
• •• Good
Bad training:
training:many many
hidden
hidden hidden units
units
units
are ignore
ignorethe
sparse the input
input
across and/or exhibit
and/or
samples
strong
exhibit correlations
strong correlations
[From Marc'Aurelio
[From Marc'Aurelio Ranzato,
Ranzato, CVPR
CVPR 2014
2014 tutorial]
tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 166
Visualization Visualization
• Check gradients
• Check gradients numerically
numerically by by finitedifferences
finite differences
• Visualize• Visualize
featuresfeatures
(features need
(features totobe
need beuncorrelated)
uncorrelated) andand
havehave high
variancehigh variance
• Visualize• parameters: learned
Visualize parameters: features
learned featuresshould exhibit
should exhibit structure and
structure
should beanduncorrelated and areand
should be uncorrelated uncorrelated
are uncorrelated
[From Marc'Aurelio Ranzato, CVPR 2014 tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 167
Take Home Messages
168
Optimization Tricks
• SGD with momentum, batch-normalization, and dropout usually
works very well
• Pick learning rate by running on a subset of the data
• Start with large learning rate & divide by 2 until loss does not diverge
• Decay learning rate by a factor of ~100 or more by the end of training
• Dropout
Next lecture:
Convolutional
Neural Networks
172