You are on page 1of 158

The Alchemist, Mattheus von Helmont

CMP784
DEEP LEARNING

Lecture #04 –Training Deep Neural Networks

Aykut Erdem // Hacettepe University // Spring 2018


Image: Jose-Luis Olivares

Previously on CMP784
• multi-layer perceptrons
• activation functions
• chain rule
• backpropagation algorithm
• computational graph
• distributed word representations

2
Lecture overview
• data preprocessing and normalization
• weight initializations
• ways to improve generalization
• optimization
• babysitting the learning process
• hyperparameter selection

Disclaimer: Much of the material and slides for this lecture were borrowed from
—Fei-Fei Li, Andrej Karpathy and Justin Johnson’s CS231n class
—Roger Grosse’s CSC321 class
—Shubhendu Trivedi and Risi Kondor’s CMSC 35246 class
—Efstratios Gavves and Max Willing’s UvA deep learning class
—Hinton's Neural Networks for Machine Learning class
3
Activation Functions

4
Activation Functions
Leaky ReLU
max(0.1x, x)

Sigmoid
x
(x) = 1/(1 + e )

Maxout
tanh tanh(x) ELU

ReLU max(0,x)
5
Activation Functions
x
(x) = 1/(1 + e )

• Squashes numbers to range [0,1]


• Historically popular since they
have nice interpretation as a
saturating “firing rate” of a
neuron

Sigmoid

6
Activation Functions
x
= (x)(1 (x)) (x) = 1/(1 + e )
x sigmoid
gate
• Squashes numbers to range [0,1]
• Historically popular since they
have nice interpretation as a
saturating “firing rate” of a
neuron
3 problems:
1. Saturated neurons “kill” the gradients

Sigmoid
Image credit: Jefkine Kafunah
7
Activation Functions
x
(x) = 1/(1 + e )

• Squashes numbers to range [0,1]


• Historically popular since they
have nice interpretation as a
saturating “firing rate” of a
neuron
3 problems:
Sigmoid 1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered

9
Consider what happens when the input to
a neuron (x) is always positive:
inputs weights sum non-linearity

x0
w0 !
x1
w1 X
x2
w2
∑ f w i xi + b
wn

<latexit sha1_base64="0sbCx1Hgxy2hhEN93G53lziDd1c=">AAACBXicbVBNS8NAEN3Ur1q/oh5FWCxCRSiJCOqt6MVjBWMLTQib7aZduvlgd6KW0JMX/4oXDype/Q/e/Ddu2xy0+mDg8d4MM/OCVHAFlvVllObmFxaXysuVldW19Q1zc+tGJZmkzKGJSGQ7IIoJHjMHOAjWTiUjUSBYKxhcjP3WLZOKJ/E1DFPmRaQX85BTAlryzd3QFSyEmquyyOf4zuf3Pj8MXMl7fTjwzapVtybAf4ldkCoq0PTNT7eb0CxiMVBBlOrYVgpeTiRwKtio4maKpYQOSI91NI1JxJSXT94Y4X2tdHGYSF0x4In6cyInkVLDKNCdEYG+mvXG4n9eJ4Pw1Mt5nGbAYjpdFGYCQ4LHmeAul4yCGGpCqOT6Vkz7RBIKOrmKDsGeffkvcY7qZ3X76rjaOC/SKKMdtIdqyEYnqIEuURM5iKIH9IRe0KvxaDwbb8b7tLVkFDPb6BeMj28rq5h+</latexit>
i
b
xn

1
bias

What can we say about the gradients on w?


10
Consider what happens when the input to
a neuron (x) is always positive:
allowed
gradient
update
! directions
X
zig zag path
f w i xi + b allowed
gradient
i update
directions
<latexit sha1_base64="0sbCx1Hgxy2hhEN93G53lziDd1c=">AAACBXicbVBNS8NAEN3Ur1q/oh5FWCxCRSiJCOqt6MVjBWMLTQib7aZduvlgd6KW0JMX/4oXDype/Q/e/Ddu2xy0+mDg8d4MM/OCVHAFlvVllObmFxaXysuVldW19Q1zc+tGJZmkzKGJSGQ7IIoJHjMHOAjWTiUjUSBYKxhcjP3WLZOKJ/E1DFPmRaQX85BTAlryzd3QFSyEmquyyOf4zuf3Pj8MXMl7fTjwzapVtybAf4ldkCoq0PTNT7eb0CxiMVBBlOrYVgpeTiRwKtio4maKpYQOSI91NI1JxJSXT94Y4X2tdHGYSF0x4In6cyInkVLDKNCdEYG+mvXG4n9eJ4Pw1Mt5nGbAYjpdFGYCQ4LHmeAul4yCGGpCqOT6Vkz7RBIKOrmKDsGeffkvcY7qZ3X76rjaOC/SKKMdtIdqyEYnqIEuURM5iKIH9IRe0KvxaDwbb8b7tLVkFDPb6BeMj28rq5h+</latexit>

hypothetical
optimal w
What can we say about the gradients on w? vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
11
Activation Functions
x
(x) = 1/(1 + e )

• Squashes numbers to range [0,1]


• Historically popular since they
have nice interpretation as a
saturating “firing rate” of a
neuron
3 problems:
Sigmoid 1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. exp() is a bit compute expensive

12
Activation Functions

• Squashes numbers to range [-1,1]


• zero centered (nice)
• still kills gradients when saturated :(

tanh(x)

[LeCun et al., 1991] 13


Activation Functions
• Computes f(x) = max(0,x)

• Does not saturate (in +region)


• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

ReLU
(Rectified Linear Unit)

[Krizhevsky et al., 2012] 14



Activation Functions
0, if x  0
=
1, if x > 0
• Computes f(x) = max(0,x)
x ReLU
gate • Does not saturate (in +region)
• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

• Not zero-centered output


• An annoyance:

Hint: what is the gradient when x < 0?


ReLU
[Krizhevsky et al., 2012] 15
active ReLU
DATA CLOUD

dead ReLU
will never activate
→ never update

17
active ReLU
DATA CLOUD

→ people like to initialize ReLU


neurons with slightly positive dead ReLU
biases (e.g. 0.01) will never activate
→ never update

18
Activation Functions
• Does not saturate
• Computationally efficient
• Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
• will not “die”.

Leaky ReLU
f (x) = max(0.01x, x)
<latexit sha1_base64="kMkvSHS98X7aaPaV3lXeMOXeLlQ=">AAAB+nicbZDNSsNAFIVv/K31L9alm2ARWpCSiKAuhKIblxWMLbShTKaTdujMJMxMJKX0Vdy4UHHrk7jzbZy2WWjrgYGPc+/l3jlhwqjSrvttrayurW9sFraK2zu7e/v2QelRxanExMcxi2UrRIowKoivqWaklUiCeMhIMxzeTuvNJyIVjcWDHiUk4KgvaEQx0sbq2qWoklWvOxxlFbfmetlpVu3aZYMzOcvg5VCGXI2u/dXpxTjlRGjMkFJtz010MEZSU8zIpNhJFUkQHqI+aRsUiBMVjGe3T5wT4/ScKJbmCe3M3N8TY8SVGvHQdHKkB2qxNjX/q7VTHV0GYyqSVBOB54uilDk6dqZBOD0qCdZsZABhSc2tDh4gibA2cRVNCN7il5fBP6td1bz783L9Jk+jAEdwDBXw4ALqcAcN8AFDBs/wCm/WxHqx3q2PeeuKlc8cwh9Znz+gx5Jt</latexit>

[Mass et al., 2013]


[He et al., 2015]
19
Activation Functions
• Does not saturate
• Computationally efficient
• Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
• will not “die”.

Parametric Rectifier (PReLU)


Leaky ReLU f (x) = max(↵x, x)
f (x) = max(0.01x, x) <latexit sha1_base64="AeonoZc8n/OKYYAYc/XrH6ELQQI=">AAAB/XicbVBNS8NAEN3Ur1q/ouLJy2IRWpCSiKAehKIXjxWMLTShTLabdunmg92NtISCf8WLBxWv/g9v/hu3bQ7a+mDg8d4MM/P8hDOpLOvbKCwtr6yuFddLG5tb2zvm7t6DjFNBqENiHouWD5JyFlFHMcVpKxEUQp/Tpj+4mfjNRyoki6N7NUqoF0IvYgEjoLTUMQ+CyrB65YYwrLjAkz7g4cmw2jHLVs2aAi8SOydllKPRMb/cbkzSkEaKcJCybVuJ8jIQihFOxyU3lTQBMoAebWsaQUill03PH+NjrXRxEAtdkcJT9fdEBqGUo9DXnSGovpz3JuJ/XjtVwYWXsShJFY3IbFGQcqxiPMkCd5mgRPGRJkAE07di0gcBROnESjoEe/7lReKc1i5r9t1ZuX6dp1FEh+gIVZCNzlEd3aIGchBBGXpGr+jNeDJejHfjY9ZaMPKZffQHxucP0f+UTg==</latexit>

backprop into \alpha


<latexit sha1_base64="kMkvSHS98X7aaPaV3lXeMOXeLlQ=">AAAB+nicbZDNSsNAFIVv/K31L9alm2ARWpCSiKAuhKIblxWMLbShTKaTdujMJMxMJKX0Vdy4UHHrk7jzbZy2WWjrgYGPc+/l3jlhwqjSrvttrayurW9sFraK2zu7e/v2QelRxanExMcxi2UrRIowKoivqWaklUiCeMhIMxzeTuvNJyIVjcWDHiUk4KgvaEQx0sbq2qWoklWvOxxlFbfmetlpVu3aZYMzOcvg5VCGXI2u/dXpxTjlRGjMkFJtz010MEZSU8zIpNhJFUkQHqI+aRsUiBMVjGe3T5wT4/ScKJbmCe3M3N8TY8SVGvHQdHKkB2qxNjX/q7VTHV0GYyqSVBOB54uilDk6dqZBOD0qCdZsZABhSc2tDh4gibA2cRVNCN7il5fBP6td1bz783L9Jk+jAEdwDBXw4ALqcAcN8AFDBs/wCm/WxHqx3q2PeeuKlc8cwh9Znz+gx5Jt</latexit>

(parameter) [Mass et al., 2013]


[He et al., 2015]
20
Activation Functions
Exponential Linear Units (ELU)

• All benefits of ReLU


• Does not die
• Closer to zero mean outputs

• Computation requires exp()


(
x if x > 0
f (x) =
↵(exp(x) 1) if x  0
<latexit sha1_base64="FGZqL1uA3elL0FNwMMZEjGvpaqk=">AAACaXicbVFNb9QwEHXSQstCYVskqOAyYgXaHlglvbQcQBVcOLZSl1ZaR6uJM9m16jip7VRZRXvgL3LjF3DhR+D9QIK2T7L0/OZ5xn5OKyWti6KfQbix+eDh1vajzuMnO0+fdXf3vtmyNoKGolSluUzRkpKahk46RZeVISxSRRfp1ZdF/eKGjJWlPnezipICJ1rmUqDz0rj7nY8g7zcH8LEDwFOaSN0K38/O/R6ggRXeAb+uMQPuqHGtzGEOzaeI86WJo6qm2OfUVL7T+/jgfj9XdA3RYgzp7O8Qnoy7vWgQLQF3SbwmPbbG6bj7g2elqAvSTii0dhRHlUtaNE4KRb5lbalCcYUTGnmqsSCbtMuo5vDWKxnkpfFLO1iq/55osbB2VqTeWaCb2tu1hXhfbVS7/Dhppa5qR1qsBuW1AlfCInfIpCHh1MwTFEb6u4KYokHh/O90fAjx7SffJcPDwYdBfHbYO/m8TmObvWZvWJ/F7IidsK/slA2ZYL+CneBF8DL4He6F++GrlTUM1mees/8Q9v4AaOOxpw==</latexit>

[Clevert et al., 2015] 21


Activation Functions
Scaled Exponential Linear Units (SELU)
• Scaled version of ELU
• Stable and attracting fixed points
for the mean and variance
• No need for batch normalization
• ~100 pages long of pure math
"Using the Banach fixed-point theorem, we
(( prove that activations close to zero mean and
x x if x
if >
x>0 0 unit variance that are propagated through
f (x)
f (x)
== many network layers will converge towards
↵(exp(x)
↵(exp(x) 1) 1) if 
if x x0 0
<latexit sha1_base64="MMebiCeNZVnhZbR1erWNhJmz9j4=">AAAB7XicbVBNS8NAFHypX7V+VT16WSyCp5KIoN6KXjxWMLbQhrLZbNqlm03YfRFK6I/w4kHFq//Hm//GbZuDVgcWhpl57HsTZlIYdN0vp7Kyura+Ud2sbW3v7O7V9w8eTJprxn2WylR3Q2q4FIr7KFDybqY5TULJO+H4ZuZ3Hrk2IlX3OMl4kNChErFgFK3U6Usbjeig3nCb7hzkL/FK0oAS7UH9sx+lLE+4QiapMT3PzTAoqEbBJJ/W+rnhGWVjOuQ9SxVNuAmK+bpTcmKViMSptk8hmas/JwqaGDNJQptMKI7MsjcT//N6OcaXQSFUliNXbPFRnEuCKZndTiKhOUM5sYQyLeyuhI2opgxtQzVbgrd88l/inzWvmt7deaN1XbZRhSM4hlPw4AJacAtt8IHBGJ7gBV6dzHl23pz3RbTilDOH8AvOxzeo0Y9R</latexit>

<latexit sha1_base64="FGZqL1uA3elL0FNwMMZEjGvpaqk=">AAACaXicbVFNb9QwEHXSQstCYVskqOAyYgXaHlglvbQcQBVcOLZSl1ZaR6uJM9m16jip7VRZRXvgL3LjF3DhR+D9QIK2T7L0/OZ5xn5OKyWti6KfQbix+eDh1vajzuMnO0+fdXf3vtmyNoKGolSluUzRkpKahk46RZeVISxSRRfp1ZdF/eKGjJWlPnezipICJ1rmUqDz0rj7nY8g7zcH8LEDwFOaSN0K38/O/R6ggRXeAb+uMQPuqHGtzGEOzaeI86WJo6qm2OfUVL7T+/jgfj9XdA3RYgzp7O8Qnoy7vWgQLQF3SbwmPbbG6bj7g2elqAvSTii0dhRHlUtaNE4KRb5lbalCcYUTGnmqsSCbtMuo5vDWKxnkpfFLO1iq/55osbB2VqTeWaCb2tu1hXhfbVS7/Dhppa5qR1qsBuW1AlfCInfIpCHh1MwTFEb6u4KYokHh/O90fAjx7SffJcPDwYdBfHbYO/m8TmObvWZvWJ/F7IidsK/slA2ZYL+CneBF8DL4He6F++GrlTUM1mees/8Q9v4AaOOxpw==</latexit> <latexit sha1_base64="FGZqL1uA3elL0FNwMMZEjGvpaqk=">AAACaXicbVFNb9QwEHXSQstCYVskqOAyYgXaHlglvbQcQBVcOLZSl1ZaR6uJM9m16jip7VRZRXvgL3LjF3DhR+D9QIK2T7L0/OZ5xn5OKyWti6KfQbix+eDh1vajzuMnO0+fdXf3vtmyNoKGolSluUzRkpKahk46RZeVISxSRRfp1ZdF/eKGjJWlPnezipICJ1rmUqDz0rj7nY8g7zcH8LEDwFOaSN0K38/O/R6ggRXeAb+uMQPuqHGtzGEOzaeI86WJo6qm2OfUVL7T+/jgfj9XdA3RYgzp7O8Qnoy7vWgQLQF3SbwmPbbG6bj7g2elqAvSTii0dhRHlUtaNE4KRb5lbalCcYUTGnmqsSCbtMuo5vDWKxnkpfFLO1iq/55osbB2VqTeWaCb2tu1hXhfbVS7/Dhppa5qR1qsBuW1AlfCInfIpCHh1MwTFEb6u4KYokHh/O90fAjx7SffJcPDwYdBfHbYO/m8TmObvWZvWJ/F7IidsK/slA2ZYL+CneBF8DL4He6F++GrlTUM1mees/8Q9v4AaOOxpw==</latexit>


zero mean and unit variance — even under
! = 1.6732632423543772848170429916717 the presence of noise and perturbations."

" = 1.0507009873554804934193349852946 [Klambauer et al., 2017] 22


Data Preprocessing and
Normalization

24
Data preprocessing
X
Data pre-processing
• Scale input variables to have similar diagonal covariances ci =
(j)
(xi )2
j
⎯ Similar covariances → more balanced rate of learning for different weights
(𝑗) 2
o ⎯Scale σ
input variables to have similar diagonal covariances 𝑐 = 𝑗 (𝑥𝑖 )
Rescaling to 1 is a good choice, unless some dimensions are𝑖 less important
◦ Similar covariances more balanced rate of learning for different weights
◦ Rescaling to 1 is a good choice, unless some dimensions are less important
x = [x11, x22, x33]T 𝑇, ✓ = [✓1 , 1✓2 , ✓23 ]T3, a𝑇 = tanh(✓T x)Τ 1 2 3 → much different covariances
𝑥 = 𝑥 ,𝑥 ,𝑥 ,𝜃 = 𝜃 ,𝜃 ,𝜃 , 𝑎 = tanh(𝜃 𝑥) x
1 2 3
, x , x
𝑥 ,𝑥 ,𝑥 much different covariances
@L
Generated gradients : much different
@✓ x1 ,x2 ,x3
𝜃1 dℒ
𝜃2 Generated gradients ቚ : much different
2 3
𝑑𝜃 𝑥 1 ,𝑥 2 ,𝑥 3 1
𝜃3 @L/@✓
Gradient update harder: ✓ t+1 ⌘t 4@L/@✓
= ✓t 𝑑ℒ/𝑑θ 1 25
3
@L/@✓
Gradient update harder: 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂𝑡 𝑑ℒ/𝑑θ 2
25
Data preprocessing
• Input variables should be as decorrelated as possible
⎯ Input variables are “more independent”
⎯ Network is forced to find non-trivial correlations between inputs
⎯ Decorrelated inputs → Better optimization
⎯ Obviously not the case when inputs are by definition correlated (sequences)

• Extreme case
− Extreme correlation (linear dependency) might cause problems [CAUTION]

26
Data preprocessing

(Assume X [NxD] is data matrix, each example in a row)

27
Data preprocessing
In practice, you may also see PCA and Whitening of the data

(data has diagonal (covariance matrix is


covariance matrix) the identity matrix)

28
Data preprocessing
In practice, you may also see PCA and Whitening of the data

29
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images

- Subtract the mean image (e.g. AlexNet)


(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)

Not common to normalize


variance, to do PCA or
whitening

30
Weight Initialization

31
Q: what happens when W=0 init is used?

32
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)

33
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to


non-homogeneous distributions of activations
across the layers of a network.

34
Lets look
at some
activation
statistics
E.g. 10-layer net with 500
neurons on each layer,
using tanh non-linearities,
and initializing as described
in last slide.

35
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?

Hint: think about backward pass


for a W*X gate.

37
Almost all neurons
completely saturated,
*1.0 instead of *0.01 either -1 and 1.
Gradients will be all
zero.

38
“Xavier initialization”
[Glorot et al., 2010]
Reasonable initialization.
(Mathematical derivation
assumes linear activations)
• If a hidden unit has a big fan-in,
small changes on many of its
incoming weights can cause the
learning to overshoot.
⎯ We generally want smaller
incoming weights when the
fan-in is big, so initialize the
weights to be proportional to
sqrt(fan-in).
• We can also scale the learning
rate the same way.
(from Hinton’s notes)
39
but when using the ReLU nonlinearity
it breaks.

40
He et al., 2015
(note additional /2)

42
Proper initialization is an active area of
research…
• Understanding the difficulty of training deep feedforward neural networks Glorot and
Bengio, 2010
• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Saxe et al, 2013
• Random walk initialization for training very deep feedforward networks Sussillo and
Abbott, 2014
• Delving deep into rectifiers: Surpassing human-level performance on ImageNet
classification He et al., 2015
• Data-dependent Initializations of Convolutional Neural Networks Krähenbühl et al.,
2015
• All you need is a good init, Mishkin and Matas, 2015

43
Batch Normalization
“you want unit Gaussian activations? just make them so.”

consider a batch of activations at some layer.


To make each dimension unit gaussian, apply:

(k) x(k) E[x(k) ]


b
x = p
Var[x(k) ] this is a vanilla differentiable
function...
<latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>

[Ioffe and Szegedy, 2015]


44
Batch Normalization
“you want unit gaussian activations? just make them so.”

1. compute the empirical mean and


variance independently for each
dimension.
N X
2. Normalize
(k) x(k) E[x(k) ]
b
x = p
Var[x(k) ]
D
<latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>

[Ioffe and Szegedy, 2015]


45
Batch Normalization
FC Usually inserted after Fully
BN
Connected / (or Convolutional, as
we’ll see soon) layers, and before
tanh nonlinearity.
FC

BN Problem: do we
(k) x(k) E[x(k) ]
necessarily want a unit
Gaussian input to a tanh
b
x = p
tanh Var[x(k) ]
layer? <latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>

...
[Ioffe and Szegedy, 2015]
46
Batch Normalization
Normalize:

(k) x(k) E[x(k) ]


b
x = p
Var[x(k) ]
Note, the network can learn:
q
<latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>

And then allow the network to squash (k)


= Var[x(k) ]
the range if it wants to:
(k)
= E[x(k) ]
y (k) =
<latexit sha1_base64="Fsw87CA2UzwIZ3in0KtYJbTlBG0=">AAACGXicbVA9SwNBEN3z2/gVtbQ5DIIihDsbtRBEG8sIxgi5GOY2k2TJ7t2xO6eGI7/Dxr9iY6FiqZX/xk1yhV8PBt6+N8POvDCRwpDnfToTk1PTM7Nz84WFxaXlleLq2qWJU82xymMZ66sQDEoRYZUESbxKNIIKJdbC3unQr92gNiKOLqifYENBJxJtwYGs1Cz6/etsu7czOAo6oBSMH8GtaGEXKLsbjIXdIETKzWax5JW9Edy/xM9JieWoNIvvQSvmqcKIuARj6r6XUCMDTYJLHBSC1GACvAcdrFsagULTyEanDdwtq7TcdqxtReSO1O8TGShj+iq0nQqoa357Q/E/r55S+6CRiShJCSM+/qidSpdid5iT2xIaOcm+JcC1sLu6vAsaONk0CzYE//fJf0l1r3xY9s+90vFJnsYc22CbbJv5bJ8dszNWYVXG2T17ZM/sxXlwnpxX523cOuHkM+vsB5yPL0c7oJc=</latexit>
sha1_base64="yjvI0zZHLCi1uOJ0DgUiOP1G24A=">AAACGXicbVCxbhNBEJ0LJBhDwkFKKE6JkBxFsu7SQAokKzQpEwljSz5jza3H9sq7d6fduYB18nek4UNoaFIkESVUdHwKa58LcPKkkd6+N6OdeUmupOUw/O1tPHi4ufWo9rj+5On2zjP/+YuPNiuMoLbIVGa6CVpSMqU2S1bUzQ2hThR1kun7hd+5IGNlln7gWU59jeNUjqRAdtLAj2afysb0YP4uHqPWWD3iz3JIE+Tyy7wSDuOEeGUO/P2wGS4R3CXRiuy3Xn07/wMAZwP/ZzzMRKEpZaHQ2l4U5twv0bAUiub1uLCUo5jimHqOpqjJ9svlafPgtVOGwSgzrlIOluq/EyVqa2c6cZ0aeWLXvYV4n9crePS2X8o0L5hSUX00KlTAWbDIKRhKQ4LVzBEURrpdAzFBg4JdmnUXQrR+8l3SPmoeN6NzF8YJVKjBS9iDBkTwBlpwCmfQBgGX8B2u4cb76l15t96PqnXDW83swn/wfv0FVPCi3A==</latexit>
sha1_base64="b5A1j48idHThqp6gt8ifYnNdips=">AAACGXicbVA9axtBEN3zV2T565KUNuawMNgYxJ2b2EVAJE1KGaxYoJPF3GokLdq9O3bn7IhDpX9DmvwQNWlS2CFlUuU35E9kpVORSH4w8Pa9GXbmRakUhnz/t7Oyura+8aK0Wd7a3tndc1+++miSTHNs8EQmuhmBQSlibJAgic1UI6hI4k00fD/1b+5QG5HE1zRKsa2gH4ue4EBW6rjB6DY/GZ6O34Z9UAqKR3gvujgAyj+NC+EsjJDmZset+FV/Bm+ZBHNSqR1Mrv48HE7qHfdn2E14pjAmLsGYVuCn1M5Bk+ASx+UwM5gCH0IfW5bGoNC089lpY+/YKl2vl2hbMXkz9d+JHJQxIxXZTgU0MIveVHzOa2XUu2jnIk4zwpgXH/Uy6VHiTXPyukIjJzmyBLgWdlePD0ADJ5tm2YYQLJ68TBrn1ctqcGXDeMcKlNg+O2InLGBvWI19YHXWYJx9Zl/ZI3tyvjjfnO/Oj6J1xZnPvGb/wfn1FzHPpEI=</latexit>
(k) (k)
b
x + (k) <latexit sha1_base64="BLM28tYMPGTZQg5zF5NmO80zVx8=">AAACN3icbVBNa9tAFFw5/UjcjzjpMZclpiG9GKkUkh4CoaHQW1OIY4OkmKf1s714V1J2n0qM0M/KJT8jt9JLD03otf+ga1vQJunAwjAzj7dvklxJS77/zWusPHr85OnqWvPZ8xcv11sbm6c2K4zArshUZvoJWFQyxS5JUtjPDYJOFPaS6dHc731FY2WWntAsx1jDOJUjKYCcNGh9jsagNZyVu9M3Fd/hBzyy54bKiPCCylMwVXixNOMqippRgnQnvIh9/BsatNp+x1+APyRBTdqsxvGgdR0NM1FoTEkosDYM/JziEgxJobBqRoXFHMQUxhg6moJGG5eLwyv+2ilDPsqMeynxhfrvRAna2plOXFIDTex9by7+zwsLGu3HpUzzgjAVy0WjQnHK+LxFPpQGBamZIyCMdH/lYgIGBLmum66E4P7JD0n3bed9J/jyrn34oW5jlW2xbbbLArbHDtkndsy6TLBL9p39ZDfelffDu/V+LaMNr555xe7A+/0HGu6rmg==</latexit>

to recover the identity mapping.

[Ioffe and Szegedy, 2015]


47
Batch Normalization
• Improves gradient flow through
the network
• Allows higher learning rates
• Reduces the strong
dependence on initialization
• Acts as a form of regularization
in a funny way, and slightly
reduces the need for dropout,
maybe

[Ioffe and Szegedy, 2015]


48
Batch Normalization
Note: at test time BatchNorm
layer functions differently:

The mean/std are not computed


based on the batch. Instead, a
single fixed empirical mean of
activations during training is used.

(e.g. can be estimated during


training with running averages)

[Ioffe and Szegedy, 2015]


49
Other normalization schemes
• Layer Normalization
Ba et al., Layer Normalization, NIPS 2016 Deep Learning Symposium

• Weight Normalization
Salimans, Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks,
NIPS, 2016

• Normalization propagation
Arpit et al., Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep
Networks, ICML, 2016

• Divisive Normalization
Ren et al., Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes. ICLR 2016

• Batch Renormalization
Ioffe, Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models, 2017

50
Improving Generalization

51
Preventing Overfitting
• Approach 1: Get more data! • Approach 3: Average many
⎯ Almost always the best bet if you different models.
have enough compute power to ⎯ Use models with different forms.
train on more data. ⎯ Or train the model on different
subsets of the training data (this is
• Approach 2: Use a model that called “bagging”).
has the right capacity:
⎯ enough to fit the true regularities. • Approach 4: (Bayesian) Use a
⎯ not enough to also fit spurious single neural network
regularities (if they are weaker). architecture, but average the
predictions made by many
different weight vectors.

52
Some ways to limit the capacity of a
neural net
• The capacity can be controlled in many ways:
• Architecture: Limit the number of hidden layers and the number of units
per layer.
• Early stopping: Start with small weights and stop the learning before it
overfits.
• Weight-decay: Penalize large weights using penalties or constraints on
their squared values (L2 penalty) or absolute values (L1 penalty).
• Noise: Add noise to the weights or the activities.

• Typically, a combination of several of these methods is used.

53
Regularization
• Neural networks typically have thousands, if not millions of parameters
⎯ Usually, the dataset size smaller than the number of parameters
• Overfitting is a grave danger
• Proper weight regularization is crucial to avoid overfitting
X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ ⌦(✓)

(x,y)✓(X,Y )

• Possible regularization methods


⎯ l2-regularization
⎯ l1-regularization
⎯ Dropout
54
l2-regularization
• Most important (or most popular) regularization
X X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ k✓l k2
✓ 2
(x,y)✓(X,Y ) l

• The l2-regularization can pass inside the gradient descend update rule

✓(t+1) = ✓(t) ⌘t (r✓ L + ✓l ) )


✓(t+1) = (1 ⌘t )✓(t) ⌘ t r✓ L
“Weight decay”, because
weights get smaller
• ! is usually about 10−1, 10−2
55
l1-regularization
• l1-regularization is one of the most important techniques
X X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ k✓l k2
✓ 2
(x,y)✓(X,Y ) l

• Also l1-regularization passes inside the gradient descend update rule


✓(t)
✓(t+1) = ✓(t) ⌘t (t) ⌘ t r✓ L
|✓ |
Sign function
• l1-regularization → sparse weights
•! → more weights become 0
56
Data augmentation [Krizhevsky2012]
Data augmentation [Krizhevsky2012]
Flip Random crop
Random crop

Original
Original

Contrast
Contrast Tint
Tint

57
Noise as L2
a regularizer
weight-decay via noisy inputs
••  Suppose
Supposewe weadd
addGaussian
Gaussiannoise noisetotothe
the inputs.
inputs. yj + 2 2
N (0, wi σ i )
TheThe
⎯ –  variance of theofnoise
variance is amplified
the noise by the squared
is amplified by
weight before going
the squared into before
weight the nextgoing
layer. into the
next layer.
• In a simple net with a linear output unit directly j
•  connected
In a simple tonet
thewith a linear
inputs, the output
amplifiedunitnoise
directly
gets
connected
added to thetooutput.
the inputs, the amplified noise
gets added to the output.
wi
••  This
Thismakes
makesananadditive
additivecontribution
contribution to to the
the
squared
squarederror.
error.
i
So So
⎯ –  minimizing
minimizingthe squared
the squarederror tends
errortotends
minimize
to the 2
squared weights when the inputs are noisy. xi + N (0, σ i )
minimize the squared weights when the
inputs are noisy. Gaussian noise
Not exactly equivalent to using an L2 weight penalty.
58
pooled data of all the tasks). These are the lower layer
Multi-task Learning in Fig. 7.2.

• Improving generalization by pooling the examples y (1) y (2)


arising out of several tasks.

• Different supervised tasks share the same input x,


as well as some intermediate-level representation h(1) h(2) h(3)
h(shared)
− Task-specific parameters
− Generic parameters (shared across all the tasks) h(shared)

Figure 7.2: Multi-task learning can be cast in several ways in d


and this figure illustrates the common situation where the tasks s
involve different target random variables. The lower layers of 59
a
Early stopping
• Start with small weights and stop the learning before it overfits.
• Think early5.stopping
CHAPTER MACHINE as a veryBASICS
LEARNING efficient hyperparameter selection.
− The number of training steps is just another hyperparameter.

Training error
Underfitting zone Overfitting zone
Generalization error
Error

Generalization gap

0 Optimal Capacity
Capacity
60
Model Ensembles CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

• Train several different models Original dataset

separately, then have all of the


models vote on the output for First resampled dataset First ensemble member

test examples. 8

• Different models will usually Second resampled dataset Second ensemble member

not make all the same errors 8


on the test set.
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an ‘8’ d
on the dataset depicted above, containing an ‘8’, a ‘6’ and a ‘9’. Suppose we ma
different resampled datasets. The bagging training procedure is to construct each
• Usually ~2% gain! datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats
On this dataset, the detector learns that a loop on top of the digit corresponds to
On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this case, the detecto
that a loop on the bottom of the digit corresponds to an ‘8’. Each of these ind
classification rules is brittle, but if we average their output then the detector is
achieving maximal confidence only when both loops of the ‘8’ are present. 62
Model Ensembles
• We can also get a small boost from averaging multiple
model checkpoints of a single model.
• keep track of (and use at test time) a running average
parameter vector:

63
Dropout
• “randomly set some neurons to zero in the forward pass”

[Srivastava et al., 2014]


64
Waaaait a second…
How could this possibly be a good idea?

66
Waaaait a second…
How could this possibly be a good idea?
Forces the network to have a redundant representation.

has an ear X

has a tail

is furry X cat
score
has claws

mischievous X
look

67
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Waaaait a second…
How could this possibly be a good idea?
y y y y

Another interpretation: h1 h2 h1 h2 h1 h2 h2

• Dropout is training a large x1 x2 x2 x1 x1 x2

ensemble of models (that y


y y y y

share parameters). h1 h1 h2 h2
h1 h2
x1 x2 x1 x2 x2

• Each binary mask is one x1 x2


y y y y

model, gets trained on only Base network


h1 h1 h2

~one datapoint. x1 x2 x1 x1

y y y y

h2 h1

x2

Ensemble of Sub-Networks 68
At test time….
Ideally:
want to integrate out all the noise

Monte Carlo approximation:


do many forward passes with different
dropout masks, average all predictions

69
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).

w0 w1

x y

70
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).

a
(this can be shown to be an
approximation to evaluating the
w0 w1 whole ensemble)

x y

71
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).

during test: a = w0*x + w1*y With p=0.5, using all


a inputs in the forward
during train: pass would inflate the
E[a] = ¼ * (w0*0 + w1*0 activations by 2x from
what the network was
w0*0 + w1*y “used to” during
w0 w1 w0*x + w1*0 training!
=> Have to
w0*x + w1*y) compensate by scaling
x y the activations back
= ¼ * (2 w0*x + 2 w1*y) down by ½
= ½ * (w0*x + w1*y)

74
We can do something approximate
analytically

At test time all neurons are active always


=> We must scale the activations so that for each neuron:
output at test time = expected output at training time

75
Dropout Summary

drop in forward pass

scale at test time


76
Optimization

78
Training a neural network, main loop:
W_2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14


79
Training a neural network, main loop:
W_2

simple gradient descent update


now: complicate.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14
80
Gradients
Gradients Again
• When we write rW L(W ) , we mean the vector of partial derivatives wrt all
coordinates of : 𝑊 𝐿(𝑊), we mean the vector of partial derivatives wrt all coordinates of 𝑊:
<latexit sha1_base64="ih7VwlGo40RFh99J7KCreQLIbG4=">AAAB83icbVA9SwNBEJ2LXzF+RS1tFoMQm3AngtoFbSwsIhgvkBxhbrNJluztnbt7gXDkd9hYqNj6Z+z8N24+Ck18MPB4b4aZeWEiuDau++3kVlbX1jfym4Wt7Z3dveL+waOOU0VZncYiVo0QNRNcsrrhRrBGohhGoWB+OLiZ+P6QKc1j+WBGCQsi7Ene5RSNlYKWxFBg2yd3Zf+0XSy5FXcKsky8OSnBHLV28avViWkaMWmoQK2bnpuYIENlOBVsXGilmiVIB9hjTUslRkwH2fToMTmxSod0Y2VLGjJVf09kGGk9ikLbGaHp60VvIv7nNVPTvQwyLpPUMElni7qpICYmkwRIhytGjRhZglRxeyuhfVRIjc2pYEPwFl9eJvWzylXFuz8vVa/naeThCI6hDB5cQBVuoQZ1oPAEz/AKb87QeXHenY9Za86ZzxzCHzifP7l5kPQ=</latexit>

When rwe
W L(Wwrite )𝛻
 T
<latexit sha1_base64="ih7VwlGo40RFh99J7KCreQLIbG4=">AAAB83icbVA9SwNBEJ2LXzF+RS1tFoMQm3AngtoFbSwsIhgvkBxhbrNJluztnbt7gXDkd9hYqNj6Z+z8N24+Ck18MPB4b4aZeWEiuDau++3kVlbX1jfym4Wt7Z3dveL+waOOU0VZncYiVo0QNRNcsrrhRrBGohhGoWB+OLiZ+P6QKc1j+WBGCQsi7Ene5RSNlYKWxFBg2yd3Zf+0XSy5FXcKsky8OSnBHLV28avViWkaMWmoQK2bnpuYIENlOBVsXGilmiVIB9hjTUslRkwH2fToMTmxSod0Y2VLGjJVf09kGGk9ikLbGaHp60VvIv7nNVPTvQwyLpPUMElni7qpICYmkwRIhytGjRhZglRxeyuhfVRIjc2pYEPwFl9eJvWzylXFuz8vVa/naeThCI6hDB5cQBVuoQZ1oPAEz/AKb87QeXHenY9Za86ZzxzCHzifP7l5kPQ=</latexit>

@L 𝜕𝐿 @L 𝜕𝐿 @L 𝜕𝐿 𝑇
rW L(W )𝛻𝑊=𝐿 𝑊 = , , , . . . ,, … ,
@W1 𝜕𝑊 @W1 2 𝜕𝑊2 @Wm 𝜕𝑊𝑚 <latexit sha1_base64="+INcLgnpCWL39lbSc82XI+bOezc=">AAACbnichVFNS+RAEO1kXT/GXR314EEXmx2EWZAhEWHXgyB68eBBwTHCJBsqPZ2Zxk4ndFcWhpCrP9Cb/8GL/8CeccBP1gcNj1evqrpfJ4UUBj3vznG/zHydnZtfaCx++7603FxZvTR5qRnvslzm+ioBw6VQvIsCJb8qNIcskTxIro/H9eAf10bk6gJHBY8yGCiRCgZopbh5EypIJMQBPW0Hv+gBDSVPsRemGlgVFqBRgKSn9TMPYr/eof817NY7YT9H84ktq0MtBkOM/l7EzZbX8Sag74k/JS0yxVncvLUbWJlxhUyCMT3fKzCqxrOZ5HUjLA0vgF3DgPcsVZBxE1WTvGq6bZU+TXNtj0I6UV92VJAZM8oS68wAh+ZtbSx+VOuVmP6JKqGKErliT4vSUlLM6Th82heaM5QjS4BpYe9K2RBsRGi/qGFD8N8++T3p7nb2O/75XuvwaJrGPNkgP0mb+OQ3OSQn5Ix0CSP3zqqz4Ww6D+66+8PderK6zrRnjbyC234E7ba8LA==</latexit>

@L𝜕𝐿
whereWhere measures
measures howhow fast
fast thethe
lossloss changes
changes vs. change in 𝑊𝑖 .
@W𝜕𝑊i
@L
𝑖 <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>

vs. change in
@W
In figure: lossi surface is blue, gradient vectors are red: <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>

When loss
• In figure: 𝛻𝑊 𝐿 surface
𝑊 = 0,isit blue,
meansgradient vectors
all the partials areare red:
zero.
i.e. the loss is not changing in any direction.
• When rW L(W ) = 0 , it means all the partials are
<latexit sha1_base64="wZU32ci7njZNXJdvzwGmlOmrdk4=">AAAB93icbVBNS8NAEN34WetHox69LBahXkoignoQil48eKhgbKENYbLdtEs3m7C7EWroL/HiQcWrf8Wb/8Ztm4O2Phh4vDfDzLww5Uxpx/m2lpZXVtfWSxvlza3tnYq9u/egkkwS6pGEJ7IdgqKcCepppjltp5JCHHLaCofXE7/1SKViibjXo5T6MfQFixgBbaTArnQFhByCFr6ttY4vncCuOnVnCrxI3IJUUYFmYH91ewnJYio04aBUx3VS7ecgNSOcjsvdTNEUyBD6tGOogJgqP58ePsZHRunhKJGmhMZT9fdEDrFSozg0nTHogZr3JuJ/XifT0bmfM5FmmgoyWxRlHOsET1LAPSYp0XxkCBDJzK2YDEAC0SarsgnBnX95kXgn9Yu6e3dabVwVaZTQATpENeSiM9RAN6iJPERQhp7RK3qznqwX6936mLUuWcXMPvoD6/MHKHGRpg==</latexit>

zero,Thus
i.e. we
theare at aislocal
loss not optimum
changing(orin
at any
leastdirection.
a saddle point).

Note: arrows point out from a minimum, in toward a maximum.


• Note: arrows point out from a minimum, in toward
a maximum Slide16
adapted from John Canny 81
Optimization
Optimization
dC
Visualizing gradient descent in one dimension: w w ✏ dw
• Visualizing gradient descent in one dimension:

• The regions
Thewhere
regions gradient descent
where gradient converges
descent converges totoa aparticular
particularlocallocal
minimum minimum
are called
are basins of attraction.
called basins of attraction.
82
Local Minima
• Since the optimization problem is non-convex, it probably has local
minima.
• This kept people from using neural nets for a long time, because
they wanted guarantees they were getting the optimal solution.
• But are local minima really a problem?
− Common view among practitioners: yes, there are local minima, but
they’re probably still pretty good.
• Maybe your network wastes some hidden units, but then you can just make it larger.
− It’s very hard to demonstrate the existence of local minima in practice.
− In any case, other optimization-related issues are much more important.

83
Saddle PointsSaddle points
@L
• At a saddle point, = 0 even though
@Wi
<latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>

we are not at a minimum. Some


directions curve upwards, and others
curve downwards.
• When would saddle points be a problem?
− If we’re exactly on the saddle point, then
we’re stuck.
− If we’re slightly to the side, then we can get
unstuck. At a saddle point @C @✓ = 0, even though we are not at
directions curve upwards, and others curve downwards
84
Saddle PointsSaddle points
@L
• At a saddle point, = 0 even though
@Wi <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>

we are not at a minimum. Some


directions curve upwards, and others
curve downwards.
• When would saddle points be a problem?
− If we’re exactly on the saddle point, then
we’re stuck.
− If we’re slightly to the side, then we can get
unstuck. At a saddle point @C @✓ = 0, even though we are not at
directions curve upwards, and others curve downwards
Saddle points much more common in high dimensions!
Y. Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS 2014
85
Plateaux
Plateaux
A flat region is called a plateau. (Plural: plateaux)
• A flat region is called a plateau. (Plural: plateaux)

86
Plateaux
An important example of a plateau is a saturated unit. This is wh
• An important
it is in theexample of a plateau
flat region is a saturated
of its activation This is the backpro
unit. Recall
function.
whenequation
it is in the flat region of its activation
for the weight derivative: function.
• If φ′(zi) is always close to zero, then the weights
will get stuck.
0
z i = h i (z)
• If there is a ReLU unit whose input zi is always
negative, the weight
wij =derivatives
z i xj will be exactly 0.
We call this a dead unit.
If 0 (z
i) is always close to zero, then the weights will get stuck.
If there is a ReLU unit whose input zi is always negative, the weig
derivatives will be exactly 0. We call this a dead unit.
87
Batch Gradient Descent
Batch Gradient Descent
Algorithm 1 Batch Gradient Descent at Iteration k
Require: Learning rate ✏k
Require: Initial Parameter ✓
1: while stopping criteria not met do
2: Compute gradient
P estimate over N examples:
3: ĝ + N1 r✓ i L(f (x(i) ; ✓), y(i) )
4: Apply Update: ✓ ✓ ✏ĝ
5: end while

Positive: Gradient estimates are stable


• Positive: Gradient estimates are stable
Negative: Need to compute gradients over the entire training
• Negative: Need to compute gradients over the entire training for
for one update
one update
88
Gradient Descent
Gradient Descent

89
Gradient Descent

90
Gradient Descent

91
Gradient Descent

92
Gradient Descent

93
Gradient Descent

94
Stochastic Batch Gradient Descent

95
Minibatching
• Potential Problem: Gradient estimates can be very noisy
• Obvious Solution: Use larger mini-batches
• Advantage: Computation time per update does not depend on
number of training examples N

• This allows convergence on extremely large datasets


• See: Large Scale Learning with Stochastic Gradient Descent by
Leon Bottou
96
Stochastic Gradient Descent
Stochastic Gradient Descent

97
Stochastic Gradient Descent

98
Stochastic Gradient Descent

99
Stochastic Gradient Descent

100
Stochastic Gradient Descent

101
Stochastic Gradient Descent

102
Stochastic Gradient Descent

103
Stochastic Gradient Descent

104
Stochastic Gradient Descent

105
Image credits: Alec Radford 106
Suppose loss function is steep vertically but shallow
horizontally:

Q: What is the trajectory along which we converge towards


the minimum with SGD?

107
Suppose loss function is steep vertically but shallow
horizontally:

Q: What is the trajectory along which we converge towards


the minimum with SGD?

108
Suppose loss function is steep vertically but shallow
horizontally:

Q: What is the trajectory along which we converge towards


the minimum with SGD?
very slow progress along flat direction, jitter along steep one

109
SGD + Momentum
Momentum update
SGD
SGD SGD+Momentum
SGD+Momentum
vt+1 = ⇢vt + rf (xt )
xt+1 = xt ↵rf (xt )
<latexit sha1_base64="9swfz75DTzefTvz7jJ45rK0lUcM=">AAACC3icbVDLSgMxFM34tr6qLt0Ei1ARy4wI6kIQ3bis4KjQlnInzbShmcyQ3JGWoR/gxl9x40LFrT/gzr8x03ah1QOBwznncnNPkEhh0HW/nKnpmdm5+YXFwtLyyupacX3jxsSpZtxnsYz1XQCGS6G4jwIlv0s0hyiQ/DboXuT+7T3XRsTqGvsJb0TQViIUDNBKzWKp18xwzxvQU5qzwX4dZNKBuoJAAg3LvSbu2pRbcYegf4k3JiUyRrVZ/Ky3YpZGXCGTYEzNcxNsZKBRMMkHhXpqeAKsC21es1RBxE0jGx4zoDtWadEw1vYppEP150QGkTH9KLDJCLBjJr1c/M+rpRgeNzKhkhS5YqNFYSopxjRvhraE5gxl3xJgWti/UtYBDQxtfwVbgjd58l/iH1ROKt7VYensfNzGAtki26RMPHJEzsglqRKfMPJAnsgLeXUenWfnzXkfRaec8cwm+QXn4xueDJpH</latexit>

xt+1 = xt
<latexit sha1_base64="KUXDF4zWwfrKj8Hf5aHq3ZT9UNQ=">AAACL3icbVBdSxtBFJ2N/dDU1tQ++jIYKinSsCtC64Mg9UEfIxgNZMNydzKbDJmdXWbuimHZn9SX/hR9aUHFV/+Fs8k+pNEDA4dzzuXOPWEqhUHX/efUVt68ffd+da3+Yf3jp43G580Lk2Sa8S5LZKJ7IRguheJdFCh5L9Uc4lDyy3ByXPqXV1wbkahznKZ8EMNIiUgwQCsFjZOrIMddr6A79JD6epzQUih2fQWhBBq1rgP85vv164VYyYvvPsh0DLSaDxpNt+3OQF8SryJNUqETNG78YcKymCtkEozpe26Kgxw0CiZ5Ufczw1NgExjxvqUKYm4G+ezggn61ypBGibZPIZ2pixM5xMZM49AmY8CxWfZK8TWvn2H0c5ALlWbIFZsvijJJMaFle3QoNGcop5YA08L+lbIxaGBoO67bErzlk1+S7l77oO2d7TePflVtrJItsk1axCM/yBE5JR3SJYz8Jrfkjtw7f5y/zoPzOI/WnGrmC/kPztMzZWKnBQ==</latexit>
↵vt+1

- Build up “velocity” as a running mean of gradients


- Rho gives “friction”; typically rho=0.9 or 0.99

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25,
110
SGD + Momentum
Momentum update
SGD
SGD SGD+Momentum
SGD+Momentum
vt+1 = ⇢vt + rf (xt )
xt+1 = xt ↵rf (xt )
<latexit sha1_base64="9swfz75DTzefTvz7jJ45rK0lUcM=">AAACC3icbVDLSgMxFM34tr6qLt0Ei1ARy4wI6kIQ3bis4KjQlnInzbShmcyQ3JGWoR/gxl9x40LFrT/gzr8x03ah1QOBwznncnNPkEhh0HW/nKnpmdm5+YXFwtLyyupacX3jxsSpZtxnsYz1XQCGS6G4jwIlv0s0hyiQ/DboXuT+7T3XRsTqGvsJb0TQViIUDNBKzWKp18xwzxvQU5qzwX4dZNKBuoJAAg3LvSbu2pRbcYegf4k3JiUyRrVZ/Ky3YpZGXCGTYEzNcxNsZKBRMMkHhXpqeAKsC21es1RBxE0jGx4zoDtWadEw1vYppEP150QGkTH9KLDJCLBjJr1c/M+rpRgeNzKhkhS5YqNFYSopxjRvhraE5gxl3xJgWti/UtYBDQxtfwVbgjd58l/iH1ROKt7VYensfNzGAtki26RMPHJEzsglqRKfMPJAnsgLeXUenWfnzXkfRaec8cwm+QXn4xueDJpH</latexit>

xt+1 = xt
<latexit sha1_base64="KUXDF4zWwfrKj8Hf5aHq3ZT9UNQ=">AAACL3icbVBdSxtBFJ2N/dDU1tQ++jIYKinSsCtC64Mg9UEfIxgNZMNydzKbDJmdXWbuimHZn9SX/hR9aUHFV/+Fs8k+pNEDA4dzzuXOPWEqhUHX/efUVt68ffd+da3+Yf3jp43G580Lk2Sa8S5LZKJ7IRguheJdFCh5L9Uc4lDyy3ByXPqXV1wbkahznKZ8EMNIiUgwQCsFjZOrIMddr6A79JD6epzQUih2fQWhBBq1rgP85vv164VYyYvvPsh0DLSaDxpNt+3OQF8SryJNUqETNG78YcKymCtkEozpe26Kgxw0CiZ5Ufczw1NgExjxvqUKYm4G+ezggn61ypBGibZPIZ2pixM5xMZM49AmY8CxWfZK8TWvn2H0c5ALlWbIFZsvijJJMaFle3QoNGcop5YA08L+lbIxaGBoO67bErzlk1+S7l77oO2d7TePflVtrJItsk1axCM/yBE5JR3SJYz8Jrfkjtw7f5y/zoPzOI/WnGrmC/kPztMzZWKnBQ==</latexit>
↵vt+1

• Build
- Build up “velocity”
up “velocity” as a running
as a running meanmean of gradients
of gradients
Rho gives
- Rho• gives “friction”;
“friction”; typically
typically rho=0.9
rho=0.9 or 0.99
or 0.99

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25,
111
SGD vs Momentum

notice momentum
overshooting the target,
but overall getting to the
minimum much faster.

112
SGD + Momentum

Momentum update

momentum
step
actual step

gradient
step

113
Nesterov Momentum

Momentum update Nesterov momentum update

“lookahead”
gradient step (bit
momentum momentum
different than
step step
actual step original)
actual step

gradient Nesterov: the only difference...


step

114
Nesterov Momentum
vt+1 = ⇢vt ↵rf (xt + ⇢vt )
xt+1 = xt + vt+1
<latexit sha1_base64="16F0xTnGlPnKaXsFlCZVv5wk8c4=">AAACN3icbVBBSxtBGJ1Nrdq01bQeexkMLSmhYVcE20NB9OKtCsYEsmH5djKbDJmdXWa+DYZlf1Yv/RnexIsHFa/9B87GPWj0wcDjve/NzPfCVAqDrnvp1N6svF1dW39Xf//h48Zm49PnM5NkmvEuS2Si+yEYLoXiXRQoeT/VHOJQ8l44PSz93oxrIxJ1ivOUD2MYKxEJBmiloPFnFuTY9gr6jf6mvp4ktBSKHz7IdAK+glACjVrnAbYrF7/7fv38SarkRbu6J2g03Y67AH1JvIo0SYXjoHHhjxKWxVwhk2DMwHNTHOagUTDJi7qfGZ4Cm8KYDyxVEHMzzBeLF/SrVUY0SrQ9CulCfZrIITZmHod2MgacmGWvFF/zBhlGP4e5UGmGXLHHh6JMUkxo2SIdCc0ZyrklwLSwf6VsAhoY2q7rtgRveeWXpLvT+dXxTnab+wdVG+vkC9kmLeKRPbJPjsgx6RJG/pIrckNunX/OtXPn3D+O1pwqs0Wewfn/ACRDqm4=</latexit>

115
Nesterov Momentum
Nesterov Momentum
vt+1 = ⇢vt ↵rf (xt + ⇢vt ) Annoying, usually we want
update in terms of xt , rf (xt )
xt+1 = xt + vt+1
<latexit sha1_base64="16F0xTnGlPnKaXsFlCZVv5wk8c4=">AAACN3icbVBBSxtBGJ1Nrdq01bQeexkMLSmhYVcE20NB9OKtCsYEsmH5djKbDJmdXWa+DYZlf1Yv/RnexIsHFa/9B87GPWj0wcDjve/NzPfCVAqDrnvp1N6svF1dW39Xf//h48Zm49PnM5NkmvEuS2Si+yEYLoXiXRQoeT/VHOJQ8l44PSz93oxrIxJ1ivOUD2MYKxEJBmiloPFnFuTY9gr6jf6mvp4ktBSKHz7IdAK+glACjVrnAbYrF7/7fv38SarkRbu6J2g03Y67AH1JvIo0SYXjoHHhjxKWxVwhk2DMwHNTHOagUTDJi7qfGZ4Cm8KYDyxVEHMzzBeLF/SrVUY0SrQ9CulCfZrIITZmHod2MgacmGWvFF/zBhlGP4e5UGmGXLHHh6JMUkxo2SIdCc0ZyrklwLSwf6VsAhoY2q7rtgRveeWXpLvT+dXxTnab+wdVG+vkC9kmLeKRPbJPjsgx6RJG/pIrckNunX/OtXPn3D+O1pwqs0Wewfn/ACRDqm4=</latexit>
Annoying, usually we want
<latexit sha1_base64="nL22RvtrRzuN9hpEm8+KpKbIaY4=">AAAB+nicbVBNS8NAEJ34WetXrEcvi0WoICURQb0VvXisYGyhDWGz3bRLN5uwu5GW0L/ixYOKV3+JN/+N2zYHbX0w8Pa9GXbmhSlnSjvOt7Wyura+sVnaKm/v7O7t2weVR5VkklCPJDyR7RArypmgnmaa03YqKY5DTlvh8Hbqt56oVCwRD3qcUj/GfcEiRrA2UmBXRoE+Q12BQ45RVDOv08CuOnVnBrRM3IJUoUAzsL+6vYRkMRWacKxUx3VS7edYakY4nZS7maIpJkPcpx1DBY6p8vPZ7hN0YpQeihJpSmg0U39P5DhWahyHpjPGeqAWvan4n9fJdHTl50ykmaaCzD+KMo50gqZBoB6TlGg+NgQTycyuiAywxESbuMomBHfx5GXindev6+79RbVxU6RRgiM4hhq4cAkNuIMmeEBgBM/wCm/WxHqx3q2PeeuKVcwcwh9Ynz/vrJNG</latexit>

update in terms of

Change of variables x̃t = xt + ⇢vt and


Change of variables and
rearrange:
<latexit sha1_base64="hO8WGpvndNUICEkDF7jTEHB+1uw=">AAACBXicbVBNS8NAEN34WetX1aMIi0UQhJKIoB6EohePFawtNCFsttt26SYbdielJfTkxb/ixYOKV/+DN/+NmzYHbX0ww+O9GXbnBbHgGmz721pYXFpeWS2sFdc3Nre2Szu7D1omirI6lUKqZkA0EzxideAgWDNWjISBYI2gf5P5jQFTmsvoHkYx80LSjXiHUwJG8ksHLnDRZulw7AO+wkPTT7CrehIPfCj6pbJdsSfA88TJSRnlqPmlL7ctaRKyCKggWrccOwYvJQo4FWxcdBPNYkL7pMtahkYkZNpLJ2eM8ZFR2rgjlakI8ET9vZGSUOtRGJjJkEBPz3qZ+J/XSqBz4aU8ihNgEZ0+1EkEBomzTHCbK0ZBjAwhVHHzV0x7RBEKJrksBGf25HlSP61cVpy7s3L1Ok+jgPbRITpGDjpHVXSLaqiOKHpEz+gVvVlP1ov1bn1MRxesfGcP/YH1+QPTLped</latexit>

rearrange:

vt+1 = ⇢vt ↵rf (x̃t )


x̃t+1 = x̃t ⇢vt + (1 + ⇢)vt+1
<latexit sha1_base64="YN1Pk6LNSEoTknleYUss4u2IANk=">AAAChHicbVFNa9tAEF0pbZKqX0567GWpaZExNpLbkuTQEtpLjy7UTcAyYrRe2YtXK7E7CjFC/yS/qrf+m65sQR0nAwtv3pu3szuTFFIYDIK/jnvw5Onh0fEz7/mLl69ed05Of5u81IxPWC5zfZ2A4VIoPkGBkl8XmkOWSH6VrL43+tUN10bk6heuCz7LYKFEKhigpeLOnXcTV9gPa/qBfqGRXua0IepBBLJYQqQgkUBTP0Ih57y6rWPsRZH3P90x73DWv70K+37Yb3Cv7WPN94ux3yqbMr9NBtbaizvdYBhsgj4EYQu6pI1x3PkTzXNWZlwhk2DMNAwKnFWgUTDJay8qDS+ArWDBpxYqyLiZVZsp1vS9ZeY0zbU9CumG3XVUkBmzzhJbmQEuzb7WkI9p0xLT81klVFEiV2zbKC0lxZw2K6FzoTlDubYAmBb2rZQtQQNDuzjPDiHc//JDMBkNL4bhz0/dy2/tNI7JW/KO+CQkZ+SS/CBjMiHMcR3fCZ2Re+QO3I/u522p67SeN+ReuF//AXBRwBs=</latexit>
= x̃t + vt+1 + ⇢(vt+1 vt )
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 27 April 25, 2017116
nag =
Nesterov
Accelerated
Gradient

117
AdaGrad
AdaGrad update

Added
Added element-wise
element-wise scaling
scaling of the
of the gradient
gradient based
based on the
on the
historical
historical sumsum of squares
of squares in each
in each dimension
dimension

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29[DuchiApril 25, 20
et al., 2011]
118
AdaGrad
AdaGrad update

Added element-wise scaling of the gradient based on the


historical sum of squares in each dimension

Q: What happens with AdaGrad?


Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29 April 25, 20
119
AdaGrad
AdaGrad update

Added element-wise scaling of the gradient based on the


historical sum of squares in each dimension

Q2: What happens to the step size over long time?


Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29 April 25, 20
120
RMSProp
RMSProp

AdaGrad
AdaGrad

RMSProp
RMSProp

Tieleman and Hinton, 2012


[Tieleman and Hinton, 2012] 121
adagrad
rmsprop

122
Adaptive Moment Estimation (Adam)
Adam (almost)
(incomplete, but close)

Momentum
momentum
AdaGrad // RMSProp
AdaGrad RMSProp

Sort of like RMSProp with momentum


Looks a bit like RMSProp with momentum

Q: What happens at first timestep?

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

[Kingma and Ba, 2014] 123


Adam (full form)

momentum

Bias correction
AdaGrad / RMSProp

The bias correction compensates for the fact that m,v are
initialized at zero and need some time to “warm up”.

[Kingma and Ba, 2014] 124


Adam (full form)

momentum

Bias correction
AdaGrad / RMSProp

Adam with beta1 = 0.9,


The bias correction compensates for the fact that m,v are beta2 = 0.999, and
initialized at zero and need some time to “warm up”. learning_rate = 1e-3 or 5e-4
is a great starting point for
many models!

[Kingma and Ba, 2014] 125


SGD, SGD+Momentum, Adagrad, RMSProp,
Adam all have learning rate as a hyperparameter.
=> Learning rate decay over time!

step decay:
e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

127
SGD, SGD+Momentum, Adagrad, RMSProp,
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
Adam rate
learning all have learning rate as a hyperparameter.
as a hyperparameter.
=> Learning rate decay over time!
Loss
Learning rate decay!
step decay:
e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

Epoch
128
First-Order Optimization
First-Order Optimization
1) Use gradient form linear approximation
(1) Use gradient form linear approximation
2) Step to (2)
minimize
Step tothe approximation
minimize the approximation
Loss

w1
129
Second-Order Optimization
1) Use gradient and Hessian to form quadratic approximation
2) Step to the minima of the approximation

130
Second order optimization methods
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

notice:
no hyperparameters! (e.g. learning rate)

Q: what is nice about this update?

131
Second order optimization methods
second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

notice:
no hyperparameters! (e.g. learning rate)

Q2: why is this impractical for training Deep Neural Nets?

132
Second order optimization methods

• Quasi-Newton methods (BGFS most popular):


instead of inverting the Hessian (O(n^3)), approximate inverse
Hessian with rank 1 updates over time (O(n^2) each).

• L-BFGS (Limited memory BFGS):


Does not form/store the full inverse Hessian.

133
Babysitting the Learning Process

134
Say we start with one hidden layer of 50 neurons:

50 hidden
neurons

10 output
output layer neurons, one
CIFAR-10 input layer per class
images, 3072 hidden layer
numbers

135
Double check that the loss is reasonable:

disable regularization

loss ~2.3.
“correct “ for returns the loss and the gradient for
10 classes all parameters

136
Double check that the loss is reasonable:

crank up regularization

loss went up, good. (sanity check)

137
Lets try to train now…

Tip: Make sure that


you can overfit very
small portion of the
The above code:
training data
- take the first 20 examples from
CIFAR-10
- turn off regularization (reg = 0.0)
- use simple vanilla ‘sgd’

138
Lets try to train now…

Tip: Make sure that


you can overfit very
small portion of the
training data

Very small loss,


train accuracy 1.00,
nice!

139
Lets try to train now…

I like to start with small


regularization and find
learning rate that makes
the loss go down.

140
Lets try to train now…

I like to start with small


regularization and find
learning rate that makes
the loss go down.

Loss barely changing

141
Lets try to train now…

I like to start with small


regularization and find
learning rate that makes
the loss go down.

loss not going down:


learning rate too low Loss barely changing: Learning rate is
probably too low

Notice train/val accuracy goes to 20%


though, what’s up with that? (remember
this is softmax)

143
Lets try to train now…

I like to start with small


regularization and find
learning rate that makes
the loss go down. Okay now lets try learning rate 1e6. What could
possibly go wrong?
loss not going down:
learning rate too low

144
Lets try to train now…

I like to start with small


regularization and find
learning rate that makes
the loss go down.

loss not going down:


learning rate too low cost: NaN almost always
loss exploding: means high learning
learning rate too high rate...

145
Lets try to train now…

I like to start with small


regularization and find
learning rate that makes
the loss go down.

loss not going down: 3e-3 is still too high. Cost explodes….
learning rate too low
loss exploding:
learning rate too high => Rough range for learning rate we
should be cross-validating is
somewhere [1e-3 … 1e-5]

146
Hyperparameter Selection

147
Everything is a hyperparameter
• Network size/depth
• Small model variations
• Minibatch creation strategy
• Optimizer/learning rate

• Models are complicated and opaque, debugging can be difficult!

Adapted from Graham Neubig 148


Cross-validation strategy
First do coarse -> fine cross-validation in stages
First stage: only a few epochs to get rough idea of what params work
Second stage: longer running time, finer search
… (repeat as necessary)

Tip for detecting explosions in the solver:


If the cost is ever > 3 * original cost, break out early

149
For example: run coarse search for 5 epochs
note it’s best to optimize in
log space!

nice

150
Now run finer search...
adjust range

53% - relatively good


for a 2-layer neural net
with 50 hidden
neurons.

151
Now run finer search...
adjust range

53% - relatively good


for a 2-layer neural net
with 50 hidden
neurons.

But this best cross-


validation result is
worrying. Why?

152
Now run finer search...
adjust range

53% - relatively good


for a 2-layer neural net
with 50 hidden
neurons.

But this best cross-


validation result is
worrying.

Learning rate close to


the edge, need more
wider search!
153
Random Search vs. Grid Search

Random Search for Hyper-Parameter Optimization


Bergstra and Bengio, 2012
154
Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)

neural networks practitioner


music = loss function

155
Monitor and visualize the loss curve

157
Loss

time

158
Loss
Bad initialization
a prime suspect

time

159
Loss function specimen
lossfunctions.tumblr.com

160
lossfunctions.tumblr.com

161
lossfunctions.tumblr.com
162
Monitor and visualize the accuracy:

big gap = overfitting


=> increase regularization strength?

no gap
=> increase model capacity?

163
Visualization
Visualization
• Check
•  Check gradients numerically
gradients numerically bydifferences
by finite finite differences
• Visualize features
•  Visualize features (features
(features need
need to to be uncorrelated)
be uncorrelated) and have and have high
variance
high variance

Goodtraining:
• • Good training: hidden
hidden units units are
are sparse
sparse across samples
across samples

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 165
Visualization
Visualization
Check gradients
gradients numerically
numerically by
by finite
finite differences
• •  Check
• Check gradients numerically bydifferences
finite differences
• Visualize
• •  Visualize features
Visualize features
features (features
(features
(features need need
need to
to to be uncorrelated)
be uncorrelated)
be uncorrelated) and have
and have and have high
high variance
variance
variance
high

Badtraining:
• ••  Good
Bad training:
training:many many
hidden
hidden hidden units
units
units
are ignore
ignorethe
sparse the input
input
across and/or exhibit
and/or
samples
strong
exhibit correlations
strong correlations

[From Marc'Aurelio
[From Marc'Aurelio Ranzato,
Ranzato, CVPR
CVPR 2014
2014 tutorial]
tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 166
Visualization Visualization
•  Check gradients
• Check gradients numerically
numerically by by finitedifferences
finite differences

• Visualize•  Visualize
featuresfeatures
(features need
(features totobe
need beuncorrelated)
uncorrelated) andand
havehave high
variancehigh variance
• Visualize•  parameters: learned
Visualize parameters: features
learned featuresshould exhibit
should exhibit structure and
structure
should beanduncorrelated and areand
should be uncorrelated uncorrelated
are uncorrelated

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 167
Take Home Messages

168
Optimization Tricks
• SGD with momentum, batch-normalization, and dropout usually
works very well
• Pick learning rate by running on a subset of the data
• Start with large learning rate & divide by 2 until loss does not diverge
• Decay learning rate by a factor of ~100 or more by the end of training

• Use ReLU nonlinearity


• Initialize parameters so that each feature across layers has similar
variance. Avoid units in saturation.

From Marc'Aurelio Ranzato, CVPR 2014 tutorial 169


Ways To Improve Generalization

• Weight sharing (greatly reduce the number of parameters)

• Dropout

• Weight decay (L2, L1)

• Sparsity in the hidden units

From Marc'Aurelio Ranzato, CVPR 2014 tutorial 170


Babysitting
• Check gradients numerically by finite differences
• Visualize features (features need to be uncorrelated) and have high
variance
• Visualize parameters: learned features should exhibit structure and
should be uncorrelated and are uncorrelated
• Measure error on both training and validation set
• Test on a small subset of the data and check the error → 0.

From Marc'Aurelio Ranzato, CVPR 2014 tutorial 171


Next paper:
The Marginal Value of Adaptive Gradient
Methods in Machine Learning
Wilson et al. NIPS 2017.

Next lecture:
Convolutional
Neural Networks

172

You might also like