Training Deep Neural Networks

The Alchemist, Mattheus von Helmont
CMP784
DEEP LEARNING
Lecture #04 –Training Deep Neural Networks
Aykut Erdem // Hacettepe University // Spring 2018

Image: Jose-Luis Olivares
Previously on CMP784
• multi-layer perceptrons
• activation functions
• chain rule
• backpropagation algorithm
• computational graph
• distributed word representations
2
Lecture overview
• data preprocessing and normalization
• weight initializations
• ways to improve generalization
• optimization
• babysitting the learning process
• hyperparameter selection
Disclaimer: Much of the material and slides for this lecture were borrowed from
—Fei-Fei Li, Andrej Karpathy and Justin Johnson’s CS231n class
—Roger Grosse’s CSC321 class
—Shubhendu Trivedi and Risi Kondor’s CMSC 35246 class
—Efstratios Gavves and Max Willing’s UvA deep learning class
—Hinton's Neural Networks for Machine Learning class
3
Activation Functions
4
Leaky ReLU
max(0.1x, x)
Sigmoid
x
(x) = 1/(1 + e )
Maxout
tanh tanh(x) ELU
ReLU max(0,x)
5
x
(x) = 1/(1 + e )
• Squashes numbers to range [0,1]

• Historically popular since they
have nice interpretation as a
saturating “firing rate” of a
neuron
Sigmoid
6
x
= (x)(1 (x)) (x) = 1/(1 + e )
x sigmoid
gate
neuron
3 problems:
1. Saturated neurons “kill” the gradients
Sigmoid
Image credit: Jefkine Kafunah
7
x
(x) = 1/(1 + e )

neuron
3 problems:
Sigmoid 1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
9
Consider what happens when the input to
a neuron (x) is always positive:
inputs weights sum non-linearity
x0
w0 !
x1
w1 X
x2
w2
∑ f w i xi + b
wn
…
<latexit sha1_base64="0sbCx1Hgxy2hhEN93G53lziDd1c=">AAACBXicbVBNS8NAEN3Ur1q/oh5FWCxCRSiJCOqt6MVjBWMLTQib7aZduvlgd6KW0JMX/4oXDype/Q/e/Ddu2xy0+mDg8d4MM/OCVHAFlvVllObmFxaXysuVldW19Q1zc+tGJZmkzKGJSGQ7IIoJHjMHOAjWTiUjUSBYKxhcjP3WLZOKJ/E1DFPmRaQX85BTAlryzd3QFSyEmquyyOf4zuf3Pj8MXMl7fTjwzapVtybAf4ldkCoq0PTNT7eb0CxiMVBBlOrYVgpeTiRwKtio4maKpYQOSI91NI1JxJSXT94Y4X2tdHGYSF0x4In6cyInkVLDKNCdEYG+mvXG4n9eJ4Pw1Mt5nGbAYjpdFGYCQ4LHmeAul4yCGGpCqOT6Vkz7RBIKOrmKDsGeffkvcY7qZ3X76rjaOC/SKKMdtIdqyEYnqIEuURM5iKIH9IRe0KvxaDwbb8b7tLVkFDPb6BeMj28rq5h+</latexit>
i
b
xn
1
bias
What can we say about the gradients on w?

10
Consider what happens when the input to
a neuron (x) is always positive:
allowed
gradient
update
! directions
X
zig zag path
f w i xi + b allowed
gradient
i update
directions
<latexit sha1_base64="0sbCx1Hgxy2hhEN93G53lziDd1c=">AAACBXicbVBNS8NAEN3Ur1q/oh5FWCxCRSiJCOqt6MVjBWMLTQib7aZduvlgd6KW0JMX/4oXDype/Q/e/Ddu2xy0+mDg8d4MM/OCVHAFlvVllObmFxaXysuVldW19Q1zc+tGJZmkzKGJSGQ7IIoJHjMHOAjWTiUjUSBYKxhcjP3WLZOKJ/E1DFPmRaQX85BTAlryzd3QFSyEmquyyOf4zuf3Pj8MXMl7fTjwzapVtybAf4ldkCoq0PTNT7eb0CxiMVBBlOrYVgpeTiRwKtio4maKpYQOSI91NI1JxJSXT94Y4X2tdHGYSF0x4In6cyInkVLDKNCdEYG+mvXG4n9eJ4Pw1Mt5nGbAYjpdFGYCQ4LHmeAul4yCGGpCqOT6Vkz7RBIKOrmKDsGeffkvcY7qZ3X76rjaOC/SKKMdtIdqyEYnqIEuURM5iKIH9IRe0KvxaDwbb8b7tLVkFDPb6BeMj28rq5h+</latexit>
hypothetical
optimal w
What can we say about the gradients on w? vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
11
x
(x) = 1/(1 + e )

neuron
3 problems:
Sigmoid 1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. exp() is a bit compute expensive
12
• Squashes numbers to range [-1,1]

• zero centered (nice)
• still kills gradients when saturated :(
tanh(x)
[LeCun et al., 1991] 13

• Computes f(x) = max(0,x)
• Does not saturate (in +region)

• Very computationally efficient
• Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012] 14

⇢
0, if x  0
=
1, if x > 0
• Computes f(x) = max(0,x)
x ReLU
gate • Does not saturate (in +region)
• Very computationally efficient
sigmoid/tanh in practice (e.g. 6x)
• Not zero-centered output

• An annoyance:
Hint: what is the gradient when x < 0?

ReLU
[Krizhevsky et al., 2012] 15
active ReLU
DATA CLOUD
dead ReLU
will never activate
→ never update
17
active ReLU
DATA CLOUD
→ people like to initialize ReLU

neurons with slightly positive dead ReLU
biases (e.g. 0.01) will never activate
→ never update
18
• Does not saturate
• Computationally efficient
sigmoid/tanh in practice! (e.g. 6x)
• will not “die”.
Leaky ReLU
f (x) = max(0.01x, x)
<latexit sha1_base64="kMkvSHS98X7aaPaV3lXeMOXeLlQ=">AAAB+nicbZDNSsNAFIVv/K31L9alm2ARWpCSiKAuhKIblxWMLbShTKaTdujMJMxMJKX0Vdy4UHHrk7jzbZy2WWjrgYGPc+/l3jlhwqjSrvttrayurW9sFraK2zu7e/v2QelRxanExMcxi2UrRIowKoivqWaklUiCeMhIMxzeTuvNJyIVjcWDHiUk4KgvaEQx0sbq2qWoklWvOxxlFbfmetlpVu3aZYMzOcvg5VCGXI2u/dXpxTjlRGjMkFJtz010MEZSU8zIpNhJFUkQHqI+aRsUiBMVjGe3T5wT4/ScKJbmCe3M3N8TY8SVGvHQdHKkB2qxNjX/q7VTHV0GYyqSVBOB54uilDk6dqZBOD0qCdZsZABhSc2tDh4gibA2cRVNCN7il5fBP6td1bz783L9Jk+jAEdwDBXw4ALqcAcN8AFDBs/wCm/WxHqx3q2PeeuKlc8cwh9Znz+gx5Jt</latexit>
[Mass et al., 2013]

[He et al., 2015]
19
• Does not saturate
• Computationally efficient
sigmoid/tanh in practice! (e.g. 6x)
• will not “die”.
Parametric Rectifier (PReLU)

Leaky ReLU f (x) = max(↵x, x)
f (x) = max(0.01x, x) <latexit sha1_base64="AeonoZc8n/OKYYAYc/XrH6ELQQI=">AAAB/XicbVBNS8NAEN3Ur1q/ouLJy2IRWpCSiKAehKIXjxWMLTShTLabdunmg92NtISCf8WLBxWv/g9v/hu3bQ7a+mDg8d4MM/P8hDOpLOvbKCwtr6yuFddLG5tb2zvm7t6DjFNBqENiHouWD5JyFlFHMcVpKxEUQp/Tpj+4mfjNRyoki6N7NUqoF0IvYgEjoLTUMQ+CyrB65YYwrLjAkz7g4cmw2jHLVs2aAi8SOydllKPRMb/cbkzSkEaKcJCybVuJ8jIQihFOxyU3lTQBMoAebWsaQUill03PH+NjrXRxEAtdkcJT9fdEBqGUo9DXnSGovpz3JuJ/XjtVwYWXsShJFY3IbFGQcqxiPMkCd5mgRPGRJkAE07di0gcBROnESjoEe/7lReKc1i5r9t1ZuX6dp1FEh+gIVZCNzlEd3aIGchBBGXpGr+jNeDJejHfjY9ZaMPKZffQHxucP0f+UTg==</latexit>
backprop into \alpha

<latexit sha1_base64="kMkvSHS98X7aaPaV3lXeMOXeLlQ=">AAAB+nicbZDNSsNAFIVv/K31L9alm2ARWpCSiKAuhKIblxWMLbShTKaTdujMJMxMJKX0Vdy4UHHrk7jzbZy2WWjrgYGPc+/l3jlhwqjSrvttrayurW9sFraK2zu7e/v2QelRxanExMcxi2UrRIowKoivqWaklUiCeMhIMxzeTuvNJyIVjcWDHiUk4KgvaEQx0sbq2qWoklWvOxxlFbfmetlpVu3aZYMzOcvg5VCGXI2u/dXpxTjlRGjMkFJtz010MEZSU8zIpNhJFUkQHqI+aRsUiBMVjGe3T5wT4/ScKJbmCe3M3N8TY8SVGvHQdHKkB2qxNjX/q7VTHV0GYyqSVBOB54uilDk6dqZBOD0qCdZsZABhSc2tDh4gibA2cRVNCN7il5fBP6td1bz783L9Jk+jAEdwDBXw4ALqcAcN8AFDBs/wCm/WxHqx3q2PeeuKlc8cwh9Znz+gx5Jt</latexit>
(parameter) [Mass et al., 2013]

[He et al., 2015]
20
Exponential Linear Units (ELU)
• All benefits of ReLU

• Does not die
• Closer to zero mean outputs
• Computation requires exp()

(
x if x > 0
f (x) =
↵(exp(x) 1) if x  0
<latexit sha1_base64="FGZqL1uA3elL0FNwMMZEjGvpaqk=">AAACaXicbVFNb9QwEHXSQstCYVskqOAyYgXaHlglvbQcQBVcOLZSl1ZaR6uJM9m16jip7VRZRXvgL3LjF3DhR+D9QIK2T7L0/OZ5xn5OKyWti6KfQbix+eDh1vajzuMnO0+fdXf3vtmyNoKGolSluUzRkpKahk46RZeVISxSRRfp1ZdF/eKGjJWlPnezipICJ1rmUqDz0rj7nY8g7zcH8LEDwFOaSN0K38/O/R6ggRXeAb+uMQPuqHGtzGEOzaeI86WJo6qm2OfUVL7T+/jgfj9XdA3RYgzp7O8Qnoy7vWgQLQF3SbwmPbbG6bj7g2elqAvSTii0dhRHlUtaNE4KRb5lbalCcYUTGnmqsSCbtMuo5vDWKxnkpfFLO1iq/55osbB2VqTeWaCb2tu1hXhfbVS7/Dhppa5qR1qsBuW1AlfCInfIpCHh1MwTFEb6u4KYokHh/O90fAjx7SffJcPDwYdBfHbYO/m8TmObvWZvWJ/F7IidsK/slA2ZYL+CneBF8DL4He6F++GrlTUM1mees/8Q9v4AaOOxpw==</latexit>
[Clevert et al., 2015] 21

Scaled Exponential Linear Units (SELU)
• Scaled version of ELU
• Stable and attracting fixed points
for the mean and variance
• No need for batch normalization
• ~100 pages long of pure math
"Using the Banach fixed-point theorem, we
(( prove that activations close to zero mean and
x x if x
if >
x>0 0 unit variance that are propagated through
f (x)
f (x)
== many network layers will converge towards
↵(exp(x)
↵(exp(x) 1) 1) if 
if x x0 0
<latexit sha1_base64="MMebiCeNZVnhZbR1erWNhJmz9j4=">AAAB7XicbVBNS8NAFHypX7V+VT16WSyCp5KIoN6KXjxWMLbQhrLZbNqlm03YfRFK6I/w4kHFq//Hm//GbZuDVgcWhpl57HsTZlIYdN0vp7Kyura+Ud2sbW3v7O7V9w8eTJprxn2WylR3Q2q4FIr7KFDybqY5TULJO+H4ZuZ3Hrk2IlX3OMl4kNChErFgFK3U6Usbjeig3nCb7hzkL/FK0oAS7UH9sx+lLE+4QiapMT3PzTAoqEbBJJ/W+rnhGWVjOuQ9SxVNuAmK+bpTcmKViMSptk8hmas/JwqaGDNJQptMKI7MsjcT//N6OcaXQSFUliNXbPFRnEuCKZndTiKhOUM5sYQyLeyuhI2opgxtQzVbgrd88l/inzWvmt7deaN1XbZRhSM4hlPw4AJacAtt8IHBGJ7gBV6dzHl23pz3RbTilDOH8AvOxzeo0Y9R</latexit>
<latexit sha1_base64="FGZqL1uA3elL0FNwMMZEjGvpaqk=">AAACaXicbVFNb9QwEHXSQstCYVskqOAyYgXaHlglvbQcQBVcOLZSl1ZaR6uJM9m16jip7VRZRXvgL3LjF3DhR+D9QIK2T7L0/OZ5xn5OKyWti6KfQbix+eDh1vajzuMnO0+fdXf3vtmyNoKGolSluUzRkpKahk46RZeVISxSRRfp1ZdF/eKGjJWlPnezipICJ1rmUqDz0rj7nY8g7zcH8LEDwFOaSN0K38/O/R6ggRXeAb+uMQPuqHGtzGEOzaeI86WJo6qm2OfUVL7T+/jgfj9XdA3RYgzp7O8Qnoy7vWgQLQF3SbwmPbbG6bj7g2elqAvSTii0dhRHlUtaNE4KRb5lbalCcYUTGnmqsSCbtMuo5vDWKxnkpfFLO1iq/55osbB2VqTeWaCb2tu1hXhfbVS7/Dhppa5qR1qsBuW1AlfCInfIpCHh1MwTFEb6u4KYokHh/O90fAjx7SffJcPDwYdBfHbYO/m8TmObvWZvWJ/F7IidsK/slA2ZYL+CneBF8DL4He6F++GrlTUM1mees/8Q9v4AaOOxpw==</latexit> <latexit sha1_base64="FGZqL1uA3elL0FNwMMZEjGvpaqk=">AAACaXicbVFNb9QwEHXSQstCYVskqOAyYgXaHlglvbQcQBVcOLZSl1ZaR6uJM9m16jip7VRZRXvgL3LjF3DhR+D9QIK2T7L0/OZ5xn5OKyWti6KfQbix+eDh1vajzuMnO0+fdXf3vtmyNoKGolSluUzRkpKahk46RZeVISxSRRfp1ZdF/eKGjJWlPnezipICJ1rmUqDz0rj7nY8g7zcH8LEDwFOaSN0K38/O/R6ggRXeAb+uMQPuqHGtzGEOzaeI86WJo6qm2OfUVL7T+/jgfj9XdA3RYgzp7O8Qnoy7vWgQLQF3SbwmPbbG6bj7g2elqAvSTii0dhRHlUtaNE4KRb5lbalCcYUTGnmqsSCbtMuo5vDWKxnkpfFLO1iq/55osbB2VqTeWaCb2tu1hXhfbVS7/Dhppa5qR1qsBuW1AlfCInfIpCHh1MwTFEb6u4KYokHh/O90fAjx7SffJcPDwYdBfHbYO/m8TmObvWZvWJ/F7IidsK/slA2ZYL+CneBF8DL4He6F++GrlTUM1mees/8Q9v4AaOOxpw==</latexit>

zero mean and unit variance — even under
! = 1.6732632423543772848170429916717 the presence of noise and perturbations."
" = 1.0507009873554804934193349852946 [Klambauer et al., 2017] 22

Data Preprocessing and
Normalization
24
Data preprocessing
X
Data pre-processing
• Scale input variables to have similar diagonal covariances ci =
(j)
(xi )2
j
⎯ Similar covariances → more balanced rate of learning for different weights
(𝑗) 2
o ⎯Scale σ
input variables to have similar diagonal covariances 𝑐 = 𝑗 (𝑥𝑖 )
Rescaling to 1 is a good choice, unless some dimensions are𝑖 less important
◦ Similar covariances more balanced rate of learning for different weights
◦ Rescaling to 1 is a good choice, unless some dimensions are less important
x = [x11, x22, x33]T 𝑇, ✓ = [✓1 , 1✓2 , ✓23 ]T3, a𝑇 = tanh(✓T x)Τ 1 2 3 → much different covariances
𝑥 = 𝑥 ,𝑥 ,𝑥 ,𝜃 = 𝜃 ,𝜃 ,𝜃 , 𝑎 = tanh(𝜃 𝑥) x
1 2 3
, x , x
𝑥 ,𝑥 ,𝑥 much different covariances
@L
Generated gradients : much different
@✓ x1 ,x2 ,x3
𝜃1 dℒ
𝜃2 Generated gradients ቚ : much different
2 3
𝑑𝜃 𝑥 1 ,𝑥 2 ,𝑥 3 1
𝜃3 @L/@✓
Gradient update harder: ✓ t+1 ⌘t 4@L/@✓
= ✓t 𝑑ℒ/𝑑θ 1 25
3
@L/@✓
Gradient update harder: 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂𝑡 𝑑ℒ/𝑑θ 2
25
Data preprocessing
• Input variables should be as decorrelated as possible
⎯ Input variables are “more independent”
⎯ Network is forced to find non-trivial correlations between inputs
⎯ Decorrelated inputs → Better optimization
⎯ Obviously not the case when inputs are by definition correlated (sequences)
• Extreme case
− Extreme correlation (linear dependency) might cause problems [CAUTION]
26
Data preprocessing
(Assume X [NxD] is data matrix, each example in a row)
27
Data preprocessing
In practice, you may also see PCA and Whitening of the data
(data has diagonal (covariance matrix is

covariance matrix) the identity matrix)
28
Data preprocessing
In practice, you may also see PCA and Whitening of the data
29
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)

(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
Not common to normalize

variance, to do PCA or
whitening
30
Weight Initialization
31
Q: what happens when W=0 init is used?
32
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
33
First idea: Small random numbers
(Gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but can lead to

non-homogeneous distributions of activations
across the layers of a network.
34
Lets look
at some
activation
statistics
E.g. 10-layer net with 500
neurons on each layer,
using tanh non-linearities,
and initializing as described
in last slide.
35
All activations
become zero!
Q: think about the
backward pass. What
do the gradients look
like?
Hint: think about backward pass

for a W*X gate.
37
Almost all neurons
completely saturated,
*1.0 instead of *0.01 either -1 and 1.
Gradients will be all
zero.
38
“Xavier initialization”
[Glorot et al., 2010]
Reasonable initialization.
(Mathematical derivation
assumes linear activations)
• If a hidden unit has a big fan-in,
small changes on many of its
incoming weights can cause the
learning to overshoot.
⎯ We generally want smaller
incoming weights when the
fan-in is big, so initialize the
weights to be proportional to
sqrt(fan-in).
• We can also scale the learning
rate the same way.
(from Hinton’s notes)
39
but when using the ReLU nonlinearity
it breaks.
40
He et al., 2015
(note additional /2)
42
Proper initialization is an active area of
research…
• Understanding the difficulty of training deep feedforward neural networks Glorot and
Bengio, 2010
• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Saxe et al, 2013
• Random walk initialization for training very deep feedforward networks Sussillo and
Abbott, 2014
• Delving deep into rectifiers: Surpassing human-level performance on ImageNet
classification He et al., 2015
• Data-dependent Initializations of Convolutional Neural Networks Krähenbühl et al.,
2015
• All you need is a good init, Mishkin and Matas, 2015
…
43
Batch Normalization
“you want unit Gaussian activations? just make them so.”
consider a batch of activations at some layer.

To make each dimension unit gaussian, apply:
(k) x(k) E[x(k) ]

b
x = p
Var[x(k) ] this is a vanilla differentiable
function...
<latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>
[Ioffe and Szegedy, 2015]

44
Batch Normalization
“you want unit gaussian activations? just make them so.”
1. compute the empirical mean and

variance independently for each
dimension.
N X
2. Normalize
(k) x(k) E[x(k) ]
b
x = p
Var[x(k) ]
D

45
Batch Normalization
FC Usually inserted after Fully
BN
Connected / (or Convolutional, as
we’ll see soon) layers, and before
tanh nonlinearity.
FC
BN Problem: do we
(k) x(k) E[x(k) ]
necessarily want a unit
Gaussian input to a tanh
b
x = p
tanh Var[x(k) ]
layer? <latexit sha1_base64="wbeSQz9gblLdGLVi0u09nHWgDnE=">AAACOHicbVDLSgNBEJz1bXxFPXpZjIIeDLte1IMgiuAxglEhu4bZSa8ZMvtwplcThv0tL/6FN8GLBxWvfoGTZA++ChpqqrqZ7gpSwRU6zpM1Mjo2PjE5NV2amZ2bXygvLp2rJJMM6iwRibwMqALBY6gjRwGXqQQaBQIugs5R37+4Bal4Ep9hLwU/otcxDzmjaKRmubbm3fEWtCnqbn6lNzqb+b4XSsp0d/ja8hC6qI8beSH4ufbUjUQ9NM6p/Gbla81yxak6A9h/iVuQCilQa5YfvVbCsghiZIIq1XCdFH1NJXImIC95mYKUsg69hoahMY1A+XpweW6vG6Vlh4k0FaM9UL9PaBop1YsC0xlRbKvfXl/8z2tkGO76msdphhCz4UdhJmxM7H6MdotLYCh6hlAmudnVZm1qckMTdsmE4P4++S+pb1f3qu7pduXgsEhjiqyQVbJBXLJDDsgJqZE6YeSePJNX8mY9WC/Wu/UxbB2xipll8gPW5xd8R67H</latexit>
...
46
Batch Normalization
Normalize:
(k) x(k) E[x(k) ]

b
x = p
Var[x(k) ]
Note, the network can learn:
q
And then allow the network to squash (k)

= Var[x(k) ]
the range if it wants to:
(k)
= E[x(k) ]
y (k) =
<latexit sha1_base64="Fsw87CA2UzwIZ3in0KtYJbTlBG0=">AAACGXicbVA9SwNBEN3z2/gVtbQ5DIIihDsbtRBEG8sIxgi5GOY2k2TJ7t2xO6eGI7/Dxr9iY6FiqZX/xk1yhV8PBt6+N8POvDCRwpDnfToTk1PTM7Nz84WFxaXlleLq2qWJU82xymMZ66sQDEoRYZUESbxKNIIKJdbC3unQr92gNiKOLqifYENBJxJtwYGs1Cz6/etsu7czOAo6oBSMH8GtaGEXKLsbjIXdIETKzWax5JW9Edy/xM9JieWoNIvvQSvmqcKIuARj6r6XUCMDTYJLHBSC1GACvAcdrFsagULTyEanDdwtq7TcdqxtReSO1O8TGShj+iq0nQqoa357Q/E/r55S+6CRiShJCSM+/qidSpdid5iT2xIaOcm+JcC1sLu6vAsaONk0CzYE//fJf0l1r3xY9s+90vFJnsYc22CbbJv5bJ8dszNWYVXG2T17ZM/sxXlwnpxX523cOuHkM+vsB5yPL0c7oJc=</latexit>
sha1_base64="yjvI0zZHLCi1uOJ0DgUiOP1G24A=">AAACGXicbVCxbhNBEJ0LJBhDwkFKKE6JkBxFsu7SQAokKzQpEwljSz5jza3H9sq7d6fduYB18nek4UNoaFIkESVUdHwKa58LcPKkkd6+N6OdeUmupOUw/O1tPHi4ufWo9rj+5On2zjP/+YuPNiuMoLbIVGa6CVpSMqU2S1bUzQ2hThR1kun7hd+5IGNlln7gWU59jeNUjqRAdtLAj2afysb0YP4uHqPWWD3iz3JIE+Tyy7wSDuOEeGUO/P2wGS4R3CXRiuy3Xn07/wMAZwP/ZzzMRKEpZaHQ2l4U5twv0bAUiub1uLCUo5jimHqOpqjJ9svlafPgtVOGwSgzrlIOluq/EyVqa2c6cZ0aeWLXvYV4n9crePS2X8o0L5hSUX00KlTAWbDIKRhKQ4LVzBEURrpdAzFBg4JdmnUXQrR+8l3SPmoeN6NzF8YJVKjBS9iDBkTwBlpwCmfQBgGX8B2u4cb76l15t96PqnXDW83swn/wfv0FVPCi3A==</latexit>
sha1_base64="b5A1j48idHThqp6gt8ifYnNdips=">AAACGXicbVA9axtBEN3zV2T565KUNuawMNgYxJ2b2EVAJE1KGaxYoJPF3GokLdq9O3bn7IhDpX9DmvwQNWlS2CFlUuU35E9kpVORSH4w8Pa9GXbmRakUhnz/t7Oyura+8aK0Wd7a3tndc1+++miSTHNs8EQmuhmBQSlibJAgic1UI6hI4k00fD/1b+5QG5HE1zRKsa2gH4ue4EBW6rjB6DY/GZ6O34Z9UAqKR3gvujgAyj+NC+EsjJDmZset+FV/Bm+ZBHNSqR1Mrv48HE7qHfdn2E14pjAmLsGYVuCn1M5Bk+ASx+UwM5gCH0IfW5bGoNC089lpY+/YKl2vl2hbMXkz9d+JHJQxIxXZTgU0MIveVHzOa2XUu2jnIk4zwpgXH/Uy6VHiTXPyukIjJzmyBLgWdlePD0ADJ5tm2YYQLJ68TBrn1ctqcGXDeMcKlNg+O2InLGBvWI19YHXWYJx9Zl/ZI3tyvjjfnO/Oj6J1xZnPvGb/wfn1FzHPpEI=</latexit>
(k) (k)
b
x + (k) <latexit sha1_base64="BLM28tYMPGTZQg5zF5NmO80zVx8=">AAACN3icbVBNa9tAFFw5/UjcjzjpMZclpiG9GKkUkh4CoaHQW1OIY4OkmKf1s714V1J2n0qM0M/KJT8jt9JLD03otf+ga1vQJunAwjAzj7dvklxJS77/zWusPHr85OnqWvPZ8xcv11sbm6c2K4zArshUZvoJWFQyxS5JUtjPDYJOFPaS6dHc731FY2WWntAsx1jDOJUjKYCcNGh9jsagNZyVu9M3Fd/hBzyy54bKiPCCylMwVXixNOMqippRgnQnvIh9/BsatNp+x1+APyRBTdqsxvGgdR0NM1FoTEkosDYM/JziEgxJobBqRoXFHMQUxhg6moJGG5eLwyv+2ilDPsqMeynxhfrvRAna2plOXFIDTex9by7+zwsLGu3HpUzzgjAVy0WjQnHK+LxFPpQGBamZIyCMdH/lYgIGBLmum66E4P7JD0n3bed9J/jyrn34oW5jlW2xbbbLArbHDtkndsy6TLBL9p39ZDfelffDu/V+LaMNr555xe7A+/0HGu6rmg==</latexit>
to recover the identity mapping.

47
Batch Normalization
• Improves gradient flow through
the network
• Allows higher learning rates
• Reduces the strong
dependence on initialization
• Acts as a form of regularization
in a funny way, and slightly
reduces the need for dropout,
maybe

48
Batch Normalization
Note: at test time BatchNorm
layer functions differently:
The mean/std are not computed

based on the batch. Instead, a
single fixed empirical mean of
activations during training is used.
(e.g. can be estimated during

training with running averages)

49
Other normalization schemes
• Layer Normalization
Ba et al., Layer Normalization, NIPS 2016 Deep Learning Symposium
• Weight Normalization
Salimans, Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks,
NIPS, 2016
• Normalization propagation
Arpit et al., Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep
Networks, ICML, 2016
• Divisive Normalization
Ren et al., Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes. ICLR 2016
• Batch Renormalization
Ioffe, Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models, 2017
50
Improving Generalization
51
Preventing Overfitting
• Approach 1: Get more data! • Approach 3: Average many
⎯ Almost always the best bet if you different models.
have enough compute power to ⎯ Use models with different forms.
train on more data. ⎯ Or train the model on different
subsets of the training data (this is
• Approach 2: Use a model that called “bagging”).
has the right capacity:
⎯ enough to fit the true regularities. • Approach 4: (Bayesian) Use a
⎯ not enough to also fit spurious single neural network
regularities (if they are weaker). architecture, but average the
predictions made by many
different weight vectors.
52
Some ways to limit the capacity of a
neural net
• The capacity can be controlled in many ways:
• Architecture: Limit the number of hidden layers and the number of units
per layer.
• Early stopping: Start with small weights and stop the learning before it
overfits.
• Weight-decay: Penalize large weights using penalties or constraints on
their squared values (L2 penalty) or absolute values (L1 penalty).
• Noise: Add noise to the weights or the activities.
• Typically, a combination of several of these methods is used.
53
Regularization
• Neural networks typically have thousands, if not millions of parameters
⎯ Usually, the dataset size smaller than the number of parameters
• Overfitting is a grave danger
• Proper weight regularization is crucial to avoid overfitting
X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ ⌦(✓)
✓
(x,y)✓(X,Y )
• Possible regularization methods

⎯ l2-regularization
⎯ l1-regularization
⎯ Dropout
54
l2-regularization
• Most important (or most popular) regularization
X X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ k✓l k2
✓ 2
(x,y)✓(X,Y ) l
• The l2-regularization can pass inside the gradient descend update rule
✓(t+1) = ✓(t) ⌘t (r✓ L + ✓l ) )

✓(t+1) = (1 ⌘t )✓(t) ⌘ t r✓ L
“Weight decay”, because
weights get smaller
• ! is usually about 10−1, 10−2
55
l1-regularization
• l1-regularization is one of the most important techniques
X X
✓⇤ arg min `(y, aL (x; ✓1,...,L ))+ k✓l k2
✓ 2
(x,y)✓(X,Y ) l
• Also l1-regularization passes inside the gradient descend update rule

✓(t)
✓(t+1) = ✓(t) ⌘t (t) ⌘ t r✓ L
|✓ |
Sign function
• l1-regularization → sparse weights
•! → more weights become 0
56
Data augmentation [Krizhevsky2012]
Data augmentation [Krizhevsky2012]
Flip Random crop
Random crop
Original
Original
Contrast
Contrast Tint
Tint
57
Noise as L2
a regularizer
weight-decay via noisy inputs
••  Suppose
Supposewe weadd
addGaussian
Gaussiannoise noisetotothe
the inputs.
inputs. yj + 2 2
N (0, wi σ i )
TheThe
⎯ –  variance of theofnoise
variance is amplified
the noise by the squared
is amplified by
weight before going
the squared into before
weight the nextgoing
layer. into the
next layer.
• In a simple net with a linear output unit directly j
•  connected
In a simple tonet
thewith a linear
inputs, the output
amplifiedunitnoise
directly
gets
connected
added to thetooutput.
the inputs, the amplified noise
gets added to the output.
wi
••  This
Thismakes
makesananadditive
additivecontribution
contribution to to the
the
squared
squarederror.
error.
i
So So
⎯ –  minimizing
minimizingthe squared
the squarederror tends
errortotends
minimize
to the 2
squared weights when the inputs are noisy. xi + N (0, σ i )
minimize the squared weights when the
inputs are noisy. Gaussian noise
Not exactly equivalent to using an L2 weight penalty.
58
pooled data of all the tasks). These are the lower layer
Multi-task Learning in Fig. 7.2.
• Improving generalization by pooling the examples y (1) y (2)

arising out of several tasks.
• Different supervised tasks share the same input x,

as well as some intermediate-level representation h(1) h(2) h(3)
h(shared)
− Task-specific parameters
− Generic parameters (shared across all the tasks) h(shared)
Figure 7.2: Multi-task learning can be cast in several ways in d

and this figure illustrates the common situation where the tasks s
involve different target random variables. The lower layers of 59
a
Early stopping
• Start with small weights and stop the learning before it overfits.
• Think early5.stopping
CHAPTER MACHINE as a veryBASICS
LEARNING efficient hyperparameter selection.
− The number of training steps is just another hyperparameter.
Training error
Underfitting zone Overfitting zone
Generalization error
Error
Generalization gap
0 Optimal Capacity
Capacity
60
Model Ensembles CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
• Train several different models Original dataset
separately, then have all of the

models vote on the output for First resampled dataset First ensemble member
test examples. 8
• Different models will usually Second resampled dataset Second ensemble member
not make all the same errors 8

on the test set.
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an ‘8’ d
on the dataset depicted above, containing an ‘8’, a ‘6’ and a ‘9’. Suppose we ma
different resampled datasets. The bagging training procedure is to construct each
• Usually ~2% gain! datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats
On this dataset, the detector learns that a loop on top of the digit corresponds to
On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this case, the detecto
that a loop on the bottom of the digit corresponds to an ‘8’. Each of these ind
classification rules is brittle, but if we average their output then the detector is
achieving maximal confidence only when both loops of the ‘8’ are present. 62
Model Ensembles
• We can also get a small boost from averaging multiple
model checkpoints of a single model.
• keep track of (and use at test time) a running average
parameter vector:
63
Dropout
• “randomly set some neurons to zero in the forward pass”
[Srivastava et al., 2014]

64
Waaaait a second…
How could this possibly be a good idea?
66
Waaaait a second…
Forces the network to have a redundant representation.
has an ear X
has a tail
is furry X cat
score
has claws
mischievous X
look
67
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Waaaait a second…
y y y y
Another interpretation: h1 h2 h1 h2 h1 h2 h2
• Dropout is training a large x1 x2 x2 x1 x1 x2
ensemble of models (that y

y y y y
share parameters). h1 h1 h2 h2
h1 h2
x1 x2 x1 x2 x2
• Each binary mask is one x1 x2

y y y y
model, gets trained on only Base network

h1 h1 h2
~one datapoint. x1 x2 x1 x1
y y y y
h2 h1
x2
Ensemble of Sub-Networks 68
At test time….
Ideally:
want to integrate out all the noise
Monte Carlo approximation:

do many forward passes with different
dropout masks, average all predictions
69
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
w0 w1
x y
70
At test time….
a
(this can be shown to be an
approximation to evaluating the
w0 w1 whole ensemble)
x y
71
At test time….
during test: a = w0*x + w1*y With p=0.5, using all

a inputs in the forward
during train: pass would inflate the
E[a] = ¼ * (w0*0 + w1*0 activations by 2x from
what the network was
w0*0 + w1*y “used to” during
w0 w1 w0*x + w1*0 training!
=> Have to
w0*x + w1*y) compensate by scaling
x y the activations back
= ¼ * (2 w0*x + 2 w1*y) down by ½
= ½ * (w0*x + w1*y)
74
We can do something approximate
analytically
At test time all neurons are active always

=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
75
Dropout Summary
drop in forward pass
scale at test time

76
Optimization
78
Training a neural network, main loop:
W_2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14

79
Training a neural network, main loop:
W_2
simple gradient descent update

now: complicate.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14
80
Gradients
Gradients Again
• When we write rW L(W ) , we mean the vector of partial derivatives wrt all
coordinates of : 𝑊 𝐿(𝑊), we mean the vector of partial derivatives wrt all coordinates of 𝑊:
<latexit sha1_base64="ih7VwlGo40RFh99J7KCreQLIbG4=">AAAB83icbVA9SwNBEJ2LXzF+RS1tFoMQm3AngtoFbSwsIhgvkBxhbrNJluztnbt7gXDkd9hYqNj6Z+z8N24+Ck18MPB4b4aZeWEiuDau++3kVlbX1jfym4Wt7Z3dveL+waOOU0VZncYiVo0QNRNcsrrhRrBGohhGoWB+OLiZ+P6QKc1j+WBGCQsi7Ene5RSNlYKWxFBg2yd3Zf+0XSy5FXcKsky8OSnBHLV28avViWkaMWmoQK2bnpuYIENlOBVsXGilmiVIB9hjTUslRkwH2fToMTmxSod0Y2VLGjJVf09kGGk9ikLbGaHp60VvIv7nNVPTvQwyLpPUMElni7qpICYmkwRIhytGjRhZglRxeyuhfVRIjc2pYEPwFl9eJvWzylXFuz8vVa/naeThCI6hDB5cQBVuoQZ1oPAEz/AKb87QeXHenY9Za86ZzxzCHzifP7l5kPQ=</latexit>
When rwe
W L(Wwrite )𝛻
 T
<latexit sha1_base64="ih7VwlGo40RFh99J7KCreQLIbG4=">AAAB83icbVA9SwNBEJ2LXzF+RS1tFoMQm3AngtoFbSwsIhgvkBxhbrNJluztnbt7gXDkd9hYqNj6Z+z8N24+Ck18MPB4b4aZeWEiuDau++3kVlbX1jfym4Wt7Z3dveL+waOOU0VZncYiVo0QNRNcsrrhRrBGohhGoWB+OLiZ+P6QKc1j+WBGCQsi7Ene5RSNlYKWxFBg2yd3Zf+0XSy5FXcKsky8OSnBHLV28avViWkaMWmoQK2bnpuYIENlOBVsXGilmiVIB9hjTUslRkwH2fToMTmxSod0Y2VLGjJVf09kGGk9ikLbGaHp60VvIv7nNVPTvQwyLpPUMElni7qpICYmkwRIhytGjRhZglRxeyuhfVRIjc2pYEPwFl9eJvWzylXFuz8vVa/naeThCI6hDB5cQBVuoQZ1oPAEz/AKb87QeXHenY9Za86ZzxzCHzifP7l5kPQ=</latexit>
@L 𝜕𝐿 @L 𝜕𝐿 @L 𝜕𝐿 𝑇
rW L(W )𝛻𝑊=𝐿 𝑊 = , , , . . . ,, … ,
@W1 𝜕𝑊 @W1 2 𝜕𝑊2 @Wm 𝜕𝑊𝑚 <latexit sha1_base64="+INcLgnpCWL39lbSc82XI+bOezc=">AAACbnichVFNS+RAEO1kXT/GXR314EEXmx2EWZAhEWHXgyB68eBBwTHCJBsqPZ2Zxk4ndFcWhpCrP9Cb/8GL/8CeccBP1gcNj1evqrpfJ4UUBj3vznG/zHydnZtfaCx++7603FxZvTR5qRnvslzm+ioBw6VQvIsCJb8qNIcskTxIro/H9eAf10bk6gJHBY8yGCiRCgZopbh5EypIJMQBPW0Hv+gBDSVPsRemGlgVFqBRgKSn9TMPYr/eof817NY7YT9H84ktq0MtBkOM/l7EzZbX8Sag74k/JS0yxVncvLUbWJlxhUyCMT3fKzCqxrOZ5HUjLA0vgF3DgPcsVZBxE1WTvGq6bZU+TXNtj0I6UV92VJAZM8oS68wAh+ZtbSx+VOuVmP6JKqGKErliT4vSUlLM6Th82heaM5QjS4BpYe9K2RBsRGi/qGFD8N8++T3p7nb2O/75XuvwaJrGPNkgP0mb+OQ3OSQn5Ix0CSP3zqqz4Ww6D+66+8PderK6zrRnjbyC234E7ba8LA==</latexit>
@L𝜕𝐿
whereWhere measures
measures howhow fast
fast thethe
lossloss changes
changes vs. change in 𝑊𝑖 .
@W𝜕𝑊i
@L
𝑖 <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
vs. change in
@W
In figure: lossi surface is blue, gradient vectors are red: <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
When loss
• In figure: 𝛻𝑊 𝐿 surface
𝑊 = 0,isit blue,
meansgradient vectors
all the partials areare red:
zero.
i.e. the loss is not changing in any direction.
• When rW L(W ) = 0 , it means all the partials are
<latexit sha1_base64="wZU32ci7njZNXJdvzwGmlOmrdk4=">AAAB93icbVBNS8NAEN34WetHox69LBahXkoignoQil48eKhgbKENYbLdtEs3m7C7EWroL/HiQcWrf8Wb/8Ztm4O2Phh4vDfDzLww5Uxpx/m2lpZXVtfWSxvlza3tnYq9u/egkkwS6pGEJ7IdgqKcCepppjltp5JCHHLaCofXE7/1SKViibjXo5T6MfQFixgBbaTArnQFhByCFr6ttY4vncCuOnVnCrxI3IJUUYFmYH91ewnJYio04aBUx3VS7ecgNSOcjsvdTNEUyBD6tGOogJgqP58ePsZHRunhKJGmhMZT9fdEDrFSozg0nTHogZr3JuJ/XifT0bmfM5FmmgoyWxRlHOsET1LAPSYp0XxkCBDJzK2YDEAC0SarsgnBnX95kXgn9Yu6e3dabVwVaZTQATpENeSiM9RAN6iJPERQhp7RK3qznqwX6936mLUuWcXMPvoD6/MHKHGRpg==</latexit>
zero,Thus
i.e. we
theare at aislocal
loss not optimum
changing(orin
at any
leastdirection.
a saddle point).
Note: arrows point out from a minimum, in toward a maximum.

• Note: arrows point out from a minimum, in toward
a maximum Slide16
adapted from John Canny 81
Optimization
Optimization
dC
Visualizing gradient descent in one dimension: w w ✏ dw
• Visualizing gradient descent in one dimension:
• The regions
Thewhere
regions gradient descent
where gradient converges
descent converges totoa aparticular
particularlocallocal
minimum minimum
are called
are basins of attraction.
called basins of attraction.
82
Local Minima
• Since the optimization problem is non-convex, it probably has local
minima.
• This kept people from using neural nets for a long time, because
they wanted guarantees they were getting the optimal solution.
• But are local minima really a problem?
− Common view among practitioners: yes, there are local minima, but
they’re probably still pretty good.
• Maybe your network wastes some hidden units, but then you can just make it larger.
− It’s very hard to demonstrate the existence of local minima in practice.
− In any case, other optimization-related issues are much more important.
83
Saddle PointsSaddle points
@L
• At a saddle point, = 0 even though
@Wi
<latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
we are not at a minimum. Some

directions curve upwards, and others
curve downwards.
• When would saddle points be a problem?
− If we’re exactly on the saddle point, then
we’re stuck.
− If we’re slightly to the side, then we can get
unstuck. At a saddle point @C @✓ = 0, even though we are not at
directions curve upwards, and others curve downwards
84
Saddle PointsSaddle points
@L
• At a saddle point, = 0 even though
@Wi <latexit sha1_base64="qkTp8AxhLvgo/0v+7Fq83G+ebDw=">AAACB3icbZDLSsNAFIZP6q3WW9WlCweL4KokIqi7ohsXLioYW2hCmEwn7dDJhZmJUEKWbnwVNy5U3PoK7nwbJ21Abf1h4OM/58zM+f2EM6lM88uoLCwuLa9UV2tr6xubW/XtnTsZp4JQm8Q8Fl0fS8pZRG3FFKfdRFAc+px2/NFlUe/cUyFZHN2qcULdEA8iFjCClba8+r4TCEwyJ8FCMczRdf7DHY/lXr1hNs2J0DxYJTSgVNurfzr9mKQhjRThWMqeZSbKzYorCad5zUklTTAZ4QHtaYxwSKWbTRbJ0aF2+iiIhT6RQhP390SGQynHoa87Q6yGcrZWmP/VeqkKztyMRUmqaESmDwUpRypGRSqozwQlio81YCKY/isiQ6yTUTq7mg7Bml15Huzj5nnTujlptC7KNKqwBwdwBBacQguuoA02EHiAJ3iBV+PReDbejPdpa8UoZ3bhj4yPb4Ramdo=</latexit>
we are not at a minimum. Some

directions curve upwards, and others
curve downwards.
• When would saddle points be a problem?
− If we’re exactly on the saddle point, then
we’re stuck.
− If we’re slightly to the side, then we can get
unstuck. At a saddle point @C @✓ = 0, even though we are not at
directions curve upwards, and others curve downwards
Saddle points much more common in high dimensions!
Y. Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS 2014
85
Plateaux
Plateaux
A flat region is called a plateau. (Plural: plateaux)
• A flat region is called a plateau. (Plural: plateaux)
86
Plateaux
An important example of a plateau is a saturated unit. This is wh
• An important
it is in theexample of a plateau
flat region is a saturated
of its activation This is the backpro
unit. Recall
function.
whenequation
it is in the flat region of its activation
for the weight derivative: function.
• If φ′(zi) is always close to zero, then the weights
will get stuck.
0
z i = h i (z)
• If there is a ReLU unit whose input zi is always
negative, the weight
wij =derivatives
z i xj will be exactly 0.
We call this a dead unit.
If 0 (z
i) is always close to zero, then the weights will get stuck.
If there is a ReLU unit whose input zi is always negative, the weig
derivatives will be exactly 0. We call this a dead unit.
87
Batch Gradient Descent
Batch Gradient Descent
Algorithm 1 Batch Gradient Descent at Iteration k
Require: Learning rate ✏k
Require: Initial Parameter ✓
1: while stopping criteria not met do
2: Compute gradient
P estimate over N examples:
3: ĝ + N1 r✓ i L(f (x(i) ; ✓), y(i) )
4: Apply Update: ✓ ✓ ✏ĝ
5: end while
Positive: Gradient estimates are stable

• Positive: Gradient estimates are stable
Negative: Need to compute gradients over the entire training
• Negative: Need to compute gradients over the entire training for
for one update
one update
88
Gradient Descent
Gradient Descent
89
Gradient Descent
90
Gradient Descent
91
Gradient Descent
92
Gradient Descent
93
Gradient Descent
94
Stochastic Batch Gradient Descent
95
Minibatching
• Potential Problem: Gradient estimates can be very noisy
• Obvious Solution: Use larger mini-batches
• Advantage: Computation time per update does not depend on
number of training examples N
• This allows convergence on extremely large datasets

• See: Large Scale Learning with Stochastic Gradient Descent by
Leon Bottou
96
Stochastic Gradient Descent
97
98
99
100
101
102
103
104
105
Image credits: Alec Radford 106
Suppose loss function is steep vertically but shallow
horizontally:
Q: What is the trajectory along which we converge towards

the minimum with SGD?
107
horizontally:

108
horizontally:

very slow progress along flat direction, jitter along steep one
109
SGD + Momentum
Momentum update
SGD
SGD SGD+Momentum
SGD+Momentum
vt+1 = ⇢vt + rf (xt )
xt+1 = xt ↵rf (xt )
<latexit sha1_base64="9swfz75DTzefTvz7jJ45rK0lUcM=">AAACC3icbVDLSgMxFM34tr6qLt0Ei1ARy4wI6kIQ3bis4KjQlnInzbShmcyQ3JGWoR/gxl9x40LFrT/gzr8x03ah1QOBwznncnNPkEhh0HW/nKnpmdm5+YXFwtLyyupacX3jxsSpZtxnsYz1XQCGS6G4jwIlv0s0hyiQ/DboXuT+7T3XRsTqGvsJb0TQViIUDNBKzWKp18xwzxvQU5qzwX4dZNKBuoJAAg3LvSbu2pRbcYegf4k3JiUyRrVZ/Ky3YpZGXCGTYEzNcxNsZKBRMMkHhXpqeAKsC21es1RBxE0jGx4zoDtWadEw1vYppEP150QGkTH9KLDJCLBjJr1c/M+rpRgeNzKhkhS5YqNFYSopxjRvhraE5gxl3xJgWti/UtYBDQxtfwVbgjd58l/iH1ROKt7VYensfNzGAtki26RMPHJEzsglqRKfMPJAnsgLeXUenWfnzXkfRaec8cwm+QXn4xueDJpH</latexit>
xt+1 = xt
<latexit sha1_base64="KUXDF4zWwfrKj8Hf5aHq3ZT9UNQ=">AAACL3icbVBdSxtBFJ2N/dDU1tQ++jIYKinSsCtC64Mg9UEfIxgNZMNydzKbDJmdXWbuimHZn9SX/hR9aUHFV/+Fs8k+pNEDA4dzzuXOPWEqhUHX/efUVt68ffd+da3+Yf3jp43G580Lk2Sa8S5LZKJ7IRguheJdFCh5L9Uc4lDyy3ByXPqXV1wbkahznKZ8EMNIiUgwQCsFjZOrIMddr6A79JD6epzQUih2fQWhBBq1rgP85vv164VYyYvvPsh0DLSaDxpNt+3OQF8SryJNUqETNG78YcKymCtkEozpe26Kgxw0CiZ5Ufczw1NgExjxvqUKYm4G+ezggn61ypBGibZPIZ2pixM5xMZM49AmY8CxWfZK8TWvn2H0c5ALlWbIFZsvijJJMaFle3QoNGcop5YA08L+lbIxaGBoO67bErzlk1+S7l77oO2d7TePflVtrJItsk1axCM/yBE5JR3SJYz8Jrfkjtw7f5y/zoPzOI/WnGrmC/kPztMzZWKnBQ==</latexit>
↵vt+1
- Build up “velocity” as a running mean of gradients

- Rho gives “friction”; typically rho=0.9 or 0.99
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25,
110
SGD + Momentum
Momentum update
SGD
SGD SGD+Momentum
SGD+Momentum
vt+1 = ⇢vt + rf (xt )
xt+1 = xt ↵rf (xt )
<latexit sha1_base64="9swfz75DTzefTvz7jJ45rK0lUcM=">AAACC3icbVDLSgMxFM34tr6qLt0Ei1ARy4wI6kIQ3bis4KjQlnInzbShmcyQ3JGWoR/gxl9x40LFrT/gzr8x03ah1QOBwznncnNPkEhh0HW/nKnpmdm5+YXFwtLyyupacX3jxsSpZtxnsYz1XQCGS6G4jwIlv0s0hyiQ/DboXuT+7T3XRsTqGvsJb0TQViIUDNBKzWKp18xwzxvQU5qzwX4dZNKBuoJAAg3LvSbu2pRbcYegf4k3JiUyRrVZ/Ky3YpZGXCGTYEzNcxNsZKBRMMkHhXpqeAKsC21es1RBxE0jGx4zoDtWadEw1vYppEP150QGkTH9KLDJCLBjJr1c/M+rpRgeNzKhkhS5YqNFYSopxjRvhraE5gxl3xJgWti/UtYBDQxtfwVbgjd58l/iH1ROKt7VYensfNzGAtki26RMPHJEzsglqRKfMPJAnsgLeXUenWfnzXkfRaec8cwm+QXn4xueDJpH</latexit>
xt+1 = xt
<latexit sha1_base64="KUXDF4zWwfrKj8Hf5aHq3ZT9UNQ=">AAACL3icbVBdSxtBFJ2N/dDU1tQ++jIYKinSsCtC64Mg9UEfIxgNZMNydzKbDJmdXWbuimHZn9SX/hR9aUHFV/+Fs8k+pNEDA4dzzuXOPWEqhUHX/efUVt68ffd+da3+Yf3jp43G580Lk2Sa8S5LZKJ7IRguheJdFCh5L9Uc4lDyy3ByXPqXV1wbkahznKZ8EMNIiUgwQCsFjZOrIMddr6A79JD6epzQUih2fQWhBBq1rgP85vv164VYyYvvPsh0DLSaDxpNt+3OQF8SryJNUqETNG78YcKymCtkEozpe26Kgxw0CiZ5Ufczw1NgExjxvqUKYm4G+ezggn61ypBGibZPIZ2pixM5xMZM49AmY8CxWfZK8TWvn2H0c5ALlWbIFZsvijJJMaFle3QoNGcop5YA08L+lbIxaGBoO67bErzlk1+S7l77oO2d7TePflVtrJItsk1axCM/yBE5JR3SJYz8Jrfkjtw7f5y/zoPzOI/WnGrmC/kPztMzZWKnBQ==</latexit>
↵vt+1
• Build
- Build up “velocity”
up “velocity” as a running
as a running meanmean of gradients
of gradients
Rho gives
- Rho• gives “friction”;
“friction”; typically
typically rho=0.9
rho=0.9 or 0.99
or 0.99
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 21 April 25,
111
SGD vs Momentum
notice momentum
overshooting the target,
but overall getting to the
minimum much faster.
112
SGD + Momentum
Momentum update
momentum
step
actual step
gradient
step
113
Nesterov Momentum
Momentum update Nesterov momentum update
“lookahead”
gradient step (bit
momentum momentum
different than
step step
actual step original)
actual step
gradient Nesterov: the only difference...

step
114
Nesterov Momentum
vt+1 = ⇢vt ↵rf (xt + ⇢vt )
xt+1 = xt + vt+1
<latexit sha1_base64="16F0xTnGlPnKaXsFlCZVv5wk8c4=">AAACN3icbVBBSxtBGJ1Nrdq01bQeexkMLSmhYVcE20NB9OKtCsYEsmH5djKbDJmdXWa+DYZlf1Yv/RnexIsHFa/9B87GPWj0wcDjve/NzPfCVAqDrnvp1N6svF1dW39Xf//h48Zm49PnM5NkmvEuS2Si+yEYLoXiXRQoeT/VHOJQ8l44PSz93oxrIxJ1ivOUD2MYKxEJBmiloPFnFuTY9gr6jf6mvp4ktBSKHz7IdAK+glACjVrnAbYrF7/7fv38SarkRbu6J2g03Y67AH1JvIo0SYXjoHHhjxKWxVwhk2DMwHNTHOagUTDJi7qfGZ4Cm8KYDyxVEHMzzBeLF/SrVUY0SrQ9CulCfZrIITZmHod2MgacmGWvFF/zBhlGP4e5UGmGXLHHh6JMUkxo2SIdCc0ZyrklwLSwf6VsAhoY2q7rtgRveeWXpLvT+dXxTnab+wdVG+vkC9kmLeKRPbJPjsgx6RJG/pIrckNunX/OtXPn3D+O1pwqs0Wewfn/ACRDqm4=</latexit>
115
Nesterov Momentum
Nesterov Momentum
vt+1 = ⇢vt ↵rf (xt + ⇢vt ) Annoying, usually we want
update in terms of xt , rf (xt )
xt+1 = xt + vt+1
<latexit sha1_base64="16F0xTnGlPnKaXsFlCZVv5wk8c4=">AAACN3icbVBBSxtBGJ1Nrdq01bQeexkMLSmhYVcE20NB9OKtCsYEsmH5djKbDJmdXWa+DYZlf1Yv/RnexIsHFa/9B87GPWj0wcDjve/NzPfCVAqDrnvp1N6svF1dW39Xf//h48Zm49PnM5NkmvEuS2Si+yEYLoXiXRQoeT/VHOJQ8l44PSz93oxrIxJ1ivOUD2MYKxEJBmiloPFnFuTY9gr6jf6mvp4ktBSKHz7IdAK+glACjVrnAbYrF7/7fv38SarkRbu6J2g03Y67AH1JvIo0SYXjoHHhjxKWxVwhk2DMwHNTHOagUTDJi7qfGZ4Cm8KYDyxVEHMzzBeLF/SrVUY0SrQ9CulCfZrIITZmHod2MgacmGWvFF/zBhlGP4e5UGmGXLHHh6JMUkxo2SIdCc0ZyrklwLSwf6VsAhoY2q7rtgRveeWXpLvT+dXxTnab+wdVG+vkC9kmLeKRPbJPjsgx6RJG/pIrckNunX/OtXPn3D+O1pwqs0Wewfn/ACRDqm4=</latexit>
Annoying, usually we want
<latexit sha1_base64="nL22RvtrRzuN9hpEm8+KpKbIaY4=">AAAB+nicbVBNS8NAEJ34WetXrEcvi0WoICURQb0VvXisYGyhDWGz3bRLN5uwu5GW0L/ixYOKV3+JN/+N2zYHbX0w8Pa9GXbmhSlnSjvOt7Wyura+sVnaKm/v7O7t2weVR5VkklCPJDyR7RArypmgnmaa03YqKY5DTlvh8Hbqt56oVCwRD3qcUj/GfcEiRrA2UmBXRoE+Q12BQ45RVDOv08CuOnVnBrRM3IJUoUAzsL+6vYRkMRWacKxUx3VS7edYakY4nZS7maIpJkPcpx1DBY6p8vPZ7hN0YpQeihJpSmg0U39P5DhWahyHpjPGeqAWvan4n9fJdHTl50ykmaaCzD+KMo50gqZBoB6TlGg+NgQTycyuiAywxESbuMomBHfx5GXindev6+79RbVxU6RRgiM4hhq4cAkNuIMmeEBgBM/wCm/WxHqx3q2PeeuKVcwcwh9Ynz/vrJNG</latexit>
update in terms of
Change of variables x̃t = xt + ⇢vt and

Change of variables and
rearrange:
<latexit sha1_base64="hO8WGpvndNUICEkDF7jTEHB+1uw=">AAACBXicbVBNS8NAEN34WetX1aMIi0UQhJKIoB6EohePFawtNCFsttt26SYbdielJfTkxb/ixYOKV/+DN/+NmzYHbX0ww+O9GXbnBbHgGmz721pYXFpeWS2sFdc3Nre2Szu7D1omirI6lUKqZkA0EzxideAgWDNWjISBYI2gf5P5jQFTmsvoHkYx80LSjXiHUwJG8ksHLnDRZulw7AO+wkPTT7CrehIPfCj6pbJdsSfA88TJSRnlqPmlL7ctaRKyCKggWrccOwYvJQo4FWxcdBPNYkL7pMtahkYkZNpLJ2eM8ZFR2rgjlakI8ET9vZGSUOtRGJjJkEBPz3qZ+J/XSqBz4aU8ihNgEZ0+1EkEBomzTHCbK0ZBjAwhVHHzV0x7RBEKJrksBGf25HlSP61cVpy7s3L1Ok+jgPbRITpGDjpHVXSLaqiOKHpEz+gVvVlP1ov1bn1MRxesfGcP/YH1+QPTLped</latexit>
rearrange:
vt+1 = ⇢vt ↵rf (x̃t )

x̃t+1 = x̃t ⇢vt + (1 + ⇢)vt+1
<latexit sha1_base64="YN1Pk6LNSEoTknleYUss4u2IANk=">AAAChHicbVFNa9tAEF0pbZKqX0567GWpaZExNpLbkuTQEtpLjy7UTcAyYrRe2YtXK7E7CjFC/yS/qrf+m65sQR0nAwtv3pu3szuTFFIYDIK/jnvw5Onh0fEz7/mLl69ed05Of5u81IxPWC5zfZ2A4VIoPkGBkl8XmkOWSH6VrL43+tUN10bk6heuCz7LYKFEKhigpeLOnXcTV9gPa/qBfqGRXua0IepBBLJYQqQgkUBTP0Ih57y6rWPsRZH3P90x73DWv70K+37Yb3Cv7WPN94ux3yqbMr9NBtbaizvdYBhsgj4EYQu6pI1x3PkTzXNWZlwhk2DMNAwKnFWgUTDJay8qDS+ArWDBpxYqyLiZVZsp1vS9ZeY0zbU9CumG3XVUkBmzzhJbmQEuzb7WkI9p0xLT81klVFEiV2zbKC0lxZw2K6FzoTlDubYAmBb2rZQtQQNDuzjPDiHc//JDMBkNL4bhz0/dy2/tNI7JW/KO+CQkZ+SS/CBjMiHMcR3fCZ2Re+QO3I/u522p67SeN+ReuF//AXBRwBs=</latexit>
= x̃t + vt+1 + ⇢(vt+1 vt )
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 27 April 25, 2017116
nag =
Nesterov
Accelerated
Gradient
117
AdaGrad
AdaGrad update
Added
Added element-wise
element-wise scaling
scaling of the
of the gradient
gradient based
based on the
on the
historical
historical sumsum of squares
of squares in each
in each dimension
dimension
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 29[DuchiApril 25, 20
et al., 2011]
118
AdaGrad
AdaGrad update
Added element-wise scaling of the gradient based on the

historical sum of squares in each dimension
Q: What happens with AdaGrad?

119
AdaGrad
AdaGrad update
Added element-wise scaling of the gradient based on the

historical sum of squares in each dimension
Q2: What happens to the step size over long time?

120
RMSProp
RMSProp
AdaGrad
AdaGrad
RMSProp
RMSProp
Tieleman and Hinton, 2012

[Tieleman and Hinton, 2012] 121
adagrad
rmsprop
122
Adaptive Moment Estimation (Adam)
Adam (almost)
(incomplete, but close)
Momentum
momentum
AdaGrad // RMSProp
AdaGrad RMSProp
Sort of like RMSProp with momentum

Looks a bit like RMSProp with momentum
Q: What happens at first timestep?
Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015
[Kingma and Ba, 2014] 123

Adam (full form)
momentum
Bias correction
AdaGrad / RMSProp
The bias correction compensates for the fact that m,v are
initialized at zero and need some time to “warm up”.

Adam (full form)
momentum
Bias correction
AdaGrad / RMSProp
Adam with beta1 = 0.9,

The bias correction compensates for the fact that m,v are beta2 = 0.999, and
initialized at zero and need some time to “warm up”. learning_rate = 1e-3 or 5e-4
is a great starting point for
many models!

SGD, SGD+Momentum, Adagrad, RMSProp,
Adam all have learning rate as a hyperparameter.
=> Learning rate decay over time!
step decay:
e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
127
SGD, SGD+Momentum, Adagrad, RMSProp,
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have
Adam rate
learning all have learning rate as a hyperparameter.
as a hyperparameter.
=> Learning rate decay over time!
Loss
Learning rate decay!
step decay:
e.g. decay learning rate by half every few epochs.
exponential decay:
1/t decay:
Epoch
128
First-Order Optimization
First-Order Optimization
1) Use gradient form linear approximation
(1) Use gradient form linear approximation
2) Step to (2)
minimize
Step tothe approximation
minimize the approximation
Loss
w1
129
Second-Order Optimization
1) Use gradient and Hessian to form quadratic approximation
2) Step to the minima of the approximation
130
Second order optimization methods
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
notice:
no hyperparameters! (e.g. learning rate)
Q: what is nice about this update?
131
second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
notice:
no hyperparameters! (e.g. learning rate)
Q2: why is this impractical for training Deep Neural Nets?
132
• Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate inverse
Hessian with rank 1 updates over time (O(n^2) each).
• L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.
133
Babysitting the Learning Process
134
Say we start with one hidden layer of 50 neurons:
50 hidden
neurons
10 output
output layer neurons, one
CIFAR-10 input layer per class
images, 3072 hidden layer
numbers
135
Double check that the loss is reasonable:
disable regularization
loss ~2.3.
“correct “ for returns the loss and the gradient for
10 classes all parameters
136
Double check that the loss is reasonable:
crank up regularization
loss went up, good. (sanity check)
137
Lets try to train now…
Tip: Make sure that

you can overfit very
small portion of the
The above code:
training data
- take the first 20 examples from
CIFAR-10
- turn off regularization (reg = 0.0)
- use simple vanilla ‘sgd’
138
Tip: Make sure that

you can overfit very
small portion of the
training data
Very small loss,

train accuracy 1.00,
nice!
139
I like to start with small

regularization and find
learning rate that makes
the loss go down.
140

the loss go down.
Loss barely changing
141

the loss go down.
loss not going down:

learning rate too low Loss barely changing: Learning rate is
probably too low
Notice train/val accuracy goes to 20%

though, what’s up with that? (remember
this is softmax)
143

the loss go down. Okay now lets try learning rate 1e6. What could
possibly go wrong?
learning rate too low
144

the loss go down.

learning rate too low cost: NaN almost always
loss exploding: means high learning
learning rate too high rate...
145

the loss go down.
loss not going down: 3e-3 is still too high. Cost explodes….
learning rate too low
loss exploding:
learning rate too high => Rough range for learning rate we
should be cross-validating is
somewhere [1e-3 … 1e-5]
146
Hyperparameter Selection
147
Everything is a hyperparameter
• Network size/depth
• Small model variations
• Minibatch creation strategy
• Optimizer/learning rate
• Models are complicated and opaque, debugging can be difficult!
Adapted from Graham Neubig 148

Cross-validation strategy
First do coarse -> fine cross-validation in stages
First stage: only a few epochs to get rough idea of what params work
Second stage: longer running time, finer search
… (repeat as necessary)
Tip for detecting explosions in the solver:

If the cost is ever > 3 * original cost, break out early
149
For example: run coarse search for 5 epochs
note it’s best to optimize in
log space!
nice
150
Now run finer search...
adjust range
53% - relatively good

for a 2-layer neural net
with 50 hidden
neurons.
151
adjust range

with 50 hidden
neurons.
But this best cross-

validation result is
worrying. Why?
152
adjust range

with 50 hidden
neurons.
But this best cross-

validation result is
worrying.
Learning rate close to

the edge, need more
wider search!
153
Random Search vs. Grid Search
Random Search for Hyper-Parameter Optimization

Bergstra and Bengio, 2012
154
Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)
neural networks practitioner

music = loss function
155
Monitor and visualize the loss curve
157
Loss
time
158
Loss
Bad initialization
a prime suspect
time
159
Loss function specimen
lossfunctions.tumblr.com
160
161
162
Monitor and visualize the accuracy:
big gap = overfitting

=> increase regularization strength?
no gap
=> increase model capacity?
163
Visualization
Visualization
• Check
•  Check gradients numerically
gradients numerically bydifferences
by finite finite differences
• Visualize features
•  Visualize features (features
(features need
need to to be uncorrelated)
be uncorrelated) and have and have high
variance
high variance
Goodtraining:
• • Good training: hidden
hidden units units are
are sparse
sparse across samples
across samples
[From Marc'Aurelio Ranzato, CVPR 2014 tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 165
Visualization
Visualization
Check gradients
gradients numerically
numerically by
by finite
finite differences
• •  Check
• Check gradients numerically bydifferences
finite differences
• Visualize
• •  Visualize features
Visualize features
features (features
(features
(features need need
need to
to to be uncorrelated)
be uncorrelated)
be uncorrelated) and have
and have and have high
high variance
variance
variance
high
Badtraining:
• ••  Good
Bad training:
training:many many
hidden
hidden hidden units
units
units
are ignore
ignorethe
sparse the input
input
across and/or exhibit
and/or
samples
strong
exhibit correlations
strong correlations
[From Marc'Aurelio
[From Marc'Aurelio Ranzato,
Ranzato, CVPR
CVPR 2014
2014 tutorial]
tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 166
Visualization Visualization
•  Check gradients
• Check gradients numerically
numerically by by finitedifferences
finite differences
• Visualize•  Visualize
featuresfeatures
(features need
(features totobe
need beuncorrelated)
uncorrelated) andand
havehave high
variancehigh variance
• Visualize•  parameters: learned
Visualize parameters: features
learned featuresshould exhibit
should exhibit structure and
structure
should beanduncorrelated and areand
should be uncorrelated uncorrelated
are uncorrelated
[From Marc'Aurelio Ranzato, CVPR 2014 tutorial] From Marc'Aurelio Ranzato, CVPR 2014 tutorial 167
Take Home Messages
168
Optimization Tricks
• SGD with momentum, batch-normalization, and dropout usually
works very well
• Pick learning rate by running on a subset of the data
• Start with large learning rate & divide by 2 until loss does not diverge
• Decay learning rate by a factor of ~100 or more by the end of training
• Use ReLU nonlinearity

• Initialize parameters so that each feature across layers has similar
variance. Avoid units in saturation.
From Marc'Aurelio Ranzato, CVPR 2014 tutorial 169

Ways To Improve Generalization
• Weight sharing (greatly reduce the number of parameters)
• Dropout
• Weight decay (L2, L1)
• Sparsity in the hidden units

Babysitting
• Check gradients numerically by finite differences
• Visualize features (features need to be uncorrelated) and have high
variance
• Visualize parameters: learned features should exhibit structure and
should be uncorrelated and are uncorrelated
• Measure error on both training and validation set
• Test on a small subset of the data and check the error → 0.

Next paper:
The Marginal Value of Adaptive Gradient
Methods in Machine Learning
Wilson et al. NIPS 2017.
Next lecture:
Convolutional
Neural Networks
172

Training Deep Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Training Deep Neural Networks

Uploaded by

Copyright:

Available Formats

The Alchemist, Mattheus von Helmont

Lecture #04 –Training Deep Neural Networks

Aykut Erdem // Hacettepe University // Spring 2018

• Squashes numbers to range [0,1]

• Squashes numbers to range [0,1]

What can we say about the gradients on w?

• Squashes numbers to range [0,1]

• Squashes numbers to range [-1,1]

[LeCun et al., 1991] 13

• Does not saturate (in +region)

[Krizhevsky et al., 2012] 14

• Not zero-centered output

Hint: what is the gradient when x < 0?

→ people like to initialize ReLU

[Mass et al., 2013]

Parametric Rectifier (PReLU)

backprop into \alpha

(parameter) [Mass et al., 2013]

• All benefits of ReLU

• Computation requires exp()

[Clevert et al., 2015] 21

" = 1.0507009873554804934193349852946 [Klambauer et al., 2017] 22

(Assume X [NxD] is data matrix, each example in a row)

(data has diagonal (covariance matrix is

- Subtract the mean image (e.g. AlexNet)

Not common to normalize

Works ~okay for small networks, but can lead to

Hint: think about backward pass

consider a batch of activations at some layer.

(k) x(k) E[x(k) ]

[Ioffe and Szegedy, 2015]

1. compute the empirical mean and

[Ioffe and Szegedy, 2015]

(k) x(k) E[x(k) ]

And then allow the network to squash (k)

to recover the identity mapping.

[Ioffe and Szegedy, 2015]

[Ioffe and Szegedy, 2015]

The mean/std are not computed

(e.g. can be estimated during

[Ioffe and Szegedy, 2015]

• Typically, a combination of several of these methods is used.

• Possible regularization methods

✓(t+1) = ✓(t) ⌘t (r✓ L + ✓l ) )

• Also l1-regularization passes inside the gradient descend update rule

• Improving generalization by pooling the examples y (1) y (2)

• Different supervised tasks share the same input x,

Figure 7.2: Multi-task learning can be cast in several ways in d

• Train several different models Original dataset

separately, then have all of the

not make all the same errors 8

[Srivastava et al., 2014]

• Dropout is training a large x1 x2 x2 x1 x1 x2

ensemble of models (that y

• Each binary mask is one x1 x2

model, gets trained on only Base network

Monte Carlo approximation:

during test: a = w0*x + w1*y With p=0.5, using all

At test time all neurons are active always

drop in forward pass

scale at test time

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 14

simple gradient descent update

Note: arrows point out from a minimum, in toward a maximum.

during test: a = w0x + w1y With p=0.5, using all