Lecture 6

CS273B Lecture 6: regularization and
optimization for deep learning

James Zou

10/12/16

Recap: architectures

•  Feedforward
Learning a nonlinear
mapping from inputs to
outputs.

Predicting:

•  Convnets
TF binding,

gene expression,

disease status from images,

risk from SNPs,

protein structure

•  RNN, LSTM

…

How to train your neural network

Regularization—prevent overfitting

Optimization—overcome underfitting



•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout


•  SGD, SGD with momentum

•  RMSProp

Empirical loss vs true loss

Given training set

( , ), ( , ), ... D

Goal of neural networks (and most ML) is to solve

= arg min ED [ ( ( , ), )] True loss

where L is the loss metric.

However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

Empirical loss vs true loss

Given training set

( , ), ( , ), ... D

Goal of neural networks (and most ML) is to solve

= arg min ED [ ( ( , ), )] True loss

where L isarise
Overfitting the loss
due metric.

to using this proxy of empirical loss.

However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss






•  Dropout



•  RMSProp

Early Stopping

Entire dataset

train
validation
test

stop

error
validation error

training error

# of steps

Use in combination with any optimization and regularization.






•  Dropout



•  RMSProp

Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +
In gradient descent

= ( ( , ), ) +
+ = ·
=( ) ( ( , ), )
Weight decay

Optimize the loss

= arg min ( ( , ), ) +
Why does this help to reduce overfitting?

Weight decay

Optimize the loss

= arg min ( ( , ), ) +
Why does this help to reduce overfitting?

•  Corresponds to a Bayesian prior that the

weights are close to zero.

•  Restricts the complexity of the learned neural
network.






•  Dropout



•  RMSProp

Increase your training set: multitask learning

y1
y2

h1
h2
h3

hshared

Increase your training set: multitask learning

Task specific y1
y2

predictions

h1
h2
h3

hshared
Leverages all the data

Increase your training set: data augmentation

First, normalize the input—zero mean and unit standard

deviation.

Example from Jason Brownlee


Transform input data via rotations, shifts and adding

random noise.

new training data



Transform input data via rotations, shifts and adding

random noise.

new training data







•  Dropout



•  RMSProp

Dropout

( )
( )
Dropout

Set each hidden unit to 0

w/ probability 0.5.

( )

Set each input unit to 0
( ) w/ probability 0.2.

Dropout


w/ probability 0.5.

( )


Dropout


w/ probability 0.5.

( )


For prediction and back propagation, use only the

present edges. The dropped-out edges are ignored and
kept at their previous values.

Dropout: test time


w/ probability 0.5.

( )


At test time, multiply the output of each unit by its

dropout probability.

Dropout intuition

Dropout is approximately
training and averaging an
exponentially large
ensemble of networks.

Summary: regularization

Three classes of approaches

•  L2 regularization—reduce complexity of function space

•  Multi-task learning; data augmentation—effectively increase
the number of training examples.

•  Dropout and other noise addition algorithms—increase the
stability of training algorithm.






•  Dropout



•  RMSProp

Stochastic gradient descent

[ , ]
{ ( ) , ( ) , ..., ( ) } ()
= (( ( ), ), ( ))
+ = +
SGD with momentum

SGD can zig-zag esp. when the loss landscape is ill-

conditioned.

Momentum—prefers to go in similar direction as
before.
Figure from Goodfellow, Bengio, Courville

SGD with momentum

[ , ]
{ ( ) , ( ) , ..., ( ) } ()
= (( ( ), ), ( ))
+ = + +
What are limitations of gradient based methods?

What are limitations of gradient based methods?

•  Local minima and saddle points.

•  Performance depends crucially on step sizes.

If too small, then requires many steps.

If too large, then gradients no longer informative.

•  Algorithms we have seen requires setting by hand.

RMSProp

Idea: set learning rate adaptively using history.

[ , ]
{ ( ) , ( ) , ..., ( ) } ()
= (( ( ), ), ( ))
= +( )
= +
+ = +
Example: DeepBind

DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Objective function

= arg min ( ( , ), ) + || ||
Initialization
N( , ) [ , ]

Objective function

= arg min ( ( , ), ) + || ||
Initialization
N( , ) [ , ]
Early stopping using validation data.


Objective function

= arg min ( ( , ), ) + || ||
Initialization
N( , ) [ , ]

SGD with momentum. Batch size [

, ]

Objective function

= arg min ( ( , ), ) + || ||
Initialization
N( , ) [ , ]

SGD with momentum. Batch size [

, ]
Dropout.

Hyperparameter optimization

•  Amount of weight decay

[ , ]

•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.

Hyperparameter optimization

•  Amount of weight decay

[ , ]

•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.

Shahriari et al. Taking the human out of the loop: a review of

Bayesian optimization.

Lecture 6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 6

Uploaded by

Copyright:

Available Formats

CS273B Lecture 6: regularization and

optimization for deep learning

Given training set

Given training set

Use in combination with any optimization and regularization.

Why does this help to reduce overfitting?

Why does this help to reduce overfitting?

• Corresponds to a Bayesian prior that the

First, normalize the input—zero mean and unit standard

Example from Jason Brownlee

Transform input data via rotations, shifts and adding

new training data

Example from Jason Brownlee

Transform input data via rotations, shifts and adding

new training data

Example from Jason Brownlee

Set each hidden unit to 0

Set each hidden unit to 0

Set each hidden unit to 0

For prediction and back propagation, use only the

Set each hidden unit to 0

At test time, multiply the output of each unit by its

Three classes of approaches

SGD can zig-zag esp. when the loss landscape is ill-

Figure from Goodfellow, Bengio, Courville

• Local minima and saddle points.

• Performance depends crucially on step sizes.

Idea: set learning rate adaptively using history.

Early stopping using validation data.

Early stopping using validation data.

SGD with momentum. Batch size [

Early stopping using validation data.

SGD with momentum. Batch size [

• Amount of weight decay

• Dropout probability: 0.5, 0.25, 0.

• Batch size 30 to 200.

• Amount of weight decay

• Dropout probability: 0.5, 0.25, 0.

• Batch size 30 to 200.

Shahriari et al. Taking the human out of the loop: a review of

You might also like

•  Corresponds to a Bayesian prior that the

•  Local minima and saddle points.

•  Performance depends crucially on step sizes.

•  Amount of weight decay

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.

•  Amount of weight decay

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.