You are on page 1of 41

CS273B Lecture 6: regularization and

optimization for deep learning


James Zou

10/12/16

Recap: architectures

•  Feedforward
Learning a nonlinear
mapping from inputs to
outputs.


Predicting:

•  Convnets
TF binding,

gene expression,

disease status from images,


risk from SNPs,

protein structure

•  RNN, LSTM





How to train your neural network

Regularization—prevent overfitting









Optimization—overcome underfitting

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Empirical loss vs true loss

Given training set


( , ), ( , ), ... D


Goal of neural networks (and most ML) is to solve


= arg min ED [ ( ( , ), )] True loss

where L is the loss metric.



However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

Empirical loss vs true loss

Given training set


( , ), ( , ), ... D


Goal of neural networks (and most ML) is to solve


= arg min ED [ ( ( , ), )] True loss

where L isarise
Overfitting the loss
due metric.

to using this proxy of empirical loss.



However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Early Stopping

Entire dataset

train
validation
test

stop

error
validation error

training error

# of steps

Use in combination with any optimization and regularization.



How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

In gradient descent

= ( ( , ), ) +

+ = ·
=( ) ( ( , ), )
Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

Why does this help to reduce overfitting?



Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

Why does this help to reduce overfitting?


•  Corresponds to a Bayesian prior that the


weights are close to zero.

•  Restricts the complexity of the learned neural
network.

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Increase your training set: multitask learning

y1
y2

h1
h2
h3

hshared

Increase your training set: multitask learning

Task specific y1
y2

predictions

h1
h2
h3

hshared
Leverages all the data

Increase your training set: data augmentation

First, normalize the input—zero mean and unit standard


deviation.

Example from Jason Brownlee



Increase your training set: data augmentation

Transform input data via rotations, shifts and adding


random noise.

new training data


Example from Jason Brownlee



Increase your training set: data augmentation

Transform input data via rotations, shifts and adding


random noise.

new training data


Example from Jason Brownlee



How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Dropout

( )

( )
Dropout

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

Dropout

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

Dropout

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

For prediction and back propagation, use only the


present edges. The dropped-out edges are ignored and
kept at their previous values.

Dropout: test time

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

At test time, multiply the output of each unit by its


dropout probability.

Dropout intuition

Dropout is approximately
training and averaging an
exponentially large
ensemble of networks.

Summary: regularization

Three classes of approaches





•  L2 regularization—reduce complexity of function space



•  Multi-task learning; data augmentation—effectively increase
the number of training examples.



•  Dropout and other noise addition algorithms—increase the
stability of training algorithm.



How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Stochastic gradient descent

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

+ = +
SGD with momentum

SGD  can zig-zag esp. when the loss landscape is ill-


conditioned.



Momentum—prefers to go in similar direction as
before.  

Figure from Goodfellow, Bengio, Courville



SGD with momentum

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

+ = + +
What are limitations of gradient based methods?

What are limitations of gradient based methods?

•  Local minima and saddle points.


•  Performance depends crucially on step sizes.



If too small, then requires many steps.

If too large, then gradients no longer informative.



•  Algorithms we have seen requires setting by hand.

RMSProp

Idea: set learning rate adaptively using history.



[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

= +( )
= +

+ = +
Example: DeepBind

DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.



DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.


SGD with momentum. Batch size [


, ]
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.


SGD with momentum. Batch size [


, ]

Dropout.

Hyperparameter optimization

•  Amount of weight decay


[ , ]


•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.


•  Batch size 30 to 200.





Hyperparameter optimization

•  Amount of weight decay


[ , ]


•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.


•  Batch size 30 to 200.


Shahriari et al. Taking the human out of the loop: a review of


Bayesian optimization.

You might also like