Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu

Machine Learning Srihari
Regularization in Neural Networks 
Sargur Srihari
srihari@buffalo.edu
1
Topics in Neural Net Regularization

•  Definition of regularization
•  Methods
1.  Limiting capacity: no of hidden units
2.  Norm Penalties: L2- and L1- regularization
3.  Early stopping
•  Invariant methods
4.  Data Set Augmentation
5.  Tangent propagation
2
Definition of Regularization
•  Generalization error
•  Performance on inputs not previously seen
•  Also called as Test error
•  Regularization is:
•  “any modification to a learning algorithm to reduce
its generalization error but not its training error”
•  Reduce generalization error even at the expense of
increasing training error
•  E.g., Limiting model capacity
is a regularization method
3
Goals of Regularization
•  Some goals of regularization
1. Encode prior knowledge
2. Express preference for simpler model
3. Need to make underdetermined problem
determined
4
Need for Regularization
•  Generalization
•  Prevent over-fitting
•  Occam's razor
•  Bayesian point of view
•  Regularization corresponds to prior distributions on
model parameters
5
Best Model: one properly regularized
•  Best fitting model obtained not by finding the

right number of parameters
•  Instead, best fitting model is a large model that
has been regularized appropriately
•  We review several strategies for how to create
such a large, deep regularized model
6
Limiting no. of hidden units
•  No. of input/output units determined by

dimensions
•  Number of hidden units M is a free parameter
•  Adjusted to get best predictive performance
•  Possible approach is to get maximum likelihood
estimate of M for balance between under-fitting
and over-fitting
7
Effect of No. of Hidden Units
Sinusoidal Regression Problem
Two layer network trained on 10 data points
M = 1, 3 and 10 hidden units
Minimizing sum-of-squared error function

Using conjugate gradient descent
Generalization error is not a simple function of M

due to presence of local minima in error function
Validation Set to fix no. of hidden units

•  Plot a graph choosing random starts and different numbers of hidden units
M and choose the specific solution having smallest generalization error
•  30 random starts for each M

Sum of squares 30 points in each column of graph
Test error
•  Overall best validation set

performance happened at M=8
No. of hidden units, M
•  There are other ways to control the complexity of a neural network in

order to avoid over-fitting
•  Alternative approach is to choose a relatively large value of M and

then control complexity by adding a regularization term
9
Parameter Norm Penalty

•  Generalization error not a simple function of M
•  Due to presence of local minima
•  Need to control capacity to avoid over-fitting
•  Alternatively choose large M and control complexity by
addition of regularization term
•  Simplest regularizer is weight decay
! λ
E(w) = E(w) + w Tw
2
•  Effective model complexity determined by choice of λ
•  Regularizer is equivalent to a Gaussian prior over w
•  Simple weight decay has certain shortcomings
•  invariance to scaling
Invariance to Transformation
Same building, different views, scales
11
Linear Transformation
https://mathinsight.org/determinant_linear_transformation
Ex: Two-layer MLP

•  Simple weight decay is inconsistent with certain
scaling properties of network mappings
•  To show this, consider a multi-layer perceptron
network with two layers of weights and linear
output units
•  Set of inputs x={xi} and outputs y={yk}
•  Activations of hidden units in first layer have form
⎛ ⎞
z j = h ⎜⎜∑ w ji x i + w j 0 ⎟⎟⎟
⎜⎝ i ⎠
•  Activations of output units are
yk = ∑ wkj z j + wk 0
j
13
Linear Transformations of i/o Variables

•  If we perform linear transformation of input data
x i → x!i = ax i +b
•  Then mapping performed by network is unchanged

by making a corresponding linear transformation of
•  weights, biases from the inputs to the hidden units as
1 b
w ji → w! ji = w ji and w j 0 = w j 0 − ∑ w ji
a a i
•  Similar linear transform of outputs of network
yk → y!k = cyk + d
•  achieved by transforming second layer weights, biases
wk j → w!kj = cwkj and wk 0 = cwk 0 + d

14
Desirable invariance of regularizer

•  Suppose we train two different networks
•  First network: trained using original data: x={xi}, y={yk}
•  Second network: input and/or target variables are
transformed by one of the linear transformations
x i → x!i = ax i +b yk → y!k = cyk + d
•  Consistency requires that we should obtain

equivalent networks that differ only by linear
transformation of the weights
For first layer: And/or or second layer:

1 wk j → cwkj
w ji → w and/or
a ji
15
Simple weight decay fails invariance

•  Regularizer should have this property
•  Otherwise arbitrarily favor one solution over another
λ
•  Simple weight decay !
E(w) = E(w) + w Tw
2 does not
have this property
•  Treats all weights and biases on an equal footing
•  While resulting wji and wkj should be treated differently
•  Consequently networks will have different weights and
violate invariance
•  We therefore look for a regularizer invariant under
the linear transformations
16
Regularizer invariance
•  The regularizer should be invariant to re-scaling
of weights and shifts of biases
•  Such a regularizer is λ2 ∑ w + λ2 ∑ w
1
w∈W1
2 2
w∈W2
2
•  where W1 are weights of first layer and

•  W2 are the set of weights in the second layer
•  This regularizer remains unchanged under the

weight transformations provided the parameters
1/2 −1/2
are rescaled using λ →a λ
1
and λ → c λ 1 2 2
•  We have seen before that weight decay is

equivalent to a Gaussian prior. So what is the
equivalent prior to this?
Bayesian linear regression
( ) p(w | α ) = N(w | 0, α −1I )

N
p(t | X,w, β ) = ∏ N tn | w φ(x n ), β
T −1
n =1
y(x,w)=w0+w1x
p(w|α)~N(0,1/α)
( )
N
p(w | t) = ∏ N tn | wTφ(x n ), β −1 N(w | 0,α −1I)
n =1
w1
Log of Posterior is
2
w0
β N
2 n =1
{ α
}
ln p(w | t) = − ∑ tn − wTφ(x n ) − wT w + const
2
Thus Maximization of posterior is equivalent to weight decay

1 N
{ } λ T
2
E(w) = ∑ tn − wTφ(x n ) + w w
2 n=1 2
with addition of quadratic regularization term wTw with λ = α /β

18
Equivalent prior for neural network

•  Regularizer invariant to rescaled weights/ biases
λ1 λ2
2
∑w w∈W1
2
+
2
∑w
w∈W2
2
•  where W1: weights of first layer, W2 : of second layer

•  It corresponds to a prior of the form
⎛ α α2 ⎞
⎜ 2⎟ α1 and α2 are hyper-parameters
p(w | α1, α2 ) α exp ⎜⎜− ∑ w −
1 2
∑ w ⎟⎟
⎜⎝ 2 w∈W1 2 w∈W2 ⎟⎟⎠
•  An improper prior which cannot be normalized

•  Leads to difficulties in selecting regularization coefficients
and in model comparison within Bayesian framework
•  Instead include separate priors for biases with their own
hyper-parameters
Separate priors and biases

•  We can illustrate effect of the resulting four
hyperparameters
•  α1b, precision of Gaussian distribution of first layer bias
•  α1w, precision of Gaussian of first layer weights
•  α2b, precision of Gaussian distribution of second layer bias
•  α2w, precision of Gaussian distribution of second layer weights
Effect of hyperparameters on i/o

Network with single input Priors are governed by four hyper-parameters
α1b, precision of Gaussian distribution of first layer bias
(x value range: -1 to +1),
α1w, precision of Gaussian of first layer weights
single linear output α2b,…. ..of second layer bias
(y value range: -60 to +40) α2w, …. of second layer weights
12 hidden units with tanh

activation functions
Observe:
vertical
output
output
scale
controlled
by α2w
Draw five samples from prior over

hyperparameters, and plot input- input
output (network functions) input
Five samples correspond to five colors
For each setting function is learnt and
plotted
Observe:
horizontal
output
scale
controlled
by α1w
input 21
3. Early Stopping
•  Alternative to regularization Training Set Error
•  In controlling complexity
•  Error measured with an
independent validation set
•  shows initial decrease in error Iteration Step
and then an increase Validation Set Error
•  Training stopped at point of

smallest
error with validation data
•  Effectively limits network Iteration Step
complexity 22
Interpreting the effect of Early Stopping

•  Consider quadratic error function
•  Axes in weight space are parallel to
eigen vectors of Hessian
•  In absence of weight decay, Contour of constant error
wML represents minimum
weight vector starts at origin and
proceeds to wML
•  Stopping at w! similar to weight
decay
! λ
E(w) = E(w) + w Tw
2
•  Effective no of parameters in
network grows during training 23
Invariances
•  In classification there is a need for:
•  Predictions invariant to transformations of input
•  Example:
•  handwritten digit should be assigned same
classification irrespective of position in image
(translation) and size (scale)
•  Such transformations produce significant
changes in raw data, yet need to produce same
output from classifier
•  Examples: pixel intensities,
in speech recognition, nonlinear time warping along time axis 24
Simple Approach for Invariance

•  Large sample set where all transformations are
present
•  E.g., for translation invariance, examples of objects
in may different positions
•  Impractical
•  Number of different samples grows exponentially
with number of transformations
•  Seek alternative approaches for adaptive model
to exhibit required invariances
25
Approaches to Invariance
1.  Training set augmented
•  Transform training patterns for desired invariances
E.g., shift each image into different positions
2.  Regularization term to error function
•  Penalize changes inoutput when input transformed
•  Leads to tangent propagation.
3.  Invariance built into pre-processing
•  By extracting features invariant to transformations
4.  Build invariance into structure of network (CNN)
•  Local receptive fields and shared weights
26
Approach 1: Transform each input 
•  Synthetically Original Image Warped Images
warp each
handwritten digit
image before
presentation to
model
•  Easy to Random displacements
Δx,Δy ∈ (0,1) at each pixel
implement but Then smooth by convolutions
of width 0.01, 30 and 60 resply
computationally
costly
27
Approach 2: Tangent Propagation

•  Regularization to
encourage invariance to
transformations
•  by technique of tangent
propagation
•  Consider effect of
transformation on input xn
•  A 1-D continuous Two-dimensional input space showing effect
of continuous transformation with single
transformation parameter ξ
parameterized by ξ Let the vector resulting from acting on xn by this
transformation be denoted by s(xn,ξ)
applied to xn sweeps a defined so that s(x,0)=x.
manifold M in D- Then the tangent to the curve M is given by
the directional derivative τ = δs/δξ and the
dimensional input space tangent vector at point xn is given by
∂s(x n ,ξ )
τn =
∂ξ ξ = 0 28
Tangent Propagation as Regularization

•  Under a transformation of input vector
•  The network output vector will change
•  Derivative of output k wrt ξ is given by
∂yk D
∂yk ∂x i D
=∑ = ∑ J ki τi
∂ξ ξ=0
i=1 ∂x i ∂ξ ξ=0
i=1
•  where Jki is the (k,i) element of Jacobian Matrix J

•  Result used to modify standard error function
E! = E + λΩ
where λ is a regularization coefficient and
2
⎛ ⎞⎟ ⎛ ⎞
2
1 ⎜ ∂y ⎟⎟ = 1 D
⎜⎜ J τ ⎟⎟
Ω= ∑ ∑ ⎜⎜ nk ⎟⎟ ∑ ∑ ⎜⎝∑ nki ni ⎟
2 n k ⎜⎜⎝ ∂ξ ξ=0 ⎠ ⎟ 2 n k i=1
⎟⎠
•  To encourage local invariance in neighborhood of data pt

Tangent vector from finite differences

Tangent vector τ
corresponding to
Original Image x small clockwise rotation
•  In practical implementation
tangent vector τn is approximated
using finite differences
by subtracting original vector xn
from the corresponding vector
after transformations using
a small value of ξ and then
dividing by ξ
Adding small contribution
•  A related technique from tangent vector True image rotated
to image x+ετ for comparison
Tangent Distance is used
to build invariance properties
into distance-based
methods such as
nearest-neighbor classifiers
30
Equivalence of Approaches 1 and 2
•  Expanding the training set is closely related to

tangent propagation
•  Small transformations of the original input
vectors together with Sum-of-squared error
function can be shown to be equivalent to the
tangent propagation regularizer
31

Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu

Uploaded by

Copyright:

Available Formats

Machine Learning Srihari

Regularization in Neural Networks

Topics in Neural Net Regularization

Need for Regularization

Best Model: one properly regularized

• Best fitting model obtained not by finding the

Limiting no. of hidden units

• No. of input/output units determined by

Effect of No. of Hidden Units

Sinusoidal Regression Problem

Two layer network trained on 10 data points

M = 1, 3 and 10 hidden units

Minimizing sum-of-squared error function

Generalization error is not a simple function of M

Validation Set to fix no. of hidden units

• 30 random starts for each M

• Overall best validation set

No. of hidden units, M

• There are other ways to control the complexity of a neural network in

• Alternative approach is to choose a relatively large value of M and

Parameter Norm Penalty

Ex: Two-layer MLP

• Activations of output units are

Linear Transformations of i/o Variables

• Then mapping performed by network is unchanged

wk j → w!kj = cwkj and wk 0 = cwk 0 + d

Desirable invariance of regularizer

• Consistency requires that we should obtain

For first layer: And/or or second layer:

Simple weight decay fails invariance

• where W1 are weights of first layer and

• This regularizer remains unchanged under the

• We have seen before that weight decay is

Bayesian linear regression

( ) p(w | α ) = N(w | 0, α −1I )

Thus Maximization of posterior is equivalent to weight decay

with addition of quadratic regularization term wTw with λ = α /β

Equivalent prior for neural network

• where W1: weights of first layer, W2 : of second layer

• An improper prior which cannot be normalized

Separate priors and biases

Effect of hyperparameters on i/o

12 hidden units with tanh

Draw five samples from prior over

and then an increase Validation Set Error

• Training stopped at point of

Interpreting the effect of Early Stopping

Simple Approach for Invariance

Approach 1: Transform each input

• Synthetically Original Image Warped Images

Approach 2: Tangent Propagation

Tangent Propagation as Regularization

• where Jki is the (k,i) element of Jacobian Matrix J

• To encourage local invariance in neighborhood of data pt

Tangent vector from finite differences

Equivalence of Approaches 1 and 2

• Expanding the training set is closely related to

You might also like

Regularization in Neural Networks 

•  Best fitting model obtained not by finding the

•  No. of input/output units determined by

•  30 random starts for each M

•  Overall best validation set

•  There are other ways to control the complexity of a neural network in

•  Alternative approach is to choose a relatively large value of M and

•  Activations of output units are

•  Then mapping performed by network is unchanged

•  Consistency requires that we should obtain

•  where W1 are weights of first layer and

•  This regularizer remains unchanged under the

•  We have seen before that weight decay is

•  where W1: weights of first layer, W2 : of second layer

•  An improper prior which cannot be normalized

•  Training stopped at point of

Approach 1: Transform each input 

•  Synthetically Original Image Warped Images

•  where Jki is the (k,i) element of Jacobian Matrix J

•  To encourage local invariance in neighborhood of data pt

•  Expanding the training set is closely related to