You are on page 1of 31

Regularization for Deep Learning

• Modification done to a LA so as to reduce its generalization error and

not its training error


• To perform well on training data and new inputs
• Available strategies are designed to reduce the test error with

increased training error


• Modifying the LA to reduce its generalization (test) error and not the

training error
• General regularization strategies
• Adding extra constraints / parameters on the ML model
• Introduce extra terms in the cost / objective function
• Other types – Ensemble methods
Regularization Strategies
• Generally based on regularizing estimators
• Effective regularizer – trade-off with Increased bias by decreasing
variance
• Generalization and overfitting (Model)
– Excluded data generating process (underfitting & induces bias)
– Matched the true data generating process
– Include the generating process
• Model complexity
– Finding the model of right size with right number of parameters
– Determine the best fitting model (large) that has been regularized
properly
– Intention is to create a large, deep, regularized model
Parameter Norm Penalties
•• Limit
  the models capacity (NN, LR, LoR)
– Add a parameter norm penalty to the objective function
– , , where
• For NNs (Parameter Norm Penalty - PNP) impacts the
weights across each layer and the biases remain unregularized
• ω – vector for weights affected by
• θ – vector for parameters comprising ω and the unregularized
parameters
• Alternatively, NNs deploy  coefficient different for each layer
of the N/w.
L2 Parameter Regularization
• Simplest and commonly utilized
• L2 PNP is known as weight decay
• To avoid overfitting, a weight update w with the respect
to∇J/∇w  and subtract from λ∙w, thereby the weights
decay towards zero – weight update
• Drive the weights closer to the origin by adding a
regularization term to the objective function
• Ridge regression / Tikhonov regularization
Regularization (revisited)
• Regularization refers to the act of modifying a learning
algorithm to favor “simpler” prediction rules to avoid
overfitting.
• Most commonly, regularization refers to modifying the
loss function to penalize certain values of the weights you
are learning.
• Specifically, penalize weights that are large.
• Identify large weights using
• L2 norm of w – vector’s length / Euclidean norm
L2 Regularization (ctd..)

• New goal for minimization –


Loss minimizing
function this, we
prefer
solutions
where
w is closer to 0.

• λ - hyperparameter that adjusts the trade-off


between having low training loss and having
low weights
L2 Regularization (ctd..)
• Assuming no bias (i.e. θ is just ω), then
cost function is
• And the gradient
• Updating weight

• Further quadratic approximation to J to


yield minimal unregularized training cost
by tuning weights
L2 Regularization (ctd..)
•  
• Then J becomes
• H – Hessian matrix of J
• Minimum of J occurs at
• Adding weight decay gradient

– When =0, approaches


– When  grows perform eigen decomposition of H
L2 Regularization (ctd..)
•  
• H is decomposed into diagonal matrix and
orthonormal basis of eigen vectors as
• Therefore, becomes
L2 Regularization (ctd..)
• Extending to linear regression - Cost
function J in terms of sum of squared errors

• Applying L2 regularization modifies J

• Therefore weight decay becomes


L1 Regularization
• PNP is defined as
• Sum of absolute values of the individual
parameters
• Linear regression model without bias

• Does not vary linearly instead a constant


add / subtracted depending upon sign(w)
L1 Regularization(ctd..)
• L1 does not provide clean algebraic solution
• Model representation using Taylor’s series
• Tuning the cost function
• Diagonal of Hessian is used
• Accordingly the objective function becomes


L1 Regularization(ctd..)
• Tuning with respect to for all i results in
two outcomes:
• L1 regularization results in sparse solution
• Sparsity assists feature selection mechanism
• Least Absolute Shrinkage and Selection Operator (LASSO)
• L1 penalty causes a subset of weights to zero and discarded
Norm Penalties as Constrained
Optimization
• Regularized objective function with PNP

• Employs Lagrangian function (objective + set


of penalties)
• Penalty – Karush-Kuhn-Tucker(KKT) x
constraint function
• Solution
Norm Penalties (ctd…)
 
• /
• +ve causes to shrink / optimal to shrink

Lagrangian Multiplier
 
Example
maximize  subject to the constraint 

Lagrangian Multiplier: ;
+;
Taking the gradient of the lagrangian and equating it to zero produces the
stationary point (i.e. ); This implies that the stationary points of are and the
objective function
Norm Penalties (ctd…)
••  
Impact analysis b

• PNP imposes a constraint on the weights


• In L2 norm the weights lie constrained in a ball
• L1 norm weights lie constrained in a limited region
• PNPs cause optimization to get trapped in local
minima (Dead units). Thus replace PNP with
explicit constraints and reprojection
Norm Penalties (ctd…)

• Explicit Constraints (EC) does not make the


weights approach the origin
• Imposes stability on optimization
• Higher learning rates introduces positive
feedback
• Therefore large weights induce large gradients
• EC prevents this feedback
Regularization and under-constrained
problems
• Assists in properly defining ML problems
• Linear regression, Principal Component Analysis (PCA)
rely upon inverting matrix (XTX)
• Problem when inverting matrix is singular XTX+I
• Solving underdetermined linear equations
• Guaranteed convergence to underdetermined problems
• Ex: Weight Decay (WD) causes gradient descent to quit
increasing the weights magnitude when its slope equals
the WD coefficient
Regularization and under-constrained problems
(ctd..)

• In such cases, underdetermined linear


equations are solved using Moore-Penrose
pseudoinverse
• Pseudoinverse of a Matrix X is X+

• Offers stability
Dataset Augmentation (DA)
• Better generalization – training on more data
• Create fake data and add it to the training set
• Eases classification
• Invariant to a wide variety of transformations
• Particularly useful in image engineering
operations
• OCR tasks should distinct between ‘b’, ‘d’, ‘6’,
‘9’ – Horizontal flips and 180o rotation
Dataset Augmentation (ctd..)
• Injecting noise into NN is also DA
• Dropout constructs new i/ps by
multiplying noise
• Reduces generalization error
• Other forms of DA
– Adding gaussian noise to input
– Image Cropping randomly (Pre-processing)
Sparse Representations (SR)
• Weight decay places penalty on the model
parameters
• Penalty on the Activation functions
• L1 performs sparse representation
(parameters become zero / close to zero)
• Representational Sparsity
• Elements of the representation are zero
Sparse Representations (ctd..)
Sparse Representations (ctd..)
• Representational regularization similar to
parameter regularization

– SR Other types
• Student-t
• KL divergence
• Hard constraint on the activation values (Orthogonal
Matching Pursuit (OMP))
Sparse Representations (ctd..)
• OMP encodes x (input) with representation h
to solve the constrained optimization problem

• No. of non-zero entries of h


• Solved W is constrained to be orthogonal
• OMP-k, k – no. of non zero features
• OMP-1 efficient for deep architectures
Bagging and other Ensemble Methods

• Bootstrap aggregating combines several models to


reduce generalization
• Train several models separately and vote on the o/p
for the test examples
• Model averaging – general strategy in ML
• Techniques employing this strategy is called Bagging
• Does not produce same errors for the different
models on the test set
Bagging and other Ensemble Methods (ctd..)
•  
• Consider a set of k regression models
• Each model makes an error on each example,
with variances and covariances assumed as ,
respectively
• Average prediction error of ensemble models

• MSE=
Bagging and other Ensemble Methods (ctd..)

• Error Analysis
– Perfectly correlated c=v, MSE reduces to v
– Perfectly uncorrelated c=0, MSE=v/k
– Denotes MSE of ensemble decreases linearly with ensemble size

• Ensemble methods constructs ensemble models in


different ways
– Each member of the ensemble can be trained by a completely
different model using a different algorithm /objective function
Bagging and other Ensemble Methods (ctd..)
• Allows the same model, training algorithm and
objective function to be reused several times.
• Constructs k different datasets with each dataset
comprising the same number of examples of the
original dataset
Bagging and other Ensemble Methods (ctd..)
• NN benefits from Model averaging

• Reduces generalization error with increased computational cost

and memory
• Netflix grand prize

• Boosting - constructs an ensemble with higher capacity than the

individual models
• Build ensembles of neural networks by incrementally adding

neural networks to the ensemble.


• Interprets an individual neural network as an ensemble

incrementally adding hidden units to the neural network.


Ou
tli
ne
of
Dr
op
ou
t
str
ate
gy

You might also like