Regularization For Deep Learning

Regularization for Deep Learning
• Modification done to a LA so as to reduce its generalization error and
not its training error

• To perform well on training data and new inputs
• Available strategies are designed to reduce the test error with
increased training error

• Modifying the LA to reduce its generalization (test) error and not the
training error
• General regularization strategies
• Adding extra constraints / parameters on the ML model
• Introduce extra terms in the cost / objective function
• Other types – Ensemble methods
Regularization Strategies
• Generally based on regularizing estimators
• Effective regularizer – trade-off with Increased bias by decreasing
variance
• Generalization and overfitting (Model)
– Excluded data generating process (underfitting & induces bias)
– Matched the true data generating process
– Include the generating process
• Model complexity
– Finding the model of right size with right number of parameters
– Determine the best fitting model (large) that has been regularized
properly
– Intention is to create a large, deep, regularized model
Parameter Norm Penalties
•• Limit
the models capacity (NN, LR, LoR)
– Add a parameter norm penalty to the objective function
– , , where
• For NNs (Parameter Norm Penalty - PNP) impacts the
weights across each layer and the biases remain unregularized
• ω – vector for weights affected by
• θ – vector for parameters comprising ω and the unregularized
parameters
• Alternatively, NNs deploy  coefficient different for each layer
of the N/w.
L2 Parameter Regularization
• Simplest and commonly utilized
• L2 PNP is known as weight decay
• To avoid overfitting, a weight update w with the respect
to∇J/∇w and subtract from λ∙w, thereby the weights
decay towards zero – weight update
• Drive the weights closer to the origin by adding a
regularization term to the objective function
• Ridge regression / Tikhonov regularization
Regularization (revisited)
• Regularization refers to the act of modifying a learning
algorithm to favor “simpler” prediction rules to avoid
overfitting.
• Most commonly, regularization refers to modifying the
loss function to penalize certain values of the weights you
are learning.
• Specifically, penalize weights that are large.
• Identify large weights using
• L2 norm of w – vector’s length / Euclidean norm
L2 Regularization (ctd..)
• New goal for minimization –

Loss minimizing
function this, we
prefer
solutions
where
w is closer to 0.
• λ - hyperparameter that adjusts the trade-off

between having low training loss and having
low weights
• Assuming no bias (i.e. θ is just ω), then
cost function is
• And the gradient
• Updating weight
• Further quadratic approximation to J to

yield minimal unregularized training cost
by tuning weights
•
• Then J becomes
• H – Hessian matrix of J
• Minimum of J occurs at
• Adding weight decay gradient
– When =0, approaches

– When  grows perform eigen decomposition of H
•
• H is decomposed into diagonal matrix and
orthonormal basis of eigen vectors as
• Therefore, becomes
• Extending to linear regression - Cost
function J in terms of sum of squared errors
• Applying L2 regularization modifies J
• Therefore weight decay becomes

L1 Regularization
• PNP is defined as
• Sum of absolute values of the individual
parameters
• Linear regression model without bias
• Does not vary linearly instead a constant

add / subtracted depending upon sign(w)
L1 Regularization(ctd..)
• L1 does not provide clean algebraic solution
• Model representation using Taylor’s series
• Tuning the cost function
• Diagonal of Hessian is used
• Accordingly the objective function becomes
•
L1 Regularization(ctd..)
• Tuning with respect to for all i results in
two outcomes:
• L1 regularization results in sparse solution
• Sparsity assists feature selection mechanism
• Least Absolute Shrinkage and Selection Operator (LASSO)
• L1 penalty causes a subset of weights to zero and discarded
Norm Penalties as Constrained
Optimization
• Regularized objective function with PNP
• Employs Lagrangian function (objective + set

of penalties)
• Penalty – Karush-Kuhn-Tucker(KKT) x
constraint function
• Solution
Norm Penalties (ctd…)

• /
• +ve causes to shrink / optimal to shrink
Lagrangian Multiplier

Example
maximize subject to the constraint
Lagrangian Multiplier: ;
+;
Taking the gradient of the lagrangian and equating it to zero produces the
stationary point (i.e. ); This implies that the stationary points of are and the
objective function
••
Impact analysis b
• PNP imposes a constraint on the weights

• In L2 norm the weights lie constrained in a ball
• L1 norm weights lie constrained in a limited region
• PNPs cause optimization to get trapped in local
minima (Dead units). Thus replace PNP with
explicit constraints and reprojection
• Explicit Constraints (EC) does not make the

weights approach the origin
• Imposes stability on optimization
• Higher learning rates introduces positive
feedback
• Therefore large weights induce large gradients
• EC prevents this feedback
Regularization and under-constrained
problems
• Assists in properly defining ML problems
• Linear regression, Principal Component Analysis (PCA)
rely upon inverting matrix (XTX)
• Problem when inverting matrix is singular XTX+I
• Solving underdetermined linear equations
• Guaranteed convergence to underdetermined problems
• Ex: Weight Decay (WD) causes gradient descent to quit
increasing the weights magnitude when its slope equals
the WD coefficient
Regularization and under-constrained problems
(ctd..)
• In such cases, underdetermined linear

equations are solved using Moore-Penrose
pseudoinverse
• Pseudoinverse of a Matrix X is X+
• Offers stability
Dataset Augmentation (DA)
• Better generalization – training on more data
• Create fake data and add it to the training set
• Eases classification
• Invariant to a wide variety of transformations
• Particularly useful in image engineering
operations
• OCR tasks should distinct between ‘b’, ‘d’, ‘6’,
‘9’ – Horizontal flips and 180o rotation
Dataset Augmentation (ctd..)
• Injecting noise into NN is also DA
• Dropout constructs new i/ps by
multiplying noise
• Reduces generalization error
• Other forms of DA
– Adding gaussian noise to input
– Image Cropping randomly (Pre-processing)
Sparse Representations (SR)
• Weight decay places penalty on the model
parameters
• Penalty on the Activation functions
• L1 performs sparse representation
(parameters become zero / close to zero)
• Representational Sparsity
• Elements of the representation are zero
Sparse Representations (ctd..)
• Representational regularization similar to
parameter regularization
– SR Other types
• Student-t
• KL divergence
• Hard constraint on the activation values (Orthogonal
Matching Pursuit (OMP))
• OMP encodes x (input) with representation h
to solve the constrained optimization problem
• No. of non-zero entries of h

• Solved W is constrained to be orthogonal
• OMP-k, k – no. of non zero features
• OMP-1 efficient for deep architectures
Bagging and other Ensemble Methods
• Bootstrap aggregating combines several models to

reduce generalization
• Train several models separately and vote on the o/p
for the test examples
• Model averaging – general strategy in ML
• Techniques employing this strategy is called Bagging
• Does not produce same errors for the different
models on the test set
Bagging and other Ensemble Methods (ctd..)
•
• Consider a set of k regression models
• Each model makes an error on each example,
with variances and covariances assumed as ,
respectively
• Average prediction error of ensemble models
• MSE=
• Error Analysis
– Perfectly correlated c=v, MSE reduces to v
– Perfectly uncorrelated c=0, MSE=v/k
– Denotes MSE of ensemble decreases linearly with ensemble size
• Ensemble methods constructs ensemble models in

different ways
– Each member of the ensemble can be trained by a completely
different model using a different algorithm /objective function
• Allows the same model, training algorithm and
objective function to be reused several times.
• Constructs k different datasets with each dataset
comprising the same number of examples of the
original dataset
• NN benefits from Model averaging
• Reduces generalization error with increased computational cost
and memory
• Netflix grand prize
• Boosting - constructs an ensemble with higher capacity than the
individual models
• Build ensembles of neural networks by incrementally adding
neural networks to the ensemble.

• Interprets an individual neural network as an ensemble
incrementally adding hidden units to the neural network.

Ou
tli
ne
of
Dr
op
ou
t
str
ate
gy

Regularization For Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regularization For Deep Learning

Uploaded by

Copyright:

Available Formats

Regularization for Deep Learning

• Modification done to a LA so as to reduce its generalization error and

not its training error

increased training error

• New goal for minimization –

• λ - hyperparameter that adjusts the trade-off

• Further quadratic approximation to J to

– When =0, approaches

• Applying L2 regularization modifies J

• Therefore weight decay becomes

• Does not vary linearly instead a constant

• Employs Lagrangian function (objective + set

• PNP imposes a constraint on the weights

• Explicit Constraints (EC) does not make the

• In such cases, underdetermined linear

• No. of non-zero entries of h

• Bootstrap aggregating combines several models to

• Ensemble methods constructs ensemble models in

• Reduces generalization error with increased computational cost

• Boosting - constructs an ensemble with higher capacity than the

neural networks to the ensemble.

incrementally adding hidden units to the neural network.

You might also like