You are on page 1of 10

Regularization for Deep Learning

• Ability to perform well on training data and new inputs

• Available strategies are designed to reduce the test error

with increased training error


• Modifying the LA to reduce its generalization (test) error

and not the training error


• General regularization strategies
• Adding extra constraints / parameters on the ML model
• Introduce extra terms in the cost / objective function

• Other types – Ensemble methods


Regularization Strategies
• Generally based on regularizing estimators
• Effective regularizer – trade-off with Increased bias by decreasing
variance
• Generalization and overfitting (Model)
– Excluded the data generating process (underfitting)
– Matched the true data generating process
– Include the generating process
• Model complexity
– Finding the model of right size with right number of parameters
– Determine the best fitting model (large) that has been regularized
properly
– Intention is to create a large, deep, regularized model
Parameter Norm Penalties
•• Limit
  the models capacity (NN, LR, LoR)
– Add a parameter norm penalty to the objective function
– , , where
• For NNs (Parameter Norm Penalty - PNP) impacts the
weights across each layer and the biases remain unregularized
• ω – vector for weights affected by
• θ – vector for parameters comprising ω and the unregularized
parameters
• Alternatively, NNs deploy  coefficient different for each layer
of the N/w.
L2 Parameter Regularization
• Simplest and commonly utilized
• L2 PNP is known as weight decay
• To avoid overfitting, a weight update w with the respect
to∇J/∇w  and subtract from λ∙w, thereby the weights
decay towards zero – weight update
• Drive the weights closer to the origin by adding a
regularization term to the objective function
• Ridge regression / Tikhonov regularization
Regularization (revisited)
• Regularization refers to the act of modifying a learning
algorithm to favor “simpler” prediction rules to avoid
overfitting.
• Most commonly, regularization refers to modifying the
loss function to penalize certain values of the weights you
are learning.
• Specifically, penalize weights that are large.
• Identify large weights using
• L2 norm of w – vector’s length / Euclidean norm
L2 Regularization (ctd..)

• New goal for minimization –


Loss minimizing
function this, we
prefer
solutions
where
w is closer to 0.

• λ - hyperparameter that adjusts the trade-off


between having low training loss and having
low weights
L2 Regularization (ctd..)
• Assuming no bias (i.e. θ is just ω), then
cost function is
• And the gradient
• Updating weight

• Further quadratic approximation to J to


yield minimal unregularized training cost
by tuning weights
L2 Regularization (ctd..)
•  
• Then J becomes
• H – Hessian matrix of J
• Minimum of J occurs at
• Adding weight decay gradient

– When =0, approaches


– When  grows perform eigen decomposition of H
L2 Regularization (ctd..)
•  
• H is decomposed into diagonal matrix and
orthonormal basis of eigen vectors as
• Therefore, becomes
L2 Regularization (ctd..)
• Extending to linear regression - Cost
function J in terms of sum of squared errors

• Applying L2 regularization modifies J

• Therefore weight decay becomes

You might also like