Professional Documents
Culture Documents
training error
• General regularization strategies
• Adding extra constraints / parameters on the ML model
• Introduce extra terms in the cost / objective function
• Other types – Ensemble methods
Regularization Strategies
• Generally based on regularizing estimators
• Effective regularizer – trade-off with Increased bias by decreasing
variance
• Generalization and overfitting (Model)
– Excluded data generating process (underfitting & induces bias)
– Matched the true data generating process
– Include the generating process
• Model complexity
– Finding the model of right size with right number of parameters
– Determine the best fitting model (large) that has been regularized
properly
– Intention is to create a large, deep, regularized model
Parameter Norm Penalties
•• Limit
the models capacity (NN, LR, LoR)
– Add a parameter norm penalty to the objective function
– , , where
• For NNs (Parameter Norm Penalty - PNP) impacts the
weights across each layer and the biases remain unregularized
• ω – vector for weights affected by
• θ – vector for parameters comprising ω and the unregularized
parameters
• Alternatively, NNs deploy coefficient different for each layer
of the N/w.
L2 Parameter Regularization
• Simplest and commonly utilized
• L2 PNP is known as weight decay
• To avoid overfitting, a weight update w with the respect
to∇J/∇w and subtract from λ∙w, thereby the weights
decay towards zero – weight update
• Drive the weights closer to the origin by adding a
regularization term to the objective function
• Ridge regression / Tikhonov regularization
Regularization (revisited)
• Regularization refers to the act of modifying a learning
algorithm to favor “simpler” prediction rules to avoid
overfitting.
• Most commonly, regularization refers to modifying the
loss function to penalize certain values of the weights you
are learning.
• Specifically, penalize weights that are large.
• Identify large weights using
• L2 norm of w – vector’s length / Euclidean norm
L2 Regularization (ctd..)
•
L1 Regularization(ctd..)
• Tuning with respect to for all i results in
two outcomes:
• L1 regularization results in sparse solution
• Sparsity assists feature selection mechanism
• Least Absolute Shrinkage and Selection Operator (LASSO)
• L1 penalty causes a subset of weights to zero and discarded
Norm Penalties as Constrained
Optimization
• Regularized objective function with PNP
Lagrangian Multiplier
Example
maximize subject to the constraint
Lagrangian Multiplier: ;
+;
Taking the gradient of the lagrangian and equating it to zero produces the
stationary point (i.e. ); This implies that the stationary points of are and the
objective function
Norm Penalties (ctd…)
••
Impact analysis b
• Offers stability
Dataset Augmentation (DA)
• Better generalization – training on more data
• Create fake data and add it to the training set
• Eases classification
• Invariant to a wide variety of transformations
• Particularly useful in image engineering
operations
• OCR tasks should distinct between ‘b’, ‘d’, ‘6’,
‘9’ – Horizontal flips and 180o rotation
Dataset Augmentation (ctd..)
• Injecting noise into NN is also DA
• Dropout constructs new i/ps by
multiplying noise
• Reduces generalization error
• Other forms of DA
– Adding gaussian noise to input
– Image Cropping randomly (Pre-processing)
Sparse Representations (SR)
• Weight decay places penalty on the model
parameters
• Penalty on the Activation functions
• L1 performs sparse representation
(parameters become zero / close to zero)
• Representational Sparsity
• Elements of the representation are zero
Sparse Representations (ctd..)
Sparse Representations (ctd..)
• Representational regularization similar to
parameter regularization
– SR Other types
• Student-t
• KL divergence
• Hard constraint on the activation values (Orthogonal
Matching Pursuit (OMP))
Sparse Representations (ctd..)
• OMP encodes x (input) with representation h
to solve the constrained optimization problem
• MSE=
Bagging and other Ensemble Methods (ctd..)
• Error Analysis
– Perfectly correlated c=v, MSE reduces to v
– Perfectly uncorrelated c=0, MSE=v/k
– Denotes MSE of ensemble decreases linearly with ensemble size
and memory
• Netflix grand prize
individual models
• Build ensembles of neural networks by incrementally adding