Regularization techniques are used to reduce overfitting in machine learning models. They modify the learning algorithm to favor simpler models by adding constraints or penalty terms to the objective function. Common regularization strategies include L2 regularization, which penalizes weights with large magnitudes, driving them closer to zero. This helps control model complexity and improve generalization to new data.
Original Description:
Original Title
5-Introduction to regularization-03-Aug-2020Material_I_03-Aug-2020_Module3_Regularization.pptx
Regularization techniques are used to reduce overfitting in machine learning models. They modify the learning algorithm to favor simpler models by adding constraints or penalty terms to the objective function. Common regularization strategies include L2 regularization, which penalizes weights with large magnitudes, driving them closer to zero. This helps control model complexity and improve generalization to new data.
Regularization techniques are used to reduce overfitting in machine learning models. They modify the learning algorithm to favor simpler models by adding constraints or penalty terms to the objective function. Common regularization strategies include L2 regularization, which penalizes weights with large magnitudes, driving them closer to zero. This helps control model complexity and improve generalization to new data.
• Ability to perform well on training data and new inputs
• Available strategies are designed to reduce the test error
with increased training error
• Modifying the LA to reduce its generalization (test) error
and not the training error
• General regularization strategies • Adding extra constraints / parameters on the ML model • Introduce extra terms in the cost / objective function
• Other types – Ensemble methods
Regularization Strategies • Generally based on regularizing estimators • Effective regularizer – trade-off with Increased bias by decreasing variance • Generalization and overfitting (Model) – Excluded the data generating process (underfitting) – Matched the true data generating process – Include the generating process • Model complexity – Finding the model of right size with right number of parameters – Determine the best fitting model (large) that has been regularized properly – Intention is to create a large, deep, regularized model Parameter Norm Penalties •• Limit the models capacity (NN, LR, LoR) – Add a parameter norm penalty to the objective function – , , where • For NNs (Parameter Norm Penalty - PNP) impacts the weights across each layer and the biases remain unregularized • ω – vector for weights affected by • θ – vector for parameters comprising ω and the unregularized parameters • Alternatively, NNs deploy coefficient different for each layer of the N/w. L2 Parameter Regularization • Simplest and commonly utilized • L2 PNP is known as weight decay • To avoid overfitting, a weight update w with the respect to∇J/∇w and subtract from λ∙w, thereby the weights decay towards zero – weight update • Drive the weights closer to the origin by adding a regularization term to the objective function • Ridge regression / Tikhonov regularization Regularization (revisited) • Regularization refers to the act of modifying a learning algorithm to favor “simpler” prediction rules to avoid overfitting. • Most commonly, regularization refers to modifying the loss function to penalize certain values of the weights you are learning. • Specifically, penalize weights that are large. • Identify large weights using • L2 norm of w – vector’s length / Euclidean norm L2 Regularization (ctd..)
• New goal for minimization –
Loss minimizing function this, we prefer solutions where w is closer to 0.
• λ - hyperparameter that adjusts the trade-off
between having low training loss and having low weights L2 Regularization (ctd..) • Assuming no bias (i.e. θ is just ω), then cost function is • And the gradient • Updating weight
• Further quadratic approximation to J to
yield minimal unregularized training cost by tuning weights L2 Regularization (ctd..) • • Then J becomes • H – Hessian matrix of J • Minimum of J occurs at • Adding weight decay gradient
– When =0, approaches
– When grows perform eigen decomposition of H L2 Regularization (ctd..) • • H is decomposed into diagonal matrix and orthonormal basis of eigen vectors as • Therefore, becomes L2 Regularization (ctd..) • Extending to linear regression - Cost function J in terms of sum of squared errors