ML 1 Lecture 2

Lecture 2
CSE 602 Machine Learning - I

Cross Validation
Overfitting
• Model learns the training • Address through:

data too well • Cross-Validation
• Captures noise • Regularization
• Feature Selection
• Too complex model,
• Ensemble Methods
• Poor generalization,
• Increasing the datasets
• High TRN Low TST size
accuracy • Strike balance between
• Due to insufficient data, complexity and dataset
large number of parameters size
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 3
Overfitting

Cross-Validation
• A statistical technique used to assess the performance of a predictive

model and to reduce the risk of overfitting.
• It involves dividing a dataset into subsets, training the model on
some of these subsets, and evaluating its performance on the
remaining subsets.
• This helps to provide a more accurate estimate of the model's
performance on new, unseen data.

Cross-Validation
• Holdout Validation:
• Description: The dataset is split into two parts - a training set and a
validation (or test) set.
• Procedure:
• Train the model on the training set.
• Evaluate the model on the validation set.
• Example:
• Splitting the dataset into 80% for training and 20% for validation.
Hold-Out Validation

Cross-Validation
• K-Fold Cross-Validation:
• Description: The dataset is divided into 'k' folds, and the model is trained 'k'
times, each time using a different fold as the validation set and the remaining folds
as the training set.
• Procedure:
• Split the dataset into 'k' folds.
• Train the model on 'k-1' folds and validate on the remaining fold.
• Repeat this process 'k' times, each time with a different validation fold.
• Average the performance metrics over the 'k' iterations.
• Example:
• If using 5-fold cross-validation, the dataset is split into 5 folds, and the model is trained and
evaluated 5 times.
K-Fold Cross Validation

Cross-Validation
• Leave-One-Out Cross-Validation (LOOCV):
• Description: A special case of k-fold cross-validation where 'k' is set equal
to the number of samples in the dataset. In each iteration, a single data point
is used as the validation set, and the model is trained on the remaining data.
• Procedure:
• Train the model on all data points except one.
• Evaluate the model on the left-out data point.
• Repeat this process for each data point.
• Example:
• If there are 100 samples, LOOCV involves training and evaluating the model 100
times, each time leavingCSE602
May 15, 2024 out -one sample.
Machine Learning I - Dr. Tariq Mahmood 10
Leave one out Cross Validation

Cross-Validation
• Stratified K-Fold Cross-Validation:
• Description: Similar to k-fold cross-validation, but it ensures that each fold
maintains the same distribution of target classes as the original dataset. This is
particularly useful when dealing with imbalanced datasets.
• Procedure:
• Split the dataset into 'k' folds while maintaining the class distribution in each fold.
• Train the model on 'k-1' folds and validate on the remaining fold.
• Repeat this process 'k' times.
• Example:
• If a dataset has 80% samples of class A and 20% of class B, each fold in stratified k-
fold will also have this 80-20 distribution.
Stratified K-fold Cross Validation

Stratified K-fold Cross Validation

Cross-Validation
• Time Series Cross-Validation:
• Description: Designed for time series data where the order of data points
matters. The training set consists of past data, and the validation set
includes future data.
• Procedure:
• Split the time-ordered dataset into training and validation sets.
• Train the model on past data and evaluate on future data.
• Move the time window forward and repeat the process.
• Example:
• If you have daily stock prices, you may use the data from the first 80% of the
timeline for training andCSE602
May 15, 2024 the -last 20%
Machine for
Learning validation.
I - Dr. Tariq Mahmood 15
Cross-Validation
• Cross-validation helps in obtaining a more reliable estimate of a
model's performance
• Reduces the impact of the dataset's specific characteristics on the
evaluation.
• The choice of the cross-validation technique depends on the nature
of the data and the problem at hand.

Hands-On with Cross Validation
• https://www.w3schools.com/python/python_ml_cross_validation.as
p

Occam’s Razor
Definition
• Philosophical and scientific principle that suggests choosing the
simplest explanation or hypothesis that fits the observed data.
• Applied as a guiding principle when selecting a model from a set of
models that adequately predicts the data.
• Principle: "Entities should not be multiplied without necessity" or
"The simplest explanation is usually the correct one."
• Favor simpler models over more complex ones when both models
perform equally well on the training data.

Definition
• Model Parameters:
• Coefficients or weights learned by the model during the training process.
• Slope and intercept (linear), weights and biases network's connections
(ANN).
• Simplifying model parameters = reducing the # of these learned
coefficients.
• Model Hyperparameters:
• Settings or configurations that are set before the learning process begins.
• Control aspects of algorithm, learning rate, regularization strength, or the
architecture of the model.
• Simplifying hyperparameters = assign values to reduce complexity
Definition

Definition

Definition

Some Details…
• Model Complexity:
• The simpler model is often preferable – simple refers to the complexity of
the model, usually measured by the number of parameters.
• Simpler models are more interpretable, easier to understand, and less prone
to overfitting
• Feature Selection:
• When faced with a choice between models that perform similarly, selecting
the model with fewer features can be preferred.
• This is based on the idea that unnecessary or irrelevant features can
introduce noise and complexity without improving predictive performance.
Regularization
Definition
• A set of techniques used in machine learning to prevent overfitting
and improve the generalization of a model to new, unseen data.
• The most common regularization techniques involve adding a
penalty term to the cost function, discouraging the model from
fitting the training data too closely.
• Popular regularization techniques:
• L1 regularization (Lasso),
• L2 regularization (Ridge),
• Elastic Net.
Definition
• Consider linear models – logistic regression, linear regression etc.
• Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
• Y represents the learned relation and β represents the coefficient
estimates for different variables or predictors(X).
• Regularization is a form of regression, that constrains or shrinks the
coefficient estimates towards zero
• Discourages learning a more complex/flexible model to avoid
the risk of overfitting.
Re-Defined
• Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
• Line Fitting uses a loss function = residual sum of squares or RSS. -
Coefficients are chosen to minimize RSS
• If there is noise, the estimated coefficients won’t generalize well to

the future data – Enter regularization to shrink or regularize these
estimates towards zero.
Definition

L2 - Ridge Regression
• RSS modified by adding the shrinkage quantity - λ is the tuning

parameter = how much to penalize the complexity
• The increase in complexity = increase in coefficient values
• Minimize function = coefficients need to be small - prevents
coefficients from rising too high.
• We shrink all coefficients except the intercept β0 – it is a measure of
the mean value of the response when xi1 = xi2 = …= xip = 0.
Ridge Defined
• When λ = 0, penalty effect = 0 | Then, Ridge regression= RSS.
• As λ→∞, impact of shrinkage grows, coefficient estimates will
approach zero
• Selecting λ is critical - Cross validation comes in handy |
Coefficient estimates produced by this method = L2 norm.
• GridSearchCV

L1 - Lasso
• Lasso uses |βj|(modulus) instead of squares of β, as its penalty -

known as the L1 norm.
• Ridge = solving an equation where summation of squares of
coefficients is less than or equal to s.
• Lasso = solving an equation where summation of modulus of
coefficients is less than or equal to s.
•Mays15,is2024a constant that exists for each value of shrinkage factor λ
CSE602 - Machine Learning I - Dr. Tariq Mahmood 33
Lasso
• Consider there are 2 parameters in a given problem.
• Ridge regression: β1² + β2² ≤ s - implies that ridge regression
coefficients have the smallest RSS for all points that lie within the
circle given by β1² + β2² ≤ s.
• Lasso regression: |β1|+|β2|≤ s - implies that lasso coefficients have
the smallest RSSfor all points that lie within the diamond given by |
β1|+|β2|≤ s.
• Labeled as Constraint functions
Lasso
Points on the ellipse share the value

of RSS.
For a very large value of s, the green

regions will contain the center of the
ellipse, making coefficient estimates
of both regression techniques, equal
to the least squares estimates.
Constraint functions (green areas), for lasso (left) and
ridge (right), along with contours for RSS(red ellipse).

Lasso
But: coefficient estimates of
both L1 and L2 are given by the
first point at which an ellipse
contacts the constraint region.
Since ridge has a circular

constraint with no sharp points,
this intersection will not
generally occur on an axis, and
Constraint functions (green areas), for lasso (left) and so the ridge regression
ridge (right), along with contours for RSS(red ellipse).
coefficient estimates will be
exclusively non-zero.
Lasso
However, the lasso constraint
has corners at each of the axes,
and so the ellipse will often
intersect the constraint region at
an axis.
When this occurs, one of the

coefficients will equal zero. In
higher dimensions(where
Constraint functions (green areas), for lasso (left) and parameters are much more than
ridge (right), along with contours for RSS(red ellipse). 2), many of the coefficient
estimates may equal zero
simultaneously.
Lasso
• Viola an obvious disadvantage of ridge regression = model
interpretability.
• It will shrink the coefficients for least important predictors, very close
to zero.
• But it will never make them exactly zero - the final model will include
all predictors.
• However, L1 penalty has the effect of forcing some of the coefficient
estimates to be exactly equal to zero when λ is sufficiently large.
• Therefore, the lasso method also performs feature selection and is
said to yield sparse models.
Summary
• A standard RSS tends to have some variance – no generalization
• Regularization reduces variance without substantial increase in its bias.
So λ controls the impact on bias and variance.
• As λ rises, it reduces the value of coefficients and thus reducing the
variance.
• Till a point, this increase is beneficial (avoiding overfitting), without
loosing any important properties in the data.
• But after certain value, the model starts loosing important properties,
giving rise to bias in the model and thus underfitting.
•MayTherefore,
15, 2024 the value ofCSE602
λ should be carefully
- Machine Learning selected.
I - Dr. Tariq Mahmood 39
Another Perspective
L1 (Lasso)
• L1 regularization adds the absolute values of the model coefficients
to the cost function. Regularized cost function for linear regression
with L1 regularization is:

L1 (Lasso) - Example

L2 (Ridge)

Elastic Net

L1
• Pros:
1. Feature Selection: L1 induces sparsity in the model, effectively setting some
coefficients to exactly zero - Makes it useful for feature selection, as irrelevant or
redundant features may have zero coefficients.
2. Simplicity: L1 tends to result in simpler models with fewer nonzero coefficients,
making them easier to interpret.
• Cons:
1. Not Robust to Correlated Features: Lasso tends to arbitrarily select one feature
among a group of correlated features, leading to instability and sensitivity to small
changes in the data.
2. Global Shrinkage: L1 regularization performs global shrinkage, which means it
tends to shrink all coefficients towards zero. This may not be ideal when all features
are somewhat relevant. CSE602 - Machine Learning I - Dr. Tariq Mahmood
May 15, 2024 47
L2
• Pros:
1.Robustness to Correlated Features: L2 is more robust to correlated features
compared to L1 - shrinks the coefficients smoothly and does not arbitrarily
select one feature over another.
2.Stability: Ridge regression is numerically stable, even when features are
highly correlated.
• Cons:
1.Does Not Induce Sparsity: Ridge does not lead to exact zero coefficients, so
it does not perform feature selection - shrinks coefficients towards zero but
rarely to zero.
2.May Not Be Ideal for Sparse Data: In situations where the dataset is sparse,
Ridge regularization may
May 15, 2024 not
CSE602 perform
- Machine Learning I -as well
Dr. Tariq as L1.
Mahmood 48
Elastic Net
• Pros:
• Combination of L1 and L2: Elastic Net combines the strengths of both L1 and
L2 regularization, allowing for both feature selection and robustness to correlated
features.
• Flexibility: By adjusting the mixing parameter (α), Elastic Net can be tuned to
behave more like L1 or L2 regularization, offering a balance between sparsity
and robustness.
• Cons:
• Computational Complexity: Elastic Net is computationally more expensive
than L1 or L2 alone
• Interpretability: While more interpretable than some other complex models,
Elastic Net may still be less interpretable than simpler models without
regularization.
Elastic Net
• https://www.kaggle.com/code/prasadperera/the-boston-housing-data
set
• https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression

ML 1 Lecture 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML 1 Lecture 2

Uploaded by

Copyright:

Available Formats

Lecture 2

CSE 602 Machine Learning - I

• Model learns the training • Address through:

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 4

• A statistical technique used to assess the performance of a predictive

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 5

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 7

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 9

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 11

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 13

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 14

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 16

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 17

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 19

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 21

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 22

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 23

• If there is noise, the estimated coefficients won’t generalize well to

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 29

• RSS modified by adding the shrinkage quantity - λ is the tuning

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 31

• Lasso uses |βj|(modulus) instead of squares of β, as its penalty -

Points on the ellipse share the value

For a very large value of s, the green

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 35

Since ridge has a circular

When this occurs, one of the

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 41

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 42

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 43

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 44

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 50

You might also like