You are on page 1of 50

Lecture 2

CSE 602 Machine Learning - I


Cross Validation
Overfitting

• Model learns the training • Address through:


data too well • Cross-Validation
• Captures noise • Regularization
• Feature Selection
• Too complex model,
• Ensemble Methods
• Poor generalization,
• Increasing the datasets
• High TRN Low TST size
accuracy • Strike balance between
• Due to insufficient data, complexity and dataset
large number of parameters size
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 3
Overfitting

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 4


Cross-Validation

• A statistical technique used to assess the performance of a predictive


model and to reduce the risk of overfitting.
• It involves dividing a dataset into subsets, training the model on
some of these subsets, and evaluating its performance on the
remaining subsets.
• This helps to provide a more accurate estimate of the model's
performance on new, unseen data.

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 5


Cross-Validation

• Holdout Validation:
• Description: The dataset is split into two parts - a training set and a
validation (or test) set.
• Procedure:
• Train the model on the training set.
• Evaluate the model on the validation set.
• Example:
• Splitting the dataset into 80% for training and 20% for validation.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 6
Hold-Out Validation

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 7


Cross-Validation
• K-Fold Cross-Validation:
• Description: The dataset is divided into 'k' folds, and the model is trained 'k'
times, each time using a different fold as the validation set and the remaining folds
as the training set.
• Procedure:
• Split the dataset into 'k' folds.
• Train the model on 'k-1' folds and validate on the remaining fold.
• Repeat this process 'k' times, each time with a different validation fold.
• Average the performance metrics over the 'k' iterations.
• Example:
• If using 5-fold cross-validation, the dataset is split into 5 folds, and the model is trained and
evaluated 5 times.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 8
K-Fold Cross Validation

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 9


Cross-Validation
• Leave-One-Out Cross-Validation (LOOCV):
• Description: A special case of k-fold cross-validation where 'k' is set equal
to the number of samples in the dataset. In each iteration, a single data point
is used as the validation set, and the model is trained on the remaining data.
• Procedure:
• Train the model on all data points except one.
• Evaluate the model on the left-out data point.
• Repeat this process for each data point.
• Example:
• If there are 100 samples, LOOCV involves training and evaluating the model 100
times, each time leavingCSE602
May 15, 2024 out -one sample.
Machine Learning I - Dr. Tariq Mahmood 10
Leave one out Cross Validation

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 11


Cross-Validation
• Stratified K-Fold Cross-Validation:
• Description: Similar to k-fold cross-validation, but it ensures that each fold
maintains the same distribution of target classes as the original dataset. This is
particularly useful when dealing with imbalanced datasets.
• Procedure:
• Split the dataset into 'k' folds while maintaining the class distribution in each fold.
• Train the model on 'k-1' folds and validate on the remaining fold.
• Repeat this process 'k' times.
• Example:
• If a dataset has 80% samples of class A and 20% of class B, each fold in stratified k-
fold will also have this 80-20 distribution.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 12
Stratified K-fold Cross Validation

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 13


Stratified K-fold Cross Validation

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 14


Cross-Validation
• Time Series Cross-Validation:
• Description: Designed for time series data where the order of data points
matters. The training set consists of past data, and the validation set
includes future data.
• Procedure:
• Split the time-ordered dataset into training and validation sets.
• Train the model on past data and evaluate on future data.
• Move the time window forward and repeat the process.
• Example:
• If you have daily stock prices, you may use the data from the first 80% of the
timeline for training andCSE602
May 15, 2024 the -last 20%
Machine for
Learning validation.
I - Dr. Tariq Mahmood 15
Cross-Validation
• Cross-validation helps in obtaining a more reliable estimate of a
model's performance
• Reduces the impact of the dataset's specific characteristics on the
evaluation.
• The choice of the cross-validation technique depends on the nature
of the data and the problem at hand.

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 16


Hands-On with Cross Validation

• https://www.w3schools.com/python/python_ml_cross_validation.as
p

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 17


Occam’s Razor
Definition
• Philosophical and scientific principle that suggests choosing the
simplest explanation or hypothesis that fits the observed data.
• Applied as a guiding principle when selecting a model from a set of
models that adequately predicts the data.
• Principle: "Entities should not be multiplied without necessity" or
"The simplest explanation is usually the correct one."
• Favor simpler models over more complex ones when both models
perform equally well on the training data.

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 19


Definition
• Model Parameters:
• Coefficients or weights learned by the model during the training process.
• Slope and intercept (linear), weights and biases network's connections
(ANN).
• Simplifying model parameters = reducing the # of these learned
coefficients.
• Model Hyperparameters:
• Settings or configurations that are set before the learning process begins.
• Control aspects of algorithm, learning rate, regularization strength, or the
architecture of the model.
• Simplifying hyperparameters = assign values to reduce complexity
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 20
Definition

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 21


Definition

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 22


Definition

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 23


Some Details…
• Model Complexity:
• The simpler model is often preferable – simple refers to the complexity of
the model, usually measured by the number of parameters.
• Simpler models are more interpretable, easier to understand, and less prone
to overfitting
• Feature Selection:
• When faced with a choice between models that perform similarly, selecting
the model with fewer features can be preferred.
• This is based on the idea that unnecessary or irrelevant features can
introduce noise and complexity without improving predictive performance.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 24
Regularization
Definition
• A set of techniques used in machine learning to prevent overfitting
and improve the generalization of a model to new, unseen data.
• The most common regularization techniques involve adding a
penalty term to the cost function, discouraging the model from
fitting the training data too closely.
• Popular regularization techniques:
• L1 regularization (Lasso),
• L2 regularization (Ridge),
• Elastic Net.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 26
Definition
• Consider linear models – logistic regression, linear regression etc.
• Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
• Y represents the learned relation and β represents the coefficient
estimates for different variables or predictors(X).
• Regularization is a form of regression, that constrains or shrinks the
coefficient estimates towards zero
• Discourages learning a more complex/flexible model to avoid
the risk of overfitting.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 27
Re-Defined
• Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
• Line Fitting uses a loss function = residual sum of squares or RSS. -
Coefficients are chosen to minimize RSS

• If there is noise, the estimated coefficients won’t generalize well to


the future data – Enter regularization to shrink or regularize these
estimates towards zero.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 28
Definition

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 29


L2 - Ridge Regression

• RSS modified by adding the shrinkage quantity - λ is the tuning


parameter = how much to penalize the complexity
• The increase in complexity = increase in coefficient values
• Minimize function = coefficients need to be small - prevents
coefficients from rising too high.
• We shrink all coefficients except the intercept β0 – it is a measure of
the mean value of the response when xi1 = xi2 = …= xip = 0.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 30
Ridge Defined
• When λ = 0, penalty effect = 0 | Then, Ridge regression= RSS.
• As λ→∞, impact of shrinkage grows, coefficient estimates will
approach zero
• Selecting λ is critical - Cross validation comes in handy |
Coefficient estimates produced by this method = L2 norm.
• GridSearchCV

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 31


May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 32
L1 - Lasso

• Lasso uses |βj|(modulus) instead of squares of β, as its penalty -


known as the L1 norm.
• Ridge = solving an equation where summation of squares of
coefficients is less than or equal to s.
• Lasso = solving an equation where summation of modulus of
coefficients is less than or equal to s.
•Mays15,is2024a constant that exists for each value of shrinkage factor λ
CSE602 - Machine Learning I - Dr. Tariq Mahmood 33
Lasso
• Consider there are 2 parameters in a given problem.
• Ridge regression: β1² + β2² ≤ s - implies that ridge regression
coefficients have the smallest RSS for all points that lie within the
circle given by β1² + β2² ≤ s.
• Lasso regression: |β1|+|β2|≤ s - implies that lasso coefficients have
the smallest RSSfor all points that lie within the diamond given by |
β1|+|β2|≤ s.
• Labeled as Constraint functions
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 34
Lasso

Points on the ellipse share the value


of RSS.

For a very large value of s, the green


regions will contain the center of the
ellipse, making coefficient estimates
of both regression techniques, equal
to the least squares estimates.
Constraint functions (green areas), for lasso (left) and
ridge (right), along with contours for RSS(red ellipse).

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 35


Lasso
But: coefficient estimates of
both L1 and L2 are given by the
first point at which an ellipse
contacts the constraint region.

Since ridge has a circular


constraint with no sharp points,
this intersection will not
generally occur on an axis, and
Constraint functions (green areas), for lasso (left) and so the ridge regression
ridge (right), along with contours for RSS(red ellipse).
coefficient estimates will be
exclusively non-zero.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 36
Lasso
However, the lasso constraint
has corners at each of the axes,
and so the ellipse will often
intersect the constraint region at
an axis.

When this occurs, one of the


coefficients will equal zero. In
higher dimensions(where
Constraint functions (green areas), for lasso (left) and parameters are much more than
ridge (right), along with contours for RSS(red ellipse). 2), many of the coefficient
estimates may equal zero
simultaneously.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 37
Lasso
• Viola an obvious disadvantage of ridge regression = model
interpretability.
• It will shrink the coefficients for least important predictors, very close
to zero.
• But it will never make them exactly zero - the final model will include
all predictors.
• However, L1 penalty has the effect of forcing some of the coefficient
estimates to be exactly equal to zero when λ is sufficiently large.
• Therefore, the lasso method also performs feature selection and is
said to yield sparse models.
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 38
Summary
• A standard RSS tends to have some variance – no generalization
• Regularization reduces variance without substantial increase in its bias.
So λ controls the impact on bias and variance.
• As λ rises, it reduces the value of coefficients and thus reducing the
variance.
• Till a point, this increase is beneficial (avoiding overfitting), without
loosing any important properties in the data.
• But after certain value, the model starts loosing important properties,
giving rise to bias in the model and thus underfitting.
•MayTherefore,
15, 2024 the value ofCSE602
λ should be carefully
- Machine Learning selected.
I - Dr. Tariq Mahmood 39
Another Perspective
L1 (Lasso)
• L1 regularization adds the absolute values of the model coefficients
to the cost function. Regularized cost function for linear regression
with L1 regularization is:

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 41


L1 (Lasso) - Example

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 42


L2 (Ridge)

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 43


Elastic Net

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 44


May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 45
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 46
L1
• Pros:
1. Feature Selection: L1 induces sparsity in the model, effectively setting some
coefficients to exactly zero - Makes it useful for feature selection, as irrelevant or
redundant features may have zero coefficients.
2. Simplicity: L1 tends to result in simpler models with fewer nonzero coefficients,
making them easier to interpret.
• Cons:
1. Not Robust to Correlated Features: Lasso tends to arbitrarily select one feature
among a group of correlated features, leading to instability and sensitivity to small
changes in the data.
2. Global Shrinkage: L1 regularization performs global shrinkage, which means it
tends to shrink all coefficients towards zero. This may not be ideal when all features
are somewhat relevant. CSE602 - Machine Learning I - Dr. Tariq Mahmood
May 15, 2024 47
L2
• Pros:
1.Robustness to Correlated Features: L2 is more robust to correlated features
compared to L1 - shrinks the coefficients smoothly and does not arbitrarily
select one feature over another.
2.Stability: Ridge regression is numerically stable, even when features are
highly correlated.
• Cons:
1.Does Not Induce Sparsity: Ridge does not lead to exact zero coefficients, so
it does not perform feature selection - shrinks coefficients towards zero but
rarely to zero.
2.May Not Be Ideal for Sparse Data: In situations where the dataset is sparse,
Ridge regularization may
May 15, 2024 not
CSE602 perform
- Machine Learning I -as well
Dr. Tariq as L1.
Mahmood 48
Elastic Net
• Pros:
• Combination of L1 and L2: Elastic Net combines the strengths of both L1 and
L2 regularization, allowing for both feature selection and robustness to correlated
features.
• Flexibility: By adjusting the mixing parameter (α), Elastic Net can be tuned to
behave more like L1 or L2 regularization, offering a balance between sparsity
and robustness.
• Cons:
• Computational Complexity: Elastic Net is computationally more expensive
than L1 or L2 alone
• Interpretability: While more interpretable than some other complex models,
Elastic Net may still be less interpretable than simpler models without
May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 49
regularization.
Elastic Net
• https://www.kaggle.com/code/prasadperera/the-boston-housing-data
set
• https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression

May 15, 2024 CSE602 - Machine Learning I - Dr. Tariq Mahmood 50

You might also like