You are on page 1of 54

🪡

Feature Selection

📌
Lesson Structure
Feature Selection Interview Questions
Feature Selection Methods
Why use feature selection?
Intrinsic methods
Filter methods How do you select features in general?
Wrapper methods
How to do feature selection if you have 10000 features?

How to calculate feature importance?

Feature Selection
Select a subset of the original features for model training.
Is usually used as a pre-processing step before doing the actual learning.

🐥 There is no best feature selection method.

Advantages

Avoid the curse of dimensionality

Improves predictive performance and interpretability of models

Shorten training times → improve computational efficiency

reduce generalization error of the model by removing irrelevant features or noise

Improves the predictive power of the model if a model suffers from overfitting

Domain knowledge is important!

Understand the business problem: know which features matter and which ones don’t

Consult with domain experts

Feature Selection 1
Exploratory data analysis (EDA)

Feature Selection Methods


Intrinsic methods
Embedded methods or implicit methods

Have feature selection naturally embedded with the training process

Tree-based models

Search for the best feature to split node so that the outcomes are more homogeneous with each
new partition.

If a feature is not used in any split, it’s independent of the target variable

Regularization models

L1-regularization penalizes many of estimated coefficients to zero → only keep features with non-
zero coefficients

Models use regularization, e.g. linear regression, logistic regression, SVMs.

Pros and Cons

✔ Fast because feature selection is embedded within model fitting process


✔ No external feature selection tool is needed.
✔ Provides a direct connection between feature selection and the object function (e.g. maximize
information gain in decision trees, maximize likelihood function in logistic regression) which makes
it easier to make informed choice.

❌ Model-dependent and the choice of models is limited.


Filter methods
Select features that correlate well with target variable.

Evaluation is independent of the algorithm.

Feature Selection 2
The search is performed only once.

Univariate statistical analysis

Analyze how each feature correlates with the target variable and select the ones with higher
correlations.

Feature Importance-based

Use feature importance scores to select features to keep (highest scores) or delete (lowest
scores).

Coefficients as feature importance, e.g. linear regression, logistic regression.

Impurity-based feature importances, e.g. tree-based models.

Feature Selection 3
Impurity-based feature importance

Pros and Cons

✔ Simple and fast.


✔ Can be effective at capturing the large trends in the data.
❌ Tend to select redundant features.
❌ Ignore relationships among features.
Wrapper methods
Iterative process that repeatedly add subsets of feature to the model and then use the resulting model
performance to guide the selection of the next subset.

Sequential feature selection (SFS)

A family of greedy search algorithms that are used to automatically select a subset of features
that are most relevant to the problem.

Feature Selection 4
https://scikit-learn.org/stable/modules/feature_selection.html#sequential-feature-selection

Forward SFS

Iteratively finds the best new feature to add to the set of selected features.

Start with zero feature and find the one feature that maximizes a cross-validated score when a
model is trained on this single feature.

Once that first feature is selected, we repeat the procedure by adding a new feature to the set
of selected features.

The procedure stops when the desired number of selected features is reached.

Backward SFS

Start with all the features and sequentially remove features from the set until the desired
number of features is reached.

Pros and Cons

✔ Search for a wider variety of feature subsets than other methods.


Consider features that are already selected when choosing a new feature.

❌ Have the most potential to overfit the features to the training data.
❌ Significant computation time when the number of features is large.

Feature Selection 5
🎰
Encoding Categorical Data

📌
Lesson Structure
Categorical Data Interview Questions
Ordinal features & Class labels
How to deal with categorical variables?
Nominal features
One-hot encoding What’s the difference when treating a variable as a dummy
Dummy encoding variable vs none dummy?
Feature hashing (Hashing trick)
How to deal with categorical features when the number of
levels is very large, i.e. high cardinality?

Categorical Data
Categorical data indicates types of data which may be divided into groups and the groups may or may not have a
specific order. e.g. gender, ethnicity, age group, or a choice of preference.
To make sure the learning algorithm interprets categorical features correctly, we need to convert categorical string
values into integers.

📌 Very few algorithms, e.g. decision trees, LightGBM, CatGBM, can take categorical features as it is.

Ordinal and nominal features


Ordinal features: categorical values that can be ordered or sorted. e.g. t-shirt size: L > M > S.
Nominal features: don’t imply any order. e.g. color: green, red, blue.

Class labels
Dependent variable of a classification problem.
Class labels are not ordinal.
It’s a good practice to provide class labels as integer arrays to avoid technical glitches.

Encoding Categorical Data 1


Ordinal features & Class labels
Define a mapping to map strings to numbers. Typically, we start with integer 1.

User Rating User Rating

1 Poor 1 1

2 Fair 2 2

3 Good 3 3

4 Fair 4 2

Pros: doesn't increase dimensionality of data.

Cons:

Imposes an artificial order. Only works for ordinal features or class labels.

Nominal features
One-hot encoding
Transform a categorical feature into several binary features with each level in a category turns into a new feature.
e.g. if we have 3 categories → create 3 new features.

For the new feature vector, only the one that gets chosen will have a value of 1 and the rest will be set to 0.

User Preference User Red Green Blue

1 Red 1 1 0 0

2 Green 2 0 1 0

3 Blue 3 0 0 1

4 Red 4 1 0 0

Number of new features = cardinality of the original feature

Pros: Easy to examine how each level of a category contributes to predictions.

Cons:

Increase the dimensionality of feature vectors.

Introduces multicollinearity (one feature is perfectly correlated with one or more features), which can be an
issue for certain algorithms, e.g. algorithms that require matrix inversion. if features are highly correlated,
matrices are computationally difficult to invert, which can lead to numerically unstable estimates.

❗ One-hot encoding introduces multicollinearity which may cause interference and parameter
estimates being inaccurate.

Dummy encoding
Remove one feature column from the one-hot encoded array. We do not lose any important information by removing
a feature column.

Encoding Categorical Data 2


User Preference User Red Green

1 Red 1 1 0

2 Green 2 0 1

3 Blue 3 0 0

4 Red 4 1 0

Number of new features = cardinality of the original feature - 1

Pros: Avoids collinearity of the features.

Cons: Increase the dimensionality of feature vectors.

Problems of one-hot and dummy encoding

✏ Both one-hot encoding and dummy encoding require us to know the vocabulary of the categorical
feature beforehand.

What if the vocabulary from the training data is incomplete?

What if there are new categories get added to the data (i.e. cold-start problems)? The model will not be able to
make predictions for such new data.
What if some categorical features have high cardinality, e.g. thousands to millions? The model will take up a
large amount of storage space and grow in size as the training set grows.

Feature hashing (Hashing trick)


Widely used to encode large-scale categorical features in practice, especially in continual learning settings where the
model learns from incoming data in production.

Feature hashing - Wikipedia


In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing
features, i.e. turning arbitrary features into indices in a vector or matrix.
https://en.wikipedia.org/wiki/Feature_hashing#Feature_hashing_(Weinberger_et_al._2009)

Use a deterministic (no random seeds) and portable (the same algorithm can be using for both training and
serving) hash function to generate a hashed value of each category. The hashed value will become the index of
that category.

Encoding Categorical Data 3


Source: Lakshmanan, V., Robinson, S., & Munn, M. (2021). Machine Learning Design Patterns: Solutions to common challenges in data
preparation, model building, and MLOps. O'Reilly.

Pros:

Can choose the number of encoded values for a feature in advance, without having to know how many
categories there will be.

Does not increase dimensionality of the data - practical for categorical features have high cardinality.

Potential problems:

Collision - two distinct categories being assigned the same hash code. Model accuracy will be compromised.

📍 Don’t use the hashing trick if you know the vocabulary beforehand, if the vocabulary size is relatively small,
and if cold start is not a concern.

Encoding Categorical Data 4



Regularization
What Is Regularization?
When to Use Regularization
Regularization Techniques
L1 and L2 Regularizations
📌 Interview Questions

What are L1 and L2 regularizations?


L1 Regularization
L2 Regularization What are the differences between them?
L1 vs. L2 Regularization Why does L1 cause parameter sparsity whereas L2
does not?

What Is Regularization?
Regularization is an umbrella term. It is the process of introducing a regularization term to the loss
function of a model.

This technique adds a penalizing term for bringing in more features with the objective function, i.e. the
penalizing term’s value is higher when the model is more complex. Hence, it tries to push the coefficients
for many features to zero and reduce the loss function.

✏ Goal: improve the generalization of a model. It becomes necessary when the model begins to
overfit.

When to Use Regularization


To handle collinearity (high correlation among features).

To filtering out noise from data.

To remove the complexity of a model and eventually reduce variance (prevent overfitting).

Regularization Techniques

Regularization 1
Can be any form.

E.g. Lp norm. Most widely used methods: L1 (lasso) and L2 (ridge) regularizations.

Isosurfaces (the surface on which the norm takes a constant value) of Lp norms (p = 0.5, 1, 2, 4)

Hybrid of L1 and L2 regularizations: Elastic net regularization.

For neural networks: dropout and batch normalization.

Non-mathematical methods that have a regularization effect: data augmentation and early stopping.

✏ Feature scaling is important for regularization - we need to ensure all features are on
comparable scales.

L1 and L2 Regularizations

📌 L1 and L2 regularizations can be applied to all parametric models including linear regression,
logistic regression, SVMs, and neural networks.

L1 Regularization
m
L1 regularization adds L1-norm (α ∑j=1 ∣wj ∣) to the loss function. L1 regularizes the absolute sum of
the weights.

E.g. Lasso Regression

n m
1
min ∑(yi − y^i )2 + α ∑ ∣wj ∣
w n
i=1 j=1

α: hyperparameter
Controls the regularization strength

Regularization 2
💡 We need to be careful when adjusting the regularization strength. If the regularization
strength is too high and the weight coefficients shrink to zero, the model can perform
poorly due to underfitting.

Evaluated with cross-validation, AIC or BIC.

Geometric interpretation

Contours: the mean squared error loss function (the squared distance between the true and
predicted values) for two weight coefficients, w1 and w2 .
Goal: find the combinations of w1 and w2 that minimize the loss function.

Regularization 3
Source: Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022). Machine learning with pytorch and Scikit-Learn:
Develop machine learning and deep learning models with python. Packt Publishing.

Diamond indicates the sum of the absolute weight coefficients.

Larger α → smaller diamond.

By increasing α, we shrink the weights towards zero.

Solution:

Balance two different losses - cannot decrease either loss without increasing the other.

Diamond intersects with the contours of the unpenalized loss function. Either w1 or w2
becomes zero (sparse vector).

Regularization 4
Source: Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022). Machine learning with pytorch and Scikit-Learn:
Develop machine learning and deep learning models with python. Packt Publishing.

💡 While constraining with L1 regularizer, in order to descent to lower error, some of the
parameters tend to shrink to zero. L1 produces sparse feature vectors, and most
feature weights are zero, provided α is large enough.

Pros and Cons

The only norm that introduces sparsity in the solution and remains convex for easy optimization.

L1 performs feature selection by deciding which features are essential for prediction and which
are not (will be forced to be exactly zero). It helps increase model interpretability.

It’s useful in cases where we have a high-dimensional dataset with many features that are
irrelevant.

Makes our models more efficient to store and compute.

The result of L1 may be inconsistent, the parameters may differ from each training.

L2 Regularization
L2 regularization adds L2-norm (α ∑j=1 wj2 ) to the loss function. It regularizes the sum of squares of
m

the weights.

Regularization 5
E.g. Ridge regression

n m
1
min ∑(yi − y^i )2 + α ∑ wj2
w n
i=1 j=1

α ≥ 0: hyperparameter
Controls the amount of shrinkage - the larger the value of α, the greater the amount of shrinkage.

Geometric interpretation

Disk indicates the quadratic L2 regularization term. The combination of w1 and w2 cannot fall
outside the disk.

Larger α → narrower disk.

Solution:

Balance two different losses - cannot decrease either loss without increasing the other.

Disk intersects with the contours of the unpenalized loss function. Both w1 and w2 will be
penalized, but neither is zero.

Regularization 6
Source: Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022). Machine learning with pytorch and Scikit-Learn:
Develop machine learning and deep learning models with python. Packt Publishing.

Pros and Cons

L2 shrinks the parameters and reduces influence of unimportant features.

It is more stable than L1 regularization.

✏ L2 regularization is differentiable, so gradient descent can be used for optimizing the


objective function.

It does not shrink parameters to zero, therefore L2 can not be used to do feature selection. It’s
helpful when the only goal is to improve the performance of the model (prevent overfitting).

L1 vs. L2 Regularization
L1 and L2 regularizations are generally used to add constraints to optimization problems to prevent
models from being overfitting.

In L1, features are penalized more than L2 which results in sparsity while L2 regularization tends to
spread error among all the terms, so L1 does feature selection while L2 does not.

Regularization 7
L1 is more sparse/binary (increases in one value of one parameter must be exactly offset by
decreases in the other), with many features either being assigned a 0 or 1 in weighting.

For correlated features, L1 selects the best one while L2 spreads out the weights.

Errors are squared in L2, so the model sees higher error and tries to minimize that squared error.

Regularization 8
🕸
Imbalanced Data

📌
Lesson Structure
Imbalanced Data Interview Questions
Why It Causes Problems
What's the disadvantage of imbalanced dataset?
How to Deal with Imbalanced Data
Resampling How to handle imbalanced data?
Model-level methods
How to deal with imbalanced dataset when data
Evaluation Metrics
contains only 1% of the minority class (label = 1).

Imbalanced Data
An imbalanced dataset is a dataset where one or more labels make up the majority of the dataset, leaving
far fewer examples of other labels.

This problem applies to both classification and regression tasks.

Classification: binary classification, multiclass classification, multilabel classification.

e.g. 95% of labels is in one class.

Imbalanced Data 1
Credit: Rafael Alencar, https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

Regression: examples with outlier values that are either much lower or higher than the
median/average of the data.

e.g. Predict prices for houses. Houses worth > $10M are rare.

In many scenarios, getting more data for the minority class may be impractical or hard to acquire because
the data is inherently imbalanced.

e.g., fraud detection and detection of rare diseases.

Why It Causes Problems

🙁 The model cannot learn to predict the minority class well because of class imbalance.

Model is only able to learn a simple heuristic (e.g. always predict the dominate class) and it gets stuck
in a suboptimal solution.

An accuracy of over 90% can be misleading because the model may not have predictive power on the
rare class.

e.g. 95% of labels is in one class.

Imbalanced Data 2
Credit: Rafael Alencar, https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

Often, the minority class is more important than the majority class. A wrong prediction on an example
of the minority class is more costly than a wrong prediction on an example of the majority class.

e.g., Missing a fraudulent transaction is 100x more costly than misclassifying a legitimate example
as fraud.

How to Deal with Imbalanced Data


Data-level: Resampling

Model-level

Metric-level

Resampling
Change the distribution of the training data to reduce the level of class imbalance.

Over-sampling (Upsampling): Add more examples to the minority class

Random over-sampling

Randomly make copies of the minority class until a ratio is reached.

Imbalanced Data 3
Credit: Rafael Alencar, https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

📌 Simply making replicas may make the model overfit to the few examples.

Generate synthetic examples


SMOTE (synthetic minority oversampling technique) - creates synthetic examples of the rare class
by combining original examples. It does this using a nearest neighbors approach.

Credit: Rafael Alencar, https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

It can prevent the overfitting caused by random oversampling because it does not use original
examples.

Under-samping (Downsampling): Remove examples from the majority class

Random under-sampling
Randomly remove samples of the majority class until a ratio is reached.

Imbalanced Data 4
Credit: Rafael Alencar, https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

📍 Random under-sampling may make the resulting dataset too small for a model to learn
from, so it only works when we have enough number of examples (at least thousands of
samples) in the minority class.

Tomek links

Find pairs of examples from opposite class that are close in proximity and remove the sample of
the majority class in each pair.

Credit: Rafael Alencar, https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

It may help make the decision boundary more clear and the model learn the boundary better. But
the model may not learn from the subtleties of the true decision boundary.

📍 Resampling method is a good starting point, but it runs the risk of overfitting training data (over-
sampling) and losing important information from removing data (under-sampling).

Model-level methods
Make the model more robust to class imbalance.

Does not change the distribution of the training data.

Imbalanced Data 5
Update loss function

Design a loss function that penalizes the wrong classifications of the minority class more than the
wrong classifications of the majority class.
Force the model to treat specific classes with more weight than others during training.

e.g. Class-balanced loss - make the weight of each class inversely proportional to the number of
samples in that class.
n
Wi = ni

Class Number of examples Weight

A 1,000 1.01

B 10 101

The loss caused by example x of class i: L(x; θ) = Wi ∑j P (j∣x; θ)Loss(x, j)


where Loss(x, j) is the loss when x is misclassified as class j (the wrong class).

Select appropriate algorithms

Tree-based models work well on tasks involving small and imbalanced datasets.

Logistic regression is able to handle class imbalanced relatively well in a standalone manner.
Adjust the probability threshold to improve the accuracy for predicting the minority class.

Combine multiple techniques

1. Under-sampling + ensemble

Use all samples of the minority class and a subset of the majority class to train multiple models
and then ensemble those models.

Class Number of examples

A 1,000 divide into 10 groups with 100 examples each

B 100 Use all examples

2. Under-sampling + update loss function


Under-sample the majority class until a ratio is reached, calculate the new weights for both
classes, then pass the new weights to the loss function of the model.

Evaluation Metrics
Choose appropriate evaluation metrics for the task.

📌 We should use un-sampled data instead of resampled data to evaluate the model because
using the later will cause the model to overfit the resampled distribution.
The test data should provide an accurate representation of the original dataset.

Imbalanced Data 6
Accuracy is misleading when classes are imbalanced - performance of the model on the majority class
will dominate the metric.

Consider using accuracy for each class individually.

Precision, recall, and F1 measure a model’s performance with respect to the positive class in a binary
classification problem.

Precision-Recall curve - identify a threshold that works best for the dataset. It gives more importance
to the positive class (put emphasis on how many predictions the model got right out of the total
number it predicted to be positive), which is helpful for dealing with imbalanced data.

AUC of the ROC curve - tune thresholds to increase recall and decrease false positive rate. It treats
both classes equally and is less sensitive to model improvement on minority class, so it’s less helpful
compared to Precision-Recall curve.

Imbalanced Data 7
Credit: Rachel Draelos, Source: https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/

Imbalanced Data 8
🌳
Random Forest

📍 Random forest is the top ML algorithm asked in interviews.

📌
Lesson Structure
Random Forest Interview Questions
How Random Forest works
How does random forest work? What are the ways in
4 steps
which they improve upon individual decision trees?
Hyperparameters
Why is a random forest “random”?

Random Forest 1
Variance reduction
Pros and cons of Random Forest What are the hyperparameters of a random forest?

Why can random forests help reduce variance?

Pros and cons of a random forest.

Random Forest
Is an ensemble learning method for classification and regression tasks, that operates by constructing
multiple decision trees (each trained on a subset of samples using a subset of features) at training time and
outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the
individual trees.

TseKiChun via Wikimedia Commons

How Random Forest works


4 steps
1. Draw a random bootstrap sample of size n (randomly choose n examples from the training dataset
with replacement).

Random Forest 2
Credit: Eugenia Anello, Image source: https://towardsdatascience.com/an-introduction-to-probability-sampling-methods

2. Grow a decision tree from the bootstrap sample. At each node:

Randomly select d features without replacement. e.g., if there are 20 features, choose a random
five as candidates for constructing the best split.

Credit: Eugenia Anello, Image source: https://towardsdatascience.com/an-introduction-to-probability-sampling-methods

Split the node using the feature that provides the best split according to the objective function, e.g.,
maximizing the information gain.

3. Repeat steps 1-2 k times. Essentially, we will build k decision trees.

4. Aggregate the prediction by each tree to assign the class label by majority vote (classification) or take
the average (regression).

📌 Why random forest “random”?


Random forest constructs a large number of trees with random bootstrap samples from the
training data. As each tree is constructed, take a random subset of features before each node is
split. Repeat this process for each node until the tree is large enough.

Random forest vs Bagging

Random forest is a modification of the bagging algorithm.

Only difference: Step 2

Random Forest 3
Hyperparameters
A number of trees, k ( n_estimators ) in a random forest (step 3). The larger the number of trees, the
better the performance of the random forest model at the expense of an increased computational cost.

Less commonly used in practice

Size of the bootstrap sample n ( max_samples ) to consider for each tree (step 1). Typically, this is
chosen to be equal to the number of training samples in the original training dataset.

A number of features d ( max_features ) to consider at each split (step 2).

For classification, the default is d = m , where m is the number of features in the training
dataset.

For regression, the default is d = m/3, where m is the number of features in the training
dataset.

Variance reduction
Average of i.d. random variables
σ2
An average of k i.i.d. random variables, each with variance σ 2 , has variance k .
If the variables are simply i.d. (identically distributed, but not necessarily independent) with positive
pairwise correlation ρ, the variance of the average is (proof)

1−ρ 2
ρσ 2 + σ
k
As k increases, the second term disappears, but the first remains. The larger the correlation, the larger
the variance of the average.

✏ The size of the correlation of pairs of trees limits the benefits of averaging.

How random forest reduce variance?

The idea in random forest is to reduce the variance of the average by reducing the correlation between
the trees.

This is achieved in step 1 and step 2 of the algorithm:


Use a bootstrap sample to grow a tree.

Random select d features to be considered to split the node. At each split in the learning process, it
inspects a random subset of the features (instead of all features) which reduce the correlation between
the trees., i.e. it creates de-correlated trees. Reducing d will reduce the correlation between any pair
of trees, and thus reduce the variance of the average.

Pros and cons of Random Forest

Random Forest 4
Pros

1. Has a better generalization performance than an individual decision tree due to randomness (the
combination of bootstrap samples and using a subset of features), which helps to decrease the
model’s variance (thus low overfitting). So it corrects decision trees' habit of overfitting the training
data.

2. Doesn’t require much parameter tuning. Using full-grown trees seldom costs much and results in
fewer tuning parameters.

3. Less sensitive to outliers in the dataset.

4. It generates feature importance which is helpful when interpreting the results.

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Con

1. Computationally expensive. It is fast to train but quite slow to create predictions once trained. More
accurate models require more trees, which means using the model becomes slower.

Overall, for fast, simple, flexible predictive modeling, random forest is probably one of the most useful
pragmatic algorithm available today.

Random Forest 5
🪴
Ensemble Learning: Bagging, Boosting
and Stacking

📌
Lesson Structure
Ensemble Methods Interview Questions
Bagging (Bootstrap aggregation)
What is ensemble learning? What are the examples
Boosting
of ensemble learning?
Stacking
What are boosting and bagging?

What are the advantages of bagging and boosting?

What are the differences between bagging and


boosting?

Why boosting models are good?

Explain stacking.

Ensemble Methods

📍 The main idea behind ensemble methods is that a group of “weak learners” can come together
to form a “strong learner”.

Better predictive performance than could be obtained from any of the base leaners alone.

If one base leaner is erroneous, it can be auto-corrected by others, so the final model is typically less
prone to overfitting and more robust, unlikely to be influenced by small changes in the training data.

Ensemble Learning: Bagging, Boosting and Stacking 1


Ensemble for classification tasks. Image source: Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022). Machine learning
with pytorch and Scikit-Learn: Develop machine learning and deep learning models with python. Packt Publishing.

Bagging (Bootstrap aggregation)

📍 Bagging is short for bootstrap aggregation. It builds several instances of an estimator on


bootstrap samples of the original training data and then aggregate their individual predictions
to form a final prediction.

Bagging for classification taksed. Image source: Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022). Machine learning
with pytorch and Scikit-Learn: Develop machine learning and deep learning models with python. Packt Publishing.

Ensemble Learning: Bagging, Boosting and Stacking 2


1. Create bootstrap samples (sampling with replacement) from the training data.

Ensure each sample is independent from others, as it does not depend on previous chosen
samples when sampling.

2. Use a single learning algorithm to build a model using each sample.

Those models are built in parallel.

3. Use multiple models to make predictions and the predictions are combined using voting or
averaging.

Voting for classification: the most frequently result (mode) is the final prediction.

Averaging for regression: average the results produced is the final result.

Example: Random forest

Credit: TseKiChun via Wikimedia Commons

Combines a set of decision trees to make predictions

Decision trees are good candidates for bagging - they can capture complex interactions in the
data, and if grown sufficiently deep, have relatively low bias.

Since trees are noisy, they benefit greatly from the voting/averaging.

🌵 From a bias-variance tradeoff point of view, random forest starts with low bias + high
variance (each tree is fully grown) and work towards reducing variance (by taking majority
vote or averaging across trees).

Boosting

Ensemble Learning: Bagging, Boosting and Stacking 3


🌻 Boosting improves the prediction power by training weak learners sequentially, each
compensating the weaknesses of its predecessors.

Start with a weak learner, gradually turn it into a strong learner by letting future weak learners focus on
correcting mistakes made by previous learners.

Misclassified examples gain a higher weight than examples that are classified correctly, so future
learners focus more on the examples that previous learners misclassified.

Reduce the bias of the weak learner.

Credit: Sirakorn

1. Start with a weak learner (e.g., a shallow tree) better than random guess.

a. Some examples that are correctly classified and some are not.

2. In the next iteration, the weights of the data are re-adjusted such that they can be corrected in the
succeeding round.

lower weights to those were classified correctly.

higher weights to those were classified incorrectly.

3. This sequential process of giving higher weights to misclassified predictions continues until a stopping
criterion is reached.

4. The final prediction is a weighted result of all weak learners.

Example: Gradient boosted trees

Train decision trees sequentially

Optimize the residual loss of a tree by adding another tree

It appears to outperform bagging on lots of problems and become the preferred choice.

Ensemble Learning: Bagging, Boosting and Stacking 4


🌱 From a bias-variance tradeoff point of view, gradient boosted trees start with high bias +
low variance (first tree is shallow) and work towards reducing bias (by making the tree
more complicated).

Bagging vs Boosting
individual learners bias-variance example

Bagging independent, built in parallel reduce variance random forest

Boosting dependent, built sequentially reduce bias gradient-boosted tree

🍡 Bagging methods work best with strong and complex models (e.g., fully developed decision
trees), while boosting methods usually work best with weak models (e.g., shallow decision
trees).

Stacking
Building a meta-model that takes the output of base learners as input.
Combining estimators to reduce their biases.
Can be applied to classification and regression problems.

Credit: Yash Khandelwal, https://www.analyticsvidhya.com/blog/2021/08/ensemble-stacking-for-machine-learning-and-deep-


learning/

Two-level ensemble:

Individual estimators that feed their predictions to the second level

A combiner estimator is fit to the level-one estimator predictions to make the final prediction

Ensemble Learning: Bagging, Boosting and Stacking 5


Example

A stacking model for a classification task.

Individual classifiers: random forest and SVM

Stacking classifier: logistic regression

Pros and Cons

In practice, a stacking predictor predicts as good as the best predictor of the base layer and
sometimes outperforms it by combining the different strengths of the these predictors.

Training a stacking predictor is computationally expensive.

Ensemble Learning: Bagging, Boosting and Stacking 6


🎍
Gradient Boosting

📌
Lesson Structure
Gradient Boosting Interview Questions
Gradient-boosted Trees
What is gradient boosting method?
XGBoost
Describe the architecture of gradient boosting

Which of the following are appropriate methods of addressing


high variance in a Gradient Boosting model?
- Increase the number of trees
- Use L1 or L2 regularization
- Use randomly selected sub-samples
- None of the above

What is XGBoost?

Gradient Boosting
a.k.a Gradient Boosting Machines (GBMs)

An ensemble learning algorithm which is widely used in industrial applications and machine learning
competitions.

A supervised learning algorithm, which attempts to accurately predict a target by combining the
estimates of a set of simpler and weaker learners.

Learners learn sequentially

Convert many weak learners into a complex learner

It's called gradient boosting because it uses a gradient descent procedure to minimize the loss when
adding new learners to the ensemble.

Gradient Boosting 1
Source: https://www.kaggle.com/code/alexisbcook/xgboost

Gradient-boosted Trees
Weak learner is Classification and Regression Trees (CART)

Builds decision trees in an iterative fashion using prediction residuals (the difference between the
target and the predicted value)

In each round, a new tree is fit on the residuals of the previous tree.

The model improves as we are moving each tree more in the right direction via small updates.
These updates are based on a loss gradient.

The training proceeds iteratively, adding new trees that predict the residuals of previous trees that
are then combined with previous trees to make the final prediction.

Algorithm

1. Start with a simple model to return a constant value.

Use a decision tree root node (e.g. a tree with a single leaf note) F0 (X)

2. For each tree i = 1, … , m, where m is the predefined maximum number of trees.

Gradient Boosting 2
How gradient boosted tree works. https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-HowItWorks.html

Compute residual ri

ri is the negative gradient of the loss function with respect to the prediction of the previous
tree

∂L(Y , F (X))
ri = −[ ]F (X )=Fi−1 (X )
∂F (X)

L(Y , F (X)) is a differentiable loss function.


For MSE: L(Y , F (X)) = ∑(Y − F (X))2
F (X) is the prediction of the previous tree Fi−1 (X).
Fit a new tree to predict ri using all features

Subsequent learners are trained to predict errors of the previous prediction.


hi (X, ri−1 ) ≈ ∇ri
hi (X, ri−1 ) is a function to predict ri .
∇ri : gradient of ri with respect to the prediction F (X).
for MSE: ∇ri = Y − F (X)
Update the prediction

Fi (X) = Fi−1 (X) + αhi (X, ri−1 )

Gradient Boosting 3
Fi−1 (X) is the prediction of the previous tree.
α is the learning rate, typically a small value between 0.01 and 1.
We scale hi by α to update the model incrementally by taking small steps, which helps
avoid overfitting.

3. Output Fm (X).

Overall prediction given by a weighted sum of the collection

Hyperparameters

Boosting reduces bias and increases variance by increasing the complexity of weak learners. It can
overfit. By tuning the hyperparameters, overfitting can be prevented.

Number of trees m (i.e. number of iterations) - increasing m reduces the error on training set
(bias), but setting it too high may lead to overfitting.

Max depth of trees - increasing the max depth will make the model more complex and more likely
to overfit.

Learning rate α - small learning rates (< 0.1) yield dramatic improvements in models'
generalization ability with increasing computational time (both during training and prediction).

Subsample size (randomly sample a fraction f of the size of the training data prior to growing
trees) - smaller values of f introduce randomness into the algorithm and help prevent overfitting.

📌 Which of the following are appropriate methods of addressing high variance in a Gradient
Boosting model? (Select all that apply)
- Increase the number of trees
✔ Use L1 or L2 regularization
✔ Use randomly selected sub-samples
- None of the above

Pros and Cons

Pros

It produces very accurate models, it outperforms random forest in accuracy.

No data pre-processing required - often works well with categorical and numerical values as is.

Handles missing data - imputation not required.

Cons

Gradient boosting is a sequential process that can be slow to train.

Computationally expensive - often require many trees (>1000) which can be time and memory
exhaustive.

Sacrifices interpretability for accuracy - less interpretative in nature.

Gradient Boosting 4
e.g., it is self-explained to follow the path that a decision tree takes to make predictions but
following the paths of thousands of trees in gradient-boosted trees is much harder.

XGBoost
is short for Extreme Gradient Boosting - the most popular implementation of gradient boosting.

💡 XGBoost is the winning solution for many Kaggle competitions.

According to XGBoost Documentation

XGBoost is an optimized distributed gradient boosting library designed to be


highly efficient, flexible and portable. The goal of this library is to push the
extreme of the computation limits of machines to provide a scalable, portable
and accurate library.

XGBoost provides a parallel tree boosting that solve many data science
problems in a fast and accurate way. The same code runs on major distributed
environment (Hadoop, SGE, MPI) and can solve problems beyond billions of
examples.

It integrates several approximations and tricks that speed up the training process significantly.

Algorithm enhancements

Minimizes a regularized (L1 and L2) objective function that combines a convex loss function
and a penalty term for model complexity → avoid overfitting

Efficient handling of missing data → simply data preprocessing

Built-in cross-validation capability (at each iteration) → prevents the need to calculate the
number of boosting iterations needed

System optimization → increase speed

Parallelized tree building → increase speed

Tree pruning using ‘depth-first’ approach and it prunes the tree in a backward direction (unlike
the stopping criterion for tree splitting used by GBMs)

Hardware optimization (out-of-core computing)

Gradient Boosting 5
Credit: Saksham Gulati, https://sakshamgulati123.medium.com/xgboost-4cb311310adb

Gradient Boosting 6
🥝
K-means

📌
Lesson Structure
K-means Algorithm Interview Questions
How to choose k?
Explain k-means clustering.
Pros and Cons
How to choose k in k-means?

Pros and cons of k-means.

Implement k-means from scratch.

K-means Algorithm
k-means is a centroid-based clustering algorithm. It’s very popular and used in a variety of applications
such as market segmentation, document clustering, fraud detection, and image segmentation, etc.

4 step algorithm

1. Randomly pick k centroids from the training examples as initial cluster centers.

These centroids should be chosen in a smart way because different positions lead to different
results. A good choice is to place the initial centroids far away from each other (instead of
random initialization).

2. Assign each example to the nearest centroid.

Distance Metric: Euclidean distance

m
d(x, y) = ∑(xj − yj )2 = ∣∣x − y∣∣22
j=1

K-means 1
💡 Feature scaling is important for k-means.
We want to make sure that the features are measured on the same scale, so we
need to apply normalization or standardization if necessary.

3. Move the centroids to the center (average) of the examples that were assigned to it.

Compute the average for all the points inside each cluster, then move the cluster centroid to
the average.

4. Repeat step 2 and 3 until some stopping criteria are met.

Convergence (cluster assignments do not change)

Maximum number of iterations is reached

A user-defined tolerance is reached (e.g. variance does not improve by at least X)

Convergence of k-means https://en.wikipedia.org/wiki/K-means_clustering

Objective function

k-means algorithm chooses centroids that minimize the within-cluster sum-of-squared errors (SSE),
or cluster inertia:

n
∑ min (∣∣xi − μj ∣∣22 )
μj ∈C
i=0

xi : examples in cluster j.
μj : The centroid for cluster j.

K-means 2
How to choose k?
Elbow method
The intuition behind this technique is that the first few clusters will explain a lot of the variation in
the data, but past a certain number of clusters, the amount of information added is diminishing.

Use SSE to quantify the quality (homogeneity) of clustering. If k increases, SSE will decrease
because examples will be closer to the centroids they are assigned to.

Elbow Curve: identity the value of k (”elbow”) where SSE begins to increase most rapidly. This is a
point after which we don’t see much decrement in SSE.

Use the elbow method to find optimal value of k

Looking at the above graph of explained variation on the y-axis versus the number of clusters (k ),
there’s be a sharp change in the y-axis when k = 3.

Silhouette Coefficient
Measures how similar points are in its cluster compared to other clusters.

b−a
s=
max(a, b)

a: The average distance between an example and all other points in the same cluster. (similarity)

b: The average distance between an example and all other points in the next closest cluster.
(dissimilarity)

Silhouette coefficient varies between - 1 and 1 for any given example.

K-means 3
1: the example is in the right cluster as b >> a.
0: cluster separation and cohesion are equal.

-1: the example is in a wrong cluster as a >> b.

By plotting the coefficient versus k, we can get an idea of the optimal value of k .

k-means clustering with 3 centroids Silhouette coefficients when k = 3

k-means clustering with 2 centroids

Silhouette coefficients when k = 2

Pros and Cons


Pros

Easy to implement.

Computationally efficient.

Speed is K-means’ big win.

K-means 4
💡 K-means scales well to large numbers of samples and has been used across a large range
of applications in many different fields.

Cons

The number of clusters, k , has to be determined. An inappropriate choice of k can result in poor
clustering performance.

Stability: Initial positions of centroids influence the final position, so two runs can result in two
different clusters.

The shapes of clusters can only be circular (because Euclidean distance doesn't prefer one
direction over another). It does not work well for datasets requiring flexible cluster shapes.

e.g. K-means is unable to separate this half-moon-shaped dataset.

📌 K-means is susceptible to curse of dimensionality. In very high-dimensional spaces,


Euclidean distances tend to become inflated.
Running a dimensionality reduction algorithm such as Principal component analysis (PCA)
prior to k-means can alleviate this problem and speed up the computations.

K-means 5
🎩
Principle Components Analysis (PCA)

📌
Lesson Structure
What Is PCA? Interview Questions
How PCA Works
What is principal component analysis? How does it work? Explain
Pros and Cons
the sort of problems you would use PCA for.

Describe PCA’s formulation and derivation in matrix form.

What are the pros and cons of PCA? Explain its limitations as a
method.

What Is PCA?
PCA is a dimensionality reduction technique that transforms input features into their principal
components. It converts a set of observations of possibly correlated features into a set of values of linearly
uncorrelated features.

Goal: map the data from the original high-dimensional space to a lower-dimensional space that
captures as much of the variation in the data as possible. It aims to find the most useful subset of
dimensions to summarize the data.

e.g. a dataset with 3 features → use PCA to extract the first principal component that captures the
most variance in the data.

Principle Components Analysis (PCA) 1


Credit: Kevin Dunn, Source: https://learnche.org/pid/contents

Linear transformation: PCA finds a sequence of linear combinations of features that have maximum
variance and are uncorrelated.

PCA is an unsupervised learning method: it does’t use class labels.

How PCA Works


General idea
Principal components are the directions of maximum variance, which has the effect of minimizing the
information loss when you perform a projection or a compression down onto these principal
components.

❓ Why does maximum variance mean minimum information loss?

(i) (i)
Support x is an example in the original dataset, v is the principle component (a vector) and a v is
the transformed data point using PCA.

() () ()

Principle Components Analysis (PCA) 2


Minimize the mean squared error (MSE) between x(i) and a(i) v → v would most closely transform x(i)

min ∑(x(i) − a(i) v)2


i

a(i) (a scalar) can be calculated easily given v: it’s the projection of x(i) onto v.
Find v to minimize the residual variance

Minimize variance off v → v is the direction of maximum variance of the data.

Source: https://learnche.org/pid/contents

Reduce the average of all the distances of every feature to the projection line (vector v), so it
projects into the direction of maximal variance to minimize the distance from the original data to its
newly transformed data → minimize the information loss.

For 2 PCs: x(i) = a(i) v1 + b(i) v2 + m

Source: https://learnche.org/pid/contents

Principle Components Analysis (PCA) 3


📍 All principal components are uncorrelated (orthogonal) to each other, so the 2nd principal
component is mathematically guaranteed to not overlap with the 1st principal component.

Even if the input features are correlated, the resulting PCs will be mutually uncorrelated → PCs can
be treated as independent features.

Credit: Kevin Dunn, Source: https://learnche.org/pid/contents

📌 Principle components are vectors that define a new coordinate system in which the nth axis
goes in the direction of the nth highest variance of the data.

Steps in PCA
1. Standardization
PCA is sensitive to the relative scaling of the original feature.

✏ PCA are highly sensitive to data scaling, so we need to standardize the features prior to
PCA if the features were measured on different units and assign equal importance to all
features.

Principle Components Analysis (PCA) 4


Source https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#

Accuracy for the normal (i.e. unscaled) test dataset with PCA 81.48%

Accuracy for the standardized test dataset with PCA 98.15%

📌 Standardization (or Z-score normalization) is an important preprocessing step for PCA.

2. Compute covariance matrix


e.g. Covariance matrix ∑ of 3 features

⎡ σ1 σ13 ⎤
2
σ12
∑ = σ21 σ22
⎣σ31 σ2 ⎦
σ23
σ32 3

Covariance matrix is a special case of a square matrix A = AT .


3. Eigendecomposition

The factorization of a square matrix into eigenvectors and eigenvalues.

Av = λ v
Matrix Eigenvalue
Eigenvector

Decompose the covariance matrix ∑ into eigenpairs.

Principle Components Analysis (PCA) 5


📌 Eigenvectors of the covariance matrix represent the principle components and the
corresponding eigenvalues define their magnitude.

4. Choose k principal components (k ≤ d)


Sort the eigenpairs in descending order of the eigenvalues

The eigenvector with the largest eigenvalue → the first principal component

The eigenvector with the second largest eigenvalue → the second principal component

How to select the number of principal components?

1. To retain certain % of the variance, e.g. 90%.

2. Choose a cut off when it becomes apparent that adding more PCs doesn’t get much more
variance.

e.g. the first two PCs capture ~60% of the variance in the data.

The proportion of total variance explained by the principle components.

3. Specific use case, e.g. data visualization.

5. Feature transformation

Transform d-dimensional feature spaces x to k -dimensional feature subspaces x .

x = xw

whereas w is a projection matrix constructed from the top k eigenvectors.

Principle Components Analysis (PCA) 6


Pros and Cons
Advantages

Removes correlated features and noise in the data

All the PCs are independent of each other. There is no correlation among them.

The first few PCs can capture majority of variance in the data and the rest just represent noise
in the data.

A data preprocessing step before using a learning algorithm - transformed data are available
to use.

Improves algorithm performance

With high-dimensional features, the performance of an algorithm will degrade.

PCA speeds up the algorithm by getting rid of correlated features and noise which don’t
contribute in any decision making. The training time of the algorithms reduces significantly with
less number of features.

Visualizes high-dimensional data

PCA transforms a high dimensional data to low dimensional data (2 or 3 dimensions) so that
the data can be visualized easily.

Limitations

PCA is not scale invariant, it is sensitive to the relative scaling of the input feature.

Features become less interpretable: PCs are the linear combination of the original features and
they are not as readable and interpretable as original features.

Only based on the mean vector and covariance matrix. Some distributions (multivariate normal)
are characterized by this but some are not.

PCA is an unsupervised learning method, so it does not take labels into account.

Principle Components Analysis (PCA) 7

You might also like