You are on page 1of 167

- Understanding Machine Learning Algorithms: Handwritten Notes -

Table of Contents
1. What is Machine Learning?

2. What are the Types of Machine Learning?

3. Supervised Machine Learning

4. Unsupervised Machine Learning

5. Reinforcement Learning

6. Semi-Supervised Learning

7. Steps in ML Project

8. Exploring Step 1 Data Collection

9. Exploring Step 2 Data Preparation

- Exploratory Data Analysis

- Data Preprocessing

- Feature Engineering
Notes by RaviTeja G
10. Exploring Step 3 - Train Model on Dataset
- Types of Learning
- Under Fitting and OverFitting
- Regularization techniques
- Hyperparameter Tuning
11. Exploring Step 4 - Evaluation of a Model
- Evaluation Metrics
- Confusion Matrix
- Recall/Sensitivity
- Precision
- Specificity
- F1 Score
- AUC and ROC Curve
- Analysis of a Model
12. Supervised Learning
- Linear Regression
- Regularization Techniques
- Logistic Regression
- Decision Trees
- Ensemble Techniques
- Random Forests
- AdaBoost
- Gradient Boost
- XG Boost
- K-Nearest Neighbours
- Support Vector Machines
- Naive Bayes Classifiers
13. Unsupervised Learning
- Clustering Techniques
- K-Means Clustering
- Hierarchical Clustering
- DB Scan Clustering
- Evaluation of Clustering Models
- Curse of Dimensionality
- Principal Component Analysis
14. Cheat Sheet of Supervised and Unsupervised Algorithms
Table of Contents
1. What is Linear Regression
2. Understanding with an example
3. Evaluating the fitness of the model
4. Understanding Gradient descent
5. Understanding Loss Function
6. Measuring Model Strength
7. Another Approach for LR - OLS
Table Of Contents
1. Understanding Multicollinearity
2. Variance Inflection Factor
3. Regularization
4. Lasso - L1 Form
5. Ridge - L2 Form
6. Elastic Net
7. Difference Between Ridge and Lasso
8. When to use Ridge/Lasso/Elastic Net
9. Polynomial Regression
Table Of Contents
1. Why do we need Decision Trees
2. How it works
3. How do we select a root node
4. Understanding Entropy, Information Gain
5. Solving an Example on Entropy
6. Understanding Gini Impurity
7. Solving an Example on Gini Impurity
8. Decision tree for Regression
9. Why Decision Trees are Greedy Apporach
10. Understanding Pruning
Table of Contents
1.Understanding Boosting
2.Understanding AdaBoost
3.Solving and Example on AdaBoost
4.Understanding Gradient Boosting
5.Solving an Example on Gradient Boosting
6.AdaBoost Vs Gradient Boosting
Table Of Contents
1. How does K-Nearest Neighbours work
2. How is Distance Calculated
- Eculidean Distance
- Hamming Distance
- Manhattan Distance
3. Why is KNN a Lazy Learner
4. Effects of Choosing the value of K
5. Different ways to perform KNN
6. Understanding KD-Tree
7. Solving an Example of KD Tree
8. Understanding Ball Tree
Understanding Support Vector Machines

Table Of Contents
1. Understanding Concept of SVC
2. What are Support Vectors
3. What is Margin
4. Hard Margin and Soft Margin
5. Kernelized SVC
6. Types of Kernels
7. Understanding SVR
Table Of Contents
1. Why do we need Naive Bayes
2. Concept of how it works
3. Mathematical Intuition of Naive Bayes
4. Solving an Example on Naive Bayes
5. Other Bayes Classifiers
- Gaussian Naive Bayes Classifier
- Multinomial Naive Bayes Classifier
- Bernoulli Naive Bayes Classifier
Table of Contents
1. How clustering is different from classification
2. Applications of Clustering
3. What are density based methods
4. What are Hierarchial based methods
5. What are partitioning methods
6. What are Grid Based methods
7. Main Requirements for Clustering Algorithms
Table Of Contents
1. Concept of K-Means Clustering
2. Math Intuition Behind K-Means
3. Cluster Building Process
4. Edge Case Scenarios of K-Means
5. Challenges and Improvements in K-Means
Understanding Principal Component Analysis

Table Of Contents
1. Idea Behind PCA
2. What are Principal Components
3. Eigen Decomposition Approach
4. Singular Value Decomposition Approach
5. Why do we maximize Variance
6. What is Explained Variance Ratio
7. How to select optimal no.of Prinicpal Components
8. Understanding Scree plot
9. Issues with PCA
10. Understanding Kernel PCA
– Supervised Algorithms –

Regression Models
ALGORITHM DESCRIPTION & ADVANTAGES DISADVANTAGES
APPLICATION

Linear Linear Regression models 1. Fast training because 1. Assumes a linear relationship
Regression a linear relationship there are few parameters. between input and output variables.
between input variables 2. 2. Sensitive to outliers.
and a continuous Interpretable/Explainable 3. Typically generalizes worse than
numerical output variable. results by its output ridge or lasso regression.
The default loss function is coefficients.
the mean square error
(MSE).

Polynomial Polynomial Regression 1. Provides a good 1. Poor interpretability of the


Regression models nonlinear approximation of the coefficients since the underlying
relationships between the relationship between the variables can be highly correlated. 2.
dependent, and dependent and The model fit is nonlinear but the
independent variable as independent variables. 2. regression function is linear. 3. Prone
the n-th degree Capable of fitting a wide to overfitting.
polynomial. range of curvature.

Support Vector Support Vector 1. Robust against outliers. 1. Does not perform well with large
Regression Regression (SVR) uses 2. Effective learning and datasets. 2. Tends to underfit in
the same principle as strong generalization cases where the number of variables
SVMs but optimizes the performance. 3. Different is much smaller than the number of
cost function to fit the most Kernel functions can be observations.
straight line (or plane) specified for the decision
through the data points. function.
With the kernel trick it can
efficiently perform a
non-linear regression by
implicitly mapping their
inputs into
high-dimensional feature
spaces.

Gaussian Gaussian Process 1. Provides uncertainty 1. Poor choice of kernel can make
Process Regression (GPR) uses a measures on the convergence slow.
Regression Bayesian approach that predictions. 2. Specifying specific kernels
infers a probability 2. It is a flexible and requires deep mathematical
distribution over the usable non-linear model understanding
possible functions that fit which fits many datasets
the data. The Gaussian well.
process is a prior that is 3. Performs well on small
specified as a multivariate datasets as the GP kernel
Gaussian distribution. allows to specify a prior on
the function space.
Classification Models

ALGORITHM DESCRIPTION & ADVANTAGES DISADVANTAGES


APPLICATION

SVM In its simplest form, 1. Effective in cases with a 1. Sensitive to overfitting,


support vector machine is high number of variables. regularization is crucial. 2. Choosing
a linear classifier. But with 2. Number of variables can a "good" kernel function can be
the kernel trick. it can be larger than the number difficult. 3. Computationally
efficiently perform a of samples. 3. Different expensive for big data due to high
non-linear classification by Kernel functions can be training complexity. 4. Performs
implicitly mapping their specified for the decision poorly if the data is noisy (target
inputs into function. classes overlap).
high-dimensional feature
spaces. This makes SVM
one of the best prediction
methods.

Negrest Nearest Neighbors 1. Successful in situations 1. Sensitive to noisy and missing


Neighbors predicts the label based where the decision data. 2. Computationally expensive
on a predefined number of boundary is irregular. 2. because the entire set of n. points for
samples closest in Non-parametric approach every execution is required.
distance to the new point. as it does not make any
assumption on the
underlying data.

Logistic The logistic regression 1. Explainable & 1. Makes a strong assumption about
Regression models a linear Interpretable. 2. Less the relationship between input and
(and its relationship between input prone to overfitting using response variables. 2.
extensions) variables and the regularization. 3. Multicollinearity can cause the model
response variable. It Applicable for multi-class to easily overfit without
models the output as predictions. regularization.
binary values (o or rather
than numeric values.

Linear The linear decision 1. Explainable & 1. Multicollinearity can cause the
Discriminant boundary maximizes the Interpretable. 2. Applicable model to overfit. 2. Assuming that all
Analysis separability between the for multi-class predictions. classes share the same covariance
classes by finding a linear matrix. 3. Sensitive to outliers. 4.
combination of features. Doesn't work well with small class
sizes.
Both Regression and Classification Models
ALGORITHM DESCRIPTION & ADVANTAGES DISADVANTAGES
APPLICATION

Decision Decision Tree models learn 1. Explainable and 1. Prone to overfitting. 2. Can be
Trees on the data by making interpretable. 2. Can unstable with minor data drift. 3.
decision rules on the handle missing values. Sensitive to outliers.
variables to separate the
classes in a flowchart like a
tree data structure. They
can be used for both
regression and
classification.

Random Random Forest 1. Effective learning and 1. Large number of trees can slow
Forest classification models learn better generalization down performance. 2. Predictions
using an ensemble of performance. 2. Can are sensitive to outliers. 3.
decision trees. The output handle moderately large Hyperparameter tuning can be
of the random forest is datasets. 3. Less prone to complex.
based on a majority vote of overfit than decision trees.
the different decision trees.

Gradient An ensemble learning 1. Handling of 1. Sensitive to outliers and can


Boosting method where weak multicollinearity. 2. therefore cause overfitting. 2. High
predictive learners are Handling of non-linear complexity due to hyperparameter
combined to improve relationships. 3. Effective tuning. 3. Computationally
accuracy. Popular learning and strong expensive.
techniques include generalization
XGBoost, LightGBM and performance. 4. XGBoost
more. is fast and is often used as
a benchmark algorithm.

Ridge Ridge Regression penalizes 1. Less prone to 1. All the predictors are kept in the
Regression variables with low predictive overfitting. 2. Best suited final model. 2. Doesn't perform
outcomes by shrinking their when data suffers from feature selection.
coefficients towards zero. It multicollinearity. 3.
can be used for Explainable &
classification and Interpretable.
regression.

Lasso Lasso Regression penalizes 1. Good generalization 1. Poor interpretability/explainability


Regression features that have low performance. 2. Good at as it can keep a single variable. from
predictive outcomes by handling datasets where a set of highly correlated variables.
shrinking their coefficients the number of variables is
to zero. It can be used for much larger than the
classification and number of observations. 3.
regression. No need for feature
selection.

AdaBoost Adaptive Boosting uses an 1. Explainable & 1. Less prone to overfitting as the
ensemble of weak learners Interpretable. 2. Less need input variables are not jointly
that is combined into a for tweaking parameters. optimized. 2. Sensitive to noisy data
weighted sum that 3. Usually outperforms and outliers.
represents the final output Random Forest.
of the boosted classifier.
– Unsupervised Algorithms –
Clustering Algorithms
ALGORITHM DESCRIPTION & ADVANTAGES DISADVANTAGES
APPLICATION

K-Means Most common clustering 1. Scales to large datasets 1. Requires defining the
approach which assumes 2. Interpretable & expected number of clusters in
that the closer data points explainable results 3. Can advance. 2. Not suitable to
are to each other, the more generate tight clusters identify clusters with
similar they are It non-convex shapes.
determines K clusters
based on Euclidean
distances.

DBSCAN Density-Based Spatial 1. No assumption on the 1. Requires optimization of two


Clustering of Applications expected number of parameters 2. Can struggle in
with Noise can handle clusters. 2. Can handle case of very high dimensional
non-linear cluster noisy data and outliers 3. data
structures, purely based on No assumptions on the
density. It can differentiate shapes and sizes of the
and separate regions with clusters 4. Can identify
varying degrees of density, clusters with different
thereby creating clusters. densities

HDBSCAN Family of the density-based 1. No assumption on the 1. Mapping of unseen objects in


algorithms and has roughly expected number of HDBSCAN is not
two steps: finding the core clusters 2. Can handle straightforward. 2. Can be
distance of each point, and noisy data and outliers. 3. computationally expensive
expands clusters from No assumptions on the
them. It extends DBSCAN shapes and sizes of the
by converting it into a clusters. 4. Can identify
hierarchical clustering clusters with different
algorithm. densities

Agglomerativ Uses hierarchical clustering 1. There is no need to 1. Specifying metric and


e Hierarchical to determine the distance specify the number of linkages types requires good
Clustering between samples based on clusters. 2. With the right understanding of the statistical
the metric, and pairs are linkage, it can be used for properties of the data 2. Not
merged into clusters using the detection of outliers. 3. straightforward to optimize 3.
the linkage type. Interpretable results using Can be computationally
dendrograms. expensive for large datasets

OPTICS Family of the density-based No assumption on the 1. It only produces a cluster


algorithms where it finds expected number of ordering. 2. Does not work well
core sample of high density clusters. 2. Can handle in case of very high
and expands clusters from noisy data and outliers. 3. dimensional data. 3. Slower
them. It operates with a No assumptions on the than DBSCAN.
core distance (e) and shapes and sizes of the
reachability distance. clusters. 4. Can identify
clusters with different
densities. 5 Not required to
define fixed radius as in
DBSCAN.
Dimensionality Reduction Techniques

ALGORITHM DESCRIPTION & ADVANTAGES DISADVANTAGES


APPLICATION

PCA Principal Component 1. Explainable Interpretable 1. Sensitive to outliers 2.


Analysis (PCA) is a feature results. 2 New unseen Requires data standardization
extraction approach that datapoints can be mapped
uses a linear function to into the existing PCA space
reduce dimensionality in 3. Con be used as
datasets by minimizing dimensionality reduction
information loss. technique as preliminary
step to other machine
learning tasks 4 Helps
reduce overfitting 5. Helps
remove correlated features

t-SNE t-distributed Stochastic 1. Helps preserve the 1. The cost function is not
Neighbor Embedding is a relationships seen in high convex: different initializations
non-linear dimensionality dimensionality 2. Easy to can get different results. 2.
reduction method that visualise the structure of Computationally intensive for
converts similarities between high dimensional data in or large datasets. 3. Default
data points to joint 3 dimensions 3. Very parameters do not always
probabilities using the effective for visualizing achieve the best results
Student t-distribution in the clusters or groups of data
low-dimensional space points and their relative
proximities

UMAP Uniform Manifold 1. It can be used as 1. Default parameters do not


Approximation and general-purpose dimension always achieve the best results
Projection (UMAP) reduction technique as o
constructs a preliminary step to other
high-dimensional graph machine learning tasks. 2.
representation of the data Can be very effective for
then optimizes a visualizing clusters or
low-dimensional graph to be groups of data points and
as structurally similar as their relative proximities. 3
possible. Able to handle high
dimensional sparse
datasets

ICA Independent Component 1. Can separate 1. Without any prior knowledge,


Analysis (ICA) is a linear multivariate signals into its determination of the number of
dimensionality reduction subcomponents 2 Clear independent components or
method that aims to separate aim of the method: only sources can be difficult. 2. PCA
a multivariate signal into applicable if there are is often required as a
additive subcomponents multiple independent pre-processing step.
under the assumption that generators of information to
independent components are uncover. 3. Can extract
non-gaussian. Where PCA hidden factors in the data
"compresses" the data, ICA by transforming a set of
"separates" the information. variables to new set that
maximally independent.
Association Rules

ALGORITHM DESCRIPTION & ADVANTAGES DISADVANTAGES


APPLICATION

Apriori The Apriori algorithm uses 1. Explainable & 1. Requires defining the
algorithm the join and prune step interpretable results. 2. expected number of clusters or
iteratively to identify the Exhaustive approach mixture components in advance
most frequent itemset in the based on the confidence 2. The covariance type needs
given dataset. Prior and support. to be defined for the mixture of
knowledge (apriori) of component
frequent itemset properties
is used in the process.

FP-growth Frequent Pattern growth 1. Explainable & 1. More complex algorithm to


algorithm (FP-growth) is an interpretable results. 2. build than Apriori 2. Can result
improvement on the Apriori Smaller memory footprint in many (incremental)
algorithm for finding than the Apriori algorithm overlapping/trivial itemsets
frequent itemsets. It
generates a conditional
FP-Tree for every item in
the data.

FP-Max A variant of Frequent 1. Explainable & 1. More complex algorithm to


Algorithm pattern growth that is Interpretable results. 2. build than Apriori
focused on finding maximal Smaller memory footprint
itemsets. than the Apriori and
FP-growth algorithms

You might also like