Prof. Pedram Jahangiry discusses machine learning models

Prof.
Pedram Jahangiry
Utah State University: Huntsman School of Business
Machine Learning
Model Model Type and use case Description Pros Cons Hyperparameters
Type
Linear - Parametric - highly interpretable (giving significancy results) - validity of linear regression assumptions none
Linear regression Used for regression only
Finds the "best fit" through all the data points. - very fast training because of closed form solution - cannot capture complex relationships
- no hyperparameter tuning required
- interpretable for low values of d (giving significancy
Linear - Parametric Extending linear regression model to capture - need to choose the right polynomial degree d: degree of polynomial
Polynomial regression non-linearities
results)
Used for regression only - Can capture polynomial relationships - notorious tail behavior (sensitive to outliers)
Linear method that penalizes irrelevant - Can be used for feature selection (reducing the penalty (how much to penalize the
Linear - Parametric - Requires feature scaling
Penalized regression: features using regularization dimension of the feature space) parameters)
Ridge, LASSO and Elastic L1 regularization: LASSO
L2 regularization: Ridge L1 ratio: ration between L1 and L2
Net Used for regression only - interpretable
regularization
combining L1 and L2: Elastic net
- the same as penalized regression if
Linear - Parametric - probabilitsitc model (the outputs are probabilities) - validity of linear regression assumptions
regularization is used
Basically the adaptation of linear regression
Logistic regression to classification problems.
- highly interpretable (giving significancy results) - sensitive to extreme values
Used for classification only - easy to understand - cannot capture complex relationships
- fast and efficient
Non-linear - Non-parametric - Intuitive and simple - Choice of K - K value
Make prediction for a new observation by - Easy to implement for multi class problem - Slow (memory based approach) - distance metrics
finding similarities (“nearness”) between it - Few parameters/hyper parameters - Curse of dimensionality
KNN Used for both regression and
and its k nearest neighbors in the existing - No assumption (non parametric) - Hard to interpret
classification
dataset. - Requires feature scaling
- Not good with multiple categorical features
Supervised
- SVM can be memory efficient! uses only a subset of

- Requires feature scaling - Kernel: linear, rbf, poly, ...
Kernel basis (non-linear) the training data (support vectors)
Linear SVM is parametric - Can handle non linear data sets - No probability outcome! - C: Cost of misclassification
Kernel SVM is non-parametric - Can handle high dimensional spaces (even when - Gamma (for rbf): how far the
Uses a kernel to transform the feature space - Does not perform well with noisy data
SVM to linearly separable boundaries
D>N) influence reach
- Linear SVM are not very sensitive to overfitting (soft
- Limited interpretability (specially for Kernel SVM) - d (for poly): degree of polynomial
Used for both regression and margin; regularization)
classification - Memory intensive: Long training time when we
- Can have high accuracy (even compared to NN)
have large data sets.
- Sensitive to noisy data. It can overfit noisy data.
- Max tree depth, min samples per leaf
-Easy to interpret and visualize Small variations in data can result in the
Tree-based (non-linear) (node), min samples split
different decision tree
Non-parametric
- Progressively divide data sets into smaller - Can easily handle categorical data without the need to
- Can lead to overfitting - Cost complexity alpha
data groups based on a descriptive feature , create dummy variables
Decision Trees until they reach sets that are small enough to - Can easily capture Non linear patterns - Poor level of predictive accuracy - Criterion: gini/entropy/ ...
be described by some label - Can handle data in its raw form (no preprocessing
Used for both regression and
needed ).
classification
- No assumption (non parametric)
- Can handle colinearity efficiently
All the advantages of Decision Trees + - no interpretability DTs parameters +
Ensemble method (non-linear) - Typically more accurate - complexity - m: subset of features
Non-parametric
- Many trees are created on bootstrapped data - Avoid overfitting by reducing the model variance. - many hyper parameters - B: number of bootstrapped trees
Random Forest and combined using averaging.
- very flexible and parallelizable! - slow on large data sets
Used for both regression and
- No data preprocessing (no feature scaling)
classification
- Great with high dimensionality
All the advantages of Random Forests + - no interpretability RF parameters +
Ensemble method (non-linear)
- Regularization for avoiding overfitting - many hyper parameters - Regularization terms
Non-parametric
- Implements boosting to build decision trees - Efficient handling of missing data
Boosting (XGboost) of weak prediction models and generalizes - In-built cross validation capability
Used for both regression and using a loss function. - Cache awareness and out-of-core computing
classification - Tree pruning using depth-first approach
- Parallelized tree building
Prof. Pedram Jahangiry
Utah State University: Huntsman School of Business
Machine Learning
Model Model Type and use case Description Pros Cons Hyperparameters
Type
- Reducing the number of features to the most relevant
- Hard to interpret None
predictors is very useful in general.
Non- Parametirc
- Dimension reduction facilitates the data visualization
- Requires feature scaling
in two or three dimensions.
Principal components are vectors that define a
- Before training another supervised or unsupervised
new coordinate system in which the first axis
Principle Component learning model, it can be performed as part of EDA to
goes in the direction of the highest variance in
identify patterns and detect correlations .
Unsupervised
Analysis (PCA) the data. The second PC is orthogonal to PC1

Used for dimension reduction and etc.
- Machine learning models are quicker to train , tend to
reduce overfitting (by avoiding the curse of
dimensionality), and are easier to interpret if provided
with lower dimensional datasets.
- Simple to understand - Need to choose k before running the algorithm K: Number of clusters
Cab be both Parametric and Non-
Parametirc - Uses a measure of similarity to detect - The k means algorithm is fast and works well on very
- Requires feature scaling - Distance metrics
groups within data set: K means is an large datasets
K-Means algorithm that repeatedly partitions - Can help visualize the data and facilitate detecting
observations into a fixed, pre-specified - Poor performance with clusters of irregular shapes
trends or outliers.
Used for clustering the data number of clusters - Not applicable for categorical data
- Unable to handle noisy data
Cab be both Parametric and Non- - The optimal number of clusters can be obrained by - The choice of distance metrics and linkage
- Uses a measure of similarity to detect - Distance metrics
Parametirc the model itself, methods can be tricky
groups within data set: Hierarchical clustering
Hierarchical clustering is an iterative procedure used to build a
Used for clustering the data hierarchy of clusters - Requires feature scaling - Linkage methods

Prof. Pedram Jahangiry discusses machine learning models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prof. Pedram Jahangiry discusses machine learning models

Uploaded by

Copyright:

Available Formats

Prof.

- SVM can be memory efficient! uses only a subset of

Analysis (PCA) the data. The second PC is orthogonal to PC1

You might also like