Professional Documents
Culture Documents
Weather
Root Node How to Learn a
• the start node
Decision Tree?
Sunny Rainy
Top down approach: Grow the tree from root node to leaf nodes.
*ID3 (Iterative Dichotomiser 3)
Decision Trees: Numerical Example
x1 x2 y
• Given this dataset, let’s
3.5 2 1
predict the y class (1 vs. 2) 5 2.5 2
using a Decision Tree. 1 3 1
2 4 1
• Iteratively split the dataset 4 2 1
into subsets from a root node, 6 6 2
such that the leaf nodes 2 9 2
contain mostly one class (as 4 9 2
pure as possible). 5 4 1
3 8 2
Decision Trees: Numerical Example
x2 Class 1 x1 x2 y
Class 2 3.5 2 1
9 5 2.5 2
8 1 3 1
7 2 4 1
6 4 2 1
5 6 6 2
4 2 9 2
3 4 9 2
2 5 4 1
1 3 8 2
1 2 3 4 5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
Class: 1, 2
9
1 2 3 4 5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
9
x2 ≤ 5
Yes No
8
7 Class: 1, 2 Class = 1
6
What feature (x1 or x2) to
55
use to split the dataset, to
4
best separate class 1 from
3 class 2?
2
1 2 3 4 5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
9
x2 ≤ 5
Yes No
8
7 x 1 ≤ 4.5 Class = 1
6 No
Yes
55
4
Class = 2 Class: 1, 2
3 What feature (x1 or x2) to
2 use to split the dataset, to
1 best separate class 1 from
class 2?
1 2 3 4 4.5
5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
9
x2 ≤ 5
Yes No
8
7 x 1 ≤ 4.5 Class = 1
6 No
Yes
55
4
Class = 2 x2 ≤ 3
3 Yes No
3
2
Class = 1 Class = 2
1
1 2 3 4 4.5
5 6 x1
Decision Trees: Example
Not too Somewhat Somewhat Not too Overcast High Misspelled Yes
Demand Address
Weather
Sunny Rainy
High Normal Correct Misspelled
Overcast
[2+, 3-]
Not too [3+, 2-]
[3+, 4-] [1+, 6-] [6+, 2-] [3+, 3-]
[4+, 0-] Not too
sure
Absolutely sure Not too Somewhat Somewhat Not too
sure sure sure sure sure
How to Measure Uncertainty
Gini Impurity curve
Weather
Sunny Rainy
Overcast
[2+, 3-]
Not too [3+, 2-]
[4+, 0-] Not too
sure
Absolutely sure
sure
Calculating Gini Impurity
Weather
Sunny Rainy
Overcast
[2+, 3-]
Not too [3+, 2-]
[4+, 0-] Not too
sure
Absolutely sure
sure
[3+, 4-] [1+, 6-] [6+, 2-] [3+, 3-] “Weather” has the highest gain of all, so
we start the tree with the “Weather”
feature as the root node!
Recap ID3 Algorithm
ID3 Algorithm*: Repeat
Feature with highest
Information Gain 1. Select “the best feature” to split using
Information Gain.
A B C
2. Separate the training samples according to
the selected feature.
Feature with highest Feature with highest
Information Gain given A Information Gain given C 3. Stop if have samples from a single class or if
all features used, and note a leaf node.
D E Class 1 F G
4. Assign the leaf node the majority class of the
Class 2 Class 1 Class 2
samples in it.
Class 1
*To build a Decision Tree Regressor: 1. Replace the Information Gain with Standard Deviation Reduction. 3. Stop when numerical values are homogeneous
(standard deviation is zero) or if used all features, and note it as a leaf node. 4. Assign the leaf node the average value of its samples.
Ensemble Methods
Ensemble Learning
Ensemble methods create a strong model by combining the predictions of
multiple weak models (aka weak learners or base estimators) built with
a given dataset and a given learning algorithm.
Weak Model 1
BaggingClassifier(base_estimator=None, n_estimators=10,
max_samples=1.0, bootstrap=True)
𝐲 =𝐰𝟎+ 𝐰𝟏 𝐱
^ is defined by (intercept) and (slope)
• The vertical offset for each data point from the
∆y
∆x regression line is the error between the true
label y and the prediction based on x
w1(slope) = /
• The “best” line minimizes the sum of the
w0 (Intercept)
squared errors (SSE):
x
Linear Regression
• Linear regression fits the “best” line to fit:
x: sqft_living; y: price; w0: intercept; w1: slope
For = 6000,
= -43580.74 + 280.62*6000
= $1,640,139.26
Linear Regression
For multiple features (), the equation extends to:
Example: Predict house prices () using multiple features: number of bedrooms (), square feet
living space (), number of bathrooms (), and number of floors ()
bedrooms sqft_living bathrooms floors price ($)
3 1180 1.00 1.0 221900.0
3 2570 2.25 2.0 538000.0
2 770 1.00 1.0 180000.0
4 1960 3.00 1.0 604000.0
.. .. .. .. ..
Linear Regression
Example (continued):
Calculated regression coefficients:
w0 (intercept) w1 (bedrooms) w2 (sqft_living) w3 (bathrooms) w4 (floors)
74669.67 -57847.96 309.39 7853.52 200.497
Regression equation:
Using the regression equation: Assuming all other variables stay the same,
increasing by 1 foot square, increases the by $.
Logistic Regression
Linear regression was useful when predicting continuous values
where
: true class {0, 1}, p: probability of class (i.e. ), log: logarithm
Log-loss (Binary Cross-Entropy)
Example: Let’s calculate the Log-Loss
for the following scenarios: LogLoss when y=1
MLA-NLP-Lecture2-Logistic-Regression.ipynb
Optimization
Optimization in Machine Learning
• We build and train ML models, hoping for:
• In reality … error
• Learn better and better models, such that overall model error gets smaller
and smaller … ideally, as small as possible!
Optimization
• In ML, use optimization to minimize an error function of the ML model
Error function: , where = input, = function, = output
Optimizing the error function:
- Minimizing means finding the input that results in the lowest value
- Maximizing, means finding that gives the largest
Gradient Optimization
• Gradient: direction and rate of the fastest increase of a function.
It can be calculated with partial derivatives of the function with respect
𝜕 𝑓 (𝑤)
to each input variable in :
𝜕𝑤
Because it has a direction, the gradient is a “vector”.
Gradient Example
¿
, with gradient vector
• Sign of the gradient shows direction the
function increases: + right and – left
Gradient Example
¿
, with gradient vector
• Sign of the gradient shows direction the
function increases: + right and – left
Gradient Example
¿
, with gradient vector
• Sign of the gradient shows direction the
function increases: + right and – left
Gradient
Global Minimum
Regularization
Regularization
Underfitting: Model too simple, fewer features,
smaller weights, weak learning.
Overfitting: Model too complex, too many features,
larger weights, weak generalization.
‘Good Fit’ Model: Compromise between fit and
complexity (drop features, reduce weights).
MLA-NLP-Lecture2-Linear-Regression.ipynb
Hyperparameter Tuning
Hyperparameter Tuning
• Hyperparameters are ML algorithms parameters that affect the
structure of the algorithms and the performance of the models.
Examples of hyperparameters:
K Nearest Neighbors: n_neighbors, metric
Decision trees: max_depth, min_samples_leaf, class_weight, criterion
Random Forest: n_estimators, max_samples
Ensemble Bagging: base_estimator, n_estimators
Hyperparameter 2
GridSearchCV(estimator, param_grid, scoring=None)
param_grid ={max_depth: [5, 10, 50, 100, 250], Total hyperparameters combinations
5 x 5 = 25
min_samples_leaf: [15, 20, 25, 30, 35]}
[5, 15], [5, 20], [5, 25], [10, 15], …
Randomized Search in sklearn
RandomizedSearchCV: randomized search on hyperparameters
Chooses a fixed number (given by parameter n_iter) of random combinations of
hyperparameter values and only tries those.
Can sample from distributions (sampling with replacement is used), if at least one
parameter is given as a distribution.
Hyperparameter 2
RandomizedSearchCV(estimator, param_distributions,
n_iter=10, scoring=None)
MLA-NLP-Lecture2-Tree-Models.ipynb
AWS AI/ML Services
AWS SageMaker: Train and Deploy
SageMaker is an AWS service to easily build, train, tune and deploy ML
models: https://aws.amazon.com/sagemaker/
MLA-NLP-Lecture2-Sagemaker.ipynb
Amazon Comprehend
Comprehend is a AWS NLP service that allows users to gain insights from
text data and build ML models.
In this section, we will implement a custom text classifier using AWS
Comprehend.
Main steps:
1. Creating the classifier
2. Putting data into correct format.
3. Training
4. Make predictions (inference)
Custom Classification
Train classifier