You are on page 1of 51

Predictive analytics

:Supervised ML

Dr.Praisan padungweang
School of information and technology
KMUTT
Example data set
Sample Features, Variables, Attributes, Label, Class,
Column, Dimensions Target, Output
Features
Sepal Sepal Petal Petal
Species
length width length width
5.0 2.0 3.5 1.0 I. versicolor
Sample, 4.5 2.3 1.3 0.3 I. setosa
Observatio 5.0 2.3 3.3 1.0 I. versicolor
n, Record, 6.2 2.8 4.8 1.8 I. virginica
Row 6.8 2.8 4.8 1.4 I. versicolor
5.6 2.8 4.9 2.0 I. virginica
Fisher's Iris Dataset 5.8 2.8 5.1 2.4 I. virginica
7.7 2.8 6.7 2.0 I. virginica
4.4 2.9 1.4 0.2 I. setosa
7.3 2.9 6.3 1.8 I. virginica
4.3 3.0 1.1 0.1 I. setosa
4.4 3.0 1.3 0.2 I. setosa
6.1 3.0 4.6 1.4 I. versicolor
6.0 3.0 4.8 1.8 I. virginica
5.9 3.2 4.8 1.8 I. versicolor
6.5 3.2 5.1 2.0 I. virginica
6.4 3.2 5.3 2.3 I. virginica
6.9 3.2 5.7 2.3 I. virginica
: : : : :

Fisher's Iris Data consist of 150 flowers with 3 species

2
https://en.wikipedia.org/wiki/Iris_flower_data_set
Predictive Analytics
Classification Models
LOGISTIC REGRESSION, DECISION TREE, SUPPORT VECTOR MACHINE, NEURAL NETWORK
Machine learning
Supervised Labeled data
Learning Direct feedback
Predict outcome/feature

Unsupervised No labels
Learning No feedback
Find hidden structure in data

Reinforcement Decision process


Learning Reward system
Learning series of action

4
Features Label

Supervised Machine learning id


Plasma glucose
Body mass index Diabetes
predict

Labeled data
concentration ML Model and Learning

Cla
A001 148 33.6 1 Algorithm
Label Label A002 85 26.6 0

s
predict

s
: : : :

i f
feedback

i c
other other

ation
ML Model and Learning Model
other Algorithm Plasma glucose

New person
Body mass index 𝓏 = 7.5156+(-0.0352)𝑥!+(-0.0763)𝑥"
dog concentration
(𝑥")
Diabetes
1
(𝑥!)
𝑓(𝓏) =
100 30 ? 1 + 𝑒 #𝓏
dog
feedback
other
other 𝓏 = 7.5156 + (-0.0352)100+(-0.0763)30 = 1.7066
!
Diabetes?
𝑓(𝓏) = = 0.8464
! 1.7066
!"# 0.8464 => yes
Model cat
cat other
? Features
Label
predict
Regression
on
cat SP NIKKEI EU
cat
ti date

Labeled data
a 5-Jan-09 -0.00468 0 0.012698 ML Model and Learning
if ic 6-Jan-09 0.007787 0.004162 0.011341 Algorithm
lass 7-Jan-09 -0.03047 0.017293 -0.01707
C : : :

feedback
Supervised Learning
• Labeled data SP NIKKEI Model
Some day

(𝑥!) (𝑥")
EU
• Direct feedback 𝓏 = 0+0.6099𝑥! +0.1723𝑥"
0.033007 0.005594 ?
• Predict outcome/feature
𝓏 = 0+0.6099 x 0.033007+0.1723 x 0.005594
= 0.021094714 EU = 0.021094714
5
Features
Unsupervised Machine Learning
ID DayMins EveMins NightMins IntlMins
C001 265.1 197.4 244.7 10

Image Clustering
C002 161.6 195.5 254.4 13.7 ML Model and Learning
C003 243.4 121.2 162.6 12.2 Algorithm
Training data
C004 299.4 61.9 196.9 6.6
C005 166.7 148.3 186.9 10.1
: : : :

ML Model and Learning No labels


Algorithm Clus ter 1
DayMins
Model 1.5 Clus ter 2

Customer segmentation
Cluster center 1
Clus ter 3
Member DayMins EveMins NightMins IntlMins 0.5
Cluster 1 34% 146.30 162.63 197.04 10.99 0
Cluster 2 30% 153.21 248.94 204.82 9.66 -0.5
Cluster 3 35% 234.84 196.69 201.17 10.00 IntlMins -1 EveMins

Model NightMins

Distance
Cluster 1 6.38
This person

person
DayMins EveMins NightMins IntlMins Cluster is in cluster 1

New
Cluster 2 47.85
124.3 277.1 250.7 15.5 ? Cluster 3 51.30

Unsupervised Learning
• No labels
• No feedback
• Find hidden structure in data

6
Customer segmentation Anomaly detection Image segmentation Gene micro-array clustering
Reinforcement machine Learning

Reinforcement Learning
• Decision process
• Reward system
• Learning series of action

7
Supervised Machine Learning Model-Regression
Features Target
Machine Learning
date SP NIKKEI EU

5-Jan-09 -0.00468 0 0.012698


Model learning
6-Jan-09 0.007787 0.004162 0.011341
7-Jan-09 -0.03047 0.017293 -0.01707
8-Jan-09 0.003391 -0.04006 -0.00556 Training data
9-Jan-09 -0.02153 -0.00447 -0.01099
: : :
Model
Historical data 𝓏 = 0+0.6099𝑥! +0.1723𝑥"

date
SP NIKKEI 𝓏 = 0+0.6099 x 0.033007+0.1723 x 0.005594
(𝑥+ ) (𝑥, ) = 0.021094714
Some day 0.033007 0.005594 EU?
0.021094714 8
Supervised Machine Learning Model-Classification
Features Target
Machine Learning
Plasma glucose
id Body mass index Diabetes
concentration
A001 148 33.6 1 Model learning
A002 85 26.6 0
: : : :
Training data
Historical data

Model

𝓏 = 7.5156+(-0.0352)𝑥! +(-
0.0763)𝑥"
1
𝑓(𝓏) =
1 + 𝑒 #𝓏

id
Plasma glucose
concentration
Body mass index 𝓏 = 7.5156 + (-0.0352)100+(-0.0763)30 Diabetes?
(𝑥, ) = 1.7066
(𝑥+ )
𝑓(𝓏) =
+ 0.8464 => yes
+-.)1.7066
New person 100 30 = 0.8464

9
Full dataset: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Models Training and Model Selection
¡ Held-out test data
¡ data is divided into training set and test set

¡ Training set is user for models creation (training and validation)

¡ Test set is held-out for model selection


Model initialization

Training set
for model learning
Original Training set
Validation set
dataset

Model
Test set

Test set

Performance (Accuracy, Sensitivity, Specificity)

10
Supervised model training and evaluation: Training
Class A
0-No-DR
1-Mild

Class B
2-Moderate,
3-Severe,
4-PDR

11
Supervised model training and evaluation: Validation
Class A
0-No-DR
1-Mild

Class B
2-Moderate,
3-Severe,
4-PDR

12
Supervised model training and evaluation : Using/Evaluation (Test set)
Class B: 2-4 Class A: 0-1
Class A: 0-1 Class B: 2-4
1 2 3 4

5 6 7 8

Class B: 2-4 Class A: 0-1


Class A: 0-1 Class B: 2-4
13
CLASSIFICATION MODELS

14
Decision trees
Decision trees are recursive partitioning algorithms that come up with a tree-like structure representing patterns in an
underlying data set

Example Decision Tree

15
Decision tree decision boundaries
model decision boundaries orthogonal to the axes

Custo
Age Income Response
mer

John 30 1,200 Bad


Sarah 25 800 Good
Sophie 52 2,200 Good
David 48 2,000 Bad
Peter 34 1,800 Good
: : : :

Decision Boundary of a Decision Tree

16
Splitting decision

¡ Use the concept of impurity to compute Information Gain


impurity
• Entropy => C4.5 Decision trees
• Gini => CART Decision trees

Temp Class
tttttt tttttt
36.0 t 111111
Entropy= 1 111111 Entropy= 1
36.3 t Temp Temp
36.8 t ≤40.5 >40.5
≤37.5 >37.5
37.3 t
36.9 t
37.0 t ttt 111 ttttt 111
37.6 1 ttt 111 t111
38.2 1 Entropy= 0 Entropy= 0 Entropy= 0.9 Entropy= 0
40.2 1
38.6 1 Child
Parent
39.4 1 (Weighted average)

37.8 1 Information Gain

17
Stopping criteria
Using appropriated metric
o Tree depth
o Maximum depth of a tree

o Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to
overfit.

o Number of data paper node


o For a node to be split further, each of its children must receive at least this number of training instances

o Value of information gain


o For a node to be split further, the split must improve at least this much (in terms of information gain).

18
Assignment decision
typically looks at the majority class or probability within the leaf node to make the decision

majority class Bad Good

probability

19
Decision tree applications
¡ Decision trees can be used for various purposes in analytics.
¡ input selection
¡ attributes that occur at the top of the tree are more predictive of the target

¡ initial segmentation.
¡ builds a tree of two or three levels deep as the segmentation scheme
¡ then uses second stage machine learning models for further refinement

¡ final analytical model to be used directly into production


¡ It gives a white box model with a clear explanation behind how it reaches its classifications.

20
Logistic regression/classification

Temp Class
Temp Predict
36.0 t 36.0 0.0
36.3 t 𝑓(𝓏)
36.3 0.1
1.0
36.8 t 0.9 36.8 0.2
37.3 t 0.8
37.3 0.4
36.9 t 0.7
36.9 0.2
0.6
37.0 t 0.5 37.0 0.3
37.6 1 0.4
37.6 0.5
0.3
38.2 1 0.2 38.2 0.8
40.2 1 0.1
40.2 1.0
0.0
38.6 1 35.5 36.0 36.5 37.0 37.5 38.0 38.5 39.0 39.5 40.0 40.5 𝑇𝑒𝑚𝑝 38.6 0.9
39.4 1 39.4 1.0
37.8 1 37.8 0.6
𝓏 = −75+2𝑇𝑒𝑚𝑝
1
𝑓(𝓏) =
1 + 𝑒 #𝓏

21
Logistic regression
Historical data

Feature Binary
target
Machine
id
Plasma glucose
Body mass index Diabetes Model learning Learning
concentration
A001 148 33.6 1
A002 85 26.6 0
A003 183 23.3 1
A004 89 28.1 0 Model
A005 137 43.1 1
A006 116 25.6 0 𝓏 = 7.5156+(-0.0352)𝑥! +(-
A007 78 31 1
A008 115 35.3 0 0.0763)𝑥"
A009 197 30.5 1 1
: : : : 𝑓(𝓏) =
1 + 𝑒 #𝓏

id
Plasma glucose
Body mass index
𝓏 = 7.5156 + (-0.0352)100+(-0.0763)30 Diabetes?
concentration = 1.7066
New person 100 30 ) 0.8464 => yes
𝑓(𝓏) = / 1.7066
)*+
Full dataset: = 0.8464
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes 22
Model decision boundaries

Logistic classification
!
𝑓(𝓏)=
!"# ,(./+ .0123+ .4567893)

Decision trees

23
Support Vector Machines

¡ Nonlinear SVM classifier will first map the input data to a higher dimensional feature
space using some mapping; kernel methods.

kernel

The Feature Space Mapping

https://www.youtube.com/watch?v=3liCbRZPrZA

24
Support Vector Machines

25
Support Vector Machines
𝑥"
Radial basis function (RBF) kernel

𝑥! 𝑥!

𝑥"

𝑥%

𝑥"

26
Support Vector Machines

¡ Kernel methods map the data into higher dimensional spaces in the hope that in this
higher-dimensional space the data could become more easily separated or better
structured.
¡ Different types of kernel functions can be used. The most popular are:

¡ Linear kernel:

¡ Polynomial kernel:

¡ Radial basis function (RBF) kernel:

Empirical evidence has shown that the RBF kernel usually performs best, but note that it
includes an extra parameter σ to be tuned.

27
Neural
Networks
Neural networks

Perceptron
1

7.5156

Plasma glucose Body mass


id Diabetes
concentration index
-0.0352
A001 148 33.6 Plasma glucose
concentration
A002 85 26.6
A003 183 23.3 -0.0763
A004 89 28.1
A005 137 43.1 Body mass index 1 𝑖𝑓 𝑧 ≥ 0
𝑓(𝓏) = 5
A006 116 25.6 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
A007 78 31
A008 115 35.3
A009 197 30.5
: : : : Weights
w1 w2
w0
Plasma glucose Body mass
Bias (inception)
concentration index
-0.0352 -0.0763 7.5156

29
Neural networks
¡ Multi Layer Perceptron (MLP)

Layer 1 Layer 2 Layer 3

Input Layer Hidden Layer Output Layer

30
The hidden layer of MLP
¡ Multi Layer Perceptron (MLP)
The hidden layers map input

x h 0 layer into new feature space

h
0
1

h 2

h 3

h 4

Layer 1 Layer 2 Layer 3

Input Layer Hidden Layer Output Layer

31
Neural networks
Each node has a transformation function f(.) (also called activation functions). The most popular activation functions
are:
§ Linear
§ ranging between −∞ and +∞; 𝑓 𝑧 =𝑧

§ Sigmoid (Logistic)
)
§ ranging between 0 and 1; 𝑓 𝑧 =
)*+ /0

§ Hyperbolic tangent
+ 0 ;+ /0
§ ranging between –1 and +1; 𝑓 𝑧 = + 0 *+ /0

§ Rectified linear unit (ReLU)

0 for z<0
§ ranging between 0 and +∞; 𝑓 𝑧 = 1
𝑧 for z ≥ 0

32
NN-model example

33
Neural networks-example 1
1
1

Plasma glucose Body mass


id Diabetes
concentration index Plasma glucose
A001 148 33.6 concentration

A002 85 26.6
A003 183 23.3
A004 89 28.1
A005 137 43.1
A006 116 25.6
A007 78 31
Body mass index
A008 115 35.3
A009 197 30.5
: : : :

34
Selecting activation function

Input Layer Hidden Layer Output Layer

¡ Hidden Layer -> logistic, hyperbolic tangent, linear, ReLU


¡ Output Layer
¡ For classification (e.g., churn, response, fraud),
¡ it is common practice to adopt a logistic transformation in the output layer, since the outputs can then
be interpreted as probabilities.
¡ For regression targets
¡ Linear
¡ Linear, logistic, hyperbolic tangent for normalized target

35
Model training and evaluation strategy
Model Comparison
models
Held-out test data
data is divide into training set and test set Selected
Cr... model
¡ Training set is user for models creation
(training and validation) Test
performance

¡ Test set is held-out for model selection

Training Model initialization


Original set
dataset

Test set
Training set

Model

Test set

Performance (Accuracy, F-measure, AUC)

36
Model training and evaluation strategy

Hold-out ….
Cross-validation for model comparison

¡ K-folds cross-validation

Model initialization

5-folds 2 1 1 1 1
Training set/ 3 3 2 2 2
1 Validation set 4 4 4 3 3
2 5 5 5 5 4
Original
dataset 3
4
Model
5

Test set 1 2 3 4 5

Mean and standard deviation


Of performance (Accuracy, F-measure, AUC)

Suitable for selecting models and parameters


37
Overfitting
Overfit model
the model learns to fit the training
data too well
¡ the prediction are very accurate
using the training data
¡ but fail to predict unknown data

38
Overfitting and underfitting

Underfitting/ Suitable Overfitting/


Bias Variance
Prediction
Prediction Prediction
performance
performance performance
100
100 100
80
80 80
60
60 60
40
40 40
20
20 20
0
0 0
Training set Test set
Training set Test set Training set Test set 39
Classification Model Evaluation

Model evaluation for binary classes Model

ad
/B
od
Confusion Matrix

Go
Custom
Age Income Gender ... Response Predicted Predicted
er
John 30 1,200 M Bad 0.51 Good Bad

Sarah 25 800 F Good 0.56 Good 2 1


Actual
Sophie 52 2,200 F Good 0.72 Bad 1 1
David 48 2,000 M Bad 0.18
Peter 34 1,800 M Good 0.36 !&"
Accuracy = = 0.6
'
Threshold=0.5

40
Model evaluation

Model evaluation for multiple classes

Confusion Predicted status


matrix
𝑃) 𝑃< 𝑃= … 𝑃>

𝐴) 𝑨𝟏 𝑷𝟏 𝐴) 𝑃< 𝐴) 𝑃= 𝐴) 𝑃> !$ "$ #!% "% #!& "& #⋯#!' "'


Accuracy =
%
𝐴< 𝐴< 𝑃) 𝑨𝟐 𝑷𝟐 𝐴< 𝑃= 𝐴< 𝑃>
Actual Status

Predicted
𝐴= 𝐴= 𝑃) 𝐴= 𝑃< 𝑨𝟑 𝑷𝟑 𝐴= 𝑃> a b c

: a ü û û

Actual
𝐴> 𝐴> 𝑃) 𝐴> 𝑃< 𝐴> 𝑃= 𝐴> 𝑨𝒌 𝑷𝒌 b û ü û
c û û ü
41
Evaluation Metrics

Predicted

T F

True False True positive rate, Sensitivity,


T Negative 𝑻𝑷
Positive (Type II error)
Recall =
𝑻𝑷"𝑭𝑵
Actual

False True True negative rate,


F Positive Specificity =
𝑻𝑵
(Type I error) Negative 𝑭𝑷"𝑻𝑵

Accuracy =
𝑻𝑷" 𝑻𝑵
Positive predictive
value, 𝑻𝑷"𝑭𝑵"𝑭𝑷"𝑻𝑵
𝑻𝑷
Precision =
𝑻𝑷"𝑭𝑷
𝟐×𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏×𝑹𝒆𝒄𝒂𝒍𝒍
F1-score =
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏"𝑹𝒆𝒄𝒂𝒍𝒍

42
Expected Value for Model Evaluation Targeted marketing

Acc=82.5% Cost- Predicted Acc=85%


Benefit R N
Predicted Predicted
Model Model
R N R 99 0 R N

Actual
R 150 150 R 0 300
Actual

Actual
-1 0

N 200 1500 N 0 1700


/ 2,000 / 2,000
Predicted Predicted

R N R N

R 0.075 0.075 R 0 0.15


P P
Actual

Actual
N 0.1 0.75 N 0 0.85

Expected value

7.425 0
S 7.325 0 S 0 0

-0.1 0 0 0

43
REGRESSION

44
Linear regression
Linear regression is a baseline modeling technique to model a continuous
target variable.
60

50
Estimated
BMI percentage of
Body Fat
40
23.63 12.3
23.33 6.1
24.67 25.3 30

24.88 10.4
25.52 28.7 20
26.46 20.9
: :
10

Regression Problem
0
15 20 25 30 35 40 45 50

Predict real-valued output BMI (weight (Kg.)/height(m^2)

45
How Machine Learning Works?
Most of the machine learning technique learn to minimize loss
Example: Regression problem
Estimated percentage of Body Fat
60
y=Real Fat (47.5) Error
(47.5-41.73)=5.77
50
y’=Predicted Fat (41.73)
47.5
41.73 4 2
40
7.6
I -2
M
30 4 *B
5
84
1.
𝑦’
= Error (e) = 𝑦 − 𝑦’
t=
20
Fa
𝑆𝑆𝐸 = 𝑒!" +𝑒"" +𝑒(" +…+𝑒)"
10

Cost/loss
0
15 20 25 30 35
37.5940 45 50

BMI (weight (Kg.)/height(m2)

46
The Regression

Target Target Target


Size Price ($) Estimated SP FTSE NIKKEI EU ISE
feet2 (x) 1000's (y) BMI percentage of -0.00468 0.003894 0 0.012698 0.038376
Body Fat 0.007787 0.012866 0.004162 0.011341 0.031813
2104 460
23.63 12.3 -0.03047 -0.02873 0.017293 -0.01707 -0.02635
1416 232 23.33 6.1 0.003391 -0.00047 -0.04006 -0.00556 -0.08472
1534 315 24.67 25.3 -0.02153 -0.01271 -0.00447 -0.01099 0.009658
852 178 24.88 10.4 -0.02282 -0.00503 -0.04904 -0.01245 -0.04236
25.52 28.7 0.001757 -0.00614 0 -0.01222 -0.00027
: :
26.46 20.9 -0.03403 -0.05095 0.002912 -0.04522 -0.03555
: : : : : : :

ISE =-0.033 * SP +
Price = 134.53*Size + 71270 Fat = 𝑦’= 1.8454*BMI - 27.642 -0.0616 * FTSE +
0.3021 * NIKKEI +
1.1068 * EU +
800 R² = 0.731 60 R² = 0.5547 0.001
0.1
700 50
600
R² = 0.519849
40 0.05
500
400 30
300 0
20
200
100 10 -0.05
0
0
0 10002000300040005000 -0.1
15 25 35 45
BMI ISE Predict
47
Beyond Linear regression
Polynomial regression
ℎ& (𝐱) = 𝑤' + 𝑤( 𝑥( + 𝑤) 𝑥()
Estimated percentage of Body Fat
50

45

40

35

30

25

20

15

10
𝑤* 𝑤! 𝑥! 𝑤" 𝑥!"
Fat = - 46.149 + 3.2675*BMI -0.0268*BMI2
5

0
15 20 25 30 35 40 45 50
BMI)
48
Supervised Machine Learning Model-Regression
Features Target
Machine Learning
date SP NIKKEI EU

5-Jan-09 -0.00468 0 0.012698


Model learning
6-Jan-09 0.007787 0.004162 0.011341
7-Jan-09 -0.03047 0.017293 -0.01707
8-Jan-09 0.003391 -0.04006 -0.00556
9-Jan-09 -0.02153 -0.00447 -0.01099
: : :
Model
Historical data 𝓏 = 0+0.6099𝑥! +0.1723𝑥"

date
SP NIKKEI 𝓏 = 0+0.6099 x 0.033007+0.1723 x 0.005594
(𝑥+ ) (𝑥, ) = 0.021094714
Some day 0.033007 0.005594 EU?
0.021094714 49
Beyond Linear regression

Polynomial regression
4 2.3
𝑃𝑟𝑖𝑐𝑒 = 𝑫𝒂𝒚𝟑 − 𝑫𝒂𝒚𝟐 + 4.8𝑫𝒂𝒚 + 900
100000 100

Model

FTSE SET All-Share Index 10 Jun 2010 to 2 Sep 2011 On 7 Sep 2011
Day=305 à Price =1359.33

50
Effect of outlier
70 90

y = 0.3252x + 10.716 80
60

70

50
60

40
50

40
30

y30= -0.0118x2 + 2.3381x - 32.526


20

20

10
10

0 0
15 65 115 165 215 15 65 115 165 215
BMI BMI

51

You might also like