You are on page 1of 68

Machine Learning Series

Machine Learning with


Python 3
By Dr. Nickholas Anting
Learning Outcome

At the end of this section, the attendees will be able to:

• Understand the fundamental concept of Machine Learning Algorithm in


data science application.

• Train ML algorithm, including regression, classification & clustering


algorithm to build the predictive model.

• Evaluate the performance of each ML model.


Machine Learning Series

Part 1
Introduction to Machine Learning

By Dr. Nickholas Anting


Concepts of Machine Learning
Training the algorithm using historical data to generate machine learning model
used to predict future outcome.

ML Produce
Algorithm

Historical Data Training Model

Add a footer 4
Conventional Processing
• Programmer specify set of rule that is pre-defined by existing condition.

Copyright by Dr. Nickholas 5


Processing with Machine Learning
• The algorithm learn from data to define the rule of the generated model

Train ML
Algorithm to
• Algorithm learn from purchase history data to
GENERATE
Historical understand the user purchasing behavior.
ML Model
Transactional Data

NEW_INPUT Apply
fraud
NO Complete payment
ML Model transaction?
transaction process

YES

Cancel payment

Copyright by Dr. Nickholas 6


What Machine Learning can do ... ?

b
Who will buy more?

Forecasting
Price

a
Size

Copyright by Dr. Nickholas 7


Fraud Detection

Copyright by Dr. Nickholas 8


Process Workflow Engineering Design

Optimization

Add a footer 9
Recommendation
System

Machine Learning

Watched Movies Recommended to watch

Add a footer 10
Types of Machine Learning
SUPERVISED LEARNING

• Samples in dataset are labelled


Label

HP MetColor FuelType Price

90 1 Diesel 13500

90 1 Diesel 13750
New Inputs
90 1 Diesel 13950

Support Vector
Car Price Dataset Linear Regression
Machine

Copyright by Dr. Nickholas 11


UNSUPERVISED LEARNING

• Dataset comprises of non-labelled


samples.

K-Means Apriori
Customer Behaviour Clustering Algorithm

Copyright by Dr. Nickholas 12


REINFORCEMENT LEARNING

• A rewards system – instead of minimizing the error, the model is maximizing the
objective function.
• Model become smarter by rewards.

Deep Neural
Network

Neural Network

Copyright by Dr. Nickholas 13


Machine Learning Algorithm
Regression Model
• Predicting numerical output/response (Car's Price)
• Have response
• Linear regression using OLS algorithm
Supervised Learning

Classification Model
• Predicting output in the form of class (Buy or Not Buy)
• Logistic regression/Naive Bayes...

Clustering Model
• Predict pattern.
• No exact response Unsupervised Learning
• K-Means, DBSCAN...

Add a footer 14
How do I start with Machine Learning ... ?

Understanding OK
Prepare Algorithm Train the Performance Saved
the
Data Selection Algorithm Evaluation Model
requirements

NOT OK

• New parameters
• New algorithm

Copyright by Dr. Nickholas 15


Building the Machine Learning Model – The
Workflow
Exploration & Data Data Pre-Processing Train & Apply Model Evaluation
Preparation

Import dataset into Feature Selection


repository
Training
Handling Categorical
Algorithm
Variables Train – 80%
Data Exploration &
Observation Assigned Input &
Output Attributes Partitioning ML Model Performance

Data Cleaning Features/Inputs Test – 20%


Scaling Apply model

ML Workflow

Copyright by Dr. Nickholas 16


Things Needed for Machine Learning

Data

Training algorithm
Programming Tools

Copyright by Dr. Nickholas 17


APPLICATION OF MACHINE LEARNING
Machine learning model applied for predictive analytics, which will be utilized
in prescriptive modelling.

COMPUTATIONAL COMPUTATIONAL
IMAGE PROCESSING
FINANCE BIOLOGY
Credit scoring and Face recognition, Drug discovery,
algorithmic trading. motion & object DNA sequencing,
detection. etc.

ENERGY MANUFACTURING & NATURAL LANGUAGE


PRODUCTION PRODUCTION PROCESSING
Price & load Predictive Voice recognition,
forecasting. maintenance text processing

Copyright by Dr. Nickholas 18


Python Packages for Machine Learning

Copyright by Dr. Nickholas 19


Data Science Professional Certification

Part 4
Linear Regression Model
By Dr. Nickholas Anting
LINEAR REGRESSION MODEL
• Linear approximation of a relationship between two or more variables.

• Model mathematically the relationship between two or more variables (dependent & independent
variables).

• The output responses are NUMERIC values.

Price
Size a

Add a footer 21
Simple Linear Regression
• Simple relationship between one input and one output variable
• Only one input to predict output

• y-axis representing dependent value while x-axis


y
is for independent value.
• Data point is the observed value plotted in the
graph y against x.
• The line is draw based on regression equation.

Add a footer 22
Ordinary Least Square (OLS)
• Most common method to estimate the
linear regression equation. y
• Least Square stands for the minimum
squares error.
• Lower error results in a better explanatory
power of the regression model.
• The method aims to find the line which
minimizes the sum of the square error.
• There are many lines can be draw to fit the
data. By the way, the OLS determines the The best fitted line with
one with the smallest error. the smallest error.
• This bet fitted line is the one that closes to x
the all data points.

Add a footer 23
Performance Metrics: R-Squared (R2)

• Measure that widely used to describe how R2 =0.70


powerful/good the regression model. 0 1
• R-Squared measure the Goodness of Fit of
the regression model. R2 = 1 or 100% means that the model explains
the entire variability of the data.

R2 = 0 or 0% means that none of the variability


of the data.

Usually observed value of R2 is ranging


between 0.2 (20%) to 0.9 (90%).

• The values of R-Squared are ranging from 0


(0%) to 1 (100%).

Add a footer 24
0 1
y y y

x x x
Hands – On 2
Create a machine learning model to predict the Price based on
Size.
y

Input/Feature/Predictor – Size
b
Output/Response – Price

Price
dataset
real_estate.csv

x
Size a

Add a footer 26
Understanding the requirement & dataset

• Build a ML model to predict house price, Price based on Size


• The response or output variable is price
• price is a numerical output.

Copyright by Dr. Nickholas 27


Machine Learning Workflow

NO
DATA TRAIN
PREPARATION ALGORITHM

Train Set
Generate
Performance
YES SAVE
OK? MODEL
ML Model

DATA Test Set PERFORMANCE


TRANSFORMATION
PARTITIONING APPLY MODEL
METRICS

MODEL EVALUATION

GENERAL PROCESS OF MACHINE LEARNING MODEL

Add a footer 28
DATA PREPARATION
A process to import dataset, over-viewing and cleansing raw dataset.

Import Data Data


dataset exploration Cleaning

Add a footer 29
Import Dataset

CSV

Excel Import

Text

Pandas data frame


Copyright by Dr. Nickholas 30
Data Exploration

Descriptive Statistics

Missing Values

Duplicated Rows

Checking & Treating


Outliers

Add a footer 31
DATA PRE-PROCESSING
A series of final steps to get the dataset ready for further processing.

Handling Assign independent


Features Feature Dataset
Categorical variable and dependent
Selection scaling Partitioning
– If any variables

Add a footer 32
Assigned Input & Output Attributes
• Assign the output (response) and input (features) variables.

INPUT, x OUTPUT, y

Add a footer 33
B. Partitioning Dataset

Overall Data
100% (100) • Train set use to train the algorithm

• Test set use to validate the model


Partitioning/
Splitting
• Use random state = 0.
Train Data Test Data
80% (80) 20% (20)

Add a footer 34
Now the Data is Ready

Add a footer 35
TRAINING ALGORITHM

• Train OLS method from statsmodel packages.

Use data from Train OLS Algorithm Generate


MLR Model
Train Set for MLR

Copyright by Dr. Nickholas 36


MODEL PERFORMANCE VALIDATION

Method A
R-Squared

Add a footer 37
Method B
Correlation between Predicted Output and Actual Output
using Test set

Use data from Apply Predict Predicted


MLR Model
Test Set Output

Add a footer 38
MULTIPLE LINEAR REGRESSION

Two or more independent variables are used to predict


the value of dependent variable.

Provide more good model. Address the higher


complexity of the problem.

The more variables, the more factors considering the


model.

It is not about fitting line anymore. Stop being 2D.


Cannot visualized using graph.

IT IS ABOUT THE BEST FITTING MODEL

Add a footer 39
Example – House Pricing

• Price of the house could be depending more than


House Pricing one factor.

• Those factors (independent variable) are such as


2008
area and the location.
2010
... • More than one independent variable, this
consider as Multiple Linear Regression.
Area Year

Add a footer 40
2 2
R > Adjusted R
• R-squared measures how much of the total • Penalized excessive use of variables.
variability that is explained by our model.
• Compares the explanatory power of
• Multiple regression are always better than regression models that contain different
simple regression. Increase additional numbers of predictors.
variable may increase explanatory power.

Add a footer 41
Example Eqn. 2
Eqn. 1 Add new variable

Value p-value
Value p-value R-Squared 0.407
R-Squared 0.406 Adjusted R-Squared 0.392
Adjusted R-Squared 0.399 0.000
0.000 0.762

Add a footer 42
Hands – On 2
Create a machine learning model to predict the Price based on
Size and Year.

Input/Feature/Predictor – Size, Year

Output/Response – Price

dataset
real_estate.csv

Add a footer 43
FEATURES/INPUTS SCALING
size year

Scaling

Add a footer 44
TRAINING ALGORITHM
• Train OLS method from statsmodel packages.

Use data from Train OLS Algorithm Generate


MLR Model
Train Set for MLR

Copyright by Dr. Nickholas 45


APPLY MODEL USING TEST SET

Use data from Apply Predict Predicted


MLR Model
Test Set Output

Add a footer 46
size size & year

• Adding year has improve the prediction power.

Add a footer 47
Hands – On 3 Data with Categorical Variable
Create a machine learning model to predict the GPA score based on
SAT score and Attendance.

Input/Feature/Predictor – SAT, Attendance

Output/Response – GPA

Add a footer 48
Handling Categorical Variable
Transform into Dummy Variable
• Variable that is used to include categorical data into the model.
• Transform non-numeric data or categorical data into numeric form.
• NOMINAL to NUMERICAL – Use one to many node.

Attendance Attendance_Yes Attendance_No

Yes Yes 1 0

No No 0 1

Add a footer 49
Data Science Professional Certification

Classification Model
Logistic Regression Model

By Dr. Nickholas Anting


Classification Model
• Classification – Process of categorizing the output into categorical
classes.
• Classifier – Classification algorithm to be trained by data to predict the
output classes.

TYPE A
BUY STAY or
or or TYPE B
NOT BUY CHURN or
TYPE C
Classification vs Regression

Regression Problem Classification Problem

Continuous numerical Categorical Output


Output

Copyright by Dr. Nickholas 52


Examples of Classification Problem

Churn Prediction Fraud Detection Customer Decision

STAY or CHURN LEGIT or FRAUD BUY or NOT BUY

Copyright by Dr. Nickholas 53


Classification Algorithm

• Logistic Regression
• Naive Bayes
• K-Nearest Neighbors
• Support Vector Machine
• Decision Tree
• Random Forest

Add a footer 54
LOGISTIC REGRESSION MODEL
• Logistic Regression is a classification algorithm used to assign observations to a discrete set of
classes.

• Supervised classification algorithm. Model builds a regression model to predict the probability of event
to success.

• Produce result in binary format that used to predict the outcome of categorical dependent variable.

Ye Class 1
s
Logistic Regression
Inputs Probability p >= 0.5
Model

Class 2
No

Output

Add a footer 55
Linear vs Logistic Regression

Linear Regression Logistic Regression


The probability of some obtained event is
represented as a linear function of a
Data is modelled using a straight line
combination of predictor variables. Data is
modelled using Sigmoid Function

Output Type Continuous Numeric Variable Categorical Variable


Prediction Variable value Probability of event occurred
Accuracy &
R-Squared, Adjusted R-Squared Accuracy, Precision, Recall
Goodness of Fit

Add a footer 56
Logistic Regression Curve

1
p≥0.5 = 1

Threshold

0.5
p<0.5 = 0

0
-6 0 6

Add a footer 57
Model Accuracy/Performance Assessment
The Confusion Matrix
• For 69 observation, the model correctly predict
Table used to describe the
performance of a classification 1 and the actual true value was 1. The model
did its job
model, such as Logistic Regression. • For 90 observation, the model correctly predict
well
0 and the true value was 0.

Predicted 1 Predicted 0
• For 4 observation, the model predict 1 but the
actual true value was 0.
Actual 1 69 5 Model
• For 5 observations, the model predict 0 but the confused
actual true value was 1.
Actual 0 4 90

Add a footer 58
Model Accuracy
A measure to evaluate the accuracy of the logistic regression model
using confusion matrix.

Predicted 1 Predicted 0

TP FN
Actual 1
69 5

FP TN • Overall performance of the logistic model is able to predict the


Actual 0 output at 94% accuracy.
4 90

Add a footer 59
Precision
• Metric used to measure the correct number of positive prediction.
• Tell the performance of the model to correctly predict the positive class.
• Answering question such as "What will be the chances of the outcome to be
actually positive when the model predicts the result is positive".

Predicted 1 Predicted 0

TP FN
Actual 1
69 5

FP TN
Actual 0
4 90

Add a footer 60
Recall
• Ratio of the total amount of correctly classified positive class.
• Answering queries of "What proportion of actual positive class was identified
correctly?"

Predicted 1 Predicted 0

TP FN
Actual 1
69 5

FP TN
Actual 0
4 90

Add a footer 61
Hands – On 4
Build a machine learning model using Logistic Regression Algorithm to predict the
result of admission status, Admitted of student application to higher learning
institution based on SAT score.
Yes
Admitted
Data File: Admittance.csv No

Add a footer 62
HANDS-ON 5
Build a machine learning model using Logistic Regression Algorithm to predict the
result of Exited Status of the customer in Bank A.
Churn
Exited
Stay
Data File: bank_churn.csv

Add a footer 63
Machine Learning Process

NO
TRAIN
ALGORITHM

Train Set
Generate Performance
YES SAVE
OK? MODEL
ML Model
DATA PREPARATION &
PRE-PROCESSING MODEL
APPLY MODEL
Test Set ASSESSMENT

Copyright by Dr. Nickholas 64


Partitioning Dataset
• Imbalance classes distribution for the response value will cause bias.
• Apply stratified sampling strategy to distribute the rows into train and test set.

Stay Churn
Total rows = 10,000 Overall
7963 2037

Stay Churn
Train Set – 80%
6370 1630

Stay Churn
Test Set – 20%
1593 407

• Stratified sampling will distribute the data point evenly


according to the proportion of the classes.

Add a footer 65
Training Algorithm
• Train Logistic Regression from sklearn packages.

Use data from Train Logistic Regression Generate Classification


Train Set Algorithm Model

Copyright by Dr. Nickholas 66


Performance & Accuracy
Predicted Churn Predicted Stay Class Recall F-Score

Actual Churn 72 335 18% 27%

Actual Stay 57 1536 96% 89%

Class
56% 82%
Precision

Overall Accuracy 80.4%

• The overall accuracy is 80.4%. However, this indicator is not sufficient since the classes of outputs are not balance.

• 82% of those predicted Stay are actually Stay. The model has good performance for Stay class.

• Only 56% of those samples predicted as Churn are actually Churn.

Add a footer 67
Predicted Churn Predicted Stay Class Recall F-Score

Actual Churn 72 335 18% 27%

Actual Stay 57 1536 96% 89%

Class
56% 82%
Precision

Overall Accuracy 80.4%

• Only 18% of those Churn classes are able to predicted correctly.

• 96% of the samples that are actually Stay predicted correctly. This is high.

• F-Score is the harmonic mean of precision and recall. Based on this, the model can be concluded perform better
to predict the output classes of Stay.

• Apply cross-validation, smote algorithm, and try to balance the data to get better model.

Add a footer 68

You might also like