Professional Documents
Culture Documents
Part 1
Introduction to Machine Learning
ML Produce
Algorithm
Add a footer 4
Conventional Processing
• Programmer specify set of rule that is pre-defined by existing condition.
Train ML
Algorithm to
• Algorithm learn from purchase history data to
GENERATE
Historical understand the user purchasing behavior.
ML Model
Transactional Data
NEW_INPUT Apply
fraud
NO Complete payment
ML Model transaction?
transaction process
YES
Cancel payment
b
Who will buy more?
Forecasting
Price
a
Size
Optimization
Add a footer 9
Recommendation
System
Machine Learning
Add a footer 10
Types of Machine Learning
SUPERVISED LEARNING
90 1 Diesel 13500
90 1 Diesel 13750
New Inputs
90 1 Diesel 13950
Support Vector
Car Price Dataset Linear Regression
Machine
K-Means Apriori
Customer Behaviour Clustering Algorithm
• A rewards system – instead of minimizing the error, the model is maximizing the
objective function.
• Model become smarter by rewards.
Deep Neural
Network
Neural Network
Classification Model
• Predicting output in the form of class (Buy or Not Buy)
• Logistic regression/Naive Bayes...
Clustering Model
• Predict pattern.
• No exact response Unsupervised Learning
• K-Means, DBSCAN...
Add a footer 14
How do I start with Machine Learning ... ?
Understanding OK
Prepare Algorithm Train the Performance Saved
the
Data Selection Algorithm Evaluation Model
requirements
NOT OK
• New parameters
• New algorithm
ML Workflow
Data
Training algorithm
Programming Tools
COMPUTATIONAL COMPUTATIONAL
IMAGE PROCESSING
FINANCE BIOLOGY
Credit scoring and Face recognition, Drug discovery,
algorithmic trading. motion & object DNA sequencing,
detection. etc.
Part 4
Linear Regression Model
By Dr. Nickholas Anting
LINEAR REGRESSION MODEL
• Linear approximation of a relationship between two or more variables.
• Model mathematically the relationship between two or more variables (dependent & independent
variables).
Price
Size a
Add a footer 21
Simple Linear Regression
• Simple relationship between one input and one output variable
• Only one input to predict output
Add a footer 22
Ordinary Least Square (OLS)
• Most common method to estimate the
linear regression equation. y
• Least Square stands for the minimum
squares error.
• Lower error results in a better explanatory
power of the regression model.
• The method aims to find the line which
minimizes the sum of the square error.
• There are many lines can be draw to fit the
data. By the way, the OLS determines the The best fitted line with
one with the smallest error. the smallest error.
• This bet fitted line is the one that closes to x
the all data points.
Add a footer 23
Performance Metrics: R-Squared (R2)
Add a footer 24
0 1
y y y
x x x
Hands – On 2
Create a machine learning model to predict the Price based on
Size.
y
Input/Feature/Predictor – Size
b
Output/Response – Price
Price
dataset
real_estate.csv
x
Size a
Add a footer 26
Understanding the requirement & dataset
NO
DATA TRAIN
PREPARATION ALGORITHM
Train Set
Generate
Performance
YES SAVE
OK? MODEL
ML Model
MODEL EVALUATION
Add a footer 28
DATA PREPARATION
A process to import dataset, over-viewing and cleansing raw dataset.
Add a footer 29
Import Dataset
CSV
Excel Import
Text
Descriptive Statistics
Missing Values
Duplicated Rows
Add a footer 31
DATA PRE-PROCESSING
A series of final steps to get the dataset ready for further processing.
Add a footer 32
Assigned Input & Output Attributes
• Assign the output (response) and input (features) variables.
INPUT, x OUTPUT, y
Add a footer 33
B. Partitioning Dataset
Overall Data
100% (100) • Train set use to train the algorithm
Add a footer 34
Now the Data is Ready
Add a footer 35
TRAINING ALGORITHM
Method A
R-Squared
Add a footer 37
Method B
Correlation between Predicted Output and Actual Output
using Test set
Add a footer 38
MULTIPLE LINEAR REGRESSION
Add a footer 39
Example – House Pricing
Add a footer 40
2 2
R > Adjusted R
• R-squared measures how much of the total • Penalized excessive use of variables.
variability that is explained by our model.
• Compares the explanatory power of
• Multiple regression are always better than regression models that contain different
simple regression. Increase additional numbers of predictors.
variable may increase explanatory power.
Add a footer 41
Example Eqn. 2
Eqn. 1 Add new variable
Value p-value
Value p-value R-Squared 0.407
R-Squared 0.406 Adjusted R-Squared 0.392
Adjusted R-Squared 0.399 0.000
0.000 0.762
Add a footer 42
Hands – On 2
Create a machine learning model to predict the Price based on
Size and Year.
Output/Response – Price
dataset
real_estate.csv
Add a footer 43
FEATURES/INPUTS SCALING
size year
Scaling
Add a footer 44
TRAINING ALGORITHM
• Train OLS method from statsmodel packages.
Add a footer 46
size size & year
Add a footer 47
Hands – On 3 Data with Categorical Variable
Create a machine learning model to predict the GPA score based on
SAT score and Attendance.
Output/Response – GPA
Add a footer 48
Handling Categorical Variable
Transform into Dummy Variable
• Variable that is used to include categorical data into the model.
• Transform non-numeric data or categorical data into numeric form.
• NOMINAL to NUMERICAL – Use one to many node.
Yes Yes 1 0
No No 0 1
Add a footer 49
Data Science Professional Certification
Classification Model
Logistic Regression Model
TYPE A
BUY STAY or
or or TYPE B
NOT BUY CHURN or
TYPE C
Classification vs Regression
• Logistic Regression
• Naive Bayes
• K-Nearest Neighbors
• Support Vector Machine
• Decision Tree
• Random Forest
Add a footer 54
LOGISTIC REGRESSION MODEL
• Logistic Regression is a classification algorithm used to assign observations to a discrete set of
classes.
• Supervised classification algorithm. Model builds a regression model to predict the probability of event
to success.
• Produce result in binary format that used to predict the outcome of categorical dependent variable.
Ye Class 1
s
Logistic Regression
Inputs Probability p >= 0.5
Model
Class 2
No
Output
Add a footer 55
Linear vs Logistic Regression
Add a footer 56
Logistic Regression Curve
1
p≥0.5 = 1
Threshold
0.5
p<0.5 = 0
0
-6 0 6
Add a footer 57
Model Accuracy/Performance Assessment
The Confusion Matrix
• For 69 observation, the model correctly predict
Table used to describe the
performance of a classification 1 and the actual true value was 1. The model
did its job
model, such as Logistic Regression. • For 90 observation, the model correctly predict
well
0 and the true value was 0.
Predicted 1 Predicted 0
• For 4 observation, the model predict 1 but the
actual true value was 0.
Actual 1 69 5 Model
• For 5 observations, the model predict 0 but the confused
actual true value was 1.
Actual 0 4 90
Add a footer 58
Model Accuracy
A measure to evaluate the accuracy of the logistic regression model
using confusion matrix.
Predicted 1 Predicted 0
TP FN
Actual 1
69 5
Add a footer 59
Precision
• Metric used to measure the correct number of positive prediction.
• Tell the performance of the model to correctly predict the positive class.
• Answering question such as "What will be the chances of the outcome to be
actually positive when the model predicts the result is positive".
Predicted 1 Predicted 0
TP FN
Actual 1
69 5
FP TN
Actual 0
4 90
Add a footer 60
Recall
• Ratio of the total amount of correctly classified positive class.
• Answering queries of "What proportion of actual positive class was identified
correctly?"
Predicted 1 Predicted 0
TP FN
Actual 1
69 5
FP TN
Actual 0
4 90
Add a footer 61
Hands – On 4
Build a machine learning model using Logistic Regression Algorithm to predict the
result of admission status, Admitted of student application to higher learning
institution based on SAT score.
Yes
Admitted
Data File: Admittance.csv No
Add a footer 62
HANDS-ON 5
Build a machine learning model using Logistic Regression Algorithm to predict the
result of Exited Status of the customer in Bank A.
Churn
Exited
Stay
Data File: bank_churn.csv
Add a footer 63
Machine Learning Process
NO
TRAIN
ALGORITHM
Train Set
Generate Performance
YES SAVE
OK? MODEL
ML Model
DATA PREPARATION &
PRE-PROCESSING MODEL
APPLY MODEL
Test Set ASSESSMENT
Stay Churn
Total rows = 10,000 Overall
7963 2037
Stay Churn
Train Set – 80%
6370 1630
Stay Churn
Test Set – 20%
1593 407
Add a footer 65
Training Algorithm
• Train Logistic Regression from sklearn packages.
Class
56% 82%
Precision
• The overall accuracy is 80.4%. However, this indicator is not sufficient since the classes of outputs are not balance.
• 82% of those predicted Stay are actually Stay. The model has good performance for Stay class.
Add a footer 67
Predicted Churn Predicted Stay Class Recall F-Score
Class
56% 82%
Precision
• 96% of the samples that are actually Stay predicted correctly. This is high.
• F-Score is the harmonic mean of precision and recall. Based on this, the model can be concluded perform better
to predict the output classes of Stay.
• Apply cross-validation, smote algorithm, and try to balance the data to get better model.
Add a footer 68