You are on page 1of 52

STMIK Amikom Purwokerto

“Sarana Pasti Meraih Prestasi”

A roadmap for building


machine learning system
Wiga Maulana Baihaqi, S.Kom., M.Eng.

7/17/20 Wiga Maulana Baihaqi, S.Kom., M.Eng. 1


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Agenda for Today’s Session


Process of Machine Learning

Data

Algorithms in Machine Learning

Splitting Data

Evaluation

7/17/20 Nama Dosen 2


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Process Of Machine Learning

Algoritme
Himpunan
data
Machine Pengetahuan Evaluasi
Learning

DATA PRE-PROCESSING Prediction


Data Cleaning Classification
Data Reduction Clustering
Association

7/17/20 Nama Dosen 3


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Data Attribute/Feature
Class/Label/Target

Record/
Object/
Sample/
Tuple

Nominal
Numerik

7/17/20 Nama Dosen 4


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Data Cleaning
• Data cleaning is one those things that everyone does but no one
really talks about. Sure, it’s not the "sexiest" part of machine
learning. And no, there aren’t hidden tricks and secrets to
uncover.
• However, proper data cleaning can make your project is good.
Professional data scientists usually spend a very large portion of
their time on this step.
• Why? Because of a simple truth in machine learning:
Better data beats fancier algorithms.

7/17/20 Nama Dosen 5


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

1. Remove Unwanted Observations


• The first step to data cleaning is removing unwanted
observations from your dataset.
• This includes duplicate or irrelevant observations.

7/17/20 Nama Dosen 6


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Remove Unwanted Observations: Duplicate


Obserations
• Duplicate observations most frequently arise during data
collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments

7/17/20 Nama Dosen 7


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Remove Unwanted Observations:


Irrelevant Observations
• Irrelevant observations are those that don’t
actually fit the specific problem that you’re
trying to solve.
• For example, if you were building a model for
Single-Family homes only, you wouldn't want
observations for Apartments in there.
• This is also a great time to review your charts
from Exploratory Analysis. You can look at the
distribution charts for categorical features to
see if there are any classes that shouldn’t be
there.
7/17/20 Nama Dosen 8
STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

2. Fix Structural Errors


As you can see:
• 'composition' is the same as
'Composition'
• 'asphalt' should be 'Asphalt'
• 'shake-shingle' should be 'Shake
Shingle'
• 'asphalt,shake-shingle' could
probably just be 'Shake Shingle' as
well

7/17/20 Nama Dosen 9


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Fix Structural Errors

7/17/20 Nama Dosen 10


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

3. Filter Unwanted Outliers

7/17/20 Nama Dosen 11


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

4. Handle Missing Data


• Real-world data often has missing values.
• Data can have missing values for a number of reasons such as
observations that were not recorded and data corruption.
• Handling missing data is important as many machine learning
algorithms do not support data with missing values.

7/17/20 Nama Dosen 12


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

How to handling missing value


• Dropping observations that
have missing values
• Imputing the missing values
based on other observations

7/17/20 Nama Dosen 13


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Example: Dropping
Sepal Length Sepal Width Petal Length Petal Width Type
5.1 ? 1.4 0.2 Iris-Setosa
4.9 3.0 1.4 0.2 Iris-Setosa
? 3.2 1.3 ? Iris-versicolor
4.6 3.6 1.5 0.3 Iris-Virginica

Sepal Length Sepal Width Petal Length Petal Width Type


4.9 3.0 1.4 0.2 Iris-Setosa
4.6 3.6 1.5 0.3 Iris-Virginica

7/17/20 Nama Dosen 14


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Disadvantages of Dropping Method?

7/17/20 Nama Dosen 15


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Example: Imputing
• A constant value that has meaning within the domain, such as 0,
distinct from all other values.
• A value from another randomly selected record.
• A mean, median or mode value for the column.
• A value estimated by another predictive model.

7/17/20 Nama Dosen 16


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Which method is more accurate?

7/17/20 Nama Dosen 17


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Data Reduction
• Dimension/Data Reduction refers to the process of converting a
set of data having vast dimensions into data with lesser
dimensions ensuring that it conveys similar information concisely.
• These techniques are typically used while solving machine
learning problems to obtain better features for a classification or
regression task.

7/17/20 Nama Dosen 18


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Data Reduction
• Let’s look at the image shown below. It shows 2
dimensions x1 and x2, which are let us say
measurements of several object in cm (x1) and
inches (x2).
• Now, if you were to use both these dimensions in
machine learning, they will convey similar
information and introduce a lot of noise in system,
so you are better of just using one dimension. 
• Here we have converted the dimension of data
from 2D (from x1 and x2) to 1D (z1), which has
made the data relatively easier to explain.

7/17/20 Nama Dosen 19


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

What are the benefits of Data Reduction?


• It helps in data compressing and reducing the storage space
required
• It fastens the time required for performing same computations.
Less dimensions leads to less computing, also less dimensions
can allow usage of algorithms unfit for a large number of
dimensions
• It takes care of multi-collinearity that improves the model
performance. It removes redundant features. For example: there
is no point in storing a value in two different units (meters and
inches).
7/17/20 Nama Dosen 20
STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

What are the benefits of Data


Reduction?
• Reducing the dimensions of data to
2D or 3D may allow us to plot and
visualize it precisely. You can
then observe patterns more clearly.
Below you can see that, how a 3D
data is converted into 2D. First it has
identified the 2D plane then
represented the points on these two
new axis z1 and z2.
• It is helpful in noise removal also
and as result of that we can improve
the performance of models.

7/17/20 Nama Dosen 21


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

What are the common methods to perform


Dimension Reduction?
• Missing Values
• Low Variance
• Decision Trees
• Random Forest
• PCA (Principal Component Analysis)

7/17/20 Nama Dosen 22


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Splitting Dataset
• Validation techniques are motivated by two fundamental problems
in pattern recognition: model selection and performance estimation.
• Model Selection
 Almost invariably, all pattern recognition techniques have one or
more free parameters
• The number of neighbors in a kNN classification rule.
• The network size, learning parameters and weights in MLPs.
 How do we select the “optimal” parameter(s) or model for a
givenclassification problem?

7/17/20 Nama Dosen 23


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Continue...
• Performance estimation
 Once we have chosen a model, how do we estimate its
performance?
 Performance is typically measured by the TRUE ERROR RATE,
the classifier’s error rate on the ENTIRE POPULATION

7/17/20 Nama Dosen 24


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

The Hold Out Method


• Split dataset into two groups
• Training set: used to train the classifier
• Test set: used to estimate the error rate of the trained classifier

7/17/20 Nama Dosen 25


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

The Holdout Method


• The holdout method has two basic drawbacks
• In problems where we have a sparse dataset we may not be able
toafford the “luxury” of setting aside a portion of the dataset for
testing
• Since it is a single train-and-test experiment, the holdout
estimate of error rate will be misleading if we happen to get an
“unfortunate” split

7/17/20 Nama Dosen 26


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Random Subsampling
• Random Subsampling performs K data splits of the dataset
• Each split randomly selects a (fixed) no. examples without
replacement
• For each data split we retrain the classifier from scratch with the
training examples and estimate Ei with the test examples

7/17/20 Nama Dosen 27


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

K-Fold Cross Validation


• Create a K-fold partition of the the dataset
• For each of K experiments, use K-1 folds for training and the remaining one for
testing

• K-Fold Cross validation is similar to Random Subsampling


• The advantage of K-Fold Cross validation is that all the examples in the dataset are
eventually used for both training and testing

7/17/20 Nama Dosen 28


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

How many folds are needed?


• With a large number of folds
+ The bias of the true error rate estimator will be small (the estimator will be very
accurate)
- The variance of the true error rate estimator will be large
- The computational time will be very large as well (many experiments)
• With a small number of folds
+ The number of experiments and, therefore, computation time are reduced
+ The variance of the estimator will be small
- The bias of the estimator will be large (conservative or higher than the true error
rate)
7/17/20 Nama Dosen 29
STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Leave-one-out Cross Validation


• Leave-one-out is the degenerate case of K-Fold Cross Validation,
where K is chosen as the total number of examples
• For a dataset with N examples, perform N experiments
• For each experiment use N-1 examples for training and the
remaining example for testing

7/17/20 Nama Dosen 30


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Algorithm of Machine Learning


Machine Learning

Supervised Unsupervised
Learning Learning

Classification Regression Clustering

7/17/20 Nama Dosen 31


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Pengetahuan (Pola)
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)

7/17/20 Nama Dosen 32


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Evaluation
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix

5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)

7/17/20 Nama Dosen 33


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Regression Evaluation
• A measure of how well your model performed. It does
this by measuring difference between predicted
values and the actual values.
• Let’s say you feed a model some input X and your
model predicts 10, but the actual value is 5.
• This difference between your prediction (10) and the
actual observation (5) is the error term: (y_prediction
- y_actual).
• The error term is important because we usually want
to minimize the error. In other words, our predictions
are very close to the actual values.

7/17/20 Nama Dosen 34


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Root Mean Squared Error (RMSE)


• RMSE is a quadratic scoring rule that also measures the average
magnitude of the error. It’s the square root of the average of
squared differences between prediction and actual observation.

7/17/20 Nama Dosen 35


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Mean Absolute Error (MAE)


• MAE measures the average magnitude of the errors in a set of
predictions, without considering their direction. It’s the average
over the test sample of the absolute differences between
prediction and actual observation where all individual differences
have equal weight.

7/17/20 Nama Dosen 36


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Coefficient of Determination
• The coefficient of determination (R2) summarizes the explanatory
power of the regression model and is computed from the sums-
of-squares terms.

7/17/20 Nama Dosen 37


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Coefficient of Determination
• R2 describes the proportion of variance of the dependent variable
explained by the regression model.
• If the regression model is “perfect”, SSE is zero, and R2 is 1.
• If the regression model is a total failure, SSE is equal to SST, no
variance is explained by regression, and R2 is zero.

7/17/20 Nama Dosen 38


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Classification Evaluation
• When we get the data, after data cleaning, pre-processing and
wrangling, the first step we do is to feed it to an outstanding
model and of course, get output in probabilities.
• But hold on! How in the hell can we measure the effectiveness of
our model. Better the effectiveness, better the performance and
that’s exactly what we want.
• And it is where the Confusion matrix comes into the limelight.
Confusion Matrix is a performance measurement for machine
learning classification.

7/17/20 Nama Dosen 39


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Confusion Matrix
• Well, it is a performance measurement for machine learning
classification problem where output can be two or more classes.
It is a table with 4 different combinations of predicted and actual
values.

7/17/20 Nama Dosen 40


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Let’s understand TP, FP, FN, TN in terms of


pregnancy analogy.

7/17/20 Nama Dosen 41


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Let’s understand TP, FP, FN, TN in terms of


pregnancy analogy.
• True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that a woman is pregnant and she actually is.
• True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually is not.

7/17/20 Nama Dosen 42


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Let’s understand TP, FP, FN, TN in terms of


pregnancy analogy.
• False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is not.
• False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.

7/17/20 Nama Dosen 43


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

How to Calculate Confusion Matrix for a


2-class classification problem?

Target
Confusion Matrix  
Positive Negative
Positive a b Positive Predictive Value a/(a+b)
Model Negative Predictive
Negative c d d/(c+d)
Value
  Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)

7/17/20 Nama Dosen 44


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Terminology in Confusion Matrix


• Accuracy : the proportion of the total number of predictions that
were correct.
• Positive Predictive Value or Precision : the proportion of positive
cases that were correctly identified.
• Negative Predictive Value : the proportion of negative cases that
were correctly identified.
• Sensitivity or Recall : the proportion of actual positive cases which
are correctly identified. 
• Specificity : the proportion of actual negative cases which are
correctly identified. 
7/17/20 Nama Dosen 45
STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Example: Confusion Matrix

Target
Confusion Matrix  
Positive Negative
Positive 70 20 Positive Predictive 0.78
Value
Model
Negative 30 80 Negative Predictive 0.73
Value
Sensitivity Specificity
  Accuracy = 0.75
0.70 0.80

7/17/20 Nama Dosen 46


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Example: Confusion Matrix

7/17/20 Nama Dosen 47


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Clustering Evaluation
• Evaluation (or "validation") of clustering results is as difficult as
the clustering itself.
• Popular approaches involve "internal" evaluation, where the
clustering is summarized to a single quality score, "external"
evaluation, where the clustering is compared to an existing
"ground truth" classification, "manual" evaluation by a human
expert, and "indirect" evaluation by evaluating the utility of the
clustering in its intended application.

7/17/20 Nama Dosen 48


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Internal Evaluation
• When a clustering result is evaluated based on the data that was
clustered itself, this is called internal evaluation.
• These methods usually assign the best score to the algorithm
that produces clusters with high similarity within a cluster and
low similarity between clusters.

7/17/20 Nama Dosen 49


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Davies–Bouldin index
• Sum of square within cluster (SSW)
𝑚
  1
𝑆𝑆𝑊 = ∑ ¿ ¿
𝑚 𝑖 𝑗=1
mi adalah jumlah data yang berada dalam cluster ke-i, sedangkan ci adalah centroid cluster ke-i
• Sum of square between cluster
  ¿
• Rasio
  𝑆𝑆 𝑊 𝑖 +𝑆𝑆 𝑊 𝑗
𝑅𝑖 , 𝑗=
𝑆𝑆 𝐵𝑖 , 𝑗
• Davies-Bouldin Index (DBI)
𝐾
  1
𝐷𝐵𝐼 = ∑ max ( 𝑅 𝑖 , 𝑗 )
𝐾 𝑖=1 𝑖≠ 𝑗
7/17/20 Nama Dosen 50
STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

External evaluation
• Confusion Matrix

7/17/20 Nama Dosen 51


STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”

Thank You

Any Questions?

7/17/20 Nama Dosen 52

You might also like