3-2 A Roadmap For Building Machine Learning

STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”
A roadmap for building

machine learning system
Wiga Maulana Baihaqi, S.Kom., M.Eng.
Senin, 11 November 2019 Wiga Maulana Baihaqi, S.Kom., M.Eng. 1

Agenda for Today’s Session

Process of Machine Learning
Data
Algorithms in Machine Learning
Splitting Data
Evaluation
Senin, 11 November 2019 Nama Dosen 2

Process Of Machine Learning
Algoritme
Himpunan
data
Machine Pengetahuan Evaluasi
Learning
DATA PRE-PROCESSING Prediction

Data Cleaning Classification
Data Reduction Clustering
Association

Data Attribute/Feature
Class/Label/Target
Record/
Object/
Sample/
Tuple
Nominal
Numerik

Data Cleaning
• Data cleaning is one those things that everyone does but no one
really talks about. Sure, it’s not the "sexiest" part of machine
learning. And no, there aren’t hidden tricks and secrets to
uncover.
• However, proper data cleaning can make your project is good.
Professional data scientists usually spend a very large portion of
their time on this step.
• Why? Because of a simple truth in machine learning:
Better data beats fancier algorithms.

1. Remove Unwanted Observations

• The first step to data cleaning is removing unwanted
observations from your dataset.
• This includes duplicate or irrelevant observations.

Remove Unwanted Observations: Duplicate

Obserations
• Duplicate observations most frequently arise during data
collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments

Remove Unwanted Observations:

Irrelevant Observations
• Irrelevant observations are those that don’t
actually fit the specific problem that you’re
trying to solve.
• For example, if you were building a model for
Single-Family homes only, you wouldn't want
observations for Apartments in there.
• This is also a great time to review your charts
from Exploratory Analysis. You can look at the
distribution charts for categorical features to
see if there are any classes that shouldn’t be
there.

2. Fix Structural Errors

As you can see:
• 'composition' is the same as
'Composition'
• 'asphalt' should be 'Asphalt'
• 'shake-shingle' should be 'Shake
Shingle'
• 'asphalt,shake-shingle' could
probably just be 'Shake Shingle' as
well

Fix Structural Errors

3. Filter Unwanted Outliers

4. Handle Missing Data

• Real-world data often has missing values.
• Data can have missing values for a number of reasons such as
observations that were not recorded and data corruption.
• Handling missing data is important as many machine learning
algorithms do not support data with missing values.

How to handling missing value

• Dropping observations that
have missing values
• Imputing the missing values
based on other observations

Example: Dropping
Sepal Length Sepal Width Petal Length Petal Width Type
5.1 ? 1.4 0.2 Iris-Setosa
4.9 3.0 1.4 0.2 Iris-Setosa
? 3.2 1.3 ? Iris-versicolor
4.6 3.6 1.5 0.3 Iris-Virginica
Sepal Length Sepal Width Petal Length Petal Width Type

4.9 3.0 1.4 0.2 Iris-Setosa
4.6 3.6 1.5 0.3 Iris-Virginica

Disadvantages of Dropping Method?

Example: Imputing
• A constant value that has meaning within the domain, such as 0,
distinct from all other values.
• A value from another randomly selected record.
• A mean, median or mode value for the column.
• A value estimated by another predictive model.

Which method is more accurate?

Data Reduction
• Dimension/Data Reduction refers to the process of converting a
set of data having vast dimensions into data with lesser
dimensions ensuring that it conveys similar information
concisely.
• These techniques are typically used while solving machine
learning problems to obtain better features for a classification or
regression task.

Data Reduction
• Let’s look at the image shown below. It shows
2 dimensions x1 and x2, which are let us say
measurements of several object in cm (x1)
and inches (x2).
• Now, if you were to use both these
dimensions in machine learning, they will
convey similar information and introduce a
lot of noise in system, so you are better of
just using one dimension.
• Here we have converted the dimension of
data from 2D (from x1 and x2) to 1D (z1),
which has made the data relatively easier
to explain.

What are the benefits of Data Reduction?

• It helps in data compressing and reducing the storage space
required
• It fastens the time required for performing same computations.
Less dimensions leads to less computing, also less dimensions
can allow usage of algorithms unfit for a large number of
dimensions
• It takes care of multi-collinearity that improves the model
performance. It removes redundant features. For example: there
is no point in storing a value in two different units (meters and
inches).
What are the benefits of Data Reduction?

• Reducing the dimensions of data to
2D or 3D may allow us to plot and
visualize it precisely. You can
then observe patterns more
clearly. Below you can see that,
how a 3D data is converted into
2D. First it has identified the 2D
plane then represented the points
on these two new axis z1 and z2.
• It is helpful in noise removal also
and as result of that we can
improve the performance of
models.
What are the common methods to perform

Dimension Reduction?
• Missing Values
• Low Variance
• Decision Trees
• Random Forest
• PCA (Principal Component Analysis)

Splitting Dataset
• Validation techniques are motivated by two fundamental
problems in pattern recognition: model selection and
performance estimation.
• Model Selection
 Almost invariably, all pattern recognition techniques have one
or more free parameters
• The number of neighbors in a kNN classification rule.
• The network size, learning parameters and weights in MLPs.
How do we select the “optimal” parameter(s) or model for a
givenclassification problem?
Continue...
• Performance estimation
 Once we have chosen a model, how do we estimate its
performance?
 Performance is typically measured by the TRUE ERROR RATE,
the classifier’s error rate on the ENTIRE POPULATION

The Hold Out Method

• Split dataset into two groups
• Training set: used to train the classifier
• Test set: used to estimate the error rate of the trained classifier

The Holdout Method

• The holdout method has two basic drawbacks
• In problems where we have a sparse dataset we may not be able
toafford the “luxury” of setting aside a portion of the dataset for
testing
• Since it is a single train-and-test experiment, the holdout
estimate of error rate will be misleading if we happen to get an
“unfortunate” split

Random Subsampling
• Random Subsampling performs K data splits of the dataset
• Each split randomly selects a (fixed) no. examples without
replacement
• For each data split we retrain the classifier from scratch with the
training examples and estimate Ei with the test examples

K-Fold Cross Validation

• Create a K-fold partition of the the dataset
• For each of K experiments, use K-1 folds for training and the remaining
one for testing
• K-Fold Cross validation is similar to Random Subsampling

• The advantage of K-Fold Cross validation is that all the examples in the
dataset are eventually used for both training and testing
How many folds are needed?

• With a large number of folds
+ The bias of the true error rate estimator will be small (the estimator will be
very accurate)
- The variance of the true error rate estimator will be large
- The computational time will be very large as well (many experiments)
• With a small number of folds
+ The number of experiments and, therefore, computation time are reduced
+ The variance of the estimator will be small
- The bias of the estimator will be large (conservative or higher than the true
error rate)
Leave-one-out Cross Validation

• Leave-one-out is the degenerate case of K-Fold Cross Validation,
where K is chosen as the total number of examples
• For a dataset with N examples, perform N experiments
• For each experiment use N-1 examples for training and the
remaining example for testing

Algorithm of Machine Learning

Machine Learning
Supervised Unsupervised
Learning Learning
Classification Regression Clustering

Pengetahuan (Pola)
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)

Evaluation
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)

Regression Evaluation
• A measure of how well your model performed. It does
this by measuring difference between predicted
values and the actual values.
• Let’s say you feed a model some input X and your
model predicts 10, but the actual value is 5.
• This difference between your prediction (10) and the
actual observation (5) is the error term: (y_prediction
- y_actual).
• The error term is important because we usually want
to minimize the error. In other words, our predictions
are very close to the actual values.

Root Mean Squared Error (RMSE)

• RMSE is a quadratic scoring rule that also measures the average
magnitude of the error. It’s the square root of the average of
squared differences between prediction and actual observation.

Mean Absolute Error (MAE)

• MAE measures the average magnitude of the errors in a set of
predictions, without considering their direction. It’s the average
over the test sample of the absolute differences between
prediction and actual observation where all individual differences
have equal weight.

Coefficient of Determination
• The coefficient of determination (R2) summarizes the explanatory
power of the regression model and is computed from the sums-
of-squares terms.

Coefficient of Determination
• R2 describes the proportion of variance of the dependent
variable explained by the regression model.
• If the regression model is “perfect”, SSE is zero, and R2 is 1.
• If the regression model is a total failure, SSE is equal to SST, no
variance is explained by regression, and R2 is zero.

Classification Evaluation
• When we get the data, after data cleaning, pre-processing and
wrangling, the first step we do is to feed it to an outstanding
model and of course, get output in probabilities.
• But hold on! How in the hell can we measure the effectiveness of
our model. Better the effectiveness, better the performance and
that’s exactly what we want.
• And it is where the Confusion matrix comes into the limelight.
Confusion Matrix is a performance measurement for machine
learning classification.

Confusion Matrix
• Well, it is a performance measurement for machine learning
classification problem where output can be two or more classes.
It is a table with 4 different combinations of predicted and actual
values.

Let’s understand TP, FP, FN, TN in terms of

pregnancy analogy.


pregnancy analogy.
• True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that a woman is pregnant and she actually is.
• True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually is not.


pregnancy analogy.
• False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is not.
• False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.

How to Calculate Confusion Matrix for a 2-class

classification problem?
Target
Confusion Matrix
Positive Negative
Positive a b Positive Predictive Value a/(a+b)
Model Negative Predictive
Negative c d d/(c+d)
Value
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)

Terminology in Confusion Matrix

• Accuracy : the proportion of the total number of predictions that
were correct.
• Positive Predictive Value or Precision : the proportion of positive
cases that were correctly identified.
• Negative Predictive Value : the proportion of negative cases that
were correctly identified.
• Sensitivity or Recall : the proportion of actual positive cases
which are correctly identified.
• Specificity : the proportion of actual negative cases which are
correctly identified.

Example: Confusion Matrix
Target
Confusion Matrix
Positive Negative
Positive Predictive
Positive 70 20 0.78
Value
Model
Negative Predictive
Negative 30 80 0.73
Value
Sensitivity Specificity
Accuracy = 0.75
0.70 0.80

Example: Confusion Matrix

Clustering Evaluation
• Evaluation (or "validation") of clustering results is as difficult as
the clustering itself.
• Popular approaches involve "internal" evaluation, where the
clustering is summarized to a single quality score, "external"
evaluation, where the clustering is compared to an existing
"ground truth" classification, "manual" evaluation by a human
expert, and "indirect" evaluation by evaluating the utility of the
clustering in its intended application.

Internal Evaluation
• When a clustering result is evaluated based on the data that was
clustered itself, this is called internal evaluation.
• These methods usually assign the best score to the algorithm
that produces clusters with high similarity within a cluster and
low similarity between clusters.

Davies–Bouldin index
• Sum of square within cluster (SSW)
𝑚
1
𝑆𝑆𝑊 = ෍𝑑(𝑥𝑗 , 𝑐𝑖 ൯
𝑚𝑖
𝑗=1
mi adalah jumlah data yang berada dalam cluster ke-i, sedangkan ci adalah centroid cluster ke-i
• Sum of square between cluster
𝑆𝑆𝐵𝑖,𝑗 = 𝑑(𝑐𝑖 , 𝑐𝑗 ൯
• Rasio
𝑆𝑆𝑊𝑖 + 𝑆𝑆𝑊𝑗
𝑅𝑖,𝑗 =
𝑆𝑆𝐵𝑖,𝑗
• Davies-Bouldin Index (DBI)
𝐾
1
𝐷𝐵𝐼 = ෍ max 𝑅𝑖,𝑗
𝐾 𝑖≠𝑗
𝑖=1

External evaluation
• Confusion Matrix

Thank You

3-2 A Roadmap For Building Machine Learning

Uploaded by

Copyright:

Available Formats

You might also like

3-2 A Roadmap For Building Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3-2 A Roadmap For Building Machine Learning

Uploaded by

Copyright:

Available Formats

STMIK Amikom Purwokerto

“Sarana Pasti Meraih Prestasi”

A roadmap for building

Senin, 11 November 2019 Wiga Maulana Baihaqi, S.Kom., M.Eng. 1

Agenda for Today’s Session

Algorithms in Machine Learning

Senin, 11 November 2019 Nama Dosen 2

Process Of Machine Learning

DATA PRE-PROCESSING Prediction

Senin, 11 November 2019 Nama Dosen 3

Senin, 11 November 2019 Nama Dosen 4

Senin, 11 November 2019 Nama Dosen 5

1. Remove Unwanted Observations

Senin, 11 November 2019 Nama Dosen 6

Remove Unwanted Observations: Duplicate

Senin, 11 November 2019 Nama Dosen 7

Remove Unwanted Observations:

Senin, 11 November 2019 Nama Dosen 8

2. Fix Structural Errors

Senin, 11 November 2019 Nama Dosen 9

Fix Structural Errors

Senin, 11 November 2019 Nama Dosen 10

3. Filter Unwanted Outliers

Senin, 11 November 2019 Nama Dosen 11

4. Handle Missing Data

Senin, 11 November 2019 Nama Dosen 12

How to handling missing value

Senin, 11 November 2019 Nama Dosen 13

Sepal Length Sepal Width Petal Length Petal Width Type

Senin, 11 November 2019 Nama Dosen 14

Disadvantages of Dropping Method?

Senin, 11 November 2019 Nama Dosen 15

Senin, 11 November 2019 Nama Dosen 16

Which method is more accurate?

Senin, 11 November 2019 Nama Dosen 17

Senin, 11 November 2019 Nama Dosen 18

Senin, 11 November 2019 Nama Dosen 19

What are the benefits of Data Reduction?

What are the benefits of Data Reduction?

What are the common methods to perform

Senin, 11 November 2019 Nama Dosen 22

Senin, 11 November 2019 Nama Dosen 24

The Hold Out Method

Senin, 11 November 2019 Nama Dosen 25

The Holdout Method

Senin, 11 November 2019 Nama Dosen 26

Senin, 11 November 2019 Nama Dosen 27

K-Fold Cross Validation

• K-Fold Cross validation is similar to Random Subsampling

How many folds are needed?

Leave-one-out Cross Validation

Senin, 11 November 2019 Nama Dosen 30

Algorithm of Machine Learning

Classification Regression Clustering

Senin, 11 November 2019 Nama Dosen 31

Senin, 11 November 2019 Nama Dosen 32

Senin, 11 November 2019 Nama Dosen 33

Senin, 11 November 2019 Nama Dosen 34