3-2 A Roadmap For Building Machine Learning

STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”
A roadmap for building

machine learning system
Wiga Maulana Baihaqi, S.Kom., M.Eng.
7/17/20 Wiga Maulana Baihaqi, S.Kom., M.Eng. 1

Agenda for Today’s Session

Process of Machine Learning
Data
Algorithms in Machine Learning
Splitting Data
Evaluation
7/17/20 Nama Dosen 2

Process Of Machine Learning
Algoritme
Himpunan
data
Machine Pengetahuan Evaluasi
Learning
DATA PRE-PROCESSING Prediction

Data Cleaning Classification
Data Reduction Clustering
Association

Data Attribute/Feature
Class/Label/Target
Record/
Object/
Sample/
Tuple
Nominal
Numerik

Data Cleaning
• Data cleaning is one those things that everyone does but no one
really talks about. Sure, it’s not the "sexiest" part of machine
learning. And no, there aren’t hidden tricks and secrets to
uncover.
• However, proper data cleaning can make your project is good.
Professional data scientists usually spend a very large portion of
their time on this step.
• Why? Because of a simple truth in machine learning:
Better data beats fancier algorithms.

1. Remove Unwanted Observations

• The first step to data cleaning is removing unwanted
observations from your dataset.
• This includes duplicate or irrelevant observations.

Remove Unwanted Observations: Duplicate

Obserations
• Duplicate observations most frequently arise during data
collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments

Remove Unwanted Observations:

Irrelevant Observations
• Irrelevant observations are those that don’t
actually fit the specific problem that you’re
trying to solve.
• For example, if you were building a model for
Single-Family homes only, you wouldn't want
observations for Apartments in there.
• This is also a great time to review your charts
from Exploratory Analysis. You can look at the
distribution charts for categorical features to
see if there are any classes that shouldn’t be
there.
2. Fix Structural Errors

As you can see:
• 'composition' is the same as
'Composition'
• 'asphalt' should be 'Asphalt'
• 'shake-shingle' should be 'Shake
Shingle'
• 'asphalt,shake-shingle' could
probably just be 'Shake Shingle' as
well

Fix Structural Errors

3. Filter Unwanted Outliers

4. Handle Missing Data

• Real-world data often has missing values.
• Data can have missing values for a number of reasons such as
observations that were not recorded and data corruption.
• Handling missing data is important as many machine learning
algorithms do not support data with missing values.

How to handling missing value

• Dropping observations that
have missing values
• Imputing the missing values
based on other observations

Example: Dropping
Sepal Length Sepal Width Petal Length Petal Width Type
5.1 ? 1.4 0.2 Iris-Setosa
4.9 3.0 1.4 0.2 Iris-Setosa
? 3.2 1.3 ? Iris-versicolor
4.6 3.6 1.5 0.3 Iris-Virginica
Sepal Length Sepal Width Petal Length Petal Width Type

4.9 3.0 1.4 0.2 Iris-Setosa
4.6 3.6 1.5 0.3 Iris-Virginica

Disadvantages of Dropping Method?

Example: Imputing
• A constant value that has meaning within the domain, such as 0,
distinct from all other values.
• A value from another randomly selected record.
• A mean, median or mode value for the column.
• A value estimated by another predictive model.

Which method is more accurate?

Data Reduction
• Dimension/Data Reduction refers to the process of converting a
set of data having vast dimensions into data with lesser
dimensions ensuring that it conveys similar information concisely.
• These techniques are typically used while solving machine
learning problems to obtain better features for a classification or
regression task.

Data Reduction
• Let’s look at the image shown below. It shows 2
dimensions x1 and x2, which are let us say
measurements of several object in cm (x1) and
inches (x2).
• Now, if you were to use both these dimensions in
machine learning, they will convey similar
information and introduce a lot of noise in system,
so you are better of just using one dimension.
• Here we have converted the dimension of data
from 2D (from x1 and x2) to 1D (z1), which has
made the data relatively easier to explain.

What are the benefits of Data Reduction?

• It helps in data compressing and reducing the storage space
required
• It fastens the time required for performing same computations.
Less dimensions leads to less computing, also less dimensions
can allow usage of algorithms unfit for a large number of
dimensions
• It takes care of multi-collinearity that improves the model
performance. It removes redundant features. For example: there
is no point in storing a value in two different units (meters and
inches).
What are the benefits of Data

Reduction?
• Reducing the dimensions of data to
2D or 3D may allow us to plot and
visualize it precisely. You can
then observe patterns more clearly.
Below you can see that, how a 3D
data is converted into 2D. First it has
identified the 2D plane then
represented the points on these two
new axis z1 and z2.
• It is helpful in noise removal also
and as result of that we can improve
the performance of models.

What are the common methods to perform

Dimension Reduction?
• Missing Values
• Low Variance
• Decision Trees
• Random Forest
• PCA (Principal Component Analysis)

Splitting Dataset
• Validation techniques are motivated by two fundamental problems
in pattern recognition: model selection and performance estimation.
• Model Selection
 Almost invariably, all pattern recognition techniques have one or
more free parameters
• The number of neighbors in a kNN classification rule.
• The network size, learning parameters and weights in MLPs.
 How do we select the “optimal” parameter(s) or model for a
givenclassification problem?

Continue...
• Performance estimation
 Once we have chosen a model, how do we estimate its
performance?
 Performance is typically measured by the TRUE ERROR RATE,
the classifier’s error rate on the ENTIRE POPULATION

The Hold Out Method

• Split dataset into two groups
• Training set: used to train the classifier
• Test set: used to estimate the error rate of the trained classifier

The Holdout Method

• The holdout method has two basic drawbacks
• In problems where we have a sparse dataset we may not be able
toafford the “luxury” of setting aside a portion of the dataset for
testing
• Since it is a single train-and-test experiment, the holdout
estimate of error rate will be misleading if we happen to get an
“unfortunate” split

Random Subsampling
• Random Subsampling performs K data splits of the dataset
• Each split randomly selects a (fixed) no. examples without
replacement
• For each data split we retrain the classifier from scratch with the
training examples and estimate Ei with the test examples

K-Fold Cross Validation

• Create a K-fold partition of the the dataset
• For each of K experiments, use K-1 folds for training and the remaining one for
testing
• K-Fold Cross validation is similar to Random Subsampling

• The advantage of K-Fold Cross validation is that all the examples in the dataset are
eventually used for both training and testing

How many folds are needed?

• With a large number of folds
+ The bias of the true error rate estimator will be small (the estimator will be very
accurate)
- The variance of the true error rate estimator will be large
- The computational time will be very large as well (many experiments)
• With a small number of folds
+ The number of experiments and, therefore, computation time are reduced
+ The variance of the estimator will be small
- The bias of the estimator will be large (conservative or higher than the true error
rate)
Leave-one-out Cross Validation

• Leave-one-out is the degenerate case of K-Fold Cross Validation,
where K is chosen as the total number of examples
• For a dataset with N examples, perform N experiments
• For each experiment use N-1 examples for training and the
remaining example for testing

Algorithm of Machine Learning

Machine Learning
Supervised Unsupervised
Learning Learning
Classification Regression Clustering

Pengetahuan (Pola)
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)

Evaluation
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)

Regression Evaluation
• A measure of how well your model performed. It does
this by measuring difference between predicted
values and the actual values.
• Let’s say you feed a model some input X and your
model predicts 10, but the actual value is 5.
• This difference between your prediction (10) and the
actual observation (5) is the error term: (y_prediction
- y_actual).
• The error term is important because we usually want
to minimize the error. In other words, our predictions
are very close to the actual values.

Root Mean Squared Error (RMSE)

• RMSE is a quadratic scoring rule that also measures the average
magnitude of the error. It’s the square root of the average of
squared differences between prediction and actual observation.

Mean Absolute Error (MAE)

• MAE measures the average magnitude of the errors in a set of
predictions, without considering their direction. It’s the average
over the test sample of the absolute differences between
prediction and actual observation where all individual differences
have equal weight.

Coefficient of Determination
• The coefficient of determination (R2) summarizes the explanatory
power of the regression model and is computed from the sums-
of-squares terms.

Coefficient of Determination
• R2 describes the proportion of variance of the dependent variable
explained by the regression model.
• If the regression model is “perfect”, SSE is zero, and R2 is 1.
• If the regression model is a total failure, SSE is equal to SST, no
variance is explained by regression, and R2 is zero.

Classification Evaluation
• When we get the data, after data cleaning, pre-processing and
wrangling, the first step we do is to feed it to an outstanding
model and of course, get output in probabilities.
• But hold on! How in the hell can we measure the effectiveness of
our model. Better the effectiveness, better the performance and
that’s exactly what we want.
• And it is where the Confusion matrix comes into the limelight.
Confusion Matrix is a performance measurement for machine
learning classification.

Confusion Matrix
• Well, it is a performance measurement for machine learning
classification problem where output can be two or more classes.
It is a table with 4 different combinations of predicted and actual
values.

Let’s understand TP, FP, FN, TN in terms of

pregnancy analogy.


pregnancy analogy.
• True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that a woman is pregnant and she actually is.
• True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually is not.


pregnancy analogy.
• False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is not.
• False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.

How to Calculate Confusion Matrix for a

2-class classification problem?
Target
Confusion Matrix
Positive Negative
Positive a b Positive Predictive Value a/(a+b)
Model Negative Predictive
Negative c d d/(c+d)
Value
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)

Terminology in Confusion Matrix

• Accuracy : the proportion of the total number of predictions that
were correct.
• Positive Predictive Value or Precision : the proportion of positive
cases that were correctly identified.
• Negative Predictive Value : the proportion of negative cases that
were correctly identified.
• Sensitivity or Recall : the proportion of actual positive cases which
are correctly identified.
• Specificity : the proportion of actual negative cases which are
correctly identified.
Example: Confusion Matrix
Target
Confusion Matrix
Positive Negative
Positive 70 20 Positive Predictive 0.78
Value
Model
Negative 30 80 Negative Predictive 0.73
Value
Sensitivity Specificity
Accuracy = 0.75
0.70 0.80

Example: Confusion Matrix

Clustering Evaluation
• Evaluation (or "validation") of clustering results is as difficult as
the clustering itself.
• Popular approaches involve "internal" evaluation, where the
clustering is summarized to a single quality score, "external"
evaluation, where the clustering is compared to an existing
"ground truth" classification, "manual" evaluation by a human
expert, and "indirect" evaluation by evaluating the utility of the
clustering in its intended application.

Internal Evaluation
• When a clustering result is evaluated based on the data that was
clustered itself, this is called internal evaluation.
• These methods usually assign the best score to the algorithm
that produces clusters with high similarity within a cluster and
low similarity between clusters.

Davies–Bouldin index
• Sum of square within cluster (SSW)
𝑚
1
𝑆𝑆𝑊 = ∑ ¿ ¿
𝑚 𝑖 𝑗=1
mi adalah jumlah data yang berada dalam cluster ke-i, sedangkan ci adalah centroid cluster ke-i
• Sum of square between cluster
¿
• Rasio
𝑆𝑆 𝑊 𝑖 +𝑆𝑆 𝑊 𝑗
𝑅𝑖 , 𝑗=
𝑆𝑆 𝐵𝑖 , 𝑗
• Davies-Bouldin Index (DBI)
𝐾
1
𝐷𝐵𝐼 = ∑ max ( 𝑅 𝑖 , 𝑗 )
𝐾 𝑖=1 𝑖≠ 𝑗
External evaluation
• Confusion Matrix

Thank You
Any Questions?

3-2 A Roadmap For Building Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3-2 A Roadmap For Building Machine Learning

Uploaded by

Copyright:

Available Formats

STMIK Amikom Purwokerto

“Sarana Pasti Meraih Prestasi”

A roadmap for building

7/17/20 Wiga Maulana Baihaqi, S.Kom., M.Eng. 1

Agenda for Today’s Session

Algorithms in Machine Learning

7/17/20 Nama Dosen 2

Process Of Machine Learning

DATA PRE-PROCESSING Prediction

7/17/20 Nama Dosen 3

7/17/20 Nama Dosen 4

7/17/20 Nama Dosen 5

1. Remove Unwanted Observations

7/17/20 Nama Dosen 6

Remove Unwanted Observations: Duplicate

7/17/20 Nama Dosen 7

Remove Unwanted Observations:

2. Fix Structural Errors

7/17/20 Nama Dosen 9

Fix Structural Errors

7/17/20 Nama Dosen 10

3. Filter Unwanted Outliers

7/17/20 Nama Dosen 11

4. Handle Missing Data

7/17/20 Nama Dosen 12

How to handling missing value

7/17/20 Nama Dosen 13

Sepal Length Sepal Width Petal Length Petal Width Type

7/17/20 Nama Dosen 14

Disadvantages of Dropping Method?

7/17/20 Nama Dosen 15

7/17/20 Nama Dosen 16

Which method is more accurate?

7/17/20 Nama Dosen 17

7/17/20 Nama Dosen 18

7/17/20 Nama Dosen 19

What are the benefits of Data Reduction?

What are the benefits of Data

7/17/20 Nama Dosen 21

What are the common methods to perform

7/17/20 Nama Dosen 22

7/17/20 Nama Dosen 23

7/17/20 Nama Dosen 24

The Hold Out Method

7/17/20 Nama Dosen 25

The Holdout Method

7/17/20 Nama Dosen 26

7/17/20 Nama Dosen 27

K-Fold Cross Validation

• K-Fold Cross validation is similar to Random Subsampling

7/17/20 Nama Dosen 28

How many folds are needed?

Leave-one-out Cross Validation

7/17/20 Nama Dosen 30

Algorithm of Machine Learning

Classification Regression Clustering

7/17/20 Nama Dosen 31

7/17/20 Nama Dosen 32

7/17/20 Nama Dosen 33

7/17/20 Nama Dosen 34

Root Mean Squared Error (RMSE)

7/17/20 Nama Dosen 35