Professional Documents
Culture Documents
3-2 A Roadmap For Building Machine Learning
3-2 A Roadmap For Building Machine Learning
3-2 A Roadmap For Building Machine Learning
Data
Splitting Data
Evaluation
Algoritme
Himpunan
data
Machine Pengetahuan Evaluasi
Learning
Data Attribute/Feature
Class/Label/Target
Record/
Object/
Sample/
Tuple
Nominal
Numerik
Data Cleaning
• Data cleaning is one those things that everyone does but no one
really talks about. Sure, it’s not the "sexiest" part of machine
learning. And no, there aren’t hidden tricks and secrets to
uncover.
• However, proper data cleaning can make your project is good.
Professional data scientists usually spend a very large portion of
their time on this step.
• Why? Because of a simple truth in machine learning:
Better data beats fancier algorithms.
Example: Dropping
Sepal Length Sepal Width Petal Length Petal Width Type
5.1 ? 1.4 0.2 Iris-Setosa
4.9 3.0 1.4 0.2 Iris-Setosa
? 3.2 1.3 ? Iris-versicolor
4.6 3.6 1.5 0.3 Iris-Virginica
Example: Imputing
• A constant value that has meaning within the domain, such as 0,
distinct from all other values.
• A value from another randomly selected record.
• A mean, median or mode value for the column.
• A value estimated by another predictive model.
Data Reduction
• Dimension/Data Reduction refers to the process of converting a
set of data having vast dimensions into data with lesser
dimensions ensuring that it conveys similar information
concisely.
• These techniques are typically used while solving machine
learning problems to obtain better features for a classification or
regression task.
Data Reduction
• Let’s look at the image shown below. It shows
2 dimensions x1 and x2, which are let us say
measurements of several object in cm (x1)
and inches (x2).
• Now, if you were to use both these
dimensions in machine learning, they will
convey similar information and introduce a
lot of noise in system, so you are better of
just using one dimension.
• Here we have converted the dimension of
data from 2D (from x1 and x2) to 1D (z1),
which has made the data relatively easier
to explain.
Splitting Dataset
• Validation techniques are motivated by two fundamental
problems in pattern recognition: model selection and
performance estimation.
• Model Selection
Almost invariably, all pattern recognition techniques have one
or more free parameters
• The number of neighbors in a kNN classification rule.
• The network size, learning parameters and weights in MLPs.
How do we select the “optimal” parameter(s) or model for a
givenclassification problem?
Senin, 11 November 2019 Nama Dosen 23
STMIK Amikom Purwokerto
“Sarana Pasti Meraih Prestasi”
Continue...
• Performance estimation
Once we have chosen a model, how do we estimate its
performance?
Performance is typically measured by the TRUE ERROR RATE,
the classifier’s error rate on the ENTIRE POPULATION
Random Subsampling
• Random Subsampling performs K data splits of the dataset
• Each split randomly selects a (fixed) no. examples without
replacement
• For each data split we retrain the classifier from scratch with the
training examples and estimate Ei with the test examples
Supervised Unsupervised
Learning Learning
Pengetahuan (Pola)
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
Evaluation
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)
Regression Evaluation
• A measure of how well your model performed. It does
this by measuring difference between predicted
values and the actual values.
• Let’s say you feed a model some input X and your
model predicts 10, but the actual value is 5.
• This difference between your prediction (10) and the
actual observation (5) is the error term: (y_prediction
- y_actual).
• The error term is important because we usually want
to minimize the error. In other words, our predictions
are very close to the actual values.
Coefficient of Determination
• The coefficient of determination (R2) summarizes the explanatory
power of the regression model and is computed from the sums-
of-squares terms.
Coefficient of Determination
• R2 describes the proportion of variance of the dependent
variable explained by the regression model.
• If the regression model is “perfect”, SSE is zero, and R2 is 1.
• If the regression model is a total failure, SSE is equal to SST, no
variance is explained by regression, and R2 is zero.
Classification Evaluation
• When we get the data, after data cleaning, pre-processing and
wrangling, the first step we do is to feed it to an outstanding
model and of course, get output in probabilities.
• But hold on! How in the hell can we measure the effectiveness of
our model. Better the effectiveness, better the performance and
that’s exactly what we want.
• And it is where the Confusion matrix comes into the limelight.
Confusion Matrix is a performance measurement for machine
learning classification.
Confusion Matrix
• Well, it is a performance measurement for machine learning
classification problem where output can be two or more classes.
It is a table with 4 different combinations of predicted and actual
values.
Target
Confusion Matrix
Positive Negative
Positive a b Positive Predictive Value a/(a+b)
Model Negative Predictive
Negative c d d/(c+d)
Value
Sensitivity Specificity Accuracy = (a+d)/(a+b+c+d)
Target
Confusion Matrix
Positive Negative
Positive Predictive
Positive 70 20 0.78
Value
Model
Negative Predictive
Negative 30 80 0.73
Value
Sensitivity Specificity
Accuracy = 0.75
0.70 0.80
Clustering Evaluation
• Evaluation (or "validation") of clustering results is as difficult as
the clustering itself.
• Popular approaches involve "internal" evaluation, where the
clustering is summarized to a single quality score, "external"
evaluation, where the clustering is compared to an existing
"ground truth" classification, "manual" evaluation by a human
expert, and "indirect" evaluation by evaluating the utility of the
clustering in its intended application.
Internal Evaluation
• When a clustering result is evaluated based on the data that was
clustered itself, this is called internal evaluation.
• These methods usually assign the best score to the algorithm
that produces clusters with high similarity within a cluster and
low similarity between clusters.
Davies–Bouldin index
• Sum of square within cluster (SSW)
𝑚
1
𝑆𝑆𝑊 = 𝑑(𝑥𝑗 , 𝑐𝑖 ൯
𝑚𝑖
𝑗=1
mi adalah jumlah data yang berada dalam cluster ke-i, sedangkan ci adalah centroid cluster ke-i
• Sum of square between cluster
𝑆𝑆𝐵𝑖,𝑗 = 𝑑(𝑐𝑖 , 𝑐𝑗 ൯
• Rasio
𝑆𝑆𝑊𝑖 + 𝑆𝑆𝑊𝑗
𝑅𝑖,𝑗 =
𝑆𝑆𝐵𝑖,𝑗
• Davies-Bouldin Index (DBI)
𝐾
1
𝐷𝐵𝐼 = max 𝑅𝑖,𝑗
𝐾 𝑖≠𝑗
𝑖=1
External evaluation
• Confusion Matrix
Thank You