You are on page 1of 14

3 Modeling and Evaluation

3.1 Selecting a Model (Predictive/Descriptive)


3.1.1 Predictive Models
3.1.2 Descriptive Models
3.2 Training a model for supervised learning
3.2.1 Holdout Method
3.3 Model representation and interpretability
3.3.1 Overfitting
3.3.2 Underfitting
3.3.3 Bias-variance trade-off
3.4 Evaluating Performance of a Model
3.4.1 Confusion Matrix
3.5 Improving performance of a model
3.5.1 Ensemble Approach
3.5.2 Model Parameter Tuning
Question Bank

Model: Structured representation of raw3.1 Selecting a Model (Predictive/


input data to the meaningful pattern is Descriptive)
called a model.
Machine learning algorithms are broadly
Model Training: The process of fitting a of two types:
specific model to a data set is called model 1. Models for supervised learning, which
training. primarily focus on solving predictive
problems
Target Function: Target function of a
model is the function defining the
2. Models for unsupervised learning,
relationship between the input (also called
which solve descriptive problems.
Predictive Models
predictor or independent) variables and the3.1.1
output (also called response or dependent Models for supervised learning approach
or target) variable. are Predictive models. They try to predict
general fornm: Y = certain value using the values in an input
It is represented in the data set.
f (X) + e, where Y is the output variable,
The predictive models have a clear focus
X represents the input variables and 'e' is on what they want to learn and how they
a random error term.
want to learn.
24
Fundamentals of Machine Learning
Predictive models may need to predict the Descriptive models which group together
value of a category or class to which a similar data instances are called clustering
data instance belongs to.
models. The most popular model for
Examples : clustering is k-Means.
1. Predicting win/loss in a
match.
basketball Examples:
2. Predicting whether a bank 1. Customner grouping based on society
2. Movie grouping based on language
transaction is fraud and age
The models which are used for
of target features of categorical prediction
value are
Descriptive models related to pattern
discovery is used for market basket
known as classification models. The analysis of transactional data.
target feature is known as aclass and the Market Basket Analysis is a modeling
categories to which classes are divided technique based upon the theory that if
into are called levels. Examples of you buy a certain group of items, you are
classification models are k-Nearest mote (or less) likely to buy another group
Neighbor (kNN), Naïve Bayes, and of items.
Decision Tree. For example, if you are in a store and you
Predictive models may also be used to buy milk and don't buy bread, you are
predict numerical values of the target more likely to buy fruits at the same time
feature based on the predictor features. than somebody who didn't buy bread.
Example: The set of items a customer buys is
1. Prediction of financial growth in the referred to as an item set, and market
next coming year basket analysis seeks to find relationships
2. Prediction of heat wave in the between purchases.
summer Typically, the relationship will be in the
The models which are used for prediction form of a rule like:
of the numerical value of the target feature IF {milk, butter) THEN ;bread}.
of a data instance are known as The probability that a customer will buy
regression models. Linear Regression and milk without an butter (i.e. that the
Logistic Regression models are examples antecedent is true) is referred to as the
of regression models. support for the rule.
3.1.2 Descriptive Models The conditional probability that a
Models for unsupervised learning customer will purchase bread is referred
approach are Descriptive models. They are to as the confidence.
used to describe a data set or gain insight 3.2
from a data set.
Training a model for supervised
There is no target feature or Single
learning
of interest in case of unsupervised feature3.2.1 Holdout Method
learning. Based on the value of all The hold-out method involves splitting the
features, interesting patterns or insights are data into multiple parts and using one part
derived about the data set. for training the model and the rest for
25
Modeling and Evaluation
on
validating and testing it.-It can be used the best model based on its accuracy
compare it with
for both' model evaluation and selection. the testing dataset and trained to
Model evaluation using the hold-out other models. Models are
datasets
method entails splitting the dataset into Improve model accuracy on test
training and test datasets, evaluating model based on the assumption that the test
performance, and determining the most dataset represents the population.
hold-out
optimal model. This diagram illustrates the
method for model evaluation.
The objective of this technique is to select

DATASET

Testing Dataset
Training Dataset

TRAIN TEST

Evaluate Model
Train Model

Holdout method
There are two parts to the dataset in the | 3. Use the hold-out test dataset to
evaluate the model.
diagram above.
One split is held aside as atraining set. 4. Use the entire dataset to train the final
Another set is held back for testing or model so that it can generalize better
on future datasets.
evaluation of the model. The percentage
of the split is determined based on the In this process, the dataset is split into
amount of training data available. A training and test sets, and a fixed set of
typical split of 70-30% is used in which hyper-parameters is used to evaluate the
70% of the dataset is used for training and model.
30% is used for testing the model. 3.2.1 K-fold cross validation method
Follow the steps below for using the hold
out method for model evaluation:
In machine learning, we couldn't fit the
model on the training data and can't say
1. Split the dataset in two (preferably 70 that the model will work accurately for
30%; however, the split percentage can the real data.
vary and should be random). For this, we have to assure that our model
2. Now, we train the model on the got the correct patterns from the data, and
training dataset by selecting some it is not getting up too much noise. For
fixed set of hyper-parameters while this purpose, we use the cross-validation
training the model.
technique.
Fundamental of Machine Learning / 2023 /4
26
Fundamentals of
Machine
¬ross validation is a technique used in
machine learning to evaluate the
subset for the
evaluation
model. In this method, we of the Learning
iterate tkrained
performance of a model on unseen data. with a different subset reserved
It involves dividing the available data into purpose each time, for tesitimnges
multiple folds or subsets, using one of The general steps to
these folds as a validation set, and training implement
the model on the remaining folds.
cross validation are as follows:
Step 1: Shuffle the dataset K-Fold
This process is repeated multiple times,
each time using a different fold as the Step 2: Split the dataset into k randomly.
validation set. Finally, the results from Step 3: For each unique group: groups
each validation step are averaged to a. Take the group as a hold
produce a more robust estimate of the data set out or test
model's performance. h Take the
The main purpose of cross validation is remaining
training data set groups as a
to overCOme overfitting, which c. Fit a model on the
0ccurs
when a model is trained too well on the evaluate it on the testtraining
set
set and
training data and performs poorly on new, d.Retain the evaluation score and
unseen data, discard the model
In K-Fold Cross Validation Step 4 :Summarize the skill of the
method, we model
split the data-set into k number of subsets
(known as folds) then we perform training using the sample of model
on the all the subsets but leave evaluation scores
one(k-1)
Vaiidation
Fola
Train.ng
Fold

K 1st
(K-Folds)
Iterations Performance .

2nd
Performance ,

3rd
Performance 3 E Performance

4th
Performance a =)Performance
5th
Perforrmance
K-Fold cross validation
Advantages of cross-validation:
More accurate estimate of out-of every observation is used for both
sample accuracy. training and testing.
More efficient" use of data as Overcome the jssue of overfitting
Modeling and Evaluation 27

Three common tactics for choosing a l3.3 Model representation and


value for k are as follows:
interpretability
Representative: The value for k is Interpretability focuses on the how.
chosen such that each train/test group of
data samples is large enough to be Fitness of a target function approximated
statistically representative of the broader by a learning algorithm determines how
dataset. correctly it is able to classify a set of data
it has never seen.
k=5 or 10: The value for k is fixed
to 5 or 10, a value that has been found In Machine Learning, if the model is able
through experimentation to generally result to fit on training data doesn't mean that
in a model skill estimate with low bias a it will perform well on testing data. This
modest variance. disparity between the performance on the
training and test data is called
k=n: The value for k is fixed to n,
where n is the size of the dataset to give Generalization Gap. It is common in
machine learning problems to observe a
each test sample an opportunity to be used
in the hold out dataset. This approach is gap between the training and testing
called leave-one-out cross-validation. performance.

Underfitting Overfitting IdealBalance

3.3.1 OVERFITTING REASONS FOR OVERFITTING


Overfitting is a scenario where the Data used for training is not
machine learning model tries to learn from cleaned and contains noise
the data along with the noise present in (garbage values) in it
the data and tries to fit each data point The model has a high variance
on the curve. The size of training data used is
Overfitting is a fundamental issue in not enough
supervised machine learning which prevents The model is too complex
us from perfectly generalizing the models to METHODS TO AVOID
well fit observed data on training data, as OVERFITTING
well as unseen data on the testing sets.
When a model is high on variance, it is then " Use re-sampling techniques like k
fold Cross-validation
said to as Overfitting of Data.
28 Fundamentals of
Machine L
Hold back of a validation data
set
Methods to avoid Underfitting
Use more training data Learning
Remove the nodes which have Reduce features by
selection method effective
no. predictive power
3.3.2 UNDERFITTING 3.3.3 Bias-variance trade-off
feature
Underfitting is a scenario where a It is important to understand
machine learning model can neither
learn the relationship between variables
errors (bias and
to accuracy in any
variance) when
machine
pritedicomes
ction
in the data nor predict or classify a new
data point.
In order to avoid overfitting, we could
algorithm.
Hiyh Bias High Variance
learning
stop the training at an earlier stage. But
it might also lead to the model not being
able to learn enough from training data, Error vatidatIGH Eror
which nay find it difficult to capture the
dominant trend. This is known as
underfitting.
As the model doesn't fully learn the Training Error
patterns, it accepts every new data point Modei Complexity
during the prediction. An underfitting
model has low variance and high bias. If the algorithm is too simple then it may
Reasons for Underfitting be on high bias and low variance
Data used for training is not condition and thus is error-prone. If
cleaned and contains noise (garbage algorithms fit too complex then it may be
values) in it on high variance and low bias. In the
The model has a high bias latter condition, the new entries will not
" The size of training data used is perform well. Well, there is something
not enough between both of these conditions, known
as Bias Variance Trade-off.
The model is too simple
Hign vaiance High Dias LOw bas lo. variance

overfitting Good balance


underfitting
An
This tradeoff in complexity is why there is a tradeoff between bias and variance.good
algorithm can't be more complex and less complex at the same time. To build a
minimizes
model, we need to find a good balance between bias and variance such that it
the total error.
Modeling and Evaluation 29

3.4 Evaluating Performance of a Model example. if we're trying to classify whether


Classification is one of the major tasks in an email is spam oY not, a true positive
Supervised Learning. In classification, the would be when the model correctly
number of correct and incorrect
predicts that an email is spam and the
predictions made by the mode is email is actually spam.
evaluated and accuracy of the model True negatives (TN) : These are cases
calculated based on that evaluation. where the model predicts a negative class,
3.4.1 Confusion Matrix and the actual class is also negative. For
example, using the same email
Aconfusion matrix is a table used to
classification problem, a true negative
evaluate the performance of a machine would be when the model correctly
learning model by comparing its predicts that an email is not spam, and
predictions to the true values of the the email is actually not spam.
dataset. It is used in classification.
False positives (FP) : These are cases
It consists of four possible outcomes: true
where the model predicts a positive class,
positives (TP), true negatives (TN), false but the actual class is negative. For
positives (FP), and false negatives (FN). example, in the email classification
The rows of the confusion matriX problem, afalse positive would be when
represent the actual values of the target the model incorrectly predicts that an
variable, while the columns represent the email is spam, but the email is actually
predicted values of the target variable. not spam.
Each cell in the matrix represents the
False negatives (FN) : These are cases
number of instances that fall into a
where the model predicts a negative class,
particular combination of actual and but the actual class is positive. For
predicted values.
example, in the email classification
The structure of confusion matrix is given problem, a false negative would be when
below :
the model incorrectly predicts that an
email is not spam, but the email is
Actual Values
actually spam.
Positive (1) Negative (0) In general, we want to minimize false
Predicted
Values positives and false negatives and
maximize true positives and true negatives
Positive (1) TP FP to have a good performance of a machine
learning model.
The confusion matrix summarizes these
four possible outcomes and helps us to
Negative (0) FN TN calculate various performance metrics such
as accuracy, error rate, sensitivity,
specificity, precision, recall, and Fl-score.
True positives (TP) : These are cases Accuracy : This measures the proportion
where the model predicts a positive class, of correctly classified instances out of all
and the actual class is also positive. For the instances in the dataset. It is calculated
30 Fundamentals of Machine Learning
by dividing the number of correct F1 score i * (Precision * Recall)|
predictions by the total number of (Precision + Recall)
predictions. The formula is : Example 1
Accuracy = (TP + TN) / (TP + TN + Suppose we have a binary classification
FP + FN) problem where we want to predict if a
Error rate: This indicates the percentage person has a disease or not based on their
of misclassification. The formula is : medical test results. We have a dataset of
Error rate = (FP + FN) / (TP + TN + 100 patients, where 60 do not have the
FP + FN) disease and 40 have the disease., We train
In other words.
a machine learning model on this data to
predict the disease status based on test
Error rate = 1- Model Accuracy results.
Sensitivity: The sensitivity of a model After training the model, we test it on a
measures the proportion of positive separate dataset of 20 patients and get the
examples which were correctly classified. following results:
The formula is: True positives (TP) : The model correctly
Sensitivity = TP / (TP + FN) predicted that 7 patients have the disease
Specificity: The specificity of a model True negatives (TN): The model correctly
measures the proportion of negative predicted that 10 patients do not have the
examples which were correctly classified. disease.
The formula is: False positives (FP) : The model predicted
Specificity = TN / (TN +FP) that 2 patients have the disease, but they
Precision: This measures the proportion do not actually have the disease.
of true positive predictions out of all the False negatives (FN) : The model
positive predictions made by the model. predicted that 1 patient does not have the
It is calculated by dividing the number of disease, but they actually have the disease.
true positives by the total number of We can use these results to create a
positive predictions. The formula is: confusion matrix, which would look like
Precision = TP / (TP + FP)
this:
Recall: This measures the proportion of Predicted No Predicted
true positive predictions out of all the Disease Disease
actual positive instances in the dataset. It FP = 2
Actual No Disease TN = 10
is calculated by dividing the number of TP = 7
true positives by the total number of Actual Disease FN = 1
positive instances. The formula is:
Recall = TP /(TP + FN)
From the confusion matrix, we Cal
FI score: This is a harmonic mean of
calculate several performance metrics
precision and recall, and provides a the model, including accuracy, error ay
balanced measure of a model's sensitivity, specificity. precision, recall,
FI score.
performance. It is calculated by taking the
harmonic mean of precision and recall. The
formula is:
Modeling and Evaluation 31

For example: correctly predicted 50 as not spam (True


Accuracy =(TP + TN) /(TP + TN + FP Negatives, TN) but incorrectly predicted
+ FN) 10 as spam (False Positives, FP). Out of
= (7+ 10) / (7+ 10+2+ 1) the 40 actual spam emails, the model
=0.85 (85% Accuracy of the correctly predicted 35 as spam (True
model) Positives, TP) but incorrectly predicted 5
Error rate =1-Accuracy as not spam (False Negatives, FN).
= |0,85 =0.15 Using this confusion matrix, we can calculate
several performance metrics of the model.
Sensitivity =TP/(TP +FN) For the given example, accuracy is (50+35)
=7/(7+1) = 0.875 100 = 85%,
Specificity =TN /(TN + FP) precision is 35/(10+35) = 77.8%.
= 10/(10+ 2) = 0.8333 recall is 35/(5+35) = 87.5%, and
Precision =TP/(TP +FP) F1 score is 2*(precision*recally/
=7/(7+ 2) = 0.78 (precisiontrecall) = 82.4%.
Recall =TP /(TP +FN) These metrics provide insight into how
=7/(7 +1)=0.88 well the model is performing and can help
FI score = 2 * (precision * recall) / us make improvements or adjustments as
(precision + recall) necessary.
=2 * (0.78 *0.88) /(0.78 +3.5 Improving performance of amodel
0.88) = 0.82
Let's see the different avenues to improve
These metrics provide insight into how the performance of models.
well the model is performing and can help 1. Feature engineering: Feature
us make improvements or adjustments as
engineering is the process of
necessary.
selecting and transforming the input
Example 2 : variables used in the model. By
Let's consider the example of a binary selecting the most important
classification problem where we want to features and transforming them into
predict whether an email is spam or not a more suitable form, we can
spam. We have a dataset of 100 emails, improve the performance of the
where 60 are not spam (negative class) model.
and 40 are spam (positive class). We build 2. Model Parameter tuning: Most
a machine learning model to predict machine learning algorithms have a
whether each email is spam or not, and set of model parameters that can be
the results are as follows :
adjusted to improve performance.
Predicted Predicted Model parameters control the
Not Spam Spam learning rate, regularization, number
|Actual Not Spam S0 10 of trees in a random forest, and so
|Actual Spam 35 on. Tuning these model parameters
This confusion matrix tells us that out of can lead to significant performance
the 60 actual not spam emails, the model improvements.
32 Fundamentals of
Machine
3. Data cleaning and preprocessing: It IS learning algorithm is not Learning
important to clean and preprocess
the data before training the model.
well, it may be worth
performing
Aifferent algor1thm. Ihere aretryingMah a
This includes removing missing different algorithms to choose from,
values, scaling the features, and some may be better
encoding categorical variables, and
the particular problem at suihandted
so on. By carefully preprocessing Let's discuss two avenues in detail
the data, we can improve the3.5,1 Ensemble Approach
performance of the model.
Ensemble learning is a machine
4. Ensemble methods: Ensemble methods technique that combines multiple leaming
to improve the overall performance.models
combine multiple models to
Improve performance. This can basic idea behind ensemble
learning The
is that
include bagging, boosting, and by combining multiple
stacking methods.
models, we can
reduce the error caused by individual
5. Larger training set: In some cases, the models and improve the overall
model may be underfitting because There are several different accuracy.
ways to create
the training set is too small. By an ensemble of models, but some of the
increasing the size of the training
most common methods include:
set, we can improve the
performaice of the model. Bagging: Bagging stands for Bootstrap
6. Regularization: Regularization methods Aggregating. In bagging, we create
such as LI, L2, and dropout can
multiple samples of the training daa by
help prevent overfitting and sampling with replacement. Then we train
improve the generalization ability of a separate model on each of these
the model. samples. Finally, we combine the results
of all models to make the final prediction.
7. Different algorithms: If one machine

The Process of Bagging (Bootstrap Aggregation)


Subset
prediction

training

Training Subset Welk ptediction Aggrogadon


set
training

Subset
pretiction
training

Bootstrap Samples
Bagging
33
Modeling and Evaluation
on the
Boosting: Boosting is a sequential process training set and train a new model
that creates an ensemble of models by same dataset. In each iteration, we give
iteratively improving the performance of a more weight to the misclassified samples
single model. In b0osting, we train a base and less weight to the correctly classified
model on the entire training set. Then, we samples. Finally, we combine the results
assign weights to each sample in the of all models to make the final prediction.
The Process of Boosting
3
ubtot testing Fase
prediction
training

Training
set
WOa False
prediction

Overall
Prediction

Boosting
Stacking: Stacking is a process that Then, we train a meta-model on the output
involves combining multiple models that of these models. The meta-model learns
use different algorithms or different howto combine the outputs of the models
representations of the data. In stacking, we to make the final prediction.
train multiple models on the same dataset.

The Process of Stacking

Model

training
Training Model Final Predictlons
Set

Model

Stacking
Fundamental of Machine Learning / 2023 /5
34
Fundamentals of
Machine
Bagging Boosting Stacking
Learning
Reduce Variance Reduce Bias
Improve Accuracy

Base Leaner TyP Homogeneous Homogeneous


Heterogeneous
Base Leomer Training Parallel Sequential Meta Model

Max Voting,
Weighted Averaging Weighted Averaging
Averaging
One powerful way for ensemble based technique is Random forest:
In Random Forest, a large number of decision trees are
trained on
subsets of the data. Each tree is trained on a different subset of the data,randomly selected
using a diferens
set of randomly selected features. This randomness helps to reduce
the generalization ability of the model. overfitting and improve

Instance

ass-A Class-A

Random Forest is a powerful ensemble


based technique that can be used for both the generalization ability of the model.
classification and regression problems. Second, by using different algorithms
Ensemble learning approach can improve representations of the data, we can captul
the performance of machine different aspects of the data and improve
learning
models in several ways. First, by combining the accuracy of the model. Finally.
multiple models, we can reduce the variance ensemble learning can be used tosensitive
creal
caused by individual mmodels and more robust models that are less
improve to changes in the training data.
Modeling and Evaluation 35

random
However. ensemble learning approach also defined using a grid search or a
has SOme drawbacks. It can be search.
The
computationally expensive and may Step 3 : Select a performance metric: will
require a large amount of memory. In performance metric is the metric that
addition, the performance of the ensemble be used to evaluate the performance of the
may be highly dependent on the model. The most common performance
performance of the individual models. metrics are accuracy, precision, recall, and
Therefore, it is important to carefully Fl-score.
select the models and ensure that they are Step 4 : Evaluate the model: The model
diverse and complementary to each other. is trained on the training data using
3.5.2 Model Parameter Tuning different combinations of hyperparameters.
Model parameter tuning is the process of The performance of the model is
evaluated on the validation data using the
adjusting the hyperparameters of a
machine learning model in order to selected performance metric.
improve its performance. Hyperparameters Step 5 : Select the best hyperparameters:
are the parameters that are set before the The hyperparameters that give the best
training of the model and cannot be performance on the validation data are
learned from the data. Examples of selected as the final hyperparameters.
hyperparameters include the learning rate, These hyperparameters are then used to
the number of hidden layers in a neural train the model on the entire training
network, and the regularization parameter. dataset.
The model parameter tuning approach in Step 6 : Evaluate the final model: The
final model is evaluated on the test data
machine learning involves the following
steps:
using the selected performance metric to
Step 1 : Define the hyperparameters: The determine the generalization performance
of the model.
first step is to define the hyperparameters
that need to be tuned. This can be done The model parameter tuning approach is
by looking at the model architecture and an iterative process that involves repeating
understanding how the hyperparameters the above steps until the desired level of
affect the performance of the model. performance is achieved.
Step 2 : Define the search space: The next It is important to note that model
step is to define the search space for the parameter tuning can be time-consuming
hyperparameters. .The search space is a and computationally expensive, especially
range of possible values for each for large datasets and complex models.
hyperparameter. The search space can be
Question Bank
Short-Answer Questions (3 or 4 marks) :
Define model. How can you train a model ?
2. Give the difference between predictive model and descriptive model.
3 State any four real-world problems solved by predictive models. Explain any one in
brief.
descript1ve models.
36
real-world problems
solved by Explain any one
in Modeling and Evaluation 37
4 State any four training.
brief. method for model d) Bagging reduces the complexity of a model, while b0osting increases the complexity
use holdout of 10-fold cross validation. of a model.
5 Wite down steps to the approach
to show 2 Which of the following is not a metric that can be calculated using a confusion matrix?
6 Draw a detailed diagram Bagging and Boosting.
between a) Precision b) Recall
Give the difference
When does it happen?
7
c) FI score d) R-squared
Define overfitting.
When does it happen?
8
model fittino 3
Which of the following is a common technique used to prevent overfitting in a machine
9 Define underfitting. trade-off in context of
on bias-varjance learning model?
10. Write a short note a) Adding more features to the model
Explain structure of
confusion matrix.
11. b) Increasing the number of epochs during training
Sensitivity, Specificity
12. Define following terms: c) Reducing the size of the model
13. Write a brief note on
stacking.
performance of a model. d) Decreasing the learning rate during
14. State various ways to improve (7 marks): 4 What is the holdout method used for?
Long-Answer Questions
a) To evaluate the performance of a machine learning model on a dataset that it
Explain different types of model. hasn't seen during training
Explain Holdout method in detail. b) To train a machine learning model on a subset of a dataset
detail.
3 Describe k-fold cross validation
stacking in detail. c) To generate new data for a machine learning model
4 Explain bagging, boosting and d) To reduce the complexity of a machine learning model
learning approach in detail.
5 Consider Ensemble
Describe the following confusion matrix of the win/loss prediction of cricket match. 5 Which of the following is a common technique used to prevent underfitting in a machine
6 calculate the accuracy, error rate, sensitivity, specificity, precision, recall and F-nmeasure learning model?
a) Adding more features to the model
of the model.
Actual Win Actual LOss b) Increasing the number of epochs during training
8 7 c) Reducing the size of the model
Predicted Win
d) Decreasing the learning rate during training
Predicted Loss 3
7. While predicting malignancy of tumour of a set of patients using a classification model
following are the data recorded:
1. Correct predictions - 20 malignant, 70 benign
2. Incorrect predictions 4 malignant, 6 benign
Create confusion matrix for the same. And, calculate the accuracy, error rate, sensitivity,
specificity, precision, recall and F-measure of the model.
Check your knowledge :
What is the difference between bagging and boosting?
a) Bagging reduces the variance of a model, while boosting reduces the bias of
a model.
b) Bagging reduces the bias of a model, while boosting
model.
reduces the variance of à
c) Bagging increases the complexity of a model, while boosting reduces the complexiy
of a model.

You might also like