You are on page 1of 21

UNIT-III

One of the core tasks in building any machine learning model is to


evaluate its performance. It’s fundamental, and it’s also really hard.

So how would one measure the success of a machine learning model?


How would we know when to stop the training and evaluation and call
it good?

While data preparation and training a machine learning model is a key


step in the machine learning pipeline, it’s equally important to measure
the performance of this trained model.
How well the model generalizes on the unseen data is what defines
adaptive vs non-adaptive machine learning models.
By using different metrics for performance evaluation, we should be in
a position to improve the overall predictive power of our model before
we roll it out for production on unseen data.
Without doing a proper evaluation of the ML model using different
metrics, and depending only on accuracy, can lead to a problem when
the respective model is deployed on unseen data and can result in poor
predictions.

This happens because, in cases like these, our models don’t learn but
instead memorize; hence, they cannot generalize well on unseen data.
To get started, let’s define these three important terms:
• Learning: ML model learning is concerned with the accurate
prediction of future data, not necessarily the accurate
prediction of training/available data.

• Memorization: ML Model performance on limited data; in


other words, overfitting on the known training dataset.

• Generalization: Can be defined as the capability of the ML


model to apply learning to previously unseen data. Without
generalization there’s no learning, just memorization. But
note that generalization is also goal specific —for instance, a
well-trained image recognition model on zoo animal images
may not generalize well on images of cars and buildings.

• The images below depict how simply relying on model accuracy


during training the model leads to poor performance during
validation.

Evaluation Metrics:
Evaluation metrics are tied to machine learning tasks. There are
different metrics for the tasks of classification, regression,
ranking, clustering, topic modeling, etc. Some metrics, such as
precision-recall, are useful for multiple tasks. Classification,
regression, and ranking are examples of supervised learning,
which constitutes a majority of machine learning applications.

Cross-Validation in Machine Learning

Cross-validation is a technique for validating the model efficiency by


training it on the subset of input data and testing on previously unseen
subset of the input data. We can also say that it is a technique to check
how a statistical model generalizes to an independent dataset.

In machine learning, there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our
model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After
that, we test our model on that sample before deployment, and this
complete process comes under cross-validation. This is something
different from the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the
model performs well with the validation set, perform the further
step, else check for the issues.

Model Accuracy:

Model accuracy in terms of classification models can be defined as the


ratio of correctly classified samples to the total number of samples:
True Positive (TP) — A true positive is an outcome where the
model correctly predicts the positive class.

True Negative (TN)—A true negative is an outcome where the


model correctly predicts the negative class.

False Positive (FP)—A false positive is an outcome where the


model incorrectly predicts the positive class.

False Negative (FN)—A false negative is an outcome where the


model incorrectly predicts the negative class.

Though highly-accurate models are what we aim to achieve, accuracy


alone may not be sufficient to ensure the model’s performance on
unseen data. Let’s explore this with a couple of use cases:
Problem Statement- Build a prediction model for hospitals to identify
whether the patient is suffering from cancer or not .

Binary Classification Model — Predict whether the patient has cancer


or not.

Let’s assume we have a training dataset with labels—100 cases, 10


labeled as ‘Cancer’, 90 labeled as ‘Normal’

Let’s try calculating the accuracy of this model on the above dataset,
given the following results:

In the above case let’s define the TP, TN, FP, FN:

TP (Actual Cancer and predicted Cancer) = 1

TN (Actual Normal and predicted Normal) = 90

FN (Actual Cancer and predicted Normal) = 8

FP (Actual Normal and predicted Cancer) = 1


So the accuracy of this model is 91%. But the question remains as to
whether this model is useful, even being so accurate?

This highly accurate model may not be useful, as it isn’t able to predict
the actual cancer patients—hence, this can have worst consequences.

So for these types of scenarios how do we can trust the machine learning
models?

Accuracy alone doesn’t tell the full story when we’re working with
a class-imbalanced dataset like this one, where there’s a significant
disparity between the number of positive and negative labels.

Precision and Recall


In a classification task, the precision for a class is the number of true
positives (i.e. the number of items correctly labeled as belonging to the
positive class) divided by the total number of elements labeled as
belonging to the positive class (i.e. the sum of true positives and false
positives, which are items incorrectly labeled as belonging to the class).

High precision means that an algorithm returned substantially more


relevant results than irrelevant ones.
In this context, recall is defined as the number of true positives divided
by the total number of elements that actually belong to the positive class
(i.e. the sum of true positives and false negatives, which are items which
were not labeled as belonging to the positive class but should have
been).
High recall means that an algorithm returned most of the
relevant results

Let’s try to measure precision and recall for our cancer prediction
use case:

Our model has a precision value of 0.5 — in other words, when


it predicts cancer, it’s correct 50% of the time.

our model has a recall value of 0.11 — in other words, it correctly


identifies only 11% of all cancer patients.

Information Criteria (AIC, BIC, MDL)


There are three statistical approaches to estimating how well a given model fits a
dataset and how complex the model is. And each can be shown to be equivalent
or proportional to each other, although each was derived from a different framing
or field of study
• Akaike Information Criterion (AIC). Derived from frequentist probability.
• Bayesian Information Criterion (BIC). Derived from Bayesian probability.
• Minimum Description Length (MDL). Derived from information theory

Each statistic can be calculated using the log-likelihood for a model and the data. Log-
likelihood comes from Maximum Likelihood Estimation, a technique for finding or
optimizing the parameters of a model in response to a training dataset.
Akaike Information Criterion

The Akaike Information Criterion, or AIC for short, is a method for scoring and
selecting a model.
It is named for the developer of the method, Hirotugu Akaike, and may be shown to
have a basis in information theory and frequentist-based inference
The AIC statistic is defined for logistic regression as follows (taken from “The Elements
of Statistical Learning“):

• AIC = -2/N * LL + 2 * k/N

Where N is the number of examples in the training dataset, LL is the log-likelihood of


the model on the training dataset, and k is the number of parameters in the model.
The score, as defined above, is minimized, e.g. the model with the lowest AIC is
selected.

Bayesian Information Criterion


The Bayesian Information Criterion, or BIC for short, is a method for scoring and
selecting a model.
It is named for the field of study from which it was derived: Bayesian probability and
inference. Like AIC, it is appropriate for models fit under the maximum likelihood
estimation framework.

The BIC statistic is calculated for logistic regression as follows (taken from “The
Elements of Statistical Learning“):

• BIC = -2 * LL + log(N) * k

Where log() has the base-e called the natural logarithm, LL is the log-likelihood of the
model, N is the number of examples in the training dataset, and k is the number of
parameters in the model.

The score as defined above is minimized, e.g. the model with the lowest BIC is
selected.
Minimum Description Length

The Minimum Description Length, or MDL for short, is a method for scoring and
selecting a model.
It is named for the field of study from which it was derived, namely information theory.

Information theory is concerned with the representation and transmission of information


on a noisy channel, and as such, measures quantities like entropy, which is the average
number of bits required to represent an event from a random variable or probability
distribution.

From an information theory perspective, we may want to transmit both the predictions
(or more precisely, their probability distributions) and the model used to generate them.
Both the predicted target variable and the model can be described in terms of the number
of bits required to transmit them on a noisy channel.

The Minimum Description Length is the minimum number of bits, or the minimum of
the sum of the number of bits required to represent the data and the model

The MDL statistic is calculated as follows (taken from “Machine Learning“):

• MDL = L(h) + L(D | h)

Where h is the model, D is the predictions made by the model, L(h) is the number of
bits required to represent the model, and L(D | h) is the number of bits required to
represent the predictions from the model on the training dataset.
The score as defined above is minimized, e.g. the model with the lowest MDL is
selected.
Assignment # 3

Exercise: Build decision tree model to predict survival based on certain parameters

In this file using following columns build a model to predict if person would survive or not
1. Pclass
2. Sex
3. Age
4. Fare

Calculate score of your model

ROC Curve:
Basically, ROC curve is a graph that shows the performance of a classification model
at all possible thresholds( threshold is a particular value beyond which you say a point
belongs to a particular class). The curve is plotted between two parameters
• TRUE POSITIVE RATE
• FALSE POSITIVE RATE

Basically TPR/Recall/Sensitivity is ratio of positive examples that are correctly


identified and FPR is the ratio of negative examples that are incorrectly classified.
and as said earlier ROC is nothing but the plot between TPR and FPR across all possible
thresholds and AUC is the entire area beneath this ROC curve.

AUC-ROC curve:

ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
It is a graph that shows the performance of the classification model at different
thresholds.
To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.
AUC measures how well a model is able to distinguish between classes

You might also like