AI Performance Evaluation - Annotated

Precision-Recall
Example
In this example:
• If the classifier predicts negative, you can trust it, the example is
negative. Our AI does not do a mistake in negative. It is sensitive.
However, pay attention, if the example is negative, you can’t be sure
it will predict it as negative (specificity=78%).
• If the classifier predicts positive, you can’t trust it (precision=33%)
• However, if the example is positive, you can trust the classifier will
find it (will not miss it) (recall=100%).
Precision-Recall
Example
In this example:
Since the population is imbalanced:
• The precision is relatively high
• The recall is 100% because all the positive examples are predicted
as positive.
• The specificity is 0% because no negative example is predicted as
negative.
Precision-Recall
Example
In this example:
• If it predicts that an example is positive, you can trust it — it is positive.
• However, if it predicts it is negative, you can’t trust it, the chances are
that it is still positive.
Can be a useful classifier
Precision-Recall
Example
In this example:
• The classifier detects all the positive examples as positive
• It also detects all negative examples as negative.
• All the measures are at 100%.
Why so many measures? Which is more important?
Because it depends
False positive is expensive. False alarm is a nightmare! False negative is expensive. Missing one is a nightmare!
Precision is more important than recall Recall is more important than precision
Precision-Recall Curves
• The precision-recall curve is used for
evaluating the performance of binary
classification algorithms.
• They provide a graphical representation of a
classifier’s performance across many
thresholds, rather than a single value
• It is constructed by calculating and plotting
the precision against the recall for a single
classifier at a variety of thresholds.
• It helps to visualize how the choice of
threshold affects classifier performance and
can even help us select the best threshold for
a specific problem.
Precision-Recall Curves Interpretation
A model that produces a

precision-recall curve that is
closer to the top-right corner is
better than a model that
produces a precision-recall
curve that is skewed towards
the bottom of the plot.
Example
Which of the following P-R curves produce represent a perfect classifier?
(b)
(a) (c)
Model Evaluation
Metrics for Performance Evaluation

 How to evaluate the performance of a model?
Methods for Performance Evaluation

 How to obtain reliable estimates?
Methods for Model Comparison

 How to compare the relative performance among
competing models?
How to obtain a reliable estimate of performance?
Performance of a model may depend on other factors besides the learning algorithm:
Size of
Class
training and
distribution
test sets
Methods of Estimation:
Holdout Random subsampling Cross validation

Partition data into k disjoint
subsets
Reserve 2/3 for k-fold: train on k-1
training and 1/3 for Repeated holdout partitions, test on the
testing remaining one
Leave-one-out: k=n
Model Evaluation
Metrics for Performance Evaluation

 How to evaluate the performance of a model?

 How to obtain reliable estimates?
Methods for Model Comparison

 How to compare the relative performance among
competing models?
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals
• Characterize the trade-off between positive hits and false alarms
ROC curve plots TP rate (y-axis) against FP rate (x-axis)

𝑇𝑃
TP rate (TPR) =
𝑇𝑃+𝐹𝑁
𝐹𝑃
FP rate (TPR) =
𝐹𝑃+𝑇𝑁
Performance of each classifier represented as a point on ROC curve
changing the threshold of the algorithm, or sample distribution
changes the location of the point
(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(0,1): ideal
Diagonal line:
• Random guessing
• Below diagonal line:
prediction is opposite of the true class
1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FPR=0.12 15
Using ROC for Model Comparison
No model consistently outperform the other
• M1 is better for small FPR
• M2 is better for large FPR
• Area Under the ROC curve

• Ideal:
• Area = 1
• Random guess:
• Area = 0.5
How to construct an ROC curve
Instance P(+|A) True Class
1. Use a classifier that produces a probability
1 0.85 +
for each test instance P(+|A) for each test A
2. Sort the instances according to P(+|A) in 2 0.53 +
decreasing order 3 0.87 -
3. Apply threshold at each unique value of 4 0.85 -
P(+|A) and count the number of TP, FP, TN,
FN at each threshold 5 0.85 -
𝑇𝑃 6 0.95 +
TP rate (TPR) =
𝑇𝑃+𝐹𝑁 7 0.76 -
𝐹𝑃 8 0.93 +
FP rate (TPR) =
𝐹𝑃+𝑇𝑁 9 0.43 -
10 0.25 +
Instance P(+|A) True Class
1. Use a classifier that produces a probability
for each test instance P(+|A) for each test A 1 0.95 +
2. Sort the instances according to P(+|A) in 2 0.93 +

decreasing order 3 0.87 -
3. Apply threshold at each unique value of 4 0.85 -
P(+|A) and count the number of TP, FP, TN,
FN at each threshold 5 0.85 -
𝑇𝑃 6 0.85 +
TP rate (TPR) =
𝑇𝑃+𝐹𝑁 7 0.76 -
𝐹𝑃 8 0.53 +
FP rate (TPR) =
𝐹𝑃+𝑇𝑁 9 0.43 -
10 0.25 +
1.
(t>=0.76)
Use a classifier that produces a
# Thresh
old >=
True
Class
AI TP FP TN FN
probability for each test instance (Human)

P(+|A) for each test A 1 0.95 + + 1 0 5 4
2. Sort the instances according to 2 0.93 + + 2 0 5 3
P(+|A) in decreasing order
3 0.87 - + 2 1 4 3
3. Apply threshold at each unique value
4 0.85 - +
of P(+|A) and count the number of
TP, FP, TN, FN at each threshold 5 0.85 - +
𝑇𝑃 6 0.85 + + 3 3 2 2
TP rate (TPR) =
𝑇𝑃+𝐹𝑁 7 0.76 - + 3 4 1 2
𝐹𝑃 8 0.53 + - 4 4 1 1
FP rate (TPR) = 9 0.43 - - 4 5 0 1
𝐹𝑃+𝑇𝑁
10 0.25 + - 5 5 0 0
Instance P(+|A) True TP FP TN FN FPR TPR
Class
1 0.95 + 1 0 5 4 0 1/5
2 0.93 + 2 0 5 3 0 2/5
3 0.87 - 2 1 4 3 1/5 2/5
4 0.85 -
5 0.85 -
6 0.85 + 3 3 2 2 3/5 3/5
7 0.76 - 3 4 1 2 4/5 3/5
8 0.53 + 4 4 1 1 4/5 4/5
9 0.43 - 4 5 0 1 1 4/5
10 0.25 + 5 5 0 0 1 1
ROC Interpretation
AUC (Area Under the Curve)
• AUC stands for "Area under the ROC Curve."
• It measures the entire two-dimensional area
underneath the entire ROC curve from (0,0) to (1,1)
• It provides an aggregate measure of performance across

all possible classification thresholds.
• AUC ranges in value from 0 to 1.

• A model whose predictions are 100% wrong has an
AUC of 0.0; which means it has the worst measure of
separability
• A model whose predictions are 100% correct has an
AUC of 1.0; which means it has a good measure of
separability
• When AUC is 0.5, it means the model has no class
separation capacity
AUC (Area Under the Curve)
AUC (Area Under the Curve) Interpretation
Example
Red distribution curve is of the positive class (patients with disease) and the green distribution
curve is of the negative class(patients with no disease)
This is an ideal situation. When two curves don’t overlap at all means model has an ideal
measure of separability.
It is perfectly able to distinguish between positive class and negative class.
Example
When AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish
between positive class and negative class.
Example
This is the worst situation. When AUC is approximately 0.5, the model has no discrimination
capacity to distinguish between positive class and negative class.
Example
When AUC is approximately 0, the model is actually reciprocating the classes. It means the
model is predicting a negative class as a positive class and vice versa.
Example
Which of the following ROC curves produce AUC values greater than 0.5?
a b c
d e
ROC vs. PRC
The main difference between ROC curves and precision-
recall curves is that the number of true-negative results is
not used for making a PRC
x-axis y-axis
Curve
Concept Calculation Concept Calculation
Precision-recall
(PRC) Recall TP / (TP + FN) Precision TP / (TP + FP)
Receiver Operating Recall

Characteristics False Positive Rate FP / (FP + TN) Sensitivity TP / (TP + FN)
(ROC) True Positive Rate
Intersection over Union
(IoU) for object
detection
Is object
detection
classification
or
regression?
What is Intersection over Union?
Definition
Intersection over union (IoU) is an evaluation metric
used to measure the accuracy of an object detector on a
particular dataset.
It is a number from 0 to 1 that specifies the amount of

overlap between the predicted and ground truth bounding
box.
It is known to be the most popular evaluation metric for

tasks such as segmentation, object detection and
tracking.
What is Intersection
over Union?
Which one is best?
• An IoU of 0 means that there is no overlap between the

boxes
• An IoU of 1 means that the union of the boxes is the
same as their overlap indicating that they are completely
overlapping
The lower the IoU
The worse the prediction result

In order to apply Intersection over Union to
evaluate an (arbitrary) object detector we need:
1. The ground-truth bounding boxes (i.e., the

hand labeled bounding boxes that specify
where in the image our object is).
2. The predicted bounding boxes from our

model.
Your goal is to take the:
Training images
construct
an object
Bounding boxes detector
Then evaluate its performance on the testing set.

An Intersection over Union score > 0.5 is normally considered a “good”
prediction.
Confidence Intervals
Confidence Interval for Accuracy
Definition
Confidence, in statistics, is a way to describe probability.
Confidence interval is the range of values you expect your estimate to
fall between if you redo your test, within a certain level of confidence.
For example: if you construct a confidence interval with a 95%

confidence level, you are confident that 95 out of 100 times the estimate
will fall between the upper and lower values specified by the confidence
interval.
Confidence level = 1 −α
Prediction can be regarded as a Bernoulli trial
 A Bernoulli trial has 2 possible outcomes

 Possible outcomes for prediction: correct or wrong
 Collection of Bernoulli trials has a Binomial distribution:
 x ~ Bin(N, p) x: number of correct predictions
Example: Toss a fair coin 50 times, how many heads would turn
up?
Expected number of heads = Nxp = 50 x 0.5 = 25
correct predictions
Classification Accuracy =
total predictions
Given x (# of correct predictions) or equivalently,
acc = x / N, and N (# of test instances),
Can we predict p (true accuracy of model)?

Example
Consider a model that produces an 1-α Z
accuracy of 80% when evaluated on
100 test instances: 0.99 2.58
• N=100, acc = 0.8
0.98 2.33
• Let 1-α = 0.95 (95% confidence)
• From probability table, Zα /2=1.96 0.95 1.96
0.90 1.65
N 50 100 500 1000 5000 Standard Normal distribution
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811

For large test sets (N > 30), acc has a normal distribution
with mean p and variance p(1-p) / N
Confidence Interval for p:

Performance Metrics for Multiclass AI
Turn it into Binary Classification
{red, blue, green, yellow}  n = 4
One vs. All (Rest) One vs. One

•Binary Classification Problem 1: red vs [blue, green, yellow] •Binary Classification Problem 1: red vs. blue
•Binary Classification Problem 2: blue vs [red, green, yellow] •Binary Classification Problem 2: red vs. green
•Binary Classification Problem 3: green vs [red, blue, yellow] •Binary Classification Problem 3: red vs. yellow
•Binary Classification Problem 4: yellow vs [red, blue, green] •Binary Classification Problem 4: blue vs. green
•Binary Classification Problem 5: blue vs. yellow
n classes  n binary classifiers •Binary Classification Problem 6: green vs. yellow
n classes  n(n-1)/2 binary classifiers

Performance Metrics for Multicalss AI
Turn it into Binary Classification
One vs. All (Rest) One vs. One

Breakout Session
Example Instance P(+|A) True
Class
FPR TPR
1 0.95 +
2 0.93 +
3 0.87 +
4 0.85 +
5 0.83 +
6 0.80 -
7 0.76 -
8 0.53 -
9 0.43 -
10 0.25 -
Class
FPR TPR
1 0.95 + 0 1/5
2 0.93 + 0 2/5
3 0.87 + 0 3/5
4 0.85 + 0 4/5
5 0.83 + 0 1
6 0.80 - 1/5 1
7 0.76 - 2/5 1
8 0.53 - 3/5 1
9 0.43 - 4/5 1
10 0.25 - 1 1
Class
FPR TPR
1 0.95 + 0 1/5
2 0.93 - 1/5 1/5
3 0.87 + 1/5 2/5
4 0.85 - 2/5 2/5
5 0.83 + 2/5 3/5
6 0.80 - 3/5 3/5
7 0.76 + 3/5 4/5
8 0.53 - 4/5 4/5
9 0.43 + 4/5 1
10 0.25 - 1 1
Breakout Session
One vs. All (Rest)
One vs. All (Rest)
One vs. One

AI Performance Evaluation - Annotated

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Performance Evaluation - Annotated

Uploaded by

Copyright:

Available Formats

Precision-Recall

A model that produces a

Metrics for Performance Evaluation

Methods for Performance Evaluation

Methods for Model Comparison

Holdout Random subsampling Cross validation

Metrics for Performance Evaluation

Methods for Performance Evaluation

Methods for Model Comparison

ROC curve plots TP rate (y-axis) against FP rate (x-axis)

• Area Under the ROC curve

2. Sort the instances according to P(+|A) in 2 0.93 +

probability for each test instance (Human)

• It provides an aggregate measure of performance across

• AUC ranges in value from 0 to 1.

Receiver Operating Recall

It is a number from 0 to 1 that specifies the amount of

It is known to be the most popular evaluation metric for

• An IoU of 0 means that there is no overlap between the

The lower the IoU

The worse the prediction result

1. The ground-truth bounding boxes (i.e., the

2. The predicted bounding boxes from our

Then evaluate its performance on the testing set.

For example: if you construct a confidence interval with a 95%

 A Bernoulli trial has 2 possible outcomes​

Expected number of heads = Nxp = 50 x 0.5 = 25​

Can we predict p (true accuracy of model)?​

N 50 100 500 1000 5000 Standard Normal distribution

p(lower) 0.670 0.711 0.763 0.774 0.789

p(upper) 0.888 0.866 0.833 0.824 0.811

Confidence Interval for p:​

One vs. All (Rest) One vs. One

n classes  n(n-1)/2 binary classifiers

One vs. All (Rest) One vs. One

You might also like

 A Bernoulli trial has 2 possible outcomes

Expected number of heads = Nxp = 50 x 0.5 = 25

Can we predict p (true accuracy of model)?

Confidence Interval for p: