You are on page 1of 44

Some issues on evaluation

The Problem with Accuracy What


about
me?
• Unbalanced class distribution
• Truth: 80 cats (0), 20 dogs (1),
• Predict: 99 cats, 1 dog

CAT
Everything is a CAT => 99% accuracy
The Problem with Accuracy What
about
me?
• Unbalanced class distribution
• Truth: 80 cats (0), 20 dogs (1),
• Predict: 99 cats, 1 dog

Accuracy : 0.81
Precision : 0.81 CAT
Recall : 1.00 Everything is a CAT => 99% accuracy
F1 : 0.89
The Problem with Accuracy What
about
me?
• Unbalanced class distribution
• Truth: 80 cats (0), 20 dogs (1),
• Predict: 99 cats, 1 dog

Accuracy : 0.81
Precision : 0.81 CAT
Recall : 1.00 Everything is a CAT => 99% accuracy
F1 : 0.89

Recall and Precision do not account for True Negatives!


Informedness : 0.05
0.05 is basically guessing!
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

The Problem with Accuracy


• The Area Under the Curve - Receiver Operator Characteristic
• AUC - ROC
• a performance measurement for binary classification problem at
various thresholds settings
• Sensitivity vs. Specificity, TPR vs TNR, TP/(TP+FN) vs TN/(TN+FP)
• Sensitive UP means Specificity DOWN, visa vera
• ROC is a probability curve
• AUC represents degree or measure of separability
• 1 which means it has good measure of separability.
• 0.5, no class separation capacity whatsoever
• Random!
• Informedness is
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

The Problem with Accuracy


What is an “epoch”
• Epoch
• Presenting all the K examples in the training sample ℓ that constitute one
epoch of training.
• Batch Learning (also known as offline)
• Adjustments to weights are performed after presented all training samples
• Cost function for batch learning is defined by the average error energy
• Online Learning (also known as stochastic gradient descent)
• Adjustments to weights are performed on a sample-by-sample basis
• Cost function for online learning is defined by the instantaneous error
energy
• Both methods shuffle the order of samples after each epoch
Batch vs. Online
• The Good • The Good
• Accurate estimate of the • Simple to implement
gradient vector • Less likely to be trapped in a
• Parallelization of the learning local minimum
process • Sensitive to redundant and
nonstationary data
• The Bad • The Bad
• Demanding on storage • Sensitive to redundant and
requirements nonstationary data
• Cannot parallelize the learning
process
Why don’t we have both?
• Mini-Batch learning
• Update the training weights several times over
the course of a single epoch
• Batch size
• 1000 training samples
• (full) Batch: 1000
• Mini-batch: 500 or 200 or100
• Online: 1
• Mini-Batch can be used when the full Batch cannot fit in memory.
• Mini-Batch might converge quicker and generalize better

https://www.kaggle.com/residentmario/full-batch-mini-batch-and-online-learning
Introduction to Machine Learning with Python: A Guide for Data Scientists

Overfitting and Early-Stopping


• Networks will almost always memorize a little bit of the
training data rather than learn generalizations.

• Test data measure generalization error.

• Early-stopping uses a third set, validation data, to stop the


training prevent overfitting.

• Validation data can be used to tune the model.


Early Stopping
• Separately compute error for the training set and for the
validation set
• Weights are modified using only the intra-sample error
• Error will generally decrease on the training set, but may start
rising on the validation set
• Beginning to overfit here…
• Learning phase should be halted now
to guarantee optimal generalization
performance
https://www.doc.ic.ac.uk/~nuric/teaching/imperial-college-machine-learning-neural-networks.html

Overfitting and Early-Stopping

Generalisation
Error
Just avoid
overfitting!
Example
• Google Colaboratory Example
• COMP2712 Evaluating Machine Learning
https://colab.research.google.com/drive/1tbbjAMc9QetoYQRsB
19KagXBz6d9Pqwl?usp=sharing
COMP2712 NNML

Feature Level Processing


Feature Level
Processing
Feature Normalisation
Hard for the mouse to get noticed
as its scale is different to the elephant

The values of inputs*weights for Feature A


will dominate the summation in the
next layer compared to Feature B values

This will make convergence difficult,


especially if Feature B is important for
class separability.

Feature A Feature B
Feature Normalisation
Ah, now we can see mouse!

Now both Feature A and B are on the


same scale, convergence will be
quicker as values are comparable.

Feature A Feature B
Feature Normalisation
• Z-score: zero mean and unit standard deviation

• Min-Max scaling: [0, 1]


https://colab.research.google.com/drive/1nbxaVa7YElj9EdG78q1iHhXYDzIjLBH0?usp=sharing

Feature Normalisation
𝑥 − 𝜇
𝑧 =
𝜎

𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 𝑛𝑜𝑟𝑚 =
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛
Curse of dimensionality
• Model fitting
• determine the relationship between the predictors and the outcome so that
future values can be predicted.
• The more predictors a model has, the more the model can learn from data.
• Real data contains random noise, redundancies, etc
• The more predictors the higher the probability that the model will learn fake
patterns within the data
• GENERALISE NOISE
• OVERFITTING
• Adding fewer predictors,
• the model may not learn enough information
• UNDERFITTING
• SOLUTION:
• to determine an appropriate mix between simplicity and complexity.
Feature Extraction

• Principle Components Analysis (PCA) to the rescue


• Reduces the original data features into uncorrelated principal
components
• Each component representing a different set of correlated
features with different amounts of variation.
• “Retain components that account for 90% variation”
• Depends on data
• Could reduce from 100’s to 10’s of features
Principle Components Analysis (PCA)
• the identification of linear combinations of variables that provide
maximum variability within a set of data.
• suppose that the data was plotted on a graph.
• PCA will find the average along each axis (variable) within the data
and then shift the points until the centre of the averages is at the origin.
• A straight line through the origin which minimizes the distance
between itself and all the data points will be fit to the data. A
• Best fit line:
• a line passing through the origin which maximizes the sum of squared distances
from the projected points (along the line) to the origin.
• Once this line is determined, it is referred to as the first principal component.

https://towardsdatascience.com/understanding-principal-component-analysis-ddaf350a363a
PCA

https://builtin.com/data-science/step-step-explanation-principal-component-analysis
• Assuming that the first principal component does not account for
100% of the variation within the data set,
• Use the second principal component .
• the linear combination of variables which maximizes variability among all
other linear combinations that are orthogonal to the first.
• Simply, once the first principal component is accounted for, the second
principal component will be the line that is perpendicular to the initial best fit
line.
• NOW, use the principal components as features!
PCA

https://rstudio-pubs-static.s3.amazonaws.com/884549_bcb9c4e0773243f6946c2bac58950b54.html
Principal Components as Descriptors
• Features as principal components
• Directions of the data that explain a maximal amount of variance
• Lines that capture most information of the data
• E.g. Features based on more than one image
• If we have a total of registered images,
• The corresponding pixels at the same spatial location in all images can be
arranged as an n-dimensional vector:

• These vectors can be treated as random quantities


• We can now talk about mean vectors and covariance matrices
The corresponding pixels at the same spatial location in all
images can be arranged as an n-dimensional vector:

• Mean vector of the population:

• -- Expected value of :

• Covariance matrix of the


population:

• An x matrix
• is the variance of
• is the covariance between and
The corresponding pixels at the same spatial location in
all images can be arranged as an n-dimensional vector:

• Every pixel is a point in a “cloud” (population) of pixels representing


the same location in an image
• a set of images is, then represented as a set of “clouds” of pixels
• There should be some simpler representation, right?
• In the end we’re talking about a sequence of images of the same thing!
• Remember the eigenvalues and eigenvectors?
Towards
Principal Components Analysis (PCA):
The Hotelling transform
• Covariance matrix of the population:

• Because this matrix is real and symmetric, finding a set of orthonormal


eigenvectors is always possible
• Let and be the eigenvectors and corresponding eigenvalues of arranged in
descending order
• Let be a matrix whose rows are formed from eigenvectors of arranged in
descending order, thus the first row of A is the eigenvector corresponding to
the largest eigenvalue.
• is used to define the Hotelling transform:

• Maps ’s into vectors denoted by


Principal Components (analysis) transform
• From the Hotelling transform (and due to the properties of the matrix ), the
vectors (images) can be reconstructed from y:

• Now, instead of using the all eigenvectors of C, we use only the k


eigenvectors representing the k highest eigenvalues in a matrix (of
dimension ):

(Principal Components Transform)

• The y vectors would be k dimensional


• The reconstruction wouldn’t be exact (but it’d generate less complex images with the
same features as the originals!)
https://colab.research.google.com/drive/1rOL7B6PGb-bovZ7z26K0daqTCzErZJpX?usp=sharing

Feature Reduction: Breast Cancer


https://colab.research.google.com/drive/1rOL7B6PGb-bovZ7z26K0daqTCzErZJpX?usp=sharing

Feature Reduction: Breast Cancer


https://colab.research.google.com/drive/1rOL7B6PGb-bovZ7z26K0daqTCzErZJpX?usp=sharing

Feature Reduction: Breast Cancer


https://colab.research.google.com/drive/1CjzTBsyt7FsrPb4KCh0qfZEsvHpGTCBq?usp=sharing

Feature Reduction: Breast Cancer


https://medium.com/apprentice-journal/pca-application-in-machine-learning-4827c07a61db

Feature Reduction: PCA Limitations


• Model performance: PCA can lead to a reduction in model
performance on datasets with no or low feature correlation or does
not meet the assumptions of linearity.
• Classification accuracy: Variance based PCA framework does
not consider the differentiating characteristics of the classes. Also,
the information that distinguishes one class from another might be
in the low variance components and may be discarded.
• Outliers: PCA is also affected by outliers, and normalization of the
data needs to be an essential component of any workflow.
• Interpretability: Each principal component is a combination of
original features and does not allow for the individual feature
importance to be recognized.
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

Output Transformation
• Categorical variables are often called nominal
• Some examples include:
• A “pet” variable with the values: “dog” and “cat“.
• A “color” variable with the values: “red“, “green” and “blue“.
• A “place” variable with the values: “first”, “second” and “third“.
• A “passing grade”: “fail”, “pass”
• A “iris variety”: “Iris-setosa”, “Iris-versicolor”, “Iris-virginica”
• Each value represents a different category
• Classifiers (like MLP) need numbers!
Output Transformation
• Two Solutions
• Integer Encoding
• each unique category value is assigned an integer value.
• for example, “red” is 1, “green” is 2, and “blue” is 3.
• okay for ordinal (order matters), but not for true nominal
• One-Hot Encoding
• a new binary variable is added for each unique integer value.
• In the “color” variable example, there are 3 categories and therefore 3
binary variables are needed. A “1” value is placed in the binary variable
for that color and “0” values for the other colors.
y_oh = pd.get_dummies(df['class']).values
# one hot encoding - much easier using keras!
train_labels_oh = tf.keras.utils.to_categorical(train_labels)
Output Transformation
Instance Class Integer Encoding
1 “Red”
2 “Green”
3 “Blue” Instance Class
4 “Green” 1 1
2 2
3 3
4 2

Instance Red Green Blue


One-Hot Encoding
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
Output Transformation
• Integer Encoding input hidden output

[1, 4] Single output that ranges of class,


e.g. [1, 4]

• One Hot Encoding


input hidden output
[1,0,0,0]
[0,1,0,0] Output neuron for each class,
[0,0,1,0]
[0,0,0,1]
Example
• Google Colaboratory Example
• COMP2712 ML Review Feature Normalisation: Output
https://colab.research.google.com/drive/1LkHD_QTzqhmo6UR
WHxwlLXE8rytW02kD?usp=sharing

• COMP2712 Exploring feature reduction with PCA with ML


https://colab.research.google.com/drive/1TqDnU8D5M4mNd9h
mpDJRofiuj7V9l5Gc?usp=sharing

You might also like