L06 Features

Some issues on evaluation
The Problem with Accuracy What

about
me?
• Unbalanced class distribution
• Truth: 80 cats (0), 20 dogs (1),
• Predict: 99 cats, 1 dog
CAT
Everything is a CAT => 99% accuracy
about
me?
• Truth: 80 cats (0), 20 dogs (1),
Accuracy : 0.81
Precision : 0.81 CAT
Recall : 1.00 Everything is a CAT => 99% accuracy
F1 : 0.89
about
me?
• Truth: 80 cats (0), 20 dogs (1),
Accuracy : 0.81
Precision : 0.81 CAT
Recall : 1.00 Everything is a CAT => 99% accuracy
F1 : 0.89
Recall and Precision do not account for True Negatives!

Informedness : 0.05
0.05 is basically guessing!
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
The Problem with Accuracy

• The Area Under the Curve - Receiver Operator Characteristic
• AUC - ROC
• a performance measurement for binary classification problem at
various thresholds settings
• Sensitivity vs. Specificity, TPR vs TNR, TP/(TP+FN) vs TN/(TN+FP)
• Sensitive UP means Specificity DOWN, visa vera
• ROC is a probability curve
• AUC represents degree or measure of separability
• 1 which means it has good measure of separability.
• 0.5, no class separation capacity whatsoever
• Random!
• Informedness is
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
The Problem with Accuracy

What is an “epoch”
• Epoch
• Presenting all the K examples in the training sample ℓ that constitute one
epoch of training.
• Batch Learning (also known as offline)
• Adjustments to weights are performed after presented all training samples
• Cost function for batch learning is defined by the average error energy
• Online Learning (also known as stochastic gradient descent)
• Adjustments to weights are performed on a sample-by-sample basis
• Cost function for online learning is defined by the instantaneous error
energy
• Both methods shuffle the order of samples after each epoch
Batch vs. Online
• The Good • The Good
• Accurate estimate of the • Simple to implement
gradient vector • Less likely to be trapped in a
• Parallelization of the learning local minimum
process • Sensitive to redundant and
nonstationary data
• The Bad • The Bad
• Demanding on storage • Sensitive to redundant and
requirements nonstationary data
• Cannot parallelize the learning
process
Why don’t we have both?
• Mini-Batch learning
• Update the training weights several times over
the course of a single epoch
• Batch size
• 1000 training samples
• (full) Batch: 1000
• Mini-batch: 500 or 200 or100
• Online: 1
• Mini-Batch can be used when the full Batch cannot fit in memory.
• Mini-Batch might converge quicker and generalize better
https://www.kaggle.com/residentmario/full-batch-mini-batch-and-online-learning
Introduction to Machine Learning with Python: A Guide for Data Scientists
Overfitting and Early-Stopping

• Networks will almost always memorize a little bit of the
training data rather than learn generalizations.
• Test data measure generalization error.
• Early-stopping uses a third set, validation data, to stop the

training prevent overfitting.
• Validation data can be used to tune the model.

Early Stopping
• Separately compute error for the training set and for the
validation set
• Weights are modified using only the intra-sample error
• Error will generally decrease on the training set, but may start
rising on the validation set
• Beginning to overfit here…
• Learning phase should be halted now
to guarantee optimal generalization
performance
https://www.doc.ic.ac.uk/~nuric/teaching/imperial-college-machine-learning-neural-networks.html
Overfitting and Early-Stopping
Generalisation
Error
Just avoid
overfitting!
Example
• Google Colaboratory Example
• COMP2712 Evaluating Machine Learning
https://colab.research.google.com/drive/1tbbjAMc9QetoYQRsB
19KagXBz6d9Pqwl?usp=sharing
COMP2712 NNML
Feature Level Processing

Feature Level
Processing
Feature Normalisation
Hard for the mouse to get noticed
as its scale is different to the elephant
The values of inputs*weights for Feature A

will dominate the summation in the
next layer compared to Feature B values
This will make convergence difficult,

especially if Feature B is important for
class separability.
Feature A Feature B
Ah, now we can see mouse!
Now both Feature A and B are on the

same scale, convergence will be
quicker as values are comparable.
Feature A Feature B
• Z-score: zero mean and unit standard deviation
• Min-Max scaling: [0, 1]

https://colab.research.google.com/drive/1nbxaVa7YElj9EdG78q1iHhXYDzIjLBH0?usp=sharing
𝑥 − 𝜇
𝑧 =
𝜎
𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 𝑛𝑜𝑟𝑚 =
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛
Curse of dimensionality
• Model fitting
• determine the relationship between the predictors and the outcome so that
future values can be predicted.
• The more predictors a model has, the more the model can learn from data.
• Real data contains random noise, redundancies, etc
• The more predictors the higher the probability that the model will learn fake
patterns within the data
• GENERALISE NOISE
• OVERFITTING
• Adding fewer predictors,
• the model may not learn enough information
• UNDERFITTING
• SOLUTION:
• to determine an appropriate mix between simplicity and complexity.
Feature Extraction
• Principle Components Analysis (PCA) to the rescue

• Reduces the original data features into uncorrelated principal
components
• Each component representing a different set of correlated
features with different amounts of variation.
• “Retain components that account for 90% variation”
• Depends on data
• Could reduce from 100’s to 10’s of features
Principle Components Analysis (PCA)
• the identification of linear combinations of variables that provide
maximum variability within a set of data.
• suppose that the data was plotted on a graph.
• PCA will find the average along each axis (variable) within the data
and then shift the points until the centre of the averages is at the origin.
• A straight line through the origin which minimizes the distance
between itself and all the data points will be fit to the data. A
• Best fit line:
• a line passing through the origin which maximizes the sum of squared distances
from the projected points (along the line) to the origin.
• Once this line is determined, it is referred to as the first principal component.
https://towardsdatascience.com/understanding-principal-component-analysis-ddaf350a363a
PCA
https://builtin.com/data-science/step-step-explanation-principal-component-analysis
• Assuming that the first principal component does not account for
100% of the variation within the data set,
• Use the second principal component .
• the linear combination of variables which maximizes variability among all
other linear combinations that are orthogonal to the first.
• Simply, once the first principal component is accounted for, the second
principal component will be the line that is perpendicular to the initial best fit
line.
• NOW, use the principal components as features!
PCA
https://rstudio-pubs-static.s3.amazonaws.com/884549_bcb9c4e0773243f6946c2bac58950b54.html
Principal Components as Descriptors
• Features as principal components
• Directions of the data that explain a maximal amount of variance
• Lines that capture most information of the data
• E.g. Features based on more than one image
• If we have a total of registered images,
• The corresponding pixels at the same spatial location in all images can be
arranged as an n-dimensional vector:
• These vectors can be treated as random quantities

• We can now talk about mean vectors and covariance matrices
The corresponding pixels at the same spatial location in all
images can be arranged as an n-dimensional vector:
• Mean vector of the population:
• -- Expected value of :
• Covariance matrix of the

population:
• An x matrix
• is the variance of
• is the covariance between and
The corresponding pixels at the same spatial location in
all images can be arranged as an n-dimensional vector:
• Every pixel is a point in a “cloud” (population) of pixels representing

the same location in an image
• a set of images is, then represented as a set of “clouds” of pixels
• There should be some simpler representation, right?
• In the end we’re talking about a sequence of images of the same thing!
• Remember the eigenvalues and eigenvectors?
Towards
Principal Components Analysis (PCA):
The Hotelling transform
• Covariance matrix of the population:
• Because this matrix is real and symmetric, finding a set of orthonormal

eigenvectors is always possible
• Let and be the eigenvectors and corresponding eigenvalues of arranged in
descending order
• Let be a matrix whose rows are formed from eigenvectors of arranged in
descending order, thus the first row of A is the eigenvector corresponding to
the largest eigenvalue.
• is used to define the Hotelling transform:
• Maps ’s into vectors denoted by

Principal Components (analysis) transform
• From the Hotelling transform (and due to the properties of the matrix ), the
vectors (images) can be reconstructed from y:
• Now, instead of using the all eigenvectors of C, we use only the k

eigenvectors representing the k highest eigenvalues in a matrix (of
dimension ):
(Principal Components Transform)
• The y vectors would be k dimensional

• The reconstruction wouldn’t be exact (but it’d generate less complex images with the
same features as the originals!)
https://colab.research.google.com/drive/1rOL7B6PGb-bovZ7z26K0daqTCzErZJpX?usp=sharing
Feature Reduction: Breast Cancer



https://colab.research.google.com/drive/1CjzTBsyt7FsrPb4KCh0qfZEsvHpGTCBq?usp=sharing

https://medium.com/apprentice-journal/pca-application-in-machine-learning-4827c07a61db
Feature Reduction: PCA Limitations

• Model performance: PCA can lead to a reduction in model
performance on datasets with no or low feature correlation or does
not meet the assumptions of linearity.
• Classification accuracy: Variance based PCA framework does
not consider the differentiating characteristics of the classes. Also,
the information that distinguishes one class from another might be
in the low variance components and may be discarded.
• Outliers: PCA is also affected by outliers, and normalization of the
data needs to be an essential component of any workflow.
• Interpretability: Each principal component is a combination of
original features and does not allow for the individual feature
importance to be recognized.
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
Output Transformation
• Categorical variables are often called nominal
• Some examples include:
• A “pet” variable with the values: “dog” and “cat“.
• A “color” variable with the values: “red“, “green” and “blue“.
• A “place” variable with the values: “first”, “second” and “third“.
• A “passing grade”: “fail”, “pass”
• A “iris variety”: “Iris-setosa”, “Iris-versicolor”, “Iris-virginica”
• Each value represents a different category
• Classifiers (like MLP) need numbers!
• Two Solutions
• Integer Encoding
• each unique category value is assigned an integer value.
• for example, “red” is 1, “green” is 2, and “blue” is 3.
• okay for ordinal (order matters), but not for true nominal
• One-Hot Encoding
• a new binary variable is added for each unique integer value.
• In the “color” variable example, there are 3 categories and therefore 3
binary variables are needed. A “1” value is placed in the binary variable
for that color and “0” values for the other colors.
y_oh = pd.get_dummies(df['class']).values
# one hot encoding - much easier using keras!
train_labels_oh = tf.keras.utils.to_categorical(train_labels)
Instance Class Integer Encoding
1 “Red”
2 “Green”
3 “Blue” Instance Class
4 “Green” 1 1
2 2
3 3
4 2
Instance Red Green Blue

One-Hot Encoding
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
• Integer Encoding input hidden output
[1, 4] Single output that ranges of class,

e.g. [1, 4]
• One Hot Encoding

input hidden output
[1,0,0,0]
[0,1,0,0] Output neuron for each class,
[0,0,1,0]
[0,0,0,1]
Example
• Google Colaboratory Example
• COMP2712 ML Review Feature Normalisation: Output
https://colab.research.google.com/drive/1LkHD_QTzqhmo6UR
WHxwlLXE8rytW02kD?usp=sharing
• COMP2712 Exploring feature reduction with PCA with ML

https://colab.research.google.com/drive/1TqDnU8D5M4mNd9h
mpDJRofiuj7V9l5Gc?usp=sharing

L06 Features

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L06 Features

Uploaded by

Copyright:

Available Formats

Some issues on evaluation

The Problem with Accuracy What

Recall and Precision do not account for True Negatives!

The Problem with Accuracy

The Problem with Accuracy

Overfitting and Early-Stopping

• Test data measure generalization error.

• Early-stopping uses a third set, validation data, to stop the

• Validation data can be used to tune the model.

Overfitting and Early-Stopping

Feature Level Processing

The values of inputs*weights for Feature A

This will make convergence difficult,

Now both Feature A and B are on the

• Min-Max scaling: [0, 1]

• Principle Components Analysis (PCA) to the rescue

• These vectors can be treated as random quantities

• Mean vector of the population:

• Covariance matrix of the

• Every pixel is a point in a “cloud” (population) of pixels representing

• Because this matrix is real and symmetric, finding a set of orthonormal

• Maps ’s into vectors denoted by

• Now, instead of using the all eigenvectors of C, we use only the k

(Principal Components Transform)

• The y vectors would be k dimensional

Feature Reduction: Breast Cancer

Feature Reduction: Breast Cancer

Feature Reduction: Breast Cancer

Feature Reduction: Breast Cancer

Feature Reduction: PCA Limitations

Instance Red Green Blue

[1, 4] Single output that ranges of class,

• One Hot Encoding

• COMP2712 Exploring feature reduction with PCA with ML

You might also like