Professional Documents
Culture Documents
1
Biological and numerical neuron: #1
https://cs231n.github.io/neural-networks-1/
Biological and numerical neuron: #2
While originally based on a simplistic model of the neurons in human and animal brains,
the artificial neuron is NOT meant to be a computer-based simulation of a biological
neuron.
INSTEAD, the goal of the artificial neuron is to achieve the same ability to learn from
experience as with the biological neuron
https://jolt.law.harvard.edu/assets/articlePDFs/v31/The-Artificial-Intelligence-Black-Box-and-the
-Failure-of-Intent-and-Causation-Yavar-Bathaee.pdf 3
Axon, synapse, dendrite, neuron
https://cs231n.github.io/neural-networks-1/
7
8
Feed forward, Recurrent, Fully and sparsely
connected
Ability to generalise: to predict the correct targets for patterns the learning system has not
previously seen before.
Therefore, it is advisable to choose a sample at each iteration that is the most unfamiliar to
the system.
Note, this applies only to stochastic learning since the order of input presentation is
irrelevant for batch.
The data is zero-centered by subtracting the mean in Each dimension is additionally scaled by its standard deviation.
each dimension. Geometric interpretation: the data The red lines indicate the extent of the data - they are of
cloud is now centered around the origin. unequal length in the middle, but of equal length on the right.
X -= np.mean(X, axis = 0) X /= np.std(X, axis = 0)
https://cs231n.github.io/neural-networks-2/ 13
Decorrelating and whitening of data: #1
https://cs231n.github.io/neural-networks-2/
After performing PCA. The data is centered at zero and Each dimension is additionally scaled by the eigenvalues,
then rotated into the eigenbasis of the data covariance transforming the data covariance matrix into the identity
matrix. This decorrelates the data (the covariance matrix. Geometrically, this corresponds to stretching and
matrix becomes diagonal). squeezing the data into an isotropic gaussian blob.
14
Decorrelating and whitening of data: #2
In practice. We mention PCA/Whitening in these notes for completeness, but these transformations are not used with
Convolutional Networks. However, it is very important to zero-center the data, and it is common to see normalization of every
pixel as well.
https://cs231n.github.io/neural-networks-2/ 15
A more thorough method: Decorrelate the input components
For a linear neuron, we get a big win by decorrelating each component of the input from the
other input components.
There are several different ways to decorrelate inputs. A reasonable method is to use
Principal Components Analysis.
Drop the principal components with the smallest eigenvalues.
This achieves some dimensionality reduction.
Divide the remaining principal components by the square roots of their eigenvalues. For a
linear neuron, this converts an axis aligned elliptical error surface into a circular one.
https://www.cs.toronto.edu/~hinton/coursera/lecture2/lec2.pptx
Common pitfall
An important point to make about the preprocessing is that any preprocessing statistics
(e.g. the data mean) must only be computed on the training data, and then applied to the
validation / test data.
E.g. computing the mean and subtracting it from every image across the entire dataset and
then splitting the data into train/val/test splits would be a mistake.
Instead, the mean must be computed only over the training data and then subtracted
equally from all splits (train/val/test).
https://cs231n.github.io/neural-networks-2/ 17
Weight initialization: #1
Pitfall: all zero initialization. This turns out to be a mistake, because if every neuron in the network computes the
same output, then they will also all compute the same gradients during backpropagation and undergo the exact
same parameter updates. There is no source of asymmetry between neurons.
Small random numbers. Therefore, we still want the weights to be very close to zero, but as we have argued
above, not identically zero. As a solution, it is common to initialize the weights of the neurons to small numbers
and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the
beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.
Calibrating the variances with 1/sqrt(n). One problem with the above suggestion is that the distribution of the
outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that
we can normalize the variance of each neuron’s output to 1 by scaling its weight vector by the square root of its
fan-in (i.e. its number of inputs).
https://cs231n.github.io/neural-networks-2/ 18
Weight initialization: #2
Pitfall: all zero initialization. This turns out to be a mistake, because if every neuron in the
network computes the same output, then they will also all compute the same gradients
during backpropagation and undergo the exact same parameter updates. There is no
source of asymmetry between neurons.
But there is an exception: Logistic Regression doesn't have a hidden layer. If you initialize
the weights to zeros, the first example x fed in the logistic regression will output zero but
the derivatives of the Logistic Regression depend on the input x (because there's no hidden
layer) which is not zero. So at the second iteration, the weights values follow x's distribution
and are different from each other if x is not a constant vector.
19
Weight initialization: #3
It is however okay to initialize the biases to zeros. Symmetry is still broken so long as
the weights are initialized randomly.
20
Weight initialization: #4
Large random initialisation -> break the symmetry but results are still not satisfactory
Xavier initialisation -> adding this factor sqrt(1/n) to the weight random initialisation.
He initialisation-> adding this factor sqrt(2/n). to the weight random initialisation. This is
also know as Kaiming initialization
21
The Sigmoid functions: #1
22
One of the potential problem using symmetric sigmoids is that the error surface can be
very flat near the origin.
For this reason it is good to avoid initializing with very small weights.
Because of the saturation of the sigmoids the error surface is also flat far from the origin.
Adding a small linear term to the sigmoid can sometimes help avoid the flat regions.
It is preferable to track epochs rather than iterations since the number of iterations
depends on the arbitrary setting of batch size.
An epoch is calculated by dividing the number of training images by the batchsize used. For
example, CIFAR-10 has 50, 000 training images and the batchsize is 100 so an epoch = 50,
000/100 = 500 iterations.
https://cs231n.github.io/neural-networks-3/ Smith, Leslie N. "Cyclical learning rates for training neural networks." 2017 IEEE winter 26
conference on applications of computer vision (WACV). IEEE, 2017.
Loss function rate: #1
With low learning rates the improvements will be
linear. With high learning rates they will start to look
more exponential.
https://cs231n.github.io/neural-networks-3/ 27
Loss function rate: #2
https://cs231n.github.io/neural-networks-3/ 28
Loss function classification
Continuous response, y ∈ R:
Loss-functions can be classified according to
the type of response variable y. Gaussian L2 loss function
Laplace L1 loss function
Huber loss function,δspecified
Quantile loss function,αspecified
Categorical response, y∈{0,1}:
Binomial loss function
Adaboost loss function
Other families of response variable:
Loss functions for survival models
Loss functions counts data
Custom loss functions
Natekin, Alexey, and Alois Knoll. "Gradient boosting machines, a tutorial." Frontiers in neurorobotics 7 (2013): 21. 29
Loss functions for
regression and
classification
https://heartbeat.fritz.ai/5-regression-loss-functions-all-ma
chine-learners-should-know-4fb140e9d4b0
30
Tips for using learning rate schedules: #1
Increase the initial learning rate. Because the learning rate will decrease, start with a larger
value to decrease from. A larger learning rate will result in a lot larger changes to the
weights, at least in the beginning, allowing you to benefit from fine tuning later.
Use a large momentum. Using a larger momentum value will help the optimization
algorithm to continue to make updates in the right direction when your learning rate
shrinks to small values.
Experiment with different schedules. It will not be clear which learning rate schedule to
use so try a few with different configuration options and see what works best on your
problem. Also try schedules that change exponentially and even schedules that respond to
the accuracy of your model on the training or test datasets.
31
Tips for using learning rate schedules: #2
Time-Based Learning Rate Schedule Drop-Based Learning Rate Schedule
32
Annealing the learning rate
• In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the
fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter
k. Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer
time.
https://cs231n.github.io/neural-networks-3/ 33
Best way of finding the learning rate?
Commonly done (not efficiently): lot of people
discover the optimum learning rate via grid
search. This is incredibly time-consuming!
What better option do we have? Over the
course of an epoch, start out with a small
learning rate and increase to a higher learning
rate over each mini-batch, resulting in a high rate
at the end of the epoch. Calculate the loss for
each rate and then, looking at a plot, pick the
learning rate (rate! read circle) that gives the
greatest decline.
Smith, Leslie N. "Cyclical learning rates for training neural networks." 2017 IEEE
winter conference on applications of computer vision (WACV). IEEE, 2017. 36
Cyclical learning rate [from the original paper]: #2
An intuitive understanding of why CLR methods work comes from considering the loss
function topology. The difficulty in minimizing the loss arises from saddle points rather than
poor local minima.
Saddle points have small gradients that slow the learning process. However, increasing the
learning rate allows more rapid traversal of saddle point plateaus. A more practical reason
as to why CLR works is that, it is likely the optimum learning rate will be between the
bounds and near optimal learning rates will be used throughout training
Smith, Leslie N. "Cyclical learning rates for training neural networks." 2017 IEEE
winter conference on applications of computer vision (WACV). IEEE, 2017. 37
Differential Learning Rates
We have seen so far how we have applied one learning rate to the entire model.
When training a model from scratch, that probably makes sense, but when it comes to
transfer learning, we can normally get a little better accuracy if we try something different:
training different groups of layers at different rates
The functions used in the context of neural networks usually have defined left derivatives
and defined right derivatives.
Software implementations of neural network training usually return one of the one-sided
derivatives rather than reporting that the derivative is undefined or raising an error.
http://www.deeplearningbook.org/contents/mlp.html 39
How to address non-differentiability?: #2
Unless indicated otherwise, most hidden units can be described as accepting a vector of inputs x, computing an affine
transformation
Most hidden units are distinguished from each other only by the choice of the form of the activation function g(z) which
can be also non-differentiable, but as long as we have the right or left derivative we can still do something. This workaround
seems to work very well.
Affine transformation: is a geometric transformation that preserves lines and parallelism (but not necessarily distances and
angles). affine transformation, which basically means it is composed of a translation, rotation, and uniform scaling.
http://www.deeplearningbook.org/contents/mlp.html 40
Increase layers or increase No of parameters?
The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units.
“using a rectifying nonlinearity is the single most important factor in improving the
performance of a recognition system,” among several different factors of neural network
architecture design.
http://www.deeplearningbook.org/contents/mlp.html
42
Synthetic data: #1
As the term “synthetic” suggests, synthetic datasets are generated through computer
programs, instead of being composed through the documentation of real-world events. The
primary purpose of a synthetic dataset is to be versatile and robust enough to be useful for
the training of machine learning models.
Why using it? Companies often have difficulty acquiring large amounts of data to train an
accurate model within a given time frame. Hand-labeling data is a costly, slow way to
acquire data. However, generating and using synthetic data can help data scientists and
companies overcome these hurdles and develop reliable machine learning models a
quicker fashion.
https://www.unite.ai/what-is-synthetic-data/
43
Synthetic data: #2
44
Ceiling analysis: #1
Ceiling analysis is a way to systematically find the weakest component of your system, and
therefore optimising that weakest component would best serve your time to bring the
greatest improvement to the overall system.
Ceiling analysis determines what component would yield the fastest improvements.
https://medium.com/@rossbulat/ceiling-analysis-in-deep-learning-and-software-development-8bc41e59364a
45
Ceiling analysis: #2
• Baseline Accuracy: 68%
Perfect Text Detection: 69%
Perfect Face Detection: 78%
Perfect Gender Detection: 100%
• As we can see, perfect text detection only yielded a 1% improvement in our overall accuracy. This suggests that perhaps
any time invested in improving our text recognition algorithm will likely not improve our overall system that much.
• Moving onto the face detection, again, looks quite strong, but could be optimised as there is a 9% difference given
perfect face detection results. Working on this component may be a good bet to improve the overall accuracy. Before
making the final call, let’s observe the final component.
• It appears that gender detection is struggling the most, yielding a 22% improvement in accuracy with perfect results.
Now, this is by far the weakest link in my overall system
https://medium.com/@rossbulat/ceiling-analysis-in-deep-learning-and-software-development-8bc41e59364a
46
Do neural nets have saddle points and why this is a problem?
Experiments show neural nets do have as many saddle points as random matrix theory
predicts
Major implication: most minima are good, and this is more true for big models.
Minor implication: the reason that Newton’s method works poorly for neural nets is its
attraction to the ubiquitous saddle points.
47
http://www.deeplearningbook.org/slides/sgd_and_cost_structure.pdf
Type of Bias: #1
Bias in data is an inconsistency with the phenomenon that data represents. This inconsistency may occur for a number of
reasons (which are not mutually exclusive).
https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 48
Type of Bias: #2
https://www.kaggle.com/alexisbcook/identifying-bias-in-ai 49
Model card
A model card is a short document that provides key information about a machine learning
model. Model cards increase transparency by communicating information about trained
models to broad audiences.
50
https://www.kaggle.com/var0101/model-cards
Survivorship bias
The phenomenon where only those that ‘survived’ a long process are included or excluded
in an analysis, thus creating a biased sample.
A great example provided by Sreenivasan Chandrasekar is the following:
“We enroll for gym membership and attend for a few days. We see the same faces of many
people who are fit, motivated and exercising everyday whenever we go to gym. After a few
days we become depressed why we aren’t able to stick to our schedule and motivation
more than a week when most of the people who we saw at gym could. What we didn’t see
was that many of the people who had enrolled for gym membership had also stopped
turning up for gym just after a week and we didn’t see them.”
51
Two types of data leakage
Leakage happens when your training data contains information that will not be available
when the model is used for prediction.
In other words, leakage causes (is highly likely) a model to look too accurate while training
but inaccurate while deployed.
Why highly likely? Because sometimes it would not change a thing and people would not
realise, that is why this comes down to implementing good practices.
There are two main types of leakage:
Target leakage
Train-test contamination
https://www.kaggle.com/alexisbcook/data-leakage 52
Target leakage
Think of it in terms of the timing or
chronological order that data becomes
available, not merely whether a feature
helps make good predictions.
The took_antibaiotic_medicine
column was updated after getting
pneumonia.
The model would see this correlation and
learn it.
While deploying the model, this information
would not be available, so the model would
perform poorly.
https://www.kaggle.com/alexisbcook/data-leakage
53
Data leakage [while normalising]: #1
Data leakage in supervised learning is the unintentional introduction of information about
the target that should not be made available. Training on contaminated data leads to
overly optimistic expectations about the model performance.
For example, consider the case where we want to normalize data, that is scale input
variables to the range 0-1.
When we normalize the input variables, this requires that we first calculate the minimum
and maximum values for each variable before using these values to scale the variables.
The dataset is then split into train and test datasets, but the examples in the training
dataset know something about the data in the test dataset;
They have been scaled by the global minimum and maximum values, so they know more
about the global distribution of the variable then they should
https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 54
Data leakage [while training]: #2
3 sets: training, validation, test sets: you train on the training data while evaluating this step
performance on the evaluation data. You then proceed to do a final check on the test set.
Why not have two sets: a training set and a test set? You’d train on the training data and
evaluate on the test data. Simple? No.
Short answer is that: developing a model always involves hyperparameter tuning. To do
this you need a performance feedback signal from validation data. By doing so, like it or
not, some information about the validation data leaks into the model. If you repeat this
many times you’ll leak an increasingly significant amount of information about the
validation set into the model. At the end of the day, you’ll end up with a model that
performs artificially well on the validation data, because that’s what you optimized it for.
You care about performance on completely new data thus the test set.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 55
Data leakage [temporal leak]: #3
If you’re trying to predict the future given the past (for example, tomorrow’s weather, stock
movements, and so on), you should not randomly shuffle your data before splitting it,
because doing so will create a temporal leak: your model will effectively be trained on data
from the future.
In such situations, you should always make sure all data in your test set is posterior to the
data in the training set.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 56
What constitutes “good data”?
57
Main properties of a learning algorithm
Explainability: Do the model predictions require explanation for a non-technical audience?
In-memory vs. out-of-memory: Can your dataset be fully loaded into the RAM of your
laptop or server?
Number of features and examples: How many training examples do you have in your
dataset? How many features does each example have?
Nonlinearity of the data: Is your data linearly separable? Can it be modeled using a linear
model?
Training speed: How much time is a learning algorithm allowed to use to build a model, and
how often you will need to retrain the model on updated data?
Prediction speed: How fast must the model be when generating predictions? Will your
model be used in a production environment where very high throughput is required?
https://www.dropbox.com/s/z45hr0y8vj2opj4/Chapter5.pdf?dl=0 58
Offline and online evaluation: #1
An offline model evaluation happens when the model is being trained by the analyst. The
analyst tries out different features, models, algorithms, and hyperparameters. The offline
model evaluation reflects how well the analyst succeeded in finding the right features,
learning algorithm, model, and values of hyperparameters. In other words, the offline model
evaluation reflects how good the model is from an engineering standpoint.
An online model evaluation, that is, testing and comparing models in production by using
online data. Online evaluation, on the other hand, focuses on measuring business outcomes,
such as customer satisfaction, average online time, open rate, and click-through rate. This
information may not be reflected in historical data, but it’s what the business really cares
about. Furthermore, offline evaluation doesn’t allow us to test the model in some conditions
that can be observed online, such as connection and data loss, and call delays.
59
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Offline and online evaluation: #2
60
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Offline and online evaluation: #3
At the end of the day, why do we need such a distinction? Main reason is the data may show a
distribution shift.
One way of doing such monitoring is to compare the performance of the model on online and
historical data.
If the performance on online data becomes significantly worse, as compared to historical, it’s
time to retrain the model.
61
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Offline and online evaluation: #4
Different online techniques exist:
A/B Test: Which of the two model candidates works better in production?” A/B testing is often used
on websites and mobile applications to test whether a specific change in the design or wording
positively affects business metrics such as user engagement, click-through rate, or sales rate. The
null hypothesis states that the new model doesn’t change the average value of the business metric.
The alternative hypothesis states that the new model changes the average value of the metric.
G-Test: The first formulation of A/B test is based on the G-test. It is appropriate for a metric that
counts the answer to a “yes” or “no” question. An advantage of the G-test is that you can ask any
question, as long as only two answers are possible.
Z-Test: The second formulation of A/B test applies when the question for each user is, “How many?
”or, “How much?” (as opposed to a yes-or-no question considered in the previous subsection).
62
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Why don’t we solve it analytically?: #1
weight
vector
neuron’s input
estimate of the vector
desired output
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Why don’t we solve it analytically?: #2
It is straight-forward to write down a set of equations, one per training case, and to solve for the
best set of weights.
This is the standard engineering approach so why don’t we use it?
Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural
networks.
The analytic solution relies on it being linear and having a squared error measure.
Iterative methods are usually less efficient but they are much easier to generalize.
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Capacity
https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
65
Model variance and model bias
Small/Simple models: may not do well on the
training data, but the difference between training
and test error quickly drops.
Big/Rich models: will learn the training data, but
the difference between training and test error
can be large.
https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
66
Optimal capacity
The curve of training error and test
error for a given training set size, as a
function of the capacity of the
machine (the richness of the class of
function) has a minimum.
https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
67
Occam’s Razor
Occam’s Razor: when given the choice between several models that explain the data
equally well, choose the “simplest” one.
Occam’s Razor applied to machine learning: choose a trade off between how well the
model fits the training data and how “simple” that model is.
https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
68
Occam’s Razor: what he really meant?
Occam’s razor famously states that entities should not be multiplied beyond necessity.
In ML, this is often taken to mean that, given two classifiers with the same training error,
the simpler of the two will likely have the lowest test error.
The conclusion is that simplicity should be preferred because simplicity is a virtue in its
own right, not because of a hypothetical connection with accuracy. This is probably what
Occam meant in the first place.
Brink, Henrik, et al. Real-world machine learning. Shelter Island, NY: Manning, 2017. 70
Mean imputation of missing data
Mean imputation is the practice of replacing null values in a data set with the mean of the
data.
Mean imputation is generally bad practice because it doesn’t take into account feature
correlation. For example, imagine we have a table showing age and fitness score and
imagine that an eighty-year-old has a missing fitness score. If we took the average fitness
score from an age range of 15 to 80, then the eighty-year-old will appear to have a much
higher fitness score than he actually should.
Second, mean imputation reduces the variance of the data and increases bias in our data.
This leads to a less accurate model and a narrower confidence interval due to a smaller
variance.
https://towardsdatascience.com/understanding-multiple-regression-249b16bde83e 71
A comparison of imputation techniques
https://projector-video-pdf-converter.datacamp.com/17404/chapter3.pdf 72
Can you build a machine learning model to monitor
another model?
But would it help us train a separate, second model to predict whether the first model is correct? The
answer might disappoint.
https://evidentlyai.com/blog/can-you-build-a-model-to-monitor-a-model 73
Just Ask for Generalization
Generalizing to what you want may be easier than optimizing directly for what you want.
We might even ask for "consciousness".
74
https://evjang.com/2021/10/23/generalization.html
Domain knowledge
76
The importance of training test
Whenever you’re designing Machine Learning algorithms, you should think of the test set as a very
precious resource that should ideally never be touched until one time at the very end.
Otherwise, the very real danger is that you may tune your hyperparameters to work well on the test
set, but if you were to deploy your model you could see a significantly reduced performance. In
practice, we would say that you overfit to the test set.
If you only use the test set once at end, it remains a good proxy for measuring the generalization of
your classifier (we will see much more discussion surrounding generalization later in the class).
Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at
all. The idea is to split our training set in two: a slightly smaller training set, and what we call a
validation set.
https://cs231n.github.io/classification/
77
Orthogonalization
Orthogonalization — Refers to the concept of picking parameters to tune which only adjust
one (single) outcome (at the time) of the ML model, e.g. regularization is a knob to reduce
variance.
https://towardsdatascience.com/structuring-your-machine-learning-project-cours
e-summary-in-1-picture-and-22-nuggets-of-wisdom-95b051a6c9dd
78
Normalisation
Why normalisation features? Normalize the features in your data (e.g. one pixel in images)
to have zero mean and unit variance.
Are there any case where we can avoid performing normalisation? Yes, while working
with images. Pixels in images are usually homogeneous and do not exhibit widely different
distributions, alleviating the need for data normalisation.
https://cs231n.github.io/classification/ 79
Why do we have to perform normalisation?
Normalizing is important because a lot of multiplication will be happening as the input
passes through the layers of the neural network; keeping the incoming values between 0
and 1 prevents the values from getting too large during the training phase (known as the
exploding gradient problem).
To this connected to BatchNorm is indeed less useful, but as they get larger, the effect of
any layer on another, say 20 layers down, can be vast because of repeated multiplication,
and you may end up with either vanishing or exploding gradients, both of which are fatal to
the training process
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-exa
mples-preprocessing-plot-scaling-importance-py 81
Normalisation vs. standardisation: #2
In theory:
Normalization would work better for uniformly
distributed data.
Standardization tends to work best for normally
distributed data.
However, in practice, data is rarely distributed following a
perfect curve.
Common sense solution? Usually, if your dataset is not too
big and you have time, you can try both and see which one
performs better for your task.
What is the benefit? These manipulations are generally used
to improve the numerical stability of some calculations. Some
models benefit from the predictors being on a common scale.
https://www.dropbox.com/s/7leuhzwq8ove3x8/Chapter4.pdf?dl=0 82
Why normalise inputs?
83
Z-scores and Normalization
Normalise all the different variables to make their range/distribution comparable via Z-score
transform:
The average value of a Z-score over all points is zero. Values greater than the mean become
positive, while those less than the mean become negative. The standard deviation of the Z-scores
is 1, so all distributions of Z-scores have similar properties. Transforming values to Z-scores
accomplishes two goals. First, they aid in visualizing patterns and correlations, by ensuring that all
fields have an identical mean (zero) and operate over a similar range.
Z-scores are best used on normally distributed variables, which, after all, are completely
described by mean μ and standard deviation σ. But they work less well when the distribution is a
power law.
84
Skiena, Steven S. The data science design manual. Springer, 2017
Should I Standardise then Normalise?
Standardization can give values that are both positive and negative centred around zero. It
may be desirable to normalize data after it has been standardized.
This might be a good idea of you have a mixture of standardized and normalized variables
and wish all input variables to have the same minimum and maximum values as input for a
given algorithm, such as an algorithm that calculates distance measures.
85
Standardisation with outliers
Many machine learning algorithms perform better when numerical input variables are scaled to a
standard range.
This includes algorithms that use a weighted sum of the input, like linear regression, and
algorithms that use distance measures, like k-nearest neighbours.
Standardizing is a popular scaling technique that subtracts the mean from values and divides by
the standard deviation, transforming the probability distribution for an input variable to a
standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if
the input variable contains outlier values.
To overcome this, the median and interquartile range can be used when standardizing numerical
input variables, generally referred to as robust scaling.
86
What happens if you do not standardise
the data?
We can illustrate it using PCA. In PCA we are interested in the components that maximize
the variance.
If one component (e.g. human height) varies less than another (e.g. weight) because of
their respective scales (meters vs. kilos), PCA might determine that the direction of
maximal variance more closely corresponds with the ‘weight’ axis, if those features are not
scaled.
As a change in height of one meter can be considered much more important than the
change in weight of one kilogram, this is clearly incorrect.
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto- 87
examples-preprocessing-plot-scaling-importance-py
Example of standardisation
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto- 88
examples-preprocessing-plot-scaling-importance-py
Scaling
As provided by Scikit-learn
StandardScaler ensures that for each feature the mean is 0
and the variance is 1, bringing all features to the same
magnitude. However, this scaling does not ensure any
particular minimum and maximum values for the features.
RobustScaler works similarly to the StandardScaler in that it
ensures statistical properties for each feature that guarantee
that they are on the same scale. However, the RobustScaler
uses the median and quartiles,1 instead of mean and variance.
This makes the RobustScaler ignore data points that are very
different from the rest (like measurement errors). These odd
data points are also called outliers, and can lead to trouble for
other scaling techniques.
MinMaxScaler, on the other hand, shifts the data such that all
features are exactly between 0 and 1.
Normalizer does a very different kind of rescaling. It scales
each data point such that the feature vector has a Euclidean
length of 1.
Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 89
How to spot scaling is done properly with MinMax?
The dataset looks different. The test points moved incongruously to
Generally the target is not scaled. Normalizing (a type of scaling) the output will not affect
shape of objective function, so it's generally not necessary.
If you scale the target, your mean squared error (MSE) is automatically scaled. Additionally,
you need to look at the mean absolute scaled error (MASE). MASE>1 automatically means
that you are doing worse than a constant (naive) prediction.
https://stats.stackexchange.com/questions/111467/is-it-necessary-to-sca 91
le-the-target-value-in-addition-to-scaling-features-for-re
Is it necessary to scale the target value in addition to
scaling features for regression analysis?: #2
Consider the case where there are outliers that can't be filtered out as they are important
to the model.
Consider the case where the distribution is right—skewed (I have not said left-skewed, so
take it as a warning!) then you can use normalise the variable.
One type of normalisation can be done via taking the logarithm.
What about left-skewed distributions? A log transformation in a left-skewed distribution
will tend to make it even more left skew, for the same reason it often makes a right skew
one more symmetric.
https://gdcoder.com/when-why-to-use-log-transformation-in-regression/ 92
Numerical vs. analytic gradient
We discussed the tradeoffs between computing the numerical and analytic gradient.
The analytic gradient is exact, fast to compute but more error-prone since it requires the
derivation of the gradient with math.
Hence, in practice we always use the analytic gradient and then perform a gradient check,
in which its implementation is compared to the numerical gradient.
https://cs231n.github.io/optimization-1/ 93