You are on page 1of 93

General observations

1
Biological and numerical neuron: #1

https://cs231n.github.io/neural-networks-1/
Biological and numerical neuron: #2

 While originally based on a simplistic model of the neurons in human and animal brains,
the artificial neuron is NOT meant to be a computer-based simulation of a biological
neuron.

 INSTEAD, the goal of the artificial neuron is to achieve the same ability to learn from
experience as with the biological neuron

https://jolt.law.harvard.edu/assets/articlePDFs/v31/The-Artificial-Intelligence-Black-Box-and-the
-Failure-of-Intent-and-Causation-Yavar-Bathaee.pdf 3
Axon, synapse, dendrite, neuron

Sze, Vivienne, et al. "Efficient processing of deep neural networks: A


tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329. 4
Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning
with PyTorch. Manning Publications Company, 2020
5
An artificial neuron – a linear transformation enclosed in a nonlinear function
Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning
with PyTorch. Manning Publications Company, 2020
6
Sizing neural networks

The first network (left) has 4 + 2 = 6 neurons (not


counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of 26
learnable parameters.

https://cs231n.github.io/neural-networks-1/
7
8
Feed forward, Recurrent, Fully and sparsely
connected

Sze, Vivienne, et al. "Efficient processing of deep neural networks: A


tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.
9
Ability to generalise

 Ability to generalise: to predict the correct targets for patterns the learning system has not
previously seen before.

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of


the trade. Springer, Berlin, Heidelberg, 2012. 9-48. 10
How to choose the training points?

 NNs learn the fastest from the most unexpected sample.

 Therefore, it is advisable to choose a sample at each iteration that is the most unfamiliar to
the system.

 Note, this applies only to stochastic learning since the order of input presentation is
irrelevant for batch.

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks


of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.
11
Normalizing the inputs

1.The average of each input variable over the


training set should be close to zero
2.Scale input variables so that their covariance
are about the same
3.Input variables should be uncorrelated if
possible

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks


of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.
12
Mean subtraction and normalisation

The data is zero-centered by subtracting the mean in Each dimension is additionally scaled by its standard deviation.
each dimension. Geometric interpretation: the data The red lines indicate the extent of the data - they are of
cloud is now centered around the origin. unequal length in the middle, but of equal length on the right.
X -= np.mean(X, axis = 0) X /= np.std(X, axis = 0)

https://cs231n.github.io/neural-networks-2/ 13
Decorrelating and whitening of data: #1

https://cs231n.github.io/neural-networks-2/
After performing PCA. The data is centered at zero and Each dimension is additionally scaled by the eigenvalues,
then rotated into the eigenbasis of the data covariance transforming the data covariance matrix into the identity
matrix. This decorrelates the data (the covariance matrix. Geometrically, this corresponds to stretching and
matrix becomes diagonal). squeezing the data into an isotropic gaussian blob.

14
Decorrelating and whitening of data: #2

In practice. We mention PCA/Whitening in these notes for completeness, but these transformations are not used with
Convolutional Networks. However, it is very important to zero-center the data, and it is common to see normalization of every
pixel as well.

https://cs231n.github.io/neural-networks-2/ 15
A more thorough method: Decorrelate the input components

 For a linear neuron, we get a big win by decorrelating each component of the input from the
other input components.

 There are several different ways to decorrelate inputs. A reasonable method is to use
Principal Components Analysis.
 Drop the principal components with the smallest eigenvalues.
 This achieves some dimensionality reduction.
 Divide the remaining principal components by the square roots of their eigenvalues. For a
linear neuron, this converts an axis aligned elliptical error surface into a circular one.

https://www.cs.toronto.edu/~hinton/coursera/lecture2/lec2.pptx
Common pitfall

 An important point to make about the preprocessing is that any preprocessing statistics
(e.g. the data mean) must only be computed on the training data, and then applied to the
validation / test data.

 E.g. computing the mean and subtracting it from every image across the entire dataset and
then splitting the data into train/val/test splits would be a mistake.

 Instead, the mean must be computed only over the training data and then subtracted
equally from all splits (train/val/test).

https://cs231n.github.io/neural-networks-2/ 17
Weight initialization: #1
 Pitfall: all zero initialization. This turns out to be a mistake, because if every neuron in the network computes the
same output, then they will also all compute the same gradients during backpropagation and undergo the exact
same parameter updates. There is no source of asymmetry between neurons.

 Small random numbers. Therefore, we still want the weights to be very close to zero, but as we have argued
above, not identically zero. As a solution, it is common to initialize the weights of the neurons to small numbers
and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the
beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.

 Calibrating the variances with 1/sqrt(n). One problem with the above suggestion is that the distribution of the
outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that
we can normalize the variance of each neuron’s output to 1 by scaling its weight vector by the square root of its
fan-in (i.e. its number of inputs).

 The link below suggests other type of initialisations.

https://cs231n.github.io/neural-networks-2/ 18
Weight initialization: #2
 Pitfall: all zero initialization. This turns out to be a mistake, because if every neuron in the
network computes the same output, then they will also all compute the same gradients
during backpropagation and undergo the exact same parameter updates. There is no
source of asymmetry between neurons.

 But there is an exception: Logistic Regression doesn't have a hidden layer. If you initialize
the weights to zeros, the first example x fed in the logistic regression will output zero but
the derivatives of the Logistic Regression depend on the input x (because there's no hidden
layer) which is not zero. So at the second iteration, the weights values follow x's distribution
and are different from each other if x is not a constant vector.

19
Weight initialization: #3

 What you should remember:

 The weights should be initialized randomly to break symmetry.

 It is however okay to initialize the biases to zeros. Symmetry is still broken so long as
the weights are initialized randomly.

20
Weight initialization: #4

 Zeros initialisation -> fail because we do not break the symmetry

 Large random initialisation -> break the symmetry but results are still not satisfactory

 Xavier initialisation -> adding this factor sqrt(1/n) to the weight random initialisation.

 He initialisation-> adding this factor sqrt(2/n). to the weight random initialisation. This is
also know as Kaiming initialization

21
The Sigmoid functions: #1

22

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks


of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.
The Sigmoid functions: #2

 One of the potential problem using symmetric sigmoids is that the error surface can be
very flat near the origin.

 For this reason it is good to avoid initializing with very small weights.

 Because of the saturation of the sigmoids the error surface is also flat far from the origin.

 Adding a small linear term to the sigmoid can sometimes help avoid the flat regions.

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks


of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.
23
Initializing weights

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks


of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.
24
Equalize the learning speeds

LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks


of the trade. Springer, Berlin, Heidelberg, 2012. 9-48.
25
What is an epoch?
 Units of epochs, which measure how many times every example has been seen during
training in expectation (e.g. one epoch means that every example has been seen once).

 It is preferable to track epochs rather than iterations since the number of iterations
depends on the arbitrary setting of batch size.

 An epoch is calculated by dividing the number of training images by the batchsize used. For
example, CIFAR-10 has 50, 000 training images and the batchsize is 100 so an epoch = 50,
000/100 = 500 iterations.

https://cs231n.github.io/neural-networks-3/ Smith, Leslie N. "Cyclical learning rates for training neural networks." 2017 IEEE winter 26
conference on applications of computer vision (WACV). IEEE, 2017.
Loss function rate: #1
 With low learning rates the improvements will be
linear. With high learning rates they will start to look
more exponential.

 Higher learning rates will decay the loss faster, but


they get stuck at worse values of loss (green line).

 This is because there is too much "energy" in the


optimization and the parameters are bouncing around
chaotically, unable to settle in a nice spot in the
optimization landscape.

https://cs231n.github.io/neural-networks-3/ 27
Loss function rate: #2

 The amount of “wiggle” in the loss is related to the


batch size. When the batch size is 1, the wiggle will
be relatively high.

 When the batch size is the full dataset, the wiggle


will be minimal because every gradient update
should be improving the loss function
monotonically (unless the learning rate is set too
high).

https://cs231n.github.io/neural-networks-3/ 28
Loss function classification
 Continuous response, y ∈ R:
 Loss-functions can be classified according to
the type of response variable y.  Gaussian L2 loss function
 Laplace L1 loss function
 Huber loss function,δspecified
 Quantile loss function,αspecified
 Categorical response, y∈{0,1}:
 Binomial loss function
 Adaboost loss function
 Other families of response variable:
 Loss functions for survival models
 Loss functions counts data
Custom loss functions

Natekin, Alexey, and Alois Knoll. "Gradient boosting machines, a tutorial." Frontiers in neurorobotics 7 (2013): 21. 29
Loss functions for
regression and
classification

https://heartbeat.fritz.ai/5-regression-loss-functions-all-ma
chine-learners-should-know-4fb140e9d4b0
30
Tips for using learning rate schedules: #1
 Increase the initial learning rate. Because the learning rate will decrease, start with a larger
value to decrease from. A larger learning rate will result in a lot larger changes to the
weights, at least in the beginning, allowing you to benefit from fine tuning later.

 Use a large momentum. Using a larger momentum value will help the optimization
algorithm to continue to make updates in the right direction when your learning rate
shrinks to small values.

 Experiment with different schedules. It will not be clear which learning rate schedule to
use so try a few with different configuration options and see what works best on your
problem. Also try schedules that change exponentially and even schedules that respond to
the accuracy of your model on the training or test datasets.

31
Tips for using learning rate schedules: #2
Time-Based Learning Rate Schedule Drop-Based Learning Rate Schedule

32
Annealing the learning rate

• In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the
fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter
k. Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer
time.

https://cs231n.github.io/neural-networks-3/ 33
Best way of finding the learning rate?
 Commonly done (not efficiently): lot of people
discover the optimum learning rate via grid
search. This is incredibly time-consuming!
 What better option do we have? Over the
course of an epoch, start out with a small
learning rate and increase to a higher learning
rate over each mini-batch, resulting in a high rate
at the end of the epoch. Calculate the loss for
each rate and then, looking at a plot, pick the
learning rate (rate! read circle) that gives the
greatest decline.

Cyclical Learning Rates for Training Neural Networks” by Leslie Smith 34


(2015)
Cyclical learning rate
 They oscillate back and forth. The main use of cyclical learning
rates is to escape local extreme points, especially sharp local
minima (overfitting). Saddle points are abundant in high
dimensions, and convergence becomes very slow, if not
impossible.

 Thus, if we use a purely decreasing learning rate it is easy to


get stuck in a single location, especially in higher dimensions.

 Cyclic learning rates raise the learning rate periodically. This


has a short term negative effect and yet helps to achieve a
longer-term beneficial effect.

 This technique is described in the paper citedvbelow.


Cyclical Learning Rates for Training Neural Networks” by Leslie Smith
(2015)
35
https://towardsdatascience.com/advanced-topics-in-neural-networks-f27fbcc638ae
Cyclical learning rate [from the original paper]: #1
 Conventional wisdom dictates that the learning
rate should be a single value that monotonically
decreases during training. This paper
demonstrates the surprising phenomenon that a
varying learning rate during training is beneficial
overall and thus proposes to let the global
learning rate vary cyclically within a band of
values instead of set- ting it to a fixed value.
 Unlike adaptive learning rates (essentially a
competitor based on local learning rate rather
than global one), the CLR methods require
essentially no additional computation.

Smith, Leslie N. "Cyclical learning rates for training neural networks." 2017 IEEE
winter conference on applications of computer vision (WACV). IEEE, 2017. 36
Cyclical learning rate [from the original paper]: #2
 An intuitive understanding of why CLR methods work comes from considering the loss
function topology. The difficulty in minimizing the loss arises from saddle points rather than
poor local minima.
 Saddle points have small gradients that slow the learning process. However, increasing the
learning rate allows more rapid traversal of saddle point plateaus. A more practical reason
as to why CLR works is that, it is likely the optimum learning rate will be between the
bounds and near optimal learning rates will be used throughout training

Smith, Leslie N. "Cyclical learning rates for training neural networks." 2017 IEEE
winter conference on applications of computer vision (WACV). IEEE, 2017. 37
Differential Learning Rates
 We have seen so far how we have applied one learning rate to the entire model.

 When training a model from scratch, that probably makes sense, but when it comes to
transfer learning, we can normally get a little better accuracy if we try something different:
training different groups of layers at different rates

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019. 38
How to address non-differentiability?: #1
 A function is differentiable at a specific point only if both the left derivative and the right
derivative are defined and equal to each other.

 The functions used in the context of neural networks usually have defined left derivatives
and defined right derivatives.

 Software implementations of neural network training usually return one of the one-sided
derivatives rather than reporting that the derivative is undefined or raising an error.

 This may be heuristically justified by observing that gradient-based optimization on a digital


computer is subject to numerical error anyway.

http://www.deeplearningbook.org/contents/mlp.html 39
How to address non-differentiability?: #2
 Unless indicated otherwise, most hidden units can be described as accepting a vector of inputs x, computing an affine
transformation

 And then applying an element-wise nonlinear function g(z).

 Most hidden units are distinguished from each other only by the choice of the form of the activation function g(z) which
can be also non-differentiable, but as long as we have the right or left derivative we can still do something. This workaround
seems to work very well.

 Affine transformation: is a geometric transformation that preserves lines and parallelism (but not necessarily distances and
angles). affine transformation, which basically means it is composed of a translation, rotation, and uniform scaling.

http://www.deeplearningbook.org/contents/mlp.html 40
Increase layers or increase No of parameters?

This plot shows that increasing the number of parameters in


layers of convolutional networks without increasing their depth is
not nearly as effective at increasing test set performance.
http://www.deeplearningbook.org/contents/mlp.html 41
The importance piecewise linear hidden units

 The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units.

 “using a rectifying nonlinearity is the single most important factor in improving the
performance of a recognition system,” among several different factors of neural network
architecture design.

http://www.deeplearningbook.org/contents/mlp.html
42
Synthetic data: #1
 As the term “synthetic” suggests, synthetic datasets are generated through computer
programs, instead of being composed through the documentation of real-world events. The
primary purpose of a synthetic dataset is to be versatile and robust enough to be useful for
the training of machine learning models.

 Why using it? Companies often have difficulty acquiring large amounts of data to train an
accurate model within a given time frame. Hand-labeling data is a costly, slow way to
acquire data. However, generating and using synthetic data can help data scientists and
companies overcome these hurdles and develop reliable machine learning models a
quicker fashion.

https://www.unite.ai/what-is-synthetic-data/
43
Synthetic data: #2

44
Ceiling analysis: #1

 Ceiling analysis is a way to systematically find the weakest component of your system, and
therefore optimising that weakest component would best serve your time to bring the
greatest improvement to the overall system.

 Ceiling analysis determines what component would yield the fastest improvements.

 In ceiling analysis, we manually overwrite a component to provide 100% accuracy. We do


this chronologically until all our components are manually overridden, and observe the
changes in accuracy, one component at a time. By the end of this process, my algorithms
and overall system will be predicting 100% accuracy.

https://medium.com/@rossbulat/ceiling-analysis-in-deep-learning-and-software-development-8bc41e59364a
45
Ceiling analysis: #2
• Baseline Accuracy: 68%
Perfect Text Detection: 69%
Perfect Face Detection: 78%
Perfect Gender Detection: 100%

• As we can see, perfect text detection only yielded a 1% improvement in our overall accuracy. This suggests that perhaps
any time invested in improving our text recognition algorithm will likely not improve our overall system that much.

• Text recognition is not the issue here.

• Moving onto the face detection, again, looks quite strong, but could be optimised as there is a 9% difference given
perfect face detection results. Working on this component may be a good bet to improve the overall accuracy. Before
making the final call, let’s observe the final component.

• It appears that gender detection is struggling the most, yielding a 22% improvement in accuracy with perfect results.
Now, this is by far the weakest link in my overall system

https://medium.com/@rossbulat/ceiling-analysis-in-deep-learning-and-software-development-8bc41e59364a
46
Do neural nets have saddle points and why this is a problem?

 Experiments show neural nets do have as many saddle points as random matrix theory
predicts

 Major implication: most minima are good, and this is more true for big models.

 Minor implication: the reason that Newton’s method works poorly for neural nets is its
attraction to the ubiquitous saddle points.

47
http://www.deeplearningbook.org/slides/sgd_and_cost_structure.pdf
Type of Bias: #1
 Bias in data is an inconsistency with the phenomenon that data represents. This inconsistency may occur for a number of
reasons (which are not mutually exclusive).

 There are many kind of biases:


 Self-selection bias
 Omitted variable bias
 Sponsorship or funding bias
 Sampling bias(also known as distribution shift)
 Prejudice or stereotype bias
 Systematic value distortion
 Experimenter bias
 Labeling bias

 The link below describes ways to avoid each one of those

https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 48
Type of Bias: #2

https://www.kaggle.com/alexisbcook/identifying-bias-in-ai 49
Model card
 A model card is a short document that provides key information about a machine learning
model. Model cards increase transparency by communicating information about trained
models to broad audiences.

 GPT-3 model card: https://github.com/openai/gpt-3/blob/master/model-card.md

50
https://www.kaggle.com/var0101/model-cards
Survivorship bias
 The phenomenon where only those that ‘survived’ a long process are included or excluded
in an analysis, thus creating a biased sample.
 A great example provided by Sreenivasan Chandrasekar is the following:
 “We enroll for gym membership and attend for a few days. We see the same faces of many
people who are fit, motivated and exercising everyday whenever we go to gym. After a few
days we become depressed why we aren’t able to stick to our schedule and motivation
more than a week when most of the people who we saw at gym could. What we didn’t see
was that many of the people who had enrolled for gym membership had also stopped
turning up for gym just after a week and we didn’t see them.”

51
Two types of data leakage
 Leakage happens when your training data contains information that will not be available
when the model is used for prediction.
 In other words, leakage causes (is highly likely) a model to look too accurate while training
but inaccurate while deployed.
 Why highly likely? Because sometimes it would not change a thing and people would not
realise, that is why this comes down to implementing good practices.
 There are two main types of leakage:
 Target leakage
 Train-test contamination

https://www.kaggle.com/alexisbcook/data-leakage 52
Target leakage
 Think of it in terms of the timing or
chronological order that data becomes
available, not merely whether a feature
helps make good predictions.
 The took_antibaiotic_medicine
column was updated after getting
pneumonia.
 The model would see this correlation and
learn it.
 While deploying the model, this information
would not be available, so the model would
perform poorly.

https://www.kaggle.com/alexisbcook/data-leakage
53
Data leakage [while normalising]: #1
 Data leakage in supervised learning is the unintentional introduction of information about
the target that should not be made available. Training on contaminated data leads to
overly optimistic expectations about the model performance.

 For example, consider the case where we want to normalize data, that is scale input
variables to the range 0-1.
 When we normalize the input variables, this requires that we first calculate the minimum
and maximum values for each variable before using these values to scale the variables.
 The dataset is then split into train and test datasets, but the examples in the training
dataset know something about the data in the test dataset;
 They have been scaled by the global minimum and maximum values, so they know more
about the global distribution of the variable then they should

https://www.dropbox.com/s/3l9cp75esgvqpxp/Chapter3.pdf?dl=0 54
Data leakage [while training]: #2
 3 sets: training, validation, test sets: you train on the training data while evaluating this step
performance on the evaluation data. You then proceed to do a final check on the test set.
 Why not have two sets: a training set and a test set? You’d train on the training data and
evaluate on the test data. Simple? No.
 Short answer is that: developing a model always involves hyperparameter tuning. To do
this you need a performance feedback signal from validation data. By doing so, like it or
not, some information about the validation data leaks into the model. If you repeat this
many times you’ll leak an increasingly significant amount of information about the
validation set into the model. At the end of the day, you’ll end up with a model that
performs artificially well on the validation data, because that’s what you optimized it for.
You care about performance on completely new data thus the test set.

Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 55
Data leakage [temporal leak]: #3

 If you’re trying to predict the future given the past (for example, tomorrow’s weather, stock
movements, and so on), you should not randomly shuffle your data before splitting it,
because doing so will create a temporal leak: your model will effectively be trained on data
from the future.

 In such situations, you should always make sure all data in your test set is posterior to the
data in the training set.

Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 56
What constitutes “good data”?

 it contains enough information that can be used for modelling


 it has good coverage of what you want to do with the model,
 it reflects real inputs that the model will see in production,
 it is as unbiased as possible,
 it is not a result of the model itself,
 it has consistent labels, and
 it is big enough to allow generalisation

57
Main properties of a learning algorithm
 Explainability: Do the model predictions require explanation for a non-technical audience?
 In-memory vs. out-of-memory: Can your dataset be fully loaded into the RAM of your
laptop or server?
 Number of features and examples: How many training examples do you have in your
dataset? How many features does each example have?
 Nonlinearity of the data: Is your data linearly separable? Can it be modeled using a linear
model?
 Training speed: How much time is a learning algorithm allowed to use to build a model, and
how often you will need to retrain the model on updated data?
 Prediction speed: How fast must the model be when generating predictions? Will your
model be used in a production environment where very high throughput is required?

https://www.dropbox.com/s/z45hr0y8vj2opj4/Chapter5.pdf?dl=0 58
Offline and online evaluation: #1
 An offline model evaluation happens when the model is being trained by the analyst. The
analyst tries out different features, models, algorithms, and hyperparameters. The offline
model evaluation reflects how well the analyst succeeded in finding the right features,
learning algorithm, model, and values of hyperparameters. In other words, the offline model
evaluation reflects how good the model is from an engineering standpoint.

 An online model evaluation, that is, testing and comparing models in production by using
online data. Online evaluation, on the other hand, focuses on measuring business outcomes,
such as customer satisfaction, average online time, open rate, and click-through rate. This
information may not be reflected in historical data, but it’s what the business really cares
about. Furthermore, offline evaluation doesn’t allow us to test the model in some conditions
that can be observed online, such as connection and data loss, and call delays.

59
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Offline and online evaluation: #2

60
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Offline and online evaluation: #3

 At the end of the day, why do we need such a distinction? Main reason is the data may show a
distribution shift.

 As a consequence, the model must be continuously monitored once deployed in production.


When a distribution shift happens, the model must be updated with new data and re-deployed.

 One way of doing such monitoring is to compare the performance of the model on online and
historical data.

 If the performance on online data becomes significantly worse, as compared to historical, it’s
time to retrain the model.

61
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Offline and online evaluation: #4
 Different online techniques exist:

 A/B Test: Which of the two model candidates works better in production?” A/B testing is often used
on websites and mobile applications to test whether a specific change in the design or wording
positively affects business metrics such as user engagement, click-through rate, or sales rate. The
null hypothesis states that the new model doesn’t change the average value of the business metric.
The alternative hypothesis states that the new model changes the average value of the metric.

 G-Test: The first formulation of A/B test is based on the G-test. It is appropriate for a metric that
counts the answer to a “yes” or “no” question. An advantage of the G-test is that you can ask any
question, as long as only two answers are possible.

 Z-Test: The second formulation of A/B test applies when the question for each user is, “How many?
”or, “How much?” (as opposed to a yes-or-no question considered in the previous subsection).

62
https://www.dropbox.com/s/sslzy9vr4qwarlh/Chapter7.pdf?dl=0
Why don’t we solve it analytically?: #1
weight
vector

Linear neurons (also


called linear filters)

neuron’s input
estimate of the vector
desired output

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Why don’t we solve it analytically?: #2
 It is straight-forward to write down a set of equations, one per training case, and to solve for the
best set of weights.
 This is the standard engineering approach so why don’t we use it?

 Scientific answer: We want a method that real neurons could use.

 Engineering answer: We want a method that can be generalized to multi-layer, non-linear neural
networks.
 The analytic solution relies on it being linear and having a squared error measure.
 Iterative methods are usually less efficient but they are much easier to generalize.

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Capacity

 The number of training samples for which


the training error and test error start
converging toward each other is called the
capacity of the learning machine

 There are better formal definitions for this.

https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
65
Model variance and model bias
 Small/Simple models: may not do well on the
training data, but the difference between training
and test error quickly drops.
 Big/Rich models: will learn the training data, but
the difference between training and test error
can be large.

 How much a model deviates from the desired


mapping on average is called the model bias of
the family of functions.
 How much the output of a model varies when
different drawings of the training set are used is
called the model variance.
 There is a dilemma between bias and variance.

https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
66
Optimal capacity
 The curve of training error and test
error for a given training set size, as a
function of the capacity of the
machine (the richness of the class of
function) has a minimum.

 This is the optimal size for the


machine.

https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
67
Occam’s Razor

 Occam’s Razor: do not multiply hypotheses beyond the strict necessary.

 Occam’s Razor: when given the choice between several models that explain the data
equally well, choose the “simplest” one.

 Occam’s Razor applied to machine learning: choose a trade off between how well the
model fits the training data and how “simple” that model is.

https://cs.nyu.edu/~yann/2008f-G22-2565-001/diglib/lecture03-regularization.pdf
68
Occam’s Razor: what he really meant?

 Occam’s razor famously states that entities should not be multiplied beyond necessity.

 In ML, this is often taken to mean that, given two classifiers with the same training error,
the simpler of the two will likely have the lowest test error.

 The conclusion is that simplicity should be preferred because simplicity is a virtue in its
own right, not because of a hypothetical connection with accuracy. This is probably what
Occam meant in the first place.

Domingos, Pedro. "A few useful things to know about machine


learning." Communications of the ACM 55.10 (2012): 78-87. 69
Full decision
diagram for
handling
missing values

Brink, Henrik, et al. Real-world machine learning. Shelter Island, NY: Manning, 2017. 70
Mean imputation of missing data
 Mean imputation is the practice of replacing null values in a data set with the mean of the
data.

 Mean imputation is generally bad practice because it doesn’t take into account feature
correlation. For example, imagine we have a table showing age and fitness score and
imagine that an eighty-year-old has a missing fitness score. If we took the average fitness
score from an age range of 15 to 80, then the eighty-year-old will appear to have a much
higher fitness score than he actually should.

 Second, mean imputation reduces the variance of the data and increases bias in our data.
This leads to a less accurate model and a narrower confidence interval due to a smaller
variance.

https://towardsdatascience.com/understanding-multiple-regression-249b16bde83e 71
A comparison of imputation techniques

Air quality time-series plot


Ffill imputation

Bfill imputation Linear Interpolation

Quadratic Interpolation Nearest Interpolation

https://projector-video-pdf-converter.datacamp.com/17404/chapter3.pdf 72
Can you build a machine learning model to monitor
another model?
 But would it help us train a separate, second model to predict whether the first model is correct? The
answer might disappoint.

 Boosting or data drift analysis seems to be a much better bet.

https://evidentlyai.com/blog/can-you-build-a-model-to-monitor-a-model 73
Just Ask for Generalization

 Generalizing to what you want may be easier than optimizing directly for what you want.
We might even ask for "consciousness".

 More on the link below

74
https://evjang.com/2021/10/23/generalization.html
Domain knowledge

 Unlike the ubiquitous applications of ad targeting or image classification, physical problems


have idiosyncrasies that require domain knowledge to tackle.

Milani, Pedro M. MACHINE LEARNING APPROACHES TO MODEL TURBULENT


MIXING IN FILM COOLING FLOWS. Diss. STANFORD UNIVERSITY, 2020. 75
Statistical Learning aka Machine Learning

• Statistical Learning aka Machine Learning.

76
The importance of training test
 Whenever you’re designing Machine Learning algorithms, you should think of the test set as a very
precious resource that should ideally never be touched until one time at the very end.

 Otherwise, the very real danger is that you may tune your hyperparameters to work well on the test
set, but if you were to deploy your model you could see a significantly reduced performance. In
practice, we would say that you overfit to the test set.

 If you only use the test set once at end, it remains a good proxy for measuring the generalization of
your classifier (we will see much more discussion surrounding generalization later in the class).

 Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at
all. The idea is to split our training set in two: a slightly smaller training set, and what we call a
validation set.
https://cs231n.github.io/classification/
77
Orthogonalization

 Orthogonalization — Refers to the concept of picking parameters to tune which only adjust
one (single) outcome (at the time) of the ML model, e.g. regularization is a knob to reduce
variance.

https://towardsdatascience.com/structuring-your-machine-learning-project-cours
e-summary-in-1-picture-and-22-nuggets-of-wisdom-95b051a6c9dd
78
Normalisation

 Why normalisation features? Normalize the features in your data (e.g. one pixel in images)
to have zero mean and unit variance.

 Are there any case where we can avoid performing normalisation? Yes, while working
with images. Pixels in images are usually homogeneous and do not exhibit widely different
distributions, alleviating the need for data normalisation.

https://cs231n.github.io/classification/ 79
Why do we have to perform normalisation?
 Normalizing is important because a lot of multiplication will be happening as the input
passes through the layers of the neural network; keeping the incoming values between 0
and 1 prevents the values from getting too large during the training phase (known as the
exploding gradient problem).

 To this connected to BatchNorm is indeed less useful, but as they get larger, the effect of
any layer on another, say 20 layers down, can be vast because of repeated multiplication,
and you may end up with either vanishing or exploding gradients, both of which are fatal to
the training process

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


80
Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019.
Normalisation vs.
standardisation: #1
 Normalisation is a rescaling of the data from the original
range so that all values are within the new range of 0 and 1.
So this strongly bounded between 0 and 1. You need to do a
full pass of your data to get its min and max.
 Standardising a dataset involves rescaling the distribution of
values so that the mean of observed values is 0 and the
standard deviation is 1. This means the bounds can no longer
be between 0 and 1 but could be -2 and 2. Please note the
negative number here.
 What they have in common? They are both a scaling
method.
 What is confusing? standardization is aka as Z-score
normalization.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-exa
mples-preprocessing-plot-scaling-importance-py 81
Normalisation vs. standardisation: #2
 In theory:
 Normalization would work better for uniformly
distributed data.
 Standardization tends to work best for normally
distributed data.
 However, in practice, data is rarely distributed following a
perfect curve.
 Common sense solution? Usually, if your dataset is not too
big and you have time, you can try both and see which one
performs better for your task.
 What is the benefit? These manipulations are generally used
to improve the numerical stability of some calculations. Some
models benefit from the predictors being on a common scale.
https://www.dropbox.com/s/7leuhzwq8ove3x8/Chapter4.pdf?dl=0 82
Why normalise inputs?

83
Z-scores and Normalization
 Normalise all the different variables to make their range/distribution comparable via Z-score
transform:

 The average value of a Z-score over all points is zero. Values greater than the mean become
positive, while those less than the mean become negative. The standard deviation of the Z-scores
is 1, so all distributions of Z-scores have similar properties. Transforming values to Z-scores
accomplishes two goals. First, they aid in visualizing patterns and correlations, by ensuring that all
fields have an identical mean (zero) and operate over a similar range.
 Z-scores are best used on normally distributed variables, which, after all, are completely
described by mean μ and standard deviation σ. But they work less well when the distribution is a
power law.

84
Skiena, Steven S. The data science design manual. Springer, 2017
Should I Standardise then Normalise?

 Standardization can give values that are both positive and negative centred around zero. It
may be desirable to normalize data after it has been standardized.

 This might be a good idea of you have a mixture of standardized and normalized variables
and wish all input variables to have the same minimum and maximum values as input for a
given algorithm, such as an algorithm that calculates distance measures.

85
Standardisation with outliers
 Many machine learning algorithms perform better when numerical input variables are scaled to a
standard range.

 This includes algorithms that use a weighted sum of the input, like linear regression, and
algorithms that use distance measures, like k-nearest neighbours.

 Standardizing is a popular scaling technique that subtracts the mean from values and divides by
the standard deviation, transforming the probability distribution for an input variable to a
standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if
the input variable contains outlier values.

 To overcome this, the median and interquartile range can be used when standardizing numerical
input variables, generally referred to as robust scaling.

86
What happens if you do not standardise
the data?
 We can illustrate it using PCA. In PCA we are interested in the components that maximize
the variance.

 If one component (e.g. human height) varies less than another (e.g. weight) because of
their respective scales (meters vs. kilos), PCA might determine that the direction of
maximal variance more closely corresponds with the ‘weight’ axis, if those features are not
scaled.

 As a change in height of one meter can be considered much more important than the
change in weight of one kilogram, this is clearly incorrect.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto- 87
examples-preprocessing-plot-scaling-importance-py
Example of standardisation

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto- 88
examples-preprocessing-plot-scaling-importance-py
Scaling
 As provided by Scikit-learn
 StandardScaler ensures that for each feature the mean is 0
and the variance is 1, bringing all features to the same
magnitude. However, this scaling does not ensure any
particular minimum and maximum values for the features.
 RobustScaler works similarly to the StandardScaler in that it
ensures statistical properties for each feature that guarantee
that they are on the same scale. However, the RobustScaler
uses the median and quartiles,1 instead of mean and variance.
This makes the RobustScaler ignore data points that are very
different from the rest (like measurement errors). These odd
data points are also called outliers, and can lead to trouble for
other scaling techniques.
 MinMaxScaler, on the other hand, shifts the data such that all
features are exactly between 0 and 1.
 Normalizer does a very different kind of rescaling. It scales
each data point such that the feature vector has a Euclidean
length of 1.

Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 89
How to spot scaling is done properly with MinMax?
The dataset looks different. The test points moved incongruously to

Guido, Sarah, and Andreas Müller. Introduction to machine learning with


the training set, as they were scaled differently. We changed the
arrangement of the data in an arbitrary way.

python. Vol. 282. O'Reilly Media, 2016.


Here, we called fit on the training set, and then called transform on
-->>BOTH<<-- the training and test sets. You can see that the dataset
looks identical to the first; -->>ONLY<<-- the ticks on the axes have
changed. Now all the features are between 0 and 1. 90
Is it necessary to scale the target value in addition to
scaling features for regression analysis?: #1
 First off why we scale the features? Feature scaling improves the convergence of steepest
descent algorithms, which do not possess the property of scale invariance.

 Generally the target is not scaled. Normalizing (a type of scaling) the output will not affect
shape of objective function, so it's generally not necessary.

 If you scale the target, your mean squared error (MSE) is automatically scaled. Additionally,
you need to look at the mean absolute scaled error (MASE). MASE>1 automatically means
that you are doing worse than a constant (naive) prediction.

 So when you do it? It is generally done in ANNs.

https://stats.stackexchange.com/questions/111467/is-it-necessary-to-sca 91
le-the-target-value-in-addition-to-scaling-features-for-re
Is it necessary to scale the target value in addition to
scaling features for regression analysis?: #2

 Consider the case where there are outliers that can't be filtered out as they are important
to the model.
 Consider the case where the distribution is right—skewed (I have not said left-skewed, so
take it as a warning!) then you can use normalise the variable.
 One type of normalisation can be done via taking the logarithm.
 What about left-skewed distributions? A log transformation in a left-skewed distribution
will tend to make it even more left skew, for the same reason it often makes a right skew
one more symmetric.

https://gdcoder.com/when-why-to-use-log-transformation-in-regression/ 92
Numerical vs. analytic gradient

 We discussed the tradeoffs between computing the numerical and analytic gradient.

 The numerical gradient is simple but it is approximate and expensive to compute.

 The analytic gradient is exact, fast to compute but more error-prone since it requires the
derivation of the gradient with math.

 Hence, in practice we always use the analytic gradient and then perform a gradient check,
in which its implementation is compared to the numerical gradient.

https://cs231n.github.io/optimization-1/ 93

You might also like