Supervised Learning

Supervised Learning
1
Macro topics
▪ Regression (quick jump)
▪ Classification (quick jump)
▪ Learning to rank (quick jump)
2
also known as data mining or
pattern extraction
Brunton, Steven L., et al. "Data-Driven Aerospace Engineering: Reframing the
Industry with Machine Learning." arXiv preprint arXiv:2008.10740 (2020).
If the labels are discrete, such as

a categorical description of an
image (e.g., dog vs. cat), then the
supervised learning task is a
classification.
In other words, any
idea what our results
should look like?
If the labels are continuous, such

as the lift profile for a particular
airfoil shape, then the task is a
regression.
3
Quick summary on the most used methods
Methods When to use it
K-Nearest neighbours For small datasets, good as a baseline, easy to explain. Nonlinear
Linear models Go-to as a first algorithm to try, good for very large datasets, good for very high- dimensional data.
Only for classification. Even faster than linear models, good for very large datasets and high-dimensional data. Often less accurate
Naïve Bayes
than linear models.
Decision trees Very fast, don’t need scaling of the data, can be visualized and easily explained. Nonlinear
Nearly always perform better than a single decision tree, very robust and powerful. Don’t need scaling of data. Not good for very
Random forests
high-dimensional sparse data. Nonlinear
Often slightly more accurate than random forests. Slower to train but faster to predict than random forests, and smaller in memory.
Gradient boosted decision trees
Need more parameter tuning than random forests.
Powerful for medium-sized datasets of features with similar meaning. Require scaling of data, sensitive to parameters. Linear but
Support vector machines
non-linear if used with kernel
Can build very complex models, particularly for large datasets. Sensitive to scaling of the data and to the choice of parameters.
Neural networks
Large models need a long time to train.
Logistic regression Linear
Naïve Bayes Nonlinear 4

Supervised learning applications
5
Structured vs. unstructured data: #1
6
7
▪ Why structured data matters?
▪ Unstructured data: Because the examples are easy for humans to understand, you can
recruit people to label them and benchmark trained models against human-level
performance (HLP). If you need more examples, you might be able to collect them by
capturing more text/images/audio or by using data augmentation to distort existing
examples. Error analysis can take advantage of human intuition.
▪ Structured data: This class of data is harder for humans to interpret, and thus harder for
humans to label. Algorithms that learn from structured data often surpass HLP, making that
measure a poor benchmark. It can also be hard to find additional examples. For instance, if
the training dataset comprises records of your customers’ purchases, it’s hard to get data
from additional customers beyond your current user base.
8
▪ Small dataset: If the dataset includes <1,000 examples, you can examine every example
manually, check if the labels are correct, and even add labels yourself. You’re likely to have
only a handful of labelers, so it’s easy to hash out any disagreements together on a call.
Every single example is a significant fraction of the dataset, so it’s worthwhile to fix every
incorrect label.
▪ Large dataset: If the dataset is >100,000 examples, it’s impractical for a single engineer to
examine every one manually. The number of labelers involved is likely to be large, so it’s
critical to define standards clearly, and it may be worthwhile to automate labeling. If a
significant number of examples are mislabeled, it may be hard to fix them, and you may
have to feed the noisy data to your algorithm and hope it can learn a robust model despite
the noise.
9
How supervised learning typically works
▪ We start by choosing a model-class: y =f(x,W)

▪ A model-class, f, is a way of using some numerical parameters, W, to map each input
vector, x, into a predicted output y.
▪ Learning usually means adjusting the parameters to reduce the discrepancy between the
target output, t, on each training case and the actual output, y, produced by the model.
▪ For regression, ½(y-t)^2 is often a sensible measure of the discrepancy.

▪ For classification there are other measures that are generally more sensible (they also
work better).
https://www.cs.toronto.edu/~hinton/coursera_slides.html
Anomaly Detection vs. Supervised Learning
▪ Use anomaly detection when...
▪ We have a very small number of positive examples (y=1 ... 0-20 examples is common) and a large
number of negative (y=0) examples.
▪ We have many different "types" of anomalies and it is hard for any algorithm to learn from positive
examples what the anomalies look like; future anomalies may look nothing like any of the
anomalous examples we've seen so far.
▪ Use supervised learning when...

▪ We have a large number of both positive and negative examples. In other words, the training set is
more evenly divided into classes.
▪ We have enough positive examples for the algorithm to get a sense of what new positives examples
look like. The future positive examples are likely similar to the ones in the training set
11
Difference between classification and regression
https://www.dropbox.com/s/wxybbtbiv64yf0j/Chapter1.pdf?dl=0 12
Regression
(go back)
13
Scalar vs. vector regression
▪ Scalar regression—A task where the target is a continuous scalar value. Predicting house
prices is a good example: the different target prices form a continuous space.
▪ Vector regression—A task where the target is a set of continuous values: for example, a
continuous vector. If you’re doing regression against multiple values (such as the
coordinates of a bounding box in an image), then you’re doing vector regression.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 14
Why would you ever use a linear model?
▪ Linear models often perform well when the number of features is large compared to the
number of samples.
▪ They are also often used on very large datasets, simply because it’s not feasible to train
other models.
▪ However, in lower-dimensional spaces, other models might yield better generalization
performance.
Guido, Sarah, and Andreas Müller. Introduction to machine

learning with python. Vol. 282. O'Reilly Media, 2016. 15
Polynomial features
▪ A trick to extend linear models to include nonlinear polynomial feature interaction terms
without losing the scalability of linear learning algorithms.
Brink, Henrik, et al. Real-world machine learning. Shelter Island, NY: Manning, 2017.
16
Linear model: ridge regression and classification: #1
Ordinary Least Squares Ridge regression
▪ This model solves a regression model where the loss function is the linear least squares function and
regularization is given by the l2-norm. Also known as Ridge Regression which is particular kind of Tikhonov
regularization.
▪ Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of
the coefficients.
▪ In general, the method provides improved efficiency in parameter estimation problems in exchange for a
tolerable amount of bias.
https://scikit-learn.org/stable/modules/linear_model.html 17
Linear model: ridge regression and
classification: #2
▪ This example also shows the usefulness of applying Ridge regression
to highly ill-conditioned matrices. For such matrices, a slight change
in the target variable can cause huge variances in the calculated
weights. In such cases, it is useful to set a certain regularization
(alpha) to reduce this variation (noise).
▪ When alpha is very large, the regularization effect dominates the

squared loss function and the coefficients tend to zero. At the end of
the path, as alpha tends toward zero and the solution tends towards
the ordinary least squares, coefficients exhibit big oscillations. In
practise it is necessary to tune alpha in such a way that a balance is
maintained between both.
Linear model: LASSO #1
▪ LASSO stands for "least absolute shrinkage and selection operator”.
▪ LASSO is an automatic and convenient way to introduce sparsity into the linear regression
model. Sparsity is here interpreted as having fewer features.
▪ When applied in a linear regression model, performs feature selection and regularization of
the selected feature weights. See more later.
▪ By feature selection we mean that is looks for correlated features and if any are found it
proceed to drop some of them.
19
https://christophm.github.io/interpretable-ml-book/limo.html
Ordinary Least Squares Ridge regression Lasso
▪ The lasso estimate thus solves the minimization of the least-squares penalty with added,
where alpha is a constant and L1 is the L1-norm of the coefficient vector.
▪ Lasso regression yields sparse models and alpha parameter controls the degree of sparsity of
the estimated coefficients.
https://scikit-learn.org/stable/modules/linear_model.html
20
number of non-zero weights. ▪ The L1-norm of the feature vector, leads to a penalization of large
weights. Since the L1-norm is used, many of the weights receive an
estimate of 0 and the others are shrunk.
▪ The parameter lambda (λ) controls the strength of the regularizing

effect and is usually tuned by cross-validation. Especially when lambda
is large, many weights become 0.
▪ The feature weights can be visualized as a function of the penalty term

lambda. Each feature weight is represented by a curve in the following
figure.
▪ The graph on the left is called regularisation paths.
▪ Pros: it can be automated, considers all features simultaneously, and

can be controlled via lambda.
21
▪ Dataset: current population survey
▪ A Lasso model identifies the

correlation between AGE and
EXPERIENCE and suppresses one of
them for the sake of the prediction.
https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#sphx-glr- 22
auto-examples-inspection-plot-linear-model-coefficient-interpretation-py
LASSO vs. Ridge Regularisation
Least Squares minimizes the sum of the squared residuals, which can
result in low bias but high variance. To cure it we use it regularisation
W is the slope
Regularisation Goal Bias Variance Robustness Solution type

Unstable solution
LASSO - L1 Reduce More robust and can have more
overfitting on solution
training data Stable solution and
RIDGE - L2 Increases Decreases Less robust
always one solution
https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e
Linear model: Elastic-Net
▪ ElasticNet is a linear regression model trained with both and -norm regularization of the
coefficients. This combination allows for learning a sparse model where few of the weights
are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
▪ Elastic-net is useful when there are multiple features which are correlated with one
another.
▪ A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to
inherit some of Ridge’s stability under rotation.
Logistic regression: #1
▪ Logistic regression, despite its name, is a linear model for classification rather than regression.
▪ This is a learning algorithm that you use when the output labels Y in a supervised learning problem are all
either zero or one, so for binary classification problems.
▪ Logistic regression is also known in the literature as:

▪ logit regression
▪ maximum-entropy classification (MaxEnt) or
▪ the log-linear classifier.
▪ In this model, the probabilities describing the possible outcomes of a single trial are modelled using a
logistic function.
▪ One the advantage not to be overseen is that logistic regression has a very high scalability to large datasets.
https://scikit-learn.org/stable/modules/linear_model.html
25
Perceptron: #0
▪ The weights and the bias are learned from the data, and
the activation function is handpicked depending on the
network designer’s intuition of the network and its
target outputs.
▪ The activation function, denoted here by f, is typically a
nonlinear function. A linear function is one whose
graph is a straight line.
▪ So, essentially, a perceptron is a composition of a linear
and a nonlinear function.
▪ The linear expression wx+b is also known as an affine
transform.
Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build intelligent language
applications using deep learning. " O'Reilly Media, Inc.", 2019. 26
Perceptron: #1
▪ The perceptron is another simple classification algorithm suitable for large scale learning. By default:
▪ It does not require a learning rate.
▪ It is not regularized (penalized).
▪ It updates its model only on mistakes.
▪ The last characteristic implies that the perceptron is slightly faster to train than SGD with the hinge loss
and that the resulting models are sparser.
https://en.wikipedia.org/wiki/Generalized_linear_model
27
Perceptron: #2
▪ Pretty much the simplest neural network is the perceptron, which approximates a single
neuron with n binary inputs. It computes a weighted sum of its inputs and “fires” if that
weighted sum is zero or greater. However, there are some problems that simply can’t be
solved by a single perceptron. For example, no matter how hard you try, you cannot use a
perceptron to build an XOR gate that outputs 1 if exactly one of its inputs is 1 and 0
otherwise.
Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 2019. 28
Discrete and continuous perceptrons
▪ Perceptron, which is a type of classifier that uses
the features of our data to make a prediction. The
prediction can be 1 or 0. This is called a discrete
perceptron, since it returns an answer from a
discrete set.
▪ There exists also continuous perceptrons, which

are called this because they return an answer that
can be any number in the interval between 0 and
1.
▪ This answer can be interpreted as a probability in

the sense that sentences with a higher score are
more likely to be happy sentences, and viceversa.
Luis G. Serrano, Grokking Machine Learning MEAP V07 29

Difference between perceptron and feed forward
NNs?
▪ What is common? As with the perceptron, for
each neuron we’ll sum up the products of its
inputs and its weights.
▪ What is different? But here, rather than
outputting the step_function applied to that
product, we’ll output a smooth approximation of
the step function.
▪ Why use sigmoid instead of the simpler
step_function? In order to train a neural
network, we’ll need to use calculus, and in
order to use calculus, we need smooth
functions. The step function isn’t even
continuous, and sigmoid is a good smooth
approximation of it. Think about the
existence of the derivative?
Sigmoid vs. logistic
▪ Technically “sigmoid” refers to the shape of the function, “logistic” to this
particular function (see below) although people often use the terms
interchangeably.
31
Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 2019.
How to build XOR with feedforward NNs? #1
situation.
to an either-or (XOR)
the figure is equivalent
Separating the points in
#1
XOR: perceptron vs. multi-layer perceptron:
35
applications using deep learning. " O'Reilly Media, Inc.", 2019.
#2
▪ Although it appears in the plot that the MLP has two decision boundaries, and that is its
advantage, it is actually just one decision boundary!
▪ The decision boundary just appears that way because the intermediate representation has
morphed the space to allow one hyperplane to appear in both of those positions.
36
#3
It has learned to “warp” the space in which the data lives so that it can divide the dataset
with a single line by the time it passes through the final layer.
Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build intelligent language 37
XOR: perceptron vs. multi-layer perceptron: #4
Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build intelligent language 38
Passive-aggressive algorithms
▪ The passive-aggressive algorithms are a family of algorithms for large-scale learning.
▪ They are similar to the Perceptron in that they do not require a learning rate.
▪ However, contrary to the Perceptron, they include a regularization parameter.
39
Regression and outliers
▪ Three of the most used methods are: RANSAC, Theil Sen and HuberRegressor
▪ They are essentially outlier detection methods.
40
Methods for detecting outliers
▪ [1] Standard deviation method

▪ [2] Interquartile range method
▪ [3] Automatic Outlier Detection -> LocalOutlierDetection & DBScan
▪ [4] Boxplots
▪ [5] DBScan clustering
▪ [6] Isolation Forest
▪ [7] Robust Random Cut Forest (AMAZON offer?)
https://towardsdatascience.com/5-ways-to-detect-outliers-that-e
very-data-scientist-should-know-python-code-70a54335a623 41
Z-Scores vs. Standard Deviation
▪ Standard deviation is essentially a reflection of the amount of variability within a given data
set. Standard deviation is calculated by first determining the difference between each data
point and the mean. The differences are then squared, summed, and averaged. This
produces the variance. The standard deviation is the square root of the variance.
▪ The Z-score, by contrast, is the number of standard deviations a given data point lies from
the mean. For data points that are below the mean, the Z-score is negative. In most large
data sets, 99% of values have a Z-score between -3 and 3, meaning they lie within three
standard deviations above and below the mean.
https://www.investopedia.com/terms/z/zscore.asp 42
Ignoring the outliers
▪ From a domain knowledge prospective outlier can be ignored.
▪ From a model perspective, some are more sensitive to outliers than others. e.g. For
gradient boosted trees, it might try to mitigate making errors on those outliers and put
unnecessary amount of focus on them, whereas a vanilla decision tree might just treat the
outlier as a misclassification.
http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/model_selection/tips_and_ 43
tricks/tips_and_tricks.ipynb
What is an inlier?
▪ This is an area where there is a bit of inconsistency in terminology which has the
unfortunate effect of confusing some statistical discussions. The concept of an "inlier" is
generally used to refer to a data value that is in error (i.e., subject to measurement error)
but is nonetheless in the "interior" of the distribution of the correctly measured values. By
this definition the inlier has two aspects: (1) it is in the interior of the relevant distribution
of values; and (2) it is an erroneous value.
▪ Dealing with inliers (which really generally involves not dealing with them): Unless you
have a source of external information indicating measurement error, it is essentially
impossible to identify "inliers". By definition, these are data points that are in the "interior"
of the distribution, where most of the other data occurs.
https://stats.stackexchange.com/questions/291709/difference-between-outlier-and-inlier 44
Polynomial regression: extending linear
models with basis functions
▪ We can combine the features in second-order polynomials, so that the model looks like this:
▪ The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating a
new set of features:
▪ With this re-labeling of the data, our problem can be written:
▪ See that the resulting polynomial regression is in the same class of linear models we considered above
(i.e. the model is linear in w) and can be solved by the same techniques.
45
Kernel regression: #1
▪ Kernel regression is a non-parametric method. That means that there are no parameters to
learn. The model is based on the data itself.
If your inputs are
multi-dimensional feature
▪ In its simplest form, in kernel regression we look for a model like this: vectors, the terms xi - x and
xk - x have to be replaced by
Euclidean distance
▪ The function k(·) is a kernel. It can have different forms, the most frequently used one is the
Gaussian kernel:
Burkov, Andriy, and Michel Lutz. "The Hundred-Page Machine Learning Book en français." (2019). 46
▪ The value b is a hyperparameter that we tune using the validation set (by running the
model built with a specific value of b on the validation set examples and calculating the
mean squared error).
▪ Gaussian processes (GP) is a supervised learning method that competes with kernel
regression. It has some advantages over the latter.
▪ For example, it provides confidence intervals for the regression line in each point.
Kernel ridge regression
▪ Kernel ridge regression (KRR) combines Ridge regression and classification (linear least
squares with l2-norm regularization) with the kernel trick.
▪ Nonlinear problems can be tackled by replacing the dot product in the support vector
formulation by a kernel function—this is often known as the “kernel trick.”
▪ It thus learns a linear function in the space induced by the respective kernel and the data.
For non-linear kernels, this corresponds to a non-linear function in the original space.
▪ It would be nice to combine the power of the kernel trick with the simplicity of standard
least-squares regression. Kernel ridge regression does just that.
https://scikit-learn.org/stable/modules/kernel_ridge.html
49
Kernel methods: #1
▪ Kernel methods are a class of algorithms for pattern analysis. Best known. member is SVM
▪ Study general types of relations (for example clusters, rankings, principal components,
correlations, classifications) in datasets.
▪ Main difference: For many algorithms the data in raw representation have to be explicitly
transformed into feature vector representations via a user-specified feature map: in
contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over
pairs of data points in raw representation.
▪ Kernel methods owe their name to the use of kernel functions, which enable them to
operate in a high-dimensional, implicit feature space without ever computing the
coordinates of the data in that space, but rather by simply computing the inner products
between the images of all pairs of data in the feature space. This operation is often
computationally cheaper than the explicit computation of the coordinates. This approach is
called the "kernel trick"
https://huyenchip.com/ml-interviews-book/contents/8.1.1-overview:-basic-algorithm.html 50
Kernel methods: #2
▪ Algorithms capable of operating with kernels include the kernel perceptron, support vector
machines (SVM), Gaussian processes, principal components analysis (PCA), canonical
correlation analysis, ridge regression, spectral clustering, linear adaptive filters, and many
others.
▪ Any linear model can be turned into a non-linear model by applying the kernel trick to the
model: replacing its features (predictors) with a kernel function.
Nearest Neighbour
▪ The principle behind nearest neighbor methods is to find a predefined number of training
samples closest in distance to the new point, and predict the label from these.
▪ The number of samples can be a user-defined constant (k-nearest neighbor learning), or

vary based on the local density of points (radius-based neighbor learning).
▪ The distance can, in general, be any metric measure: standard Euclidean distance is the
most common choice.
▪ Neighbors-based methods are known as non-generalizing machine learning methods, since

they simply “remember” all of its training data (possibly transformed into a fast indexing
structure such as a Ball Tree or KD Tree).
https://scikit-learn.org/stable/modules/neighbors.html 52
k-nearest neighbour (k-NN)
▪ k-NN is a non-parametric method used for classification and regression.
▪ Given an object, the algorithm’s output is computed from its k closest training examples in
the feature space.
▪ In k-NN classification, each object is classified into the class most common among its k
nearest neighbors.
▪ In k-NN regression, each object’s value is calculated as the average of the values of its k
nearest neighbors.
▪ Applications: anomaly detection, search, recommender system
Learning Vector Quantization (LVQ)
▪ A downside of K-Nearest Neighbours is that you need to hang on to your entire

training dataset.
▪ The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural
network algorithm that allows you choose how many training instances to hang onto
and learns exactly what those instances should look like.
54
MLPs and their names
▪ Is MLP the same as fully connected? Yes, a multilayer perceptron is just a collection of interleaved fully
connected layers and non-linearities. The usual non-linearity nowadays is ReLU, but in the past sigmoid and
tanh non-linearities were also used.
▪ Is MLP a feed forward network? The MLP architecture is a layered feedforward neural network, in which the
nonlinear elements (neurons) are arranged in successive layers, and the information flows unidirectionally, from
input layer to output layer, through the hidden layer(s).
▪ How is MLP different from deep neural network? MLP is a subset of DNN. While DNN can have loops and
MLP are always feed-forward, i.e. A Multilayer Perceptron is a finite acyclic graph.
▪ Is MLP a deep neural network? A multilayer perceptron (MLP) is a class of a feedforward artificial neural
network (ANN). MLPs models are the most basic deep neural network, which is composed of a series of fully
connected layers.8 Apr 2021.
▪ What is the difference between MLP and CNN? MLP stands for Multi Layer Perceptron. CNN stands for
Convolutional Neural Network. ... So MLP is good for simple image classification , CNN is good for
complicated image classification and RNN is good for sequence processing and these neural networks
should be ideally used for the type of problem they are designed for.
▪
55
What is a multi-layer perceptron?
▪ Multilayer perceptrons (MLPs) are also known as (vanilla)
feed-forward neural networks, or sometimes just neural networks.
▪ The multilayer perceptron structurally extends the simpler
perceptron by grouping many perceptrons in a single layer and
stacking multiple layers together.
▪ The advantages of Multi-layer Perceptron are:
▪ Capability to learn non-linear models.
▪ Capability to learn models in real-time (on-line learning) using
partial fit.
▪ The disadvantages of Multi-layer Perceptron (MLP) include:
▪ MLP with hidden layers have a non-convex loss function.
▪ MLP requires tuning a number of hyperparameters (No. of
hidden neurons, layers, and iterations).
▪ MLP is sensitive to feature scaling.
https://scikit-learn.org/stable/modules/neural_networks_supervised.html 56
What is the hidden layer doing?: #1
▪ A NN with one input layer and an output layer. Such a network
simply tries to separate the two classes of data by dividing them
with a line.
▪ Modern neural networks generally have multiple layers between.

Let us consider one that has one hidden layer.
▪ With each layer, the network transforms the data, creating a new
representation. When we get to the final representation, the
network will just draw a line through the data (or, in higher
dimensions, a hyperplane).
▪ Essentially, the hidden layer learns a representation so that the

data is linearly separable
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ 57
What is the hidden layer doing?: #2
▪ Each layer stretches and squishes space, but it never cuts, breaks, or folds it.
▪ Intuitively, we can see that it preserves topological properties.
▪ For example, a set will be connected afterwards if it was before (and vice versa).
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ 58
Batch normalisation: #1
▪ Training deep NNs is challenging. One aspect of this challenge is that the model is updated
layer-by-layer backward from the output to the input using an estimate of error that
assumes the weights (in the layers prior to the current layer) are fixed.
▪ The weights of a layer are updated given an expectation that the prior layer outputs values
with a given distribution. This distribution is likely changed after the weights of the prior
layer are updated. This is referred to as internal covariate shift.
▪ In essence, this describe the undesirable situation where during training each layer receives
inputs with different distribution.
https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/ 59
▪ Batch normalization, or batch norm for short, is proposed as a technique to help coordinate
the update of multiple layers in the model.
▪ It does this by scaling the output of the layer, specifically by standardizing (mean of zero and
a standard deviation of one) the activations of each input variable per mini-batch, such as
the activations of a node from the previous layer.
▪ This process is also called “whitening” when applied to images in computer vision.
▪ Standardizing the activations of the prior layer means that assumptions the subsequent
layer makes about the spread and distribution of inputs during the weight update will not
change, at least not dramatically. This has the effect of stabilizing and speeding-up the
training process of deep neural networks.
https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/ 60
▪ Although reducing internal covariate shift was a motivation in the development of the method, there is
some suggestion that instead batch normalisation is effective because it smooths and, in turn, simplifies
the optimisation function that is being solved when training the network.
▪ In a few words if the data is normalised just before it enters a NN, there is no guarantee that the
intermediate layers get a normalised input. Batch normalisation normalises the intermediate data when
the mean and variance area changing over time during training.
▪ Key advantages are:

▪ Improves gradient flow
▪ Allows higher learning rates
▪ Reduces the strong dependency on initialisation
▪ Act as a form of regularisation and reduces the dependency of dropout
https://arxiv.org/abs/1805.11604
61
Berner, Julius, et al. "The Modern Mathematics of Deep Learning." arXiv preprint arXiv:2105.04026 (2021).
62
▪ Batchnorm can help correct training that is slow or unstable.
▪ Sometimes also help prediction performance.
▪ SGD will shift the network weights in proportion to how large an activation the data
produces. Features that tend to produce activations of very different sizes can make for
unstable training behavior.
▪ If it's good to normalize the data before it goes into the network, maybe also normalizing
inside the network would be better! In fact, we have a special kind of layer that can do this,
the batch normalization layer. A batch normalization layer looks at each batch as it comes
in, first normalizing the batch with its own mean and standard deviation, and then also
putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in
effect, performs a kind of coordinated rescaling of its inputs.
https://www.kaggle.com/ryanholbrook/dropout-and-batch-normalization
63
When multi-tasks learning makes sense
▪ Training on a set of tasks that could benefit from having shared lower-level features
▪ Amount of data you have for each task is quite similar
▪ Can train a big enough network to do well enough on each task
64
Kernel machines
▪ In ML, kernel machines are a class of algorithms for pattern analysis, whose best known
member is the support vector machine (SVM). The word kernel is used in mathematics to
denote a weighting function for a weighted sum or integral.
Domingos, Pedro. "Every Model Learned by Gradient Descent Is Approximately

a Kernel Machine." arXiv preprint arXiv:2012.00152 (2020). 65
Kernel machines vs. deep NNs
▪ Comparison Kernel machines are shallow architectures, in which one large layer of simple
template matchers is followed by a single layer of trainable coefficients. Kernel machines
can be viewed as neural networks with one hidden layer, with the kernel as the
nonlinearity. For example, a Gaussian kernel machine is a radial basis function network
(Poggio and Girosi, 1990).
▪ Issue: shallow architectures can be very inefficient in terms of required number of

computational elements and examples.
▪ How deep NN compare? deep architectures, in which lower-level features or concepts are
progressively combined into more abstract and higher-level representations. Deep
architectures have the potential to generalize in non-local ways, i.e., beyond immediate
neighbours, and that this is crucial in order to make progress on the kind of complex tasks
required for artificial intelligence.
Bengio, Yoshua, and Yann LeCun. "Scaling learning algorithms
towards AI." Large-scale kernel machines 34.5 (2007): 1-41. 66
Every Model Learned by Gradient Descent Is
Approximately a Kernel Machine: #1
▪ The argument used in the past was. A deep network would seem to be irreducible to a kernel machine,
since it can represent some functions exponentially more compactly than a shallow one.
▪ A notable disadvantage of deep networks is their lack of interpretability (Zhang and Zhu, 2018). Knowing
that they are effectively path kernel machines greatly ameliorates this. In particular, the weights of a
deep network have a straightforward interpretation as a super- position of the training examples in
gradient space, where each example is represented by the corresponding gradient of the model.
▪ One well-studied approach to interpreting the output of deep networks involves looking for training
instances that are close to the query in Euclidean or some other simple space (Ribeiro et al., 2016). Path
kernels tell us what the exact space for these comparisons should be, and how it relates to the model’s
predictions.

▪ Deep network weights as superpositions of
training examples.
▪ Applying the learned model to a query

example is equivalent to simultaneously
matching the query with each stored
example using the path kernel and outputting
a weighted sum of the results.
Domingos, Pedro. "Every Model Learned by Gradient Descent Is Approximately 68

a Kernel Machine." arXiv preprint arXiv:2012.00152 (2020).
▪ Perhaps the most significant implication of our result for deep learning is that it casts doubt
on the common view that it works by automatically discovering new representations of the
data, in contrast with other machine learning methods, which rely on predefined features

Regression
70
Regression explain via comics
▪ When modeling such a problem statistically, much of the work
of a data scientist or statistician is knowing which fitting
method is most appropriate for the data in question.
▪ Here we see various hypothetical scientists or statisticians

each applying their own interpretations to the exact same
data, and the comic mocks each of them for their various
personal biases or other assorted excuses.
▪ In general, the researcher will specify the form of an equation

for the line to be drawn, and an algorithm will produce the
actual line.
▪ This comic just exaggerates various methods on interpreting

data, but without the knowledge of the matter in the
background nothing makes any sense.
https://www.explainxkcd.com/wiki/index.php/2048:_Curve-Fitting 71
Linear regression: #1
▪ Totally different data sets can result into the same line. It's
obvious that some more basics about the nature of the data
must be used to understand if this simple line really does
make sense.
▪ Hey, I did a regression." refers to the fact that this is just the
easiest way of fitting data into a curve.
▪ Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear
method for regression.
▪ Linear regression finds the parameters w and b that minimize the mean squared error
between predictions and the true regression targets, y, on the training set.
▪ The mean squared error is the sum of the squared differences between the predictions and
the true values. Linear regression has no parameters, which is a benefit, but it also has no
way to control model complexity.
73
Does OLS require a normally distributed
data?
▪ You don't need to assume Normal distributions to do regression. Least squares regression is
the BLUE estimator (Best Linear, Unbiased Estimator) regardless of the distributions.
▪ See the Gauss-Markov Theorem.
▪ A normal distribution is only used to show that the estimator is also the maximum
likelihood estimator.
▪ It is a common misunderstanding that OLS somehow assumes normally distributed data. It
does not. It is far more general.
▪ ATTENTION: OLS regression makes no assumptions about the data (as stated above), it
makes assumptions about the errors, as estimated by residuals.
https://stats.stackexchange.com/questions/75054/how-do-i-perform-a-regression-on-non-normal-data-which-remain-non-normal-when-tr 74
Quadratic fit
▪ Quadratic fit (i.e. fitting a parabola through the data) is the lowest
grade polynomial that can be used to fit data through a curved line; if
the data exhibits clearly "curved" behavior (or if the experimenter
feels that its growth should be more than linear), a parabola is often
the first, easiest, stab at fitting the data.
▪ The comment below the graph "I wanted a curved line, so I made
one with math." suggests that a quadratic regression is used when
straight lines no longer satisfy the researcher, but he still wants to use
simple math expression.
Logarithmic
▪ A logarithmic curve grows slower on higher values, but still grows

without bound to infinity rather than approaching a horizontal
asymptote.
▪ The comment below the graph "Look, it's tapering off!" builds up the
impression that the data diminishes while under this fit it's still
growing to infinity, only much slower than a linear regression does.
Exponential
▪ An exponential curve, on the contrary, is typical of a phenomenon whose

growth gets rapidly faster and faster - a common case is a process that
generates stuff that contributes to the process itself, think bacteria
growth or compound interest.
▪ The comment below the graph "Look, it's growing uncontrollably!"

gives an other frivolous statement suggesting something like chaos. Also
this even faster growth is well defined and has no asymptote at both
axes.
LOESS (locally estimated scatterplot
smoothing)
▪ A LOESS fit doesn't use a single formula to fit all the data, but
approximates data points locally using different polynomials for each
"zone" (weighting differently data points as they get further from it)
and patching them together. As it has much more degrees of freedom
compared to a single polynomial, it generally "fits better" to any data
set, although it is generally impossible to derive any strong, "clean"
mathematical correlation from it - it is just a nice smooth line that
approximates well the data points, with a good degree of rejection
from outliers.
▪ The comment below the graph "I'm sophisticated, not like those
bumbling polynomial people." emphasises this more complicated
interpretation but without a simple mathematical description it's not
very helpful
Linear, no slope
▪ The value of c can be determined simply by taking the average of the

y-values in the data.
▪ Apparently, the person making this line figured out pretty early on that
their data analysis was turning into a scatter plot, and wanted to escape
their personal stigma of scatter plots by drawing an obviously false
regression line on top of it.
▪ The comment below the graph "I'm making a scatter plot but I don't
want to." is probably done by a student who isn't happy with their
choice of field of study.
Logistic
▪ The logistic regression is taken when a variable can take binary

results such as "0" and "1" or "old" and "young".
▪ The comment below the graph "I need to connect these two lines,
but my first idea didn't have enough math." implies the
experimenter just wants to find a mathematically-respectable way to
link two flat lines.
Confidence interval
▪ Not a type of curve fitting, but a method of depicting the

predictive power of a curve.
▪ The comment below the graph "Listen, science is hard. But I'm a
serious person doing my best." is just an honest statement about
this uncertainty.
Piecewise
▪ Mapping different curves to different segments of the data. This is a legitimate
strategy, but the different segments should be meaningful, such as if they were pulled
from different population
▪ This kind of fit would arise naturally in a study based on a regression discontinuity
design. For instance, if students who score below a certain cutoff must take remedial
classes, the line for outcomes of those below the cutoff would reasonably be
separate from the one for outcomes above the cutoff; the distance between the end
of the two lines could be considered the effect of the treatment, under certain
assumptions. This kind of study design is used to investigate causal theories, where
mere correlation in observational data is not enough to prove anything. Thus, the
associated text would be appropriate; there is a theory, and data that might prove the
theory is hard to find
▪ The comment below the graph "I have a theory, and this is the only data I could find."
is a bit ambiguous because there are many data points ignored. Without an
explanation why only a subset of the data is used this isn't a useful interpretation at
all. As a matter of fact, with the extra degrees of freedom offered by the piecewise
regression, it could indicate that the researcher is trying to fit the data to confirm
their theory, rather than building their theory off of the data.
Connecting lines
▪ This is often used to smooth gaps in measurements. A simple
example is the weather temperature which is often measured in
distinct intervals. When the intervals are high enough it's safe to
assume that the temperature didn't change that much between
them and connecting the data points by lines doesn't distort the real
situation in many cases.
▪ The comment below the graph "I clicked 'Smooth Lines' in Excel."
refers to the well known spreadsheet application from Microsoft
Office. Like other spreadsheet applications it has the feature to
visualize data from a table into a graph by many ways. "Smooth
Lines" is a setting meant for use on a line graph, a graph in which
one axis represents time; as it simply joins up every point rather
than finding a sensible line, it is not suitable for regression.
Ad-Hoc Filter
▪ Drawing a bunch of different lines by hand, keeping in only the data

points perceived as "good". Not really useful except for marketing
purposes.
▪ The comment below the graph "I had an idea for how to clean up the
data. What do you think?" admits that in fact the data is
whitewashed and tightly focused to a result the presenter wants to
show.
House of Cards
▪ Not a real method, but a common consequence of misapplication
of statistical methods: a curve can be generated that fits the data
extremely well, but immediately becomes absurd as soon as one
glances outside the training data sample range, and your analysis
comes crashing down "like a house of cards". This is a type of
overfitting. In other words, the model may do quite well for
(approximately) interpolating between values in the sample
range, but not extend at all well to extrapolating values outside
that range.
▪ The name is also a potential reference to the TV show House of

Cards. The plot in House of Cards began with a premise of a rise
to power in the United States government, but as it continued
into more seasons the premise was taken to an extreme,
introducing more and more ridiculous plot points ("WAIT NO, NO,
DON'T EXTEND IT!").
Bayes regression: vs. standard regression?
▪ Philosophical difference: In ordinary least squares and maximum likelihood estimation, we
are starting with the question "What are the best values for 𝛽𝑖 (perhaps for later use)?”
▪ In the full Bayesian approach, we start with the question "What can we say about the
unknown values 𝛽𝑖?" and then maybe proceed to using the maximum a posteriori or
posterior mean if a point estimate is needed.
https://stats.stackexchange.com/questions/252577/bayes-regression-how-is-it-don 86
e-in-comparison-to-standard-regression/252608#252608
▪ The biggest advantage of linear regression models is linearity: It makes the estimation
procedure simple and, most importantly, these linear equations have an easy to understand
interpretation on a modular level (i.e. the weights).
▪ Estimated weights come with confidence intervals. A confidence interval is a range for the
weight estimate that covers the "true" weight with a certain confidence. For example, a
95% confidence interval for a weight of 2 could range from 1 to 3. The interpretation of this
interval would be: If we repeated the estimation 100 times with newly sampled data, the
confidence interval would include the true weight in 95 out of 100 cases, given that the
linear regression model is the correct model for the data.
https://christophm.github.io/interpretable-ml-book/limo.html 87
▪ Assumption used in linear regression:
▪ Linearity: the linear regression model forces the prediction to be a linear combination of features,
which is both its greatest strength and its greatest limitation.
▪ Normality: It is assumed that the target outcome given the features follows a normal distribution.
▪ Homoscedasticity (constant variance): The variance of the error terms is assumed to be constant
over the entire feature space.
▪ Independence: It is assumed that each instance is independent of any other instance.
▪ Fixed features: Fixed means that they are treated as "given constants" and not as statistical
variables.
▪ Absence of multicollinearity: You do not want strongly correlated features, because this messes up
the estimation of the weights. Sometimes this is also stated as absence of correlated errors -
Correlated errors, like the definition suggests, happen when one error is correlated to another. For
instance, if one positive error makes a negative error systematically, it means that there's a
relationship between these variables. This occurs often in time series, where some patterns are time
related. We'll also not get into this. However, if you detect something, try to add a variable that can
explain the effect you're getting. That's the most common solution for correlated errors.
Linear regression: #3 [homoscedasticity]
▪ Homoscedasticity (constant variance): The variance of the error terms is assumed to be
constant over the entire feature space.
▪ Suppose you want to predict the value of a house given the living area in square meters.
You estimate a linear model that assumes that, regardless of the size of the house, the error
around the predicted response has the same variance. This assumption is often violated in
reality. In the house example, it is plausible that the variance of error terms around the
predicted price is higher for larger houses, since prices are higher and there is more room
for price fluctuations.
▪ Homoscedasticity - I just hope I wrote it right. Homoscedasticity refers to the 'assumption

that dependent variable(s) exhibit equal levels of variance across the range of predictor
variable(s)'. Homoscedasticity is desirable because we want the error term to be the same
across all values of the independent variables.
Linear regression: #4 [fixed features]
▪ Fixed features: the input features are considered "fixed". Fixed means that they are treated
as "given constants" and not as statistical variables.
▪ This implies that they are free of measurement errors.
▪ This is a rather unrealistic assumption. Without that assumption, however, you would have
to fit very complex measurement error models that account for the measurement errors of
your input features. And usually you do not want to do that
Linear regression #5 [collinearity]
▪ Linear regression is an example of approximation learning. Approximation doesn't necessarily
exactly fit the data points, but the goal is that it should come close, but not at the expense of
failing to generalize to new data
▪ Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed
that input variables are relevant to the output variable and that they are not highly correlated with
each other (a problem called collinearity).
▪ Remove collinearity: linear regression will over-fit your data when you have highly correlated input
variables. Consider calculating pairwise correlations for your input data and removing the most
correlated.
▪ Multicollinearity: you do not want strongly correlated features, because this messes up the
estimation of the weights. In a situation where two features are strongly correlated, it becomes
problematic to estimate the weights because the feature effects are additive and it becomes
indeterminable to which of the correlated features to attribute the effects.
http://vision.psych.umn.edu/users/kersten/kersten-lab/courses/Psy5038WF2014/Lectures/L https://christophm.github.io/interpre
ect_10_RegressWid/Lect_10_RegressWid.nb.pdf table-ml-book/limo.html 91
Linear regression #5.1 [collinearity]
▪ Completely correlated features make it even impossible to find a unique solution for the
linear equation.
▪ An example: You have a model to predict the value of a house and have features like
number of rooms and size of the house.
▪ House size and number of rooms are highly correlated: the bigger a house is, the more
rooms it has. If you take both features into a linear model, it might happen, that the size of
the house is the better predictor and gets a large positive weight.
▪ The number of rooms might end up getting a negative weight, because, given that a house
has the same size, increasing the number of rooms could make it less valuable or the linear
equation becomes less stable, when the correlation is too strong.
Linear regression #5.2 [collinearity]
▪ When features are correlated the matrix becomes close to singular and as a result, the
least-squares estimate becomes highly sensitive to random errors in the observed target,
producing a large variance.
▪ One solution can be to use Ridge regression. It uses a parameter to control the amount of
shrinkage. The greater this parameter is the more model becomes robust to collinearity.
Linear regression #6 [complete resources]
▪ https://www.kdnuggets.com/2021/08/3-reasons-linear-regression-instead-neural-network
s.html
▪ Which includes a small article a link to a video:

▪ https://www.youtube.com/watch?v=9ISvX3yL4Pc
https://www.kdnuggets.com/2021/08/3-reasons-linear-regression-instead-neural-networks.html 94
Linear regression #6
Linear regression was used to model to predict the
number of rented bikes on a particular day, given ▪ An increase of the temperature by 1 degree Celsius
weather and calendar information. error of the estimate increases the predicted number of bicycles by 110.7, when
(SE), and the absolute value of the t-statistic |t|.
all other features remain fixed.
▪ The predicted target is a linear combination of the weighted

features. The estimated linear equation is a hyperplane in
the feature/target space (a simple line in the case of a single
feature).
▪ The weights specify the slope (gradient) of the hyperplane in

each direction. So it gives the delta increase or decrease.
▪ The good side is that the additivity isolates the

interpretation, but on the bad side, it ignores a possible real
joint distribution.
Linear regression #7 [weight plot]
▪ The problem with the weight plot is that the
features are measured on different scales.
▪ While for the weather the estimated weight

reflects the difference between good and
rainy/stormy/snowy weather, for
temperature it only reflects an increase of 1
degree Celsius.
▪ You can make the estimated weights more

comparable by scaling the features (zero
mean and standard deviation of one) before
fitting the linear model.
Linear regression #8 [effect plot]
▪ Effects = weight per feature times the
feature value of an instance.
▪ A box in a boxplot contains the effect

range for half of your data (25% to 75%
effect quantiles).
▪ The vertical line in the box is the median

effect, i.e. 50% of the instances have a
lower and the other half a higher effect on
the prediction.
▪ The horizontal lines extend to ±1.5IQR/√n,

with IQR being the inter quartile range
(75% quantile minus 25% quantile). The
dots are outliers.
97
Linear regression #9 [explaining individual predictions]
▪ Let us say we have a model and the results for

a prediction. How do we explain?
▪ For a temperature of 1.6 degrees Celsius, the

effect is 177.6. We add these individual effects
as crosses to the effect plot, which shows us
the distribution of the effects in the data.
▪ This allows us to compare the individual effects

with the distribution of effects in the data.
98
Linear regression #9 [extentions]
▪ There are also many extensions of the linear regression model such as:
▪ GLM
▪ GAM
▪ and more
99
Collinearity in general
▪ In general what is the effect of having collinearity in the features?
▪ This may cause some issue while understanding the importance of each feature.
▪ When features are collinear, permutating one feature (feature permutation is one
way to understand how features are important) will have little effect on the
models performance because it can get the same information from a correlated
feature.
▪ Solution: one way to handle multicollinear features is by performing hierarchical

clustering on the Spearman rank-order correlations, picking a threshold, and
keeping a single feature from each cluster. See link for an example in python
https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-
auto-examples-inspection-plot-permutation-importance-multicollinear-py
100
Detecting multicollinearity
▪ One of the most widely used statistical measure of
detecting multicollinearity amongst numerical variable is
the Variance Inflation Factor (VIF).
▪ It’s called the variance inflation factor because it
estimates how much the variance of a coefficient is
"inflated" because of linear dependence with other
predictors. Thus, a VIF of 1.8 tells us that the variance (the
square of the standard error) of a particular coefficient is
80% larger than it would be if that predictor was
completely uncorrelated with all the other predictors.
▪ For classification we use the: Cramer’s V is a statistic
measuring the strength of association or dependency
between two (nominal) categorical variables.
http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/model_selection/collinearity.ipynb 101
Common pitfalls in interpretation of
coefficients of linear models
▪ Coefficients must be scaled to the same unit of measure to retrieve feature importance.
Scaling them with the standard-deviation of the feature is a useful proxy.
▪ Coefficients in multivariate linear models represent the dependency between a given
feature and the target, conditional on the other features.
▪ Correlated features induce instabilities in the coefficients of linear models and their effects
cannot be well teased apart.
▪ Different linear models respond differently to feature correlation and coefficients could
significantly vary from one another.
▪ Inspecting coefficients across the folds of a cross-validation loop gives an idea of their
stability.
https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#sphx-glr-a
102
uto-examples-inspection-plot-linear-model-coefficient-interpretation-py
Misinterpreting feature importance
▪ Tree-based models or linear models are commonly used algorithms as they have the
capability of giving us feature importance or coefficients without having to rely on other
external methods. When interpreting these result, there are a couple of caveats to keep in
mind.
▪ If the features are co-linear, the importance can shift from one feature to another. The
more features the data set has the more likely the features are co-linear and the less
reliable simple interpretations of feature importance are. Thus it is recommended to
remove multicollinearity before feeding the data into our model.
▪ Linear model such as linear and logistic regression outputs coefficients. Many times
these coefficients will cause people to believe that the bigger the value of the
coefficient, the more important the feature is. Not that this is wrong, but we need to
make sure that we standardized our dataset beforehand as the scale of the variable will
change the absolute value of the coefficient.
http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/model_selection/tips_and_ 103
tricks/tips_and_tricks.ipynb
Ways of solving linear regression?
▪ Linear regression and the matrix reformulation with the normal equations.
▪ Solve linear regression using a QR matrix decomposition. The QR decomposition approach

is more computationally efficient and more numerically stable than calculating the normal
equation directly, but does not work for all data matrices.
▪ Solve linear regression using SVD and the pseudoinverse. Unlike the QR decomposition, all
matrices have a singular-value decomposition. It is the de-facto standard for solving the
system of linear equations for linear regression. SVD is more stable and the preferred
approach.
104
Extensions of linear regression
▪ The linear model comes with many other assumptions. The bad news is (well, not really
news) that all those assumptions are often violated in reality.
▪ The good news is that the statistics community has developed a variety of modifications
that transform the linear regression model from a simple blade into a Swiss knife.
▪ There are in particular two models:

▪ Generalized Linear Models (GLGAMMs) and
▪ Generalized Additive Models (GAMs)
https://christophm.github.io/interpretable-ml-book/extend-lm.html 105
Linear regression issue No. #1
▪ Problem: The target outcome y given the features

does not follow a Gaussian distribution.
▪ Example: Suppose I want to predict how many

minutes I will ride my bike on a given day. As
features I have the type of day, the weather and so
on. If I use a linear model, it could predict negative
minutes because it assumes a Gaussian distribution
which does not stop at 0 minutes. Also if I want to
predict probabilities with a linear model, I can get
probabilities that are negative or greater than 1.
▪ Solution: Generalized Linear Models (GLMs).
▪ Problem: The features interact.
▪ Example: On average, light rain has a slight negative

effect on my desire to go cycling. But in summer, during
rush hour, I welcome rain, because then all the
fair-weather cyclists stay at home and I have the bicycle
paths for myself! This is an interaction between time and
weather that cannot be captured by a purely additive
model.
▪ Solution: Adding interactions manually.
▪ Problem: The true relationship between the
features and y is not linear.
▪ Example: Between 0 and 25 degrees Celsius, the

influence of the temperature on my desire to ride a
bike could be linear, which means that an increase
from 0 to 1 degree causes the same increase in
cycling desire as an increase from 20 to 21. But at
higher temperatures my motivation to cycle levels
off and even decreases - I do not like to bike when it
is too hot.
▪ Solutions: Generalized Additive Models (GAMs);

transformation of features.
Linear model: matrix of issues & solutions #1
No Issue example Solution
My data violates the

mixed models or
assumption of being
1 Repeated measurements on the same patient. generalized estimating
independent and identically
equations
distributed (iid).
when predicting the value of a house, the model
My model has
2 errors are usually higher in expensive houses, which Robust regression
heteroscedastic errors.
violates the homoscedasticity of the linear model.
I have outliers that strongly
3 robust regression.
influence my model.
Time-to-event data usually comes with censored
measurements, which means that for some instances
there was not enough time to observe the event. For parametric survival
I want to predict the time
4 example, a company wants to predict the failure of its models, cox regression,
until an event occurs
ice machines, but only has data for two years. Some survival analysis.
machines are still intact after two years, but might fail
later.
If the outcome has two categories use a

logistic regression model, which models
the probability for the categories. If you
My outcome to predict
5 have more categories, search for
is a category.
multinomial regression. Logistic
regression and multinomial regression are
both GLMs.
I want to predict
6 For example school grades. Search for proportional odds model.
ordered categories.
Search for Poisson regression.
The Poisson model is also a GLM. You
might also have the problem that the count
7 My outcome is a count (like number of children in a family).
value of 0 is very frequent.
Search for zero-inflated Poisson
regression, hurdle model.
For example, I want to know the effect of a

I am not sure what
drug on the blood pressure. The drug has a
features need to be
direct effect on some blood value and this Search for causal inference,
8 included in the model to
blood value affects the outcome. Should I mediation analysis.
draw correct causal
include the blood value into the regression
conclusions.
model?
I want to integrate prior
9 knowledge into my Search for Bayesian inference.
models.
10
Generalised linear model [GLM]: #1
▪ The generalized linear model (GLM) is a flexible generalisation of ordinary linear regression
that allows for response variables that have error distribution models other than a normal
distribution.
▪ In a generalized linear model (GLM), each outcome Y of the dependent variables is assumed
to be generated from a particular distribution in an exponential family, a large class of
probability distributions that includes the normal, binomial, Poisson and gamma
distributions, among others.
▪ In brief: with Generalized Linear Models, one uses a common training technique for a
diverse set of regression models.
112
Generalised linear model [GLM]: #2
▪ GLMs allow the modeller to express the relationship between the regression variables X
and the response variable (a.k.a. dependent variable) y, in a linear and additive way even
though the underlying relationships may be neither linear nor additive.
https://towardsdatascience.com/generalized-linear-models-9ec4dfe3dc3f 113
Binning for linear regression: #1
One way to make linear
model more powerful on
continuous data is to use
discretization (also
known as binning). In the
example, we discretize
the feature and one-hot
encode the transformed
data.
Linear model is fast to build and relatively Compared with the result before discretization, linear
model become much more flexible while decision tree gets
straightforward to interpret, but can only model much less flexible. Note that binning features generally has
linear relationships, while decision tree can build a no beneficial effect for tree-based models, as these models
much more complex model of the data. can learn to split up the data anywhere.
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html#sphx-glr- 114
auto-examples-preprocessing-plot-discretization-py
Linear vs. multiple regression
▪ Linear regression ▪ Multiple regression
▪ One input is handled ▪ Multiple inputs are handled.
https://towardsdatascience.com/understanding-multiple-regression-249b16bde83e 115
Example of multiple regression
▪ Imagine that you’re a traffic planner in your city and need to estimate the average commute
time of drivers going from the East side of the city to the West. You don’t know how long it
takes on average, but you do know that it will depend on a number of factors.
▪ It probably depends on things like the distance driven, the number of stoplights on the
route, and the number of other cars on the road. In that case you could create a linear
multiple regression equation like the following:
Assume you need to generate a predictive model using
multiple regression. Explain how you intend to validate this
model.
▪ There are two methods:
▪ Adjusted R-squared: every additional independent variable added to a model always

increases the R-squared value — therefore, a model with several independent variables
may seem to be a better fit even if it isn’t. This is where adjusted R² comes in. The
adjusted R² compensates for each additional independent variable and only increases if
each given variable improves the model above what is possible by probability.
▪ Cross-validation
Classification
(go back)
118
Bird-eye view of classification methods
119
Single- vs. multi-label, multiclass classification
▪ single-label, multiclass classification = each data point should be classified into only one
category
▪ multilabel, multiclass classification = if each data point could belong to multiple categories
Class vs. label
▪ Classes—A set of possible labels to choose from in a classification problem. For example,
when classifying cat and dog pictures, “dog” and “cat” are the two classes.
▪ Label—A specific instance of a class annotation in a classification problem. For instance, if

picture #1234 is annotated as containing the class “dog,” then “dog” is a label of picture
#1234.
2 ways to handle labels in multiclass
classification
▪ There are two ways to handle labels in multiclass classification:
▪ Encoding the labels via categorical encoding (also known as one-hot encoding) and
using categorical_crossentropy as a loss function
▪ Encoding the labels as integers and using the sparse_categorical_crossentropy loss
function
One-vs-Rest and One-vs-One: #1
▪ Algorithms such as the Perceptron, Logistic Regression, and Support Vector Machines were
designed for binary classification and do not natively support classification tasks with more
than two classes.
▪ Solution? splits a multi-class classification dataset into binary classification problems. There
are two approaches:
▪ One-vs-Rest strategy splits a multi-class classification into one binary classification

problem per class.
▪ One-vs-One strategies: strategy splits a multi-class classification into one binary
classification problem per each pair of classes.
https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/ 123
One-vs-Rest and One-vs-One: #1.1
Skiena, Steven S. The data science design manual. Springer, 2017. 124
▪ An example of One-Vs-Rest for Multi-Class Classification
▪ For example, given a multi-class classification problem with examples for each class ‘red,’
‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
▪ Binary Classification Problem 1: red vs [blue, green]
▪ Binary Classification Problem 2: blue vs [red, green]
▪ Binary Classification Problem 3: green vs [red, blue]
▪ A possible downside of this approach is that it requires one model to be created for each
class. (think about scaling it up to millions).
▪ Unlike one-vs-rest that splits it into one binary dataset for each class, the one-vs-one approach splits the dataset into one
dataset for each class versus every other class.
▪ For example, consider a multi-class classification problem with four classes: ‘red,’ ‘blue,’ and ‘green,’ ‘yellow.’ This could be
divided into six binary classification datasets as follows:
▪ Binary Classification Problem 1: red vs. blue
▪ Binary Classification Problem 2: red vs. green
▪ Binary Classification Problem 3: red vs. yellow
▪ Binary Classification Problem 4: blue vs. green
▪ Binary Classification Problem 5: blue vs. yellow
▪ Binary Classification Problem 6: green vs. yellow
▪ Each binary classification model may predict one class label and the model with the most predictions or votes is predicted
by the one-vs-one strategy.
One-vs-Rest and One-vs-One: #4 [usage]
▪ OVR = generally applied to Logistic regression
▪ OVO = Classically, this approach is suggested for support vector machines (SVM) and related
kernel-based algorithms. This is believed because the performance of kernel methods does
not scale in proportion to the size of the training dataset and using subsets of the training
data may counter this effect.
Classification regression: #2
128
Classification regression: #3 [Sigmoid = Logistic Function]
▪ Just like a perceptron, the

sigmoid neuron has inputs,
x1,x2,…. But instead of being
just 0 or 1, these inputs can also
take on any values between 0
and 1.
▪ So, for instance, 0.638… is a

valid input for a sigmoid neuron.
http://neuralnetworksanddeeplearning.com/chap1.html
129
▪ Why were the Sigmoid neuron introduced? What does it have of so advantageous?
▪ Sigmoid neurons are similar to perceptrons, but modified so that small changes in their
weights and bias cause only a small change in their output. In fact with normal neuron a
small changes can cause the output to flip from 0 to 1. This is important for tuning and
learning rate.
▪ That's the crucial fact which will allow a network of sigmoid neurons to learn.
http://neuralnetworksanddeeplearning.com/chap1.html 130
▪ If σ had in fact been a step function, then the
sigmoid neuron would be a perceptron, since the
output would be 1 or 0 depending on whether
w⋅x+b was positive or negative.
▪ Actually, when w⋅x+b=0 the perceptron outputs

0, while the step function outputs 1. So, strictly
speaking, we'd need to modify the step function
at that one point. But you get the idea.
▪ Sigmoid neurons are similar to perceptrons, but

modified so that small changes in their weights
and bias cause only a small change in their
output. That's the crucial fact which will allow a
network of sigmoid neurons to learn.
131
=
Sigmoid function
▪ The keep point brought in by the sigmoid function is the the smoothness of the σ function that is
the crucial fact, not its detailed form.
▪ The smoothness of σ means that small changes Δwj in the weights and Δb in the bias will produce
a small change Δoutput in the output from the neuron.
▪ What this equation is telling us is that Δoutput is a linear function of the changes Δwj and Δb in
the weights and bias. This linearity makes it easy to choose small changes in the weights and biases
to achieve any desired small change in the output.
132
133
Classification regression: #5 [decision boundary]
134
135
136
137
▪ The issue: we cannot use the same cost function that we use for linear regression because
the Logistic Function will cause the output to be wavy, causing many local optima.
▪ In other words, it will not be a convex function.
138
Note that writing the cost function in this way guarantees

that J(θ) is convex for logistic regression. 139
140
The wonderful thing about this loss
function is that it is convex, meaning
that we can find the parameters w
Classification regression: #12 which best fit the training examples
using gradient descent.
Skiena, Steven S. The data science design manual. Springer, 2017. 141
▪ It perform well when data is linearly separable

or approximately linearly separable.
▪ Even if it is not linearly separable, we could try

to convert the data as shown in the figure.
▪ Although logistic regression is a low bias, high

variance algorithm, we can still use L1/2
regularization.
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 142
▪ Another disadvantage of the logistic regression model is that the interpretation is more
difficult because the interpretation of the weights is multiplicative and not additive.
▪ Logistic regression can suffer from complete separation. If there is a feature that would
perfectly separate the two classes, the logistic regression model can no longer be trained.
This is because the weight for that feature would not converge, because the optimal weight
would be infinite.
▪ This is really a bit unfortunate, because such a feature is really useful. But you do not need
machine learning if you have a simple rule that separates both classes. The problem of
complete separation can be solved by introducing penalization of the weights or defining a
prior probability distribution of weights.
https://christophm.github.io/interpretable-ml-book/logistic.html 143
▪ Logistic regression has a series of issues:
▪ Two-Class Problems. Logistic regression is intended for two-class or binary classification

problems. It can be extended for multi-class classification, but is rarely used for this
purpose.
▪ Unstable With Well Separated Classes. Logistic regression can become unstable when
the classes are well separated.
▪ Unstable With Few Examples. Logistic regression can become unstable when there are
few examples from which to estimate the parameters.
▪ Linear Discriminant Analysis (LDA) does address each of these points and is the go-to linear
method for multi-class classification problems. Even with binary-classification problems, it is
a good idea to try both logistic regression and linear discriminant analysis.
https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/
144
Linear vs. logistic regression #1
▪ Linear regression is used for predicting the continuous dependent variable using a given set of
independent features whereas Logistic Regression is used to predict the categorical.
▪ Linear regression is used to solve regression problems whereas logistic regression is used to solve
classification problems.
▪ In Linear regression, the approach is to find the best fit line to predict the output whereas in the Logistic
regression approach is to try for S curved graphs that classify between the two classes that are 0 and 1.
▪ The method for accuracy in linear regression is the least square estimation whereas for logistic
regression it is maximum likelihood estimation.
▪ In Linear regression, the output should be continuous like price & age, whereas in Logistic regression the
output must be categorical like either Yes / No or 0/1.
▪ There should be a linear relationship between the dependent and independent features in the case of
Linear regression whereas it is not in the case of Logistic regression.
▪ There can be collinearity between independent features in the case of linear regression but it is not in
the case of logistic regression.
https://www.analyticssteps.com/blogs/how-does-linear-and-logistic-regression-work-machine-learning 145
Linear vs. logistic regression #2
▪ Logistic regression is the most fundamental classification algorithm. Mathematically, logistic

regression works in a manner similar to linear regression.
▪ For each column, logistic regression finds an appropriate weight, or coefficient, that
maximizes model accuracy.
▪ The primary difference is that instead of summing each term, as in linear regression, logistic
regression uses the sigmoid function.
Corey Wade. “Hands-On Gradient Boosting with XGBoost and scikit-learn

146
Logistic regression (as a NN)
147
LDA – Linear Discriminant Analysis
▪ It is the go-to linear method for multi-class classification problems.

▪ Even with binary-classification problems, it is a good idea to try both logistic regression and
linear discriminant analysis.
148
PCA vs. LDA
▪ LDA (Linear discriminant analysis) is a supervised method whereas PCA is an unsupervised
method.
▪ Principal Component Analysis (PCA) applied to this data identifies the combination of
attributes (principal components, or directions in the feature space) that account for the
most variance in the data. Linear Discriminant Analysis (LDA) tries to identify attributes that
account for the most variance between classes.
▪ PCA is tasked with finding the directions of maximum variance, whereas LDA is tasked with
finding a feature subspace that maximise the separability of the class.
https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html#sphx-glr- 149
auto-examples-decomposition-plot-pca-vs-lda-py
Factor Analysis [dimension reduction]
https://raw.githubusercontent.com/aaronwangy/Data-Science-
Cheatsheet/main/images/page2-1.png
150
How to Prepare Data for LDA
▪ This section lists some suggestions you may consider when preparing your data for use with
LDA.
▪ Classification Problems. LDA is intended for classification problems where the output
variable is categorical. LDA supports both binary and multi-class classification.
▪ Gaussian Distribution. The standard implementation of the model assumes a Gaussian
distribution of the input variables. Consider reviewing the univariate distributions of
each attribute and using transforms to make them more Gaussian-looking (e.g. log and
root for exponential distributions and Box-Cox for skewed distributions).
▪ Remove Outliers. Consider removing outliers from your data. These can skew the basic
statistics used to separate classes in LDA such the mean and the standard deviation.
▪ Same Variance. LDA assumes that each input variable has the same variance. It is
almost always a good idea to standardize your data before using LDA so that it has a
mean of 0 and a standard deviation of 1.
https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/ 151
LDA: Shrinkage and Covariance Estimator #1
▪ Shrinkage is a form of regularization used to improve the estimation of covariance matrices
in situations where the number of training samples is small compared to the number of
features. In this scenario, the empirical sample covariance is a poor estimator, and
shrinkage helps improving the generalization performance of the classifier.
▪ First option: This automatically determines the optimal shrinkage parameter in an

analytic way following the lemma introduced by Ledoit and Wolf.
▪ Second option: The shrinked Ledoit and Wolf estimator of covariance may not always
be the best choice. For example if the distribution of the data is normally distributed,
the Oracle Shrinkage Approximating yields a smaller Mean Squared Error than the one
given by Ledoit and Wolf’s formula
https://scikit-learn.org/stable/modules/lda_qda.html
152
https://scikit-learn.org/stable/modules/lda_qda.html
153
LDA: Shrinkage and Covariance Estimator #2
Extensions to LDA
▪ Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
▪ Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such
as splines.
▪ Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the
variance (actually covariance), moderating the influence of different variables on LDA.
https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/ 154
Linear and Quadratic Discriminant Analysis
▪ This example plots the covariance

ellipsoids of each class and decision
boundary learned by LDA and QDA.
▪ The ellipsoids display the double

standard deviation for each class.
▪ With LDA, the standard deviation is

the same for all the classes, while
each class has its own standard
deviation with QDA.
https://scikit-learn.org/stable/modules/lda_qda.html 155
one-versus-all (OvA) and one-versus-one (OvO)
▪ The OvA technique (one-versus-all) involves multiple binary classifiers that is equal to the
number of classes. For example, if a dataset has five classes, then OvA uses five binary
classifiers, each of which detects one of the five classes. In order to classify a data point in
this particular dataset, select the binary classifier that has output the highest score.
▪ The OvO technique (one-versus-one) also involves multiple binary classifiers, but in this
case a binary classifier is used to train on a pair of classes. For instance, if the classes are
A, B, C, D, and E, then 10 binary classifiers are required: one for A and B, one for A and C,
one for A and D, and so forth, until the last binary classifier for D and E.
▪ In general, if there are n classes, then n*(n-1)/2 binary classifiers are required. Although
the OvO technique requires considerably more binary classifiers (e.g., 190 are required for
20 classes) than the OvA technique (e.g., a mere 20 binary classifiers for 20 classes), the
OvO technique has the advantage that each binary classifier is only trained on the portion
of the dataset that pertains to its two chosen classes.
Python 3 for machine learning, O. Compesato

156
Linear classifier
▪ A linear classifier separates a dataset into two classes. A linear classifier is a line for 2D
points, a plane for 3D points, and a hyperplane (a generalisation of a plane) for higher
dimensional points.
▪ Linear classifiers are often the fastest classifiers, so they are often used when the speed of
classification is of high importance. Linear classifiers usually work well when the input
vectors are sparse (i.e., mostly zero values) or when the number of dimensions is large.

157
k-Nearest Neighbour (kNN): #1
▪ Moreover, we described the k-Nearest Neighbour (kNN) classifier which labels images by
comparing them to (annotated) images from the training set. As we saw, kNN has a number
of disadvantages:
▪ The classifier must remember all of the training data and store it for future comparisons
with the test data. This is space inefficient because datasets may easily be gigabytes in
size.
▪ Classifying a test image is expensive since it requires a comparison to all training

images.
https://cs231n.github.io/linear-classify/
158
▪ The k nearest neighbour (kNN) algorithm is a classification algorithm. In brief, data points
that are “near” each other are classified as belonging to the same class. When a new point
is introduced, it’s added to the class of the majority of its nearest neighbour.
▪ For example, suppose that k equals 3, and a new data point is introduced. Look at the class
of its 3 nearest neighbours: let’s say they are A, A, and B. Then by majority vote, the new
data point is labelled as a data point of class A.
▪ The kNN algorithm is essentially a heuristic and not a technique with complex mathematical
underpinnings, and yet it’s still an effective and useful algorithm.

159
K-Nearest Neighbours looks at the nearest points, if k=1 then
the unclassified point would be classified as a blue point.
https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e 160
▪ If the value of k is too low, it can be subject
to outliers.
▪ If it’s too high, it may overlook classes with
only a few samples.
▪ Elbow method helps decide which k to use.

▪ You can see that the elbow occurs when k=3,
so k should equal 3.
k-Nearest Neighbour (kNN): #5 [conclusion]
▪ 2 important parameters: the number of neighbours and how you measure distance
between data points.
▪ Building the nearest neighbours model is usually very fast, but when your training set is very
large (either in number of features or in number of samples) prediction can be slow.
▪ This approach often does not perform well on datasets with many features (hundreds or
more), and it does particularly badly with datasets where most features are 0 most of the
time (so-called sparse datasets).
▪ So, while the nearest k-neighbours algorithm is easy to understand, it is not often used in
practice, due to prediction being slow and its inability to handle many features.
Guido, Sarah, and Andreas Müller. Introduction to machine learning with python. Vol. 282. O'Reilly Media, 2016. 162
KNN: decision boundary & overfitting #1
▪ As you can see on the left in the figure, using a
single neighbour results in a decision boundary that
follows the training data closely.
▪ Considering more and more neighbours leads to a
smoother decision boundary.
▪ ATTENTION: If you consider the extreme case
where the number of neighbours is the
number of all data points in the training set,
each test point would have exactly the same
neighbours (all training points) and all
predictions would be the same: the class
that is most frequent in the training set.
KNN: decision boundary & overfitting #2
▪ Considering a single nearest neighbour, the

prediction on the training set is perfect.
▪ But when more neighbours are considered, the
model becomes simpler and the training
accuracy drops.
▪ The test set accuracy for using a single neighbour
is lower than when using more neighbours,
indicating that using the single nearest neighbuor
leads to a model that is too complex.
▪ On the other hand, when considering 10
neighbours, the model is too simple and
performance is even worse.
Support vector machines (SVMs): #0
▪ Support Vector Machines (SVMs) are most frequently used for solving classification
problems, which fall under the supervised machine learning category.
▪ However, with small adaptations, SVMs can also be used for other types of problems such
as:
▪ Clustering (unsupervised learning) through the use of Support Vector Clustering algorithm
▪ Regression (supervised learning) through the use of Support Vector Regression algorithm
(SVR)
https://towardsdatascience.com/svm-classifier-and-rbf-kernel-how-to-make-better-models-in-python-73bb4914af5b 165
▪ SVM is a machine learning classifier that works by drawing a hyperplane in
d-dimensional space and classifying points by which side they fall on.
▪ SVMs are yet another type of supervised machine learning algorithm. It is sometimes
cleaner and more powerful.
▪ Support vectors are the data points that lie closest to the decision surface (or
hyperplane)
▪ They are the data points most difficult to classify.
▪ SVMs maximize the margin around the separating hyperplane. This becomes a quadratic
programming problem that is easy to solve by standard methods. It does so by
separating only a few data examples (called supports, hence the name of the algorithm)
from the rest of the data using a function.
166
Support vector machines (SVMs): #1.1
▪ This is the dividing line that maximizes the margin
between the two sets of points. Notice that a few of the
training points just touch the margin; they are indicated
by the black circles.
▪ These points are the pivotal elements of this fit, and are
known as the support vectors, and give the algorithm
its name.
▪ A key to this classifier’s success is that for the

fit, only the position of the support vectors
matters; any points further from the margin
that are on the correct side do not modify the
fit!
VanderPlas, Jake. Python data science handbook: Essential
167
tools for working with data. " O'Reilly Media, Inc.", 2016.
▪ Support vector machines (SVMs) are a set of supervised learning methods used for
classification, regression and outliers detection.
▪ The advantages of support vector machines are:

▪ Effective in high dimensional spaces.
▪ Still effective in cases where number of dimensions is greater than the number of samples.
▪ Uses a subset of training points in the decision function (called support vectors), so it is also memory
efficient.
▪ Versatile: different Kernel functions can be specified for the decision function, for instance RBF kernel
▪ The disadvantages of support vector machines include:

▪ If the number of features is much greater than the number of samples, avoid over-fitting in choosing
Kernel functions and regularization term is crucial.
▪ SVMs do not directly provide probability estimates, these could be calculated using an expensive
cross-validation.
https://scikit-learn.org/stable/modules/svm.html
168
169
https://scikit-learn.org/stable/modules/svm.html
▪ SVM with linear kernel perform comparably to logistic regression, but SVM works well well
with non-separable one, if equipped with a non-linear kernel, such as RBF.
▪ For a high dimension the performance of logistic regression is compromised, while SVM still
perform well.
▪ A good example could be news classification where feature dimension is in tens of

thousands.
▪ For high dimension, high accuracy can be obtained at the expense of intense computation
and high memory consumption.
Liu, Yuxi Hayden. Python Machine Learning By Example. Packt Publishing Ltd, 2017. 170
▪ Why is not a trivial task?
▪ Let us say our task is o divide squares from stars. Figure shows
two possible solutions, but even more can exist. Both chosen
solutions are too near to the existing observations (as shown by
the proximity of the lines to the data points), but there is no
reason to think that new observations will behave precisely like
those shown in the figure. Two possible solutions
▪ SVM minimizes the risk of choosing the wrong line (solutions A or

B) by choosing the solution characterized by the largest distance
from the bordering points of the two groups. Having so much
space between groups (the maximum possible) should reduce the
chance of picking the wrong solution!
Viable SVM solution

Mueller, John Paul, and Luca Massaron. Python for data science for dummies. John Wiley & Sons, 2019 171
Support vector machines (SVMs): #6 [the kernel
trick]
▪ However, what if we wanted to apply SVMs to non-linear problems? This is where the kernel
trick comes in. A kernel is a function that takes the original non-linear problem and
transforms it into a linear one within the higher-dimensional space.
▪ By applying this transformation z = x² + y²
▪ Using this three-dimensional space with x, y, and z coordinates, we can now draw a
hyperplane (flat 2D surface) to separate red and black points. Hence, the SVM classification
algorithm can now be used. In the next slide you can see a representation of this procedure.
https://towardsdatascience.com/svm-classifier-and-rbf-kernel-how-to-make-better-models-in-python-73bb4914af5b 172
Support vector machines (SVMs): #7 [the kernel trick]
173
SVM: pros & cons
▪ Pros:
▪ Their dependence on relatively few support vectors means that they are very compact models, and
take up very little memory.
▪ Once the model is trained, the prediction phase is very fast.
▪ Because they are affected only by points near the margin, they work well with high-dimensional
data—even data with more dimensions than samples, which is a challenging regime for other
algorithms.
▪ Their integration with kernel methods makes them very versatile, able to adapt to many types of
data.
▪ Cons:
▪ The scaling with the number of samples N is O[N^3] at worst or O[N^2] for efficient
implementations. For large numbers of training samples, this computational cost can be prohibitive.
▪ The results are strongly dependent on a suitable choice for the softening parameter C. This must be
carefully chosen via cross-validation, which can be expensive as datasets grow in size.
▪ The results do not have a direct probabilistic interpretation. This can be estimated via an internal
cross-validation (see the probability parameter of SVC), but this extra estimation is costly.
VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016
174
General idea behind the kernel trick: #1
https://www.cs.cornell.edu/courses/cs6787/2020fa/lectures/Lecture4.pdf 175
https://www.cs.cornell.edu/courses/cs6787/2020fa/lectures/Lecture4.pdf 176
▪ The benefit of learning with kernels is that we can express a wider class of classification
functions
▪ However, recall that a linear classifier learning problems are “easy” to solve because they
are convex, and gradients easy to compute
▪ The major cost of learning naively with Kernels is its evaluation.
177
▪ Linear SVC is not a probabilistic classifier by default but it has a built-in calibration option.
https://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html#sphx-glr-a
uto-examples-classification-plot-classification-probability-py
178
Support vector machines (SVMs): #8 [ code application]
▪ https://towardsdatascience.com/svm-classifier-and-rbf-kernel-how-to-make-better-models
-in-python-73bb4914af5b
179
Why Support Vector Machines were never a good bet for Artificial Intelligence tasks
that need good representations. SVM’s are just a clever reincarnation of Perceptrons
▪ View #1 •View #2
▪ They expand the input to a (very •They use each input vector in the
large) layer of non-linear non-adaptive training set to define a
features. non-adaptive “pheature”. The
global match between a test input
▪ They only have one layer of adaptive and that training input.
weights.
•They have a clever way of
▪ They have a very efficient way of simultaneously doing feature
fitting the weights that controls selection and finding weights on
overfitting. the remaining features
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec13.pdf 180
Kernel trick vs. kernel approximation
▪ Kernel approximation: approximate the feature mappings that correspond to certain
kernels, as they are used for example in support vector machines. Perform non-linear
transformations of the input, which can serve as a basis for linear classification or other
algorithms. Makes use of feature maps explicitly. Explicit mappings can be better suited for
online learning and can significantly reduce the cost of learning with very large datasets.
▪ Kernel trick: makes use of feature maps implicitly. Standard kernelized SVMs do not scale
well to large datasets
https://scikit-learn.org/stable/modules/kernel_approximation.html#kernel-approximation 181
https://raw.githubusercontent.com/aaronwangy/Data-Science-Cheatsheet/main/images/page2-1.png 182
Decision trees vs. SVM
Random forests allow you to determine the feature
importance. SVM’s can’t do this.
Random forests allow you to determine the feature

importance. SVM’s can’t do this.
183
What is a hyperplane?
▪ In a p-dimensional space, a hyperplane is a flat affine subspace of dimension p − 1.
▪ For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other

words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is,
a plane. In p > 3 dimensions, it can be hard to visualize a hyperplane, but the notion of a (p
− 1)-dimensional flat subspace still applies.
▪ The word affine indicates that the subspace need not pass through the origin.
▪ This is used in SVMs.
James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
184
Comparing different classifier: #1
▪ Well calibrated classifiers are probabilistic classifiers for which the output of the prediction
can be DIRECTLY INTERPRETED as a CONFIDENCE LEVEL.
▪ For instance a well calibrated (binary) classifier should classify the samples such that among
the samples to which it gave a prediction probability value close to 0.8, approximately 80%
actually belong to the positive class.
▪ A model is called calibrated if the reported uncertainty actually matches how correct it is.
Guido, Sarah, and Andreas Müller. Introduction to machine

learning with python. Vol. 282. O'Reilly Media, 2016.
https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#sphx-glr-a Predicting Good Probabilities with Supervised Learning, A.

uto-examples-calibration-plot-compare-calibration-py Niculescu-Mizil & R. Caruana, ICML 2005 185
Comparing different
classifier: #2
Predicting Good Probabilities with Supervised Learning, A.

Niculescu-Mizil & R. Caruana, ICML 2005
https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.
html#sphx-glr-auto-examples-calibration-plot-compare-calibration-py
186
▪ [1] LOGISTIC REGRESSION returns well calibrated predictions as it directly optimizes

log-loss. In contrast, the other methods return BIASED probabilities, with different biases
per method.
▪ [2] GAUSSIAN NAIVE BAYES tends to push probabilities to 0 or 1 (note the counts in the
histograms). This is mainly because it makes the assumption that features are conditionally
independent given the class, which is not the case in this dataset which contains 2
redundant features.
https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#sphx-glr-a Predicting Good Probabilities with Supervised Learning, A.

uto-examples-calibration-plot-compare-calibration-py Niculescu-Mizil & R. Caruana, ICML 2005 187
▪ [3] RANDOM FOREST CLASSIFIER shows the opposite behaviour: the histograms show peaks at
approx. 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. Methods such as
bagging and random forests that average predictions from a base set of models can have difficulty
making predictions near 0 and 1 because variance in the underlying base models will bias
predictions that should be near zero or one away from these values. For example, if a model should
predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If
we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict
values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away
from 0. We observe this effect most strongly with random forests because the base-level trees
trained with random forests have relatively high variance due to feature subsetting.”
▪ [4] SUPPORT VECTOR CLASSIFICATION (SVC) shows an even more sigmoid curve as the
RandomForestClassifier, which is typical for maximum-margin methods, which focus on hard
samples that are close to the decision boundary (the support vectors).
https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#sphx-glr-a
uto-examples-calibration-plot-compare-calibration-py 188
▪ We can calibrate uncalibrated classifiers.
▪ There are two methods:
▪ isotonic calibration and

▪ sigmoid calibration
https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html#sphx-glr-a
uto-examples-calibration-plot-compare-calibration-py 189
naïve Bayes (NB) classifier: #1
▪ An NB classifier assumes the attributes are conditionally independent and it works well
even when assumption is not true.
▪ This assumption greatly reduces computational cost, and it’s a simple algorithm to
implement that only requires linear time.
▪ Moreover, an NB classifier is easily scalable to larger datasets and good results are obtained
in most cases.
Python 3 for machine learning, O. Compesato 190

▪ Pros:
▪ can be used for binary and multiclass classification
▪ provides different types of NB algorithms
▪ good choice for text classification problems
▪ a popular choice for spam email classification
▪ can be easily trained on small datasets
▪ Cons:
▪ all features are assumed unrelated
▪ it cannot learn relationships between features
▪ it can suffer from “the zero probability problem”

▪ The zero probability problem refers to the case when the conditional probability is zero for
an attribute and thus it fails to give a valid prediction.
▪ However, can be fixed explicitly using a Laplacian estimator.

Error function used in classification: #1
▪ There 3 error function you can use: absolute error, square and log loss. Where the
last one if the most used.
▪ Notice that with the absolute and square error functions, points that are vastly
misclassified have large errors, but never too large. Let’s look at an example: a
point with label 1 but that the classifier has assigned a prediction of 0.01. This
point is vastly misclassified, since we would hope that the classifier assigns it a
prediction close to 1. The absolute error for this point is the difference between 1
and 0.01, or 0.99. The square error is this difference squared, or 0.9801. But this
is a small error for a point that is so vastly misclassified. We’d like an error
function that gives us a higher error for this point.
193
Luis G. Serrano, Grokking Machine Learning MEAP V07
Error function used in classification: #2
▪ Let’s say we have a point with label 1 (happy), for which the classifier makes a prediction of
0.00001. This point is very poorly classified. The absolute error will be 0.99999, and the
square error will be 0.9999800001.
▪ However, the log loss will be the negative of the natural logarithm of (1-0.99999), which is
11.51. This value is much larger than the absolute or square errors, which means the log
loss error is a better alarm for poorly classified points.
194
More on the log loss error function
▪ There is another reason why you may want to use the log loss. It has to do with
independent probabilities.
▪ The total log loss for a classifier is defined as the sum of the log loss for every point in the
dataset. Sum of lags is equivalent to log of product. Why do we multiply probabilities?
Because when events are independent (or when we assume they are, for the sake of
simplicity) their probabilities get multiplied.
▪ If the occurrences of two events don’t depend on each other, the probability of both of
them happening is the product of the probabilities of both events happening.
▪ This is a nice property of the loss function.
195
Dealing with unbalanced binary classification
▪ Change metrics: The accuracy of your model might not be the best metric to look at
because and I’ll use an example to explain why. Let’s say 99 bank withdrawals were not
fraudulent and 1 withdrawal was. If your model simply classified every instance as “not
fraudulent”, it would have an accuracy of 99%! Therefore, you may want to consider using
metrics like precision and recall.
▪ Increase the cost of misclassifying the minority class. By increasing the penalty of such, the
model should classify the minority class more accurately.
▪ Oversample the minority class or by undersampling the majority class.
Maximum Entropy
▪ Maximum Entropy classifiers use a basic model that is similar to the
model used by naive Bayes; however, they employ iterative
optimization to find the set of feature weights that maximizes the
probability of the training set.
VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016. 197
Generative vs. discriminative classification
▪ A Discriminative models refers to class of models which learn to classify based on the probability
estimates i.e p(y/X)=p(class label/data point) or learn a direct map from inputs x to class labels y.
▪ Essentially for a classification problem: rather than modelling each class, we simply find a line or
curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each
other. SVM is an example.
▪ Where as a Generative model explicitly models the distribution of each class by learning the joint
probability p(X,y) between the inputs and class labels and then make their predictions by using the Bayes
rule to get p(y/X), and the picking the most likely label y.
▪ There are plethora of reasons why one might find a generative process very fascinating. One of the
many reasons is that by using a generative model we can understand the causal relationships
between variations in the data and the output observations and based on that form an explainable
hypothesis. Another important feature of using generative modelling is being able to find
disentangled factors in the data which constitute to various data generating factors. Examples are
GAN’s and Variational-AE’s.
Ng, Andrew Y., and Michael I. Jordan. "On discriminative vs.

VanderPlas, Jake. Python data science handbook: Essential https://medium.com/analytics-vidhya/generative-modelling-using-variati generative classifiers: A comparison of logistic regression and naive
tools for working with data. " O'Reilly Media, Inc.", 2016 onal-autoencoders-vae-and-beta-vaes-81a56ef0bc9f 198
bayes." Advances in neural information processing systems. 2002.
with Label Ambiguity
• Deep Label Distribution Learning with Label Ambiguity
• https://csgaobb.github.io/Projects/DLDL.html
199
Learning to rank
(go back)
200
Learning to Rank = LTR
▪ Learning to Rank (LTR) is a class of techniques that apply supervised machine learning (ML)
to solve ranking problems. The main difference between LTR and traditional supervised ML
is this:
▪ Traditional ML solves a prediction problem (classification or regression) on a single

instance at a time.
▪ LTR solves a ranking problem on a list of items. The aim of LTR is to come up with
optimal ordering of those items. As such, LTR doesn’t care much about the exact score
that each item gets, but cares more about the relative ordering among all the items.
https://medium.com/@nikhilbd/intuitive-explanation-of-learning-to-rank-and-ranknet-lambdarank-and-lambdamart-fe1e17fac418 201
RankNet, LambdaRank and LambdaMART
▪ Algirtihms: RankNet, LambdaRank and LambdaMART are all LTR algorithms developed by Chris Burges
and his colleagues at Microsoft Research. RankNet was the first one to be developed, followed by
LambdaRank and then LambdaMART.
▪ Here is the reference paper:

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf
▪ In all three techniques, ranking is transformed into a pairwise classification or regression problem. That
means you look at pairs of items at a time, come up with the optimal ordering for that pair of items, and
then use it to come up with the final ranking for all the results.
RankNet
▪ The cost function for RankNet aims to minimize the number of inversions in ranking.
▪ Here an inversion means an incorrect order among a pair of results, i.e. when we rank a
lower rated result above a higher rated result in a ranked list.
▪ RankNet optimizes the cost function using Stochastic Gradient Descent.
LambdaRank
▪ During RankNet training procedure, you don’t need the costs, only need the gradients (λ) of
the cost with respect to the model score.
▪ Further they found that scaling the gradients by the change in NDCG (normalized
discounted cumulative gain) found by swapping each pair of documents gave good results.
▪ The core idea of LambdaRank is to use this new cost function for training a RankNet.
▪ On experimental datasets, this shows both speed and accuracy improvements over the
original RankNet.
LambdaMART
▪ LambdaMART combines LambdaRank and MART (Multiple Additive Regression Trees).
▪ While MART uses gradient boosted decision trees for prediction tasks, LambdaMART uses
gradient boosted decision trees using a cost function derived from LambdaRank for solving
a ranking task.
▪ On experimental datasets, LambdaMART has shown better results than LambdaRank and
the original RankNet.
Discounted cumulative gain
▪ Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval,
it is often used to measure effectiveness of web search engine algorithms or related
applications.
▪ Using a graded relevance scale of documents in a search-engine result set, DCG measures
the usefulness, or gain, of a document based on its position in the result list.
▪ The gain is accumulated from the top of the result list to the bottom, with the gain of each
result discounted at lower ranks.
https://en.wikipedia.org/wiki/Discounted_cumulative_gain
206
Content-based vs. collaborative filtering
▪ In content-based filtering, you use the properties of the objects to find similar products. For
example, using content-based filtering, a movie recommender may recommend movies of
the same genre or movies directed by the same director.
▪ In collaborative filtering, your behavior is compared to other users and users with similar
behavior dictate what is recommended to you. To give a very simple example, if you bought
a tv and another user bought a tv as well as a recliner, you would be recommended the
recliner as well.
https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e
207
Neural search
▪ The term neural search is a less academic form of the term neural information retrieval,
which first appeared during a research workshop at the SIGIR 2016 conference
(www.microsoft.com/en-us/research/event/neuir2016) focused on applying deep neural
networks to the field of information retrieval.
▪ Why do we need even we have already some good search engine? If you’ve ever worked on
designing, implementing, or configuring a search engine, you’ve surely faced the problem of
obtaining a solution that adapts to your data. DL will help a lot in providing solutions to
these problems that are accurately based on your data, not on fixed rules or algorithms.
Teofili, Tommaso. Deep Learning for Search. Manning Publications Company, 2019.
208
How was the search before the advent of DL?
▪ Before the advent of DL, such images had to be decorated with metadata (data about data)
describing their contents before being put into the search engine.
▪ And that metadata usually had to be typed by a human. Deep neural networks can abstract
a representation of an image that captures what’s in there so that no human intervention is
required to put an image description in the search engine.
Teofili, Tommaso. Deep Learning for Search. Manning Publications Company, 2019.
209
Arrow’s Impossibility Theorem
▪ This is a consequence of Arrow’s impossibility theorem, which proves that no election
system for aggregating permutations of preferences satisfies some properties.
▪ See the reference for a list of such a properties.
▪ Take-home lesson: we do not seek correct rankings, because this is an ill-defined objective.
Instead, we seek rankings that are useful and interesting.
Skiena, Steven S. The data science design manual. Springer, 2017 210

Supervised Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Learning

Uploaded by

Copyright:

Available Formats

Supervised Learning

If the labels are discrete, such as

If the labels are continuous, such

Logistic regression Linear

Naïve Bayes Nonlinear 4

▪ We start by choosing a model-class: y =f(x,W)

▪ For regression, ½(y-t)^2 is often a sensible measure of the discrepancy.

▪ Use supervised learning when...

Guido, Sarah, and Andreas Müller. Introduction to machine

▪ When alpha is very large, the regularization effect dominates the

▪ The parameter lambda (λ) controls the strength of the regularizing

▪ The feature weights can be visualized as a function of the penalty term

▪ The graph on the left is called regularisation paths.

▪ Pros: it can be automated, considers all features simultaneously, and

▪ Dataset: current population survey

▪ A Lasso model identifies the

Regularisation Goal Bias Variance Robustness Solution type

▪ Logistic regression is also known in the literature as:

▪ There exists also continuous perceptrons, which

▪ This answer can be interpreted as a probability in

Luis G. Serrano, Grokking Machine Learning MEAP V07 29

▪ The passive-aggressive algorithms are a family of algorithms for large-scale learning.

▪ However, contrary to the Perceptron, they include a regularization parameter.

▪ They are essentially outlier detection methods.

▪ [1] Standard deviation method

▪ With this re-labeling of the data, our problem can be written:

▪ The number of samples can be a user-defined constant (k-nearest neighbor learning), or

▪ Neighbors-based methods are known as non-generalizing machine learning methods, since

▪ Applications: anomaly detection, search, recommender system

▪ A downside of K-Nearest Neighbours is that you need to hang on to your entire

▪ Modern neural networks generally have multiple layers between.

▪ Essentially, the hidden layer learns a representation so that the

▪ Intuitively, we can see that it preserves topological properties.

▪ Key advantages are:

▪ Amount of data you have for each task is quite similar

▪ Can train a big enough network to do well enough on each task

Domingos, Pedro. "Every Model Learned by Gradient Descent Is Approximately

▪ Issue: shallow architectures can be very inefficient in terms of required number of

Domingos, Pedro. "Every Model Learned by Gradient Descent Is Approximately

▪ Applying the learned model to a query

Domingos, Pedro. "Every Model Learned by Gradient Descent Is Approximately 68

Domingos, Pedro. "Every Model Learned by Gradient Descent Is Approximately

▪ Here we see various hypothetical scientists or statisticians

▪ In general, the researcher will specify the form of an equation

▪ This comic just exaggerates various methods on interpreting

▪ A logarithmic curve grows slower on higher values, but still grows

▪ An exponential curve, on the contrary, is typical of a phenomenon whose

▪ The comment below the graph "Look, it's growing uncontrollably!"

▪ The value of c can be determined simply by taking the average of the

▪ The logistic regression is taken when a variable can take binary

▪ Not a type of curve fitting, but a method of depicting the

▪ Drawing a bunch of different lines by hand, keeping in only the data

▪ The name is also a potential reference to the TV show House of

▪ Homoscedasticity - I just hope I wrote it right. Homoscedasticity refers to the 'assumption

▪ This implies that they are free of measurement errors.

▪ Which includes a small article a link to a video:

▪ The predicted target is a linear combination of the weighted

▪ The weights specify the slope (gradient) of the hyperplane in

▪ The good side is that the additivity isolates the

▪ While for the weather the estimated weight

▪ You can make the estimated weights more

▪ A box in a boxplot contains the effect

▪ The vertical line in the box is the median