You are on page 1of 75

Advanced Data Analytics

Lecture 6

Simon Scheidegger
Today’s Roadmap

1. Support Vector Machines (SVM)

2. Basics on Gaussian Process Regression


1. Support Vector Machines (SVMs)

We saw that linear classification was rather limited in that
it could only identify straight line classifiers.

It could only separate out groups of data if it was possible to draw a straight line
(hyperplane in higher dimensions) between them.

This meant that it could not learn to distinguish between the two true classes of the
2D XOR function.

However, we saw that it was possible to modify the problem so that the linear
classifier could solve the problem, by changing the data so that it used more
dimensions than the original data.
SVM – the idea


The problem is how to work out which dimensions to use,
→ that is what kernel methods, which is the class of algorithms that
we will talk about in this lecture.

SVM provides very impressive classification performance on
reasonably sized datasets.

SVMs do not work well on extremely large datasets,
→ the computations don’t scale well with the number of training
examples,
→ become computationally very expensive.
More on SVM

There is rather more to the SVM than the kernel method; the algorithm
also reformulates the classification problem in such a way that we can
tell a good classifier from a bad one, even if they both give the same
results on a particular dataset.

It is this distinction that enables the SVM algorithm to be derived, so
that is where we will start.

Three different classification lines. Is there any reason why one is better than
the others?
Fig.: from Marsland (2014)
Basic Idea of SVM

We can measure the distance that we have to travel away from
the line (in a direction perpendicular to the line) before we hit a data
point.


How large could we make the radius of this cylinder until we started to
put points into a no-man’s land, where we don’t know which
class they are from?

This largest radius is known as the margin,
labelled M.

The data points in each class that lie closest
to the classification line have a name as well.

They are called support vectors.
Basic Idea of SVM


Using the argument that the best classifier is the one that goes through
the middle of no-man’s land, we can now make two arguments:


first that the margin should be as large as possible

second that the support vectors are the most useful data points
because they are the ones that we might get wrong.

This leads to an interesting feature of these algorithms:
→ after training, we can throw away all of the data except for the
support vectors, and use them for classification, which is a useful
saving in data storage.
Basic Idea of SVM

By modifying the features we hope to find spaces where the data are linearly
separable.
SVMs more formally


Support Vector Machines: discriminative Method for Classification

Independent features and data points need to be “metric”.

Idea: Certain Hyperplanes (cf. linear regression) separate the data
as well as possible.

In an idea world, the data lie on the two sides of the Hyperplane.
SVMs
Bishop, Chapter 7; Murphy, Chapter 14

Fig. from Zacki & Meira – Data Mining and Analysis


Recall: Linear Classification

Consider a two class classification problem

Use a linear model

followed by a threshold function



For now, let’s assume training data are linearly separable

Recall that the perceptron would converge to a perfect classifier
for such data

But there are many such perfect classifiers
Detour – Kernels
See Bishop (2006), Chapter 6; Murphy, Chapter 14
Recall Generalized Linear Model
Non-linear Mappings

In the lectures on linear models for regression and
classification, we looked at models with

The feature space could be high-dimensional

This was good because if data aren’t separable in original
input space (x), they may be in feature space

We’d like to avoid computing high-dimensional

We’d like to work with x which doesn’t have a natural
vector-space representation

e.g. graphs, sets, strings

N items are always (linearly!) separable in N dimensions
Kernel Trick

In previous lectures on linear models, we would explicitly
compute for each data point.

Run algorithm in feature space.

For some feature spaces, can compute dot product
efficiently

Efficient method is computation of a kernel function


The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products

The way to move forward from here:

Kernel trick examples

Kernel functions
A Kernel Trick

Let’s look at the nearest-neighbor classification algorithm

 For input point xi , find point xj with smallest distance:


If we used a non-linear feature space


So nearest-neighbor can be done in a high-dimensional
feature space without actually moving to it.
A Kernel Function

Consider the kernel function

With


So this particular kernel function does correspond to a dot
product in a feature space (is valid).

Computing k(x, z) is faster than explicitly computing


In higher dimensions, larger exponent, much faster.
Why Kernels?


Why bother with kernels?

Often easier to specify how similar two things are (dot product)
than to construct explicit feature space .

There are high-dimensional (even infinite) spaces that have
efficient-to-compute kernels.

So you want to use kernels

Need to know when kernel function is valid, so we can apply the
kernel trick.
Valid Kernels
 Given some arbitrary function k(xi , xj ), how do we know if it
corresponds to a dot product in some space?

Valid kernels: if k(·, ·) satisfies:
 Symmetric: k(x , x ) = k(x , x )
i j j i
 Positive definite: for any x , . . . , x , the Gram matrix K must
1 N
be positive semi-definite:


Positive semi-definite means for all x
then k(·, ·) corresponds to a dot product in some space

a.k.a. Mercer kernel, admissible kernel, reproducing kernel
Examples of some Kernels
Constructing Kernels


Can build new valid kernels from existing valid ones:


Bishop (2006), table on p. 296 gives many such rules
More Kernels (end of detour)

Stationary kernels are only a function of the difference
between arguments: k(x1 , x2 ) = k(x1 − x2 )


Translation invariant in input space:
k(x1 , x2 ) = k(x1 + c, x2 + c)


Homogeneous kernels, a. k. a. radial basis functions only a
function of magnitude of difference:
 k(x1 , x2 ) = k(||x1 − x2 ||)
Set subsets , where |A| denotes
number of elements in A

Domain-specific: think hard about your problem, figure out
what it means to be similar, define as k(·, ·), prove positive
definite (Feynman algorithm)
Max. Margin


We can define the margin of a classifier as the minimum
distance to any example.

In support vector machines (SVM) the decision boundary which
maximizes the margin is chosen.
Marginal Geometry
Recall Bishop, Chapter 4
Support Vectors


Assuming data are separated by the hyperplane, distance to
decision boundary is


The maximum margin criterion chooses w, b by:


Points with this min value are known as support vectors.
Canonical Representation

This optimization problem is complex:


Note that rescaling does not change
distance (many equiv. Answers).

So for closest to surface, we rescale, and thus can set:


All other points are at least this far away:


Under these constraints, the optimization becomes:
Canonical Representation (II)

So the optimization problem is now a constrained
optimization problem:


To solve this, we need to take a detour into Lagrange
multipliers
Recall: Lagrange Multipliers

Consider the problem:


Points on g(x) = 0 must have ∇g(x) normal to surface

A stationary point must have no change in f in the direction
of the surface, so ∇f (x) must also be in this same direction

So there must be some λ such that ∇f (x) + λ∇g(x) = 0
Recall: Lagrange Multipliers (II)


Define Lagrangian:


Stationary points of L(x, λ) have

∇x L(x, λ) = ∇f (x) + λ∇g(x) = 0



So are stationary points of constrained problem!
Recall: Lagrange Multipliers
Bishop (2006) – Appendix

Consider the problem:

Stationary points require:

So stationary point is
L. Multipliers - Inequality Constraints

Consider the problem:


Optimization over a region – solutions either at stationary
points (gradients 0) in region or on boundary
L. Multipliers - Inequality Constraints

Consider the problem:


Exactly how does the Lagrangian relate to the optimization
problem in this case?


It turns out that the solution to optimization problem is:
Max-min

Lagrangian


Consider the following:


Hence
Min-max (Dual form)

So the solution to optimization problem is:

which is called the primal problem.


The dual problem is when one switches the order of the max
and min:


These are not the same, but it is always the case the dual is
a bound for the primal (in the SVM case with minimization,


Slater’s theorem gives conditions for these two problems to
be equivalent, with LD (λ) = LP(x).


Slater’s theorem applies for the SVM optimization problem,
and solving the dual leads to kernelization and can be easier
than solving the primal.
Now what for SVM?

So the optimization problem is now a constrained
optimization problem:

 For this problem, the Lagrangian (with N multipliers an ) is:


We can find the derivatives of L wrt w, b and set to 0:
The dual formulation

Plugging those equations into L removes w and b results in
a version of L where ∇w,b L = 0:


this new L̃ is the dual representation of the problem
(maximize with constraints)

Note that it is kernelized

It is quadratic, convex in a

Bounded above since K positive semi-definite

Optimal a can be found

With large datasets, descent strategies employed
From a to a Classifier

We found a optimizing something else

This is related to classifier by


Recall an condition from Lagrange

 Either an = 0 or xn is a support vector



a will be sparse - many zeros
 Don’t need to store x for which a = 0
n n


Another formula for finding b
Examples


SVM trained using Gaussian kernel (see the previous lecture)

Support vectors circled

Note non-linear decision boundary in x space
Example


From Burges, A Tutorial on Support Vector Machines for
Pattern Recognition (1998).

SVM trained using a cubic polynomial kernel.


Left panel: linearly separable.

Note decision boundary is almost linear, even using cubic
polynomial kernel

Right panel: not linearly separable.

But is separable using polynomial kernel
Non-Separable Data


For most problems, data will not be linearly separable (even
in feature space )

Can relax the constraints from

 The ξn ≥ 0 are called slack variables


 ξ = 0, satisfy original problem, so x is on margin or correct side
n n
of margin
 0 < ξ < 1, inside margin, but still correctly classifed
n
 ξ > 1, mis-classified
n
Loss Function For Non-separable Data


Non-zero slack variables are bad, penalize while maximizing
the margin:


Constant C > 0 controls importance of large margin versus
incorrect (non-zero slack)

Set using cross-validation

Optimization is same quadratic, different constraints, convex
SVM in Python
demo/svm_class.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score,accuracy_score
########################################################################
# load data
cars = pd.read_csv('auto-mpg.data.txt',header=None, sep='\s+')
# extract power and weight as data matrix X
X = cars.iloc[:, [3,4]].values
# extract origin as target vector y
y = [1 if o==1 else 0 for o in cars.iloc[:, 7].values]
#y = cars.iloc[:, 7].values
# Training data (80%), Test data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Standartize features
scaler = StandardScaler()
scaler.fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)

# train SVM
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(X_train,y_train)
y_predicted = svm.predict(X_test_standardized)

# print confusion matrix


print("confusion matrix:\n", confusion_matrix(y_true=y_test, y_pred=y_predicted))

## Correctly classified
print("Correctly Classified:\n", accuracy_score(y_true=y_test, y_pred=y_predicted))
print("Precision:\n", precision_score(y_true=y_test, y_pred=y_predicted))
print("Score:\n", recall_score(y_true=y_test, y_pred=y_predicted))
print("F1:\n", f1_score(y_true=y_test, y_pred=y_predicted))
Action required!


Try 3 different kernels (https://scikit-learn.org). How does the
performance of SVM change?

Try different sizes of training and test data (50%/50%),
(70%/30%), (90%/10%) → How does the performance change?
Summary on SVMs


Maximum margin criterion for deciding on decision boundary

Linearly separable data

Relax with slack variables for non-separable case

Global optimization is possible in both cases

Convex problem (no local optima)

Descent methods converge to global optimum

Kernelized
2. Gaussian Process Regression


Today

Basics of Gaussian Process Regression

Noise-free kernels
Gaussian Process Regression

http://www.gaussianprocess.org/gpml/
Recall: Aim of Regression

Given some (potential) noisy observations of a dependent variable at certain values of the
independent variable x, what is our best estimate of the dependent variable y at a new value, x∗?


Let f denote an (unknown) function which maps inputs x to outputs

f:X → Y

Modeling a function f means mathematically representing the relation between inputs and outputs.

Often times, the shape of the underlying function might be unknown, the function can be hard to
evaluate, or other requirements might complicate the process of information acquisition.
Choosing a model


If we expect the underlying function f(x) to be linear, and can make
some assumptions about the input data, we might use a least-squares
method to fit a straight line (linear regression).


Moreover, if we suspect f(x) may also be quadratic, cubic, or
even non-polynomial, we can use the principles of model selection to
choose among the various possibilities.
Model Selection
Example data set by https://archive.ics.uci.edu/ml/datasets/Auto+MPG

One common approach to reliably assess the quality of a machine learning


model and avoid over-fitting is to randomly split the available data into

training data (~70% of the data)
is used for determining optimal coefficients.

validation data (~20% of the data) is used for model selection (e.g., fixing
degree of polynomial, selecting a subset of features, etc.)

test data (~10% of the data) is used to measure the quality that is reported.
For completeness: Polynomial
Regression in Python

Degree of regression
1: linear
2: quadratic
...
Why Gaussian Process Regression?


There are many projections possible.


We have to choose one either a priori or by model comparison with a
set of possible projections.

Especially if the problem is to explore and exploit a completely
unknown function, this approach will not be beneficial as there is little
guidance to which projections we should try.

Gaussian process regression offers a principled solution to this


problem in which projections are chosen implicitly, effectively leading
“the data decide” on the complexity of the function.
Recall Multivariate Gaussians
Say you measure two variables, e.g.,
 x1:height
x2  x2: weight

→ plot
→ we want to fit a Gaussian to these points.

x2
x1

x1
Recall Multivariate Gaussians (II)

x2
Mean =
x1
x=
x2

x2
x1
Mean

Fit a Gaussian that with a


covariance that is circular.
x1

Fit a Gaussian that with a


covariance that is an ellipse.
Multivariate Gaussians (III)

Assume the points are Gaussian distributed (this is our “model”).

 How do points relate to each other? (“how does increasing x1 increase x2?)

→The variable to describe this is called “Covariance*” (cov)


If the entries in the column vector
are random variables, each with finite variance and expected value, then the
covariance matrix KXX is the matrix whose (i ,j) entry is the covariance.

*Covariance matrix: positive definite (Matrix symmetric, its Eigenvalues positive).


Multivariate Gaussian (IV)

 Assume for a moment that E[ ] = 0 → Covariance E[x1x2] is a “dot” product.


Assume two points [1 0] 1 =1 → Covariance is a measure similarity.
0

1 0 Knowing about x1does not provide any


0 1 information about x2 as they are uncorrelated.

E[x1x2 ] = 0
Multivariate Gaussians (V)

 Assume for a moment that E[ ] = 0 → Covariance E[x1x2] is a “dot” product.


Assume two points [1 0] 1 = 1 → Covariance measure similarity.
0

1 0.6 → Knowing about x1 DOES provide information


0.6 1 about x2 .
→ if x1is positive, x2 is with great probability.
E[x1x2 ] ≠ 0 → knowing something about x1 allows us to
know something about x2.
Joint Gaussian distributions
see, e.g., Rasmussen et al. (2005), Murphy (2012)

Conditional distribution
,

Mean Covariance
P(x1,x2 ) joint
From Joint to Conditional distributions
see, e.g., Murphy (2012), chapter 4 .

Two “blocks” of vectors

This Theorem allows you to go from joint to conditional distributions.


Producing Data from Gaussians
the cumulative

As we have the capability of drawing


1-dim random numbers from a Gaussian,
we can also do this in a multivariate case.

→We need a way to take “square roots”


from matrices.
→ Cholesky decomposition: random number from a
uniform distribution.

Recall Eq. (4.68) from the previous slide,


Observations → Interpolation

f3 We have 3 observations at xi for f(xi)

??? → Given the data pairs


D = { (x1,f1), (x2,f2), (x3,f3)}
f(x)
f2
f1
→ want to find/learn the function
that describes the data, i.e.,
for a “new” x*, we want to
know what f(x*) would be!
x1 x2 x* x3
Observations → Interpolation (II)
We assume that f's (the height) are Gaussian distributed,
with zero – mean and some covariance matrix K.

f3

??? ,

f(x)
f2

f1 Note: f1 and f2 should probably be more correlated,


as they are nearby (compared to f1 and f3).

→ The prior mean function reflects the expected


x1 x2 x* x3 function value at input x:

→ It is often set to 0.
Observations → Interpolation (II)
We assume that f's (the height) are Gaussian distributed,
with zero – mean and some covariance matrix K.
f3

??? ,

f(x)
f2
Note: f1 and f2 should probably be more correlated,
f1 as they are nearby (compared to f1 and f3), e.g.,

x1 x2 x* x3 ,

Covariance matrix constructed by


some “measure of similarity”, i.e., a kernel function
(parametric ansatz), such as “squared exponential”.
Parameters can be obtained e.g. via MLE (later).
– controls vertical variation.

– controls horizontal length scale.


Observations → Interpolation (III)
Given data D = { (x1,f1), (x2,f2), (x3,f3)} → f(x*) =f* ?
3d-Covariance K from
→ Assume f ~ N( 0 , K(·,·) ) the training data
→ Assume f(x*) ~ N( 0, K(x*,x*) )

f3

??? ,

f(x)
f2

f1

→ Joint distribution over f and f* .


→ We need the conditional of f* given f.
x1 x2 x* x3
→ In this example, we “cut” in 3 dimensions.
→ What is left is a 1-dimensional Gaussian,
i.e., the Gaussian for f*
Interpolation → Noiseless GPR
(see, e.g., Rasmussen & Williams (2006), with references therein)

Test point = interpolation at X*


→ predictive mean
→ Confidence Intervals!
Where we have data, we have high confidence in our predictions.
Where we do not have data, we cannot be to confident about our predictions.
Algorithmic Complexity!


The central computational operation in using Gaussian processes
will involve the inversion of a matrix of size N × N,
for which standard methods require O(N³) computations.

The matrix inversion must be performed once for the given
training set.

→ For large training data sets, however, the direct application of


Gaussian process methods can become infeasible.

→ Sparse Methods (cf. Rasmussen et. al (2006) and references therein).


A note on the predictive mean
Note that the predictive mean can (in general) also be written as

where
and t being the N observations.

→ We can think of the GP posterior mean as an approximation of f(·)


using N symmetric basis functions centered at each observed input.

→ by choosing a covariance function that vanishes when x and x’ are


separated by a lot, for example the squared exponential covariance
function, we see that an observed input-output will only affect the
approximation locally.
GP – a distribution over functions
demo/1d_gp_example.ipynb

f2

Procedure
Create vector X1:N

K = LLT
f1
f i ~ N(0, K) ~ L N(0,I)
GPR Example code & numerical stability
Look at demo/1d_gp_example.ipynb and Murphy, Chapter 15

Alg. 15: Numerical more stable.


Prediction & Confidence Intervals
Note: if you don’t fix the seed, these pictures vary every time you run of the code.

95% Confidence Intervals


Unknown true function

Observations

10 Training Points 50 Training Points

We see that the model perfectly interpolates the training data, and that the predictive
uncertainty increases as we move further away from the observed data.
Non-hypercubic domain: GPR versus ASG

Figure: The upper left panel shows the analytical test function evaluated at random test points on the simplex. The
upper right panel displays a comparison of the interpolation error for GPs, sparse grids, and adaptive sparse grids of
varying resolution and constructed under the assumption that a continuation value exists, (denoted by cont"), or that
there is no continuation value. The lower left panel displays a sparse grid consisting of 32,769 points. The lower middle
panel shows an adaptive sparse grid (cont) that consists of 1,563 points, whereas the lower right panel shows an
adaptive sparse grid, constructed with 3,524 points and under the assumption that the function outside ∆ is not known.
GPR in scikit-learn.org
https://scikit-learn.org/stable/modules/gaussian_process.html
https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py

Look at demo/1d_GPR.ipynb
More GP packages (next to scikit-learn.org)
Illustration of noiseless GPR prediction (II)


We use a squared exponential kernel, aka Gaussian kernel or RBF kernel.


In 1d, this is given by


Here l controls the horizontal length scale over which the function varies,
and σf controls the vertical variation.

 We usually show predictions from the posterior, p(f∗ |X∗ , X, f).


We see (NEXT SLIDE) that the model perfectly interpolates the training
data, and that the predictive uncertainty increases as we move further
away from the observed data.
The parameters in the Kernel
cf. demo/1d_gp_example.ipynb
l² = 1.0

Let

→ Tuning the parameters by hand


is not a good idea in general
(in particular in high-dimensional settings).

l² = 0.01 l² = 10.0

You might also like