Advanced Data Analytics: Simon Scheidegger

Advanced Data Analytics
Lecture 6
Simon Scheidegger
Today’s Roadmap
1. Support Vector Machines (SVM)
2. Basics on Gaussian Process Regression

1. Support Vector Machines (SVMs)

We saw that linear classification was rather limited in that
it could only identify straight line classifiers.

It could only separate out groups of data if it was possible to draw a straight line
(hyperplane in higher dimensions) between them.

This meant that it could not learn to distinguish between the two true classes of the
2D XOR function.

However, we saw that it was possible to modify the problem so that the linear
classifier could solve the problem, by changing the data so that it used more
dimensions than the original data.
SVM – the idea

The problem is how to work out which dimensions to use,
→ that is what kernel methods, which is the class of algorithms that
we will talk about in this lecture.

SVM provides very impressive classification performance on
reasonably sized datasets.

SVMs do not work well on extremely large datasets,
→ the computations don’t scale well with the number of training
examples,
→ become computationally very expensive.
More on SVM

There is rather more to the SVM than the kernel method; the algorithm
also reformulates the classification problem in such a way that we can
tell a good classifier from a bad one, even if they both give the same
results on a particular dataset.

It is this distinction that enables the SVM algorithm to be derived, so
that is where we will start.
Three different classification lines. Is there any reason why one is better than
the others?
Fig.: from Marsland (2014)
Basic Idea of SVM

We can measure the distance that we have to travel away from
the line (in a direction perpendicular to the line) before we hit a data
point.

How large could we make the radius of this cylinder until we started to
put points into a no-man’s land, where we don’t know which
class they are from?

This largest radius is known as the margin,
labelled M.

The data points in each class that lie closest
to the classification line have a name as well.

They are called support vectors.
Basic Idea of SVM

Using the argument that the best classifier is the one that goes through
the middle of no-man’s land, we can now make two arguments:

first that the margin should be as large as possible

second that the support vectors are the most useful data points
because they are the ones that we might get wrong.

This leads to an interesting feature of these algorithms:
→ after training, we can throw away all of the data except for the
support vectors, and use them for classification, which is a useful
saving in data storage.
Basic Idea of SVM
By modifying the features we hope to find spaces where the data are linearly
separable.
SVMs more formally

Support Vector Machines: discriminative Method for Classification

Independent features and data points need to be “metric”.

Idea: Certain Hyperplanes (cf. linear regression) separate the data
as well as possible.

In an idea world, the data lie on the two sides of the Hyperplane.
SVMs
Bishop, Chapter 7; Murphy, Chapter 14
Fig. from Zacki & Meira – Data Mining and Analysis

Recall: Linear Classification

Consider a two class classification problem

Use a linear model
followed by a threshold function


For now, let’s assume training data are linearly separable

Recall that the perceptron would converge to a perfect classifier
for such data

But there are many such perfect classifiers
Detour – Kernels
See Bishop (2006), Chapter 6; Murphy, Chapter 14
Recall Generalized Linear Model
Non-linear Mappings

In the lectures on linear models for regression and
classification, we looked at models with

The feature space could be high-dimensional

This was good because if data aren’t separable in original
input space (x), they may be in feature space

We’d like to avoid computing high-dimensional

We’d like to work with x which doesn’t have a natural
vector-space representation

e.g. graphs, sets, strings

N items are always (linearly!) separable in N dimensions
Kernel Trick

In previous lectures on linear models, we would explicitly
compute for each data point.

Run algorithm in feature space.

For some feature spaces, can compute dot product
efficiently

Efficient method is computation of a kernel function

The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products

The way to move forward from here:

Kernel trick examples

Kernel functions
A Kernel Trick

Let’s look at the nearest-neighbor classification algorithm
 For input point xi , find point xj with smallest distance:

If we used a non-linear feature space

So nearest-neighbor can be done in a high-dimensional
feature space without actually moving to it.
A Kernel Function

Consider the kernel function

With

So this particular kernel function does correspond to a dot
product in a feature space (is valid).

Computing k(x, z) is faster than explicitly computing

In higher dimensions, larger exponent, much faster.
Why Kernels?

Why bother with kernels?

Often easier to specify how similar two things are (dot product)
than to construct explicit feature space .

There are high-dimensional (even infinite) spaces that have
efficient-to-compute kernels.

So you want to use kernels

Need to know when kernel function is valid, so we can apply the
kernel trick.
Valid Kernels
 Given some arbitrary function k(xi , xj ), how do we know if it
corresponds to a dot product in some space?

Valid kernels: if k(·, ·) satisfies:
 Symmetric: k(x , x ) = k(x , x )
i j j i
 Positive definite: for any x , . . . , x , the Gram matrix K must
1 N
be positive semi-definite:

Positive semi-definite means for all x
then k(·, ·) corresponds to a dot product in some space

a.k.a. Mercer kernel, admissible kernel, reproducing kernel
Examples of some Kernels
Constructing Kernels

Can build new valid kernels from existing valid ones:

Bishop (2006), table on p. 296 gives many such rules
More Kernels (end of detour)

Stationary kernels are only a function of the difference
between arguments: k(x1 , x2 ) = k(x1 − x2 )

Translation invariant in input space:
k(x1 , x2 ) = k(x1 + c, x2 + c)

Homogeneous kernels, a. k. a. radial basis functions only a
function of magnitude of difference:
 k(x1 , x2 ) = k(||x1 − x2 ||)
Set subsets , where |A| denotes
number of elements in A

Domain-specific: think hard about your problem, figure out
what it means to be similar, define as k(·, ·), prove positive
definite (Feynman algorithm)
Max. Margin

We can define the margin of a classifier as the minimum
distance to any example.

In support vector machines (SVM) the decision boundary which
maximizes the margin is chosen.
Marginal Geometry
Recall Bishop, Chapter 4
Support Vectors

Assuming data are separated by the hyperplane, distance to
decision boundary is

The maximum margin criterion chooses w, b by:

Points with this min value are known as support vectors.
Canonical Representation

This optimization problem is complex:

Note that rescaling does not change
distance (many equiv. Answers).

So for closest to surface, we rescale, and thus can set:

All other points are at least this far away:

Under these constraints, the optimization becomes:
Canonical Representation (II)

So the optimization problem is now a constrained
optimization problem:

To solve this, we need to take a detour into Lagrange
multipliers
Recall: Lagrange Multipliers
Consider the problem:

Points on g(x) = 0 must have ∇g(x) normal to surface

A stationary point must have no change in f in the direction
of the surface, so ∇f (x) must also be in this same direction

So there must be some λ such that ∇f (x) + λ∇g(x) = 0
Recall: Lagrange Multipliers (II)

Define Lagrangian:

Stationary points of L(x, λ) have
∇x L(x, λ) = ∇f (x) + λ∇g(x) = 0


So are stationary points of constrained problem!
Recall: Lagrange Multipliers
Bishop (2006) – Appendix
Stationary points require:
So stationary point is
L. Multipliers - Inequality Constraints

Optimization over a region – solutions either at stationary
points (gradients 0) in region or on boundary
L. Multipliers - Inequality Constraints

Exactly how does the Lagrangian relate to the optimization
problem in this case?

It turns out that the solution to optimization problem is:
Max-min

Lagrangian

Consider the following:

Hence
Min-max (Dual form)

So the solution to optimization problem is:
which is called the primal problem.

The dual problem is when one switches the order of the max
and min:

These are not the same, but it is always the case the dual is
a bound for the primal (in the SVM case with minimization,

Slater’s theorem gives conditions for these two problems to
be equivalent, with LD (λ) = LP(x).

Slater’s theorem applies for the SVM optimization problem,
and solving the dual leads to kernelization and can be easier
than solving the primal.
Now what for SVM?

So the optimization problem is now a constrained
optimization problem:
 For this problem, the Lagrangian (with N multipliers an ) is:

We can find the derivatives of L wrt w, b and set to 0:
The dual formulation

Plugging those equations into L removes w and b results in
a version of L where ∇w,b L = 0:

this new L̃ is the dual representation of the problem
(maximize with constraints)

Note that it is kernelized

It is quadratic, convex in a

Bounded above since K positive semi-definite

Optimal a can be found

With large datasets, descent strategies employed
From a to a Classifier

We found a optimizing something else

This is related to classifier by

Recall an condition from Lagrange
 Either an = 0 or xn is a support vector


a will be sparse - many zeros
 Don’t need to store x for which a = 0
n n

Another formula for finding b
Examples

SVM trained using Gaussian kernel (see the previous lecture)

Support vectors circled

Note non-linear decision boundary in x space
Example

From Burges, A Tutorial on Support Vector Machines for
Pattern Recognition (1998).

SVM trained using a cubic polynomial kernel.

Left panel: linearly separable.

Note decision boundary is almost linear, even using cubic
polynomial kernel

Right panel: not linearly separable.

But is separable using polynomial kernel
Non-Separable Data

For most problems, data will not be linearly separable (even
in feature space )

Can relax the constraints from
 The ξn ≥ 0 are called slack variables

 ξ = 0, satisfy original problem, so x is on margin or correct side
n n
of margin
 0 < ξ < 1, inside margin, but still correctly classifed
n
 ξ > 1, mis-classified
n
Loss Function For Non-separable Data

Non-zero slack variables are bad, penalize while maximizing
the margin:

Constant C > 0 controls importance of large margin versus
incorrect (non-zero slack)

Set using cross-validation

Optimization is same quadratic, different constraints, convex
SVM in Python
demo/svm_class.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score,accuracy_score
########################################################################
# load data
cars = pd.read_csv('auto-mpg.data.txt',header=None, sep='\s+')
# extract power and weight as data matrix X
X = cars.iloc[:, [3,4]].values
# extract origin as target vector y
y = [1 if o==1 else 0 for o in cars.iloc[:, 7].values]
#y = cars.iloc[:, 7].values
# Training data (80%), Test data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Standartize features
scaler = StandardScaler()
scaler.fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)
# train SVM
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(X_train,y_train)
y_predicted = svm.predict(X_test_standardized)
# print confusion matrix

print("confusion matrix:\n", confusion_matrix(y_true=y_test, y_pred=y_predicted))
## Correctly classified
print("Correctly Classified:\n", accuracy_score(y_true=y_test, y_pred=y_predicted))
print("Precision:\n", precision_score(y_true=y_test, y_pred=y_predicted))
print("Score:\n", recall_score(y_true=y_test, y_pred=y_predicted))
print("F1:\n", f1_score(y_true=y_test, y_pred=y_predicted))
Action required!

Try 3 different kernels (https://scikit-learn.org). How does the
performance of SVM change?

Try different sizes of training and test data (50%/50%),
(70%/30%), (90%/10%) → How does the performance change?
Summary on SVMs

Maximum margin criterion for deciding on decision boundary

Linearly separable data

Relax with slack variables for non-separable case

Global optimization is possible in both cases

Convex problem (no local optima)

Descent methods converge to global optimum

Kernelized
2. Gaussian Process Regression

Today

Basics of Gaussian Process Regression

Noise-free kernels
Gaussian Process Regression
http://www.gaussianprocess.org/gpml/
Recall: Aim of Regression

Given some (potential) noisy observations of a dependent variable at certain values of the
independent variable x, what is our best estimate of the dependent variable y at a new value, x∗?

Let f denote an (unknown) function which maps inputs x to outputs
f:X → Y

Modeling a function f means mathematically representing the relation between inputs and outputs.

Often times, the shape of the underlying function might be unknown, the function can be hard to
evaluate, or other requirements might complicate the process of information acquisition.
Choosing a model

If we expect the underlying function f(x) to be linear, and can make
some assumptions about the input data, we might use a least-squares
method to fit a straight line (linear regression).

Moreover, if we suspect f(x) may also be quadratic, cubic, or
even non-polynomial, we can use the principles of model selection to
choose among the various possibilities.
Model Selection
Example data set by https://archive.ics.uci.edu/ml/datasets/Auto+MPG
One common approach to reliably assess the quality of a machine learning

model and avoid over-fitting is to randomly split the available data into

training data (~70% of the data)
is used for determining optimal coefficients.

validation data (~20% of the data) is used for model selection (e.g., fixing
degree of polynomial, selecting a subset of features, etc.)

test data (~10% of the data) is used to measure the quality that is reported.
For completeness: Polynomial
Regression in Python
Degree of regression
1: linear
2: quadratic
...
Why Gaussian Process Regression?

There are many projections possible.

We have to choose one either a priori or by model comparison with a
set of possible projections.

Especially if the problem is to explore and exploit a completely
unknown function, this approach will not be beneficial as there is little
guidance to which projections we should try.
Gaussian process regression offers a principled solution to this

problem in which projections are chosen implicitly, effectively leading
“the data decide” on the complexity of the function.
Recall Multivariate Gaussians
Say you measure two variables, e.g.,
 x1:height
x2  x2: weight
→ plot
→ we want to fit a Gaussian to these points.
x2
x1
x1
Recall Multivariate Gaussians (II)
x2
Mean =
x1
x=
x2
x2
x1
Mean
Fit a Gaussian that with a

covariance that is circular.
x1
Fit a Gaussian that with a

covariance that is an ellipse.
Multivariate Gaussians (III)

Assume the points are Gaussian distributed (this is our “model”).
 How do points relate to each other? (“how does increasing x1 increase x2?)
→The variable to describe this is called “Covariance*” (cov)

If the entries in the column vector
are random variables, each with finite variance and expected value, then the
covariance matrix KXX is the matrix whose (i ,j) entry is the covariance.
*Covariance matrix: positive definite (Matrix symmetric, its Eigenvalues positive).

Multivariate Gaussian (IV)
 Assume for a moment that E[ ] = 0 → Covariance E[x1x2] is a “dot” product.

Assume two points [1 0] 1 =1 → Covariance is a measure similarity.
0
1 0 Knowing about x1does not provide any

0 1 information about x2 as they are uncorrelated.
E[x1x2 ] = 0
Multivariate Gaussians (V)
 Assume for a moment that E[ ] = 0 → Covariance E[x1x2] is a “dot” product.

Assume two points [1 0] 1 = 1 → Covariance measure similarity.
0
1 0.6 → Knowing about x1 DOES provide information

0.6 1 about x2 .
→ if x1is positive, x2 is with great probability.
E[x1x2 ] ≠ 0 → knowing something about x1 allows us to
know something about x2.
Joint Gaussian distributions
see, e.g., Rasmussen et al. (2005), Murphy (2012)
Conditional distribution
,
Mean Covariance
P(x1,x2 ) joint
From Joint to Conditional distributions
see, e.g., Murphy (2012), chapter 4 .
Two “blocks” of vectors
This Theorem allows you to go from joint to conditional distributions.

Producing Data from Gaussians
the cumulative
As we have the capability of drawing

1-dim random numbers from a Gaussian,
we can also do this in a multivariate case.
→We need a way to take “square roots”

from matrices.
→ Cholesky decomposition: random number from a
uniform distribution.
Recall Eq. (4.68) from the previous slide,

Observations → Interpolation
f3 We have 3 observations at xi for f(xi)
??? → Given the data pairs

D = { (x1,f1), (x2,f2), (x3,f3)}
f(x)
f2
f1
→ want to find/learn the function
that describes the data, i.e.,
for a “new” x*, we want to
know what f(x*) would be!
x1 x2 x* x3
Observations → Interpolation (II)
We assume that f's (the height) are Gaussian distributed,
with zero – mean and some covariance matrix K.
f3
??? ,
f(x)
f2
f1 Note: f1 and f2 should probably be more correlated,

as they are nearby (compared to f1 and f3).
→ The prior mean function reflects the expected

x1 x2 x* x3 function value at input x:
→ It is often set to 0.
Observations → Interpolation (II)
We assume that f's (the height) are Gaussian distributed,
with zero – mean and some covariance matrix K.
f3
??? ,
f(x)
f2
Note: f1 and f2 should probably be more correlated,
f1 as they are nearby (compared to f1 and f3), e.g.,
x1 x2 x* x3 ,
Covariance matrix constructed by

some “measure of similarity”, i.e., a kernel function
(parametric ansatz), such as “squared exponential”.
Parameters can be obtained e.g. via MLE (later).
– controls vertical variation.
– controls horizontal length scale.

Observations → Interpolation (III)
Given data D = { (x1,f1), (x2,f2), (x3,f3)} → f(x*) =f* ?
3d-Covariance K from
→ Assume f ~ N( 0 , K(·,·) ) the training data
→ Assume f(x*) ~ N( 0, K(x*,x*) )
f3
??? ,
f(x)
f2
f1
→ Joint distribution over f and f* .

→ We need the conditional of f* given f.
x1 x2 x* x3
→ In this example, we “cut” in 3 dimensions.
→ What is left is a 1-dimensional Gaussian,
i.e., the Gaussian for f*
Interpolation → Noiseless GPR
(see, e.g., Rasmussen & Williams (2006), with references therein)
Test point = interpolation at X*

→ predictive mean
→ Confidence Intervals!
Where we have data, we have high confidence in our predictions.
Where we do not have data, we cannot be to confident about our predictions.
Algorithmic Complexity!

The central computational operation in using Gaussian processes
will involve the inversion of a matrix of size N × N,
for which standard methods require O(N³) computations.

The matrix inversion must be performed once for the given
training set.
→ For large training data sets, however, the direct application of

Gaussian process methods can become infeasible.
→ Sparse Methods (cf. Rasmussen et. al (2006) and references therein).

A note on the predictive mean
Note that the predictive mean can (in general) also be written as
where
and t being the N observations.
→ We can think of the GP posterior mean as an approximation of f(·)

using N symmetric basis functions centered at each observed input.
→ by choosing a covariance function that vanishes when x and x’ are

separated by a lot, for example the squared exponential covariance
function, we see that an observed input-output will only affect the
approximation locally.
GP – a distribution over functions
demo/1d_gp_example.ipynb
f2
Procedure
Create vector X1:N
K = LLT
f1
f i ~ N(0, K) ~ L N(0,I)
GPR Example code & numerical stability
Look at demo/1d_gp_example.ipynb and Murphy, Chapter 15
Alg. 15: Numerical more stable.

Prediction & Confidence Intervals
Note: if you don’t fix the seed, these pictures vary every time you run of the code.
95% Confidence Intervals

Unknown true function
Observations
10 Training Points 50 Training Points
We see that the model perfectly interpolates the training data, and that the predictive
uncertainty increases as we move further away from the observed data.
Non-hypercubic domain: GPR versus ASG
Figure: The upper left panel shows the analytical test function evaluated at random test points on the simplex. The
upper right panel displays a comparison of the interpolation error for GPs, sparse grids, and adaptive sparse grids of
varying resolution and constructed under the assumption that a continuation value exists, (denoted by cont"), or that
there is no continuation value. The lower left panel displays a sparse grid consisting of 32,769 points. The lower middle
panel shows an adaptive sparse grid (cont) that consists of 1,563 points, whereas the lower right panel shows an
adaptive sparse grid, constructed with 3,524 points and under the assumption that the function outside ∆ is not known.
GPR in scikit-learn.org
https://scikit-learn.org/stable/modules/gaussian_process.html
https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py
Look at demo/1d_GPR.ipynb
More GP packages (next to scikit-learn.org)
Illustration of noiseless GPR prediction (II)

We use a squared exponential kernel, aka Gaussian kernel or RBF kernel.

In 1d, this is given by

Here l controls the horizontal length scale over which the function varies,
and σf controls the vertical variation.
 We usually show predictions from the posterior, p(f∗ |X∗ , X, f).

We see (NEXT SLIDE) that the model perfectly interpolates the training
data, and that the predictive uncertainty increases as we move further
away from the observed data.
The parameters in the Kernel
cf. demo/1d_gp_example.ipynb
l² = 1.0
Let
→ Tuning the parameters by hand

is not a good idea in general
(in particular in high-dimensional settings).
l² = 0.01 l² = 10.0

Advanced Data Analytics: Simon Scheidegger

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Data Analytics: Simon Scheidegger

Uploaded by

Copyright:

Available Formats

Advanced Data Analytics

1. Support Vector Machines (SVM)

2. Basics on Gaussian Process Regression

Fig. from Zacki & Meira – Data Mining and Analysis

followed by a threshold function

 For input point xi , find point xj with smallest distance:

Consider the problem:

∇x L(x, λ) = ∇f (x) + λ∇g(x) = 0

Consider the problem:

Stationary points require:

Consider the problem:

Consider the problem:

which is called the primal problem.

 For this problem, the Lagrangian (with N multipliers an ) is:

 Either an = 0 or xn is a support vector

 The ξn ≥ 0 are called slack variables

# print confusion matrix

One common approach to reliably assess the quality of a machine learning

Gaussian process regression offers a principled solution to this

Fit a Gaussian that with a

Fit a Gaussian that with a

→The variable to describe this is called “Covariance*” (cov)

*Covariance matrix: positive definite (Matrix symmetric, its Eigenvalues positive).

 Assume for a moment that E[ ] = 0 → Covariance E[x1x2] is a “dot” product.

1 0 Knowing about x1does not provide any

 Assume for a moment that E[ ] = 0 → Covariance E[x1x2] is a “dot” product.

1 0.6 → Knowing about x1 DOES provide information

Two “blocks” of vectors

This Theorem allows you to go from joint to conditional distributions.

As we have the capability of drawing

→We need a way to take “square roots”

Recall Eq. (4.68) from the previous slide,

f3 We have 3 observations at xi for f(xi)

??? → Given the data pairs

f1 Note: f1 and f2 should probably be more correlated,

→ The prior mean function reflects the expected

Covariance matrix constructed by

– controls horizontal length scale.

→ Joint distribution over f and f* .

Test point = interpolation at X*

→ For large training data sets, however, the direct application of

→ Sparse Methods (cf. Rasmussen et. al (2006) and references therein).

→ We can think of the GP posterior mean as an approximation of f(·)

→ by choosing a covariance function that vanishes when x and x’ are

Alg. 15: Numerical more stable.

95% Confidence Intervals

10 Training Points 50 Training Points

 We usually show predictions from the posterior, p(f∗ |X∗ , X, f).

→ Tuning the parameters by hand

You might also like