Professional Documents
Culture Documents
Lecture 6
Simon Scheidegger
Today’s Roadmap
The problem is how to work out which dimensions to use,
→ that is what kernel methods, which is the class of algorithms that
we will talk about in this lecture.
SVM provides very impressive classification performance on
reasonably sized datasets.
SVMs do not work well on extremely large datasets,
→ the computations don’t scale well with the number of training
examples,
→ become computationally very expensive.
More on SVM
There is rather more to the SVM than the kernel method; the algorithm
also reformulates the classification problem in such a way that we can
tell a good classifier from a bad one, even if they both give the same
results on a particular dataset.
It is this distinction that enables the SVM algorithm to be derived, so
that is where we will start.
Three different classification lines. Is there any reason why one is better than
the others?
Fig.: from Marsland (2014)
Basic Idea of SVM
We can measure the distance that we have to travel away from
the line (in a direction perpendicular to the line) before we hit a data
point.
How large could we make the radius of this cylinder until we started to
put points into a no-man’s land, where we don’t know which
class they are from?
This largest radius is known as the margin,
labelled M.
The data points in each class that lie closest
to the classification line have a name as well.
They are called support vectors.
Basic Idea of SVM
Using the argument that the best classifier is the one that goes through
the middle of no-man’s land, we can now make two arguments:
first that the margin should be as large as possible
second that the support vectors are the most useful data points
because they are the ones that we might get wrong.
This leads to an interesting feature of these algorithms:
→ after training, we can throw away all of the data except for the
support vectors, and use them for classification, which is a useful
saving in data storage.
Basic Idea of SVM
By modifying the features we hope to find spaces where the data are linearly
separable.
SVMs more formally
Support Vector Machines: discriminative Method for Classification
Independent features and data points need to be “metric”.
Idea: Certain Hyperplanes (cf. linear regression) separate the data
as well as possible.
In an idea world, the data lie on the two sides of the Hyperplane.
SVMs
Bishop, Chapter 7; Murphy, Chapter 14
The kernel trick is to rewrite an algorithm to only have x
enter in the form of dot products
The way to move forward from here:
Kernel trick examples
Kernel functions
A Kernel Trick
Let’s look at the nearest-neighbor classification algorithm
If we used a non-linear feature space
So nearest-neighbor can be done in a high-dimensional
feature space without actually moving to it.
A Kernel Function
Consider the kernel function
With
So this particular kernel function does correspond to a dot
product in a feature space (is valid).
Computing k(x, z) is faster than explicitly computing
In higher dimensions, larger exponent, much faster.
Why Kernels?
Why bother with kernels?
Often easier to specify how similar two things are (dot product)
than to construct explicit feature space .
There are high-dimensional (even infinite) spaces that have
efficient-to-compute kernels.
So you want to use kernels
Need to know when kernel function is valid, so we can apply the
kernel trick.
Valid Kernels
Given some arbitrary function k(xi , xj ), how do we know if it
corresponds to a dot product in some space?
Valid kernels: if k(·, ·) satisfies:
Symmetric: k(x , x ) = k(x , x )
i j j i
Positive definite: for any x , . . . , x , the Gram matrix K must
1 N
be positive semi-definite:
Positive semi-definite means for all x
then k(·, ·) corresponds to a dot product in some space
a.k.a. Mercer kernel, admissible kernel, reproducing kernel
Examples of some Kernels
Constructing Kernels
Can build new valid kernels from existing valid ones:
Bishop (2006), table on p. 296 gives many such rules
More Kernels (end of detour)
Stationary kernels are only a function of the difference
between arguments: k(x1 , x2 ) = k(x1 − x2 )
Translation invariant in input space:
k(x1 , x2 ) = k(x1 + c, x2 + c)
Homogeneous kernels, a. k. a. radial basis functions only a
function of magnitude of difference:
k(x1 , x2 ) = k(||x1 − x2 ||)
Set subsets , where |A| denotes
number of elements in A
Domain-specific: think hard about your problem, figure out
what it means to be similar, define as k(·, ·), prove positive
definite (Feynman algorithm)
Max. Margin
We can define the margin of a classifier as the minimum
distance to any example.
In support vector machines (SVM) the decision boundary which
maximizes the margin is chosen.
Marginal Geometry
Recall Bishop, Chapter 4
Support Vectors
Assuming data are separated by the hyperplane, distance to
decision boundary is
The maximum margin criterion chooses w, b by:
Points with this min value are known as support vectors.
Canonical Representation
This optimization problem is complex:
Note that rescaling does not change
distance (many equiv. Answers).
So for closest to surface, we rescale, and thus can set:
All other points are at least this far away:
Under these constraints, the optimization becomes:
Canonical Representation (II)
So the optimization problem is now a constrained
optimization problem:
To solve this, we need to take a detour into Lagrange
multipliers
Recall: Lagrange Multipliers
Points on g(x) = 0 must have ∇g(x) normal to surface
A stationary point must have no change in f in the direction
of the surface, so ∇f (x) must also be in this same direction
So there must be some λ such that ∇f (x) + λ∇g(x) = 0
Recall: Lagrange Multipliers (II)
Define Lagrangian:
Stationary points of L(x, λ) have
So stationary point is
L. Multipliers - Inequality Constraints
Optimization over a region – solutions either at stationary
points (gradients 0) in region or on boundary
L. Multipliers - Inequality Constraints
Exactly how does the Lagrangian relate to the optimization
problem in this case?
It turns out that the solution to optimization problem is:
Max-min
Lagrangian
Consider the following:
Hence
Min-max (Dual form)
So the solution to optimization problem is:
The dual problem is when one switches the order of the max
and min:
These are not the same, but it is always the case the dual is
a bound for the primal (in the SVM case with minimization,
Slater’s theorem gives conditions for these two problems to
be equivalent, with LD (λ) = LP(x).
Slater’s theorem applies for the SVM optimization problem,
and solving the dual leads to kernelization and can be easier
than solving the primal.
Now what for SVM?
So the optimization problem is now a constrained
optimization problem:
We can find the derivatives of L wrt w, b and set to 0:
The dual formulation
Plugging those equations into L removes w and b results in
a version of L where ∇w,b L = 0:
this new L̃ is the dual representation of the problem
(maximize with constraints)
Note that it is kernelized
It is quadratic, convex in a
Bounded above since K positive semi-definite
Optimal a can be found
With large datasets, descent strategies employed
From a to a Classifier
We found a optimizing something else
This is related to classifier by
Recall an condition from Lagrange
Another formula for finding b
Examples
SVM trained using Gaussian kernel (see the previous lecture)
Support vectors circled
Note non-linear decision boundary in x space
Example
From Burges, A Tutorial on Support Vector Machines for
Pattern Recognition (1998).
SVM trained using a cubic polynomial kernel.
Left panel: linearly separable.
Note decision boundary is almost linear, even using cubic
polynomial kernel
Right panel: not linearly separable.
But is separable using polynomial kernel
Non-Separable Data
For most problems, data will not be linearly separable (even
in feature space )
Can relax the constraints from
Non-zero slack variables are bad, penalize while maximizing
the margin:
Constant C > 0 controls importance of large margin versus
incorrect (non-zero slack)
Set using cross-validation
Optimization is same quadratic, different constraints, convex
SVM in Python
demo/svm_class.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score,accuracy_score
########################################################################
# load data
cars = pd.read_csv('auto-mpg.data.txt',header=None, sep='\s+')
# extract power and weight as data matrix X
X = cars.iloc[:, [3,4]].values
# extract origin as target vector y
y = [1 if o==1 else 0 for o in cars.iloc[:, 7].values]
#y = cars.iloc[:, 7].values
# Training data (80%), Test data (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Standartize features
scaler = StandardScaler()
scaler.fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)
# train SVM
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(X_train,y_train)
y_predicted = svm.predict(X_test_standardized)
## Correctly classified
print("Correctly Classified:\n", accuracy_score(y_true=y_test, y_pred=y_predicted))
print("Precision:\n", precision_score(y_true=y_test, y_pred=y_predicted))
print("Score:\n", recall_score(y_true=y_test, y_pred=y_predicted))
print("F1:\n", f1_score(y_true=y_test, y_pred=y_predicted))
Action required!
Try 3 different kernels (https://scikit-learn.org). How does the
performance of SVM change?
Try different sizes of training and test data (50%/50%),
(70%/30%), (90%/10%) → How does the performance change?
Summary on SVMs
Maximum margin criterion for deciding on decision boundary
Linearly separable data
Relax with slack variables for non-separable case
Global optimization is possible in both cases
Convex problem (no local optima)
Descent methods converge to global optimum
Kernelized
2. Gaussian Process Regression
Today
Basics of Gaussian Process Regression
Noise-free kernels
Gaussian Process Regression
http://www.gaussianprocess.org/gpml/
Recall: Aim of Regression
Given some (potential) noisy observations of a dependent variable at certain values of the
independent variable x, what is our best estimate of the dependent variable y at a new value, x∗?
Let f denote an (unknown) function which maps inputs x to outputs
f:X → Y
Modeling a function f means mathematically representing the relation between inputs and outputs.
Often times, the shape of the underlying function might be unknown, the function can be hard to
evaluate, or other requirements might complicate the process of information acquisition.
Choosing a model
If we expect the underlying function f(x) to be linear, and can make
some assumptions about the input data, we might use a least-squares
method to fit a straight line (linear regression).
Moreover, if we suspect f(x) may also be quadratic, cubic, or
even non-polynomial, we can use the principles of model selection to
choose among the various possibilities.
Model Selection
Example data set by https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Degree of regression
1: linear
2: quadratic
...
Why Gaussian Process Regression?
There are many projections possible.
We have to choose one either a priori or by model comparison with a
set of possible projections.
Especially if the problem is to explore and exploit a completely
unknown function, this approach will not be beneficial as there is little
guidance to which projections we should try.
→ plot
→ we want to fit a Gaussian to these points.
x2
x1
x1
Recall Multivariate Gaussians (II)
x2
Mean =
x1
x=
x2
x2
x1
Mean
How do points relate to each other? (“how does increasing x1 increase x2?)
If the entries in the column vector
are random variables, each with finite variance and expected value, then the
covariance matrix KXX is the matrix whose (i ,j) entry is the covariance.
Assume two points [1 0] 1 =1 → Covariance is a measure similarity.
0
E[x1x2 ] = 0
Multivariate Gaussians (V)
Assume two points [1 0] 1 = 1 → Covariance measure similarity.
0
Conditional distribution
,
Mean Covariance
P(x1,x2 ) joint
From Joint to Conditional distributions
see, e.g., Murphy (2012), chapter 4 .
f3
??? ,
f(x)
f2
→ It is often set to 0.
Observations → Interpolation (II)
We assume that f's (the height) are Gaussian distributed,
with zero – mean and some covariance matrix K.
f3
??? ,
f(x)
f2
Note: f1 and f2 should probably be more correlated,
f1 as they are nearby (compared to f1 and f3), e.g.,
x1 x2 x* x3 ,
f3
??? ,
f(x)
f2
f1
The central computational operation in using Gaussian processes
will involve the inversion of a matrix of size N × N,
for which standard methods require O(N³) computations.
The matrix inversion must be performed once for the given
training set.
where
and t being the N observations.
f2
Procedure
Create vector X1:N
K = LLT
f1
f i ~ N(0, K) ~ L N(0,I)
GPR Example code & numerical stability
Look at demo/1d_gp_example.ipynb and Murphy, Chapter 15
Observations
We see that the model perfectly interpolates the training data, and that the predictive
uncertainty increases as we move further away from the observed data.
Non-hypercubic domain: GPR versus ASG
Figure: The upper left panel shows the analytical test function evaluated at random test points on the simplex. The
upper right panel displays a comparison of the interpolation error for GPs, sparse grids, and adaptive sparse grids of
varying resolution and constructed under the assumption that a continuation value exists, (denoted by cont"), or that
there is no continuation value. The lower left panel displays a sparse grid consisting of 32,769 points. The lower middle
panel shows an adaptive sparse grid (cont) that consists of 1,563 points, whereas the lower right panel shows an
adaptive sparse grid, constructed with 3,524 points and under the assumption that the function outside ∆ is not known.
GPR in scikit-learn.org
https://scikit-learn.org/stable/modules/gaussian_process.html
https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py
Look at demo/1d_GPR.ipynb
More GP packages (next to scikit-learn.org)
Illustration of noiseless GPR prediction (II)
We use a squared exponential kernel, aka Gaussian kernel or RBF kernel.
In 1d, this is given by
Here l controls the horizontal length scale over which the function varies,
and σf controls the vertical variation.
We see (NEXT SLIDE) that the model perfectly interpolates the training
data, and that the predictive uncertainty increases as we move further
away from the observed data.
The parameters in the Kernel
cf. demo/1d_gp_example.ipynb
l² = 1.0
Let
l² = 0.01 l² = 10.0