Ai Lecture15 SVM 2up

Artificial Intelligence
Support Vector Machines

Stephan Dreiseitl
FH Hagenberg
Software Engineering
Lecture 15: Support Vector Machines
Artificial Intelligence SS2009
1 / 48
Overview
Motivation
Statistical learning theory
Optimal separating hyperplanes
Support vector classification and regression
Kernel functions
2 / 48
Motivation
Given data D = {xi , ti } distributed according to P(x, t),
which model is better representation of data?
30
30
20
20
10
10
10
10
20
20
30
3
30
3
high bias
low variance
low bias
high variance
3 / 48
Motivation (cont.)
Neural networks model p(t|x) by
topology restriction
early stopping
weight decay
Bayesian approach
In support vector machines, this is replaced by capacity

control
SVM concepts based on statistical learning theory
4 / 48
Statistical learning theory

Given: data set {xi , ti }, class labels ti {1, +1},
classifier output y (, xi ) {1, +1}
Find: parameter such that y (, xi ) = ti
Important questions: is learning consistent (does
performance increase with number of size of training
set)?
How to handle limited data (small training sets)?
Can performance on test set (generalization error) be
inferred from performance on training set?
5 / 48
Statistical learning theory (cont.)

Empirical error on a data set {xi , ti } with distribution
P(x, t) for classifier with parameter :
n
1 X
Remp () =
|y (, xi ) ti |
2n
i=1
Expected error of same classifier on unseen data with the

same distribution:
Z
1
R() =
|y (, xi ) ti | dP(x, t)
2
6 / 48
Statistical learning theory (cont.)

Fundamental question of statistical learning theory: How
can we relate Remp and R?
Key result: Generalization error R depends on both Remp
and capacity h of the classifier
The following holds with probability 1 :
r
h(log(2n/h) + 1) log(/4)
R() Remp () +
,
n
with h the Vapnik-Chervonenkis (VC) dimension of the
classifier, and n the size of training set
7 / 48
Shattering
A classifier shatters data points if, for any labeling, the
points can be correctly classified
Capacity of classifier depends on number of points that
can be shattered by a classifier
VC dimension is largest number of data points for which
there exists an arrangement that can be shattered
Not the same as the number of parameters in the
classifier!
8 / 48
Shattering examples
Straight lines can shatter 3 points in 2-space
Classifier: sign( x)
9 / 48
Shattering examples (cont.)

Other classifier:
sign(x x )
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
classifier for last

case:
sign(x x )
0.6
0.8
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.4
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
10 / 48

Extreme example: one parameter, but infinite VC
dimension
Consider classifier y (, x) = sign(sin(x))
Surprising fact: for any n there exists arrangement of
data points {xi } R that can be shattered by y (, x)
Choose data points as xi = 10i , i = 1, . . . , n
There is a clever way of encoding labeling information in
single number
11 / 48

For any labeling ti {1, +1}, construct as
n

1X
i
= 1+
(1 ti )10
2
i=1
For n = 5 and ti = (+1, 1, 1, +1, 1), = 101101

1
0
1
5
10
10
10
10
10
12 / 48
VC dimension
VC dimension is capacity measure for classifiers
VC dimension is largest number of data points for which
there exists an arrangement that can be shattered
For straight lines in 2-space, VC dimension is 3
For hyperplanes in n-space, VC dimension is n + 1
May be difficult to calculate VC dimension for classifiers
13 / 48
Structural risk minimization

Recall that, with probability 1 ,
r
h(log(2n/h) + 1) log(/4)
R() Remp () +
,
n
Induction principle for finding best classifier:
fix data set and order classifiers according to their
VC dimension
for each classier, train and calculate right-hand side
of inequality
best classifier is the one that minimizes right-hand
side
14 / 48
Structural risk minimization (cont.)

r
R() Remp () +
Model
Remp
h(log(2n/h) + 1) log(/4)
n
VC conf.
upper bound
y1 (, x)
y2 (, x)
y3 (, x)
y4 (, x)
y5 (, x)
15 / 48
Support vector machines

Algorithmic representation of concepts from statistical
learning theory
Implement hyperplanes, so VC dimension is known
SVM calculate optimal hyperplanes: hyperplanes that
maximize margin between classes
Decision function: sign(w x + w0 )
16 / 48
Geometry of hyperplanes
z
|w z + w0 |
kw k
{x | w x + w0 = 0}
|w0 |
kw k
17 / 48
Geometry of hyperplanes (cont.)

Hyperplanes invariant to scaling of parameters:
{x | w x + w0 = 0} = {x | cw x + cw0 = 0}
-2
-1
-2
-1
-2
-2
-4
-4
-6
-6
-8
-8
-10
-10
{x | 3x y 4 = 0}
{x | 6x 2y 8 = 0}
18 / 48

We want
w x + w0 +1 for all xi with ti = +1
w x + w0 1 for all xi with ti = 1
w x + w0 = 1
+1
19 / 48
Optimal separating hyperplanes (cont.)
Points x and o on dashed lines satisfy w x + w0 = +1

and w o + w0 = 1, resp.
Distance between dashed lines is
|w x + w0 | |w o + w0 |
2
+
=
kw k
kw k
kw k
Find largest (optimal) margin by maximizing
2
kw k
This is equivalent to
1
kw k2
2
subject to ti (w xi + w0 ) 1
minimize
20 / 48
Algorithmic aspects
Constrained optimization problem transformed to
Lagrangian
n
X

1
2
kw k
i ti (w xi + w0 ) 1
2
i=1
Find saddle point (minimize w.r.t. w , w0 , maximize

w.r.t. i )
Leads to criteria
n
X
i ti = 0
and
w=
i=1
n
X
i ti xi
i=1
21 / 48
Algorithmic aspects (cont.)

Substituting constraints into Lagrangian results in dual
problem
n
n
X
1X
maximize
i
i j ti tj xi xj
2
i,j=1
i=1
n
X
subject to i 0 and
i ti = 0
i=1
Pn
With expansion w = i=1 i ti xi , decision function

sign(w x + w0 ) becomes
n
X

f (x) = sign
ti i xi x + w0
i=1
22 / 48
Summary algorithmic aspects

Optimal separating hyperplane has largest margin (SVM
are large margin classifiers)
Unique solution P
to convex constrained optimization
problem is w = i ti xi over all points xi with i 6= 0
Points xi with i 6= 0 lie on the margin (support
vectors), all other points irrelevant for solution!
Observe that data points enter calculation only via dot
product
23 / 48
Large margin classifiers

Arguments for the importance of large margins:
3
2.5
1.5
0.5
1
0
2
0.5
1.5
0.5
0.5
1.5
2.5
24 / 48
Soft margin classifiers

What happens when data set is not linearly separable?
Introduce slack variables i 0
1.5
0.5
0.5
1.5
2
1
0.5
0.5
1.5
2.5
25 / 48
Soft margin classifiers (cont.)

Constraints are then
w x + w0 +1 i for all xi with ti = +1
w x + w0 1 + i for all xi with ti = 1
Want slack variables as small as possible, include this in
objective function
Pn
1
2
Soft margin classifier minimizes 2 kw k + C i=1 i
Large value of C gives large penalty to data on wrong
side
26 / 48
Soft margin classifiers (cont.)

Little difference in dual formulation:
n
X
n
1X
maximize
i
i j ti tj xi xj
2
i=1
i,j=1
n
X
subject to 0 i C and
i ti = 0
i=1
Again, data points appear only via dot products
27 / 48
Support vector regression

Difference to classification: targets ti are real-valued
Prediction function for linear regression is
f (x) = w x + w0
Recall 0-1 loss in classification: 12 |f (xi ) ti |
Need different loss for regression (-insensitive loss):

|f (xi ) ti | := max 0, |f (xi ) ti |
28 / 48
Support vector regression (cont.)

-insensitive loss results in tube around regression
function
7
6
5
4
3
2
1
2
10
Minimize regularization term and error contribution

n
X
1
kw k2 + C
|f (xi ) ti |
2
i=1
29 / 48

Need slack variables for points outside tube
7
6
5
4
3
2
1
2
10
X

1
2
maximize kw k + C
i + i
2
i=1
subject to f (xi ) ti + i , i 0
ti f (xi ) + i , i 0
30 / 48

Convert to dual problem statement (omitting details)
Regression estimate is
f (x) =
n
X
(i i ) xi x + w0
i=1
Again, data points enter calculation only via dot products
31 / 48
Probabilities for SVM outputs

In many applications, want output to be P(ti = 1 | xi )
SVMs provide only 1 classifications
Probabilities can be obtained by fitting a sigmoid to
raw SVM output w x + w0
Functional form of sigmoid can be motivated
theoretically
32 / 48
Probabilities for SVM outputs (cont.)

Details: With 0/1 target encoding ti , SVM output fi and
sigmoid pi = 1/(1 + exp(A fi + B)), minimize
n
X
ti log(pi ) + (1 ti ) log(1 pi )
i=1
-4
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
-2
-4
-2
33 / 48
Nonlinear SVM
Idea: Do nonlinear projection (x) : Rm H of original
data points x into some higher-dimensional space H
Then, apply optimal margin hyperplane algorithm in H
2.5
1.5
( )
0.5
( )
( )
( )
1.5
0
( )
( )
( )
1
0.5
0.5
1
( )
1.5
2
1
( )
( )
0.5
0.5
1.5
2.5
0.5
1.5
0.5
0.5
1.5
2.5
34 / 48
Nonlinear SVM example

3
Idea: Project R2 R3 by
2

x
1
x1
2 x1 x2
x2
x22
3
3
35 / 48
Nonlinear SVM example (cont.)

Do the math (dot product in H):
2
2

x1
y1
x1
y1
= 2 x1 x2 2 y1 y2
x2
y2
x22
y22
= x12 y12 + 2x1 x2 y1 y2 + x22 y22
= (x1 y1 + x2 y2 )2
2
x1
y
=
1
x2
y2
This means that dot product in H can be represented by

function in original space R2
36 / 48
The kernel trick

Recall that data enters maximum margin calculation only
via dot products xi xj or (xi ) (xj )
Instead of calculating (xi ) (xj ), use kernel function in
original space:
K (xi , xj ) = (xi ) (xj )
Advantage: no need to calculate
Advantage: no need to know H
Raises question: what are admissible kernel functions?
37 / 48
Kernel functions

Admissible kernel functions: Gram matrix K (xi , xj ) i,j is
positive definite
Most widely used kernel functions and their parameters:
polynomials (degree)
Gaussians (variance)
Practical importance of kernels: similarity measures on

data sets without dot products!
Great for text analysis, bioinformatics,. . .
Kernel-ization of other algorithms (Kernel PCA, LDA,. . . )
38 / 48
Kernel function example

Dot product (x1 , x2 ) (y
1 , y2 ) after nonlinear
projection (x1 , x2 ) = (x12 , 2x1 x2 , x22 ) can be achieved
by kernel function K (x, y ) = (x y )2
3
5
0
4
2
3
4
0
1
2
1
3
4
3
3
39 / 48
SVM examples
Linearly separable
C = 100
40 / 48
SVM examples (cont.)

C = 100
C =1
41 / 48

Linear function
Quad. polynomial, C = 10
42 / 48

43 / 48

Kubic polynomial
Gaussian, = 1
44 / 48

Quad. polynomial
Kubic polynomial, C = 10
45 / 48

Kubic polynomial, C = 10
deg. 4 polynomial, C = 10
46 / 48

Gaussian, = 1
Gaussian, = 3
47 / 48
Summary
SVM based on statistical learning theory
Allows calculation of bounds on generalization
performance
Kernel trick (projection)
Kernel functions are similarity measures
SVM perform comparable to neural networks
48 / 48

Ai Lecture15 SVM 2up

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai Lecture15 SVM 2up

Uploaded by

Copyright:

Available Formats

Artificial Intelligence

Support Vector Machines

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

In support vector machines, this is replaced by capacity

Artificial Intelligence SS2009

Statistical learning theory

Artificial Intelligence SS2009

Statistical learning theory (cont.)

Expected error of same classifier on unseen data with the

Artificial Intelligence SS2009

Statistical learning theory (cont.)

Artificial Intelligence SS2009

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Shattering examples (cont.)

classifier for last

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Shattering examples (cont.)

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Shattering examples (cont.)

For n = 5 and ti = (+1, 1, 1, +1, 1), = 101101

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Structural risk minimization

Artificial Intelligence SS2009

Structural risk minimization (cont.)

Artificial Intelligence SS2009

Support vector machines

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Artificial Intelligence SS2009

Geometry of hyperplanes (cont.)

Optimal separating hyperplanes

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Optimal separating hyperplanes (cont.)

Points x and o on dashed lines satisfy w x + w0 = +1

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Find saddle point (minimize w.r.t. w , w0 , maximize

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Algorithmic aspects (cont.)

With expansion w = i=1 i ti xi , decision function

Artificial Intelligence SS2009

Summary algorithmic aspects

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Large margin classifiers

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009

Soft margin classifiers

Lecture 15: Support Vector Machines

Artificial Intelligence SS2009