Professional Documents
Culture Documents
Ai Lecture15 SVM 2up
Ai Lecture15 SVM 2up
1 / 48
Overview
Motivation
Statistical learning theory
Optimal separating hyperplanes
Support vector classification and regression
Kernel functions
2 / 48
Motivation
Given data D = {xi , ti } distributed according to P(x, t),
which model is better representation of data?
30
30
20
20
10
10
10
10
20
20
30
3
30
3
high bias
low variance
Lecture 15: Support Vector Machines
low bias
high variance
Artificial Intelligence SS2009
3 / 48
Motivation (cont.)
Neural networks model p(t|x) by
topology restriction
early stopping
weight decay
Bayesian approach
4 / 48
5 / 48
1 X
Remp () =
|y (, xi ) ti |
2n
i=1
6 / 48
7 / 48
Shattering
A classifier shatters data points if, for any labeling, the
points can be correctly classified
Capacity of classifier depends on number of points that
can be shattered by a classifier
VC dimension is largest number of data points for which
there exists an arrangement that can be shattered
Not the same as the number of parameters in the
classifier!
8 / 48
Shattering examples
Straight lines can shatter 3 points in 2-space
Classifier: sign( x)
9 / 48
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.6
0.8
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.4
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
10 / 48
11 / 48
10
10
10
10
10
12 / 48
VC dimension
VC dimension is capacity measure for classifiers
VC dimension is largest number of data points for which
there exists an arrangement that can be shattered
For straight lines in 2-space, VC dimension is 3
For hyperplanes in n-space, VC dimension is n + 1
May be difficult to calculate VC dimension for classifiers
13 / 48
VC dimension
for each classier, train and calculate right-hand side
of inequality
best classifier is the one that minimizes right-hand
side
Lecture 15: Support Vector Machines
14 / 48
Remp
h(log(2n/h) + 1) log(/4)
n
VC conf.
upper bound
y1 (, x)
y2 (, x)
y3 (, x)
y4 (, x)
y5 (, x)
Lecture 15: Support Vector Machines
15 / 48
16 / 48
Geometry of hyperplanes
z
|w z + w0 |
kw k
{x | w x + w0 = 0}
|w0 |
kw k
Lecture 15: Support Vector Machines
17 / 48
-2
-1
-2
-1
-2
-2
-4
-4
-6
-6
-8
-8
-10
-10
{x | 3x y 4 = 0}
Lecture 15: Support Vector Machines
{x | 6x 2y 8 = 0}
Artificial Intelligence SS2009
18 / 48
w x + w0 = 1
+1
19 / 48
2
kw k
This is equivalent to
1
kw k2
2
subject to ti (w xi + w0 ) 1
minimize
20 / 48
Algorithmic aspects
Constrained optimization problem transformed to
Lagrangian
n
X
1
2
kw k
i ti (w xi + w0 ) 1
2
i=1
and
w=
i=1
n
X
i ti xi
i=1
21 / 48
Pn
22 / 48
23 / 48
1.5
0.5
1
0
2
0.5
1.5
0.5
0.5
1.5
2.5
24 / 48
0.5
0.5
1.5
2
1
0.5
0.5
1.5
2.5
25 / 48
26 / 48
n
1X
maximize
i
i j ti tj xi xj
2
i=1
i,j=1
n
X
subject to 0 i C and
i ti = 0
i=1
27 / 48
28 / 48
10
29 / 48
10
X
1
2
maximize kw k + C
i + i
2
i=1
subject to f (xi ) ti + i , i 0
ti f (xi ) + i , i 0
Lecture 15: Support Vector Machines
30 / 48
n
X
(i i ) xi x + w0
i=1
31 / 48
32 / 48
ti log(pi ) + (1 ti ) log(1 pi )
i=1
-4
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
-2
-4
-2
33 / 48
Nonlinear SVM
Idea: Do nonlinear projection (x) : Rm H of original
data points x into some higher-dimensional space H
Then, apply optimal margin hyperplane algorithm in H
2.5
1.5
( )
0.5
( )
( )
( )
1.5
0
( )
( )
( )
1
0.5
0.5
1
( )
1.5
2
1
( )
( )
0.5
0.5
1.5
2.5
0.5
1.5
0.5
0.5
1.5
2.5
34 / 48
Idea: Project R2 R3 by
2
x
1
x1
2 x1 x2
x2
x22
3
3
35 / 48
2
2
x1
y1
x1
y1
= 2 x1 x2 2 y1 y2
x2
y2
x22
y22
= x12 y12 + 2x1 x2 y1 y2 + x22 y22
= (x1 y1 + x2 y2 )2
2
x1
y
=
1
x2
y2
36 / 48
37 / 48
Kernel functions
Admissible kernel functions: Gram matrix K (xi , xj ) i,j is
positive definite
Most widely used kernel functions and their parameters:
polynomials (degree)
Gaussians (variance)
38 / 48
5
0
4
2
3
4
0
1
2
1
3
4
3
3
39 / 48
SVM examples
Linearly separable
C = 100
40 / 48
C =1
41 / 48
Quad. polynomial, C = 10
42 / 48
43 / 48
Gaussian, = 1
44 / 48
Kubic polynomial, C = 10
45 / 48
deg. 4 polynomial, C = 10
46 / 48
Gaussian, = 3
47 / 48
Summary
SVM based on statistical learning theory
Allows calculation of bounds on generalization
performance
Optimal separating hyperplanes
Kernel trick (projection)
Kernel functions are similarity measures
SVM perform comparable to neural networks
48 / 48