Professional Documents
Culture Documents
Kernel Methods
Geoff Gordon
ggordon@cs.cmu.edu
It is popular because
• it can be easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning
SVM concepts
Perceptrons
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
Classification example—Fisher’s irises
2.5
setosa
versicolor
virginica
1.5
0.5
0
1 2 3 4 5 6 7
Iris data
95
90
% Middle & Upper Class
85
80
75
70
65
60
40 50 60 70 80 90 100 110 120 130
Industry & Pollution
A good linear classifier
100
95
90
% Middle & Upper Class
85
80
75
70
65
60
40 50 60 70 80 90 100 110 120 130
Industry & Pollution
Example—NIST digits
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
5 5 5
10 10 10
15 15 15
20 20 20
25 25 25
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
5 5
10 10
15 15
20 20
25 25
5 10 15 20 25 5 10 15 20 25
A linear separator
10
15
20
25
5 10 15 20 25
Sometimes a nonlinear classifier is better
5 5
10 10
15 15
20 20
25 25
5 10 15 20 25 5 10 15 20 25
Sometimes a nonlinear classifier is better
10
−2
−4
−6
−8
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
Classification problem
100 10
8
95
.. 90
6
. 4
80
0
75 −2
−4
70
−6
65
−8
X y 60
40 50 60 70 80 90
Industry & Pollution
100 110 120 130
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
Perceptrons
4
3.5
2.5
1.5
0.5
−0.5
−3 −2 −1 0 1 2 3
w += y x
c += y
That is, gradient descent on l(yf (x)), since
(
−y x if y(x · w + c) ≤ 0
∇w [y(x · w + c)]− =
0 otherwise
Perceptron demo
3
1
2
0.5
1
0 0
−1
−0.5
−2
−1
−3
−1 −0.5 0 0.5 1 −3 −2 −1 0 1 2 3
Perceptrons as linear inequalities
y(x · w + c) > 0
That’s
x·w+c>0 for positive examples
x·w+c<0 for negative examples
i.e., ensure correct classification of training examples
x·w+c=0
Convex program:
min f (x) subject to
g i ( x) ≤ 0 i = 1...m
where f and gi are convex functions
Trick: write
y(x · w + c) ≥ 1
Slack variables
4
3.5
2.5
1.5
0.5
−0.5
−3 −2 −1 0 1 2 3
y(x · w + c) + s ≥ 1
i si as small as possible
P
So try to make
Perceptron as convex program
(yixi) · w + yic + si ≥ 1
si ≥ 0
If x plays g(x) < 0, then a wins: playing big numbers makes payoff
approach −∞
1.5
0.5
−0.5
−1
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Lagrange Multipliers: the caption
Problem: maximize
f (x, y) = 6x + 8y
subject to
g(x, y) = x2 + y 2 − 1 ≥ 0
Using a Lagrange multiplier a,
Z w + cy + s ≥ 1
s≥0
subject to s ≥ 0 a ≥ 0
Duality cont’d
subject to s ≥ 0 a ≥ 0
Minimize wrt w, c explicitly by setting gradient to 0:
0 = aT Z
0 = aT y
Minimizing wrt s yields inequality:
0≤1−a
Duality cont’d
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
Modernizing the perceptron
Three extensions:
• Margins
• Feature expansion
• Kernel trick
+ve margin points are correctly classified, -ve margin means error
SVMs are maximum margin
For this lecture we are interested in large margins and compact de-
scriptions
4 4
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(x − M w/kwk) · w + c = 0
x · w + c = M w · w/kwk = M kwk
SVM maximizes M > 0 such that all margins are ≥ M :
Z w + yc ≥ M kwk
Note λw, λc is a solution if w, c is
Optimizing the margin, cont’d
Z v + yd + s ≥ 1
Modernizing the perceptron
Three extensions:
• Margins
• Feature expansion
• Kernel trick
Feature expansion
Could add new features like a2, ab, a7b3, sin(b), ...
1 1
0.8
0.6
0.5
0.4
0.2
0
−0.2 −0.5
−0.4
−0.6 −1
1
−0.8
0.5 1
−1
0.5
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0
−0.5
−0.5
−1 −1
x, y x, y, xy
Some popular feature sets
Polynomials of degree k
1, a, a2, b, b2, ab
RBFs of radius σ
1
exp − 2 ((a − a0)2 + (b − b0)2)
σ
Feature expansion problems
Three extensions:
• Margins
• Feature expansion
• Kernel trick
Kernel trick
0 = w − Z Ta 0=a·y
yi x i · w + y i c = 1 ⇔ ai > 0
or as dual variable for a · y = 0
Representation of w
110
100
90
80
70
60
40 50 60 70 80 90 100 110 120
Partway through optimization
Best w so far is Z Ta: diff between mean of +ve SVs, mean of -ve
c = −x̄ · w
That is,
• Compute mean of +ve SVs, mean of -ve SVs
• Draw the line between means, and its perpendicular bisector
• This bisector is current classifier
At end of optimization
√1
Suppose x = 2q
q2
Since there are usually not too many support vectors, this is a rea-
sonably fast calculation
Summary of SVM algorithm
Training:
• Compute Gram matrix Gij = yiyj K(qi, qj )
• Solve QP to get a
• Compute intercept c by using complementarity or duality
Classification:
• Compute ki = K(q, qi) for support vectors qi
• Compute f = c + i a i k i yi
P
• Test sign(f )
Outline
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
Advanced kernels
Insight: K(x, y) can be defined when x and y are not fixed length
Examples:
• String kernels
• Path kernels
• Tree kernels
• Graph kernels
String kernels
Pick λ ∈ (0, 1)
Works for words in phrases too: “man bites dog” similar to “man
bites hot dog,” less similar to “dog bites man”
Then so are
• K + K0
• αK for α > 0
K = α 1 K 1 + α2 K 2 + . . .
using cross-validation, etc.
“Kernel X”
Examples:
• kernel Fisher discriminant
• kernel logistic regression
• kernel linear and ridge regression
• kernel SVD or PCA
• 1-class learning / density estimation
Summary
They suffer from inefficient use of data, overfitting, and lack of ex-
pressiveness
http://www.cs.cmu.edu/~ggordon/SVMs
these slides, together with code
http://svm.research.bell-labs.com/SVMdoc.html
Burges’s SVM tutorial
http://citeseer.nj.nec.com/burges98tutorial.html
Burges’s paper “A Tutorial on Support Vector Machines for Pattern
Recognition” (1998)
References
http://www.cs.rhbnc.ac.uk/research/compint/areas/
comp_learn/sv/pub/slide1.ps
Slides by Stitson & Weston
http://oregonstate.edu/dept/math/CalculusQuestStudyGuides/
vcalc/lagrang/lagrang.html
Lagrange Multipliers
http://svm.research.bell-labs.com/SVT/SVMsvt.html
on-line SVM applet