You are on page 1of 28

Introduction to Support Vector

Machines
Jie Tang
25 July 2005
Introduction
• Support Vector Machine (SVM) is a learning
methodology based on Vapnik’s statistical
learning theory
– Addressed in the 1990s
– To solve the problems in traditional statistical
learning( over fitting, capacity control,…)
• It achieved the best performances in practical
applications
– Handwritten digit recognition
– text categorization…
Classification Problem
• Given training set S={(x1,y1),(x2,y2),…,
(xl,yl)}, and xi∈ X=Rn, yi∈ Y={1,-1},
i=1,2,…,l
• To learn a function g(x), and make the
decision function f(x)=sgn(g(x)) can
classify new input x
• So this is a supervised batch learning
method
Linear classifier

g ( x)  ( wT x  b)
 1, g ( x)  0 
sgn( g ( x))   
 1, g ( x )  0 
f ( x)  sgn( g ( x))
Maximum Marginal Classifier

w b  ˆ
x (1) 
w w

w b  ˆ
x (2) 
w w
Maximum Marginal Classifier
 (i )  y ( i ) ( wT x  b)
ˆ  min 

• Let us select two points on the two hyperplan
respectively.
wT x (1)  b  ˆ
wT x (2)  b  ˆ
w
( x  0) Distance from hyperplan to origin
w
w b  ˆ
x (1)

w w
w b  ˆ
x (2)

w w
Maximum Marginal Classifier

w b  ˆ
x (1) 
w w

w b  ˆ
x (2) 
w w

w b  ˆ
x (1) 
w w
2ˆ
w b  ˆ
x (2)  w
w w
Then
2ˆ w
max  , w,b equal to min  , w,b
w 2ˆ

Note: we have constraints

s.t. wT x ( i )  b  ˆ, 0  i  k
equal to y ( wT x (i )  b)  ˆ , 0  i  m
T
w x ( j)
 b  ˆ, k  j  m

By scaling w and b by settingγ=1

w
min w,b
2
s.t. y ( wT x (i )  b)  1, 0  i  m
Lagrange duality
For the problem:
w
min w,b
2
s.t. y ( wT x (i )  b)  1, 0  i  m

We can write lagrange form:

w m
L( w, b)     i [ y ( wT x (i )  b)  1]
2 i 0
s.t.  i  0, 0  i  m
Let us review generalized Lagrangian
min f ( x)
s.t. gi ( x)  0, 0  i  k
h j ( x)  0, 0  j  l

By lagrangian:
k l
L( w, b)  f ( x)    i g i ( x)    j h j ( x)
i 0 j 0

s.t.  i  0,  j  0

Let us consider

RP ( w)  max L( w, b)

Note: the constraints must be satisfied, otherwise, the maxL will be infinite.
Let us review generalized Lagrangian
If the constraints are satisfied, then we must have
max L  f ( x)

Now you can found that, maxL takes the same value as the objective of
our problem f(x).
Therefore, we can consider the minimization problem

min w RP ( w)  min w max ,  L( w,  ,  )

Let us define the optimal value of the primal problem as p*
Then, let us define the dual problem They are similar

RD ( ,  )  min w L( w,  ,  )

max ,  RD ( ,  )  max ,  min w L( w,  ,  )

Now we define the optimal value of the dual problem as d*
Relationship between Primal and
Dual problems
d *  max ,  min w L( w,  ,  )  min w max L( w,  ,  )  p*
Why?
Just remember it

Then if under some conditions, d*=p*
We can solve the dual problem in lieu of the primal problem

What is the conditions?
The famous KKT conditions
Karush-Kuhn-Tucker conditions

L( w* ,  * ,  * )  0, i  [1, m]
wi

L( w* ,  * ,  * )  0, i  [1, l ]
 i
 i* gi ( w* )  0, i  [1, k ]
gi ( w* )  0, i  [1, k ]
 i*  0, i  [1, k ]
What does it imply?
The famous KKT conditions
Karush-Kuhn-Tucker conditions
 i* gi ( w* )  0, i  [1, k ]
gi ( w* )  0,  i*  0 Very important!
gi ( w* )  0,  i*  0???
Return to our problem
w m
L( w, b,  )     i [ y ( wT x ( i )  b)  1]
2 i 0

s.t. y ( wT x ( i )  b)  1, 0  i  m

Let us first solve minwL with respective to the w:
m
 w L( w, b,  )  w    i y (i ) x (i )  0
i 1
m
w    i y (i ) x (i ) Substitute the two
i 1
equations back to
m
L(w,b,a)
 b L( w, b,  )    i y (i )  0
i 1
We have
1 m (i ) ( j )
m
L( w, b,  )    i   y y  i j ( x ( i ) )T x ( j )
i 0 2 i , j 1

Then, what we have the maximum optimum problem with respect to αβ:
m
1 m (i ) ( j )
max L( )    i   y y  i j x ( i ) , x ( j )
i 0 2 i , j 1
s.t.  i  0, i  [1, m]
m

 y
i 1
i
(i )
0

Now, we have only one parameter: α
We can solve it and then solve w,
And then b, because:

max i: y( i ) 1 w*T x (i )  min i: y( i ) 1 w*T x (i )
b*  
2
How to predict
For a new sample x, we can predict it by:

m
w x  b  (  i y ( i ) x ( i ) )T x  b
T

i 1
m
   i y (i ) x (i ) , x  b
i 1
Non-separable case
What is non-separable case? I will not give an example. I suppose
you know that

Then what is the optimal problem:

w m
min w,b  C  i
2 i 1

s.t. y ( wT x (i )  b)  1  i , 0  i  m
i  0, 0  i  m

Next, by forming the lagrangian:

w m m m
L( w, b,  ,  ,  )   C   i    i [ y ( w x  b)  1   i ]    i  i
T (i )

2 i 1 i 1 i 1
Dual form
m
1 m (i ) ( j )
max L( )    i   y y  i j x (i ) , x ( j ) What is the difference
i 0 2 i , j 1
from the previous
s.t. C   i  0, i  [1, m] form??!!
m

 y
i 1
i
(i )
0

Also note following conditions:

 i  0  y ( i ) ( wT x ( i )  b)  1
 i  C  y (i ) ( wT x ( i )  b)  1
0   i  C  y ( i ) ( wT x ( i )  b)  1
How to train SVM = how to solve
the optimal problem
Sequential minimal optimization (SMO) algorithm, due to John Platt.

First, let us introduce coordinate ascent algorithm:
Loop until convergence:
{
For i=1, …, m
{
ai:=argmaxaiL(a1,…, ai-1, ai, ai+1,…, am)
}
}
coordinate ascent is ok?

m
1 y (1)
   i y (i )
i2
m
1   y (1)
 i

i2
y (i )

Is it ok?
SMO
Change the algorithm by: this is just SMO
Repeat until convergence
{
1. select some pair ai and aj to update next. (using a heuristic that tries to
pick the two that will allow us to make the biggest progress towards the
global maximum).
2. reoptimize L(a) with respect to ai and aj, while holding all the other
a.
}

m
1 y   2 y
(1) (2)
   i y ( i )  
i 3

1  (   2 y (2) ) y (1)
L(a )  L((   2 y (2) ) y (1) ,  2 ,...,  m )
SMO(2)
L(a)  L((   2 y (2) ) y (1) ,  2 ,...,  m )

This is a quadratic function in a2. I.e. it can be written as:

a 22  b 2  c
Solving a2
a 22  b 2  c
For the quadratic function, we can simply solve it by setting its derivative to
zero. Let us use a2new, unclipped as the resulting value.

 H if ( 2new,unclipped  H )

 new
2    2new if ( L   2new,unclipped  H )

 L if ( 2new,unclipped  L)

Having find a2, we can go back to find the optimal a1.
Please read Platt’s paper if you want to read more details
Kernel
1. Why kernel?
2. What is feature space mapping?

x   ( x)
K ( x, z )   ( x)T  ( z ) kernel function

With kernel, what’s more interesting to us?
We can compute kernel without
calculating mapping
Replace all x (i ) , x by ( x ( i ) ),  ( x)

m
m
1 m (i ) ( j )
max L( )    i   y y  i j  ( x (i ) ),  ( x ( j ) ) w x  b  (   i y ( i ) x ( i ) )T x  b
T

i 0 2 i , j 1 i 1

s.t.  i  0, i  [1, m] m
   i y (i )  ( x (i ) ),  ( x)  b
m
i 1
  i y (i )  0
i 1

Now, we need to compute Φ(x) first. That may be expensive.
But with kernel, we can ignore the step.
Why?
Because, both in the training and test, there have the expression <x, z>.

K ( x , z )   ( x )T  ( z )
For example: 2
xz
K ( x, z )  exp( )
2 2
References
• Vladimir N. Vapnik. The nature of statistical learning
theory. Springer-Verlag New York. 1998.
• Andrew Ng. CS229 Lecture notes. Lectures from
10/19/03 to 10/26/03. Part V. Support Vector Machines
• CHRISTOPHER J.C. BURGES. A Tutorial on Support
Vector Machines for Pattern Recognition. Data Mining
and Knowledge Discovery, 2, 121–167 (1998). 1998
Kluwer Academic Publishers, Boston. Manufactured in
The Netherlands.
• Cristianini, N., Shawe-Taylor, J., An Introduction to
Support Vector Machines, Cambridge University Press,
(2000).
People
• Vladimir Vapnik.
• J. Platt
• J. Platt, N. Cristianini, J. Shawe-Taylor
• Shawe-Taylor, J.
• Burges, C. J. C.
• Thorsten Joachims
• Etc.