You are on page 1of 54

Support Vector Machines

(SVM)
Lecture-11
Slides Credits
Saad Ali, Andrew Zisserman, Guo-Jun QI,
many more
Application
Pattern recognition
Object classification/detection
Usage
The classifier must be trained using a set of
negative and positive examples.
The classifier learns the regularities in the
data
After training classifier is capable of
classifying an unknown example with a high
degree of accuracy.
Human Detection
Navneet Dalal and Bill Triggs Histograms of Oriented Gradients for
Human Detection CVPR05
Blocks, Cells
16x16 blocks of 50% overlap.
7x15=105 blocks in total
Each block should consist of 2x2
cells with size 8x8.
Block 1
Block 2
Cells
(8 by 8)
Each block consists of 2x2 cells with
size 8x8
Quantize the gradient orientation into 9
bins (0-180)
The vote is the gradient magnitude
Interpolate votes linearly between neighboring bin
centers.
Example: if =85 degrees.
Distance to the bin center Bin 70 and Bin 90
are 15 and 5 degrees, respectively.
Hence, ratios are 5/20=1/4, 15/20=3/4.
The vote can also be weighted with Gaussian to
down weight the pixels near the edges of the block.
Votes
9 Bins
Bin centers
Final Feature Vector
Concatenate histograms
Make it a 1D vector of length 3780.
Visualization
7
Slides Credits
Andrew Zisserman
Slides Credits
Andrew Zisserman
Slides Credits
Andrew Zisserman
Slidesc Credits
Andrew Zisserman
Slides Credits
Andrew Zisserman
Slides Credits
Andrew Zisserman
Slides Credits
Andrew Zisserman
Linear Classifiers
f
x
y
f(x,w,b) = sign(w x + b)
How would you
classify this
data?
w x + b<0
w x + b>0
denotes +1
denotes -1
Label y:
Parameters
Slide Credits: Guo-Jun QI
Linear Classifiers
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this
data?
Parameters
Slide Credits: Guo-Jun QI
Linear Classifiers
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this
data?
Parameters
Slide Credits: Guo-Jun QI
Linear Classifiers
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Any of these
would be fine..
..but which is
the best?
Parameters
Slide Credits: Guo-Jun QI
Linear Classifiers
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this
data?
Misclassified
to +1 class
Parameters
Slide Credits: Guo-Jun QI
Classifier Margin
f
x
y
est
denotes +1
denotes -1
Define the margin of a
linear classifier as the
width that the
boundary could be
increased by before
hitting a data point.
Classifier Margin
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Parameters
Slide Credits: Guo-Jun QI
Maximum Margin
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
The maximum margin
linear classifier is the
linear classifier with
the maximum margin.
This is the simplest
kind of SVM (Called an
LSVM)
Linear SVM
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Maximizing the margin makes sense
according to intuition
2. Implies that only support vectors are
important; other training examples
can be discarded without affecting the
training result.
Parameters
Slide Credits: Guo-Jun QI
Maximum Margin
f
x
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Linear SVM
Keeping only
support vectors
will not change
the maximum
margin classifier.
Robust to the
small changes
(noise) in non-
support
vectors
Parameters
Slide Credits: Guo-Jun QI
Linear Classifier
Binary classifier Task of
separating classes in feature space
w
T
x + b = 0
w
T
x + b < 0
w
T
x + b > 0
f(x) = sign(w
T
x + b)
Linear Classifier
Finding perpendicular to the plane (line)
w
T
x + b = 0
X'
X ' '
W
0
0
= +
' '
= +
'
b X W
b X W
T
T
0 ) ( ) ( = +
' '
+
'
b X W b X W
T T
0 =
' '
+
'
b X W b X W
T T
0 ) (
0
=
' '

'
=
' '

'
X X W
X W X W
T
T T
is perpendicular a to line (hyper plane)
W
Linear Classifier
Binary classifier Task of separating classes in feature space
Distance from a point to a line (hyper plane)
w
T
x + b = 0
n
X
W

) ( ) (

n n
X X
W
w
X X W =
X
W
W
W =

) . . (
1
) (

n n
X W X W
W
X X W =
W
X X W
n
1
) (

=
) . . (
1
) (

b X W b X W
W
X X W
n n
+ =
w
T
x + b =1
Unit vector
Margin
Margin 2 of the separator is the width of separation between classes.
2
w
T
x + b =1
w
T
x + b =-1

X
+
X
W
X X W
X X W
) (
) (

+
+

=
W
WX WX
X X W
+
+

= ) (

W
b WX b WX
X X W
) (
) (

+ +
=
+
+
W
X X W
2
) (

=
+
W

W
b WX b WX
X X W
+
=
+
+
) (

Linear SVM Mathematically


Goal: 1) Correctly classify all training data
if y
i
= +1
if y
i
= -1
for all i
2) Maximize the Margin
same as minimize
We can formulate a Quadratic Optimization Problem and solve for w and b
Minimize
subject to
w
M
2
=
w w w
t
2
1
) ( = u
1 > + b wx
i
1 s + b wx
i
1 ) ( > +b wx y
i i
1 ) ( > +b wx y
i i
i
w w
t
2
1
Linear SVM
Now we can formulate the quadratic optimization
problem as
Given a linearly separable training sample S = ((x1,y1),...(xl ,yl)),
the hyperplane (w,b) that solves the optimization problem
minimizes <w * w>
subject to y
i
(<w * x
i
> + b) 1 i = 1,....l
Linear SVM
minimizes <w * w>
subject to yi(<w * x
i
> + b) 1 i = 1,....l
Convert the problem from the primal form into the dual form
L(w,b,) = 1<w * w>
i
[y
i
(<w * x
i
> + b) 1]
2 i = 1
l
w
L(w*,*,*)
=
w - y
i

i
x
i
= 0
i = 1
l
L(w*,*,*)
=
y
i

i
= 0
i = 1
l
b
w = y
i

i
x
i
i = 1
l
0 = y
i

i
i = 1
l
Linear SVM
minimizes <w * w>
subject to yi(<w * x
i
> + b) 1 i = 1,....l
Convert the problem from the primal form into the dual form
L(w,b,) = 1<w * w>
i
[y
i
(<w * x
i
> + b) 1]
2 i = 1
l
w
L(w*,*,*)
=
w - y
i

i
x
i
= 0
i = 1
l
L(w*,*,*)
=
y
i

i
= 0
i = 1
l
b
w = y
i

i
x
i
i = 1
l
0 = y
i

i
i = 1
l
Linear SVM
Now we plug the newly defined w into the L(w,b,)
L(w,b,) = 1<w * w>
i
[y
i
(<w * x
i
> + b) 1] =
2 i = 1
l
1 y
i
y
j

j
<x
i
* x
j
> y
i
y
j

j
<x
i
* x
j
> +
i
=
j = 1 i = 1
l l
2 j = 1 i = 1
l l
j = 1
l
w = y
i

i
x
i
i = 1

i
- 1y
i
y
j

j
<x
i
* x
j
>
2 j = 1 i = 1
l l
i = 1
l
Linear SVM
The dual form of the original problem is
maximize W() =
i
1 y
i
y
j

j
<x
i
* x
j
>
j = 1 i = 1
l l
i = 1
l
2
subject to y
i

i
= 0
i
0 i = 1,....,l.
i = 1
l
w* = y
i
*
i
x
i
i = 1
l
The optimal weight vector given by:
realizes the maximal margin hyperplane with the geometric margin
given by: 1
||w||
2
=
Solving the Optimization Problem
Find
1

N
such that
Q() =
i
-
i

j
y
i
y
j
x
i
T
x
j
is maximized and
(1)
i
y
i
= 0
(2)
i
0 for all
i
Slide Credits: Guo-Jun QI
Finding b
1 ) ( = +b X W y
k
T
k
k
k
T
y
b X W
1
) ( = +
k
T
k
X W y b =
k k
T
y b X W = + ) (
The Optimization Problem Solution
The solution has the form:

i
must satisfy Karush-Kuhn-Tucker conditions:

i
[y
i
(w
T
x
i
+b)-1]=0, for any i
If
i
>0, y
i
(w
T
x
i
+b)-1=0, x
i
is on the margin
If y
i
(w
T
x
i
+b)>1,
i
=0
Each non-zero
i
indicates that corresponding x
i
is a support vector.
w =
i
y
i
x
i
b= y
k
- w
T
x
k
for any x
k
such that
k
= 0
Slide Credits: Guo-Jun QI
Maximum Margin
denotes +1
denotes -1
w, b depends
only on Support
Vectors via
active
constraints
y
i
(w
T
x
k
+b)-1=0
Slide Credits: Guo-Jun QI
The Optimization Problem Solution
To classify the new test point x, we use
f(x) = wx + b =
i
y
i
x
i
T
x + b
Find
1

N
such that
Q() =
i
-
i

j
y
i
y
j
x
i
T
x
j
is maximized and
(1)
i
y
i
= 0
(2)
i
0 for all
i
Slide Credits: Guo-Jun QI
Dataset with noise
Hard Margin: So far , all data
points are classified correctly
- No training error
What if the training set is
noisy?
Slide Credits: Guo-Jun QI
Slack variables
i
to allow misclassification:
y
i
(w
T
x
i
+ b) 1-
i

i
0
Soft Margin Classification
What should our quadratic
optimization criterion be?
We expect
i
to be small.
(w) = w
T
w + C
i

2
x
1
x
2
x
3
Previous constraints
y
i
(w
T
x
i
+ b) 1
Slide Credits: Guo-Jun QI
Hard Margin v.s. Soft Margin
The old formulation:
The new formulation incorporating slack variables:
Similar solution can be obtained to that of hard margin
Parameter C can be viewed as a way to control overfitting.
Find w and b such that
(w) = w
T
w is minimized and for all {(x
i
,y
i
)}
y
i
(w
T
x
i
+ b) 1
Find w and b such that
(w) = w
T
w + C
i
is minimized and for all {(x
i
,y
i
)}
y
i
(w
T
x
i
+ b) 1-
i
and
i
0 for all i
Slide Credits: Guo-Jun QI
Slide Credit Andrew Zisserman
Slide Credit Andrew Zisserman
Slide Credit Andrew Zisserman
XOR Problem
44
http://wwwold.ece.utep.edu/research/webfuzzy/docs/kk-
thesis/kk-thesis-html/node19.html
Not linearly separable
XOR problem
XOR data are not linearly separable
Mapping (x
1,
x
2
) to (x
1,
x
1
x
2
)
45
x
1
x
2
x
1
x
1
x
2
Slide Credits: Guo-Jun QI
Non-linear SVMs: Feature spaces
General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is linearly separable:
: x (x)
Slide Credits: Guo-Jun QI
Non-linear SVMs
If every data point is mapped into high-dimensional space
via some transformation : x (x), optimization
problem is similar:
Classifying function is:
But relies on inner product (x
i
)
T
(x)
f(x) =
i
y
i
(x
i
)
T
(x) + b
Find
1

N
such that
Q() =
i
-
i

j
y
i
y
j
(x
i
)
T
(x
j
) is maximized
(1)
i
y
i
= 0
(2)
i
0 for all
i
Slide Credits: Guo-Jun QI
The Kernel Trick
SVM relies on
Linear: K(x
i
,x
j
)=x
i
T
x
j
Non-linear: K(x
i
,x
j
)= (x
i
)
T
(x
j
)
Feature mapping is time-consuming.
Use a kernel function that directly obtains the value of inner product
Feature mapping is not necessary in this case.
Example:
2-dimensional vectors x=[x
1
x
2
]; let K(x
i
,x
j
)=(1 + x
i
T
x
j
)
2
,
It is inner product of (x) = [1 x
1
2
2 x
1
x
2
x
2
2
2x
1
2x
2
]
Verify: K(x
i
,x
j
)=(1+x
i
T
x
j
)
2
= 1+ x
i1
2
x
j1
2
+ 2 x
i1
x
j1
x
i2
x
j2
+ x
i2
2
x
j2
2
+ 2x
i1
x
j1
+ 2x
i2
x
j2
= [1 x
i1
2
2 x
i1
x
i2
x
i2
2
2x
i1
2x
i2
]
T
[1 x
j1
2
2 x
j1
x
j2
x
j2
2
2x
j1
2x
j2
]
= (x
i
)
T
(x
j
),
Slide Credits: Guo-Jun QI
49
2
j i j i
) x . x 1 ( ) x , x ( + = K
2
2 1 2 1
)) , ( ). , ( (1
j j i i
x x x x + =
2
2 2 1 1
) (1
j i j i
x x x x + + =
2 2 1 1
2
2
2
2
2
1
2
1 2 2 1 1
2 2 2 1
j i j i
j i
j i j i j i
x x x x x x x x x x x x + + + + + =
2
2 2 1 1 2 2 1 1
) ( ) 2( 1
j i j i j i j i
x x x x x x x x + + + + =
Examples of Kernel Functions
Linear: K(x
i
,x
j
)= x
i
T
x
j
Polynomial of power p: K(x
i
,x
j
)= (1+ x
i
T
x
j
)
p
Gaussian (radial-basis function network):
Sigmoid: K(x
i
,x
j
)= tanh(
0
x
i
T
x
j
+
1
)
)
2
exp( ) , (
2
2
o
j i
j i
x x
x x

= K
Slide Credits: Guo-Jun QI
Non-linear SVMs Mathematically
Dual problem formulation:
The solution is:
Optimization techniques for finding
i
s remain the same!
Find
1

N
such that
Q() =
i
-
i

j
y
i
y
j
K(x
i
, x
j
) is maximized and
(1)
i
y
i
= 0
(2)
i
0 for all
i
f(x) =
i
y
i
K(x
i
, x
j
)+ b
Slide Credits: Guo-Jun QI
The feature is mapped to a high
dimensional space where
training data are separable.
Inner product is computed by kernel function.
Optimization problem is similar to linear SVM
Nonlinear SVM - Overview
Slide Credits: Guo-Jun QI
Resources
References
Vladimir Vapnik: The Nature of Statistical Learning
Theory. Springer-Verlag, 1995.
Christopher J. C. Burges: A Tutorial on Support Vector
Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
Bernhard Scholkopf and A. J. Smola: Learning with
Kernels. 2002
A useful website: www.kernel-machines.org
Software:
LIBSVM: www.csie.ntu.edu.tw/~cjlin/libsvm/
SVMLight: svmlight.joachims.org/
53
Slide Credits: Guo-Jun QI
Other Links
http://www.robots.ox.ac.uk/~az/lectures/ml/
http://en.wikipedia.org/wiki/Support_vector
_machine
Andrew Ng. Part V. Support Vector
Machines (note)
54