Support Vector Machines (SVM)
Y.H. Hu
SVM  1
Outline
Linear pattern classfiers and optimal hyperplane
Optimization problem formulation
Statistical properties of optimal hyperplane
The case of nonseparable patterns
Applications to general pattern classification
Mercer's Theorem
Y.H. Hu
SVM  2
Linear Hyperplane Classifier
x 2
x 1
Given:
{+1, 1}}.
A linear hyperplane classifier is a hyperplane consisting of
points x such that
{(x _{i} , d _{i} ); i = 1 to N, d _{i} Œ
H = {x g(x) = w ^{T} x + b = 0}
g(x): discriminant function!.
For x in the side of o : 
w ^{T} x + b 0; 
d = +1; 
For x in the side of æ: 
w ^{T} x + b £ 0; 
d = 1. 
Distance from x to H:
r = w ^{T} x/w  (b/w) = g(x) /w
Y.H. Hu
SVM  3
Distance from a Point to a Hyperplane
The hyperplane H is characterized by
(*)
w ^{T} x + b = 0
w: normal vector perpendicular to H
(*) says any vector x on H that
project to w will have a length of
H
w
A
X*
B
r
C
OA = b/w.
Consider a special point C corresponding to vector x ^{*} . Its projection onto vector w is
w ^{T} x ^{*} /w = OA + BC.
Or equivalently, w ^{T} x ^{*} /w = rb/w.
Hence r = (w ^{T} x ^{*} +b)/w = g(x*)/w
Y.H. Hu
SVM  4
Optimal Hyperplane: Linearly Separable Case
x 1
Optimal hyperplane should be in the center of the gap.
Supporting Vectors  Samples on the boundaries. Supporting vectors alone can determine optimal
hyperplane.
Question: How to find optimal hyperplane?
For d _{i} = +1, 
g(x _{i} ) = w ^{T} x _{i} + b rw 
ﬁ 
w _{o} ^{T} x _{i} + b _{o} 1 
For d _{i} = 1, 
g(x _{i} ) = w ^{T} x _{i} + b £ rw 
ﬁ 
w _{o} ^{T} x _{i} + b _{o} £ 1 
Y.H. Hu
SVM  5
Separation Gap
For x _{i} being a supporting vector,
For d _{i} = +1, 
g(x _{i} ) = w ^{T} x _{i} + b = rw 
ﬁ 
w _{o} ^{T} x _{i} + b _{o} = 1 
For d _{i} = 1, 
g(x _{i} ) = w ^{T} x _{i} + b = rw 
ﬁ 
w _{o} ^{T} x _{i} + b _{o} = 1 
Hence w _{o} = w/(rw), b _{o} = b/(rw). But distance from x _{i} to hyperplane is r = g(x _{i} )/w. Thus w _{o} = w/g(x _{i} ), and r = 1/w _{o} .
The maximum distance between the two classes is
2r = 2/w _{o} .
Hence the objective is to find w _{o} , b _{o} to minimize w _{o}  (so that r is maximized) subject to the constraints that
w _{o} ^{T} x _{i} + b _{o} 1 for d _{i} = +1;
and w _{o} ^{T} x _{i} + b _{o} £ 1 for d _{i} = 1.
Combine these constraints, one has:
d _{i} (w _{o} ^{T} x _{i} + b _{o} ) 1
Y.H. Hu
SVM  6
Quadratic Optimization Problem Formulation
Given {(x _{i} , d _{i} ); i = 1 to N}, find w and b such that f(w) = w ^{T} w/2 is minimized subject to N constraints
d _{i} (w ^{T} x _{i} + b) 0;
1 £
i
£
N.
Method of Lagrange Multiplier
Set
J(w, b, a ) = f(w) 
J(w,b,a)/ w = 0
J(w,b,a)/ a = 0
N
Â
i = 1
a
i
[
d
i
(
T
w
x
i
+
N
ﬁ w = _{Â}_{Â}
ﬁ
N
_{Â}_{Â}
i == 1
a
i == 1
i
d
i
a
i
==
d x
i
0
i
b
)

Y.H. Hu
1]
SVM  7
Optimization (continued)
The solution of Lagrange multiplier problem is at a saddle point where the minimum is sought w.r.t. w and b, while the maximum is sought w.r.t. ai.
KuhnTucker Condition: at the saddle point,
a _{i} [d _{i} (w ^{T} x _{i} + b)  1] = 0
for 1 £ i £ N.
ﬁ If x _{i} is NOT a suppor vector, the corresponding a _{i} = 0!
ﬁ Hence, only support vector will affect the result of optimization!
Y.H. Hu
SVM  8
A Numerical Example


















(3,1) 

(1,1) 
_{(}_{2}_{,}_{1}_{)} 

3 inequalities: 1 w + b £ 1; 2 w + b +1;
J = w ^{2} /2  a _{1} (wb1)  a _{2} (2w+b1)  a _{3} (3w+b1)
J/ w = 0 ﬁ w =  a _{1} + 2a _{2} + 3a _{3}
J/ b = 0 ﬁ
Solve: (a) wb1 = 0
3 w + b + 1
0 =
a _{1}
a _{2} 
a _{3}

(b) 2w+b1 = 0
(c); 3w + b 1 = 0
(b) and (c) conflict each other. Solve (a), (b) yield w = 2,
b 
= 3. From the KuhnTucker condition, a _{3} = 0. Thus, a _{1} = a _{2} 
= 
2. Hence the solution of decision boundary is: 2x  3 = 0. or 
x 
= 1.5! This is shown as the dash line in above figure. 
Y.H. Hu
SVM  9
Primal/Dual Problem Formulation
Given a constrained optimization problem with a convex cost function and linear constraints; a dual problem with the Lagrange multipliers providing the solution can be formulated.
Duality Theorem (Bertsekas 1995)
(a) 
If the primal problem has an optimal solution, then the dual problem has an optimal solution with the same optimal values. 
(b) 
In order for w _{o} to be an optimal solution and a _{o} to be an optimal dual solution, it is necessary and sufficient that w _{o} is feasible for the primal problem and 
F(w _{o} ) = J(w _{o} ,b _{o} ,a _{o} ) = Min _{w} J(w,b _{o} ,a _{o} )
Y.H. Hu
SVM  10
Formulating the Dual Problem
With
J 
( 
w , b 
,a) 
= 
N 
1
2
w
w = _{Â}_{Â}
a
i
d x
i
i
and
i == 1
Dual Problem
T
w
N
_{Â}_{Â}
i == 1
a

i
N
Â
i = 1
d
i
a
==
i
d w
i
T
x
i

N N
b
Â
i
= 1
a
i
d
i
+ Â
i
= 1
a
0. These lead to a
i
Maximize
N
1 a
N
N
Q (a ) 
== ÂÂ 1 i == 
d d i 
T 

a i ÂÂ  ÂÂ a 2 i i 1 == j 1 == 
j 
j 
x 
i 
x 
j 
N
Subject to: _{Â}_{Â} 1 a
i ==
i
d i
==
0
and a _{i} 0
for i = 1, 2, …, N.
Note
Q (
a
) =
N
Â
i = 1
a
i

1
[
a
2 1
d
1
L
a
N
d
N
]
È
Í
Í
Í
Î
2
x
1
x
M
N
x
1
L
O
L
x x
1
N
M
2
N
x
˘
˙
˙
˙
˚
È a
Í
Í
Í a
Î
Y.H. Hu
SVM  11
1
d 1
M
N
d
N
˘
˙
˙
˙
˚
Numerical Example (cont’d)
or
Q ( a
3
) == ÂÂ
i == 1
a
i

1
2
[[
 a
1
a
2
a
3
]]
ÈÈ ÍÍ ÍÍ ÎÎ ÍÍ 3
2
1
2
4
6
3
˘˘ ÈÈ  ˙˙ ÍÍ ˙˙ ÍÍ 9 ˚˚ ˙˙ ÎÎ ÍÍ a
a
6
a
2
3
1
˘˘
˙˙
˙˙
˚˚ ˙˙
Q(a) = a _{1} + a _{2} + a _{3}  [0.5a _{1} ^{2} + 2a _{2} ^{2} + 4.5a _{3} ^{2}  2a _{1} a _{2} 
3a _{1} a _{3} + 6a _{2} a _{3} ]
subject to constraints:
a _{1} + a _{2} + a _{3} = 0, and
a _{1} 0, a _{2} 0, and a _{3} 0.
Use Matlab‘ Optimization tool box command:
x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)
The solution is [a _{1}
a _{2}
a _{3} ] = [2
2
0] as expected.
Y.H. Hu
SVM  12
Implication of Minimizing w
Let D denote the diameter of the smallest hyperball that encloses all the input training vectors {x _{1} , x _{2} , …, x _{N} }. The set of optimal hyperplanes described by the equation
W _{o} ^{T} x + b _{o} = 0
has a VCdimension h bounded from above as
h £ min { ÈD ^{2} /r ^{2} ˘, m _{0} } + 1
where m _{0} is the dimension of the input vectors, and r = 2/w _{o}  is the margin of the separation of the hyperplanes.
VCdimension determines the complexity of the classifier structure, and usually the smaller the better.
Y.H. Hu
SVM  13
Nonseparable Cases
Recall that in linearly separable case, each training sample pair (x _{i} , d _{i} ) represents a linear inequality constraint
d _{i} (w ^{T} x _{i} + b) 1,
i = 1, 2, …, N
If the training samples are not linearly separable, the constraint can be modified to yield a soft constraint:
d _{i} (w ^{T} x _{i} + b) 1z _{i} ,
i = 1, 2, …, N
{z _{i} ; 1 £ i £ N} are known as slack variables. corresponding (x _{i} , d _{i} ) will be misclassified.
If z _{i} > 1, then the
N
The minimum error classifier would minimize _{Â}_{Â}
i == 1
I (z
i
 1), but it
is nonconvex w.r.t. w. Hence an approximation is to minimize
F(
w ,z )
1
2
N
i == 1
i
==
w
T
w
++
C ÂÂ
z
Y.H. Hu
SVM  14
Primal and Dual Problem Formulation
Primal Optimization Problem Given {(x _{i} , z _{i} );1£ i£ N}. Find w, b such that
F(
w ,z )
1
2
N
i == 1
i
==
w
T
w
++
C ÂÂ
z
is minimized subject to the constraints (i) z _{i} 0, and (ii) di(w ^{T} x _{i} + b) 1z _{i} for i = 1, 2, …, N.
Dual Optimization Problem Given {(x _{i} , z _{i} );1£ i£ N}. Find Lagrange multipliers {a _{i} ; 1 £ i £ N} such that
is maximized subject to the constraints (i) 0 £ a _{i} £ C (a user
specified positive number) and (ii)
N
ÂÂ
a d
i
i
== 0
i == 1
Y.H. Hu
SVM  15
Solution to the Dual Problem
Optimal Solution to the Dual problem is:
N _{s} : # of support vectors.
KuhnTucker condition implies for i = 1, 2, …, N,
(i) 
a _{i} [d _{i} (w ^{T} x _{i} + b)  1 + z _{i} ] = 0 
(*) 
(ii) 
m _{i} z _{i} = 0 
{m _{i} ; 1 £ i £ N} are Lagrange multipliers to enforce the condition z _{i} 0. At optimal point of the primal problem, F/ z _{i} = 0. One
may deduce that z _{i} = 0
if a _{i} £ C.
Solving (*), we have
b ==
1
N
N i == 1
o
ÂÂ
a i
(
w
T
x
i

d
i
)
Y.H. Hu
SVM  16
Matlab Implementation
% svm1.m: basic support vector machine
% X: N by m matrix. ith row is x_i
% d: N by 1 vector. ith element is d_i
% X, d should be loaded from file or read from input.
% call MATLAB optimization tool box function fmincon
a0=eps*ones(N,1);
C = 1;
a=fmincon('qfun',a0,[],[],d',0,zeros(N,1),C*ones(N,1),…
[],[],X,d) wo=X'*(a.*d) bo=sum(diag(a)*(X*wod))/sum([a > 10*eps])
function y=qfun(a,X,d);
% the Q(a) function. Note that it is actually Q(a)
% because we call fmincon to minimize Q(a) is
% the same as to maximize Q(a) [N,m]=size(X);
y=ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a;
Y.H. Hu
SVM  17
Inner Product Kernels
In general, if the input is first transformed via a set of nonlinear functions {f _{i} (x)} and then subject to the hyperplane classifier
g
(
x
)
=
p
Â
j =1
w
j
j
j
(
x
)
p
Â
+ b =
j
=0
w
j
j
j
(
x
)
= w
T
j
b == w
0
;
j
0
(
x
)
==
1
Define the inner product kernel as
K
(
x
,
y
)
=
p
Â
j =0
j
j
(
)
x j
j
(
y
)
= jj
T
one may obtain a dual optimization problem formulation as:
N
1
i =
1 2 ÂÂ
i
= 1
j
N
N
= 1
(
Q a
) =
Â
a
i

a
i a
j
d d
i
j
K
(
x
i
,
x
j
)
Often, dim of jj (=p+1) >> dim of x!
Y.H. Hu
SVM  18
General Pattern Recognition with SVM
x i
d i
By careful selection of the nonlinear transformation {f _{j} (x); 1 £ j £ p}, any pattern recognition problem can be solved.
Y.H. Hu
SVM  19
Polynomial Kernel
Consider a polynomial kernel
K
(
x
,
y
)
==
(1
++
T
x
y
)
2
==
1
++
2
m
ÂÂ
1
i ==
x y
i
i
m
++ 2 ÂÂ
i
1
==
j
m
ÂÂ
== ++
i
1
x y x
i
i
j
y
j
m
++ ÂÂ
i
== 1
x
2
i
y
2
i
Let K(x,y) = jj ^{T} (x) jj(x), then
jj(x) = [1
x _{1} ^{2} , º, x _{m} ^{2} , ÷2 x _{1} , º, ÷2x _{m} , ÷2 x _{1} x _{2} , º, ÷2 x _{1} x _{m} ,
÷2 x _{2} x _{3} , º, ÷2 x _{2} x _{m} , º,÷2 x _{m}_{}_{1} x _{m} ]
= [1 j _{1} (x), º, j _{p} (x)]
where p = 1 +m + m + (m1) + (m2) + º + 1 = (m+2)(m+1)/2
Hence, using a kernel, a low dimensional pattern classification problem (with dimension m) is solved in a higher dimensional space (dimension p+1). But only f _{j} (x) corresponding to support vectors are used for pattern classification!
Y.H. Hu
SVM  20
Numerical Example: XOR Problem
Training samples: 




(1 1; 1), (1 
1 
+1), 

(1 
1 
+1), (1 
1 
1) 


F
x = [x _{1} , x _{2} ] ^{T} .
Use K(x,x _{i} ) = (1 + x ^{T} x _{i} ) ^{2} one has
j(x) = [1 
x _{1} ^{2} 
x _{2} ^{2} 
÷2 x _{1} , ÷2 x _{2} , ÷2 x _{1} x _{2} ] ^{T} 

ÈÈ 1 
1 
1 
 
2

 
2

˘˘ 

ÍÍ 
1 
1 
1 
2


˙˙ ) == 

 
 

== 
ÍÍ 

ÍÍ 1 
1 
0 
2

 

 

ÍÍ ÎÎ 
˙˙ 

1 
1 
1 
2

2

˚˚ 
FF
T
==
9 1
ÈÈ
ÍÍ
ÍÍ
ÍÍ 1
ÍÍ ÎÎ
1 9
1
1 1
Note dim[jj(x)] = 6 > dim[x] = 2!
1
1
9
1
Y.H. Hu
SVM  21
1
1
1 ˙˙
˙˙
˙˙
˘˘
9
˚˚ ˙˙
;
XOR Problem (Continued)
Note that K(x _{i} , x _{j} ) can be calculated directly without using F!
E.g.
K
1,1
=
(1
+
[

1

1
]
È 
Í
Î

1 ˘
1
˙
˚
)
2
=
9;
K
1,2
=
(1
+
[

1

1
]
È
Í
Î
1
1 ˘
˙
˚
)
2
=
1
The corresponding Lagrange multiplier a = (1/8)[1 1 1 1] ^{T} .
w =
N
Â
i = 1
a d j
i
i
(
x
i
) = F
T
[
a d
1
1
a
2
d
2
L
a
N
d
N
] ^{T}
= (1/8)(1) j(x _{1} ) + (1/8)(1)j(x _{2} ) + (1/8)(1)j(x _{3} ) + (1/8)(1)j(x _{4} ) = [0 0 0 0 0 1/÷2] ^{T}
Hence the hyperplane is: y = w ^{T} j(x) =  x _{1} x _{2}
(x _{1} , x _{2} ) 
(1, 1) 
(1, +1) 
(+1,1) 
(+1,+1) 
y = 1 x _{1} x _{2} 
1 
+1 
+1 
1 
Y.H. Hu
SVM  22
Other Types of Kernels
type of SVM 
K(x,y) 
Comments 

Polynomial learning machine 
(x ^{T} y + 1) ^{p} 
p: selected a priori 

Radial basis function 
ÊÊ ÁÁ ËË 
 
1 
 
 
 
2 
˜˜ ˆˆ 
s ^{2} : selected a priori 

exp 
2 s 
2 
x 
y 
¯¯ 

Twolayer 
tanh(b _{o} x ^{T} y + b _{1} ) 
only some b _{o} and b _{1} values are feasible. 

perceptron 
What kernel is feasible? It must satisfy the "Mercer's theorem"!
Y.H. Hu
SVM  23
Mercer's Theorem
Let K(x,y) be a continuous, symmetric kernel, defined on a£ x,y £ b. K(x,y) admits an eigenfunction expansion
K
(
x y
,
)
••
== ÂÂ
i ==1
lj
i
i
(
)
x j
i
(
y
)
with l _{i} > 0 for each i. This expansion converges absolutely and uniformly if and only if
a a
ÚÚ ÚÚ
b b
K(x,y)y (x)y (y)dxdy
a
for all y(x) such that _{Ú}_{Ú} y
b
2 (x)dx << ••
.
0
Y.H. Hu
SVM  24
Testing with Kernels
For many types of kernels, j(x) can not be explicitly represented or even found. However,
w
=
N
Â
i = 1
a
i
d
i
j
(x ) = F
i
T
[
a
1
d
1
a
2
d
2
L
a
N
d ]
N
T
= F
T
f
y(x) = w ^{T} j(x) = f ^{T} F j(x) = f ^{T} K(x _{i} ,x) = K(x, x _{j} ) f
Hence there is no need to know j(x) explicitly! For example,
in the XOR problem, f = (1/8)[1 +1 +1 1] ^{T} . Suppose that x
= (1, +1), then
y
(
x
)
= K
= [1
(
x
,
9
x
j
)
1
f
=
1][
È
Í
Î

1
+
[

1/ 8
1

1/ 8
1]
È
Í
Î
1
1 ˘
˙
˚
)
1/ 8
2

1
+
1/ 8]
[

T
1
=
1
1]
È
Í
Î
1
1 ˘
˙
˚
)
2
1
+
[1

1]
È
Í
Î
1
1 ˘
˙
˚
)
2
1
+
Y.H. Hu
[1
1]
È 
Í
Í
1/ 8 ˘
˙
˙
˙
˙
˚
È 1 ˘
Í
Î Í
Í
Î
˙
˚
)
2
˘
˙
˚
1/ 8
1/ 8
 1/ 8
1
SVM  25
Much more than documents.
Discover everything Scribd has to offer, including books and audiobooks from major publishers.
Cancel anytime.