You are on page 1of 25

Support Vector Machines (SVM)

Support Vector Machines (SVM) Y.H. Hu SVM - 1

Y.H. Hu

SVM - 1

Outline

Linear pattern classfiers and optimal hyperplane

Optimization problem formulation

Statistical properties of optimal hyperplane

The case of non-separable patterns

Applications to general pattern classification

Mercer's Theorem

The case of non-separable patterns Applications to general pattern classification Mercer's Theorem Y.H. Hu SVM -

Y.H. Hu

SVM - 2

Linear Hyperplane Classifier

Linear Hyperplane Classifier x 2 x r w -b/|w| x 1 Given: {+1, - 1}}. A

x 2

Linear Hyperplane Classifier x 2 x r w -b/|w| x 1 Given: {+1, - 1}}. A
x r w -b/|w|
x
r
w
-b/|w|
Linear Hyperplane Classifier x 2 x r w -b/|w| x 1 Given: {+1, - 1}}. A
Linear Hyperplane Classifier x 2 x r w -b/|w| x 1 Given: {+1, - 1}}. A
Linear Hyperplane Classifier x 2 x r w -b/|w| x 1 Given: {+1, - 1}}. A

x 1

Given:

{+1, -1}}.

A linear hyperplane classifier is a hyperplane consisting of

points x such that

{(x i , d i ); i = 1 to N, d i Œ

H = {x| g(x) = w T x + b = 0}

g(x): discriminant function!.

For x in the side of o :

w T x + b 0;

d = +1;

For x in the side of æ:

w T x + b £ 0;

d = -1.

Distance from x to H:

r = w T x/|w| - (-b/|w|) = g(x) /|w|

w T x + b £ 0; d = - 1. Distance from x to H:

Y.H. Hu

SVM - 3

Distance from a Point to a Hyper-plane

The hyperplane H is characterized by

(*)

w T x + b = 0

w: normal vector perpendicular to H

(*) says any vector x on H that

project to w will have a length of

any vector x on H that project to w will have a length of H w

H

w

A

X*

B

r

C

to w will have a length of H w A X* B r C O OA
O
O

OA = -b/|w|.

Consider a special point C corresponding to vector x * . Its projection onto vector w is

w T x * /|w| = OA + BC.

Or equivalently, w T x * /|w| = r-b/|w|.

Hence r = (w T x * +b)/|w| = g(x*)/|w|

+ BC. Or equivalently, w T x * /|w| = r - b/|w|. Hence r =

Y.H. Hu

SVM - 4

Optimal Hyperplane: Linearly Separable Case

Optimal Hyperplane: Linearly Separable Case x 2 r r x 1 Optimal hyperplane should be in
x 2 r r
x
2
r
r
Optimal Hyperplane: Linearly Separable Case x 2 r r x 1 Optimal hyperplane should be in

x 1

Optimal hyperplane should be in the center of the gap.

Supporting Vectors - Samples on the boundaries. Supporting vectors alone can determine optimal

hyperplane.

Question: How to find optimal hyperplane?

For d i = +1,

g(x i ) = w T x i + b r|w|

w o T x i + b o 1

For d i = -1,

g(x i ) = w T x i + b £ -r|w|

w o T x i + b o £ -1

= - 1, g(x i ) = w T x i + b £ -r |w|

Y.H. Hu

SVM - 5

Separation Gap

For x i being a supporting vector,

For d i = +1,

g(x i ) = w T x i + b = r|w|

w o T x i + b o = 1

For d i = -1,

g(x i ) = w T x i + b = -r|w|

w o T x i + b o = -1

Hence w o = w/(r|w|), b o = b/(r|w|). But distance from x i to hyperplane is r = g(x i )/|w|. Thus w o = w/g(x i ), and r = 1/|w o |.

The maximum distance between the two classes is

2r = 2/|w o |.

Hence the objective is to find w o , b o to minimize |w o | (so that r is maximized) subject to the constraints that

w o T x i + b o 1 for d i = +1;

and w o T x i + b o £ 1 for d i = -1.

Combine these constraints, one has:

d i (w o T x i + b o ) 1

for d i = - 1. Combine these constraints, one has: d i (w o T

Y.H. Hu

SVM - 6

Quadratic Optimization Problem Formulation

Given {(x i , d i ); i = 1 to N}, find w and b such that f(w) = w T w/2 is minimized subject to N constraints

d i (w T x i + b) 0;

1 £

i

£

N.

Method of Lagrange Multiplier

Set

+ b) 0; 1 £ i £ N. Method of Lagrange Multiplier Set J(w, b, a

J(w, b, a ) = f(w) -

J(w,b,a)/ w = 0

J(w,b,a)/ a = 0

N

Â

i = 1

a

i

[

d

i

(

T

w

x

i

+

N

w = ÂÂ

N

ÂÂ

i == 1

a

i == 1

i

d

i

a

i

==

d x

i

0

i

b

)

-

Y.H. Hu

1]

SVM - 7

Optimization (continued)

The solution of Lagrange multiplier problem is at a saddle point where the minimum is sought w.r.t. w and b, while the maximum is sought w.r.t. ai.

Kuhn-Tucker Condition: at the saddle point,

a i [d i (w T x i + b) - 1] = 0

for 1 £ i £ N.

If x i is NOT a suppor vector, the corresponding a i = 0!

Hence, only support vector will affect the result of optimization!

the corresponding a i = 0! fi Hence, only support vector will affect the result of

Y.H. Hu

SVM - 8

A Numerical Example

(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1) (1, - 1) ( 2 , 1 )
(3,1)

(3,1)

(1,-1)

(1, - 1)
(1, - 1)

(2,1)

3 inequalities: 1 w + b £ -1; 2 w + b +1;

J = w 2 /2 - a 1 (-w-b-1) - a 2 (2w+b-1) - a 3 (3w+b-1)

J/ w = 0 w = - a 1 + 2a 2 + 3a 3

J/ b = 0

Solve: (a) -w-b-1 = 0

3 w + b + 1

0 =

a 1

a 2 -

a 3

-

(b) 2w+b-1 = 0

(c); 3w + b -1 = 0

(b) and (c) conflict each other. Solve (a), (b) yield w = 2,

b

= -3. From the Kuhn-Tucker condition, a 3 = 0. Thus, a 1 = a 2

=

2. Hence the solution of decision boundary is: 2x - 3 = 0. or

x

= 1.5! This is shown as the dash line in above figure.

decision boundary is: 2x - 3 = 0. or x = 1.5! This is shown as

Y.H. Hu

SVM - 9

Primal/Dual Problem Formulation

Given a constrained optimization problem with a convex cost function and linear constraints; a dual problem with the Lagrange multipliers providing the solution can be formulated.

Duality Theorem (Bertsekas 1995)

(a)

If the primal problem has an optimal solution, then the dual problem has an optimal solution with the same optimal values.

(b)

In order for w o to be an optimal solution and a o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem and

F(w o ) = J(w o ,b o ,a o ) = Min w J(w,b o ,a o )

primal problem and F (w o ) = J(w o ,b o , a o )

Y.H. Hu

SVM - 10

Formulating the Dual Problem

With

J

(

w , b

,a)

=

N

1

2

w

w = ÂÂ

a

i

d x

i

i

and

i == 1

Dual Problem

T

w

N

ÂÂ

i == 1

a

-

i

N

Â

i = 1

d

i

a

==

i

d w

i

T

x

i

-

N N

b

Â

i

= 1

a

i

d

i

+ Â

i

= 1

a

0. These lead to a

i

Maximize

N

1 a

N

N

Q (a )

== ÂÂ

1

i ==

 

d d

i

 

T

a

i ÂÂ

--

ÂÂ

a

2 i

i

1

==

j

1

==

j

j

x

i

x

j

N

Subject to: ÂÂ 1 a

i ==

i

d i

==

0

and a i 0

for i = 1, 2, …, N.

Note

Q (

a

) =

N

Â

i = 1

a

i

-

1

[

a

2 1

d

1

L

a

N

d

N

]

È

Í

Í

Í

Î

2

x

1

x

M

N

x

1

L

O

L

x x

1

N

M

2

N

x

˘

˙

˙

˙

˚

È a

Í

Í

Í a

Î

2 x 1 x M N x 1 L O L x x 1 N M

Y.H. Hu

SVM - 11

1

d 1

M

N

d

N

˘

˙

˙

˙

˚

Numerical Example (cont’d)

or

Q ( a

3

) == ÂÂ

i == 1

a

i

--

1

2

[[

-- a

1

a

2

a

3

]]

ÈÈ ÍÍ ÍÍ ÎÎ ÍÍ 3

2

1

2

4

6

3

˘˘ ÈÈ -- ˙˙ ÍÍ ˙˙ ÍÍ 9 ˚˚ ˙˙ ÎÎ ÍÍ a

a

6

a

2

3

1

˘˘

˙˙

˙˙

˚˚ ˙˙

Q(a) = a 1 + a 2 + a 3 - [0.5a 1 2 + 2a 2 2 + 4.5a 3 2 - 2a 1 a 2 -

3a 1 a 3 + 6a 2 a 3 ]

subject to constraints:

-a 1 + a 2 + a 3 = 0, and

a 1 0, a 2 0, and a 3 0.

Use MatlabOptimization tool box command:

x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)

The solution is [a 1

a 2

a 3 ] = [2

2

0] as expected.

x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq) The solution is [ a 1 a 2 a 3 ]

Y.H. Hu

SVM - 12

Implication of Minimizing ||w||

Let D denote the diameter of the smallest hyper-ball that encloses all the input training vectors {x 1 , x 2 , …, x N }. The set of optimal hyper-planes described by the equation

W o T x + b o = 0

has a VC-dimension h bounded from above as

h £ min { ÈD 2 /r 2 ˘, m 0 } + 1

where m 0 is the dimension of the input vectors, and r = 2/||w o || is the margin of the separation of the hyper-planes.

VC-dimension determines the complexity of the classifier structure, and usually the smaller the better.

determines the complexity of the classifier structure , and usually the smaller the better. Y.H. Hu

Y.H. Hu

SVM - 13

Non-separable Cases

Recall that in linearly separable case, each training sample pair (x i , d i ) represents a linear inequality constraint

d i (w T x i + b) 1,

i = 1, 2, …, N

If the training samples are not linearly separable, the constraint can be modified to yield a soft constraint:

d i (w T x i + b) 1-z i ,

i = 1, 2, …, N

{z i ; 1 £ i £ N} are known as slack variables. corresponding (x i , d i ) will be mis-classified.

If z i > 1, then the

N

The minimum error classifier would minimize ÂÂ

i == 1

I (z

i

-- 1), but it

is non-convex w.r.t. w. Hence an approximation is to minimize

non-convex w.r.t. w. Hence an approximation is to minimize F ( w , z ) 1

F(

w ,z )

1

2

N

i == 1

i

==

w

T

w

++

C ÂÂ

z

Y.H. Hu

SVM - 14

Primal and Dual Problem Formulation

Primal Optimization Problem Given {(x i , z i );1£ i£ N}. Find w, b such that

F(

w ,z )

1

2

N

i == 1

i

==

w

T

w

++

C ÂÂ

z

is minimized subject to the constraints (i) z i 0, and (ii) di(w T x i + b) 1-z i for i = 1, 2, …, N.

Dual Optimization Problem Given {(x i , z i );1£ i£ N}. Find Lagrange multipliers {a i ; 1 £ i £ N} such that

N 1 N N T Q a ( ) == ÂÂa -- ÂÂ ÂÂa a
N 1
N N
T
Q a
(
) == ÂÂa
--
ÂÂ ÂÂa
a d d
x
x
i
j
i
j
i
i
i 2
i ==1
i
==1
j
==1

is maximized subject to the constraints (i) 0 £ a i £ C (a user-

specified positive number) and (ii)

N

ÂÂ

a d

i

i

== 0

i == 1

(i) 0 £ a i £ C (a user- specified positive number) and (ii) N ÂÂ

Y.H. Hu

SVM - 15

Solution to the Dual Problem

Optimal Solution to the Dual problem is:

N s w == ÂÂ a d x o o i , i i i
N s
w
== ÂÂ
a
d x
o
o i
,
i
i
i == 1

N s : # of support vectors.

Kuhn-Tucker condition implies for i = 1, 2, …, N,

(i)

a i [d i (w T x i + b) - 1 + z i ] = 0

(*)

(ii)

m i z i = 0

{m i ; 1 £ i £ N} are Lagrange multipliers to enforce the condition z i 0. At optimal point of the primal problem, F/ z i = 0. One

may deduce that z i = 0

if a i £ C.

Solving (*), we have

b ==

1

N

N i == 1

o

ÂÂ

a i

(

w

T

x

i

--

d

i

)

== 1 N N i == 1 o ÂÂ a i ( w T x i

Y.H. Hu

SVM - 16

Matlab Implementation

% svm1.m: basic support vector machine

% X: N by m matrix. i-th row is x_i

% d: N by 1 vector. i-th element is d_i

% X, d should be loaded from file or read from input.

% call MATLAB optimization tool box function fmincon

a0=eps*ones(N,1);

C = 1;

a=fmincon('qfun',a0,[],[],d',0,zeros(N,1),C*ones(N,1),…

[],[],X,d) wo=X'*(a.*d) bo=sum(diag(a)*(X*wo-d))/sum([a > 10*eps])

function y=qfun(a,X,d);

% the Q(a) function. Note that it is actually -Q(a)

% because we call fmincon to minimize -Q(a) is

% the same as to maximize Q(a) [N,m]=size(X);

y=-ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a;

is % the same as to maximize Q(a) [N,m]=size(X); y=-ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a; Y.H. Hu SVM - 17

Y.H. Hu

SVM - 17

Inner Product Kernels

In general, if the input is first transformed via a set of nonlinear functions {f i (x)} and then subject to the hyperplane classifier

g

(

x

)

=

p

Â

j =1

w

j

j

j

(

x

)

p

Â

+ b =

j

=0

w

j

j

j

(

x

)

= w

T

j

b == w

0

;

j

0

(

x

)

==

1

Define the inner product kernel as

K

(

x

,

y

)

=

p

Â

j =0

j

j

(

)

x j

j

(

y

)

= jj

T

one may obtain a dual optimization problem formulation as:

N

1

i =

1 2 ÂÂ

i

= 1

j

N

N

= 1

(

Q a

) =

Â

a

i

-

a

i a

j

d d

i

j

K

(

x

i

,

x

j

)

Often, dim of jj (=p+1) >> dim of x!

i j K ( x i , x j ) Often, dim of jj (=p+1) >>

Y.H. Hu

SVM - 18

General Pattern Recognition with SVM

x i

j 1 (x) w 1 b x 1 j 2 (x) w 2 x 2
j
1 (x)
w
1
b
x
1
j
2 (x)
w
2
x
2
+
x
m
w
p
j
p (x)

d i

By careful selection of the nonlinear transformation {f j (x); 1 £ j £ p}, any pattern recognition problem can be solved.

transformation { f j (x); 1 £ j £ p}, any pattern recognition problem can be

Y.H. Hu

SVM - 19

Polynomial Kernel

Consider a polynomial kernel

K

(

x

,

y

)

==

(1

++

T

x

y

)

2

==

1

++

2

m

ÂÂ

1

i ==

x y

i

i

m

++ 2 ÂÂ

i

1

==

j

m

ÂÂ

== ++

i

1

x y x

i

i

j

y

j

m

++ ÂÂ

i

== 1

x

2

i

y

2

i

Let K(x,y) = jj T (x) jj(x), then

jj(x) = [1

x 1 2 , º, x m 2 , ÷2 x 1 , º, ÷2x m , ÷2 x 1 x 2 , º, ÷2 x 1 x m ,

÷2 x 2 x 3 , º, ÷2 x 2 x m , º,÷2 x m-1 x m ]

= [1 j 1 (x), º, j p (x)]

where p = 1 +m + m + (m-1) + (m-2) + º + 1 = (m+2)(m+1)/2

Hence, using a kernel, a low dimensional pattern classification problem (with dimension m) is solved in a higher dimensional space (dimension p+1). But only f j (x) corresponding to support vectors are used for pattern classification!

p+1). But only f j (x) corresponding to support vectors are used for pattern classification! Y.H.

Y.H. Hu

SVM - 20

Numerical Example: XOR Problem

Training samples:

 
 
 
 

(-1 -1; -1), (-1

1

+1),

(1

-1

+1), (1

1

-1)

(1 - 1 +1), (1 1 - 1)
(1 - 1 +1), (1 1 - 1)

F

x = [x 1 , x 2 ] T .

Use K(x,x i ) = (1 + x T x i ) 2 one has

j(x) = [1

x 1 2

x 2 2

÷2 x 1 , ÷2 x 2 , ÷2 x 1 x 2 ] T

 

ÈÈ 1

1

1

--

2
2
2

--

2
2

˘˘  ÈÈ 1 1 1 -- 2 -- 2

ÍÍ

1

1

1

2
2

ÍÍ 1 1 1 2   ˙˙ ) ==

 

˙˙

) == ==

 

--

--

==

ÍÍ

ÍÍ 1

1

0

2
2

--

ÍÍ 1 1 0 2 -- --

--

ÍÍ

ÎÎ

 

˙˙

1

1

1

2
2
2
2

˚˚1 1 1 2 2

FF

T

==

9 1

ÈÈ

ÍÍ

ÍÍ

ÍÍ 1

ÍÍ ÎÎ

1 9

1

1 1

Note dim[jj(x)] = 6 > dim[x] = 2!

1

1

9

1

1 9 1 1 1 Note dim[ jj ( x )] = 6 > dim[ x

Y.H. Hu

SVM - 21

1

1

1 ˙˙

˙˙

˙˙

˘˘

9

˚˚ ˙˙

;

XOR Problem (Continued)

Note that K(x i , x j ) can be calculated directly without using F!

E.g.

K

1,1

=

(1

+

[

-

1

-

1

]

È -

Í

Î

-

1 ˘

1

˙

˚

)

2

=

9;

K

1,2

=

(1

+

[

-

1

-

1

]

È-

Í

Î

1

1 ˘

˙

˚

)

2

=

1

The corresponding Lagrange multiplier a = (1/8)[1 1 1 1] T .

w =

N

Â

i = 1

a d j

i

i

(

x

i

) = F

T

[

a d

1

1

a

2

d

2

L

a

N

d

N

] T

= (1/8)(-1) j(x 1 ) + (1/8)(1)j(x 2 ) + (1/8)(1)j(x 3 ) + (1/8)(-1)j(x 4 ) = [0 0 0 0 0 -1/÷2] T

Hence the hyperplane is: y = w T j(x) = - x 1 x 2

(x 1 , x 2 )

(-1, -1)

(-1, +1)

(+1,-1)

(+1,+1)

y = -1 x 1 x 2

-1

+1

+1

-1

( - 1, - 1) ( - 1, +1) (+1, - 1) (+1,+1) y = -

Y.H. Hu

SVM - 22

Other Types of Kernels

type of SVM

 

K(x,y)

 

Comments

Polynomial learning machine

 

(x T y + 1) p

 

p: selected a priori

Radial basis function

 

ÊÊ ÁÁ

ËË

--

1

||

 

--

||

2

˜˜ ˆˆ

s 2 : selected a priori

exp

2 s

2

x

y

¯¯

Two-layer

tanh(b o x T y + b 1 )

 

only some b o and b 1 values are feasible.

perceptron

What kernel is feasible? It must satisfy the "Mercer's theorem"!

feasible. perceptron What kernel is feasible? It must satisfy the "Mercer's theorem"! Y.H. Hu SVM -

Y.H. Hu

SVM - 23

Mercer's Theorem

Let K(x,y) be a continuous, symmetric kernel, defined on a£ x,y £ b. K(x,y) admits an eigen-function expansion

K

(

x y

,

)

••

== ÂÂ

i ==1

lj

i

i

(

)

x j

i

(

y

)

with l i > 0 for each i. This expansion converges absolutely and uniformly if and only if

a a

ÚÚ ÚÚ

b b

K(x,y)y (x)y (y)dxdy

a

for all y(x) such that ÚÚ y

b

2 (x)dx << ••

.

0

( x ) y ( y ) d x d y a for all y (

Y.H. Hu

SVM - 24

Testing with Kernels

For many types of kernels, j(x) can not be explicitly represented or even found. However,

w

=

N

Â

i = 1

a

i

d

i

j

(x ) = F

i

T

[

a

1

d

1

a

2

d

2

L

a

N

d ]

N

T

= F

T

f

y(x) = w T j(x) = f T F j(x) = f T K(x i ,x) = K(x, x j ) f

Hence there is no need to know j(x) explicitly! For example,

in the XOR problem, f = (1/8)[-1 +1 +1 -1] T . Suppose that x

= (-1, +1), then

y

(

x

)

= K

= [1

(

x

,

9

x

j

)

1

f

=

1][

È

Í

Î

-

1

+

[

-

1/ 8

1

-

1/ 8

1]

È-

Í

Î

1

1 ˘

˙

˚

)

1/ 8

2

-

1

+

1/ 8]

[

-

T

1

=

1

1]

È-

Í

Î

1

1 ˘

˙

˚

)

2

1

+

[1

-

1]

È-

Í

Î

1

1 ˘

˙

˚

)

2

1

+

˚ ) 2 1 + [1 - 1] È- Í Î 1 1 ˘ ˙ ˚

Y.H. Hu

[1

1]

È -

Í

Í

1/ 8 ˘

˙

˙

˙

˙

˚

È- 1 ˘

Í

Î Í

Í

Î

˙

˚

)

2

˘

˙

˚

1/ 8

1/ 8

- 1/ 8

1

SVM - 25