You are on page 1of 34

An Introduction of

Support Vector Machine

Slides are from Dr. Belhumuer, Columbia University


Support Vector Machine (SVM)

nn A classifier derived from statistical learning theory by Vapnik, et


al. in 1992
nn SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
nn SVM has been widely used in object detection & recognition,
content-based image retrieval, text recognition, biometrics,
speech recognition, etc.
nn Also used for regression

nn Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook


V. Vapnik
Outline

nn LinearDiscriminant Function
nn Large Margin Linear Classifier

nn Nonlinear SVM: The Kernel Trick


Discriminant Function
nChapter 2.4: the classifier is said to assign a feature vector x
to class wi if
fi (x) > fj (x) for all j ≠ i

nFor two-category case, F (x) ≡ f1 (x) − f2 (x)


Decide ω1 if f (x) > 0; otherwise
decide ω2

nAn example we’ve learned before:


Minimum-Error-Rate Classifier
f(x) ≡ p (ω1 | x) − p (ω2 | x)
Discriminant Function
nn It can be arbitrary functions of x, such as:

Nearest Nonlinear
Linear
Neighbor Functions
Functions

f(x) = wT x + b
• For a K-NN classifier it was necessary to `carry’ the training data
• For a linear classifier, the training data is used to learn w and then
discarded
• Only w is needed for classifying new data
Linear Discriminant Function denotes +1
denotes -1
nn How would you classify x2
these points using a linear
discriminant function in order
to minimize the error rate?

nn Infinite number of answers!

x1
Linear Discriminant Function denotes +1
denotes -1
nn How would you classify x2
these points using a linear
discriminant function in order
to minimize the error rate?

nn Infinite number of answers!

x1
Linear Discriminant Function denotes +1
denotes -1
nn How would you classify x2
these points using a linear
discriminant function in order
to minimize the error rate?

nn Infinite number of answers!

x1
Linear Discriminant Function denotes +1
denotes -1
nn How would you classify x2
these points using a linear
discriminant function in order
to minimize the error rate?

nn Infinite number of answers!

nn Which one is the best? x1


Large Margin Linear Classifier denotes +1
denotes -1
nn Thelinear discriminant x2
Margin
function (classifier) with the “safe zone”
maximum margin is the best

nn Margin is defined as the


width that the boundary
could be increased by before
hitting a data point

nn Why it is the best?


qq Robust to outliers and thus
strong generalization ability x1
Linear Discriminant Function
nn f(x) is a linear function: x2
wT x + b > 0

f(x) = wT x + b

nn Ahyper-plane in the feature


space
n

nn (Unit-length)
normal vector
of the hyper-plane:
w
n= wT x + b < 0 x1
w
Large Margin Linear Classifier denotes +1
denotes -1
nn Given a set of data points: x2

{(xi , yi )}, i = 1,2, , n,where


For y = +1, wT x + b > 0
i i
For y = −1, wT x + b < 0
i i

nn With a scale transformation


on both w and b, the above
is equivalent to
For y = +1, wT x + b ≥1
i i x1
For y = −1, w x + b ≤ −1
T
i i
Linear Discriminant Function
Large Margin Linear Classifier denotes +1
denotes -1
nn We know that x2
Margin
wT x + + b = 1 x+

wT x− + b = −1
x+
nn Distance between the origin
and the line wtx=k is k/||w|| m
nn The margin width is:
x-

Support Vectors x1
Support Vector Machine
Large Margin Linear Classifier denotes +1
denotes -1
nn Formulation: x2
Margin

2 x+
maximize
w
such that x+

For y = +1, wT x + b ≥1 n
i i x-
For y = −1, wT x + b ≤ −1
i i

x1
Large Margin Linear Classifier denotes +1
denotes -1
nn Formulation: x2
Margin

1 2 x+
minimize w
2
such that x+

For y = +1, wT x + b ≥1 n
i i x-
For y = −1, wT x + b ≤ −1
i i

x1
Large Margin Linear Classifier denotes +1
denotes -1
nn Formulation: x2
Margin

1 2 x+
minimize w
2
such that x+

y (wT x + b) ≥1 n
i i x-

x1
Solving the Optimization Problem
1
Quadratic minimize w2
programming 2
with linear
constraints s.t. y (wT x + b) ≥1
i i
Solving the Optimization Problem
1
Quadratic minimize w2
programming 2
with linear
constraints s.t. y (wT x + b) ≥1
i i

Lagrangian
Function
n
1
w −∑α i (y i (w T x i + b) −1)
2
minimize Lp (w,b,αi ) =
2 i=1

s.t. αi ≥ 0
Solving the Optimization Problem
n
1
w −∑α i (yi (w T xi + b) −1)
2
minimize Lp (w,b,αi ) =
2 i=1

s.t. αi ≥ 0

n
∂Lp w = ∑ αi yi xi
=0
∂w i=1
n
∂Lp
=0 ∑α y i i =0
∂b i=1
Solving the Optimization Problem
1 n
minimize Lp (w,b,αi ) = w −∑ α i (yi (w T xi + b) −1)
2

2 i=1

s.t. αi ≥ 0
Primal formulation

Lagrangian Dual
Problem

Dual formulation
KKT Conditions
Solving the Optimization Problem
nn From KKT condition, we know:
x2

α i (yi (wT xi + b) −1)= 0 x+

x+

nn Thus, only support vectors have αi ≠ 0


x-

nn The solution has the form: Support Vectors


x1
n
w = ∑ αi yi xi = ∑α y x i i i
i=1 i∈SV

get b from y (wT x + b) −1 = 0,


i i

where xi is support vector


Solving the Optimization Problem
nn The linear discriminant function is:

g(x) = w T x + b = ∑ i i i x +b
α y x T

i∈SV

nn Noticeit relies on a dot product between the test point x


and the support vectors xi

nn Also keep in mind that solving the optimization problem


involved computing the dot products xiTxj between all pairs
of training points
Large Margin Linear Classifier denotes +1
denotes -1
nn What if data is not linear x2
separable? (noisy data,
outliers, etc.)

nn Slackvariables ξi can be ξ2
added to allow mis- ξ1
classification of difficult
or noisy data points

x1
Large Margin Linear Classifier
nn Formulation:
n

minimize 1 w + C ∑ξi
2

2 i=1

such that
y (wT x + b) ≥ 1− ξ
i i i

ξi ≥ 0

nn Parameter C can be viewed as a way to control over-fitting.


Large Margin Linear Classifier
nn Formulation: (Lagrangian Dual Problem)

n 1 n n
maximize ∑ αi − ∑∑ α iα j yi y j xTi x j
i=1 i=1 j=1
2 A small value for C will
increase the number of
such that training errors, while a
large C will lead to a
0 ≤ αi ≤ C behavior similar to that of
a hard-margin SVM
n

∑α y i i =0
i=1
nn The parameter C controls the trade off
between errors of the SVM on training data
and margin maximization (C = ∞ leads to
hard margin SVM).

nn Ifit is too large, we have a high penalty for


nonseparable points and we may store many
support vectors and overfit. If it is too small,
we may have underfitting.

You might also like