SVM Introduction: Support Vector Machine Basics

An Introduction of
Support Vector Machine
Slides are from Dr. Belhumuer, Columbia University

Support Vector Machine (SVM)
nn A classifier derived from statistical learning theory by Vapnik, et

al. in 1992
nn SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
nn SVM has been widely used in object detection & recognition,
content-based image retrieval, text recognition, biometrics,
speech recognition, etc.
nn Also used for regression
nn Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook

V. Vapnik
Outline
nn LinearDiscriminant Function
nn Large Margin Linear Classifier
nn Nonlinear SVM: The Kernel Trick

Discriminant Function
nChapter 2.4: the classifier is said to assign a feature vector x
to class wi if
fi (x) > fj (x) for all j ≠ i
nFor two-category case, F (x) ≡ f1 (x) − f2 (x)

Decide ω1 if f (x) > 0; otherwise
decide ω2
nAn example we’ve learned before:

Minimum-Error-Rate Classifier
f(x) ≡ p (ω1 | x) − p (ω2 | x)
Discriminant Function
nn It can be arbitrary functions of x, such as:
Nearest Nonlinear
Linear
Neighbor Functions
Functions
f(x) = wT x + b
• For a K-NN classifier it was necessary to `carry’ the training data
• For a linear classifier, the training data is used to learn w and then
discarded
• Only w is needed for classifying new data
Linear Discriminant Function denotes +1
denotes -1
nn How would you classify x2
these points using a linear
discriminant function in order
to minimize the error rate?
nn Infinite number of answers!
x1
denotes -1
x1
denotes -1
x1
denotes -1
nn Which one is the best? x1

Large Margin Linear Classifier denotes +1
denotes -1
nn Thelinear discriminant x2
Margin
function (classifier) with the “safe zone”
maximum margin is the best
nn Margin is defined as the

width that the boundary
could be increased by before
hitting a data point
nn Why it is the best?

qq Robust to outliers and thus
strong generalization ability x1
Linear Discriminant Function
nn f(x) is a linear function: x2
wT x + b > 0
f(x) = wT x + b
nn Ahyper-plane in the feature

space
n
nn (Unit-length)
normal vector
of the hyper-plane:
w
n= wT x + b < 0 x1
w
denotes -1
nn Given a set of data points: x2
{(xi , yi )}, i = 1,2, , n,where

For y = +1, wT x + b > 0
i i
For y = −1, wT x + b < 0
i i
nn With a scale transformation

on both w and b, the above
is equivalent to
For y = +1, wT x + b ≥1
i i x1
For y = −1, w x + b ≤ −1
T
i i
Linear Discriminant Function
denotes -1
nn We know that x2
Margin
wT x + + b = 1 x+
wT x− + b = −1
x+
nn Distance between the origin
and the line wtx=k is k/||w|| m
nn The margin width is:
x-
Support Vectors x1
Support Vector Machine
denotes -1
nn Formulation: x2
Margin
2 x+
maximize
w
such that x+
For y = +1, wT x + b ≥1 n
i i x-
For y = −1, wT x + b ≤ −1
i i
x1
denotes -1
nn Formulation: x2
Margin
1 2 x+
minimize w
2
such that x+
For y = +1, wT x + b ≥1 n
i i x-
For y = −1, wT x + b ≤ −1
i i
x1
denotes -1
nn Formulation: x2
Margin
1 2 x+
minimize w
2
such that x+
y (wT x + b) ≥1 n
i i x-
x1
Solving the Optimization Problem
1
Quadratic minimize w2
programming 2
with linear
constraints s.t. y (wT x + b) ≥1
i i
1
Quadratic minimize w2
programming 2
with linear
constraints s.t. y (wT x + b) ≥1
i i
Lagrangian
Function
n
1
w −∑α i (y i (w T x i + b) −1)
2
minimize Lp (w,b,αi ) =
2 i=1
s.t. αi ≥ 0
n
1
w −∑α i (yi (w T xi + b) −1)
2
minimize Lp (w,b,αi ) =
2 i=1
s.t. αi ≥ 0
n
∂Lp w = ∑ αi yi xi
=0
∂w i=1
n
∂Lp
=0 ∑α y i i =0
∂b i=1
1 n
minimize Lp (w,b,αi ) = w −∑ α i (yi (w T xi + b) −1)
2
2 i=1
s.t. αi ≥ 0
Primal formulation
Lagrangian Dual
Problem
Dual formulation
KKT Conditions
nn From KKT condition, we know:
x2
α i (yi (wT xi + b) −1)= 0 x+
x+
nn Thus, only support vectors have αi ≠ 0

x-
nn The solution has the form: Support Vectors

x1
n
w = ∑ αi yi xi = ∑α y x i i i
i=1 i∈SV
get b from y (wT x + b) −1 = 0,

i i
where xi is support vector

nn The linear discriminant function is:
g(x) = w T x + b = ∑ i i i x +b
α y x T
i∈SV
nn Noticeit relies on a dot product between the test point x

and the support vectors xi
nn Also keep in mind that solving the optimization problem

involved computing the dot products xiTxj between all pairs
of training points
denotes -1
nn What if data is not linear x2
separable? (noisy data,
outliers, etc.)
nn Slackvariables ξi can be ξ2
added to allow mis- ξ1
classification of difficult
or noisy data points
x1
Large Margin Linear Classifier
nn Formulation:
n
minimize 1 w + C ∑ξi
2
2 i=1
such that
y (wT x + b) ≥ 1− ξ
i i i
ξi ≥ 0
nn Parameter C can be viewed as a way to control over-fitting.

Large Margin Linear Classifier
nn Formulation: (Lagrangian Dual Problem)
n 1 n n
maximize ∑ αi − ∑∑ α iα j yi y j xTi x j
i=1 i=1 j=1
2 A small value for C will
increase the number of
such that training errors, while a
large C will lead to a
0 ≤ αi ≤ C behavior similar to that of
a hard-margin SVM
n
∑α y i i =0
i=1
nn The parameter C controls the trade off
between errors of the SVM on training data
and margin maximization (C = ∞ leads to
hard margin SVM).
nn Ifit is too large, we have a high penalty for

nonseparable points and we may store many
support vectors and overfit. If it is too small,
we may have underfitting.

SVM Introduction: Support Vector Machine Basics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SVM Introduction: Support Vector Machine Basics

Uploaded by

Copyright:

Available Formats

An Introduction of

Support Vector Machine

Slides are from Dr. Belhumuer, Columbia University

nn A classifier derived from statistical learning theory by Vapnik, et

nn Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook

nn Nonlinear SVM: The Kernel Trick

nFor two-category case, F (x) ≡ f1 (x) − f2 (x)

nAn example we’ve learned before:

nn Infinite number of answers!

nn Infinite number of answers!

nn Infinite number of answers!

nn Infinite number of answers!

nn Which one is the best? x1

nn Margin is defined as the

nn Why it is the best?

nn Ahyper-plane in the feature

{(xi , yi )}, i = 1,2, , n,where

nn With a scale transformation

α i (yi (wT xi + b) −1)= 0 x+

nn Thus, only support vectors have αi ≠ 0

nn The solution has the form: Support Vectors

get b from y (wT x + b) −1 = 0,

where xi is support vector

nn Noticeit relies on a dot product between the test point x

nn Also keep in mind that solving the optimization problem

nn Parameter C can be viewed as a way to control over-fitting.

nn Ifit is too large, we have a high penalty for

You might also like