Support Vector Machines - Lec1

Support Vector Machine (SVM)
SVM is a hyperplane based (linear) classifier that ensures a large margin around the hyperplane
Note: We will assume the hyperplane to be of the form w > x + b = 0 (will keep the bias term b)
Note: SVMs can also learn nonlinear decision boundaries using kernel methods (will see later)
Reason behind the name “Support Vector Machine”?
SVM optimization discovers the most important examples (called “support vectors”) in training data
These examples act as “balancing” the margin boundaries (hence called “support”)
Intro to Machine Learning (CS771A) Optimization (Wrap-up), and Hyperplane based Classifiers (Perceptron and SVM) 16
Learning a Maximum Margin Hyperplane
Suppose we want a hyperplane w > x + b = 0 such that

w T x n + b ≥ 1 for yn = +1
w T x n + b ≤ −1 for yn = −1

w T x n + b ≥ 1 for yn = +1
w T x n + b ≤ −1 for yn = −1
Equivalently, yn (w T x n + b) ≥ 1 ∀n

w T x n + b ≥ 1 for yn = +1
w T x n + b ≤ −1 for yn = −1
|w T x n +b|
Define the margin on each side: γ = min1≤n≤N ||w ||
= ||w1 ||

w T x n + b ≥ 1 for yn = +1
w T x n + b ≤ −1 for yn = −1
|w T x n +b|
= ||w1 ||
2
Total margin = 2γ = ||w ||

w T x n + b ≥ 1 for yn = +1
w T x n + b ≤ −1 for yn = −1
|w T x n +b|
= ||w1 ||
2
Want the hyperplane (w , b) that gives the largest possible margin

w T x n + b ≥ 1 for yn = +1
w T x n + b ≤ −1 for yn = −1
|w T x n +b|
= ||w1 ||
2
Want the hyperplane (w , b) that gives the largest possible margin

Note: Can replace yn (w T x n + b) ≥ 1 by yn (w T x n + b) ≥ m for some m > 0. It won’t change the
solution for w , will just scale it by a constant, without changing the direction of w (exercise)
Hard-Margin SVM
Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1
Hard-Margin SVM
1
Also want to maximize the margin γ ∝ ||w || .
Hard-Margin SVM
1 ||w ||2
Also want to maximize the margin γ ∝ ||w || . Equivalent to minimizing ||w ||2 or 2
Hard-Margin SVM
1 ||w ||2
The objective for hard-margin SVM

||w ||2
min f (w , b) =
w ,b 2
T
subject to yn (w x n + b) ≥ 1, n = 1, . . . , N
Hard-Margin SVM
1 ||w ||2

||w ||2
min f (w , b) =
w ,b 2
T
Constrained optimization with N inequality constraints
Hard-Margin SVM
1 ||w ||2

||w ||2
min f (w , b) =
w ,b 2
T
Constrained optimization with N inequality constraints (note: function and constraints are convex)
Soft-Margin SVM (More Commonly Used)
Allow some training examples to fall within the margin region, or be even misclassified (i.e., fall on
the wrong side). Preferable if training data is noisy
Each training example (x n , yn ) given a “slack” ξn ≥ 0 (distance by which it “violates” the margin).
If ξn > 1 then x n is totally on the wrong side
Each training example (x n , yn ) given a “slack” ξn ≥ 0 (distance by which it “violates” the margin).
If ξn > 1 then x n is totally on the wrong side
Basically, we want a soft-margin condition: yn (w T x n + b) ≥ 1−ξn , ξn ≥ 0
Goal: Maximize the margin, while also minimizing the sum of slacks (don’t want too many training
examples violating the margin condition)
The primal objective for soft-margin SVM can thus be written as

N
||w ||2 X
min f (w , b, ξ) = +C ξn
w ,b,ξ 2 n=1
T
subject to yn (w x n + b) ≥ 1−ξn , ξn ≥ 0 n = 1, . . . , N

N
||w ||2 X
w ,b,ξ 2 n=1
T
Constrained optimization with 2N inequality constraints

N
||w ||2 X
w ,b,ξ 2 n=1
T
Constrained optimization with 2N inequality constraints

Parameter C controls the trade-off between large margin vs small training error
Summary: Hard-Margin SVM vs Soft-Margin SVM
Objective for the hard-margin SVM (unknowns are w and b)

||w ||2
min f (w , b) =
w ,b 2
T
Objective for the soft-margin SVM (unknowns are w , b, and {ξn }N

n=1 )
N
||w ||2 X
w ,b,ξ 2 n=1
T
In either case, we have to solve a constrained, convex optimization problem

Solving Hard-Margin SVM
The hard-margin SVM optimization problem is:
||w ||2
min f (w , b) =
w ,b 2
T
subject to 1 − yn (w x n + b) ≤ 0, n = 1, . . . , N
A constrained optimization problem. Can solve using Lagrange’s method
||w ||2
min f (w , b) =
w ,b 2
T

Introduce Lagrange Multipliers αn (n = {1, . . . , N}), one for each constraint, and solve
N
||w ||2 X T
min max L(w , b, α) = + αn {1 − yn (w x n + b)}
w ,b α≥0 2 n=1
Note: α = [α1 , . . . , αN ] is the vector of Lagrange multipliers
||w ||2
min f (w , b) =
w ,b 2
T

Introduce Lagrange Multipliers αn (n = {1, . . . , N}), one for each constraint, and solve
N
||w ||2 X T
min max L(w , b, α) = + αn {1 − yn (w x n + b)}
w ,b α≥0 2 n=1
Note: α = [α1 , . . . , αN ] is the vector of Lagrange multipliers

Note: It is easier (and helpful; we will soon see why) to solve the dual problem: min and then max
The dual problem (min then max) is
N
w >w X T
max min L(w , b, α) = + αn {1 − yn (w x n + b)}
α≥0 w ,b 2 n=1
N
w >w X T
α≥0 w ,b 2 n=1
Take (partial) derivatives of L w.r.t. w , b and set them to zero
N N
∂L X ∂L X
=0⇒ w = αn yn x n =0⇒ αn yn = 0
∂w n=1
∂b n=1
N
w >w X T
α≥0 w ,b 2 n=1
N N
∂L X ∂L X
=0⇒ w = αn yn x n =0⇒ αn yn = 0
∂w n=1
∂b n=1
Important: Note the form of the solution w - it is simply a weighted sum of all the training inputs
x 1 , . . . , x N (and αn is like the “importance” of x n )
N
w >w X T
α≥0 w ,b 2 n=1
N N
∂L X ∂L X
=0⇒ w = αn yn x n =0⇒ αn yn = 0
∂w n=1
∂b n=1
Important: Note the form of the solution w - it is simply a weighted sum of all the training inputs
x 1 , . . . , x N (and αn is like the “importance” of x n )
PN
Substituting w = n=1 αn yn x n in Lagrangian, we get the dual problem as (verify)
N N
X 1 X T
max LD (α) = αn − αm αn ym yn (x m x n )
α≥0
n=1
2 m,n=1
Can write the objective more compactly in vector/matrix form as
> 1 >
max LD (α) = α 1 − α Gα
α≥0 2
where G is an N × N matrix with Gmn = ym yn x >

m x n , and 1 is a vector of 1s
† If interested in more details of the solver, see: “Support Vector Machine Solvers” by Bottou and Lin
> 1 >
α≥0 2

Good news: This is maximizing a concave function (or minimizing a convex function - verify that
the Hessian is G, which is p.s.d.). Note that our original SVM objective was also convex
> 1 >
α≥0 2

Important: Inputs x’s only appear as inner products (helps to “kernelize”; more when we see
kernel methods)
> 1 >
α≥0 2

kernel methods)
Can solve† the above objective function for α using various methods, e.g.,
> 1 >
α≥0 2

kernel methods)
Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such as
quadprog (MATLAB), CVXOPT, CPLEX, etc.
> 1 >
α≥0 2

kernel methods)
Treating the objective as a Quadratic Program (QP) and running some off-the-shelf QP solver such as
quadprog (MATLAB), CVXOPT, CPLEX, etc.
Using (projected) gradient methods (projection needed because the α’s are constrained). Gradient
methods will usually be much faster than QP methods.
Hard-Margin SVM: The Solution
Once we have the αn ’s, w and b can be computed as:

PN
w= n=1 αn yn x n
(we already saw this)
b = − 12

minn:yn =+1 w T x n + maxn:yn =−1 w T x n (exercise)

PN
w= n=1 αn yn x n
b = − 12

A nice property: Most αn ’s in the solution will be zero (sparse solution)

Reason: Karush-Kuhn-Tucker (KKT) conditions

PN
w= n=1 αn yn x n
b = − 12


For the optimal αn ’s
αn {1 − yn (w T x n + b)} = 0

PN
w= n=1 αn yn x n
b = − 12


αn {1 − yn (w T x n + b)} = 0
αn is non-zero only if x n

PN
w= n=1 αn yn x n
b = − 12


αn {1 − yn (w T x n + b)} = 0
αn is non-zero only if x n lies on one of the two margin boundaries,

i.e., for which yn (w T x n + b) = 1

PN
w= n=1 αn yn x n
b = − 12


αn {1 − yn (w T x n + b)} = 0

These examples are called support vectors

PN
w= n=1 αn yn x n
b = − 12


αn {1 − yn (w T x n + b)} = 0

These examples are called support vectors
Recall the support vectors “support” the margin boundaries
Solving Soft-Margin SVM
Recall the soft-margin SVM optimization problem:
N
||w ||2 X
w ,b,ξ 2 n=1
T
subject to 1 ≤ yn (w x n + b)+ξn , −ξn ≤ 0 n = 1, . . . , N
Note: ξ = [ξ1 , . . . , ξN ] is the vector of slack variables
N
||w ||2 X
w ,b,ξ 2 n=1
T

Introduce Lagrange Multipliers αn , βn (n = {1, . . . , N}), for constraints, and solve the Lagrangian:
N N N
||w ||2 X X T
X
min max L(w , b, ξ, α, β) = + +C ξn + αn {1 − yn (w x n + b)−ξn }− β n ξn
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1
N
||w ||2 X
w ,b,ξ 2 n=1
T

N N N
||w ||2 X X T
X
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1
Note: The terms in red above were not present in the hard-margin SVM
N
||w ||2 X
w ,b,ξ 2 n=1
T

N N N
||w ||2 X X T
X
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1
Note: The terms in red above were not present in the hard-margin SVM
Two sets of dual variables α = [α1 , . . . , αN ] and β = [β1 , . . . , βN ]. We’ll eliminate the primal
variables w , b, ξ to get dual problem containing the dual variables (just like in the hard margin case)
The Lagrangian problem to solve
N N N
w >w X X T
X
min max L(w , b, ξ, α, β) = + +C ξn + αn {1 − yn (w x n + b)−ξn }− βn ξn
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1
N N N
w >w X X T
X
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1
Take (partial) derivatives of L w.r.t. w , b, ξn and set them to zero

N N
∂L X ∂L X ∂L
=0⇒ w = αn yn x n , =0⇒ αn yn = 0, = 0 ⇒ C − αn − βn = 0
∂w n=1
∂b n=1
∂ξn
N N N
w >w X X T
X
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1

N N
∂L X ∂L X ∂L
∂w n=1
∂b n=1
∂ξn
Note: Solution of w again has the same form as in the hard-margin case (weighted sum of all
inputs with αn being the importance of input x n )
N N N
w >w X X T
X
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1

N N
∂L X ∂L X ∂L
∂w n=1
∂b n=1
∂ξn
Note: Using C − αn − βn = 0 and βn ≥ 0 ⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)
N N N
w >w X X T
X
w ,b,ξ α≥0,β≥0 2 n=1 n=1 n=1

N N
∂L X ∂L X ∂L
∂w n=1
∂b n=1
∂ξn
Note: Using C − αn − βn = 0 and βn ≥ 0 ⇒ αn ≤ C (recall that, for the hard-margin case, α ≥ 0)
Substituting these in the Lagrangian L gives the Dual problem
N N N
X 1 X T
X
max LD (α, β) = αn − αm αn ym yn (x m x n ) s.t. αn yn = 0
α≤C ,β≥0
n=1
2 m,n=1 n=1
Interestingly, the dual variables β don’t appear in the objective!

Just like the hard-margin case, we can write the dual more compactly as
> 1 >
α≤C 2


> 1 >
α≤C 2

Like hard-margin case, solving the dual requires concave maximization (or convex minimization)

> 1 >
α≤C 2

Can be solved† the same way as hard-margin SVM (except that α ≤ C )
Can solve for α using QP solvers or (projected) gradient methods

> 1 >
α≤C 2

Given α, the solution for w , b has the same form as hard-margin case

> 1 >
α≤C 2

Given α, the solution for w , b has the same form as hard-margin case
Note: α is again sparse. Nonzero αn ’s correspond to the support vectors
Support Vectors in Soft-Margin SVM
The hard-margin SVM solution had only one type of support vectors
.. ones that lie on the margin boundaries w T x + b = −1 and w T x + b = +1
The soft-margin SVM solution has three types of support vectors
1 Lying on the margin boundaries w T x + b = −1 and w T x + b = +1 (ξn = 0)

2 Lying within the margin region (0 < ξn < 1) but still on the correct side

2 Lying within the margin region (0 < ξn < 1) but still on the correct side
3 Lying on the wrong side of the hyperplane (ξn ≥ 1)
SVMs via Dual Formulation: Some Comments
Recall the final dual objectives for hard-margin and soft-margin SVM
> 1 >
Hard-Margin SVM: max LD (α) = α 1 − α Gα
α≥0 2
> 1 >
Soft-Margin SVM: max LD (α) = α 1 − α Gα
α≤C 2
† See: “Support Vector Machine Solvers” by Bottou and Lin
> 1 >
α≥0 2
> 1 >
α≤C 2
The dual formulation is nice due to two primary reasons:

Allows conveniently handling the margin based constraint (via Lagrangians)
> 1 >
α≥0 2
> 1 >
α≤C 2

Important: Allows learning nonlinear separators by replacing inner products (e.g., Gmn = ym yn x >
m x n)
by kernelized similarities (kernelized SVMs)
> 1 >
α≥0 2
> 1 >
α≤C 2

m x n)
However, the dual formulation can be expensive if N is large. Have to solve for N variables
α = [α1 , . . . , αN ], and also need to store an N × N matrix G
> 1 >
α≥0 2
> 1 >
α≤C 2

m x n)
However, the dual formulation can be expensive if N is large. Have to solve for N variables
α = [α1 , . . . , αN ], and also need to store an N × N matrix G
A lot of work† on speeding up SVM in these settings (e.g., can use co-ord. descent for α)
SVM: Some Notes
A hugely (perhaps the most!) popular classification algorithm

Reasonably mature, highly optimized SVM softwares freely available (perhaps the reason why it is
more popular than various other competing algorithms)
Some popular ones: libSVM, LIBLINEAR, sklearn also provides SVM
Lots of work on scaling up SVMs† (both large N and large D)

Extensions beyond binary classification (e.g., multiclass, structured outputs)
Can even be used for regression problems (Support Vector Regression)
Nonlinear extensions possible via kernels

Support Vector Machines - Lec1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machines - Lec1

Uploaded by

Copyright:

Available Formats

Support Vector Machine (SVM)

Suppose we want a hyperplane w > x + b = 0 such that

Suppose we want a hyperplane w > x + b = 0 such that

Suppose we want a hyperplane w > x + b = 0 such that

Suppose we want a hyperplane w > x + b = 0 such that

Suppose we want a hyperplane w > x + b = 0 such that

Want the hyperplane (w , b) that gives the largest possible margin

Suppose we want a hyperplane w > x + b = 0 such that

Want the hyperplane (w , b) that gives the largest possible margin

Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1

Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1

Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1

Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1

The objective for hard-margin SVM

Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1

The objective for hard-margin SVM

Constrained optimization with N inequality constraints

Hard-Margin: Every training example has to fulfil the margin condition yn (w T x n + b) ≥ 1

The objective for hard-margin SVM

The primal objective for soft-margin SVM can thus be written as

The primal objective for soft-margin SVM can thus be written as

Constrained optimization with 2N inequality constraints

The primal objective for soft-margin SVM can thus be written as

Constrained optimization with 2N inequality constraints

Objective for the hard-margin SVM (unknowns are w and b)

Objective for the soft-margin SVM (unknowns are w , b, and {ξn }N

In either case, we have to solve a constrained, convex optimization problem

The hard-margin SVM optimization problem is:

A constrained optimization problem. Can solve using Lagrange’s method

The hard-margin SVM optimization problem is:

A constrained optimization problem. Can solve using Lagrange’s method

Note: α = [α1 , . . . , αN ] is the vector of Lagrange multipliers

The hard-margin SVM optimization problem is:

A constrained optimization problem. Can solve using Lagrange’s method

Note: α = [α1 , . . . , αN ] is the vector of Lagrange multipliers

Take (partial) derivatives of L w.r.t. w , b and set them to zero

Take (partial) derivatives of L w.r.t. w , b and set them to zero

Take (partial) derivatives of L w.r.t. w , b and set them to zero

where G is an N × N matrix with Gmn = ym yn x >

where G is an N × N matrix with Gmn = ym yn x >

where G is an N × N matrix with Gmn = ym yn x >

where G is an N × N matrix with Gmn = ym yn x >

where G is an N × N matrix with Gmn = ym yn x >

where G is an N × N matrix with Gmn = ym yn x >

Once we have the αn ’s, w and b can be computed as:

Once we have the αn ’s, w and b can be computed as:

A nice property: Most αn ’s in the solution will be zero (sparse solution)

Once we have the αn ’s, w and b can be computed as:

A nice property: Most αn ’s in the solution will be zero (sparse solution)

Once we have the αn ’s, w and b can be computed as:

A nice property: Most αn ’s in the solution will be zero (sparse solution)

Once we have the αn ’s, w and b can be computed as:

A nice property: Most αn ’s in the solution will be zero (sparse solution)

αn is non-zero only if x n lies on one of the two margin boundaries,

Once we have the αn ’s, w and b can be computed as:

A nice property: Most αn ’s in the solution will be zero (sparse solution)

αn is non-zero only if x n lies on one of the two margin boundaries,

Once we have the αn ’s, w and b can be computed as:

A nice property: Most αn ’s in the solution will be zero (sparse solution)

αn is non-zero only if x n lies on one of the two margin boundaries,

Note: ξ = [ξ1 , . . . , ξN ] is the vector of slack variables

Note: ξ = [ξ1 , . . . , ξN ] is the vector of slack variables

Note: ξ = [ξ1 , . . . , ξN ] is the vector of slack variables