Support Vector Machines: Srihari@buffalo - Edu

Machine Learning Srihari
Support Vector Machines
srihari@buffalo.edu
SVM Discussion Overview

1.  Importance of SVMs
2.  Overview of Mathematical Techniques
Employed
3.  Margin Geometry
4.  SVM Training Methodology
5.  Overlapping Distributions
6.  Dealing with Multiple Classes
7.  SVM and Computational Learning Theory
8.  Relevance Vector Machines
1. Importance of SVMs
•  SVM is a discriminative method that
brings together:
1. computational learning theory
2. previously known methods in linear
discriminant functions
3. optimization theory
•  Widely used for solving problems in
classification, regression and novelty
detection
Alternative Names for SVM

•  Also called Sparse kernel machines
•  Kernel methods predict based on linear
combinations of a kernel function evaluated
at the training points, e.g., Parzen Window
•  Sparse because not all pairs of training
points need be used
•  Also called Maximum margin classifiers
2. Mathematical Techniques Used

1.  Linearly separable case considered
since appropriate nonlinear mapping f to a high
dimension two categories are always separable by
a hyperplane
2.  To handle non-linear separability
•  Preprocessing data to represent in much higher-
dimensional space than original feature space
•  Kernel trick reduces computational overhead
k(y j ,yk ) = y j t ⋅ yk = φ(x j )t ⋅ φ(x k )
3. Support Vectors and Margin
•  Support vectors are

those nearest patterns
at distance b from
hyperplane
•  SVM finds hyperplane
with maximum
distance
(margin distance b)
from nearest training
patterns
Three support vectors are shown as solid dots

Margin Maximization
•  Motivation: computational learning theory

•  or statistical learning theory (PAC learning-VC dimension)
•  Insight (Tong, Koller 2000):
•  Model distributions for each class using Parzen
density estimators using Gaussian kernels with
common parameter σ2
•  Instead of optimum boundary, determine best
hyperplane relative to learned density model
– As σ2à 0 optimum hyperplane has maximum margin
•  Hyperplane becomes independent of data points
that are not support vectors
Machine Learning
Distance from arbitrary plane Srihari
•  Hyperplane: g(x) = w t • x + w0 where w is the weight vector and w is bias

0
g(x)
•  Lemma:: Distance from x to the plane is r=
|| w ||
w
Proof: Let x = xp + r where r is the distance from x to the plane,
|| w ||
=0
then
w w t
w || w ||2
g(x) = w (x p + r
t
) + w0 = w x + w + r
t
= g(x p ) + r = r || w || QED
|| w || p 0
|| w || || w ||
w
x = xp + r
g ( x) = w • x + w0 = 0
t
|| w ||
Corollary: Distance of origin to plane is
r = g(0)/||w|| = w0/||w|| g ( x)
since g(0) = w t 0 + w 0 = w 0 r=
|| w ||
Thus w0=0 implies that
plane passes through origin xp
g ( x) > 0
g ( y) < 0
Choosing a margin
•  Augmented space: g(y) = aty by choosing a0= w0
and y0=1, i.e, plane passes through origin
•  For each of the patterns, let zk = +1 depending on

whether pattern k is in class ω1 or ω2
•  Thus if g(y)=0 is a separating hyperplane then

zk g(yk) > 0, k = 1,.., n
•  Since distance of a point y to hyperplane g(y)=0 is
g(y)
y
g(y)= aty || a ||
we could require that hyperplane be such that all
points are at least distant b from it, i.e.,
zk g(yk )
b ≥b
|| a ||
SVM Margin geometry

y2
g(y) = aty = 0
g(y) = 1
g(y) = -1 y1
y2
y1
Optimal
hyperplane is Shortest line connecting
orthogonal Convex hulls, which
to.. has length = 2/||a||
Statement of Optimization Problem

•  The goal is to find the weight vector a that satisfies
( ) ≥ b,
z k g yk
k = 1,.....n
a
while maximizing b
•  To ensure uniqueness we impose the constraint

b ||a|| = 1 or b = 1/||a||
•  Which implies that ||a||2 is to be minimized
•  Support vectors are (transformed) training patterns

which represent equality in above equation
•  Called a quadratic optimization problem since we are trying to minimize a
quadratic function subject to a set of linear inequality constraints
4. SVM Training Methodology

1.  Training is formulated as an optimization
problem
•  Dual problem is stated to reduce computational
complexity
•  Kernel trick is used to reduce computation
2.  Determination of the model parameters
corresponds to a convex optimization problem
•  Solution is straightforward (local solution is a global
optimum)
3.  Makes use of Lagrange multipliers
Joseph-Louis Lagrange 1736-1813

•  French Mathematician
•  Born in Turin, Italy
•  Succeeded Euler at Berlin
academy
•  Narrowly escaped execution
in French revolution due to
Lovoisier who himself was
guillotined
•  Made key contributions to
calculus and dynamics
SVM Training: Optimization Problem

optimize
1
arg min || a || 2
a, b 2
subject to constraints
z k a t yk ≥ 1, k = 1,.....n
•  Can be cast as an unconstrained problem by introducing

Lagrange undetermined multipliers
with one multiplier α k for each constraint
•  The Lagrange function is 1 2 n

[ ]
L(a,α ) = a − ∑ α k z k a t y k − 1
2 k =1
Optimization of Lagrange function

•  The Lagrange function is
1 2 n
[ ]
L(a,α ) = a − ∑ α k z k a t y k − 1
2 k =1
•  We seek to minimize L( )
•  with respect to the weight vector a and α k
≥0
•  maximize it w.r.t . the undetermined

multipliers
•  Last term represents the goal of classifying the points correctly
•  Karush-Kuhn-Tucker construction shows that this can be recast
as a maximization problem which is computationally better
Dual Optimization Problem

•  Problem is reformulated as one of maximizing
n
1 n n
L(α ) = ∑α k − ∑∑ α kα j z k z j k ( y j , yk )
k =1 2 k =1 j =1
•  Subject to the constraints

n
∑α z
k =1
k k =0 α k ≥ 0, k = 1,....., n
given the training data

•  where the kernel function is defined by
k ( y j , yk ) = y j ⋅ yk = φ ( x j )t ⋅ φ ( xk )
t
Solution of Dual Problem
•  Implementation:
•  Solved using quadratic programming
•  Alternatively, since it only needs inner
products of training data
•  It can be implemented using kernel functions
•  which is a crucial property for generalizing to
non-linear case
a = ∑α k z k y k
•  The solution is given by k
Summary of SVM Optimization Problems

Different
Notation here!
Quadratic term
Kernel Function: key property
•  If kernel function is chosen with property

K(x,y) = (f (x). f(y))
then computational expense of increased dimensionality
is avoided.
•  Polynomial kernel K(x,y) = (x.y)d

can be shown (next slide) to
correspond to a map f into
the space spanned by all products of exactly d dimensions.
A Polynomial Kernel Function

Suppose x = ( x1 , x2 ) is the input vector
The feature space mapping is :
(
ϕ ( x ) = x1 , 2 x1 x2 , x2
2
)
2 t
Then inner product is

( ) (y )
same
2 t
ϕ ( x )ϕ ( y) = x1 , 2 x1 x2 , x2
2
1
2 2
, 2 y1 y2 , y2 = ( x1 y1 + x2 y2 ) 2
Polynomial kernel function to compute the same value is

2
(
K ( x , y) = ( x.y) = (x1 , x2 )( y1 , y2 ) ) = (x y + x y )
t 2
1 1 2 2
2
or K ( x , y) = ϕ ( x )ϕ ( y)
•  Inner product ϕ (x)ϕ (y ) needs computing six

feature values and 3 x 3 = 9 multiplications
•  Kernel function K(x,y) has 2 mults and a squaring
Another Polynomial (quadratic) kernel function
•  K(x,y) = (x.y + 1)2

•  This one maps d = 2, p = 2 into a
six- dimensional space
•  Contains all the powers of x
K (x, y ) = ϕ (x)ϕ (y )
where
( )
ϕ (x) = x12 , x 2 2 , 2 x1 , 2 x 2 , 2 x1x 2 ,1
•  Inner product needs 36

multiplications
•  Kernel function needs 4
multiplications
Non-Linear Case
•  Mapping function f (.) to a
sufficiently high dimension
•  So that data from two categories
can always be separated by a
hyperplane
•  Assume each pattern xk has been
transformed to
yk=f (xk), for k=1,.., n
•  First choose the non-linear f
functions
•  To map the input vector to a higher
dimensional feature space
•  Dimensionality of space can be
arbitrarily high only limited by
computational resources
Mapping into Higher

Dimensional Feature Space
•  Mapping each input point x by map
⎛1⎞
⎜ ⎟
y = Φ ( x) = ⎜ x ⎟
⎜ x2 ⎟
⎝ ⎠
Points on 1-d line are mapped onto

curve in 3-d.
•  Linear separation in 3-d space is

possible. Linear discriminant function
in 3-d is in the form
g ( x) = a1 y1 + a2 y2 + a3 y3
Pattern Transformation using Kernels
•  Problem with high-dimensional mapping

•  Very many parameters
•  Polynomial of degree p over d variables leads to O(dp) variables in
feature space
•  Example: if d = 50 and p =2 we need a feature space of size
2500
•  Solution:
•  Dual Optimization problem needs only inner products
•  Each pattern xk transformed into pattern ykwhere
y k
= Φ(xk )
•  Dimensionality of mapped space can be arbitrarily high
Example of SVM results

•  Two classes in two
dimensions
•  Synthetic Data
•  Shows contours of constant
g (x)
•  Obtained from SVM with
Gaussian kernel function
•  Decision boundary is shown
•  Margin boundaries are
shown
•  Support vectors are shown
•  Shows sparsity of SVM
SVM for the XOR problem

•  XOR: binary valued features x1, x2
•  not solved by linear discriminant
•  function f maps input x = [x1, x2] into six-
dimensional
input spacefeature space
featureysub-space
= [1, /2x1, /2x2, /2x1x2 ,
x12 , x22] Hyperplanes
Corresponding to
/2x1 x2 = +1
Hyperbolas
Corresponding to
x1 x2 = +1
SVM for XOR: maximization problem

•  We seek to maximize
4
1 4 4
∑
k =1
αk − ∑
2 k =1
∑ k j j k jy k
α
j=1
α z z y t
•  Subject to the constraints

α1 − α 2 + α 3 − α 4 = 0
0 ≤ αk k = 1, 2, 3, 4,
•  From problem symmetry, at the solution

α1 = α 3 , andα 2 = α 4
SVM for XOR: maximization problem
•  Can use iterative gradient descent

•  Or use analytical techniques for small problem
•  The solution is
a* = (1/8,1/8,1/8,1/8)
•  Last term of Optimizn Problem implies that all four points are support vectors
(unusual and due to symmetric nature of XOR) Hyperplanes
Corresponding to
•  The final discriminant function is g(x1,x2) = x1. x2
_
•  Decision hyperplane is defined by g(x1,x2) = 0 \/2x1 x2 = +1
_
•  Margin is given by b=1/||a|| = \/2
Hyperbolas
Corresponding to
x1 x2 = +1
5. Overlapping Class Distributions

•  We assumed training data are linearly
separable in the mapped space y
•  Resulting SVM gives exact separation in
input space x although decision boundary will
be nonlinear
•  In practice class-conditionals overlap
•  Exact separation leads to poor generalization
•  Therefore need to allow SVM to misclassify
some training points
n-SVM applied to non-separable data

•  Support Vectors are indicated
by circles
•  Done by introducing slack
variables
•  With one slack variable per
training data point
•  Maximize the margin while
softly penalizing points that lie
on the wrong side of the margin
boundary
•  n is an upper-bound on the
fraction of margin errors (lie on
wrong side of margin boundary)
6. Multiclass SVMs (one-versus rest)

•  SVM is a two-class classifier
•  Several suggested methods for combining
multiple two-class classfiers
•  Most used approach: one versus rest
•  Also recommended by Vapnik
•  using data from class Ck as the positive examples
and data from the remaining k-1 classes as
negative examples
•  Disadvantages
•  Input assigned to multiple classes simultaneously
•  Training sets are imbalanced (90% are one class and
10% are another)– symmetry of original problem is lost
Multiclass SVMs (one-versus one)
•  Train k(k-1)/2 different 2-class SVMs on

all possible pairs of classes
•  Classify test points according to which
class has highest number of votes
•  Again leads to ambiguities in classification
•  For large k requires significantly more
training time than one-versus rest
•  Also more computation time for evaluation
•  Can be alleviated by organizing into a directed
acyclic graph (DAGSVM)
7. SVM and Computational Learning Theory

•  SVM motivated using COLT
•  Called PAC learning framework
•  Goal of PAC framework is to understand
how large a data sets needs to be in
order to give good generalizations
•  Key quantity in PAC learning is VC dimension
which provides a measure of complexity of a
space of functions
•  Maximizing margin is main conclusion
All dichotomies of 3 points in 2 dimensions

are linearly separable
VC Dimension of Hyperplanes in R2
VC dimension provides
the complexity of a
class of decision
functions
Hyperplanes in Rd è VC Dimension = d+1

Fraction of Dichotomies that are linearly

separable
When no of points
is same as dimensionality
Fraction of dichotomies of n points in d dimensions that are all dichotomies are
⎧ d1 n ≤ d +1 linearly separable
⎛ n − 1⎞
linearly⎪separable
n
⎜
f (n, d ) = ⎨2 ∑⎜
⎟ ⎟
i =0 ⎝ i ⎠
n > d +1
⎪
⎩
Capacity of a hyperplane
f(n,d)
At n = 2(d+1), called the capacity 0.5
of the hyperplane nearly one half
of the dichotomies are still linearly
separable
Hyperplane is not over-determined n/(d+1)

until number of samples is several Fraction of dichotomies of n points in d dimensions
times the dimensionality that are linear
Capacity of a line when d=2

Some Separable Cases
f(5,2)= 0.5,
i.e., half the dichotomies are linear Some Non-separable Cases
Capacity is achieved at n = 2d+1 = 5
VC Dimension = d+1 = 3
Possible method of training SVM
•  Based on modification of Perceptron training rule given below
Instead of all misclassified samples, use worst classified samples

Support vector samples

•  Support vectors are
•  training samples that define the •  Finding worst case pattern is
optimal separating hyperplane computationally expensive
•  They are the most difficult to classify
•  Patterns most informative for the •  For each update, need to
classification task search through entire
•  Worst classified pattern at any stage training set to find worst
is the one on the wrong side of the classified sample
decision boundary farthest from the •  Only used for small
boundary problems
•  At the end of the training period such •  More commonly used
a pattern will be one of the support method is different
vectors
Generalization Error of SVM
•  If there are n training patterns •  Leave one out bound

•  Expected value of the generalization
error rate is bounded according to
ε [P(error)] ≤ ε [No. of Support

•  If we have n points in the training
n Vectors] set
n
n •  Train SVM on n-1 of them
Expected value of error < expected •  Test on single remaining point
no of support vectors/n •  If the point is a support vector
•  Where expectation is over all then there will be an error
training sets of size n (drawn from
distributions describing the •  If we find a transformation f
categories) •  that well separates the data,
•  This also means that error rate on then
the support vectors will be n times
the error rate on the total sample –  expected number of
support vectors is small
–  expected error rate is
small
8. Relevance Vector Machines
•  Addresses several limitations of SVMs

•  SVM does not provide a posteriori probabilities
•  Relevance Vector Machines provide such
output
•  Extension of SVM to multiclasses is problematic
•  Complexity parameter C or n that must be found
using a hold-out method
•  Linear combinations of kernel functions centered
on training data points that must be positive
definite

Support Vector Machines: Srihari@buffalo - Edu

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machines: Srihari@buffalo - Edu

Uploaded by

Copyright:

Available Formats

Machine Learning Srihari

Support Vector Machines

SVM Discussion Overview

Alternative Names for SVM

2. Mathematical Techniques Used

3. Support Vectors and Margin

• Support vectors are

Three support vectors are shown as solid dots

• Motivation: computational learning theory

• Hyperplane: g(x) = w t • x + w0 where w is the weight vector and w is bias

• For each of the patterns, let zk = +1 depending on

• Thus if g(y)=0 is a separating hyperplane then

SVM Margin geometry

Statement of Optimization Problem

• To ensure uniqueness we impose the constraint

• Support vectors are (transformed) training patterns

4. SVM Training Methodology

Joseph-Louis Lagrange 1736-1813

SVM Training: Optimization Problem

• Can be cast as an unconstrained problem by introducing

• The Lagrange function is 1 2 n

Optimization of Lagrange function

• maximize it w.r.t . the undetermined

Dual Optimization Problem

• Subject to the constraints

given the training data

Solution of Dual Problem

Summary of SVM Optimization Problems

Kernel Function: key property

• If kernel function is chosen with property

• Polynomial kernel K(x,y) = (x.y)d

A Polynomial Kernel Function

Then inner product is

Polynomial kernel function to compute the same value is

• Inner product ϕ (x)ϕ (y ) needs computing six

Another Polynomial (quadratic) kernel function

• K(x,y) = (x.y + 1)2

• Inner product needs 36

Mapping into Higher

Points on 1-d line are mapped onto

• Linear separation in 3-d space is

Pattern Transformation using Kernels

• Problem with high-dimensional mapping

• Each pattern xk transformed into pattern ykwhere

Example of SVM results

SVM for the XOR problem

SVM for XOR: maximization problem

• Subject to the constraints

• From problem symmetry, at the solution

SVM for XOR: maximization problem

• Can use iterative gradient descent

5. Overlapping Class Distributions

n-SVM applied to non-separable data

6. Multiclass SVMs (one-versus rest)

Multiclass SVMs (one-versus one)

• Train k(k-1)/2 different 2-class SVMs on

7. SVM and Computational Learning Theory

All dichotomies of 3 points in 2 dimensions

Hyperplanes in Rd è VC Dimension = d+1

Fraction of Dichotomies that are linearly

Hyperplane is not over-determined n/(d+1)

Capacity of a line when d=2

•  Support vectors are

•  Motivation: computational learning theory

•  Hyperplane: g(x) = w t • x + w0 where w is the weight vector and w is bias

•  For each of the patterns, let zk = +1 depending on

•  Thus if g(y)=0 is a separating hyperplane then

•  To ensure uniqueness we impose the constraint

•  Support vectors are (transformed) training patterns

•  Can be cast as an unconstrained problem by introducing

•  The Lagrange function is 1 2 n

•  maximize it w.r.t . the undetermined

•  Subject to the constraints

•  If kernel function is chosen with property

•  Polynomial kernel K(x,y) = (x.y)d

•  Inner product ϕ (x)ϕ (y ) needs computing six

•  K(x,y) = (x.y + 1)2

•  Inner product needs 36

•  Linear separation in 3-d space is

•  Problem with high-dimensional mapping

•  Each pattern xk transformed into pattern ykwhere

•  Subject to the constraints

•  From problem symmetry, at the solution

•  Can use iterative gradient descent

•  Train k(k-1)/2 different 2-class SVMs on

•  Based on modification of Perceptron training rule given below

•  If there are n training patterns •  Leave one out bound

•  Addresses several limitations of SVMs