You are on page 1of 14

Machine Learning Srihari

SVM for Overlapping Classes

srihari@buffalo.edu
Machine Learning Srihari

SVM Discussion Overview


1.  Importance of SVMs
2.  Overview of Mathematical Techniques
Employed
3.  Margin Geometry
4.  SVM Training Methodology
5.  Overlapping Distributions
6.  Dealing with Multiple Classes
7.  SVM and Computational Learning Theory
8.  Relevance Vector Machines
Machine Learning Srihari

Overlapping Class Distributions


•  We have assumed training data are
linearly separable in mapped space ϕ(x)
•  Resulting SVM gives exact separation in
input space x although decision boundary will
be nonlinear
•  In practice class-conditionals overlap
•  Exact separation leads to poor generalization
•  Therefore need to allow SVM to misclassify
some training points
Machine Learning Srihari

Overlapping Disributions
Machine Learning

Modifying SVM Srihari

•  Implicit error function of SVM is


N 2
∑ E (y(x
n=1
∞ n
)tn −1) + λ w

•  Gives infinite error if point is misclassified


•  Zero error if it was classified correctly
•  We optimize parameters to maximize margin
•  We now modify this approach
•  Points allowed to be on wrong side of margin
•  But with penalty that increases with the distance
from the boundary
– Convenient if penalty is a linear function of distance
Machine Learning Srihari

Definition of Slack Variables ξn

•  Introduce slack variables, ξn ≥ 0, n=1,..,N


•  With one slack variable for each training point
•  These are defined by
•  ξn =0 for data points that are on or inside the
correct margin boundary
•  ξn =|tn-y(xn)| for other points
Machine Learning Srihari

Slack Variables
•  A data point that is on the decision
boundary y(xn)=0 will have ξn =1
•  Points with ξn ≥1 will be misclassified
•  Exact classification constraints
tn (wTϕ(xn)+b)≥1 , n=1,..,N
are then replaced with

tny(xn) ≥1-ξn n=1,..,N


•  in which slack variables are constrained to
satisfy ξn ≥0
Machine Learning Srihari

Slack Variables and Margins


y=1 y= 1
y=0 y=0
y= 1 Data points
y=1
with circles
Hard Margin are support
vectors
margin

Data points on wrong side


of decision boundary have ξ>1

Soft Margin Correctly classified


points inside margin but on
correct side of decision
boundary have 0< ξ≤1

Correctly classified
points on correct side of margin
have ξ=0
Machine Learning Srihari

Soft Margin Classification

•  Allows some of the data points to be


misclassified
•  While slack variables allow for overlapping
class distributions, the framework is still
sensitive to outliers
•  because the penalty for misclassification still
increases linearly with ξ
Optimization with slack variables
Machine Learning Srihari

•  Our goal is to maximize the margin while


softly penalizing points that lie on the
wrong side of the margin boundary
•  We therefore minimize C ∑ ξ + 1 w
N

n
2

n=1 2

•  Parameter C>0 controls trade-off between


slack variable penalty and the margin
•  Subject to constraints
tny(xn)≥1-ξn n=1,..,N
Machine Learning Srihari

Lagrangian with slack


•  The Lagrangian we wish to maximize is
N N N
1 2
L (w,b,a ) = w +C ∑ ξn − ∑ an ⎢tny(x n )−1 + ξn ⎥ − ∑ µn ξn
⎡ ⎤
2 n=1 n=1
⎣ ⎦ n=1

•  where {an≥0}, {µn≥0} are Lagrange multipliers


•  The corresponding KKT conditions are
an ≥ 0
tny(x n )−1 + ξn ≥ 0
an {tny(x n )−1 + ξ} = 0
µn ≥ 0
ξn ≥ 0
µn ξn = 0

•  We then optimize out w,b and {ξn} by taking


derivatives of L wrt w, b and ξn to get
Machine Learning Srihari

Dual Lagrangian

•  The dual Lagrangian has the form


N N N
1
L! (a ) = ∑ an − ∑ ∑ amantntmk(x n ,x m )
n=1 2 n=1 m=1

•  Which is identical to the separable case,


except that the constraints are somewhat
different
Machine Learning Srihari

Nu-SVM
•  An equivalent formulation of SVM
•  It involves maximizing 1 N N
L! (a ) = − ∑ ∑ amantntmk(x n ,x m )
2 n=1 m=1

•  Subject to the constraints 0 ≤ an ≤ 1 / N


N

∑a t
n=1
n n
=0
N

∑a
n=1
n
≥ν

•  ν-SVM has the advantage of using a


parameter ν for controlling the no. of support
vectors
Machine Learning Srihari

ν-SVM applied to non-separable data


•  Support Vectors are indicated
by circles
•  Done by introducing slack
variables
•  With one slack variable per
training data point
•  Maximize the margin while
softly penalizing points that lie
on the wrong side of the margin
boundary
•  ν is an upper-bound on the
fraction of margin errors (lie on
wrong side of margin boundary)

You might also like