New Schedule: Fall 2004 Pattern Recognition For Vision

New Schedule
Dec. 8
same
same
same
Oct. 21
^2 weeks
^1 week
^1 week
Fall 2004 Pattern Recognition for Vision

9.913 Pattern Recognition for Vision
Classification
Bernd Heisele
Fall 2004
Overview
Introduction
Linear Discriminant Analysis
Support Vector Machines
Literature & Homework

Introduction
Classification
• Linear, non-linear
separation
• Two class, multi-class
problems
Two approaches:
• Density estimation, classify with Bayes decision:
Linear Discr. Analysis (LDA), Quadratic Discr. Analysis (QDA)
• Without density estimation: Support Vector Machines (SVM)
LDA Bayes Rule
Bayes Rule
p(x,w ) = p(x | w ) P (w ) = P(w | x) p ( x) �
likelihood prior
p ( x | w ) P (w )
P(w | x ) =
p(x)
posterior evidence
x : random vector
w : class

LDA— Bayes Decision Rule
P(w1 | x
)
Decide w1 if > 1; otherwise w 2
P(w 2 | x
)
Likelihood Ratio
p(x | w1 ) P (w1 )
>1
p(x | w 2 ) P (w 2 )
Log Likelihood Ratio
� p(x | w1 ) P (w1 ) � � p(x | w1 ) � � P(w1 ) �

ln � � > 0, ln � � + ln � �>0
Ł p(x | w 2 ) P (w 2 ) ł Ł p(x | w 2 ) ł Ł P(w 2 ) ł
LDA—Two Classes, Identical Covariance
1
1 - ( x - m i ) T S i-1 ( x- m i )
Gaussian: p ( x | w i ) = e 2
(2p ) d /2
| Si | 1/2
assume identical covariance matrices S1 = S 2 :

� p( x | w1 ) � � P(w1 ) �
ln � � + ln � �
Ł p( x | w 2 ) ł Ł P(w 2 ) ł
1 -1 1 -1 � P (w1 ) �
= ( x - m 2 ) S (x - m 2 ) - ( x - m1 ) S ( x - m1 ) + ln �
T T
�
2 2 Ł (w 2 ) ł
P
-1 1 -1 � P (w1 ) �
= x S ( m1 - m 2 ) + ( m1 + m 2 ) S ( m 2 - m1 ) + ln �
T T
�
�� 2 Ł P (w 2 ł
)
w ��
b
= xT w + b linear decision function: w1 if xT w + b > 0

LDA—Two Classes, Identical Covariance
fi U T Rotation fi D-1 / 2 Scaling

X w X¢ Nearest neighbor
x¢ classifier in X ¢
m1¢
m 2¢
f ( x¢) = 0
1
f (x ) = x S ( m1 - m 2 ) + ( m1 + m 2 )T S -1 ( m 2 - m1 ) + ln( P (w1 ) / P (w 2 ))
T -1
2
1
= x ( m1 - m2 ) + ( m1¢ + m 2¢ )T ( m 2¢ - m1¢) + ln( P (w1 ) / P (w2 )),
¢T
¢ ¢
��
2 ��
b
GT G = S -1 , x¢T = x T GT , x¢ = Gx , m1¢ - m2¢ = G ( m1 - m2 )

S = UDU T , S -1 = U D -1U T = GT G, G = D -1 / 2 U T
LDA—Computation
1 Ni
�
mˆ i = �
Nin =1
x i ,n �
�
N � Density estimation
T �
2
1
�� i,n i i,n i �
i
Sˆ = ( x - m
ˆ )( x - m
ˆ )
N1 + N 2 i =1 n =1 �
f (x) = sign(xT w + b)
w = Sˆ -1 ( mˆ - mˆ )
1 2
1 -1 � P (w1 ) �
b = ( mˆ1 + mˆ 2 ) S ( mˆ 2 - mˆ1 ) + ln �
T
�
2 Ł�P (w 2 ł
)
��
N1
Approximate by
N2

QDA—Two classes, different covariance matrix
Quadratic Discriminant Analysis
decide w1 if f (x) > 0

f ( x) = ln ( p( x | w1 ) ) + ln( P (w1 )) - ln ( p (x | w2 ) ) - ln ( P(w2 ) )
1 1
ln ( p( x | w1 ) ) = - ln S1 - ( x - m1 )T S1-1 ( x - m1 ) + ln P(w1 )
2 2
f ( x ) = xT Ax + w T x + w0 - quadratic
A = - ( S1 - S-2 1 )
1 -1
where - a matrix
2
w = S1-1m1 - S-21m 2 - a vector
wo = ... well, the rest of it - a scalar

LDA Multiclass, Identical Covariance
Find the linear decision boundaries for k classes:

For two classes we have:
f ( x ) = xT w + b decide w1 if f ( x ) > 0
In the multi-class case we have k -1 decision functions:
f1,2 ( x ) = xT w1,2 + b1,2 ,
f1,3 (x ) = x T w 1,3 + b1,3 ,
�
f1, k ( x ) = xT w 1, k + b1, k
� we have to determine (k -1)( p +1) parameters,
p is dimension of x

LDA Multiclass, Identical Covariance
X X¢
y2 ( m1 )
m1 m1¢
m2
w2 m 3¢ m 2¢
m3
w1
y1 ( m1 )
Find the n - dimensional subspace that gives the
best linear discrimination betw. the k classes.
y = ( w1 | w 2 |...| w n )T x
also known as Fisher Linear Discriminant

LDA—Multiclass, Identical Covariance
Computation
compute the d · k matrix M = (m1| m 2 |...|mk ) and cov. matrix S
compute the m ¢ : M ¢=GM , S -1 =GT G
compute the cov. matrix B¢ of m ¢
compute the eigenvectors v ¢i of B¢ ranked by eigenvalues
calculate y by projecting x into X ¢ and then onto the eigenvector:

yi = v¢iT Gx � w i = GT v¢i

LDA—Fisher’s Approach
Find w such that the ratio of between-class and

in-class variance is maximized if the data is projected
onto w :
y = wT x
w T Bw
max T , B = MMT the covariance of the m ' s
w Sw
can be written as:
max w T Bw subject to wT Sw = 1
generalized eigenvalue problem,
solution are the ranked eigenvectors of S -1B
...same is an previous derivation.

The Coffee Problem: LDA vs. PCA
Image removed due to copyright considerations. See: : R. Gutierrez-Osuna

http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

LDA/QDA—Summary
Advantages:
• LDA is the Bayes classifier for multivariate Gaussian

distributions with common covariance.
• LDA creates linear boundaries which are simple to compute.
• LDA can be used for representing multi-class data

in low dimensions.
• QDA is the Bayes classifier for multivariate Gaussian

distributions.
• QDA creates quadratic boundaries.
Problems:
• LDA is based on a single prototype per class (class center) which
is often insufficient in practice.

Variants of LDA
Nonparameteric LDA (Fukunaga)
removes the unimodal assumption by the scatter matrix
using local information.
More than k-1 features can be extracted.
Orthonormal LDA (Okada&Tomita) computes projections
that maximize separability and are pair-wise orthonormal.
Generalized LDA (Lowe)
Incorporates a cost function similar to Bayes Risk
minimization.
….and many many more (see “Elements of Statistical
Learning” Hastie, Tibshirani, Friedman)

SVM—Linear, Separable (LS)
Find the separating function f ( x) = sign(xTi

w + b)
which maximizes the margin M = 2d on the training data.

� Maximum margin classifier.

SVM—Primal, (LS)
Training data consists of N pairs {( x1 , y1 ),..., (x N , y N )} ,yi ˛ {-1,1} .

The problem of maximizing the margin 2d can be formulated as:
�� yi ( xTi w ¢ + b¢) ‡ d
max 2d subject to �
¢
2
w ¢,b
� w =1 w¢
or alternatively: w = w¢ / d , b = b¢ / d
1 1
min w subject to yi ( xi w + b) ‡ 1, where d =
2 T
w ,b 2 w
Convex optimization problem with quadratic objective function and
linear constraints.

SVM—Dual, (LS)
Multiply constraint equations by positive Lagrange multipliers

and subtract them from the objective function:
N
1
LP = w - � a i Øº yi ( x iT w + b ) - 1øß
2
2 i =1
Min. LP w.r. t. w and b and max. w.r. t. a i , subject to a i ‡ 0.

Set derivatives d LP /d w and d LP /d b to zero and max. w.r. t. a i :
N N
w = � a i yi x i , �a y i i = 0.
i =1 i =1
substititung in Lp we get the so called Wolfe dual:

�
max. LD = � a i - �� a ia k yi yk x iT x k � solve for a i then
N
1 N N
� compute w = a y x and
� i i i
i =1 2 k =1 i =1
N �
subject to a i ‡ 0, � a i yi = 0 �
�� b from a Ø y
i º i ( x ø=0
i w + b) - 1ß
T
i =1

SVM—Primal vs. Dual (LS)
Primal:
w
min subject to: yi ( xiT w + b) ‡ 1
w ,b 2
Dual:
N
1 N N N
max � a i - � � a iak yi yk xiT xk subject to: a i ‡ 0, �a y i i =0
ai 2 k =1 i =1
i =1 i =1
The primal has a dense inequality constraint for every point in the
training set. The dual has a single dense equality constraint and a set
of box constraints which makes it easier to solve than the p rimal.

SVM—Optimality Conditions (LS)
Optimality conditions for the linearly separable data:

N
� a i yi = 0, a i ‡ 0 "i,
i =1
yi ( xiT w + b) - 1 ‡ 0 "i
N N
w = � a i yi xi � f ( x) = � a i yi xi T x + b
i =1 i =1
a i ( yi ( xiT w + b) - 1) = 0 "i � a i =0 for points which are

a ‡0 not on the boundary of the margin.
a =0
a ‡0
points with a i > 0 are support vectors.

a =0

SVM—Linear, non-separable (LNS)
f ( x) = 1
f ( x) = 0
f ( x ) = -1
xi
2d = M
N
1
min w +C � x i subject to:
2
w ,b 2
i=1
yi ( xTi w + b) ‡ 1 - x i "i, x i > 0 "i

x i are called slack variables
C constant, penalizes errors
SVM—Dual (LNS)
Same procedure as in separable case
N
1 N N
max. LD = � a i - �� a ia k yi yk x iT x k
i =1 2 k =1 i =1
N
subject to 0 £ a i £ C , �a y
i =1
i i =0
solve for a i then

compute w = � a i yi x i and
b from yi (x i T w + b) - 1 = 0
for any sample x i for which 0 < a i < C

SVM—Optimality Conditions (LNS)
f ( x) = 1
f ( x) = 0
f ( x ) = -1
xi
2d = M
N
f ( x ) = � a i yi x i T x + b
i =1
a i = 0 � xi = 0, yi f ( xi ) > 1
0 < a i < C � xi = 0 yi f ( xi ) = 1 unbounded support vectors
a i = C � xi ‡ 0, yi f (x i ) £ 1 bounded support vectors
SVM—Non-linear (NL)
Non-linear mapping:
x¢ = F ( x)
x2 input space x2¢ feature space
x1 x1¢
Project into feature space, apply SVM procedure

SVM—Kernel Trick
N
1 N N
max. � a i - �� a iak yi yk x¢i T x¢k
i=1 2 k =1 i =1
N
subject to 0 £ a i £ C, �a y
i =1
i i =0
Only the inner product of the samples appears in the objective
function. If we can write: K (xi , x k ) = x¢iT x¢k

we can avoid any computations in the feature space.
The solution f (x¢) = x¢T w + b can be written as:
N N
f ( x) = � a i yi F( x)T
F(x i ) + b = � a i yi K (x, x i
) + b
i =1 i=1
using w = � a i yi x¢i .

SVM—Kernels
When is a Kernel K (u, v ) an inner product in a Hilbert space?
K (u, v ) = � ln fn (u)fn ( v )
n
with positive coefficients ln

if for any g (u ) ˛ L2
� K (u, v) g (u ) g ( v)dudv ‡ 0 Mercer's condition
Some examples of commonly used kernels:

Linear kernel: uT v
Polynomial kernel: (1 + uT v )d
Gaussian kernel (RBF): exp(- u - v ), shift invar.
2
MLP: tanh(uT v - q )
SVM—Polynomial Kernel
Polynomial second degree kernel

K (u, v ) = (1 + u T v ) 2 , u, v ˛ � 2
= 1 + (u1 v1 ) 2 + (u 2 v2 )2 + 2u1v 1u 2 v2 + u1v1 + u 2 v2
= (1, u12 , u 2 2 , 2u1 u2 , u1 , u2 )(1, v12 , v2 2 , 2v1v2 , v1 ,v 2 )T
F (x ) = (1, x12 , x2 2 , 2 x1 x2 , x1 , x 2 )T
Shift invariant kernel K (u, v ) = K (u - v )
defined on L2 ([0, T ]d ) can be written
as the Fourier series of K :
¥ j 2 p kt ¥ j 2 p kt j 2 p kt0
1 1 -
f (t ) =
T
�l
k =-¥
k e T
, f ( t - t0 ) =
T
�l
k =-¥
k e T
e T
¥
K (u - v ) = � lk e j 2p k k u e - j 2p k k v "k k ˛ Z ([ -¥, ¥]d )
k =0

SVM—Uniqueness
Are the a i in the solution unique? No x2 y = +1

N
f ( x) = � a i yi xi T x + b x2 +1 x1
i =1
w = (1, 0) w
x1
two solutions: -1 +1
a = (0.25,0.25,0.25,0.25) x3 -1 x4
a = (0.5,0.5,0,0)
N
constraints: a i ‡ 0, �a y
i =1
i i = 0 are satisfied
More important: is the solution f ( x) unique?

Yes, the solution is unique and global
if the objective function is strictly convex.
SVM—Multiclass
Bottom-Up 1vs1 1 vs. All
A or B or C or D A B,C,D
A or B C or D B A,C,D
C A,B,D
D A,B,C
A B C D
Training: k
Training: k (k-1) / 2
Classification : k
Classification : k-1

SVM—Choosing the Kernel
How to choose the kernel?
Linear SVMs are simple to compute, fast at

runtime but often not sufficient for complex tasks.
SVM with Gaussian kernels showed excellent

performance in many applications (after some
tuning of sigma). Slow at run-time.
Polynomial with 2nd are commonly used in

computer vision applications. Good trade off
between classification performance computational
complexity.

SVM—Example
Face detection with linear and 2nd degree polyn. SVM & LDA

SVM—Choosing C
How to choose the C-value?

N
1
min w +C � x i
2
w ,b 2
i=1
C-value penalizes points within the margin.
Large C-value can lead to poor generalization

performance (over-fitting).
From own experience in object detection tasks:
Find a kernel and C-values which give you zero errors on
the training set.

SVM—Computation during Classification
In computer vision applications fast classification is usually

more important than fast training.
Two ways of computing the decision function f ( x) :

N
a) wT F(x) + b b) � a y K (x, x ) + b
i=1
i i i Which one is faster?
-For a linear kernel a)
-For a polynomial 2nd degree kernel:
Multiplications for a): GF , poly2 = (n + 2) n, where n is dim. of x
Multiplications for b): GK , poly 2 = (n + 2) s, where s is nb. of sv's

-Gaussian kernel: only b) since dim. of F ( x) is ¥.

Learning Theory—Problem Formulation
From a given set of training examples {xi , yi } learn the

mapping x fi y. The learning machine is defined by a set of
possible mappings x fi f (x,a ) where a is the adjustable
parameter of f .
The goal is to minimize the expected risk R :
R (a ) = � V ( f (x,a ),y ) dP ( x, y )
V is the loss function

P is the probability distribution function
We can't compute R (a ) since we don't know P (x, y )

Learning Theory –Empirical Risk Minimization
To solve the problem minimize the "empirical risk"

Remp over the training set :
1 N
Remp (a ) = � V ( f (xi ,a ), yi )
N i =1
V is the loss function
Common loss functions:

V ( f (x), y ) = ( y - f ( x)) 2 least squares
V ( f (x), y ) = (1 - yf ( x)) + hinge loss where ( x) + ” max( x, 0)
1
1 yf ( x)
Learning Theory & SVM
Bound on the expected risk:
For a loss function with 0 £ V ( f ( x), y) £ 1 with probability

1 - h , 0 £ h £ 1 the following bound holds:
h ln(2 N / h) + h - ln(h / 4)
R (a ) £ Remp (a ) +
N
Bound is independant of the
Remp (a ) empirical risk
probablility distribution P ( x, y ).
N number of training examples
h Vapnik Chervonenkis (VC ) dimension
Keep all parameters in the bound fixed except one:

(1 - h ) › bound ›, N › bound fl, h › bound ›

Learning Theory VC Dimension
The VC dimension is a property of the set of functions { f (a )} .

If for a set of N points labeled in all 2N possible ways
one can find an f ˛ { f (a ) } which separates the points correctly
one says that the set of points is shattered by { f (a )} .
The VC dimension is the maximum number of points
that can be shattered by { f (a )} .
The VC dimension of a functions f : w T x + b = 0 in 2 dim:

Learning Theory—SVM
D
M1
The expected risk E ( R ) for the optimal hyperplanes:

E ( D2 / M 2 )
E ( R) £
N
where the expectation is over all training sets of size N .
'Algorithms that maximize the margin have better generalization
performance.'
Bounds
Most bounds on expected risk are very loose to

compute instead:
Cross Validation Error
Error on a cross validation set which is different

from the training set.
Leave-one-out Error
Leave one training example out of the training set,
train classifier and test on the example which was
left out. Do this for all examples.
For SVMs upper bounded by the # of support
vectors.

Regularization Theory
Given N examples (x i , yi ), x ˛ R n , y ˛ {0,1} solve:

N
1
� V ( f (x ), y ) + g
2
min i i f
f ˛H N K
i =1
2
where f K
is the norm in a Reproducing Kernel
Hilbert Space (RKHS) H,with the reproducing kernel K ,
g is the regularization parameter.
g f
2
K
can be interpreted as a smoothness constraint.
Under rather general conditions the solution can be written as:
N
f ( x) = � ci K ( x, x i )
i =1
Smooth
function
Regularization—Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space (RKHS) H�
f ( x ) = K ( x, y ), f ( y ) H�
Positive numbers ln and orthonormal set of

functions fn (x), �f n (x)fm (x) dx = 0 for n „ m, and 1 otherwise :
K ( x , y ) ” � lnf n (x)fn (y ), ln are nonnegative eigenvalues of K
n
f ( x ) = � a nfn (x), an = � f (x)f n (x) dx,

n
an bn
f ( x), f ( y ) H
”�
n ln ln
f ( x) H�
= f ( x), f ( x) H
= � an2 / ln
n

Regularization—Simple Example of RKHS
Kernel is a one dimensional Gaussian with s =1:
K ( x , y ) = exp(-( x - y )2 ), x, y in [0,1]
write K ( x , y ) as Fourier expansion using
shift theorem:
K ( x , y ) = � ln exp( j 2p nx ) exp( - j 2p ny ) Period T = 1
n
where ln are the Fourier coeff. of exp( - x 2 )

ln = A exp( - n 2 / 2)
ln decreases with higher frequencies (increasing n).
This is a property of most kernels. The regularization term:
H� � n
f ( x) = a 2
/ ln , where an are the Fourier coeff. of f ( x )
n
penalizes high freq. more than low freq. fi smoothness!

Regularization—SVM
For the hinge loss function V ( f ( x), y ) = (1 - yf (x)) + it can be

shown that the regularization problem is equivalent
to the SVM problem:
1 N
min � (1 - yi f ( xi )) + + l f K
2
f ˛H N
i =1
introducing slack variables x i = 1 - yi f ( xi ) we can rewrite:

1 N
min � x i + l f K , subject to: yi f ( xi ) ‡ 1 - x i , and x i ‡ 0 "i
2
f ˛H N
i =1
It can be shown that this is equivalent to the SVM problem (up to b) :

N
1
SVM: min w +C � x i C = 1/(2 l N )
2
w ,b 2
i =1
subject to: yi ( xTi w + b) ‡ 1 - x i , x i ‡ 0 "i

SVM—Summary
• SVMs are maximum margin classifiers.
• Only training points close to the boundary (support vectors) occur

in the SVM solution.
• The SVM problem is convex, the solution is global and unique.
• SVMs can handle non-separable data.

• Non-linear separation in the input space is possible by projecting
the data into a feature space.
• All calculations can be done in the input space (kernel trick).
• SVMs are known to perform well in high dimensional problems

with few examples.
• Depending on the kernel, SVMs can be slow during classification
• SVMs are binary classifiers. Not efficient for problems with large
number of classes.

Literature
T. Hastie, R. Tibshirani, J. Friedman: The Elements of

Statistical Learning, Springer, 2001:
LDA, QDA, extensions to LDA, SVM & Regularization.
C. Burges: A Tutorial on SVM for Pattern Recognition,

1999: Learning Theory, SVM.
R. Rifkin: Everything Old is New again: A fresh

Look at Historical Approaches in Machine Learning,
2002: SVM training, SVM multiclass.
T. Evgeniou, M. Pontil, T. Poggio: Regularization

Networks and SVMs, 1999: SVM & Regularization.
V. Vapnik: The Nature of Statistical Learning, 1995:

Statistical learning theory, SVM.

Homework
Classification problem on the NIST handwritten

digits data involving PCA, LDA and SVMs.
PCA code will be posted today

New Schedule: Fall 2004 Pattern Recognition For Vision

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New Schedule: Fall 2004 Pattern Recognition For Vision

Uploaded by

Copyright:

Available Formats

New Schedule

Fall 2004 Pattern Recognition for Vision

Linear Discriminant Analysis

Support Vector Machines

Literature & Homework

Fall 2004 Pattern Recognition for Vision

p(x,w ) = p(x | w ) P (w ) = P(w | x) p ( x) �

Fall 2004 Pattern Recognition for Vision

Log Likelihood Ratio

� p(x | w1 ) P (w1 ) � � p(x | w1 ) � � P(w1 ) �

assume identical covariance matrices S1 = S 2 :

= xT w + b linear decision function: w1 if xT w + b > 0

ﬁ U T Rotation ﬁ D-1 / 2 Scaling

GT G = S -1 , x¢T = x T GT , x¢ = Gx , m1¢ - m2¢ = G ( m1 - m2 )

Fall 2004 Pattern Recognition for Vision

Quadratic Discriminant Analysis

decide w1 if f (x) > 0

Fall 2004 Pattern Recognition for Vision

Find the linear decision boundaries for k classes:

� we have to determine (k -1)( p +1) parameters,

Fall 2004 Pattern Recognition for Vision

Fall 2004 Pattern Recognition for Vision

compute the d · k matrix M = (m1| m 2 |...|mk ) and cov. matrix S

compute the m ¢ : M ¢=GM , S -1 =GT G

compute the cov. matrix B¢ of m ¢

compute the eigenvectors v ¢i of B¢ ranked by eigenvalues

calculate y by projecting x into X ¢ and then onto the eigenvector:

Fall 2004 Pattern Recognition for Vision

Find w such that the ratio of between-class and

Fall 2004 Pattern Recognition for Vision

Image removed due to copyright considerations. See: : R. Gutierrez-Osuna

Fall 2004 Pattern Recognition for Vision

• LDA is the Bayes classifier for multivariate Gaussian

• LDA can be used for representing multi-class data

• QDA is the Bayes classifier for multivariate Gaussian

Fall 2004 Pattern Recognition for Vision

Nonparameteric LDA (Fukunaga)

removes the unimodal assumption by the scatter matrix

using local information.

More than k-1 features can be extracted.

Orthonormal LDA (Okada&Tomita) computes projections

that maximize separability and are pair-wise orthonormal.

Generalized LDA (Lowe)

Incorporates a cost function similar to Bayes Risk

….and many many more (see “Elements of Statistical

Learning” Hastie, Tibshirani, Friedman)

Fall 2004 Pattern Recognition for Vision

Find the separating function f ( x) = sign(xTi

which maximizes the margin M = 2d on the training data.

Fall 2004 Pattern Recognition for Vision

Training data consists of N pairs {( x1 , y1 ),..., (x N , y N )} ,yi ˛ {-1,1} .

Fall 2004 Pattern Recognition for Vision

Multiply constraint equations by positive Lagrange multipliers

Min. LP w.r. t. w and b and max. w.r. t. a i , subject to a i ‡ 0.

substititung in Lp we get the so called Wolfe dual:

Fall 2004 Pattern Recognition for Vision

Fall 2004 Pattern Recognition for Vision

Optimality conditions for the linearly separable data:

a i ( yi ( xiT w + b) - 1) = 0 "i � a i =0 for points which are

points with a i > 0 are support vectors.

Fall 2004 Pattern Recognition for Vision

yi ( xTi w + b) ‡ 1 - x i "i, x i > 0 "i

Same procedure as in separable case