You are on page 1of 48

New Schedule

Dec. 8
same
same
same
Oct. 21

^2 weeks

^1 week

^1 week

Fall 2004 Pattern Recognition for Vision


9.913 Pattern Recognition for Vision

Classification

Bernd Heisele

Fall 2004
Overview

Introduction

Linear Discriminant Analysis

Support Vector Machines

Literature & Homework

Fall 2004 Pattern Recognition for Vision


Introduction

Classification

• Linear, non-linear
separation
• Two class, multi-class
problems

Two approaches:
• Density estimation, classify with Bayes decision:
Linear Discr. Analysis (LDA), Quadratic Discr. Analysis (QDA)
• Without density estimation: Support Vector Machines (SVM)
Fall 2004 Pattern Recognition for Vision
LDA Bayes Rule

Bayes Rule

p(x,w ) = p(x | w ) P (w ) = P(w | x) p ( x) �

likelihood prior
p ( x | w ) P (w )
P(w | x ) =
p(x)
posterior evidence
x : random vector
w : class

Fall 2004 Pattern Recognition for Vision


LDA— Bayes Decision Rule

P(w1 | x
)
Decide w1 if > 1; otherwise w 2
P(w 2 | x
)

Likelihood Ratio

p(x | w1 ) P (w1 )
>1
p(x | w 2 ) P (w 2 )

Log Likelihood Ratio

� p(x | w1 ) P (w1 ) � � p(x | w1 ) � � P(w1 ) �


ln � � > 0, ln � � + ln � �>0
Ł p(x | w 2 ) P (w 2 ) ł Ł p(x | w 2 ) ł Ł P(w 2 ) ł
Fall 2004 Pattern Recognition for Vision
LDA—Two Classes, Identical Covariance

1
1 - ( x - m i ) T S i-1 ( x- m i )
Gaussian: p ( x | w i ) = e 2
(2p ) d /2
| Si | 1/2

assume identical covariance matrices S1 = S 2 :


� p( x | w1 ) � � P(w1 ) �
ln � � + ln � �
Ł p( x | w 2 ) ł Ł P(w 2 ) ł
1 -1 1 -1 � P (w1 ) �
= ( x - m 2 ) S (x - m 2 ) - ( x - m1 ) S ( x - m1 ) + ln �
T T

2 2 Ł (w 2 ) ł
P

-1 1 -1 � P (w1 ) �
= x S ( m1 - m 2 ) + ( m1 + m 2 ) S ( m 2 - m1 ) + ln �
T T

����� 2 Ł P (w 2 ł
)
w ������� ��������� �
b

= xT w + b linear decision function: w1 if xT w + b > 0


Fall 2004 Pattern Recognition for Vision
LDA—Two Classes, Identical Covariance

fi U T Rotation fi D-1 / 2 Scaling


X w X¢ Nearest neighbor
x¢ classifier in X ¢

m1¢
m 2¢
f ( x¢) = 0
1
f (x ) = x S ( m1 - m 2 ) + ( m1 + m 2 )T S -1 ( m 2 - m1 ) + ln( P (w1 ) / P (w 2 ))
T -1

2
1
= x ( m1 - m2 ) + ( m1¢ + m 2¢ )T ( m 2¢ - m1¢) + ln( P (w1 ) / P (w2 )),
¢T
¢ ¢
��������
2 ���������
b

GT G = S -1 , x¢T = x T GT , x¢ = Gx , m1¢ - m2¢ = G ( m1 - m2 )


S = UDU T , S -1 = U D -1U T = GT G, G = D -1 / 2 U T
Fall 2004 Pattern Recognition for Vision
LDA—Computation

1 Ni

mˆ i = �
Nin =1
x i ,n �

N � Density estimation
T �
2
1
�� i,n i i,n i �
i

Sˆ = ( x - m
ˆ )( x - m
ˆ )
N1 + N 2 i =1 n =1 �

f (x) = sign(xT w + b)
w = Sˆ -1 ( mˆ - mˆ )
1 2

1 -1 � P (w1 ) �
b = ( mˆ1 + mˆ 2 ) S ( mˆ 2 - mˆ1 ) + ln �
T

2 Ł�P (w 2 ł
)
��� �
N1
Approximate by
N2

Fall 2004 Pattern Recognition for Vision


QDA—Two classes, different covariance matrix

Quadratic Discriminant Analysis

decide w1 if f (x) > 0


f ( x) = ln ( p( x | w1 ) ) + ln( P (w1 )) - ln ( p (x | w2 ) ) - ln ( P(w2 ) )
1 1
ln ( p( x | w1 ) ) = - ln S1 - ( x - m1 )T S1-1 ( x - m1 ) + ln P(w1 )
2 2
f ( x ) = xT Ax + w T x + w0 - quadratic

A = - ( S1 - S-2 1 )
1 -1
where - a matrix
2
w = S1-1m1 - S-21m 2 - a vector
wo = ... well, the rest of it - a scalar

Fall 2004 Pattern Recognition for Vision


LDA Multiclass, Identical Covariance

Find the linear decision boundaries for k classes:


For two classes we have:
f ( x ) = xT w + b decide w1 if f ( x ) > 0
In the multi-class case we have k -1 decision functions:
f1,2 ( x ) = xT w1,2 + b1,2 ,
f1,3 (x ) = x T w 1,3 + b1,3 ,

f1, k ( x ) = xT w 1, k + b1, k

� we have to determine (k -1)( p +1) parameters,

p is dimension of x

Fall 2004 Pattern Recognition for Vision


LDA Multiclass, Identical Covariance

X X¢
y2 ( m1 )
m1 m1¢
m2
w2 m 3¢ m 2¢
m3

w1
y1 ( m1 )
Find the n - dimensional subspace that gives the
best linear discrimination betw. the k classes.
y = ( w1 | w 2 |...| w n )T x
also known as Fisher Linear Discriminant

Fall 2004 Pattern Recognition for Vision


LDA—Multiclass, Identical Covariance

Computation

compute the d · k matrix M = (m1| m 2 |...|mk ) and cov. matrix S

compute the m ¢ : M ¢=GM , S -1 =GT G

compute the cov. matrix B¢ of m ¢

compute the eigenvectors v ¢i of B¢ ranked by eigenvalues

calculate y by projecting x into X ¢ and then onto the eigenvector:


yi = v¢iT Gx � w i = GT v¢i

Fall 2004 Pattern Recognition for Vision


LDA—Fisher’s Approach

Find w such that the ratio of between-class and


in-class variance is maximized if the data is projected
onto w :
y = wT x
w T Bw
max T , B = MMT the covariance of the m ' s
w Sw
can be written as:
max w T Bw subject to wT Sw = 1
generalized eigenvalue problem,
solution are the ranked eigenvectors of S -1B
...same is an previous derivation.

Fall 2004 Pattern Recognition for Vision


The Coffee Problem: LDA vs. PCA

Image removed due to copyright considerations. See: : R. Gutierrez-Osuna


http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

Fall 2004 Pattern Recognition for Vision


LDA/QDA—Summary

Advantages:

• LDA is the Bayes classifier for multivariate Gaussian


distributions with common covariance.
• LDA creates linear boundaries which are simple to compute.

• LDA can be used for representing multi-class data


in low dimensions.

• QDA is the Bayes classifier for multivariate Gaussian


distributions.
• QDA creates quadratic boundaries.

Problems:
• LDA is based on a single prototype per class (class center) which
is often insufficient in practice.

Fall 2004 Pattern Recognition for Vision


Variants of LDA

Nonparameteric LDA (Fukunaga)

removes the unimodal assumption by the scatter matrix

using local information.

More than k-1 features can be extracted.

Orthonormal LDA (Okada&Tomita) computes projections

that maximize separability and are pair-wise orthonormal.

Generalized LDA (Lowe)

Incorporates a cost function similar to Bayes Risk

minimization.

….and many many more (see “Elements of Statistical

Learning” Hastie, Tibshirani, Friedman)

Fall 2004 Pattern Recognition for Vision


SVM—Linear, Separable (LS)

Find the separating function f ( x) = sign(xTi


w + b)

which maximizes the margin M = 2d on the training data.


� Maximum margin classifier.

Fall 2004 Pattern Recognition for Vision


SVM—Primal, (LS)

Training data consists of N pairs {( x1 , y1 ),..., (x N , y N )} ,yi ˛ {-1,1} .


The problem of maximizing the margin 2d can be formulated as:
�� yi ( xTi w ¢ + b¢) ‡ d
max 2d subject to �
¢
2
w ¢,b
� w =1 w¢

or alternatively: w = w¢ / d , b = b¢ / d
1 1
min w subject to yi ( xi w + b) ‡ 1, where d =
2 T
w ,b 2 w
Convex optimization problem with quadratic objective function and
linear constraints.

Fall 2004 Pattern Recognition for Vision


SVM—Dual, (LS)

Multiply constraint equations by positive Lagrange multipliers


and subtract them from the objective function:
N
1
LP = w - � a i غ yi ( x iT w + b ) - 1øß
2

2 i =1

Min. LP w.r. t. w and b and max. w.r. t. a i , subject to a i ‡ 0.


Set derivatives d LP /d w and d LP /d b to zero and max. w.r. t. a i :
N N
w = � a i yi x i , �a y i i = 0.
i =1 i =1

substititung in Lp we get the so called Wolfe dual:



max. LD = � a i - �� a ia k yi yk x iT x k � solve for a i then
N
1 N N
� compute w = a y x and
� i i i
i =1 2 k =1 i =1
N �
subject to a i ‡ 0, � a i yi = 0 �
�� b from a Ø y
i º i ( x ø=0
i w + b) - 1ß
T
i =1

Fall 2004 Pattern Recognition for Vision


SVM—Primal vs. Dual (LS)

Primal:
w
min subject to: yi ( xiT w + b) ‡ 1
w ,b 2
Dual:
N
1 N N N
max � a i - � � a iak yi yk xiT xk subject to: a i ‡ 0, �a y i i =0
ai 2 k =1 i =1
i =1 i =1

The primal has a dense inequality constraint for every point in the
training set. The dual has a single dense equality constraint and a set
of box constraints which makes it easier to solve than the p rimal.

Fall 2004 Pattern Recognition for Vision


SVM—Optimality Conditions (LS)

Optimality conditions for the linearly separable data:


N

� a i yi = 0, a i ‡ 0 "i,
i =1
yi ( xiT w + b) - 1 ‡ 0 "i
N N
w = � a i yi xi � f ( x) = � a i yi xi T x + b
i =1 i =1

a i ( yi ( xiT w + b) - 1) = 0 "i � a i =0 for points which are


a ‡0 not on the boundary of the margin.
a =0

a ‡0

points with a i > 0 are support vectors.


a =0

Fall 2004 Pattern Recognition for Vision


SVM—Linear, non-separable (LNS)

f ( x) = 1
f ( x) = 0
f ( x ) = -1

xi

2d = M
N
1
min w +C � x i subject to:
2

w ,b 2
i=1

yi ( xTi w + b) ‡ 1 - x i "i, x i > 0 "i


x i are called slack variables
C constant, penalizes errors
Fall 2004 Pattern Recognition for Vision
SVM—Dual (LNS)

Same procedure as in separable case

N
1 N N
max. LD = � a i - �� a ia k yi yk x iT x k
i =1 2 k =1 i =1
N
subject to 0 £ a i £ C , �a y
i =1
i i =0

solve for a i then


compute w = � a i yi x i and
b from yi (x i T w + b) - 1 = 0
for any sample x i for which 0 < a i < C

Fall 2004 Pattern Recognition for Vision


SVM—Optimality Conditions (LNS)

f ( x) = 1
f ( x) = 0
f ( x ) = -1

xi

2d = M
N
f ( x ) = � a i yi x i T x + b
i =1

a i = 0 � xi = 0, yi f ( xi ) > 1
0 < a i < C � xi = 0 yi f ( xi ) = 1 unbounded support vectors
a i = C � xi ‡ 0, yi f (x i ) £ 1 bounded support vectors
Fall 2004 Pattern Recognition for Vision
SVM—Non-linear (NL)

Non-linear mapping:
x¢ = F ( x)
x2 input space x2¢ feature space

x1 x1¢

Project into feature space, apply SVM procedure

Fall 2004 Pattern Recognition for Vision


SVM—Kernel Trick

N
1 N N
max. � a i - �� a iak yi yk x¢i T x¢k

i=1 2 k =1 i =1
N
subject to 0 £ a i £ C, �a y
i =1
i i =0

Only the inner product of the samples appears in the objective

function. If we can write: K (xi , x k ) = x¢iT x¢k


we can avoid any computations in the feature space.
The solution f (x¢) = x¢T w + b can be written as:
N N

f ( x) = � a i yi F( x)T
F(x i ) + b = � a i yi K (x, x i
) + b
i =1 i=1

using w = � a i yi x¢i .

Fall 2004 Pattern Recognition for Vision


SVM—Kernels

When is a Kernel K (u, v ) an inner product in a Hilbert space?

K (u, v ) = � ln fn (u)fn ( v )
n

with positive coefficients ln


if for any g (u ) ˛ L2

� K (u, v) g (u ) g ( v)dudv ‡ 0 Mercer's condition

Some examples of commonly used kernels:


Linear kernel: uT v
Polynomial kernel: (1 + uT v )d
Gaussian kernel (RBF): exp(- u - v ), shift invar.
2

MLP: tanh(uT v - q )
Fall 2004 Pattern Recognition for Vision
SVM—Polynomial Kernel

Polynomial second degree kernel


K (u, v ) = (1 + u T v ) 2 , u, v ˛ � 2
= 1 + (u1 v1 ) 2 + (u 2 v2 )2 + 2u1v 1u 2 v2 + u1v1 + u 2 v2
= (1, u12 , u 2 2 , 2u1 u2 , u1 , u2 )(1, v12 , v2 2 , 2v1v2 , v1 ,v 2 )T
F (x ) = (1, x12 , x2 2 , 2 x1 x2 , x1 , x 2 )T
Shift invariant kernel K (u, v ) = K (u - v )
defined on L2 ([0, T ]d ) can be written
as the Fourier series of K :
¥ j 2 p kt ¥ j 2 p kt j 2 p kt0
1 1 -
f (t ) =
T
�l
k =-¥
k e T
, f ( t - t0 ) =
T
�l
k =-¥
k e T
e T

¥
K (u - v ) = � lk e j 2p k k u e - j 2p k k v "k k ˛ Z ([ -¥, ¥]d )
k =0

Fall 2004 Pattern Recognition for Vision


SVM—Uniqueness

Are the a i in the solution unique? No x2 y = +1


N
f ( x) = � a i yi xi T x + b x2 +1 x1
i =1
w = (1, 0) w
x1
two solutions: -1 +1
a = (0.25,0.25,0.25,0.25) x3 -1 x4
a = (0.5,0.5,0,0)
N
constraints: a i ‡ 0, �a y
i =1
i i = 0 are satisfied

More important: is the solution f ( x) unique?


Yes, the solution is unique and global
if the objective function is strictly convex.
Fall 2004 Pattern Recognition for Vision
SVM—Multiclass

Bottom-Up 1vs1 1 vs. All

A or B or C or D A B,C,D

A or B C or D B A,C,D

C A,B,D

D A,B,C
A B C D
Training: k
Training: k (k-1) / 2
Classification : k
Classification : k-1

Fall 2004 Pattern Recognition for Vision


SVM—Choosing the Kernel

How to choose the kernel?

Linear SVMs are simple to compute, fast at


runtime but often not sufficient for complex tasks.

SVM with Gaussian kernels showed excellent


performance in many applications (after some
tuning of sigma). Slow at run-time.

Polynomial with 2nd are commonly used in


computer vision applications. Good trade off
between classification performance computational
complexity.

Fall 2004 Pattern Recognition for Vision


SVM—Example
Face detection with linear and 2nd degree polyn. SVM & LDA

Fall 2004 Pattern Recognition for Vision


SVM—Choosing C

How to choose the C-value?


N
1
min w +C � x i
2

w ,b 2
i=1

C-value penalizes points within the margin.

Large C-value can lead to poor generalization


performance (over-fitting).
From own experience in object detection tasks:
Find a kernel and C-values which give you zero errors on
the training set.

Fall 2004 Pattern Recognition for Vision


SVM—Computation during Classification

In computer vision applications fast classification is usually


more important than fast training.

Two ways of computing the decision function f ( x) :


N
a) wT F(x) + b b) � a y K (x, x ) + b
i=1
i i i Which one is faster?

-For a linear kernel a)

-For a polynomial 2nd degree kernel:

Multiplications for a): GF , poly2 = (n + 2) n, where n is dim. of x

Multiplications for b): GK , poly 2 = (n + 2) s, where s is nb. of sv's


-Gaussian kernel: only b) since dim. of F ( x) is ¥.

Fall 2004 Pattern Recognition for Vision


Learning Theory—Problem Formulation

From a given set of training examples {xi , yi } learn the


mapping x fi y. The learning machine is defined by a set of
possible mappings x fi f (x,a ) where a is the adjustable
parameter of f .
The goal is to minimize the expected risk R :
R (a ) = � V ( f (x,a ),y ) dP ( x, y )

V is the loss function


P is the probability distribution function

We can't compute R (a ) since we don't know P (x, y )

Fall 2004 Pattern Recognition for Vision


Learning Theory –Empirical Risk Minimization

To solve the problem minimize the "empirical risk"


Remp over the training set :
1 N
Remp (a ) = � V ( f (xi ,a ), yi )
N i =1
V is the loss function

Common loss functions:


V ( f (x), y ) = ( y - f ( x)) 2 least squares
V ( f (x), y ) = (1 - yf ( x)) + hinge loss where ( x) + ” max( x, 0)
1
1 yf ( x)
Fall 2004 Pattern Recognition for Vision
Learning Theory & SVM

Bound on the expected risk:

For a loss function with 0 £ V ( f ( x), y) £ 1 with probability


1 - h , 0 £ h £ 1 the following bound holds:
h ln(2 N / h) + h - ln(h / 4)
R (a ) £ Remp (a ) +
N
Bound is independant of the
Remp (a ) empirical risk
probablility distribution P ( x, y ).
N number of training examples
h Vapnik Chervonenkis (VC ) dimension

Keep all parameters in the bound fixed except one:


(1 - h ) › bound ›, N › bound fl, h › bound ›

Fall 2004 Pattern Recognition for Vision


Learning Theory VC Dimension

The VC dimension is a property of the set of functions { f (a )} .


If for a set of N points labeled in all 2N possible ways
one can find an f ˛ { f (a ) } which separates the points correctly
one says that the set of points is shattered by { f (a )} .
The VC dimension is the maximum number of points
that can be shattered by { f (a )} .
The VC dimension of a functions f : w T x + b = 0 in 2 dim:

Fall 2004 Pattern Recognition for Vision


Learning Theory—SVM

D
M1

The expected risk E ( R ) for the optimal hyperplanes:


E ( D2 / M 2 )
E ( R) £
N
where the expectation is over all training sets of size N .
'Algorithms that maximize the margin have better generalization
performance.'
Fall 2004 Pattern Recognition for Vision
Bounds

Most bounds on expected risk are very loose to


compute instead:

Cross Validation Error

Error on a cross validation set which is different


from the training set.

Leave-one-out Error

Leave one training example out of the training set,

train classifier and test on the example which was

left out. Do this for all examples.

For SVMs upper bounded by the # of support

vectors.

Fall 2004 Pattern Recognition for Vision


Regularization Theory

Given N examples (x i , yi ), x ˛ R n , y ˛ {0,1} solve:


N
1
� V ( f (x ), y ) + g
2
min i i f
f ˛H N K
i =1
2
where f K
is the norm in a Reproducing Kernel
Hilbert Space (RKHS) H,with the reproducing kernel K ,
g is the regularization parameter.
g f
2
K
can be interpreted as a smoothness constraint.
Under rather general conditions the solution can be written as:
N
f ( x) = � ci K ( x, x i )
i =1

Smooth
function
Fall 2004 Pattern Recognition for Vision
Regularization—Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space (RKHS) H�
f ( x ) = K ( x, y ), f ( y ) H�

Positive numbers ln and orthonormal set of


functions fn (x), �f n (x)fm (x) dx = 0 for n „ m, and 1 otherwise :
K ( x , y ) ” � lnf n (x)fn (y ), ln are nonnegative eigenvalues of K
n

f ( x ) = � a nfn (x), an = � f (x)f n (x) dx,


n

an bn
f ( x), f ( y ) H
”�
n ln ln
f ( x) H�
= f ( x), f ( x) H
= � an2 / ln
n

Fall 2004 Pattern Recognition for Vision


Regularization—Simple Example of RKHS
Kernel is a one dimensional Gaussian with s =1:
K ( x , y ) = exp(-( x - y )2 ), x, y in [0,1]
write K ( x , y ) as Fourier expansion using
shift theorem:
K ( x , y ) = � ln exp( j 2p nx ) exp( - j 2p ny ) Period T = 1
n

where ln are the Fourier coeff. of exp( - x 2 )


ln = A exp( - n 2 / 2)
ln decreases with higher frequencies (increasing n).
This is a property of most kernels. The regularization term:
H� � n
f ( x) = a 2
/ ln , where an are the Fourier coeff. of f ( x )
n

penalizes high freq. more than low freq. fi smoothness!

Fall 2004 Pattern Recognition for Vision


Regularization—SVM

For the hinge loss function V ( f ( x), y ) = (1 - yf (x)) + it can be


shown that the regularization problem is equivalent
to the SVM problem:
1 N
min � (1 - yi f ( xi )) + + l f K
2

f ˛H N
i =1

introducing slack variables x i = 1 - yi f ( xi ) we can rewrite:


1 N
min � x i + l f K , subject to: yi f ( xi ) ‡ 1 - x i , and x i ‡ 0 "i
2

f ˛H N
i =1

It can be shown that this is equivalent to the SVM problem (up to b) :


N
1
SVM: min w +C � x i C = 1/(2 l N )
2

w ,b 2
i =1

subject to: yi ( xTi w + b) ‡ 1 - x i , x i ‡ 0 "i


Fall 2004 Pattern Recognition for Vision
SVM—Summary

• SVMs are maximum margin classifiers.

• Only training points close to the boundary (support vectors) occur


in the SVM solution.
• The SVM problem is convex, the solution is global and unique.

• SVMs can handle non-separable data.


• Non-linear separation in the input space is possible by projecting
the data into a feature space.
• All calculations can be done in the input space (kernel trick).

• SVMs are known to perform well in high dimensional problems


with few examples.
• Depending on the kernel, SVMs can be slow during classification

• SVMs are binary classifiers. Not efficient for problems with large
number of classes.

Fall 2004 Pattern Recognition for Vision


Literature

T. Hastie, R. Tibshirani, J. Friedman: The Elements of


Statistical Learning, Springer, 2001:
LDA, QDA, extensions to LDA, SVM & Regularization.

C. Burges: A Tutorial on SVM for Pattern Recognition,


1999: Learning Theory, SVM.

R. Rifkin: Everything Old is New again: A fresh


Look at Historical Approaches in Machine Learning,
2002: SVM training, SVM multiclass.

T. Evgeniou, M. Pontil, T. Poggio: Regularization


Networks and SVMs, 1999: SVM & Regularization.

V. Vapnik: The Nature of Statistical Learning, 1995:


Statistical learning theory, SVM.

Fall 2004 Pattern Recognition for Vision


Homework

Classification problem on the NIST handwritten


digits data involving PCA, LDA and SVMs.

PCA code will be posted today

Fall 2004 Pattern Recognition for Vision

You might also like