You are on page 1of 25

CS434a/541a: Pattern Recognition

Prof. Olga Veksler

Lecture 8

Today
Continue with Dimensionality Reduction
Last lecture: PCA
This lecture: Fisher Linear Discriminant

Data Representation vs. Data Classification


PCA finds the most accurate data representation
in a lower dimensional space
Project data in the directions of maximum variance

However the directions of maximum variance may


be useless for classification
separable

apply PCA

not separable

to each class

Fisher Linear Discriminant project to a line which


preserves direction useful for data classification

Fisher Linear Discriminant


Main idea: find projection to a line s.t. samples
from different classes are well separated
Example in 2D

bad line to project to,


classes are mixed up

good line to project to,


classes are well separated

Fisher Linear Discriminant


Suppose we have 2 classes and d-dimensional
samples x1,,xn where
n1 samples come from the first class
n2 samples come from the second class
consider projection on a line
Let the line direction be given by unit vector v

txi

Scalar vtxi is the distance of


projection of xi from the origin
xi

Thus it vtxi is the projection of


xi into a one dimensional
subspace

Fisher Linear Discriminant


Thus the projection of sample xi onto a line in
direction v is given by vtxi
How to measure separation between projections of
different classes?
Let ~1 and ~2 be the means of projections of
classes 1 and 2
Let 1 and 2 be the means of classes 1 and 2
~1 ~2 seems like a good measure

n1
n1
1
1
v t xi = v t
x i = v t 1
~1 =
n1 x i C 1
n1 x i C 1

similarly ,

~ 2 = v t 2

Fisher Linear Discriminant


How good is ~1 ~2 as a measure of separation?

The larger ~1 ~ 2 , the better is the expected separation

~1

~2
1

the vertical axes is a better line than the horizontal


axes to project to for class separability
however 1 2 > ~1 ~2

Fisher Linear Discriminant

small variance

The problem with ~1 ~2 is that it does not


consider the variance of the classes
~1

1
2

~2
1

large variance

Fisher Linear Discriminant


We need to normalize ~1 ~2 by a factor which is
proportional to variance
1 n
Have samples z1,,zn . Sample mean is z = n z i
i =1

Define their scatter as


s =

i =1

(z i

z )

Thus scatter is just sample variance multiplied by n


scatter measures the same thing as variance, the spread
of data around the mean
scatter is just on different scale than variance
larger scatter:

smaller scatter:

Fisher Linear Discriminant


Fisher Solution: normalize

~1 ~2 by scatter

Let yi = vtxi , i.e. yi s are the projected samples


Scatter for projected samples of class 1 is
(y i ~1 ) 2
s~12 =
y i Class 1

Scatter for projected samples of class 2 is


~2 =
~ )2
(
s
y

2
i
2
y i Class 2

Fisher Linear Discriminant


We need to normalize by both scatter of class 1 and
scatter of class 2
Thus Fisher linear discriminant is to project on line
in the direction v which maximizes
want projected means are far from each other

(
~1 ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2

want scatter in class 1 is as


small as possible, i.e. samples
of class 1 cluster around the
~
projected mean
1

want scatter in class 2 is as


small as possible, i.e. samples
of class 2 cluster around the
~
projected mean
2

Fisher Linear Discriminant

(
~1 ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s 2

If we find v which makes J(v) large, we are


guaranteed that the classes are well separated
projected means are far from each other

~1
~ implies that
small s
1
projected samples of
class 1 are clustered
around projected mean

~2
~ implies that
small s
2
projected samples of
class 2 are clustered
around projected mean

Fisher Linear Discriminant Derivation

(
~1 ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2

All we need to do now is to express J explicitly as a


function of v and maximize it
straightforward but need linear algebra and Calculus

Define the separate class scatter matrices S1 and


S2 for classes 1 and 2. These measure the scatter
of original samples xi (before projection)
S1 =
S2 =

(x i

1 )( x i 1 )

(x i

2 )( x i 2 )

x i Class 1

x i Class 2

Fisher Linear Discriminant Derivation


Now define the within the class scatter matrix
SW = S 1 + S 2

Recall that

(y i

s~12 =

y i Class 1

~1 ) 2

t
Using yi = vtxi and ~1 = v 1

~2 =
s
1
=

(v x v )
(v (x )) (v (x
t

y i Class 1

((x

y i Class 1

y i Class 1

y i Class 1

1 ) v
t

(x i

) ((x
t

1 ))

1 ) v
t

1 )( x i 1 ) v = v t S 1v
t

Fisher Linear Discriminant Derivation


s~22 = v t S 2v
Therefore s~12 + s~22 = v t S 1v + v t S 2v = v t S W v

Similarly

Define between the class scatter matrix


t
S B = ( 1 2 )( 1 2 )
SB measures separation between the means of two
classes (before projection)
Lets rewrite the separations of the projected means
2
2
t
t
~
~
( 1 2 ) = (v 1 v 2 )
t
t
= v ( 1 2 )( 1 2 ) v
= v t SBv

Fisher Linear Discriminant Derivation


Thus our objective function can be written:
2
~
~
(
v t S Bv
1 2 )
J (v ) = ~ 2 ~ 2 = t
s1 + s 2
v SW v

Minimize J(v) by taking the derivative w.r.t. v and


setting it to 0
d
J (v ) =
dv
=

d t
d t
t
v SW v v t S B v
v S B v v SW v
dv
dv

(v

SW v )

(2 S B v )v t SW v (2 SW v )v t S B v

(v

SW v )

=0

Fisher Linear Discriminant Derivation


Need to solve

v t S W v (S B v ) v t S B v (S W v ) = 0

v t S W v (S B v ) v t S B v (S W v )

=0
t
t
v SW v
v SW v
v t S B v (S W v )
SBv
=0
t
v SW v =
S B v = SW v
generalized eigenvalue problem

Fisher Linear Discriminant Derivation


S B v = SW v

If SW has full rank (the inverse exists), can convert


this to a standard eigenvalue problem
S W 1S B v = v
But SB x for any vector x, points in the same

direction as 1 - 2
t
S B x = ( 1 2 )( 1 2 ) x = ( 1 2 )(( 1 2 )t x ) = ( 1 2 )
Thus can solve the eigenvalue problem immediately
v = S W 1 ( 1 2 )

1
W

S SB

[S (
1
W

1
1
[
=
S

)
(
)
]
[
]
= SW 1 2
W ( 1 2 )]
1 2

Fisher Linear Discriminant Example


Data
Class 1 has 5 samples c1=[(1,2),(2,3),(3,3),(4,5),(5,5)]
Class 2 has 6 samples c2=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]

Arrange data in 2 separate matrices


c1 =

1 2
5 5

c2 =

1 0
6 5

Notice that PCA performs very


poorly on this data because the
direction of largest variance is not
helpful for classification

Fisher Linear Discriminant Example


First compute the mean for each class
1 = mean (c 1 ) = [3 3 . 6 ]
2 = mean (c 2 ) = [3 . 3

2]

Compute scatter matrices S1 and S2 for each class


17 . 3 16
8 .0
(
)
S
=
5

cov
c
=
S 1 = 4 cov (c 1 ) = 810
2
2
16 16
.0 7 .2
Within the class scatter:

. 3 24
S W = S 1 + S 2 = 27
24 23 . 2

it has full rank, dont have to solve for eigenvalues

0 . 41
The inverse of SW is S W 1 = inv (S W ) = 00. 39
. 41 0 . 47
Finally, the optimal line direction v
. 79
v = S W 1 ( 1 2 ) = 00. 89

Fisher Linear Discriminant Example


Notice, as long as the line
has the right direction, its
exact position does not
matter
Last step is to compute
the actual 1D vector y.
Lets do it separately for
each class
1
Y 1 = v t c 1t = [ 0 . 65 0 . 73 ] 2
1
Y 2 = v t c 2t = [ 0 . 65 0 . 73 ] 0

5 = [0 . 81
5
6 = [ 0 . 65
5

0 .4 ]
0 . 25 ]

Multiple Discriminant Analysis (MDA)


Can generalize FLD to multiple classes
In case of c classes, can reduce dimensionality to
1, 2, 3,, c-1 dimensions
Project sample xi to a linear subspace yi = V txi
V is called projection matrix

Multiple Discriminant Analysis (MDA)


Let

ni by the number of samples of class i


and i be the sample mean of class i
be the total mean of all samples
i = 1

ni

= 1

x class i

xi

xi

det (V t S BV )
Objective function: J (V ) =
det (V t S W V )
within the class scatter matrix SW is
SW =

i =1

Si =

(x k

i = 1 x k class i

i )( x k i )

between the class scatter matrix SB is

SB =

i =1

n i ( i )( i )

maximum rank is c -1

Multiple Discriminant Analysis (MDA)


det (V t S BV )
J (V ) =
det (V t S W V )

First solve the generalized eigenvalue problem:


S B v = SW v
At most c-1 distinct solution eigenvalues
Let v1, v2 ,, vc-1 be the corresponding eigenvectors
The optimal projection matrix V to a subspace of
dimension k is given by the eigenvectors
corresponding to the largest k eigenvalues
Thus can project to a subspace of dimension at
most c-1

FDA and MDA Drawbacks


Reduces dimension only to k = c-1 (unlike PCA)
For complex data, projection to even the best line may
result in unseparable projected samples

Will fail:

1. J(v) is always 0: happens if 1 = 2


PCA performs
reasonably well
here:

fails:

PCA also

2. If J(v) is always large: classes have large overlap when


projected to any line (PCA will also fail)

You might also like