You are on page 1of 81

Intelligent Data Analysis and Probabilistic Inference

Lecture 15:
Linear Discriminant Analysis
Recommended reading: Bishop,
Chapter 4.1
Hastie et al., Chapter 4.3

Duncan Gillies and Marc Deisenroth


Department of Computing
Imperial College London

February 22, 2016

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 2


Classification
x2

x1
Adapted from PRML (Bishop, 2006)

§ Input vector x P RD, assign it to one of K discrete classes Ck, k “


1, . . . , K.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 3


§ Assumption: classes are disjoint, i.e., input vectors are assigned to
exactly one class
§ Idea: Divide input space into decision regions whose boundaries are
called decision boundaries/surfaces

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 4


Linear Classification
1

0.5

−0.5

−1
−1 −0.5 0 0.5 1

From PRML (Bishop, 2006)

§ Focus on linear classification model, i.e., the decision boundary is a


linear function of x

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 5


Defined by pD ´ 1q-dimensional hyperplane
§ If the data can be separated exactly by linear decision surfaces, they
are called linearly separable
§ Implicit assumption: Classes can be modeled well by Gaussians
Here: Treat classification as a projection problem

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 6


Example

§ Measurementsfor 150 Iris flowers from three different species. §


Measurements for 150 Iris flowers from three different species.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


Example

§ Four features (petal length/width, sepal length/width)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ Measurements for 150 Iris flowers from three different species.


§ Four features (petal length/width, sepal length/width)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ Given a new measurement of these features, predict the Iris species


based on a projection onto a low-dimensional space.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ Measurements for 150 Iris flowers from three different species.


§ Four features (petal length/width, sepal length/width)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ Given a new measurement of these features, predict the Iris species


based on a projection onto a low-dimensional space.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ PCA may not be ideal to separate the classes well


4

§ Measurements for 150 Iris flowers from three different species.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ Four features (petal length/width, sepal length/width)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ Given a new measurement of these features, predict the Iris species


based on a projection onto a low-dimensional space.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Example

§ PCA may not be ideal to separate the classes well


4

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Orthogonal Projections
(Repetition)

§ Project input vector x P RD down to a 1-dimensional subspace with basis


vector w
§
With }w} “ 1, we get

P “ wwJ Projection matrix, such that Px “ p p “ yw P RD

Projection point Discussed in Lecture 14 y “ wJx P R Coordinates

with respect to basis w Today

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


5

Orthogonal Projections
(Repetition)

§ Project input vector x P RD down to a 1-dimensional subspace with basis


vector w
§
With }w} “ 1, we get

P “ wwJ Projection matrix, such that Px “ p p “ yw P RD

Projection point Discussed in Lecture 14 y “ wJx P R Coordinates

with respect to basis w Today

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


§ We will largely focus on the coordinates y in the following
§ Projection points equally apply to concepts discussed today
§ Coordinates equally apply to PCA (see Lecture 14)
5

Classification as Projection

w0

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


§ Assume we know the basis vector w, we can compute the projection of
any point x P RD onto the one-dimensional subspace spanned by w

Classification as Projection

w0

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


§ Assume we know the basis vector w, we can compute the projection of
any point x P RD onto the one-dimensional subspace spanned by w
§ Threshold w0, such that we decide on C1 if y ě w0 and C2 otherwise
6

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


The Linear Decision Boundary of LDA

§ Look at the log-probability ratio


ppC1|xq ppx|C1q ppC1q log “
log ` log
ppC2|xq ppx|C2q ppC2q
where the decision boundary (for C1 or C2) is at 0.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

§ Look at the log-probability ratio


ppC1|xq ppx|C1q ppC1q log “
log ` log
ppC2|xq ppx|C2q ppC2q
where the decision boundary (for C1 or C2) is at 0.

§ N
Assume Gaussian likelihood ppx|Ciq “ `x|mi, Σ˘ with the same

covariance in both classes. Decision boundary:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

ppC1|xq
log “ 0 ppC2|
xq
§ Look at the log-probability ratio

ppC1|xq ppx|C1q ppC1q log “


log ` log
ppC2|xq ppx|C2q ppC2q
where the decision boundary (for C1 or C2) is at 0.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

§ N
Assume Gaussian likelihood ppx|Ciq “ `x|mi, Σ˘ with the same

covariance in both classes. Decision boundary:

ppC1|xq
log “ 0 ppC2|
xq

ô log ppC1q´1pmJΣ´1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0 ppC2q 2


1

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

§ Look at the log-probability ratio


ppC1|xq ppx|C1q ppC1q log “
log ` log
ppC2|xq ppx|C2q ppC2q
where the decision boundary (for C1 or C2) is at 0.

§ N
Assume Gaussian likelihood ppx|Ciq “ `x|mi, Σ˘ with the same

covariance in both classes. Decision boundary:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

ppC1|xq
log “ 0 ppC2|
xq

ppC1q 1 J ´1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0


ô log ppC2q´2pm1 Σ

ô pm1 ´ m2qJΣ´1x “ 21 pmJ1 Σ´1m1 ´ m2Σ´1m2q´ log ppppCC21qq


§ Look at the log-probability ratio

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

pp C1 |
xq ppx|C1q ppC1q log “ log ` log
ppC2|xq ppx|C2q ppC2q
where the decision boundary (for C1 or C2) is at 0.

§ N
Assume Gaussian likelihood ppx|Ciq “ `x|mi, Σ˘ with the same

covariance in both classes. Decision boundary:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

ppC1|xq
log “0
ppC2|xq
C
pp 1q 1 J´

ô log ppC2q´2pm1 Σ
1m1 ´ m2Σ´1m2q` pm1 ´ m2qJΣ´1x “ 0
1
ô pm1 ´ m2qJΣ´1x “ pmJ1
2
Σ´1m1 ´ m2Σ´1m2q´ log ppppCC21qq

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


The Linear Decision Boundary of LDA

Of the form Ax “ b Decision boundary linear in x

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 7


Potential Issues

From PRML (Bishop, 2006)

§ Considerable loss of information when projecting

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 8


Potential Issues

From PRML (Bishop, 2006)

§ Considerable loss of information when projecting

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 8


Potential Issues

§ Even if data was linearly separable in RD, we may lose this


separability (see figure)
From PRML (Bishop, 2006)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 8


Potential Issues

§ Considerable loss of information when projecting

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 8


Potential Issues

§ Even if data was linearly separable in RD, we may lose this


separability (see figure)
Find good basis vector w that spans the subspace we project onto

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 8


Approach: Maximize Class Separation

§ Adjust components of basis vector w


Select projection that maximizes the class separation

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 9


Approach: Maximize Class Separation

§ Adjust components of basis vector w


Select projection that maximizes the class separation
§ Consider two classes: C1 with N1 points and C2 with N2 points

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 9


Approach: Maximize Class Separation

§ Adjust components of basis vector w


Select projection that maximizes the class separation
§ Consider two classes: C1 with N1 points and C2 with N2 points
§ Corresponding mean vectors:

1
m1 “ ÿ xn ,

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 9


Approach: Maximize Class Separation

N1 nPC1 N2 nPC2
1
m2 “ ÿ xn

§ Adjust components of basis vector w


Select projection that maximizes the class separation § Consider
two classes: C1 with N1 points and C2 with N2 points § Corresponding
mean vectors:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 9


Approach: Maximize Class Separation

1 1
m1 “ ÿ xn , m2 “ ÿ xn

N1 PC1 n N2 C2 nP
§ Measure class separation as the distance of the projected class
means:

m2 ´ m1 “ wJm2 ´ wJm1 “ wJpm2 ´ m1q

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 9


Approach: Maximize Class Separation

and maximize this w.r.t. w with the constraint }w} “ 1

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 9


Maximum Class
Separation

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


§ Find w 9 pm2 ´ m1q
§ Projected classes may still have considerable overlap (because of
strongly non-diagonal covariances of the class distributions)

10

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 10


Maximum Class
Separation

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 11


§ Find w 9 pm2 ´ m1q
§ Projected classes may still have considerable overlap (because of
strongly non-diagonal covariances of the class distributions)
§ LDA: Large separation of projected class means and small within-class
variation (small overlap of classes)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 12


Key Idea of LDA

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 13


§ Separate samples of distinct groups by projecting them onto a space
that
§ Maximizes their between-class separability while
§ Minimizing their within-class variability

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 14


Fisher Criterion
§
For each class Ck the within-class scatter (unnormalized variance) is given
as

s2k “ ÿ pyn ´ mkq2 , yn “ wJxn , mk “ wJmk

nPCk

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Fisher Criterion
§
For each class Ck the within-class scatter (unnormalized variance) is given
as

s2k “ ÿ pyn ´ mkq2 , yn “ wJxn , mk “ wJmk

nPCk

§ Maximize the Fisher criterion:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 16


Fisher Criterion

Between-class scatter pm2 ´ m1q2 wJSBw

Jpwq “ Within-class scatter “ s2 ` s22 “ wJSWw

1 SW “ ÿk
ÿnPCkpxn ´
mkqpxn ´ mkqJ

SB “ pm2 ´ m1qpm2 ´ m1qJ

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Fisher Criterion
12
§
For each class Ck the within-class scatter (unnormalized variance) is given
as

s2k “ ÿ pyn ´ mkq2 , yn “ wJxn , mk “ wJmk

nPCk

§ Maximize the Fisher criterion:

Between-class scatter pm2 ´ m1q2 wJSBw

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 18


Fisher Criterion

Jpwq “ Within-class scatter “ s2 ` s22 “ wJSWw


1

SW “ ÿk ÿnPCkpxn ´ mkqpxn ´ mkqJ

SB “ pm2 ´ m1qpm2 ´ m1qJ

§ SW isthe total within-class scatter and proportional to the sample


covariance matrix

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Generalization to k Classes

For k classes, we define the between-class scatter matrix as


ÿ N
1
SB “
k Nkpmk ´ µqpm2 ´ µqJ , µ“ N ÿxi

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 13


Finding the Projection
i“1

where µ is the global mean of the data set

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 14


Finding the Projection

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w
We find w by setting dJ{dw “ 0:

dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Finding the Projection
14

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 16


Finding the Projection

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w
We find w by setting dJ{dw “ 0:

dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0

ô SBw ´ JSWw “ 0

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Finding the Projection

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w
14

We find w by setting dJ{dw “ 0:

dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0 ô SBw ´ JSWw

“ 0 ô SW´1SBw ´ Jw “ 0

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Finding the Projection

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w
14

We find w by setting dJ{dw “ 0:

dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0 ô SBw ´ JSWw

“ 0 ô SW´1SBw ´ Jw “ 0

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Finding the Projection

Eigenvalue problem SW´1SBw “ Jw

14

We find w by setting dJ{dw “ 0:

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Finding the Projection

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w
dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0 ô SBw ´ JSWw

“ 0 ô SW´1SBw ´ Jw “ 0

Eigenvalue problem SW´1SBw “ Jw

The projection vector w is the eigenvector of SW´1SB.

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Finding the Projection

Objective
Find w ˚ that maximizes
wJ SBw
J pw q“
wJ SW w
14

We find w by setting dJ{dw “ 0:

dJ{dw “ 0 ô `wJSWw˘SBw ´ `wJSBw˘SWw “ 0 ô SBw ´ JSWw

“ 0 ô SW´1SBw ´ Jw “ 0

Eigenvalue problem SW´1SBw “ Jw

The projection vector w is the eigenvector of SW´1SB.


Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016
Finding the Projection

Choose the eigenvector that corresponds to the maximum eigenvalue


(similar to PCA) to maximize class separability
14

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Algorithm

1. Mean normalization

§X P RnˆD: ith row represents the ith sample

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

1. Mean normalization

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

2. Compute mean vectors mi P RD for all k classes § X P RnˆD: ith

row represents the ith sample

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

1. Mean normalization
2. Compute mean vectors mi P RD for all k classes

3. Compute scatter matrices SW, SB

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

§X P RnˆD: ith row represents the ith sample


1. Mean normalization
2. Compute mean vectors mi P RD for all k classes

3. Compute scatter matrices SW, SB

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

4. Compute eigenvectors and eigenvalues of SW´1SB § X P RnˆD: ith row

represents the ith sample

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

1. Mean normalization
2. Compute mean vectors mi P RD for all k classes

3. Compute scatter matrices SW, SB


4. Compute eigenvectors and eigenvalues of SW´1SB

5. Select k eigenvectors wi with the largest eigenvalues to form a D ˆ k-


dimensional matrix W “ rw1, . . . , wks

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

§X P RnˆD: ith row represents the ith sample


1. Mean normalization
2. Compute mean vectors mi P RD for all k classes

3. Compute scatter matrices SW, SB


4. Compute eigenvectors and eigenvalues of SW´1SB

5. Select k eigenvectors wi with the largest eigenvalues to form a D ˆ k-


dimensional matrix W “ rw1, . . . , wks

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


Algorithm

6. Project samples onto the new subspace using W and compute the
new coordinates as Y “ XW

§X P RnˆD: ith row represents the ith sample


§Y P Rnˆk: Coordinate matrix of the n data points w.r.t. eigenbasis
W spanning the k-dimensional subspace

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 15


PCA vs LDA
4 PCA: Iris projection onto the first 2 principal axes 2.5 LDA: Iris projection onto the first 2 linear discriminants

§ Similar to PCA, we can use LDA for dimensionality reduction by looking


at an eigenvalue problem
§ Similar to PCA, we can use LDA for dimensionality reduction by looking
at an eigenvalue problem
§ LDA: Magnitude of the eigenvalues in LDA describe importance of the
corresponding eigenspace with respect to classification performance

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 16


PCA vs LDA
4 PCA: Iris projection onto the first 2 principal axes 2.5 LDA: Iris projection onto the first 2 linear discriminants

§ Similar to PCA, we can use LDA for dimensionality reduction by looking


at an eigenvalue problem
§ LDA: Magnitude of the eigenvalues in LDA describe importance of the
corresponding eigenspace with respect to classification performance
§ PCA: Magnitude of the eigenvalues in LDA describe importance of the
corresponding eigenspace with respect to minimizing reconstruction
error

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 16


Assumptions in LDA

§ The true covariance matrices of each class are equal

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 17


Assumptions in LDA

§ The true covariance matrices of each class are equal


§ Without this assumption: Quadratic discriminant analysis (e.g.
Hastie et al., 2009)

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 17


Assumptions in LDA

§ The true covariance matrices of each class are equal


§ Without this assumption: Quadratic discriminant analysis (e.g. Hastie
et al., 2009)
§ Performance of the standard LDA can be seriously degraded if there
are only a limited number of total training observations N compared
to the dimension D of the feature space.
Shrinkage (Copas, 1983)
§ The true covariance matrices of each class are equal

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 17


Assumptions in LDA

§ Without this assumption: Quadratic discriminant analysis (e.g. Hastie


et al., 2009)
§ Performance of the standard LDA can be seriously degraded if there
are only a limited number of total training observations N compared
to the dimension D of the feature space. Shrinkage (Copas, 1983)
§ LDA explicitly attempts to model the difference between the classes of
data. PCA on the other hand does not take into account any
difference in class

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016 17


Limitations of LDA

§ LDA’s most disriminant features are the means of the data distributions
§ LDA will fail when the discriminatory information is not the mean but
the variance of the data.
§ If the data distributions are very non-Gaussian, the LDA projections will
not preserve the complex structure of the data that may be required
for classification

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Nonlinear LDA (e.g., Mika et al., 1999; Baudat & Anouar, 2000)
18

References I

[1] G. Baudat and F. Anouar. Generalized Discriminant Analysis using a


Kernel Approach. Neural Computation, 12(10):2385–2404, 2000.
[2] C. M. Bishop. Pattern Recognition and Machine Learning.
Information Science and Statistics. Springer-Verlag, 2006.
[3] J. B. Copas. Regression, Prediction and Shrinkage. Journal of the Royal
Statistical Society, Series B, 45(3):311–354, 1983.
[4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning—Data Mining, Inference, and Prediction. Springer Series in

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016


Statistics. Springer-Verlag New York, Inc., 175 Fifth Avenue, New York
City, NY, USA, 2001.
[5] S. Mika, G. Ratsch, J. Weston, B. Sch¨ olkopf, and K.-R. M¨
uller.¨ Fisher Discriminant Analysis with Kernels. Neural Networks
for Signal Processing, IX:41–48, 1999.
19

Linear Discriminant Analysis IDAPI, Lecture 15 February 22, 2016

You might also like