You are on page 1of 33

Linear Discriminants Analysis

Hosoo Lee
What is feature reduction?

• Feature reduction refers to the mapping of the original high dimensional data into a
lower dimensional space
• - Criterion for feature reduction can be different based on different problem settings
• -- Unsupervised setting: minimize the information loss
• --Supervised setting: maximize the class discrimination
• Given a set of data of 𝑝 variables 𝑥1 , 𝑥2 , … , 𝑥𝑛
Compute the linear transformation (Projection)

Original data
reduced data
Linear transformation

𝐺𝑇 ∈ 𝑅𝑝×𝑑 𝑦 = 𝐺 𝑇 𝑥 ∈ 𝑅𝑝

𝑥 ∈ 𝑅𝑑
Linear discriminant analysis, two-classes
Objective
LDA seeks to reduce dimensionality while preserving as much of the
class discriminatory information as possible
Assume we have a set 𝐷-dimensional samples {𝑥1 , 𝑥2 , … , 𝑥𝑁 }, 𝑁1 of
which belong to class 𝐶1 , and 𝑁2 to class 𝐶2 .

We seek to obtain a scalar 𝑦 by projecting the samples 𝑥 onto a line


𝑦 = 𝑤𝑇𝑥
Of all the possible lines we would like to select the one that maximizes
the separability of the scalars
Linear discriminant analysis, two-classes

What is a good projection vector 𝑤?

We need to define a measure of separation


Linear discriminant analysis, two-classes

The mean vector of each class in 𝑥-space and 𝑦-space is


1 1 1
𝜇𝑖 = 𝑥 𝜇𝑖 = 𝑦 = 𝑤 𝑇 𝑥 = 𝑤 𝑇 𝜇𝑖
𝑁𝑖 𝑁𝑖 𝑁𝑖
𝑥∈ 𝐶𝑖 𝑥∈ 𝐶𝑖 𝑥∈ 𝐶𝑖

We could then choose the distance between the projected means as


our objective function

𝐽 𝑤 = |𝜇1 − 𝜇2 | = 𝑤 𝑇 (𝜇1 − 𝜇2 |

𝜇2

𝜇1 𝜇1

𝜇2
Linear discriminant analysis, two-classes

However, the distance between projected means is not good


measure since it does not account for the standard deviation
within classes
Linear discriminant analysis, two-classes
A good projection satisfies the following conditions:

The projected means are as farther apart as possible

Examples from the same class are projected very close to each other
Fisher’s solution
Fisher suggested maximizing the difference between the means normalized by a
measure of the within-class scatter

For each class we define the scatter, an equivalent of the variance, as

𝑠𝑖 2 = 𝑦 − 𝜇𝑖 2

𝑦∈𝐶𝑖

(𝑠12 + 𝑠22 ) is called the within-class scatter of the projected examples

The Fisher linear discriminant is defined as the linear function 𝑤 𝑇 𝑥


that maximizes the criterion function

𝜇1 − 𝜇2 2
𝐽 𝑤 = 2
𝑠1 + 𝑠22
Within-class scatter
To find the optimum projection 𝑤 ∗ ,
we need to express 𝑱 𝒘 as an explicit function of 𝒘
We define a measure of the scatter in multivariate feature space 𝑥,
which are scatter

𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇 𝑆𝑊 = 𝑆1 + 𝑆2
𝑥∈𝐶𝑖

where 𝑆𝑊 is called the within-class scatter matrix

The scatter of the projection 𝑦 can then be expressed as a function of


the scatter matrix in feature space 𝑥
𝑠𝑖2 = 𝑦 − 𝜇𝑖 2
= 𝑤 𝑇 𝑥 − 𝑤 𝑇 𝜇𝑖 2
= 𝑤 𝑇 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇 𝑤 = 𝑤 𝑇 𝑆𝑖 𝑤
𝑦∈𝐶𝑖 𝑥∈𝐶𝑖 𝑥∈𝐶𝑖

𝑠12 + 𝑠22 = 𝑤 𝑇 𝑆𝑊 𝑤
Between-class scatter
The difference between the projected means can be expressed in terms of the
means in the original feature space

𝜇1 − 𝜇2 2 = 𝑤 𝑇 𝜇1 − 𝜇2 𝑤 𝑇 𝜇1 − 𝜇2

= 𝑤 𝑇 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 𝑤

= 𝑤 𝑇 𝑆𝐵 𝑤

The matrix 𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 is called the between-class scatter.

Note that, since 𝑆𝐵 is the outer product of two vectors, its rank is at most one.

The Fisher criterion function 𝑤 𝑇 𝑆𝐵 𝑤


𝐽 𝑤 = 𝑇
in term of 𝑆𝑊 and 𝑆𝐵 𝑤 𝑆𝑊 𝑤
Fisher’s linear discriminate
We need to solve
𝑤 𝑇 𝑆𝐵 𝑤
max 𝑇
𝑤≠𝟎 𝑤 𝑆𝑊 𝑤

max 𝑤 𝑇 𝑆𝐵 𝑤 Maximization problem


subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1 subject to an equality constraint

Method of Lagrange Multipliers


Lagrange Multipliers
Maximize 𝑓 𝑥, 𝑦
s.t. 𝑔 𝑥, 𝑦 = 𝑐

Need both 𝑓 and 𝑔 to have continuous partial derivatives.


We introduce a new variable 𝜆, called a Lagrange multiplier,
and the Lagrangian 𝐿 defined by
𝐿 𝑥, 𝑦, 𝜆 = 𝑓 𝑥, 𝑦 − 𝜆[𝑔 𝑥, 𝑦 − 𝑐]

Note that if 𝑓 𝑥0 , 𝑦0 is a maximum of 𝑓 𝑥, 𝑦 , then


∃𝜆0 s.t. (𝑥0 , 𝑦0 , 𝜆0 ) is a stationary point for the 𝐿,
where the partial derivatives of 𝐿 are zero.

max 𝑤 𝑇 𝑆𝐵 𝑤
𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1
Lagrange Multipliers

𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
𝛻𝑤 𝐿 = 2 𝑆𝐵 − 𝜆𝑆𝑊 𝑤

If 𝐴 is a symmetric matrix 𝛻𝑣 [𝑣 𝑇 𝐴𝑣] = 2 𝐴𝑣.

𝑎 𝑏 𝑣1
Let 𝐴 = and 𝑣 = 𝑣
𝑐 𝑑 2
Then 𝑓 𝑣 = 𝑣 𝐴𝑣 = 𝑎𝑣1 + (𝑏 + 𝑐)𝑣1 𝑣2 + 𝑑𝑣22
𝑇 2

𝜕𝑓(𝑣)
𝜕𝑣1 2𝑎𝑣1 + 𝑏 + 𝑐 𝑣2 2𝑎 𝑏 + 𝑐 𝑣1 𝑇 𝑣
𝛻𝑓 𝑣 = = = = 𝐴 + 𝐴
𝜕𝑓(𝑣) 𝑏 + 𝑐 𝑣1 + 2𝑑𝑣2 𝑏+𝑑 2𝑑 𝑣2
𝜕𝑣2
If 𝐴 is a symmetric matrix 𝛻𝑣 [𝑣 𝑇 𝐴𝑣] = 2 𝐴𝑣

𝑇
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2
are symmetric matrices.
𝑆𝑊 = 𝑆1 + 𝑆2 , 𝑇
𝑆𝑖 = 𝑥∈𝐶𝑖 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖
Lagrange Multipliers

𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
𝛻𝑤 𝐿 = 2 𝑆𝐵 − 𝜆𝑆𝑊 𝑤 = 0

𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤

The scalar λ in 𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤 is a generalized eigenvalue of the pair


(𝑺𝑩 , 𝑺𝑾 ) associated with the (generalized) eigenvector 𝑤.
Generalized eigenvalue

The scalar λ is called a generalized eigenvalue of the pair (𝑨, 𝑩)


associated with the (generalized) eigenvector 𝑣

𝐴𝑣 = 𝜆𝐵𝑣 𝐴 − 𝜆𝐵 𝑣 = 𝟎

The expression 𝐴 − 𝜆𝐵 with indeterminate 𝜆 is called a matrix pencil

The terms “matrix pair” and “matrix pencil” are used more or less interchangeably.
For example, if 𝑣 is any eigenvector for the pair 𝐴, 𝐵 , then we also say that 𝑣 is
any eigenvector of the matrix pencil 𝐴 − 𝜆𝐵.

If B is nonsingular, then the generalized eigenvalue problem 𝐴𝑣 = 𝜆𝐵𝑣


is equivalent to a standard eigenvalue problem 𝐵 −1 𝐴𝑣 = 𝜆𝑣
Generalized eigenvalue
𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤

Which generalized eigenvalue 𝝀 should we choose?

Recall that we want to solve max 𝑤 𝑇 𝑆𝐵 𝑤


subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1

Since the optimal 𝑤 satisfies 𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤, we have

𝑤 𝑇 𝑆𝐵 𝑤 = 𝜆(𝑤 𝑇 𝑆𝑊 𝑤) = 𝜆

Hence we need to pick the largest generalized eigenvalue and its


corresponding generalized eigenvector.
Generalized eigenvalue
𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤

If SW is nonsingular, can convert this to a standard eigenvalue problem.


−1
𝑆𝑊 𝑆𝐵 𝑤 = 𝜆𝑤

𝑆𝐵 𝑣 for any vector 𝑣, points in the same direction as 𝜇1 − 𝜇2

𝑆𝐵 𝑣 = (𝜇1 − 𝜇2 ) 𝜇1 − 𝜇2 𝑇 𝑣 = (𝜇1 − 𝜇2 ) 𝛼

Thus can solve the eigenvalue problem immediately


−1
𝑣 = 𝑆𝑊 (𝜇1 − 𝜇2 )

−1 −1 −1 −1 −1
𝑆𝑊 𝑆𝐵 𝑣 = 𝑆𝑊 𝑆𝐵 𝑆𝑊 𝜇1 − 𝜇2 = 𝑆𝑊 [𝛼 𝜇1 − 𝜇2 ] = 𝛼 𝑆𝑊 𝜇1 − 𝜇2 = 𝛼 𝑣
LDA in 5 steps
(Step 1) Computing the 𝑑-dimensional mean vectors

(Step 2) Computing the Scatter Matrices


Within-class scatter matrix
𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇

𝑥∈𝐶𝑖
𝑆𝑊 = 𝑆1 + 𝑆2
Between-class scatter matrix
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇

(Step 3) Solving the generalized eigenvalue problem for the matrix

(Step 4) Selecting linear discriminants for the new feature subspace

(Step 5) Transforming the samples onto the new subspace


Example
Compute the LDA projection for the following 2D dataset
𝐶1 = 4,2 , 2,4 , 2,3 , 3,6 , (4,4)
𝐶2 = 9,10 , 6,8 , 9,5 , 8,7 , (10,8)
(Solution)
(Step 1) 𝜇1 = (3, 3.8), 𝜇2 = (8.4, 7.6)

(Step 2) 𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇
𝑥∈𝐶𝑖

1 −.25 2.3 −.05


𝑆1 = , 𝑆2 =
−.25 2.2 −.05 3.3

3.3 −.30
The within class scatter 𝑆𝑊 = ,
−.30 5.5
29.16 20.52
between-class scatter 𝑆B =
20.52 14.44
Example
−1
(Step 3) Solving the generalized eigenvalue problem 𝑆𝑊 𝑆𝐵 𝑤 = 𝜆𝑤

−1 9.2213 − 𝜆 6.489
𝑆𝑊 𝑆𝐵 − 𝜆𝐼 = =0
4.2339 2.9794 − 𝜆

𝜆1 = 12.2007 𝜆2 = 0

9.2213 6.489 9.2213 6.489


𝑤 = 12.2007𝑤1 𝑤 = 0𝑤2
4.2339 2.9794 1 4.2339 2.9794 2

𝑤1 = 0.9088
0.4173 𝑤2 = −0.5755
0.8178
Example
Example
LDA, C classes
Fisher’s LDA generalizes for C-class problems
Instead of one projection 𝑦,
we will now seek 𝐶 − 1 projections [𝑦1 , 𝑦2 , … , 𝑦𝐶−1 ]
by means of 𝐶 − 1 projection vectors 𝑤𝑖
arranged by columns into a projection matrix
𝑊 = 𝑤1 𝑤2 … |𝑤𝐶−1 :

𝑦𝑖 = 𝑤𝑖𝑇 𝑥 𝑦 = 𝑊𝑇𝑥
Derivation 𝐶

The within class scatter generalizes as 𝑆𝑊 = 𝑆𝑖


𝑖=1
𝑇 1
where 𝑆𝑖 = 𝑥∈𝐶𝑖 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 and 𝜇𝑖 = 𝑁 𝑥∈𝐶𝑖 𝑥
𝑖

𝐶
𝑇
The between-class scatter becomes 𝑆𝐵 = 𝑁𝑖 𝜇𝑖 − 𝜇 𝜇𝑖 − 𝜇
1 1 𝑖=1
𝐶
Where 𝜇 = ∀𝑥 𝑥 = 𝑖=1 𝜇𝑖
𝑁 𝑁
LDA, C classes

• Similarly, we define the mean vector and


scatter matrices for the projected samples as

1 1
𝜇𝑖 = 𝑦 𝜇= 𝑦
𝑁𝑖 𝑁
𝑥∈𝐶𝑖 ∀𝑥

𝐶
𝑇
𝑆𝑊 = 𝑦 − 𝜇𝑖 𝑦 − 𝜇𝑖
𝑖=1 𝑥∈𝐶𝑖
𝐶
𝑇
𝑆𝐵 = 𝑁𝑖 𝜇 − 𝜇𝑖 𝜇 − 𝜇𝑖
𝑖=1 𝑥∈𝐶𝑖
LDA, C classes
• From our derivation for the two-class problem, we can write

𝑆𝑊 = 𝑊 𝑇 𝑆𝑊 𝑊 𝑆𝐵 = 𝑊 𝑇 𝑆𝐵 𝑊

Recall that we are looking for a projection that


-- maximizes the ration of between-class to within-class scatter.

Since the projection is no longer a scalar (it has 𝐶 − 1 dimensions), we use the
determinant of the scatter matrices to obtain a scalar objective function
|𝑆𝐵 | |𝑊 𝑇 𝑆𝐵 𝑊|
𝐽 𝑊 = =
|𝑆𝑊 | |𝑊 𝑇 𝑆𝑊 𝑊|
We will seek the projection matrix 𝑊 ∗ that maximizes this ratio
LDA, C classes
• It can be shown that the optimal projection matrix 𝑊 ∗ is the one whose columns are
the eigenvectors corresponding to the largest eigenvalues of the following generalized
eigenvalue problem


|𝑊 𝑇 𝑆𝐵 𝑊|
𝑊∗ = 𝑤1∗ 𝑤2∗ … |𝑤𝐶−1 = arg 𝑚𝑎𝑥 SB − 𝜆i SW wi∗ =0
|𝑊 𝑇 𝑆𝑊 𝑊|

Note that
SB is the sum of 𝐶 matrices of rank ≤ 1 and the mean vectors and constrained by
𝐶
1
𝜇𝑖 = 𝜇
𝐶
𝑖=1
-- therefore, SB will be of rank (𝐶 − 1) or less
-- This means that only (𝐶 − 1) of the eigenvalues 𝜆𝑖 will be non-zero

The projections with maximum class seprability information are the eigenvectors
−1
corresponding to the largest eigenvalues of 𝑆𝑊 𝑆𝐵
Example, Iris
The iris dataset contains measurements for 150 iris flowers from three different
species.

The three classes in the Iris dataset: Iris-setosa (n=50)


Iris-versicolor (n=50)
Iris-virginica (n=50)

The four features of the Iris dataset:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Example, Iris
Example, Iris
Example, Iris
Limitations of LDA
LDA produces at most (𝐶 − 1) feature projections

If the classification error estimates establish that more features are needed, some
other method must be employed to provide those additional features

LDA is a parametric method ( it assumes unimodal Gaussian likelihoods)


- If the distributions are significantly non-Gaussian, the LDA projections may not
preserve complex structure in the data needed for classification
Limitations of LDA

LDA will also fail if discriminatory information is not tin the mean but in the
variance of the data
PCA vs. LDA

Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are
linear transformation techniques that are commonly used for dimensionality reduction.

PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels


and its goal is to find the directions (the so-called principal components) that
maximize the variance in a dataset.

LDA is “supervised” and computes the directions (“linear discriminants”) that will
represent the axes that that maximize the separation between multiple classes.

You might also like