Professional Documents
Culture Documents
Linear Discriminants Analysis: Hosoo Lee
Linear Discriminants Analysis: Hosoo Lee
Hosoo Lee
What is feature reduction?
• Feature reduction refers to the mapping of the original high dimensional data into a
lower dimensional space
• - Criterion for feature reduction can be different based on different problem settings
• -- Unsupervised setting: minimize the information loss
• --Supervised setting: maximize the class discrimination
• Given a set of data of 𝑝 variables 𝑥1 , 𝑥2 , … , 𝑥𝑛
Compute the linear transformation (Projection)
Original data
reduced data
Linear transformation
𝐺𝑇 ∈ 𝑅𝑝×𝑑 𝑦 = 𝐺 𝑇 𝑥 ∈ 𝑅𝑝
𝑥 ∈ 𝑅𝑑
Linear discriminant analysis, two-classes
Objective
LDA seeks to reduce dimensionality while preserving as much of the
class discriminatory information as possible
Assume we have a set 𝐷-dimensional samples {𝑥1 , 𝑥2 , … , 𝑥𝑁 }, 𝑁1 of
which belong to class 𝐶1 , and 𝑁2 to class 𝐶2 .
𝐽 𝑤 = |𝜇1 − 𝜇2 | = 𝑤 𝑇 (𝜇1 − 𝜇2 |
𝜇2
𝜇1 𝜇1
𝜇2
Linear discriminant analysis, two-classes
Examples from the same class are projected very close to each other
Fisher’s solution
Fisher suggested maximizing the difference between the means normalized by a
measure of the within-class scatter
𝑠𝑖 2 = 𝑦 − 𝜇𝑖 2
𝑦∈𝐶𝑖
𝜇1 − 𝜇2 2
𝐽 𝑤 = 2
𝑠1 + 𝑠22
Within-class scatter
To find the optimum projection 𝑤 ∗ ,
we need to express 𝑱 𝒘 as an explicit function of 𝒘
We define a measure of the scatter in multivariate feature space 𝑥,
which are scatter
𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇 𝑆𝑊 = 𝑆1 + 𝑆2
𝑥∈𝐶𝑖
𝑠12 + 𝑠22 = 𝑤 𝑇 𝑆𝑊 𝑤
Between-class scatter
The difference between the projected means can be expressed in terms of the
means in the original feature space
𝜇1 − 𝜇2 2 = 𝑤 𝑇 𝜇1 − 𝜇2 𝑤 𝑇 𝜇1 − 𝜇2
= 𝑤 𝑇 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 𝑤
= 𝑤 𝑇 𝑆𝐵 𝑤
Note that, since 𝑆𝐵 is the outer product of two vectors, its rank is at most one.
max 𝑤 𝑇 𝑆𝐵 𝑤
𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1
Lagrange Multipliers
𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
𝛻𝑤 𝐿 = 2 𝑆𝐵 − 𝜆𝑆𝑊 𝑤
𝑎 𝑏 𝑣1
Let 𝐴 = and 𝑣 = 𝑣
𝑐 𝑑 2
Then 𝑓 𝑣 = 𝑣 𝐴𝑣 = 𝑎𝑣1 + (𝑏 + 𝑐)𝑣1 𝑣2 + 𝑑𝑣22
𝑇 2
𝜕𝑓(𝑣)
𝜕𝑣1 2𝑎𝑣1 + 𝑏 + 𝑐 𝑣2 2𝑎 𝑏 + 𝑐 𝑣1 𝑇 𝑣
𝛻𝑓 𝑣 = = = = 𝐴 + 𝐴
𝜕𝑓(𝑣) 𝑏 + 𝑐 𝑣1 + 2𝑑𝑣2 𝑏+𝑑 2𝑑 𝑣2
𝜕𝑣2
If 𝐴 is a symmetric matrix 𝛻𝑣 [𝑣 𝑇 𝐴𝑣] = 2 𝐴𝑣
𝑇
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2
are symmetric matrices.
𝑆𝑊 = 𝑆1 + 𝑆2 , 𝑇
𝑆𝑖 = 𝑥∈𝐶𝑖 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖
Lagrange Multipliers
𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
𝛻𝑤 𝐿 = 2 𝑆𝐵 − 𝜆𝑆𝑊 𝑤 = 0
𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤
𝐴𝑣 = 𝜆𝐵𝑣 𝐴 − 𝜆𝐵 𝑣 = 𝟎
The terms “matrix pair” and “matrix pencil” are used more or less interchangeably.
For example, if 𝑣 is any eigenvector for the pair 𝐴, 𝐵 , then we also say that 𝑣 is
any eigenvector of the matrix pencil 𝐴 − 𝜆𝐵.
𝑤 𝑇 𝑆𝐵 𝑤 = 𝜆(𝑤 𝑇 𝑆𝑊 𝑤) = 𝜆
𝑆𝐵 𝑣 = (𝜇1 − 𝜇2 ) 𝜇1 − 𝜇2 𝑇 𝑣 = (𝜇1 − 𝜇2 ) 𝛼
−1 −1 −1 −1 −1
𝑆𝑊 𝑆𝐵 𝑣 = 𝑆𝑊 𝑆𝐵 𝑆𝑊 𝜇1 − 𝜇2 = 𝑆𝑊 [𝛼 𝜇1 − 𝜇2 ] = 𝛼 𝑆𝑊 𝜇1 − 𝜇2 = 𝛼 𝑣
LDA in 5 steps
(Step 1) Computing the 𝑑-dimensional mean vectors
𝑥∈𝐶𝑖
𝑆𝑊 = 𝑆1 + 𝑆2
Between-class scatter matrix
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇
(Step 2) 𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇
𝑥∈𝐶𝑖
3.3 −.30
The within class scatter 𝑆𝑊 = ,
−.30 5.5
29.16 20.52
between-class scatter 𝑆B =
20.52 14.44
Example
−1
(Step 3) Solving the generalized eigenvalue problem 𝑆𝑊 𝑆𝐵 𝑤 = 𝜆𝑤
−1 9.2213 − 𝜆 6.489
𝑆𝑊 𝑆𝐵 − 𝜆𝐼 = =0
4.2339 2.9794 − 𝜆
𝜆1 = 12.2007 𝜆2 = 0
𝑤1 = 0.9088
0.4173 𝑤2 = −0.5755
0.8178
Example
Example
LDA, C classes
Fisher’s LDA generalizes for C-class problems
Instead of one projection 𝑦,
we will now seek 𝐶 − 1 projections [𝑦1 , 𝑦2 , … , 𝑦𝐶−1 ]
by means of 𝐶 − 1 projection vectors 𝑤𝑖
arranged by columns into a projection matrix
𝑊 = 𝑤1 𝑤2 … |𝑤𝐶−1 :
𝑦𝑖 = 𝑤𝑖𝑇 𝑥 𝑦 = 𝑊𝑇𝑥
Derivation 𝐶
𝐶
𝑇
The between-class scatter becomes 𝑆𝐵 = 𝑁𝑖 𝜇𝑖 − 𝜇 𝜇𝑖 − 𝜇
1 1 𝑖=1
𝐶
Where 𝜇 = ∀𝑥 𝑥 = 𝑖=1 𝜇𝑖
𝑁 𝑁
LDA, C classes
1 1
𝜇𝑖 = 𝑦 𝜇= 𝑦
𝑁𝑖 𝑁
𝑥∈𝐶𝑖 ∀𝑥
𝐶
𝑇
𝑆𝑊 = 𝑦 − 𝜇𝑖 𝑦 − 𝜇𝑖
𝑖=1 𝑥∈𝐶𝑖
𝐶
𝑇
𝑆𝐵 = 𝑁𝑖 𝜇 − 𝜇𝑖 𝜇 − 𝜇𝑖
𝑖=1 𝑥∈𝐶𝑖
LDA, C classes
• From our derivation for the two-class problem, we can write
𝑆𝑊 = 𝑊 𝑇 𝑆𝑊 𝑊 𝑆𝐵 = 𝑊 𝑇 𝑆𝐵 𝑊
Since the projection is no longer a scalar (it has 𝐶 − 1 dimensions), we use the
determinant of the scatter matrices to obtain a scalar objective function
|𝑆𝐵 | |𝑊 𝑇 𝑆𝐵 𝑊|
𝐽 𝑊 = =
|𝑆𝑊 | |𝑊 𝑇 𝑆𝑊 𝑊|
We will seek the projection matrix 𝑊 ∗ that maximizes this ratio
LDA, C classes
• It can be shown that the optimal projection matrix 𝑊 ∗ is the one whose columns are
the eigenvectors corresponding to the largest eigenvalues of the following generalized
eigenvalue problem
∗
|𝑊 𝑇 𝑆𝐵 𝑊|
𝑊∗ = 𝑤1∗ 𝑤2∗ … |𝑤𝐶−1 = arg 𝑚𝑎𝑥 SB − 𝜆i SW wi∗ =0
|𝑊 𝑇 𝑆𝑊 𝑊|
Note that
SB is the sum of 𝐶 matrices of rank ≤ 1 and the mean vectors and constrained by
𝐶
1
𝜇𝑖 = 𝜇
𝐶
𝑖=1
-- therefore, SB will be of rank (𝐶 − 1) or less
-- This means that only (𝐶 − 1) of the eigenvalues 𝜆𝑖 will be non-zero
The projections with maximum class seprability information are the eigenvectors
−1
corresponding to the largest eigenvalues of 𝑆𝑊 𝑆𝐵
Example, Iris
The iris dataset contains measurements for 150 iris flowers from three different
species.
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Example, Iris
Example, Iris
Example, Iris
Limitations of LDA
LDA produces at most (𝐶 − 1) feature projections
If the classification error estimates establish that more features are needed, some
other method must be employed to provide those additional features
LDA will also fail if discriminatory information is not tin the mean but in the
variance of the data
PCA vs. LDA
Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are
linear transformation techniques that are commonly used for dimensionality reduction.
LDA is “supervised” and computes the directions (“linear discriminants”) that will
represent the axes that that maximize the separation between multiple classes.