Linear Discriminants Analysis: Hosoo Lee

Linear Discriminants Analysis
Hosoo Lee
What is feature reduction?
• Feature reduction refers to the mapping of the original high dimensional data into a
lower dimensional space
• - Criterion for feature reduction can be different based on different problem settings
• -- Unsupervised setting: minimize the information loss
• --Supervised setting: maximize the class discrimination
• Given a set of data of 𝑝 variables 𝑥1 , 𝑥2 , … , 𝑥𝑛
Compute the linear transformation (Projection)
Original data
reduced data
Linear transformation
𝐺𝑇 ∈ 𝑅𝑝×𝑑 𝑦 = 𝐺 𝑇 𝑥 ∈ 𝑅𝑝
𝑥 ∈ 𝑅𝑑
Linear discriminant analysis, two-classes
Objective
LDA seeks to reduce dimensionality while preserving as much of the
class discriminatory information as possible
Assume we have a set 𝐷-dimensional samples {𝑥1 , 𝑥2 , … , 𝑥𝑁 }, 𝑁1 of
which belong to class 𝐶1 , and 𝑁2 to class 𝐶2 .
We seek to obtain a scalar 𝑦 by projecting the samples 𝑥 onto a line

𝑦 = 𝑤𝑇𝑥
Of all the possible lines we would like to select the one that maximizes
the separability of the scalars
What is a good projection vector 𝑤?
We need to define a measure of separation

The mean vector of each class in 𝑥-space and 𝑦-space is

1 1 1
𝜇𝑖 = 𝑥 𝜇𝑖 = 𝑦 = 𝑤 𝑇 𝑥 = 𝑤 𝑇 𝜇𝑖
𝑁𝑖 𝑁𝑖 𝑁𝑖
𝑥∈ 𝐶𝑖 𝑥∈ 𝐶𝑖 𝑥∈ 𝐶𝑖
We could then choose the distance between the projected means as

our objective function
𝐽 𝑤 = |𝜇1 − 𝜇2 | = 𝑤 𝑇 (𝜇1 − 𝜇2 |
𝜇2
𝜇1 𝜇1
𝜇2
However, the distance between projected means is not good

measure since it does not account for the standard deviation
within classes
A good projection satisfies the following conditions:
The projected means are as farther apart as possible
Examples from the same class are projected very close to each other
Fisher’s solution
Fisher suggested maximizing the difference between the means normalized by a
measure of the within-class scatter
For each class we define the scatter, an equivalent of the variance, as
𝑠𝑖 2 = 𝑦 − 𝜇𝑖 2
𝑦∈𝐶𝑖
(𝑠12 + 𝑠22 ) is called the within-class scatter of the projected examples
The Fisher linear discriminant is defined as the linear function 𝑤 𝑇 𝑥

that maximizes the criterion function
𝜇1 − 𝜇2 2
𝐽 𝑤 = 2
𝑠1 + 𝑠22
Within-class scatter
To find the optimum projection 𝑤 ∗ ,
we need to express 𝑱 𝒘 as an explicit function of 𝒘
We define a measure of the scatter in multivariate feature space 𝑥,
which are scatter
𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇 𝑆𝑊 = 𝑆1 + 𝑆2
𝑥∈𝐶𝑖
where 𝑆𝑊 is called the within-class scatter matrix
The scatter of the projection 𝑦 can then be expressed as a function of

the scatter matrix in feature space 𝑥
𝑠𝑖2 = 𝑦 − 𝜇𝑖 2
= 𝑤 𝑇 𝑥 − 𝑤 𝑇 𝜇𝑖 2
= 𝑤 𝑇 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇 𝑤 = 𝑤 𝑇 𝑆𝑖 𝑤
𝑦∈𝐶𝑖 𝑥∈𝐶𝑖 𝑥∈𝐶𝑖
𝑠12 + 𝑠22 = 𝑤 𝑇 𝑆𝑊 𝑤
Between-class scatter
The difference between the projected means can be expressed in terms of the
means in the original feature space
𝜇1 − 𝜇2 2 = 𝑤 𝑇 𝜇1 − 𝜇2 𝑤 𝑇 𝜇1 − 𝜇2
= 𝑤 𝑇 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 𝑤
= 𝑤 𝑇 𝑆𝐵 𝑤
The matrix 𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 is called the between-class scatter.
Note that, since 𝑆𝐵 is the outer product of two vectors, its rank is at most one.
The Fisher criterion function 𝑤 𝑇 𝑆𝐵 𝑤

𝐽 𝑤 = 𝑇
in term of 𝑆𝑊 and 𝑆𝐵 𝑤 𝑆𝑊 𝑤
Fisher’s linear discriminate
We need to solve
𝑤 𝑇 𝑆𝐵 𝑤
max 𝑇
𝑤≠𝟎 𝑤 𝑆𝑊 𝑤
max 𝑤 𝑇 𝑆𝐵 𝑤 Maximization problem

subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1 subject to an equality constraint
Method of Lagrange Multipliers

Lagrange Multipliers
Maximize 𝑓 𝑥, 𝑦
s.t. 𝑔 𝑥, 𝑦 = 𝑐
Need both 𝑓 and 𝑔 to have continuous partial derivatives.

We introduce a new variable 𝜆, called a Lagrange multiplier,
and the Lagrangian 𝐿 defined by
𝐿 𝑥, 𝑦, 𝜆 = 𝑓 𝑥, 𝑦 − 𝜆[𝑔 𝑥, 𝑦 − 𝑐]
Note that if 𝑓 𝑥0 , 𝑦0 is a maximum of 𝑓 𝑥, 𝑦 , then

∃𝜆0 s.t. (𝑥0 , 𝑦0 , 𝜆0 ) is a stationary point for the 𝐿,
where the partial derivatives of 𝐿 are zero.
max 𝑤 𝑇 𝑆𝐵 𝑤
𝐿 = 𝑤 𝑇 𝑆𝐵 𝑤 − 𝜆[𝑤 𝑇 𝑆𝑊 𝑤 − 1]
subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1
𝛻𝑤 𝐿 = 2 𝑆𝐵 − 𝜆𝑆𝑊 𝑤
If 𝐴 is a symmetric matrix 𝛻𝑣 [𝑣 𝑇 𝐴𝑣] = 2 𝐴𝑣.
𝑎 𝑏 𝑣1
Let 𝐴 = and 𝑣 = 𝑣
𝑐 𝑑 2
Then 𝑓 𝑣 = 𝑣 𝐴𝑣 = 𝑎𝑣1 + (𝑏 + 𝑐)𝑣1 𝑣2 + 𝑑𝑣22
𝑇 2
𝜕𝑓(𝑣)
𝜕𝑣1 2𝑎𝑣1 + 𝑏 + 𝑐 𝑣2 2𝑎 𝑏 + 𝑐 𝑣1 𝑇 𝑣
𝛻𝑓 𝑣 = = = = 𝐴 + 𝐴
𝜕𝑓(𝑣) 𝑏 + 𝑐 𝑣1 + 2𝑑𝑣2 𝑏+𝑑 2𝑑 𝑣2
𝜕𝑣2
If 𝐴 is a symmetric matrix 𝛻𝑣 [𝑣 𝑇 𝐴𝑣] = 2 𝐴𝑣
𝑇
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2
are symmetric matrices.
𝑆𝑊 = 𝑆1 + 𝑆2 , 𝑇
𝑆𝑖 = 𝑥∈𝐶𝑖 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖
𝛻𝑤 𝐿 = 2 𝑆𝐵 − 𝜆𝑆𝑊 𝑤 = 0
𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤
The scalar λ in 𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤 is a generalized eigenvalue of the pair

(𝑺𝑩 , 𝑺𝑾 ) associated with the (generalized) eigenvector 𝑤.
Generalized eigenvalue
The scalar λ is called a generalized eigenvalue of the pair (𝑨, 𝑩)

associated with the (generalized) eigenvector 𝑣
𝐴𝑣 = 𝜆𝐵𝑣 𝐴 − 𝜆𝐵 𝑣 = 𝟎
The expression 𝐴 − 𝜆𝐵 with indeterminate 𝜆 is called a matrix pencil
The terms “matrix pair” and “matrix pencil” are used more or less interchangeably.
For example, if 𝑣 is any eigenvector for the pair 𝐴, 𝐵 , then we also say that 𝑣 is
any eigenvector of the matrix pencil 𝐴 − 𝜆𝐵.
If B is nonsingular, then the generalized eigenvalue problem 𝐴𝑣 = 𝜆𝐵𝑣

is equivalent to a standard eigenvalue problem 𝐵 −1 𝐴𝑣 = 𝜆𝑣
Which generalized eigenvalue 𝝀 should we choose?
Recall that we want to solve max 𝑤 𝑇 𝑆𝐵 𝑤

subject to 𝑤 𝑇 𝑆𝑊 𝑤 = 1
Since the optimal 𝑤 satisfies 𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤, we have
𝑤 𝑇 𝑆𝐵 𝑤 = 𝜆(𝑤 𝑇 𝑆𝑊 𝑤) = 𝜆
Hence we need to pick the largest generalized eigenvalue and its

corresponding generalized eigenvector.
If SW is nonsingular, can convert this to a standard eigenvalue problem.

−1
𝑆𝑊 𝑆𝐵 𝑤 = 𝜆𝑤
𝑆𝐵 𝑣 for any vector 𝑣, points in the same direction as 𝜇1 − 𝜇2
𝑆𝐵 𝑣 = (𝜇1 − 𝜇2 ) 𝜇1 − 𝜇2 𝑇 𝑣 = (𝜇1 − 𝜇2 ) 𝛼
Thus can solve the eigenvalue problem immediately

−1
𝑣 = 𝑆𝑊 (𝜇1 − 𝜇2 )
−1 −1 −1 −1 −1
𝑆𝑊 𝑆𝐵 𝑣 = 𝑆𝑊 𝑆𝐵 𝑆𝑊 𝜇1 − 𝜇2 = 𝑆𝑊 [𝛼 𝜇1 − 𝜇2 ] = 𝛼 𝑆𝑊 𝜇1 − 𝜇2 = 𝛼 𝑣
LDA in 5 steps
(Step 1) Computing the 𝑑-dimensional mean vectors
(Step 2) Computing the Scatter Matrices

Within-class scatter matrix
𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇
𝑥∈𝐶𝑖
𝑆𝑊 = 𝑆1 + 𝑆2
Between-class scatter matrix
𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇
(Step 3) Solving the generalized eigenvalue problem for the matrix
(Step 4) Selecting linear discriminants for the new feature subspace
(Step 5) Transforming the samples onto the new subspace

Example
Compute the LDA projection for the following 2D dataset
𝐶1 = 4,2 , 2,4 , 2,3 , 3,6 , (4,4)
𝐶2 = 9,10 , 6,8 , 9,5 , 8,7 , (10,8)
(Solution)
(Step 1) 𝜇1 = (3, 3.8), 𝜇2 = (8.4, 7.6)
(Step 2) 𝑆𝑖 = 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 𝑇
𝑥∈𝐶𝑖
1 −.25 2.3 −.05

𝑆1 = , 𝑆2 =
−.25 2.2 −.05 3.3
3.3 −.30
The within class scatter 𝑆𝑊 = ,
−.30 5.5
29.16 20.52
between-class scatter 𝑆B =
20.52 14.44
Example
−1
(Step 3) Solving the generalized eigenvalue problem 𝑆𝑊 𝑆𝐵 𝑤 = 𝜆𝑤
−1 9.2213 − 𝜆 6.489
𝑆𝑊 𝑆𝐵 − 𝜆𝐼 = =0
4.2339 2.9794 − 𝜆
𝜆1 = 12.2007 𝜆2 = 0
9.2213 6.489 9.2213 6.489

𝑤 = 12.2007𝑤1 𝑤 = 0𝑤2
4.2339 2.9794 1 4.2339 2.9794 2
𝑤1 = 0.9088
0.4173 𝑤2 = −0.5755
0.8178
Example
Example
LDA, C classes
Fisher’s LDA generalizes for C-class problems
Instead of one projection 𝑦,
we will now seek 𝐶 − 1 projections [𝑦1 , 𝑦2 , … , 𝑦𝐶−1 ]
by means of 𝐶 − 1 projection vectors 𝑤𝑖
arranged by columns into a projection matrix
𝑊 = 𝑤1 𝑤2 … |𝑤𝐶−1 :
𝑦𝑖 = 𝑤𝑖𝑇 𝑥 𝑦 = 𝑊𝑇𝑥
Derivation 𝐶
The within class scatter generalizes as 𝑆𝑊 = 𝑆𝑖

𝑖=1
𝑇 1
where 𝑆𝑖 = 𝑥∈𝐶𝑖 𝑥 − 𝜇𝑖 𝑥 − 𝜇𝑖 and 𝜇𝑖 = 𝑁 𝑥∈𝐶𝑖 𝑥
𝑖
𝐶
𝑇
The between-class scatter becomes 𝑆𝐵 = 𝑁𝑖 𝜇𝑖 − 𝜇 𝜇𝑖 − 𝜇
1 1 𝑖=1
𝐶
Where 𝜇 = ∀𝑥 𝑥 = 𝑖=1 𝜇𝑖
𝑁 𝑁
LDA, C classes
• Similarly, we define the mean vector and

scatter matrices for the projected samples as
1 1
𝜇𝑖 = 𝑦 𝜇= 𝑦
𝑁𝑖 𝑁
𝑥∈𝐶𝑖 ∀𝑥
𝐶
𝑇
𝑆𝑊 = 𝑦 − 𝜇𝑖 𝑦 − 𝜇𝑖
𝑖=1 𝑥∈𝐶𝑖
𝐶
𝑇
𝑆𝐵 = 𝑁𝑖 𝜇 − 𝜇𝑖 𝜇 − 𝜇𝑖
𝑖=1 𝑥∈𝐶𝑖
LDA, C classes
• From our derivation for the two-class problem, we can write
𝑆𝑊 = 𝑊 𝑇 𝑆𝑊 𝑊 𝑆𝐵 = 𝑊 𝑇 𝑆𝐵 𝑊
Recall that we are looking for a projection that

-- maximizes the ration of between-class to within-class scatter.
Since the projection is no longer a scalar (it has 𝐶 − 1 dimensions), we use the
determinant of the scatter matrices to obtain a scalar objective function
|𝑆𝐵 | |𝑊 𝑇 𝑆𝐵 𝑊|
𝐽 𝑊 = =
|𝑆𝑊 | |𝑊 𝑇 𝑆𝑊 𝑊|
We will seek the projection matrix 𝑊 ∗ that maximizes this ratio
LDA, C classes
• It can be shown that the optimal projection matrix 𝑊 ∗ is the one whose columns are
the eigenvectors corresponding to the largest eigenvalues of the following generalized
eigenvalue problem
∗
|𝑊 𝑇 𝑆𝐵 𝑊|
𝑊∗ = 𝑤1∗ 𝑤2∗ … |𝑤𝐶−1 = arg 𝑚𝑎𝑥 SB − 𝜆i SW wi∗ =0
|𝑊 𝑇 𝑆𝑊 𝑊|
Note that
SB is the sum of 𝐶 matrices of rank ≤ 1 and the mean vectors and constrained by
𝐶
1
𝜇𝑖 = 𝜇
𝐶
𝑖=1
-- therefore, SB will be of rank (𝐶 − 1) or less
-- This means that only (𝐶 − 1) of the eigenvalues 𝜆𝑖 will be non-zero
The projections with maximum class seprability information are the eigenvectors
−1
corresponding to the largest eigenvalues of 𝑆𝑊 𝑆𝐵
Example, Iris
The iris dataset contains measurements for 150 iris flowers from three different
species.
The three classes in the Iris dataset: Iris-setosa (n=50)

Iris-versicolor (n=50)
Iris-virginica (n=50)
The four features of the Iris dataset:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Example, Iris
Example, Iris
Example, Iris
Limitations of LDA
LDA produces at most (𝐶 − 1) feature projections
If the classification error estimates establish that more features are needed, some
other method must be employed to provide those additional features
LDA is a parametric method ( it assumes unimodal Gaussian likelihoods)

- If the distributions are significantly non-Gaussian, the LDA projections may not
preserve complex structure in the data needed for classification
Limitations of LDA
LDA will also fail if discriminatory information is not tin the mean but in the
variance of the data
PCA vs. LDA
Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are
linear transformation techniques that are commonly used for dimensionality reduction.
PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels

and its goal is to find the directions (the so-called principal components) that
maximize the variance in a dataset.
LDA is “supervised” and computes the directions (“linear discriminants”) that will
represent the axes that that maximize the separation between multiple classes.

Linear Discriminants Analysis: Hosoo Lee

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Discriminants Analysis: Hosoo Lee

Uploaded by

Copyright:

Available Formats

Linear Discriminants Analysis

We seek to obtain a scalar 𝑦 by projecting the samples 𝑥 onto a line

What is a good projection vector 𝑤?

We need to define a measure of separation

The mean vector of each class in 𝑥-space and 𝑦-space is

We could then choose the distance between the projected means as

However, the distance between projected means is not good

The projected means are as farther apart as possible

For each class we define the scatter, an equivalent of the variance, as

(𝑠12 + 𝑠22 ) is called the within-class scatter of the projected examples

The Fisher linear discriminant is defined as the linear function 𝑤 𝑇 𝑥

where 𝑆𝑊 is called the within-class scatter matrix

The scatter of the projection 𝑦 can then be expressed as a function of

The matrix 𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 is called the between-class scatter.

The Fisher criterion function 𝑤 𝑇 𝑆𝐵 𝑤

max 𝑤 𝑇 𝑆𝐵 𝑤 Maximization problem

Method of Lagrange Multipliers

Need both 𝑓 and 𝑔 to have continuous partial derivatives.

Note that if 𝑓 𝑥0 , 𝑦0 is a maximum of 𝑓 𝑥, 𝑦 , then

If 𝐴 is a symmetric matrix 𝛻𝑣 [𝑣 𝑇 𝐴𝑣] = 2 𝐴𝑣.

The scalar λ in 𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤 is a generalized eigenvalue of the pair

The scalar λ is called a generalized eigenvalue of the pair (𝑨, 𝑩)

The expression 𝐴 − 𝜆𝐵 with indeterminate 𝜆 is called a matrix pencil

If B is nonsingular, then the generalized eigenvalue problem 𝐴𝑣 = 𝜆𝐵𝑣

Which generalized eigenvalue 𝝀 should we choose?

Recall that we want to solve max 𝑤 𝑇 𝑆𝐵 𝑤

Since the optimal 𝑤 satisfies 𝑆𝐵 𝑤 = 𝜆𝑆𝑊 𝑤, we have

Hence we need to pick the largest generalized eigenvalue and its

If SW is nonsingular, can convert this to a standard eigenvalue problem.

𝑆𝐵 𝑣 for any vector 𝑣, points in the same direction as 𝜇1 − 𝜇2

Thus can solve the eigenvalue problem immediately

(Step 2) Computing the Scatter Matrices

(Step 3) Solving the generalized eigenvalue problem for the matrix

(Step 4) Selecting linear discriminants for the new feature subspace

(Step 5) Transforming the samples onto the new subspace

1 −.25 2.3 −.05

9.2213 6.489 9.2213 6.489

The within class scatter generalizes as 𝑆𝑊 = 𝑆𝑖

• Similarly, we define the mean vector and

Recall that we are looking for a projection that

The three classes in the Iris dataset: Iris-setosa (n=50)

The four features of the Iris dataset:

LDA is a parametric method ( it assumes unimodal Gaussian likelihoods)

PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels

You might also like