Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Data Mining and Machine Learning:
Fundamental Concepts and Algorithms

dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chap. 20: Linear Discriminant Analysis
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Linear Discriminant Analysis
Given labeled data consisting of d-dimensional points x i along with their classes
yi , the goal of linear discriminant analysis (LDA) is to find a vector w that
maximizes the separation between the classes after projection onto w .
The key difference between principal component analysis and LDA is that the
former deals with unlabeled data and tries to maximize variance, whereas the
latter deals with labeled data and tries to maximize the discrimination between
the classes.
Projection onto a Line
Let D i denote the subset of points labeled with class ci , i.e., D i = {x j |yj = ci },
and let |D i | = ni denote the number of points with class ci . We assume that there
are only k = 2 classes.
The projection of any d-dimensional point x i onto a unit vector w is given as
T
w xi
w = w T x i w = ai w

x ′i = T
w w
where ai specifies the offset or coordinate of x ′i along the line w :
ai = w T x i
The set of n scalars {a1 , a2 , . . . , an } represents the mapping from Rd to R, that is,
from the original d-dimensional space to a 1-dimensional space (along w ).
Projection onto w : Iris 2D Data
iris-setosa as class c1 (circles), and the other two Iris types as class c2 (triangles)
4.5 bC
bC w
bC
bC ut
4.0 bC
ut
bC bC ut uT uT
ut
bC bC bC ut
bC bC ut uT
ut ut
ut
bC bC bC bC ut
3.5 bC bC bC bC bC bC uT uT uT ut ut
ut ut
ut ut
ut
bC bC uT ut ut
utbc tu
uT
bcut ut
bC bC bC bC uT
tbcu tu
uT tu tu uT uT uT uT uT uT
bC bC bC u t tu tu T u uT uT
bcut tu
bC bC bC bC bC uT uT utbc uT tu cb ut uT uT uT uT uT uT uT uT uT uT uT
3.0 bC tu bcut uT
utbc tbcu uT uT uT uT uT uT uT uT
tbcu utbc
bc tbcu
c b u t bC
utbc tbcu T u T u T u uT uT uT uT uT uT uT uT
cb ut uT uT uT uT uT uT
bc cb
cb utbc uT uT uT uT uT
bc
tu utbc
bcut uT uT uT uT uT uT uT
2.5 bc bc
bc
uT uT
bc
bC uT uT uT
uT uT
uT
2.0
1.5
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Iris 2D Data: Optimal Linear Discriminant Direction
4.5 bC
bC
bc bc bC
bc bC
4.0 bc bC
bc bc
bC bc bC uT uT
bc bc
bc bc
bC bC bC bC bc
bc bc
bC bC bc bc uT
bc bc
bC bC bC bC bc bc
3.5 bC bC bC bC bC bC
bc
bc uT uT uT
bC bC uT uT
bC bC bC bC ut uT uT uT uT uT uT uT
bc ut
bC bC bC ut ut uT uT uT
ut ut
bC bC bC bC bC uT uT uT ut ut uT uT uT uT uT uT uT uT uT uT uT
3.0 bC uT uT
ut ut
ut ut uT uT uT uT uT uT uT
ut ut
uT uT uT ut ut uT uT uT uT uT uT uT uT
uT tuuT ut ut tu ut
ut
uT uT uT uT uT
uT uT uT uT ut ut ut uT
ut ut
uT uT uT uT uT uT uT
2.5 uT uT
ut ut
ut
bC uT uT uT tu ut tu
uT uT ut
ut
uT ut
2.0
ut
w
1.5
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Optimal Linear Discriminant
The mean of the projected points is given as:
m1 =w T µ1 m2 =w T µ2
To maximize the separation between the classes, we maximize the difference

between the projected means, |m1 − m2 |. However, for good separation, the
variance of the projected points for each class should also not be too large. LDA
maximizes the separation by ensuring that the scatter si2 for the projected points
within each class is small, where scatter is defined as
X
si2 = (aj − mi )2 = ni σi2
x j ∈D i
where σi2 is the variance for class ci .
Linear Discriminant Analysis: Fisher Objective
We incorporate the two LDA criteria, namely, maximizing the distance between projected
means and minimizing the sum of projected scatter, into a single maximization criterion
called the Fisher LDA objective:
(m1 − m2 )2
max J(w ) =
w s12 + s22
In matrix terms, we can rewrite (m1 − m2 )2 as follows:

2
(m1 − m2 )2 = w T (µ1 − µ2 ) = w T Bw
where B = (µ1 − µ2 )(µ1 − µ2 )T is a d × d rank-one matrix called the between-class

scatter matrix.
The projected scatter for class ci is given as
 
X T T T
X T
2
si = 2
(w x j − w µi ) = w (x j − µi )(x j − µi ) w = wT Si w
x j ∈D 1 x j ∈D i
where S i is the scatter matrix for D i .

Linear Discriminant Analysis: Fisher Objective
The combined scatter for both classes is given as
s12 + s22 = w T S 1 w + w T S 2 w = w T (S 1 + S 2 )w = w T Sw
where the symmetric positive semidefinite matrix S = S 1 + S 2 denotes the

within-class scatter matrix for the pooled data.
The LDA objective function in matrix form is
w T Bw
max J(w ) =
w w T Sw
To solve for the best direction w , we differentiate the objective function with
respect to w ; after simplification it yields the generalized eigenvalue problem
Bw = λSw
where λ = J(w ) is a generalized eigenvalue of B and S. To maximize the

objective λ should be chosen to be the largest generalized eigenvalue, and w to be
the corresponding eigenvector.
Linear Discriminant Algorithm
LinearDiscriminant
(D = {(x i , yi )}ni=1 ):
1 D i ← x j | yj = ci , j = 1, . . . , n , i = 1, 2 // class-specific
subsets
2 µi ← mean(D i ), i = 1, 2 // class means
3 B ← (µ1 − µ2 )(µ1 − µ2 )T // between-class scatter matrix
4 Z i ← D i − 1ni µTi , i = 1, 2 // center class matrices
5 S i ← Z Ti Z i , i = 1, 2 // class scatter matrices
6 S ← S 1 + S 2 // within-class scatter matrix
7 λ1 , w ← eigen(S −1 B) // compute dominant eigenvector
Linear Discriminant Direction: Iris 2D Data
The between-class scatter matrix is

1.587 −0.693
B=
−0.693 0.303
and the within-class scatter matrix is

4.5 bC
bC 6.09 4.91
bc bc bC S1 =
4.0 bc
bc
bc bc
bC
bC 4.91 7.11
bC bc bC uT uT
bc bc
bc bc
bC bC bC bC bc

bC bC
bc bc
bc bc
bc bc
uT 43.5 12.09
3.5 bC bC bC bC bc bc
bc S2 =
bC bC bC
bC
bC
bC
bC bC bc uT uT uT
uT uT 12.09 10.96
bC bC bC bC ut uT uT uT uT uT uT uT
bc ut
bC bC bC ut ut uT uT uT
ut ut
3.0 bC bC bC bC bC uT uT uT ut ut uT
ut ut
uT uT uT uT uT uT uT uT uT uT
49.58 17.01
bC uT
uT
uT
uT uT
ut ut
ut ut
uT
ut ut
uT
uT
uT
uT
uT
uT
uT
uT uT
uT
uT
uT
uT uT S=
uT
uT
uT
uT
uT
uT
uT tuuT ut ut ut ut
ut
uT ut ut ut
uT uT
uT
17.01 18.08
ut ut
uT uT uT uT uT uT uT
2.5 uT uT
ut ut
ut
bC uT uT uT tu ut tu
uT uT ut
ut
ut
The direction of most separation between c1 and c2
uT
2.0
ut is the dominant eigenvector corresponding to the
w largest eigenvalue of the matrix S −1 B. The solution
1.5
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
is
J(w ) = λ1 = 0.11

0.551
w=
−0.834
Linear Discriminant Analysis: Two Classes
For the two class scenario, if S is nonsingular, we can directly solve for w without
computing the eigenvalues and eigenvectors.
The between-class scatter matrix B points in the same direction as (µ1 − µ2 )
because

Bw = (µ1 − µ2 )(µ1 − µ2 )T w

=(µ1 − µ2 ) (µ1 − µ2 )T w
=b(µ1 − µ2 )
The generalized eigenvectors equation can then be rewritten as

b
w = S −1 (µ1 − µ2 )
λ
b
Because λ is just a scalar, we can solve for the best linear discriminant as
w =S −1 (µ1 − µ2 )
We can finally normalize w to be a unit vector.

Linear Discriminant Direction: Iris 2D Data
We can directly compute w as follows:
w = S −1 (µ1 − µ2 )

0.066 −0.029 −1.246 −0.0527
= =
−0.100 0.044 0.546 0.0798
After normalizing, we have

w 1 −0.0527 −0.551
w= = =
kw k 0.0956 0.0798 0.834
Note that even though the sign is reversed for w , they represent the same
direction; only the scalar multiplier is different.
Kernel Discriminant Analysis
The goal of kernel LDA is to find the direction vector w in feature space that
maximizes
(m1 − m2 )2
max J(w ) =
w s12 + s22
It is well known that w can be expressed as a linear combination of the points in
feature space
n
X
w= aj φ(x j )
j =1
The mean for class ci in feature space is given as

1 X
µφi = φ(x j )
ni x ∈ D
j i
and the covariance matrix for class ci in feature space is

1 X T
Σφi = φ(x j ) − µφi φ(x j ) − µφi
ni x ∈ D
j i
The between-class scatter matrix in feature space is
T
B φ = µφ1 − µφ2 µφ1 − µφ2
and the within-class scatter matrix in feature space is
S φ = n1 Σφ1 + n2 Σφ2
S φ is a d × d symmetric, positive semidefinite matrix, where d is the

dimensionality of the feature space.
The best linear discriminant vector w in feature space is the dominant
eigenvector, which satisfies the expression

S −1
φ B φ w = λw
where we assume that S φ is non-singular.
LDA Objective via Kernel Matrix: Between-class Scatter
The projected mean for class ci is given as
n
1 X X
mi = w T µφi = aj K (x j , x k ) = a T m i
ni
j =1 x k ∈D i
where a = (a1 , a2 , . . . , an )T is the weight vector, and

P 
x ∈D K (x 1 , x k )
P k i K (x , x )
1  x k ∈D i 2 k 
mi =  ..  = 1 K ci 1n
i
ni 
 .
 ni

P
x k ∈D i K (x n , x k )
where K ci is the n × ni subset of the kernel matrix, restricted to columns belonging to

points only in D i , and 1ni is the ni -dimensional vector all of whose entries are one.
The separation between the projected means is thus
2
(m1 − m2 )2 = a T m 1 − a T m 2 = a T Ma
where M = (m 1 − m 2 )(m 1 − m 2 )T is the between-class scatter matrix.

LDA Objective via Kernel Matrix: Within-class Scatter
We can compute the projected scatter for each class, s12 and s22 , purely in terms of
the kernel function, as follows
X 2 X
φ
s12 = K i K Ti − n1 m 1 m T1 a = a T N 1 a
T
w φ(x i ) − w T µ1 = a T
x i ∈D 1 x i ∈D 1
where K i is the ith column of the kernel matrix, and N 1 is the class scatter
matrix for c1 .
The sum of projected scatter values is then given as
s12 + s22 = a T (N 1 + N 2 )a = a T Na
where N is the n × n within-class scatter matrix.
Kernel LDA
The kernel LDA maximization condition is
a T Ma
max J(w ) = max J(a) =
w a a T Na
The weight vector a is the eigenvector corresponding to the largest eigenvalue of

the generalized eigenvalue problem:
Ma = λ1 Na
When there are only two classes a can be obtained directly:

a = N −1 (m 1 − m 2 )
To normalize w to be a unit vector we scale a by √ 1 .

aT K a
We can project any point x onto the discriminant direction as follows:
n
X n
X
w T φ(x) = aj φ(x j )T φ(x) = aj K (x j , x)
j =1 j =1
Kernel Discriminant Analysis Algorithm
KernelDiscriminant
(D = {(x i , yi )}ni=1 , K ):
1 K ← K (x i , x j ) i ,j =1,...,n // compute n × n kernel matrix
K ci ← K (j, k) | yk = ci , 1 ≤ j, k ≤ n , i = 1, 2 // class kernel

2
matrix
3 m i ← n1 K ci 1ni , i = 1, 2 // class means
i
4 M ← (m 1 − m 2 )(m 1 − m 2 )T // between-class scatter matrix
5 N i ← K ci (I ni − n1 1ni ×ni )(K ci )T , i = 1, 2 // class scatter
i
matrices
6 N ← N 1 + N 2 // within-class scatter matrix
7 λ1 , a ← eigen(N −1 M) // compute weight vector
8 a ← √ Ta // normalize w to be unit vector
a Ka
Quadratic Homogeneous Kernel
Iris 2D Data: c1 (circles; iris-virginica) and c2 (triangles; other two types).

Kernel Function: K (x i , x j ) = (x Ti x j )2 .
Contours of constant projection onto optimal kernel discriminant, i.e., points
along both the curves have the same value when projected onto w .
u2
uT uT
uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT uT
uT uT Tu
bC
0.5 uT uT
uT bC
bC uT Tu uT
Tu
uT uT uT uT
uT Tu Tu bC bC uT uT uT
uT uT Tu uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT uT bC uT
0 uT uT bC uT
bC
uT uT bC
bC bC uT uT Tu Tu
uT uT uT uT uT uT
uT bC bC bC uT uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT bC bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC CbC b bC
uT uT bC
bC
uT
−1.0 bC bC
uT
bC
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Quadratic Homogeneous Kernel
Projecting x i ∈ D onto w , which separates the two classes.

The projected scatters and means for both classes are as follows:
m1 = 0.338 m2 = 4.476
s12 = 13.862 s22 = 320.934
The value of J(w ) is given as
(m1 − m2 )2 (0.338 − 4.476)2 17.123

J(w ) = 2 2
= = = 0.0511
s1 + s2 13.862 + 320.934 334.796
which, as expected, matches λ1 = 0.0511 from above.
uT bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT w
−1 0 1 2 3 4 5 6 7 8 9 10
Kernel Feature Space and Optimal Discriminant
It is not desirable or possible to obtain an explicit discriminant vector w .
√
x = (x1 , x2 )T ∈ R2 is mapped to φ(x) = ( 2x1 x2 , x12 , x22 )T ∈ R3 .
The projection of φ(x i ) onto w is also shown, where
w = 0.511x1 x2 + 0.761x12 − 0.4x22
w
uT
uT uT
uT uT
uT uT
uT uT uT uT
X1 X2 uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT Tu uT uT
uT uT Tu uT uT Tu
uT uT Tu
Tu uT
uT uT uT uT uT
Tu uT uT uT
uT bC Tu uT
uT bC bC Tu uT
C b C b b C C b bC Tu uT
bC bC C b
bC bC bC bC bC
bC bC uT bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Cb
bC bC bC bC bC bC bC bC uT Tu uT Cb Tu TuTu bC uT TuTu Tu
uT
Tu
bC bC bC bC bC bC bC bC Tu
bC Cb uT uT uT
uT uT
uT bC bC Tu Tu uT uT uT
Tu uT uT uT uT uT Tu Tu Tu uT uT uT uT uT uT
bC Cb uT uT uT uT uT
uT uT Tu
bC bC bC
uT
uT uT Tu uT uT
uT uT
uT
uT uT uT uT
uT uT
bC uT uT
uT uT uT uT uT
uT uT uT uT
X22 uT uT uT uT uT uT
uT uT
uT uT uT uT
uT uT
uT uT
uT X12
uT uT uT
uT
uT
uT
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chap. 20: Linear Discriminant Analysis

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Copyright:

Available Formats

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chap. 20: Linear Discriminant Analysis

where ai specifies the offset or coordinate of x ′i along the line w :

The mean of the projected points is given as:

To maximize the separation between the classes, we maximize the difference

where σi2 is the variance for class ci .

In matrix terms, we can rewrite (m1 − m2 )2 as follows:

where B = (µ1 − µ2 )(µ1 − µ2 )T is a d × d rank-one matrix called the between-class

where S i is the scatter matrix for D i .

where the symmetric positive semidefinite matrix S = S 1 + S 2 denotes the

where λ = J(w ) is a generalized eigenvalue of B and S. To maximize the

and the within-class scatter matrix is

The generalized eigenvectors equation can then be rewritten as

We can finally normalize w to be a unit vector.

We can directly compute w as follows:

After normalizing, we have

The mean for class ci in feature space is given as

and the covariance matrix for class ci in feature space is

and the within-class scatter matrix in feature space is

S φ is a d × d symmetric, positive semidefinite matrix, where d is the

where we assume that S φ is non-singular.

where a = (a1 , a2 , . . . , an )T is the weight vector, and

where K ci is the n × ni subset of the kernel matrix, restricted to columns belonging to

where M = (m 1 − m 2 )(m 1 − m 2 )T is the between-class scatter matrix.

where N is the n × n within-class scatter matrix.

The weight vector a is the eigenvector corresponding to the largest eigenvalue of

When there are only two classes a can be obtained directly:

To normalize w to be a unit vector we scale a by √ 1 .

Iris 2D Data: c1 (circles; iris-virginica) and c2 (triangles; other two types).

Projecting x i ∈ D onto w , which separates the two classes.

The value of J(w ) is given as

(m1 − m2 )2 (0.338 − 4.476)2 17.123

which, as expected, matches λ1 = 0.0511 from above.

Mohammed J. Zaki1 Wagner Meira Jr.2

Chap. 20: Linear Discriminant Analysis

You might also like