You are on page 1of 22

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chap. 20: Linear Discriminant Analysis

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Linear Discriminant Analysis

Given labeled data consisting of d-dimensional points x i along with their classes
yi , the goal of linear discriminant analysis (LDA) is to find a vector w that
maximizes the separation between the classes after projection onto w .
The key difference between principal component analysis and LDA is that the
former deals with unlabeled data and tries to maximize variance, whereas the
latter deals with labeled data and tries to maximize the discrimination between
the classes.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 2
Projection onto a Line

Let D i denote the subset of points labeled with class ci , i.e., D i = {x j |yj = ci },
and let |D i | = ni denote the number of points with class ci . We assume that there
are only k = 2 classes.
The projection of any d-dimensional point x i onto a unit vector w is given as
 T 
w xi
w = w T x i w = ai w

x ′i = T
w w

where ai specifies the offset or coordinate of x ′i along the line w :

ai = w T x i

The set of n scalars {a1 , a2 , . . . , an } represents the mapping from Rd to R, that is,
from the original d-dimensional space to a 1-dimensional space (along w ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 3
Projection onto w : Iris 2D Data
iris-setosa as class c1 (circles), and the other two Iris types as class c2 (triangles)

4.5 bC

bC w
bC
bC ut
4.0 bC
ut

bC bC ut uT uT
ut
bC bC bC ut
bC bC ut uT
ut ut
ut
bC bC bC bC ut
3.5 bC bC bC bC bC bC uT uT uT ut ut
ut ut
ut ut
ut
bC bC uT ut ut
utbc tu
uT
bcut ut
bC bC bC bC uT
tbcu tu
uT tu tu uT uT uT uT uT uT
bC bC bC u t tu tu T u uT uT
bcut tu
bC bC bC bC bC uT uT utbc uT tu cb ut uT uT uT uT uT uT uT uT uT uT uT
3.0 bC tu bcut uT
utbc tbcu uT uT uT uT uT uT uT uT
tbcu utbc
bc tbcu
c b u t bC
utbc tbcu T u T u T u uT uT uT uT uT uT uT uT
cb ut uT uT uT uT uT uT
bc cb
cb utbc uT uT uT uT uT
bc
tu utbc
bcut uT uT uT uT uT uT uT
2.5 bc bc
bc
uT uT
bc
bC uT uT uT
uT uT

uT
2.0

1.5
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 4
Iris 2D Data: Optimal Linear Discriminant Direction

4.5 bC

bC
bc bc bC
bc bC
4.0 bc bC
bc bc
bC bc bC uT uT
bc bc
bc bc
bC bC bC bC bc
bc bc
bC bC bc bc uT
bc bc
bC bC bC bC bc bc
3.5 bC bC bC bC bC bC
bc
bc uT uT uT
bC bC uT uT
bC bC bC bC ut uT uT uT uT uT uT uT
bc ut
bC bC bC ut ut uT uT uT
ut ut
bC bC bC bC bC uT uT uT ut ut uT uT uT uT uT uT uT uT uT uT uT
3.0 bC uT uT
ut ut
ut ut uT uT uT uT uT uT uT
ut ut
uT uT uT ut ut uT uT uT uT uT uT uT uT
uT tuuT ut ut tu ut
ut
uT uT uT uT uT
uT uT uT uT ut ut ut uT
ut ut
uT uT uT uT uT uT uT
2.5 uT uT
ut ut
ut
bC uT uT uT tu ut tu
uT uT ut
ut

uT ut
2.0
ut

w
1.5
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 5
Optimal Linear Discriminant

The mean of the projected points is given as:

m1 =w T µ1 m2 =w T µ2

To maximize the separation between the classes, we maximize the difference


between the projected means, |m1 − m2 |. However, for good separation, the
variance of the projected points for each class should also not be too large. LDA
maximizes the separation by ensuring that the scatter si2 for the projected points
within each class is small, where scatter is defined as
X
si2 = (aj − mi )2 = ni σi2
x j ∈D i

where σi2 is the variance for class ci .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 6
Linear Discriminant Analysis: Fisher Objective
We incorporate the two LDA criteria, namely, maximizing the distance between projected
means and minimizing the sum of projected scatter, into a single maximization criterion
called the Fisher LDA objective:

(m1 − m2 )2
max J(w ) =
w s12 + s22

In matrix terms, we can rewrite (m1 − m2 )2 as follows:


 2
(m1 − m2 )2 = w T (µ1 − µ2 ) = w T Bw

where B = (µ1 − µ2 )(µ1 − µ2 )T is a d × d rank-one matrix called the between-class


scatter matrix.
The projected scatter for class ci is given as
 
X T T T
X T
2
si = 2
(w x j − w µi ) = w (x j − µi )(x j − µi ) w = wT Si w
x j ∈D 1 x j ∈D i

where S i is the scatter matrix for D i .


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 7
Linear Discriminant Analysis: Fisher Objective
The combined scatter for both classes is given as

s12 + s22 = w T S 1 w + w T S 2 w = w T (S 1 + S 2 )w = w T Sw

where the symmetric positive semidefinite matrix S = S 1 + S 2 denotes the


within-class scatter matrix for the pooled data.
The LDA objective function in matrix form is

w T Bw
max J(w ) =
w w T Sw

To solve for the best direction w , we differentiate the objective function with
respect to w ; after simplification it yields the generalized eigenvalue problem

Bw = λSw

where λ = J(w ) is a generalized eigenvalue of B and S. To maximize the


objective λ should be chosen to be the largest generalized eigenvalue, and w to be
the corresponding eigenvector.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 8
Linear Discriminant Algorithm

LinearDiscriminant
 (D = {(x i , yi )}ni=1 ):
1 D i ← x j | yj = ci , j = 1, . . . , n , i = 1, 2 // class-specific
subsets
2 µi ← mean(D i ), i = 1, 2 // class means
3 B ← (µ1 − µ2 )(µ1 − µ2 )T // between-class scatter matrix
4 Z i ← D i − 1ni µTi , i = 1, 2 // center class matrices
5 S i ← Z Ti Z i , i = 1, 2 // class scatter matrices
6 S ← S 1 + S 2 // within-class scatter matrix
7 λ1 , w ← eigen(S −1 B) // compute dominant eigenvector

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 9
Linear Discriminant Direction: Iris 2D Data
The between-class scatter matrix is
 
1.587 −0.693
B=
−0.693 0.303

and the within-class scatter matrix is


4.5 bC  
bC 6.09 4.91
bc bc bC S1 =
4.0 bc
bc
bc bc
bC
bC 4.91 7.11
bC bc bC uT uT
bc bc
bc bc
bC bC bC bC bc
 
bC bC
bc bc
bc bc
bc bc
uT 43.5 12.09
3.5 bC bC bC bC bc bc
bc S2 =
bC bC bC
bC
bC
bC
bC bC bc uT uT uT
uT uT 12.09 10.96
bC bC bC bC ut uT uT uT uT uT uT uT
bc ut
bC bC bC ut ut uT uT uT  
ut ut
3.0 bC bC bC bC bC uT uT uT ut ut uT
ut ut
uT uT uT uT uT uT uT uT uT uT
49.58 17.01
bC uT
uT
uT
uT uT
ut ut
ut ut
uT
ut ut
uT
uT
uT
uT
uT
uT
uT
uT uT
uT
uT
uT
uT uT S=
uT
uT
uT
uT
uT
uT
uT tuuT ut ut ut ut
ut

uT ut ut ut
uT uT
uT
17.01 18.08
ut ut
uT uT uT uT uT uT uT
2.5 uT uT
ut ut
ut
bC uT uT uT tu ut tu
uT uT ut
ut

ut
The direction of most separation between c1 and c2
uT
2.0
ut is the dominant eigenvector corresponding to the
w largest eigenvalue of the matrix S −1 B. The solution
1.5
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
is

J(w ) = λ1 = 0.11
 
0.551
w=
−0.834

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Linear Discriminant Analysis: Two Classes
For the two class scenario, if S is nonsingular, we can directly solve for w without
computing the eigenvalues and eigenvectors.
The between-class scatter matrix B points in the same direction as (µ1 − µ2 )
because
 
Bw = (µ1 − µ2 )(µ1 − µ2 )T w
 
=(µ1 − µ2 ) (µ1 − µ2 )T w
=b(µ1 − µ2 )

The generalized eigenvectors equation can then be rewritten as


b
w = S −1 (µ1 − µ2 )
λ
b
Because λ is just a scalar, we can solve for the best linear discriminant as

w =S −1 (µ1 − µ2 )

We can finally normalize w to be a unit vector.


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Linear Discriminant Direction: Iris 2D Data

We can directly compute w as follows:

w = S −1 (µ1 − µ2 )
    
0.066 −0.029 −1.246 −0.0527
= =
−0.100 0.044 0.546 0.0798

After normalizing, we have


   
w 1 −0.0527 −0.551
w= = =
kw k 0.0956 0.0798 0.834

Note that even though the sign is reversed for w , they represent the same
direction; only the scalar multiplier is different.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Kernel Discriminant Analysis
The goal of kernel LDA is to find the direction vector w in feature space that
maximizes
(m1 − m2 )2
max J(w ) =
w s12 + s22
It is well known that w can be expressed as a linear combination of the points in
feature space
n
X
w= aj φ(x j )
j =1

The mean for class ci in feature space is given as


1 X
µφi = φ(x j )
ni x ∈ D
j i

and the covariance matrix for class ci in feature space is


1 X  T
Σφi = φ(x j ) − µφi φ(x j ) − µφi
ni x ∈ D
j i

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Kernel Discriminant Analysis
The between-class scatter matrix in feature space is
  T
B φ = µφ1 − µφ2 µφ1 − µφ2

and the within-class scatter matrix in feature space is

S φ = n1 Σφ1 + n2 Σφ2

S φ is a d × d symmetric, positive semidefinite matrix, where d is the


dimensionality of the feature space.
The best linear discriminant vector w in feature space is the dominant
eigenvector, which satisfies the expression

 
S −1
φ B φ w = λw

where we assume that S φ is non-singular.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
LDA Objective via Kernel Matrix: Between-class Scatter
The projected mean for class ci is given as
n
1 X X
mi = w T µφi = aj K (x j , x k ) = a T m i
ni
j =1 x k ∈D i

where a = (a1 , a2 , . . . , an )T is the weight vector, and


P 
x ∈D K (x 1 , x k )
P k i K (x , x )
1  x k ∈D i 2 k 
mi =  ..  = 1 K ci 1n
i
ni 
 .
 ni

P
x k ∈D i K (x n , x k )

where K ci is the n × ni subset of the kernel matrix, restricted to columns belonging to


points only in D i , and 1ni is the ni -dimensional vector all of whose entries are one.
The separation between the projected means is thus
 2
(m1 − m2 )2 = a T m 1 − a T m 2 = a T Ma

where M = (m 1 − m 2 )(m 1 − m 2 )T is the between-class scatter matrix.


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
LDA Objective via Kernel Matrix: Within-class Scatter

We can compute the projected scatter for each class, s12 and s22 , purely in terms of
the kernel function, as follows
X 2  X  
φ
s12 = K i K Ti − n1 m 1 m T1 a = a T N 1 a
T
w φ(x i ) − w T µ1 = a T
x i ∈D 1 x i ∈D 1

where K i is the ith column of the kernel matrix, and N 1 is the class scatter
matrix for c1 .
The sum of projected scatter values is then given as

s12 + s22 = a T (N 1 + N 2 )a = a T Na

where N is the n × n within-class scatter matrix.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Kernel LDA
The kernel LDA maximization condition is

a T Ma
max J(w ) = max J(a) =
w a a T Na

The weight vector a is the eigenvector corresponding to the largest eigenvalue of


the generalized eigenvalue problem:
Ma = λ1 Na

When there are only two classes a can be obtained directly:


a = N −1 (m 1 − m 2 )

To normalize w to be a unit vector we scale a by √ 1 .


aT K a
We can project any point x onto the discriminant direction as follows:
n
X n
X
w T φ(x) = aj φ(x j )T φ(x) = aj K (x j , x)
j =1 j =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Kernel Discriminant Analysis Algorithm

KernelDiscriminant
 (D = {(x i , yi )}ni=1 , K ):
1 K ← K (x i , x j ) i ,j =1,...,n // compute n × n kernel matrix
K ci ← K (j, k) | yk = ci , 1 ≤ j, k ≤ n , i = 1, 2 // class kernel

2
matrix
3 m i ← n1 K ci 1ni , i = 1, 2 // class means
i
4 M ← (m 1 − m 2 )(m 1 − m 2 )T // between-class scatter matrix
5 N i ← K ci (I ni − n1 1ni ×ni )(K ci )T , i = 1, 2 // class scatter
i
matrices
6 N ← N 1 + N 2 // within-class scatter matrix
7 λ1 , a ← eigen(N −1 M) // compute weight vector
8 a ← √ Ta // normalize w to be unit vector
a Ka

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Kernel Discriminant Analysis
Quadratic Homogeneous Kernel

Iris 2D Data: c1 (circles; iris-virginica) and c2 (triangles; other two types).


Kernel Function: K (x i , x j ) = (x Ti x j )2 .
Contours of constant projection onto optimal kernel discriminant, i.e., points
along both the curves have the same value when projected onto w .
u2
uT uT

uT uT
uT
1.0
uT
uT uT uT
uT
uT bC uT
uT uT uT
uT uT Tu
bC
0.5 uT uT
uT bC
bC uT Tu uT
Tu
uT uT uT uT
uT Tu Tu bC bC uT uT uT
uT uT Tu uT bC bC
bC uT
bC uT
uT uT uT bC
bC
uT uT
uT uT uT uT uT uT
uT bC uT
uT uT bC uT
0 uT uT bC uT
bC
uT uT bC
bC bC uT uT Tu Tu
uT uT uT uT uT uT
uT bC bC bC uT uT uT uT uT
uT Cb uT
uT
bC bC Cb Cb uT
uT bC bC bC bC Cb Tu
bC bC bC
uT bC bC uT uT
−0.5 uT uT uT
bC bC Cb uT
Cb bC CbC b bC
uT uT bC
bC
uT
−1.0 bC bC

uT
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 1
Kernel Discriminant Analysis
Quadratic Homogeneous Kernel

Projecting x i ∈ D onto w , which separates the two classes.


The projected scatters and means for both classes are as follows:

m1 = 0.338 m2 = 4.476
s12 = 13.862 s22 = 320.934

The value of J(w ) is given as

(m1 − m2 )2 (0.338 − 4.476)2 17.123


J(w ) = 2 2
= = = 0.0511
s1 + s2 13.862 + 320.934 334.796

which, as expected, matches λ1 = 0.0511 from above.

uT bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT w
−1 0 1 2 3 4 5 6 7 8 9 10

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 2
Kernel Feature Space and Optimal Discriminant
It is not desirable or possible to obtain an explicit discriminant vector w .

x = (x1 , x2 )T ∈ R2 is mapped to φ(x) = ( 2x1 x2 , x12 , x22 )T ∈ R3 .
The projection of φ(x i ) onto w is also shown, where
w = 0.511x1 x2 + 0.761x12 − 0.4x22
w
uT

uT uT
uT uT
uT uT
uT uT uT uT
X1 X2 uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT Tu uT uT
uT uT Tu uT uT Tu
uT uT Tu
Tu uT
uT uT uT uT uT
Tu uT uT uT
uT bC Tu uT
uT bC bC Tu uT
C b C b b C C b bC Tu uT
bC bC C b
bC bC bC bC bC
bC bC uT bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Cb
bC bC bC bC bC bC bC bC uT Tu uT Cb Tu TuTu bC uT TuTu Tu
uT
Tu
bC bC bC bC bC bC bC bC Tu
bC Cb uT uT uT
uT uT
uT bC bC Tu Tu uT uT uT
Tu uT uT uT uT uT Tu Tu Tu uT uT uT uT uT uT
bC Cb uT uT uT uT uT
uT uT Tu
bC bC bC
uT
uT uT Tu uT uT
uT uT
uT
uT uT uT uT
uT uT
bC uT uT
uT uT uT uT uT
uT uT uT uT
X22 uT uT uT uT uT uT
uT uT
uT uT uT uT
uT uT

uT uT
uT X12
uT uT uT
uT

uT
uT

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 2
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chap. 20: Linear Discriminant Analysis

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap. 20: Linear Discriminant Analysis 2

You might also like