You are on page 1of 19

Multivariate Analysis (slides 8)

• Today we consider Linear Discriminant Analysis (LDA) and Quadratic


Discriminant Analysis (QDA).
• These are used if it is assumed that there exists a set of k groups within the
data and that there is a subset of the data that is labelled, i.e., whose group
membership is known.
• Discriminant analysis refers to a set of ‘supervised’ statistical techniques
where the class information is used to help reveal the structure of the data.
• This structure then allows the ‘classification’ of future observations.

1
Discriminant Analysis
• We want to be able to use knowledge of labelled data (i.e., those whose
group membership is known) in order to classify the group membership of
unlabelled data.
• We previously considered the k-nearest neighbours technique for this
problem.
• We shall now consider the alternative approaches of:
– LDA (linear discriminant analysis)
– QDA (quadratic discriminant analysis)

2
LDA & QDA
• Unlike k-Nearest Neighbours (and all the other techniques so far covered),
both LDA and QDA assume the use of a distribution over the data.
• Once we introduce distributions (and parameters for these distributions),
we can quantify our uncertainty over the structure of the data.
• As far as classification is concerned, this means that we can consider the
probability of group assignment.
• The distinction between a point that is assigned a probability of 0.51 to one
group and 0.49 to another, against a point that is assigned a probability of
0.99 to one group and 0.01 to another, can be quite important.

3
Multivariate Normal Distribution
• Let xT = (x1 , x2 , ..., xm ), where x1 , x2 , ..., xm are random variables.
• The Multi-Variate Normal (MVN) distribution has two parameters:
– Mean µ, an m-dimensional vector.
– Covariance matrix Σ, with dimension m × m.
• A vector x is said to follow a MVN distribution, denoted x ∼ M V N (µ, Σ),
if it has the following probability density function:
[ ]
1 1 T −1
f (x|µ, Σ) = m 1 exp − (x − µ) Σ (x − µ)
(2π) |Σ|
2 2 2

• Here |Σ| is denotes the determinant of Σ.

4
Multivariate Normal Distribution
• The MVN distribution is very useful when modelling multivariate data.
• Notice:
{ [ ]}
T −1 m 1
{x : f (x|µ, Σ) > C} = x : (x − µ) Σ (x − µ) < −2 log C(2π) 2 |Σ| 2

• This corresponds to an m-dimensional ellipsoid centered at point µ.


• If it is assumed that the data within a group k follows a MVN distribution
with mean µk and covariance Σk , then the scatter of the data should be
roughly elliptical.
• The mean fixes the location of the scatter and the covariance affects the
shape of the ellipsoid.

5
Normal Contours
   
0 1 0.8
• For example, the contour plot of a MVN  ,  is:
0 0.8 3
3
2
1
0
−1
−2
−3

−2 −1 0 1 2

6
Normal Contours: Data
• Sampling from this distribution and overlaying the results on the contour
plot gives:
6
4
2
0
−2
−4
−6

−4 −2 0 2 4

7
Shape of Scatter
• If we assume that the data within each group follows a MVN distribution
with mean µk and covariance Σk , then we also assume that the scatter is
roughly elliptical.
• The mean sets the location of this scatter and the covariance sets the shape
of the ellipse.

4
4

2
2

0
0
−2

−2
−4

−4

−3 −1 1 2 −3 −1 1 2 3
8
Mahalanobis Distance
• The Mahalanobis distance from a point x to a mean µ is D, where

D2 = (x − µ)T Σ−1 (x − µ).

• Two points have the same Mahalanobis distance if they are on the same
ellipsoid centered on µ (as defined earlier).
6
4
2

µ
0
−2
−4
−6

−4 −2 0 2 4
9
Which Is Closest?
• Suppose we wish to find the mean µk that a point x is closest to as
measured by Mahalanobis distance.
• That is, we want to find the k that minimizes the expression:

(x − µk )T Σ−1
k (x − µk )

• The point x is closer to µk than it is to µl (under Mahalanobis distance)


when:
(x − µk )T Σ−1
k (x − µ k ) < (x − µ l )T −1
Σl (x − µl ).

• This is a quadratic expression of x.

10
When Covariance is Equal
• If Σk = Σ for all k, then the previous expression becomes:

(x − µk )T Σ−1 (x − µk ) < (x − µl )T Σ−1 (x − µl ).

• This can be simplified as:

−2xT Σ−1 µk + µTk Σ−1 µk < −2xT Σ−1 µl + µTl Σ−1 µl

⇔ −2µTk Σ−1 x + µTk Σ−1 µk < −2µTl Σ−1 x + µTl Σ−1 µl

• This is now a linear expression of x

11
Estimating Equal Covariance
• In LDA we need to pool the covariance matrices of individual classes.
• Remember that the sample covariance matrix Q for a set of n observations
of dimension m is the matrix whose elements are
1 ∑
n
qij = (xki − xi )(xkj − xj )
n−1
k=1

for i = 1, 2, . . . , m and j = 1, 2, . . . , m.
• Then the pooled covariance matrix is defined as:

1 ∑
g
Qp = (nl − 1)Ql
n−g
l=1

Where g is the number of classes, Ql is the estimated sample covariance


matrix for class l, nl is the number of data points in class l, whilst n is the
total number of data points.
12
Estimating Equal Covariance
• This formula arises from summing the squares and cross products over data
points in all classes:

g ∑
nl
Wij = (xki − xli )(xkj − xlj )
l=1 k=1

for i = 1, . . . , m and j = 1, . . . , m.
• Hence:

g
W = (nl − 1)Ql
l=1

• Given n data points falling in g groups, we have n − g degrees of freedom


because we need to estimate the g group means.
• This results in the previous formula for the pooled covariance matrix:
W
Qp =
n−g
13
Modelling Assumptions
• Both LDA and QDA are parametric statistical methods.
• To classify a new observation x into one of the known K groups, we need
P(x ∈ k|x) for k = 1, . . . K.
• That is, we need to know the posterior probability of belonging to each of
group, given the data.
• We then classify the new observation as belonging to the class which has
largest posterior probability.
• Bayes’ Theorem states that the posterior probability of observation x
belonging to group k is:
πk f (xi |x ∈ k)
P(x ∈ k|x) = ∑K
l=1 πl f (xi |x ∈ l)

14
Modelling Assumptions
• Discriminant analysis assumes that observations from group k follow a
MVN distribution with mean µk and covariance Σk .
• That is
[ ]
1 1 T −1
f (xi |i ∈ k) = f (xi |µk , Σk ) = m 1 exp − (x i − µ k ) Σk (xi − µk )
(2π) |Σk |
2 2 2

• Discriminant analysis (as presented here) also assumes values for


πk = P(i ∈ k), which is the proportion of population objects belonging to
class k (this can be known or estimated).
∑K
• Note that k=1 πk = 1.
• Typically, πk = 1/K is used.
• πk are sometimes referred to as prior probabilities.
• We can then compute P(i ∈ k|x) and assign data points to groups so as to
maximise this probability.
15
Some Calculations
• The probability of an observation i belonging to group k conditional on xi
being known satisfies:

P(i ∈ k|x) ∝ πk f (xi |µk , Σk ).

• Hence,

P(i ∈ k|xi ) > P(i ∈ l|xi ) ⇔ πk f (xi |µk , Σk ) > πl f (xi |µl , Σl )

• Taking logarithms and substituting in the probability density function for a


MVN distribution we find after simplification:
1 1
log πk − log |Σk | − (xi − µk )T Σ−1
k (xi − µk )
2 2

1 1
> log πl − log |Σl | − (xi − µl )T Σ−1
l (xi − µl )
2 2
16
Linear Discriminant Analysis
• If equal covariances are assumed then P(i ∈ k|xi ) > P(i ∈ l|xi ) if and only if:
1 1
log πk + xTi Σ−1 µk − µTk Σ−1 µk > log πl + xTi Σ−1 µl − µTl Σ−1 µl .
2 2
• Hence the name linear discriminant analysis.
• If πk = 1/K for all k, then this reduces further:
( 1 )T
xi − (µk + µl ) Σ−1 (µk − µl ) > 0
2

17
Quadratic Discriminant Analysis
• No simplification arises in the unequal covariance case, hence
P(i ∈ k|xi ) > P(i ∈ l|xi ) if and only if:
1 1
log πk − log |Σk | − (xi − µk )T Σ−1
k (xi − µk )
2 2

1 1
> log πl − log |Σl | − (x − µl )T Σ−1
l (xi − µl )
2 2
• Hence the name quadratic discriminant analysis.
• If πk = 1/K for all k, then some simplification arises.

18
Summary
• In LDA the decision boundary between class k and class l is given by:
P (k|x) πk f (x|k)
log = log + log =0
P (l|x) πl f (x|l)

• Unlike k-nearest neighbour, both LDA and QDA are model based classifiers
where P(data|group) is assumed to follow a MVN distribution:
– The model based assumption allows for the generation of the probability
for class membership.
– The MVN assumption means that groups are assumed to follow an
elliptical shape.
• Whilst LDA assumes groups have the same covariance matrix, QDA
permits different covariance structures between groups.

19

You might also like