Multivariate Analysis (Slides 8)

Multivariate Analysis (slides 8)
• Today we consider Linear Discriminant Analysis (LDA) and Quadratic

Discriminant Analysis (QDA).
• These are used if it is assumed that there exists a set of k groups within the
data and that there is a subset of the data that is labelled, i.e., whose group
membership is known.
• Discriminant analysis refers to a set of ‘supervised’ statistical techniques
where the class information is used to help reveal the structure of the data.
• This structure then allows the ‘classification’ of future observations.
1
Discriminant Analysis
• We want to be able to use knowledge of labelled data (i.e., those whose
group membership is known) in order to classify the group membership of
unlabelled data.
• We previously considered the k-nearest neighbours technique for this
problem.
• We shall now consider the alternative approaches of:
– LDA (linear discriminant analysis)
– QDA (quadratic discriminant analysis)
2
LDA & QDA
• Unlike k-Nearest Neighbours (and all the other techniques so far covered),
both LDA and QDA assume the use of a distribution over the data.
• Once we introduce distributions (and parameters for these distributions),
we can quantify our uncertainty over the structure of the data.
• As far as classification is concerned, this means that we can consider the
probability of group assignment.
• The distinction between a point that is assigned a probability of 0.51 to one
group and 0.49 to another, against a point that is assigned a probability of
0.99 to one group and 0.01 to another, can be quite important.
3
Multivariate Normal Distribution
• Let xT = (x1 , x2 , ..., xm ), where x1 , x2 , ..., xm are random variables.
• The Multi-Variate Normal (MVN) distribution has two parameters:
– Mean µ, an m-dimensional vector.
– Covariance matrix Σ, with dimension m × m.
• A vector x is said to follow a MVN distribution, denoted x ∼ M V N (µ, Σ),
if it has the following probability density function:
[ ]
1 1 T −1
f (x|µ, Σ) = m 1 exp − (x − µ) Σ (x − µ)
(2π) |Σ|
2 2 2
• Here |Σ| is denotes the determinant of Σ.
4
Multivariate Normal Distribution
• The MVN distribution is very useful when modelling multivariate data.
• Notice:
{ [ ]}
T −1 m 1
{x : f (x|µ, Σ) > C} = x : (x − µ) Σ (x − µ) < −2 log C(2π) 2 |Σ| 2
• This corresponds to an m-dimensional ellipsoid centered at point µ.

• If it is assumed that the data within a group k follows a MVN distribution
with mean µk and covariance Σk , then the scatter of the data should be
roughly elliptical.
• The mean fixes the location of the scatter and the covariance affects the
shape of the ellipsoid.
5
Normal Contours
   
0 1 0.8
• For example, the contour plot of a MVN  ,  is:
0 0.8 3
3
2
1
0
−1
−2
−3
−2 −1 0 1 2
6
Normal Contours: Data
• Sampling from this distribution and overlaying the results on the contour
plot gives:
6
4
2
0
−2
−4
−6
−4 −2 0 2 4
7
Shape of Scatter
• If we assume that the data within each group follows a MVN distribution
with mean µk and covariance Σk , then we also assume that the scatter is
roughly elliptical.
• The mean sets the location of this scatter and the covariance sets the shape
of the ellipse.
4
4
2
2
0
0
−2
−2
−4
−4
−3 −1 1 2 −3 −1 1 2 3
8
Mahalanobis Distance
• The Mahalanobis distance from a point x to a mean µ is D, where
D2 = (x − µ)T Σ−1 (x − µ).
• Two points have the same Mahalanobis distance if they are on the same
ellipsoid centered on µ (as defined earlier).
6
4
2
µ
0
−2
−4
−6
−4 −2 0 2 4
9
Which Is Closest?
• Suppose we wish to find the mean µk that a point x is closest to as
measured by Mahalanobis distance.
• That is, we want to find the k that minimizes the expression:
(x − µk )T Σ−1
k (x − µk )
• The point x is closer to µk than it is to µl (under Mahalanobis distance)

when:
(x − µk )T Σ−1
k (x − µ k ) < (x − µ l )T −1
Σl (x − µl ).
• This is a quadratic expression of x.
10
When Covariance is Equal
• If Σk = Σ for all k, then the previous expression becomes:
(x − µk )T Σ−1 (x − µk ) < (x − µl )T Σ−1 (x − µl ).
• This can be simplified as:
−2xT Σ−1 µk + µTk Σ−1 µk < −2xT Σ−1 µl + µTl Σ−1 µl
⇔ −2µTk Σ−1 x + µTk Σ−1 µk < −2µTl Σ−1 x + µTl Σ−1 µl
• This is now a linear expression of x
11
Estimating Equal Covariance
• In LDA we need to pool the covariance matrices of individual classes.
• Remember that the sample covariance matrix Q for a set of n observations
of dimension m is the matrix whose elements are
1 ∑
n
qij = (xki − xi )(xkj − xj )
n−1
k=1
for i = 1, 2, . . . , m and j = 1, 2, . . . , m.
• Then the pooled covariance matrix is defined as:
1 ∑
g
Qp = (nl − 1)Ql
n−g
l=1
Where g is the number of classes, Ql is the estimated sample covariance

matrix for class l, nl is the number of data points in class l, whilst n is the
total number of data points.
12
Estimating Equal Covariance
• This formula arises from summing the squares and cross products over data
points in all classes:
∑
g ∑
nl
Wij = (xki − xli )(xkj − xlj )
l=1 k=1
for i = 1, . . . , m and j = 1, . . . , m.
• Hence:
∑
g
W = (nl − 1)Ql
l=1
• Given n data points falling in g groups, we have n − g degrees of freedom

because we need to estimate the g group means.
• This results in the previous formula for the pooled covariance matrix:
W
Qp =
n−g
13
Modelling Assumptions
• Both LDA and QDA are parametric statistical methods.
• To classify a new observation x into one of the known K groups, we need
P(x ∈ k|x) for k = 1, . . . K.
• That is, we need to know the posterior probability of belonging to each of
group, given the data.
• We then classify the new observation as belonging to the class which has
largest posterior probability.
• Bayes’ Theorem states that the posterior probability of observation x
belonging to group k is:
πk f (xi |x ∈ k)
P(x ∈ k|x) = ∑K
l=1 πl f (xi |x ∈ l)
14
Modelling Assumptions
• Discriminant analysis assumes that observations from group k follow a
MVN distribution with mean µk and covariance Σk .
• That is
[ ]
1 1 T −1
f (xi |i ∈ k) = f (xi |µk , Σk ) = m 1 exp − (x i − µ k ) Σk (xi − µk )
(2π) |Σk |
2 2 2
• Discriminant analysis (as presented here) also assumes values for

πk = P(i ∈ k), which is the proportion of population objects belonging to
class k (this can be known or estimated).
∑K
• Note that k=1 πk = 1.
• Typically, πk = 1/K is used.
• πk are sometimes referred to as prior probabilities.
• We can then compute P(i ∈ k|x) and assign data points to groups so as to
maximise this probability.
15
Some Calculations
• The probability of an observation i belonging to group k conditional on xi
being known satisfies:
P(i ∈ k|x) ∝ πk f (xi |µk , Σk ).
• Hence,
P(i ∈ k|xi ) > P(i ∈ l|xi ) ⇔ πk f (xi |µk , Σk ) > πl f (xi |µl , Σl )
• Taking logarithms and substituting in the probability density function for a

MVN distribution we find after simplification:
1 1
log πk − log |Σk | − (xi − µk )T Σ−1
k (xi − µk )
2 2
1 1
> log πl − log |Σl | − (xi − µl )T Σ−1
l (xi − µl )
2 2
16
Linear Discriminant Analysis
• If equal covariances are assumed then P(i ∈ k|xi ) > P(i ∈ l|xi ) if and only if:
1 1
log πk + xTi Σ−1 µk − µTk Σ−1 µk > log πl + xTi Σ−1 µl − µTl Σ−1 µl .
2 2
• Hence the name linear discriminant analysis.
• If πk = 1/K for all k, then this reduces further:
( 1 )T
xi − (µk + µl ) Σ−1 (µk − µl ) > 0
2
17
Quadratic Discriminant Analysis
• No simplification arises in the unequal covariance case, hence
P(i ∈ k|xi ) > P(i ∈ l|xi ) if and only if:
1 1
log πk − log |Σk | − (xi − µk )T Σ−1
k (xi − µk )
2 2
1 1
> log πl − log |Σl | − (x − µl )T Σ−1
l (xi − µl )
2 2
• Hence the name quadratic discriminant analysis.
• If πk = 1/K for all k, then some simplification arises.
18
Summary
• In LDA the decision boundary between class k and class l is given by:
P (k|x) πk f (x|k)
log = log + log =0
P (l|x) πl f (x|l)
• Unlike k-nearest neighbour, both LDA and QDA are model based classifiers
where P(data|group) is assumed to follow a MVN distribution:
– The model based assumption allows for the generation of the probability
for class membership.
– The MVN assumption means that groups are assumed to follow an
elliptical shape.
• Whilst LDA assumes groups have the same covariance matrix, QDA
permits different covariance structures between groups.
19

Multivariate Analysis (Slides 8)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multivariate Analysis (Slides 8)

Uploaded by

Copyright:

Available Formats

Multivariate Analysis (slides 8)

• Today we consider Linear Discriminant Analysis (LDA) and Quadratic

• Here |Σ| is denotes the determinant of Σ.

• This corresponds to an m-dimensional ellipsoid centered at point µ.

D2 = (x − µ)T Σ−1 (x − µ).

• The point x is closer to µk than it is to µl (under Mahalanobis distance)

• This is a quadratic expression of x.

(x − µk )T Σ−1 (x − µk ) < (x − µl )T Σ−1 (x − µl ).

• This can be simplified as:

−2xT Σ−1 µk + µTk Σ−1 µk < −2xT Σ−1 µl + µTl Σ−1 µl

⇔ −2µTk Σ−1 x + µTk Σ−1 µk < −2µTl Σ−1 x + µTl Σ−1 µl

• This is now a linear expression of x

Where g is the number of classes, Ql is the estimated sample covariance

• Given n data points falling in g groups, we have n − g degrees of freedom

• Discriminant analysis (as presented here) also assumes values for

P(i ∈ k|x) ∝ πk f (xi |µk , Σk ).

• Taking logarithms and substituting in the probability density function for a

You might also like