You are on page 1of 42

Classification Methods II: Linear and Quadratic

Discrimminant Analysis

Rebecca C. Steorts, Duke University

STA 325, Chapter 4 ISL

Agenda

I Linear Discrimminant Analysis (LDA)

Classification

I Recall that linear regression assumes the responses is
quantitative
I In many cases, the response is qualitative (categorical).

Here, we study approaches for predicting qualitative responses, a
process that is known as classification.
Predicting a qualitative response for an observation can be referred
to as classifying that observation, since it involves assigning the
observation to a category, or class.
We have already covered two such methods, namely, KNN
regression and logistic regression.

Setup

We have set of training observations (x1 , y1 ), . . . , (xn , yn ) that we
can use to build a classifier.
We want our classifier to perform well on both the training and the
test data.

Why do we need another method. when we have logistic regression? . and then use Bayes theorem to flip these around into estimates for Pr (Y = k|X = x ). I We model the distribution of the predictors X separately in each of the response classes (i. given Y ). it turns out that the model is very similar in form to logistic regression. I We now consider an alternative and less direct approach to estimating these probabilities. given the predictor(s) X .Linear Discrimminant Analysis (LDA) I Logistic regression involves directly modeling Pr (Y = k|X = x ) using the logistic function. I We model the conditional distribution of the response Y .7) for the case of two response classes.e. given by (4. I When these distributions are assumed to be normal.

the linear discriminant model is again more stable than the logistic regression model. I Linear discriminant analysis (LDA) does not suffer from this problem. I Linear discriminant analysis is popular when we have more than two response classes. I If n is small and the distribution of the predictors X is approximately normal in each of the classes. the parameter estimates for the logistic regression model are surprisingly unstable. .LDA I When the classes are well-separated.

I Let πk : overall or prior probability that a randomly chosen observation comes from the kth class.Using Bayes’ Theorem for Classification I Wish to classify an observation into one of K classes. K ≥ 2. (1) i=1 πi fi (x ) . I Then **Bayes theorem** states that πk fk (x ) P(Y = k | X = x ) = PK . I fk (X ) is small if it is very unlikely that an observation in the kth class has X ≈ x . I Let fk (X ) = Pr (X = x |Y = k): the density function of X for an observation that comes from the kth class. I fk (X ) is relatively large if there is a high probability that an observation in the kth class has X ≈ x .

estimating fk (X ) tends to be more challenging. unless we assume some simple forms for these densities.Using Bayes’ Theorem for Classification I Let pk (X ) = Pr (Y = k|X ). we can simply plug in estimates of πk and fk (X ) into (1). I However. . I Instead of directly computing pk (X ). I We simply compute the fraction of the training observations that belong to the kth class. I Estimating πk is easy if we have a random sample of Y s from the population.

) I Therefore. I Such an approach is the topic of the following sections. then we can develop a classifier that approximates the Bayes classifier. . I (This is of course only true if the terms in (1) are all correctly specified.Bayes Classifier I A Bayes classifier classifies an observation to the class for which pk (x ) is largest and has the lowest possible error rate out of all classifiers. if we can find a way to estimate fk (X ).

I Assume that σ12 = · · · = σk2 =: σ 2 . . I Assume that fk (x ) is normal or Gaussian with mean µk and variance σk2 . I We would like to obtain an estimate for fk (x ) that we can plug into (1) in order to estimate pk (x ).Intro to LDA I Assume that p = 1. we only have one predictor. I We will then classify an observation to the class for which pk (x ) is greatest. we will first make some assumptions about its form. that is. I In order to estimate fk (x ). meaning there is a shared variance term across all K classes.

The normal (Gaussian) density Recall that the normal density is 1 1 fk (x ) = √ exp{ 2 (x − µk )2 } (2) 2πσ 2σ Plugging (2) into (1) yields πk √ 1 exp{ 2σ1 2 (x − µk )2 } pk (x ) = PK 2πσ 1 1 . (3) 2 i=1 πi 2πσ exp{ 2σ 2 (x − µi ) } √ .

I Taking the log of (3) and rearrange the terms. . I Then we can show that this is equivalent to assigning the observation to the class for which µk µ2k δ(x )k = x − + log(πk ) (4) σ2 2σ 2 is largest (Exercise).LDA (continued) I The Bayes classifier involves assigning an observation X = x to the class for which (3) is largest.

and to class 2 otherwise. if K = 2 and π1 = π2 . the Bayes decision boundary corresponds to the point where µ21 − µ22 (µ1 − µ2 )(µ1 + µ2 ) µ1 + µ2 x= = = . 2(µ1 − µ2 ) 2(µ1 − µ2 ) 2 . then the Bayes classifier assigns an observation to class 1 if 2x (µ1 − µ2 ) > (µ21 − µ22 ).LDA (continued) I For instance. I In this case.

The Bayes decision boundary is again shown as a dashed vertical line. Right: 20 observations were drawn from each of the two classes. .Bayes decision boundaries 5 4 3 2 1 0 −4 −2 0 2 4 −3 −2 −1 0 1 2 3 4 Figure 1: Two one-dimensional normal density functions are shown. and are shown as histograms. The solid vertical line represents the LDA decision boundary estimated from the training data. The dashed vertical line represents the Bayes decision boundary.

and σ into (4. πk . the following estimates are used: . πK . we still have to estimate the parameters µ1 . . . . µk .Bayes decision boundaries I In practice. . . and σ 2 . . . I In particular. I The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for πk . π1 . .13). even if we are quite certain of our assumption that X is drawn from a Gaussian distribution within each class.

(6) n − K k=1 i:y =k i I where n is the total number of training observations.Next part 1 X µ̂k = xi (5) nk i:y =k i K X 1 X σ̂ 2 = (xi − µ̂k )2 . and nk is the number of training observations in the kth class. I µ̂k is simply the average of all the training observations from the kth class I σ 2 can be seen as a weighted average of the sample variances for each of the K classes. .

π̂k = nk /n. .Next part I Sometimes we have knowledge of the class membership probabilities π1 . LDA estimates πk using the proportion of the training observations that belong to the kth class. (7) . . . which can be used directly. In other words. πk . . I In the absence of any additional information.

The word linear in the classifierÕs name stems from the fact that the discriminant functions δk (x ) in (8) are linear functions of x (as opposed to a more complex function of x). and assigns an observation X = x to the class for which µ̂k µ̂2k δk (x ) = x − + log(π̂k ) (8) σ̂ 2 2σ̂ 2 is largest.LDA classifier The LDA classifier plugs the estimates given in (6) and (7) into (8). .

we will assume that X = (X1 . .LDA classifier (p > 1) I We now extend the LDA classifier to the case of multiple predictors. Xp ) is drawn from a multivariate Gaussian (or multivariate normal) distribution. . . with a class-specific mean vector and a common covariance matrix. . I To do this.

.A. giving the multivariate form of LDA. and going through some extensions. Fisher and available in R). I Illustrate the Iris data through naive linear regression.Iris application I We motivate this section with a very well known dataset (introduced by R.

I Y is a matrix of 0’s and 1’s with each row having a single 1 indicating a person is in class k. . x ip that will be represented by XN×p . Suppose G has K classes. . I The i th person of interest has covariate values x i 1 . . I Our goal is to predict what class each observation is in given its covariate values. . . then Y 1 is a vector of 0’s and 1’s indicating for example whether each person is in class 1.Iris application Suppose that response categories are coded as an indicator variable. . I The indicator response matrix is defined as Y = (Y 1 . . . . Y K ).

. We will do the following: I Fit a linear regression to each column of Y. .Iris application Let’s proceed blindly and use a naive method of linear regression. . I Ŷ = X (X 0 X )−1 X 0 Y I The k th column of β̂ contains the estimates corresponding to the linear regression coefficients that we get from regressing X 1 . . X p onto Y K . I The coefficient matrix is β̂ = (X 0 X )−1 X 0 Y . .

More formally stated.LDA Look at Ŷ corresponding to the indicator variable for each class k. Assign each person to the class for which Ŷ is the largest. X)0 β̂]0 . k We now pose the question. I Identify the largest component of Ŷ new (X) and classify according to Ĝ(X) = arg max Ŷ new (X). a new observation with covariate x is classified as follows: I Compute the fitted output Ŷ new (X) = [(1. does this approach make sense? .

I Illustrate “masking effect” with the Iris dataset. Ŷk (X) can be negative or greater than 1 which is nonsensical to the initial problem statement. . I Worse problems can occur when classes are masked by others due to the rigid nature of the regression model. P I Although k Ŷk (X) = 1 for any X as long as there is an intercept in the model (exercise).LDA I The regression line estimates E(Yk |X = X) = P(G = k|X = X) so the method seems somewhat sensible at first.

What is masking? When the number of classes K ≥ 3. a class may be hidden or masked by others because there is no region in the feature space that is labeled as this class. .

9 virginica 7.. . . . versicolor. I To best illustrate the methods of classification.9 3. .1 4.0 5.1 3. Annals of Eugenics.2 setosa 4.5 1..1 1.0 2.5 1.4 0.9 3.2 setosa 4.7 5..4 3.4 versicolor 6. . . . and virginica) .0 3.3 0.7 1.5 versicolor .3 6. . . . I We see the dataset partially represented in Table 1.2 4.9 1. Sepal L Sepal W Petal L Petal W Species 5.. . we considered how petal width and length predict the species of a flower.5 versicolor 6.5 virginica 5. .. 6. . 7.Iris data I The Iris data (Fisher.2 4.1 3.7 3.. versicolor. ..1936) gives the measurements of sepal and petal length and width for 150 flowers using 3 species of iris (50 flowers per species). . .2 1.4 0. .0 1. I The species considered are setosa.2 setosa .1 virginica Table 1: Iris data with 3 species (setosa. . and virginica...9 2.8 2.3 3..

. There is a plane that is high in the bottom left corner (setosa) and low in the top right corner (virginica).Iris data We can can see that using linear regression (Figure 2 to predict for different classes can lead to a masking effect of one group or more. 2. This occurs for the following reasons: 1. There is a second plane that is high in the top right corner (virginica) but low in the bottom left corner (setosa). The third plane is approximately flat since it tries to linearly fit a collection of points that is high in the middle (versicolor) and low on both ends. 3.

5 2.Iris data 7 ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● Petal Length ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● 3 Setosa 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Versicolor ● ● ● ● ● ● ● Virginica 1 0.0 2.0 1.5 1.5 Petal Width .

(2π)p/2 |Σ|1/2 2 } LDA assumes Σk = Σ for all k (just as was true in the p = 1 setting). conditional on them being in class k. Σ). which has the following assumptions: For each person.Iris data with LDA We instead consider LDA. P P I . That is. where Nk is the number of people of class k P I µˆk = i:gi =k x i /Nk Σ̂ = K (x i − µˆk )(x i − µˆk )0 /(N − K ). we assume X|G = k ∼ Np (µk . 1 1   fk (X) = exp − (X − µk )0 Σ−1 (X − µk ) . In practice the parameters of the Gaussian distribution are unknown and must be estimated by: I πˆk = Nk /N.

X = X) P(G = k|X = X) = P(X = X) P(X = X|G = k)P(G = k) = PK k=1 P(X = X. G = k) fk (X)πk = PK .LDA Computation We’re interested in computing P(G = k. j=1 fj (X)πj We will compute P(G = k|X = X) for each class k. . Consider comparing P(G = k1 |X = X) and P(G = k2 |X = X).

LDA Computation Then P(G = k1 |X = X) fk (X)πk1     log = log 1 P(G = k2 |X = X) fk2 (X)πk2 1 1 πk1   = − (X − µk1 )0 Σ−1 (X − µk1 ) + (X − µk2 )0 Σ−1 (X − µk2 ) + log 2 2 πk2 1 1 πk1   = (µk1 − µk2 )0 Σ−1 X − µ0k1 Σ−1 µk1 + µ0k2 Σ−1 µk2 + log 2 2 πk2 .

This reduces to solving 1 1 πk1   0 −1 (µk1 − µk2 ) Σ X − µ0k1 Σ−1 µk1 + µ0k2 Σ−1 µk2 + log = 0. 2 2 πk2 which is linear in X. .Boundary Lines for LDA I Now let’s consider the boundary between predicting someone to be in class k1 or class k2 . I To be on the the boundary. we must decide what X would need to be if we think that a person is equally likely to be in class k1 or k2 .

this is a hyperplane. } \frame{ We can then say that class k1 is more likely than class k2 if P(G = k1 |X = X) > P(G = k2 |X = X) =⇒ P(G = k1 |X = X)   log > 0 =⇒ P(G = k2 |X = X) 1 1 πk1   0 −1 (µk1 −µk2 ) Σ X− µ0k1 Σ−1 µk1 + µ0k2 Σ−1 µk2 +log > 0 =⇒ 2 2 πk2 .Boundary Lines for LDA I The boundary will be a line for two dimensional problems. The linear log-odds function implies that our decision boundary between classes k1 and k2 will be the set where P(G = k1 |X = X) = P(G = k2 |X = X). I The boundary will be a hyperplane for three dimensional problems. which is linear in X. In p dimensions.

. We can tell which class is more likely for a particular value of X by comparing the classes’ linear discriminant functions.Linear Discrimminant Function The linear discriminant function δkL (X) is defined as δkL (X) = µ0k Σ−1 X − µ0k Σ−1 µk + log(πk ).

I QDA is similar to LDA except a covariance matrix must be estimated for each class k. which handles the following: I If the Σk are not assumed to be equal. I The quadratic pieces in X end up remaining leading to quadratic discriminant functions (QDA).Quadratic Discriminant Analysis (QDA) We now introduce Quadratic Discriminant Analysis. then convenient cancellations in our derivations earlier do not occur. .

we want our model to have low variance. 2 2 LDA and QDA seem to be widely accepted due to a bias variance trade off that leads to stability of the models. . so we are willing to sacrifice some bias of a linear decision boundary in order for our model to be more stable.QDA The quadratic discriminant function δkQ (X) is defined as 1 1 δkQ (X) = − log |Σk | − (X − µk )0 Σ−1 k (X − µk ) + log(πk ). That is.

LDA and QDA on Iris data 7 ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● Petal Length ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● 3 Setosa 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Versicolor ● ● ● ● ● ● ● Virginica 1 0.5 2. .0 2.5 1.0 1.5 Petal Width Figure 3: Clustering of Iris data using LDA.

5 1.LDA and QDA on Iris data 7 ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● Petal Length ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● 3 Setosa 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Versicolor ● ● ● ● ● ● ● Virginica 1 0. .0 1.5 Petal Width Figure 4: Clustering of Iris data using QDA.5 2.0 2.

namely Regularized Discriminant Analysis (RDA). I Regularized covariance matrices take the form Σˆk (α) = αΣˆk + (1 − α)Σ̂. proposed by Friedman (1989). .Extensions of LDA and QDA I We consider some extensions of LDA and QDA. 0 ≤ α ≤ 1. I In practice. α is chosen based on performance of the model on validation data or by using cross-validation. This was proposed as compromise between LDA and QDA. I This method says that we should shrink the covariance matrices of QDA toward a common covariance matrix as done in LDA.

Extensions of LDA and QDA I Fisher proposed an alternative derivation to dimension reduction in LDA that is equivalent to the ideas previously discussed. I He suggested the proper way to rotate the coordinate axes was by maximizing the variance between classes relative to the variance within the classes. .

I B + W = T = total covariance matrix of X . I Denote the pooled within class covariance of the original data by W . I Denote the covariance of the centroids by B. I BC Var(Z ) = a 0 Ba and WC Var(Z ) = a 0 W a. .Extensions I Let Z = a 0 X and find the linear combo Z such that the between class variance is maximized wrt within class variance.

In terms 00 of what we did earlier. This process continues and a k are called the discriminant coordinates. we repeat the process again of maximization except this time the new maximum. denoted by a 1 . a a Wa The solution is a = largest eigenvalue of W −1 B.Extensions Fisher’s problem amounts to maximizing a 0 Ba max 0 (exercise). the a k ’s are equivalent to the x k ’s. . must be orthogonal to a 1 . Once we find the solution to maximization problem above. Reduced-Rank LDA is desirable in high dimensions since it leads to a further dimension reduction in LDA. a 2 .