You are on page 1of 24

Bayes Classifiers

Probabilistic Setting for Classification

Probabilistic Setting for Classification (cont.)


Let X = [X1 , . . . , Xd ]T ∈ Rd be a feature vector for a classification
problem, and let Y ∈ {1, . . . , K} be the associated class label.
Assumption
X and Y are jointly distributed random variables.

We will consider two perspectives for the joint distribution:

Marginal distribution of Y and conditional distribution of X|Y :

P (X, Y ) = P (Y )P (X|Y )

Marginal distribution of X and conditional distribution of Y |X:

P (X, Y ) = P (X)P (Y |X)


ECE 4453 Linear Models September 15, 2023 1
Probabilistic Setting for Classification Example: Two-dimensional Data

Example: Two-dimensional Data with Gaussian


Class-conditional Distributions

Figure: Two-dimensional data with Gaussian class-conditional distributions.

ECE 4453 Linear Models September 15, 2023 2


Probabilistic Setting for Classification Example: Two-dimensional Data

Example: Two-dimensional Data (cont.) I

For a binary (K = 2) classification problem with a d = 2 dimensional


feature space, the marginal distribution of Y is uniform on the values 1
and 2.

The conditional distribution of X|Y is

X|Y = 1 ∼ N (µ1 , Σ),


X|Y = 2 ∼ N (µ2 , Σ),

where      
1 1 0.9 0.4
µ1 = , µ2 = , Σ= .
1 −1 0.4 0.3

ECE 4453 Linear Models September 15, 2023 3


Probabilistic Setting for Classification Example: Two-dimensional Data

Example: Two-dimensional Data (cont.) II

The probability Pr(X2 ≤ 0.5) is

Pr(X2 ≤ 0.5) = Pr(X2 ≤ 0.5|Y = 1) Pr(Y = 1)


+ Pr(X2 ≤ 0.5|Y = 2) Pr(Y = 2)
√ √
= 0.5Φ(0.5; 1, .3) + 0.5Φ(0.5; −1, .3)
= 0.5888,

where Φ(x; µ, σ) is the CDF of a N (µ, σ 2 ) random variable.

ECE 4453 Linear Models September 15, 2023 4


Probabilistic Setting for Classification Example: Two-dimensional Data

Example: Two-dimensional Data (cont.) III


The marginal density of X is (by the law of total probability)

P (X = x) = P (X = x|Y = 1)P (Y = 1) + P (X = x|Y = 2)P (Y = 2)


1 1
= ϕ(x; µ1 , σ) + ϕ(x; µ2 , σ),
2 2
where ϕ(x; µ, σ) is the pdf of a N (µ, Σ) random variable.

The conditional distribution of Y |X = x is


1
2 ϕ(x; µ1 , σ)
Pr(Y = 1|X = x) = 1 1 ,
2 ϕ(x; µ1 , σ) + 2 ϕ(x; µ2 , σ)
1
2 ϕ(x; µ2 , σ)
Pr(Y = 2|X = x) = 1 1 .
2 ϕ(x; µ1 , σ) + 2 ϕ(x; µ2 , σ)

ECE 4453 Linear Models September 15, 2023 5


The Bayes Classifier

The Bayes Classifier: Introduction

Given a joint distribution (X, Y ), we may ask “what is the best possible
classifier for that joint distribution?”

Classifier function: f : Rd → {1, . . . , K}


Performance measure by probability of error :

R(f ) := Pr(f (X) ̸= Y )

R(f ) is an example of a risk.


The Bayes risk is the smallest risk of any classifier.

R∗ := min R(f )
all f

If R(f ) = R∗ , f is called the Bayes classifier

ECE 4453 Linear Models September 15, 2023 6


The Bayes Classifier

Terminology

Posterior distribution: ηk (x) = Pr(Y = k|X = x)


Prior distribution: πk = Pr(Y = k)
Class-conditional distributions: gk (x) = Pr(X = x|Y = k)

ECE 4453 Linear Models September 15, 2023 7


The Bayes Classifier

Formulas and Theorem

Theorem 1
The Bayes classifier is given by

f ∗ (x) = arg max ηk (x) (1)


k=1,...,K

= arg max πk gk (x) (2)


k=1,...,K

ECE 4453 Linear Models September 15, 2023 8


The Bayes Classifier

Proof of Theorem
The proof starts with the definition of R(f ):

R(f ) = E[1{f (X)̸=Y }


= EX EY |X [1{f (X)̸=Y } ]
"K #
X
= EX P (Y = k|X)1{f (X)̸=k}
k=1
= EX [P (Y = 1|X) + · · · + P (Y = f (X) − 1|X)
+ P (Y = f (X) + 1|X) + · · · P (Y = K|X)]
= EX [1 − P (Y = f (X)|X)]

To minimize R(f ), f should be a function s.t.


f (x) = arg maxk P (Y = k|x) for each x. This proves (1)
π g (x)
To prove (2), note that by Bayes rule ηk (x) = PKk k
l=1 πl gl (x)

ECE 4453 Linear Models September 15, 2023 9


Plug-in Classifiers

Plug-in Classifiers

In machine learning, we don’t know the joint distribution of (X, Y ),


and therefore we cannot determine the Bayes classifier
iid
Instead, we observe training data (X1 , Y1 ), . . . , (Xn , Yn ) ∼ PXY
iid: Independent and identically distributed
One approach to classification is to use the training data to estimate
either the posterior probability function ηk (x), or the class-conditional
distributions gk (x) and prior probabilities πk , and “plug” these
estimates into the corresponding formula for the Bayes classifier.

ECE 4453 Linear Models September 15, 2023 10


Plug-in Classifiers

Examples of Plug-in Classifiers

Linear discriminant analysis


Naı̈ve Bayes
Logistic regression

ECE 4453 Linear Models September 15, 2023 11


Linear Discriminant Analysis
Linear Discriminant Analysis

Linear Classifiers

Linear Discriminant Analysis (LDA) is a classification method that


produces a linear classifier.

f (x) = sign(wT x + b),

where w ∈ Rd is the weight vector, b ∈ R is the bias, and sign(t) is 1 if


t ≥ 0 and -1 otherwise.

ECE 4453 Linear Discriminant Analysis September 15, 2023 1


Linear Discriminant Analysis

Multiclass Linear Classifiers

For multiclass classification with labels 1, 2, . . . , K,

f (x) = arg max(wkT x + bk ).


k=1,...,K

LDA is one example of such a classifier.

ECE 4453 Linear Discriminant Analysis September 15, 2023 2


Linear Discriminant Analysis The LDA Assumption

The LDA Assumption

The Bayes classifier is given by:

f ∗ (x) = arg max πk gk (x)


k

For LDA, it is assumed that each class conditional distribution is a


multivariate Gaussian with a common covariance matrix Σ:

gk (x) = ϕ(x; µk , Σ)
 
− d2 1 1 T −1
= (2π) |Σ| exp − (x − µk ) Σ (x − µk ) .
2
2

ECE 4453 Linear Discriminant Analysis September 15, 2023 3


Linear Discriminant Analysis The LDA Classifier

Estimating Parameters

Parameters πk , µk and Σ are estimated using:


nk
π̂k = , nk = |{i : yi = k}|
n
1 X
µ̂k = xi ,
nk
i:yi =k
n
1 X
Σ̂ = (xi − µ̂yi )(xi − µ̂yi )T .
n
i=1

ECE 4453 Linear Discriminant Analysis September 15, 2023 4


Linear Discriminant Analysis The LDA Classifier

The LDA Classifier

The LDA classifier is:

fˆ(x) = arg max π̂k · ϕ(x; µ̂k , Σ̂).


1≤k≤K

Or equivalently,

fˆ(x) = arg max log π̂k + log ϕ(x; µ̂k , Σ̂).


1≤k≤K

ECE 4453 Linear Discriminant Analysis September 15, 2023 5


Linear Discriminant Analysis Binary Case

Linear Classifier in Binary Case

LDA Classifier:

fˆ(x) = sign(wT x + b)

where w = Σ̂−1 (µ̂1 − µ̂−1 ) and


b = log π̂π̂−1
1
+ 12 (µ̂T−1 Σ̂−1 µ̂−1 − µ̂T1 Σ̂−1 µ̂1 ).

ECE 4453 Linear Discriminant Analysis September 15, 2023 6


Linear Discriminant Analysis Binary Case

Binary Case Derivation

fˆ(x) = sign(log π̂1 + log ϕ(x; µ̂1 , Σ̂) − log π̂−1 − log ϕ(x; µ̂−1 , Σ̂)

d 1 1
= sign log π̂1 − log 2π − log |Σ̂| − (x − µ̂1 )T Σ̂−1 (x − µ̂1 )
2 2 2

d 1 1
− log π̂−1 + log 2π + log |Σ̂| + (x − µ̂−1 )T Σ̂−1 (x − µ̂−1 )
2 2 2
 i
π̂1 1 h
= sign log + (x − µ̂−1 )T Σ̂−1 (x − µ̂−1 ) − (x − µ̂1 )T Σ̂−1 (x − µ̂1 )
π̂−1 2
 
π̂1 1
= sign log + xT Σ̂−1 (µ̂1 − µ̂−1 ) + (µ̂T−1 Σ̂−1 µ̂−1 − µ̂T1 Σ̂−1 µ̂1 )
π̂−1 2
= sign(wT x + b)

ECE 4453 Linear Discriminant Analysis September 15, 2023 7


Linear Discriminant Analysis Multiclass Case

Linear Classifier in Multiclass Case

LDA Classifier:

fˆ(x) = arg max wkT x + bk


k

where wk = Σ̂−1 µ̂k and bk = log π̂k − 12 µ̂Tk Σ̂−1 µ̂k .

ECE 4453 Linear Discriminant Analysis September 15, 2023 8


Linear Discriminant Analysis Multiclass Case

Multiclass Case Derivation

fˆ(x) = arg max log π̂k + log ϕ(x; µ̂k , Σ̂)


k
d 1 1
= arg max log π̂k − log 2π − log |Σ̂| − (x − µ̂k )T Σ̂−1 (x − µ̂k )
k 2 2 2
1
= arg max log π̂k − (x − µ̂k )T Σ̂−1 (x − µ̂k )
k 2
1 h T −1 i
= arg max log π̂k − x Σ̂ x − 2xT Σ̂−1 µ̂k + µ̂Tk Σ̂−1 µ̂k
k 2
1
= arg max log π̂k + xT Σ̂−1 µ̂k − µ̂Tk Σ̂−1 µ̂k
k 2
= arg max wkT x + bk
k

ECE 4453 Linear Discriminant Analysis September 15, 2023 9


Mahalanobis Distance in LDA

Definition

Mahalanobis Distance:
q
DA (x, x′ ) := (x − x′ )T A(x − x′ ),

LDA Classifier in Terms of Mahalanobis Distance:


1 2
fˆ(x) = arg min − log π̂k + DΣ̂ −1 (x, µk )
k 2

→ if we ignore the log π̂k terms, LDA can be viewed as assigning a test
point x to the class whose mean µk is closest to x according to the
Mahalanobis distance associated to Σ̂−1 .

ECE 4453 Linear Discriminant Analysis September 15, 2023 10


Mahalanobis Distance in LDA

Visualization

Figure: Mahalanobis Distance in LDA.

ECE 4453 Linear Discriminant Analysis September 15, 2023 11

You might also like