Slide ML 0915

Bayes Classifiers
Probabilistic Setting for Classification
Probabilistic Setting for Classification (cont.)

Let X = [X1 , . . . , Xd ]T ∈ Rd be a feature vector for a classification
problem, and let Y ∈ {1, . . . , K} be the associated class label.
Assumption
X and Y are jointly distributed random variables.
We will consider two perspectives for the joint distribution:
Marginal distribution of Y and conditional distribution of X|Y :
P (X, Y ) = P (Y )P (X|Y )
Marginal distribution of X and conditional distribution of Y |X:
P (X, Y ) = P (X)P (Y |X)

ECE 4453 Linear Models September 15, 2023 1
Probabilistic Setting for Classification Example: Two-dimensional Data
Example: Two-dimensional Data with Gaussian

Class-conditional Distributions
Figure: Two-dimensional data with Gaussian class-conditional distributions.

Example: Two-dimensional Data (cont.) I
For a binary (K = 2) classification problem with a d = 2 dimensional

feature space, the marginal distribution of Y is uniform on the values 1
and 2.
The conditional distribution of X|Y is
X|Y = 1 ∼ N (µ1 , Σ),

X|Y = 2 ∼ N (µ2 , Σ),
where
1 1 0.9 0.4
µ1 = , µ2 = , Σ= .
1 −1 0.4 0.3

Example: Two-dimensional Data (cont.) II
The probability Pr(X2 ≤ 0.5) is
Pr(X2 ≤ 0.5) = Pr(X2 ≤ 0.5|Y = 1) Pr(Y = 1)

+ Pr(X2 ≤ 0.5|Y = 2) Pr(Y = 2)
√ √
= 0.5Φ(0.5; 1, .3) + 0.5Φ(0.5; −1, .3)
= 0.5888,
where Φ(x; µ, σ) is the CDF of a N (µ, σ 2 ) random variable.

Example: Two-dimensional Data (cont.) III

The marginal density of X is (by the law of total probability)
P (X = x) = P (X = x|Y = 1)P (Y = 1) + P (X = x|Y = 2)P (Y = 2)

1 1
= ϕ(x; µ1 , σ) + ϕ(x; µ2 , σ),
2 2
where ϕ(x; µ, σ) is the pdf of a N (µ, Σ) random variable.
The conditional distribution of Y |X = x is

1
2 ϕ(x; µ1 , σ)
Pr(Y = 1|X = x) = 1 1 ,
2 ϕ(x; µ1 , σ) + 2 ϕ(x; µ2 , σ)
1
2 ϕ(x; µ2 , σ)
Pr(Y = 2|X = x) = 1 1 .
2 ϕ(x; µ1 , σ) + 2 ϕ(x; µ2 , σ)

The Bayes Classifier
The Bayes Classifier: Introduction
Given a joint distribution (X, Y ), we may ask “what is the best possible
classifier for that joint distribution?”
Classifier function: f : Rd → {1, . . . , K}

Performance measure by probability of error :
R(f ) := Pr(f (X) ̸= Y )
R(f ) is an example of a risk.

The Bayes risk is the smallest risk of any classifier.
R∗ := min R(f )
all f
If R(f ) = R∗ , f is called the Bayes classifier

Terminology
Posterior distribution: ηk (x) = Pr(Y = k|X = x)

Prior distribution: πk = Pr(Y = k)
Class-conditional distributions: gk (x) = Pr(X = x|Y = k)

Formulas and Theorem
Theorem 1
The Bayes classifier is given by
f ∗ (x) = arg max ηk (x) (1)

k=1,...,K
= arg max πk gk (x) (2)

k=1,...,K

Proof of Theorem
The proof starts with the definition of R(f ):
R(f ) = E[1{f (X)̸=Y }

= EX EY |X [1{f (X)̸=Y } ]
"K #
X
= EX P (Y = k|X)1{f (X)̸=k}
k=1
= EX [P (Y = 1|X) + · · · + P (Y = f (X) − 1|X)
+ P (Y = f (X) + 1|X) + · · · P (Y = K|X)]
= EX [1 − P (Y = f (X)|X)]
To minimize R(f ), f should be a function s.t.

f (x) = arg maxk P (Y = k|x) for each x. This proves (1)
π g (x)
To prove (2), note that by Bayes rule ηk (x) = PKk k
l=1 πl gl (x)

Plug-in Classifiers
Plug-in Classifiers
In machine learning, we don’t know the joint distribution of (X, Y ),

and therefore we cannot determine the Bayes classifier
iid
Instead, we observe training data (X1 , Y1 ), . . . , (Xn , Yn ) ∼ PXY
iid: Independent and identically distributed
One approach to classification is to use the training data to estimate
either the posterior probability function ηk (x), or the class-conditional
distributions gk (x) and prior probabilities πk , and “plug” these
estimates into the corresponding formula for the Bayes classifier.

Plug-in Classifiers
Examples of Plug-in Classifiers
Linear discriminant analysis

Naı̈ve Bayes
Logistic regression

Linear Discriminant Analysis
Linear Classifiers
Linear Discriminant Analysis (LDA) is a classification method that

produces a linear classifier.
f (x) = sign(wT x + b),
where w ∈ Rd is the weight vector, b ∈ R is the bias, and sign(t) is 1 if

t ≥ 0 and -1 otherwise.
ECE 4453 Linear Discriminant Analysis September 15, 2023 1

Multiclass Linear Classifiers
For multiclass classification with labels 1, 2, . . . , K,
f (x) = arg max(wkT x + bk ).

k=1,...,K
LDA is one example of such a classifier.

Linear Discriminant Analysis The LDA Assumption
The LDA Assumption
The Bayes classifier is given by:
f ∗ (x) = arg max πk gk (x)

k
For LDA, it is assumed that each class conditional distribution is a

multivariate Gaussian with a common covariance matrix Σ:
gk (x) = ϕ(x; µk , Σ)

− d2 1 1 T −1
= (2π) |Σ| exp − (x − µk ) Σ (x − µk ) .
2
2

Linear Discriminant Analysis The LDA Classifier
Estimating Parameters
Parameters πk , µk and Σ are estimated using:

nk
π̂k = , nk = |{i : yi = k}|
n
1 X
µ̂k = xi ,
nk
i:yi =k
n
1 X
Σ̂ = (xi − µ̂yi )(xi − µ̂yi )T .
n
i=1

Linear Discriminant Analysis The LDA Classifier
The LDA Classifier
The LDA classifier is:
fˆ(x) = arg max π̂k · ϕ(x; µ̂k , Σ̂).

1≤k≤K
Or equivalently,
fˆ(x) = arg max log π̂k + log ϕ(x; µ̂k , Σ̂).

1≤k≤K

Linear Discriminant Analysis Binary Case
Linear Classifier in Binary Case
LDA Classifier:
fˆ(x) = sign(wT x + b)
where w = Σ̂−1 (µ̂1 − µ̂−1 ) and

b = log π̂π̂−1
1
+ 12 (µ̂T−1 Σ̂−1 µ̂−1 − µ̂T1 Σ̂−1 µ̂1 ).

Linear Discriminant Analysis Binary Case
Binary Case Derivation
fˆ(x) = sign(log π̂1 + log ϕ(x; µ̂1 , Σ̂) − log π̂−1 − log ϕ(x; µ̂−1 , Σ̂)

d 1 1
= sign log π̂1 − log 2π − log |Σ̂| − (x − µ̂1 )T Σ̂−1 (x − µ̂1 )
2 2 2

d 1 1
− log π̂−1 + log 2π + log |Σ̂| + (x − µ̂−1 )T Σ̂−1 (x − µ̂−1 )
2 2 2
i
π̂1 1 h
= sign log + (x − µ̂−1 )T Σ̂−1 (x − µ̂−1 ) − (x − µ̂1 )T Σ̂−1 (x − µ̂1 )
π̂−1 2

π̂1 1
= sign log + xT Σ̂−1 (µ̂1 − µ̂−1 ) + (µ̂T−1 Σ̂−1 µ̂−1 − µ̂T1 Σ̂−1 µ̂1 )
π̂−1 2
= sign(wT x + b)

Linear Discriminant Analysis Multiclass Case
Linear Classifier in Multiclass Case
LDA Classifier:
fˆ(x) = arg max wkT x + bk

k
where wk = Σ̂−1 µ̂k and bk = log π̂k − 12 µ̂Tk Σ̂−1 µ̂k .

Linear Discriminant Analysis Multiclass Case
Multiclass Case Derivation
fˆ(x) = arg max log π̂k + log ϕ(x; µ̂k , Σ̂)

k
d 1 1
= arg max log π̂k − log 2π − log |Σ̂| − (x − µ̂k )T Σ̂−1 (x − µ̂k )
k 2 2 2
1
= arg max log π̂k − (x − µ̂k )T Σ̂−1 (x − µ̂k )
k 2
1 h T −1 i
= arg max log π̂k − x Σ̂ x − 2xT Σ̂−1 µ̂k + µ̂Tk Σ̂−1 µ̂k
k 2
1
= arg max log π̂k + xT Σ̂−1 µ̂k − µ̂Tk Σ̂−1 µ̂k
k 2
= arg max wkT x + bk
k

Mahalanobis Distance in LDA
Definition
Mahalanobis Distance:
q
DA (x, x′ ) := (x − x′ )T A(x − x′ ),
LDA Classifier in Terms of Mahalanobis Distance:

1 2
fˆ(x) = arg min − log π̂k + DΣ̂ −1 (x, µk )
k 2
→ if we ignore the log π̂k terms, LDA can be viewed as assigning a test
point x to the class whose mean µk is closest to x according to the
Mahalanobis distance associated to Σ̂−1 .

Mahalanobis Distance in LDA
Visualization
Figure: Mahalanobis Distance in LDA.

Slide ML 0915

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide ML 0915

Uploaded by

Copyright:

Available Formats

Bayes Classifiers

Probabilistic Setting for Classification

Probabilistic Setting for Classification (cont.)

We will consider two perspectives for the joint distribution:

Marginal distribution of Y and conditional distribution of X|Y :

Marginal distribution of X and conditional distribution of Y |X:

P (X, Y ) = P (X)P (Y |X)

Example: Two-dimensional Data with Gaussian

Figure: Two-dimensional data with Gaussian class-conditional distributions.

ECE 4453 Linear Models September 15, 2023 2

Example: Two-dimensional Data (cont.) I

For a binary (K = 2) classification problem with a d = 2 dimensional

The conditional distribution of X|Y is

X|Y = 1 ∼ N (µ1 , Σ),

ECE 4453 Linear Models September 15, 2023 3

Example: Two-dimensional Data (cont.) II

The probability Pr(X2 ≤ 0.5) is

Pr(X2 ≤ 0.5) = Pr(X2 ≤ 0.5|Y = 1) Pr(Y = 1)

where Φ(x; µ, σ) is the CDF of a N (µ, σ 2 ) random variable.

ECE 4453 Linear Models September 15, 2023 4

Example: Two-dimensional Data (cont.) III

P (X = x) = P (X = x|Y = 1)P (Y = 1) + P (X = x|Y = 2)P (Y = 2)

The conditional distribution of Y |X = x is

ECE 4453 Linear Models September 15, 2023 5

The Bayes Classifier: Introduction

Classifier function: f : Rd → {1, . . . , K}

R(f ) := Pr(f (X) ̸= Y )

R(f ) is an example of a risk.

If R(f ) = R∗ , f is called the Bayes classifier

ECE 4453 Linear Models September 15, 2023 6

Posterior distribution: ηk (x) = Pr(Y = k|X = x)

ECE 4453 Linear Models September 15, 2023 7

Formulas and Theorem

f ∗ (x) = arg max ηk (x) (1)

= arg max πk gk (x) (2)

ECE 4453 Linear Models September 15, 2023 8

R(f ) = E[1{f (X)̸=Y }

To minimize R(f ), f should be a function s.t.

ECE 4453 Linear Models September 15, 2023 9

In machine learning, we don’t know the joint distribution of (X, Y ),

ECE 4453 Linear Models September 15, 2023 10

Examples of Plug-in Classifiers

Linear discriminant analysis

ECE 4453 Linear Models September 15, 2023 11

Linear Discriminant Analysis (LDA) is a classification method that

f (x) = sign(wT x + b),

where w ∈ Rd is the weight vector, b ∈ R is the bias, and sign(t) is 1 if

ECE 4453 Linear Discriminant Analysis September 15, 2023 1

Multiclass Linear Classifiers

For multiclass classification with labels 1, 2, . . . , K,

f (x) = arg max(wkT x + bk ).

LDA is one example of such a classifier.

ECE 4453 Linear Discriminant Analysis September 15, 2023 2

The LDA Assumption

The Bayes classifier is given by:

f ∗ (x) = arg max πk gk (x)

For LDA, it is assumed that each class conditional distribution is a

ECE 4453 Linear Discriminant Analysis September 15, 2023 3

Parameters πk , µk and Σ are estimated using:

ECE 4453 Linear Discriminant Analysis September 15, 2023 4

The LDA Classifier

The LDA classifier is:

fˆ(x) = arg max π̂k · ϕ(x; µ̂k , Σ̂).