Bayes Decision Theory Lecture on Pattern Recognition

Pattern Recognition Lecture
Bayes Decision Theory
Prof. Dr. Marcin Grzegorzek
Research Group for Pattern Recognition

Institute for Vision and Graphics
University of Siegen, Germany
Pattern Recognition Chain
Introduction
Bayes Decision
Theory
Discriminant
Functions and
Decision
Surfaces patterns feature feature classifier system
sensor
generation selection design evaluation
Bayesian
↑
Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
Functions
Overview
Introduction
1 Introduction
Bayes Decision
Theory
Discriminant 2 Bayes Decision Theory

Functions and
Decision
Surfaces
Bayesian 3 Discriminant Functions and Decision Surfaces

Classification
for Normal
Distributions
Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
Overview
Introduction
1 Introduction
Bayes Decision
Theory

Functions and
Decision
Surfaces

Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
Statistical Classification - Problem Statement
Introduction Classification of an unknown pattern in the most probable

Bayes Decision of the classes!
Theory
• Set of classes: {ω1 , ω2 , . . . , ωM }
Discriminant
Functions and
Decision • Unknown pattern represented by its feature vector x
Surfaces
• Conditional probabilities: P(ωi |x), i = 1, 2, . . . , M
Bayesian
Classification
for Normal
• Classification result: the class with the maximum
Distributions
conditional probability
Estimation of
Unknown
Probability
Density But how to compute the conditional probability for a
Functions
particular class?
Probability P vs. Density p
Introduction
Probability P
Bayes Decision is a real number describing an event belonging to the
Theory
range < 0, 1 >.
Discriminant
Functions and
Decision
Surfaces
Density p
Bayesian
Classification is a value of a function1 p(x) describing the distribution of
for Normal
Distributions the random variable x.
Estimation of
Unknown
Probability
Density
If the random variable takes only discrete values, the
Functions
densities become probabilities!
1
This function is often referred as pdf - probability density function.
Overview
Introduction
1 Introduction
Bayes Decision
Theory

Functions and
Decision
Surfaces

Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
A Priori Probability vs. A Posteriori Probability
Introduction A priori probability - probability before classification

Bayes Decision
Theory • How probable is a particular class ωi for a pattern x before
Discriminant applying any classification algorithm?
Functions and
Decision • Answer: P(ωi )
Surfaces
Bayesian
Classification
for Normal A posteriori probability - probability after classification
Distributions
Estimation of
• How probable is a particular class ωi for a pattern x after
Unknown
Probability
applying a statistical classification algorithm?
Density
Functions • Answer: P(ωi |x)
Likelihood Density Function
Introduction
Bayes Decision
Theory
Likelihood Density Function
Discriminant
Functions and • How feature vectors x are distributed in a class ωi ?
Decision
Surfaces
• Answer: p(x|ωi )
Bayesian
Classification • p(x|ωi ) is the likelihood function of ωi with respect to x
for Normal
Distributions
• p(x|ωi ) can be trained from examples
Estimation of
Unknown
Probability
Density
Functions
Bayes Decision Theory for a Two-Class Problem
Known
Introduction
Classes: {ω1 , ω2 }
Bayes Decision
Theory A priori probabilities: P(ω1 ) and P(ω2 )
Discriminant Likelihood density functions: p(x|ω1 ) and p(x|ω2 )
Functions and
Decision Pattern to be classified: x = [x1 , x2 , . . . , xl ]T
Surfaces
Bayesian
Classification Assumption
for Normal
Distributions The feature vectors can take any value in the
Estimation of
Unknown
l -dimensional feature space: x = [x1 , x2 , . . . , xl ]T ∈ IRl
Probability
Density
Functions
Unknown
A posteriori probabilities: P(ω1 |x) and P(ω2 |x)
Computation of the A Posteriori Probability
Introduction
Bayes Decision
Theory Using the Bayes Rule
Discriminant
Functions and
Decision p(x|ωi )P(ωi )
Surfaces P(ωi |x) = i = 1, 2 (1)
p(x)
Bayesian
Classification
for Normal
Distributions
Estimation of p(x) – density function for x

Unknown
Probability
Density
Functions
Bayes Classification Rule (1)
Introduction
Bayes Decision
Theory
Discriminant Higher a posteriori probability wins

Functions and
Decision
Surfaces
If P(ω1 |x) > P(ω2 |x), x is classified to ω1
Bayesian
Classification
for Normal
Distributions
If P(ω1 |x) < P(ω2 |x), x is classified to ω2
Estimation of
Unknown
Probability
Density
Functions
Introduction
Bayes Decision
Theory
Discriminant
Considering the Bayes Rule (Eq. 1)
Functions and
Decision
p(x|ω1 )P(ω1 ) p(x|ω2 )P(ω2 )
Surfaces If p(x) > p(x) , x is classified to ω1
Bayesian
Classification
for Normal p(x|ω1 )P(ω1 ) p(x|ω2 )P(ω2 )
Distributions If p(x) < p(x) , x is classified to ω2
Estimation of
Unknown
Probability
Density
Functions
Introduction
Bayes Decision
Theory
p(x) can be disregarded, because it is the same for all
Discriminant
Functions and classes
Decision
Surfaces
Bayesian If p(x|ω1 )P(ω1 ) > p(x|ω2 )P(ω2 ) , x is classified to ω1

Classification
for Normal
Distributions
If p(x|ω1 )P(ω1 ) < p(x|ω2 )P(ω2 ) , x is classified to ω2
Estimation of
Unknown
Probability
Density
Functions
Introduction If the a priori probabilities are equal: P(ω1 ) = P(ω2 )

Bayes Decision
Theory
If p(x|ω1 ) > p(x|ω2 ) , x is classified to ω1
Discriminant
Functions and
Decision
Surfaces If p(x|ω1 ) < p(x|ω2 ) , x is classified to ω2
Bayesian
Classification
for Normal
Distributions
Estimation of
Unknown We are done, since the likelihood density functions
Probability
Density p(x|ω1 ) and p(x|ω2 ) are assumed to have been trained
Functions
from examples!
Classification Error Probability
p(x|v) p(x|v1)
p(x|v2)
Introduction
Bayes Decision
Theory
Discriminant
Functions and
Decision
Surfaces
Bayesian
Classification
for Normal
Distributions
Estimation of
Unknown x0 x
Probability R1 R2
Density
Functions
1
Rx0 1
R∞
Error Probability: Pe = 2 p(x|ω2 )dx + 2 p(x|ω1 )dx
−∞ x0
Classification Error Probability in General
Introduction
• A priori probabilities are not equal: P(ω1 ) 6= P(ω2 )
Bayes Decision
Theory
• Feature vectors have more than one dimension: l > 1
Discriminant
Functions and
Decision
Surfaces
x = [x1 , x2 , . . . , xl ]T
Bayesian
Classification • General form:
for Normal
Distributions Z Z
Estimation of Pe = P(ω1 ) p(x|ω1 )dx + P(ω2 ) p(x|ω2 )dx
Unknown
Probability
Density
R2 R1
Functions
Classification Error Probability
Introduction
Bayes Decision
Theory
Discriminant
Functions and
Decision
Surfaces
Bayesian
Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
Functions
Bayesian Classifier is OPTIMAL with respect to

minimising the classification error probability!
Minimising Average Risk for Two Classes
• Classification error probability assigns the same importance

Introduction
to all errors, which is wrong for many applications (e. g.,
Bayes Decision
Theory ω1 → “malignant tumour”, ω2 → “benign tumour” ).
Discriminant
Functions and
• In such cases a penalty term is assigned to weight each
Decision
Surfaces
error.
Bayesian • A modified version of the error probability has to be
Classification
for Normal minimised:
Distributions
Z Z
Estimation of
Unknown r = λ12 P(ω1 ) p(x|ω1 )dx + λ21 P(ω2 ) p(x|ω2 )dx
Probability
Density
Functions
R2 R1
• For the tumour example λ12 is much greater than λ21 .

Modified Bayes Classification Rule
Introduction
Bayes Decision
Theory
Discriminant
If the a priori probabilities are equal: P(ω1 ) = P(ω2 )
Functions and
Decision
Surfaces
If p(x|ω2 ) > p(x|ω1 ) λλ12
21
, x is classified to ω2
Bayesian
Classification
for Normal
Distributions If p(x|ω2 ) < p(x|ω1 ) λλ21
12
, x is classified to ω1
Estimation of
Unknown
Probability
Density
Functions
Overview
Introduction
1 Introduction
Bayes Decision
Theory

Functions and
Decision
Surfaces

Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
Discriminant Functions
• Sometimes it is more convenient to work with functions of

probabilities instead of probabilities
Introduction
Bayes Decision
Theory gi (x) ≡ f (P(ωi |x))
Discriminant
Functions and • f (·) is a monotonically increasing function
Decision
Surfaces
• gi (x) is known as discriminant function
Bayesian
Classification • The decision test is now stated as
for Normal
Distributions
Estimation of classify x into ωi if gi (x) > gj (x) ∀j 6= i

Unknown
Probability
Density • The decision surfaces, separating contiguous regions, are
Functions
described by
gij (x) ≡ gi (x) − gj (x) = 0, i , j = 1, 2, . . . , M i 6= j

Overview
Introduction
1 Introduction
Bayes Decision
Theory

Functions and
Decision
Surfaces

Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
Assumption
Introduction • The likelihood density functions describing the data in

Bayes Decision each of the classes, are multivariate Gaussian (normal)
Theory
distributions
Discriminant
Functions and l 1 1 T Σ −1 (x−µ )
Decision
Surfaces
p(x|ωi ) = (2π)− 2 |Σi |− 2 e − 2 (x−µi ) i i
Bayesian
Classification
for Normal
Distributions
Estimation of
Unknown • This “monster” will be denoted by
Probability
Density
Functions
p(x|ωi ) = N (µi , Σi ) i = 1, 2, . . . , M
Discriminant Function f (·) = ln(·)
Introduction • Due to the exponential form of the involved densities, the

Bayes Decision following discriminant function is applied:
Theory
Discriminant
Functions and gi (x) = ln(p(x|ωi )P(ωi )) = ln p(x|ωi ) + ln P(ωi )
Decision
Surfaces
Bayesian
m considering the “monster”
Classification
for Normal
1
Distributions
gi (x) = − (x − µi )T Σi−1 (x − µi ) + ln P(ωi ) + ci (2)
Estimation of 2
Unknown
Probability
Density
Functions
• Where: ci = − 2l ln 2π − 1
2 ln |Σi |
Quadrics as Decision Curves
Assuming l = 2 and σ1,2 = σ2,1 = 0, the decision curves

gi (x) − gj (x) = 0
Introduction
Bayes Decision
Theory
are quadrics (i. e., ellipsoids, parabolas, hyperbolas, pairs of
Discriminant
lines)
Functions and
Decision
Surfaces
x2 x2
Bayesian
Classification 3 4
for Normal
v2
Distributions
Estimation of v1 1
v1 v1
Unknown 0
Probability
Density 22 v2
Functions
23
25
23 22 21 0 x1 210 25 0 5 x1
(a) (b)
Decision Hyperplanes
• The only quadric contribution in Equation (2) is xT Σi−1 x

Introduction • Assuming that the covariance matrix is the same for all
Bayes Decision classes Σi = Σ the quadric term will be the same for all
Theory
Discriminant
discriminant functions
Functions and
Decision • Thus, the quadric term can be disregarded by decision
Surfaces
surface equations. The same is true for the constant ci
Bayesian
Classification • The simplified version of the discriminant function is just a
for Normal
Distributions linear function
Estimation of
Unknown
gi (x) = wi T x + wi 0
Probability
Density where
Functions
1
wi = Σ −1 µi and wi 0 = ln P(ωi ) − µi T Σ −1 µi
2
Minimum Distance Classifiers
• Assuming equiprobable classes with the same covariance

Introduction matrix and neglecting the constants Eq. 2 is simplified to:
Bayes Decision
Theory 1
gi (x) = − (x − µi )T Σ −1 (x − µi )
Discriminant
Functions and
2
Decision
Surfaces
• If Σ = σ 2 I (diagonal matrix) the maximum gi (x) implies
Bayesian
Classification the minimum Euclidean distance dǫ = ||x − µi ||
for Normal
Distributions
Estimation of
Unknown
• Thus, a feature vector x is assigned to a class bi according
Probability
Density
to its Euclidean distance to the respective mean points µi
Functions
bi = argmax(gi (x)) = argmin(||x − µi ||)
i i
Remarks
Introduction • In practice, it is quite common to assume the Gaussian distribution of the

Bayes Decision data. In this case, the Bayesian classifier is either linear or quadratic in
Theory nature. These approaches are knows as linear discriminant analysis (LDA)
Discriminant
or quadratic discriminant analysis (QDA).
Functions and
Decision
Surfaces • A major problem associated with LDA and QDA is the large number of
Bayesian parameters to be estimated. Thus, l parameters in each mean vector and
2
Classification approximately l2 in each covariance matrix. Moreover, a large number of
for Normal
Distributions training points N is needed.
Estimation of
Unknown
Probability
• LDA and QDA perform very good for many different applications. However,
Density in many cases the assumed normal distribution is not the right method to
Functions statistically model the data.
Overview
Introduction
1 Introduction
Bayes Decision
Theory

Functions and
Decision
Surfaces

Classification
for Normal
Distributions
Estimation of
Unknown
Probability
Density
Problem Statement
Introduction • So far, we have assumed that the likelihood density

Bayes Decision
Theory
functions p(x|ωi ) for i = 1, 2, . . . , M are known.
Discriminant
Functions and
Decision • This is not the most common case. In many problems, the
Surfaces
Bayesian
likelihood density functions have to be estimated from the
Classification
for Normal
available training data.
Distributions
Estimation of
Unknown • Here, two estimation methods will be considered, namely
Probability
Density • Maximum Likelihood Parameter Estimation
Functions
• Maximum a Posteriori Probability Estimation
Maximum Likelihood Parameter Estimation (1)
• Let us consider an M-class problem with feature vectors

Introduction
Bayes Decision
distributed according to p(x|ωi ), i = 1, 2, . . . , M.
Theory
• The likelihood functions are assumed to be given in a
Discriminant
Functions and parametric form. The statistical parameters for the classes
Decision
Surfaces ωi form vectors θi which are unknown
Bayesian
Classification
for Normal
p(x|ωi ) = p(x|ωi ; θi )
Distributions
Estimation of • Goal: to estimate the unknown parameters using a set of

Unknown
Probability known feature vectors in each class.
Density
Functions • Since the estimation process is the same for all classes, the
index i will be skipped for further investigations.
• Let X = {x1 , x2 , . . . , xN } be a set of feature vectors

Introduction describing training samples of a particular class.
Bayes Decision • Assuming statistical independence between the different
Theory
Discriminant
feature vectors, we can form the joint density function
Functions and
Decision N
Surfaces Y
Bayesian
p(X ; θ) = p(x1 , x2 , . . . , xN ; θ) = p(xk ; θ)
Classification k=1
for Normal
Distributions
Estimation of
• The ML method estimates θ so that the likelihood
Unknown
Probability
function takes its maximum value
Density
Functions N
Y
θbML = argmax p(xk ; θ)
θ k=1
• To find a maximum, the gradient has to be zero

QN
∂ k=1 p(xk ; θ)
Introduction
Bayes Decision =0
Theory ∂θ
Discriminant
Functions and
• Due to the monotonicity of the logarithmic function, we
Decision
Surfaces
can use also the log-likelihood function
Bayesian N
Classification Y
for Normal
Distributions
L(θ) = ln p(xk ; θ)
Estimation of
k=1
Unknown
Probability • Looking for the maximum here, we have
Density
Functions
N N
∂L(θ) X ∂ ln p(xk ; θ) X 1 ∂p(xk ; θ)
= = =0
∂θ ∂θ p(xk ; θ) ∂θ
k=1 k=1
Maximum a Posteriori Probability Estimation
• Set of feature vectors X = {x1 , x2 , . . . , xN }

Introduction
Bayes Decision • θ is an unknown random vector

Theory
• The starting point is the following density function
Discriminant
Functions and
Decision
p(θ)p(X |θ)
Surfaces
p(θ|X ) =
Bayesian p(X )
Classification
for Normal
Distributions • The MAP estimate θbMAP is defined as a point where
Estimation of
Unknown
p(θ|X ) becomes maximum
Probability
Density
∂ ∂
θbMAP :
Functions
p(θ|X ) = 0 or (p(θ)p(X |θ)) = 0
∂θ ∂θ

Bayes Decision Theory Lecture on Pattern Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayes Decision Theory Lecture on Pattern Recognition

Uploaded by

Copyright:

Available Formats

Pattern Recognition Lecture

Bayes Decision Theory

Prof. Dr. Marcin Grzegorzek

Research Group for Pattern Recognition

Discriminant 2 Bayes Decision Theory

Bayesian 3 Discriminant Functions and Decision Surfaces

Discriminant 2 Bayes Decision Theory

Bayesian 3 Discriminant Functions and Decision Surfaces

Introduction Classification of an unknown pattern in the most probable

Discriminant 2 Bayes Decision Theory

Bayesian 3 Discriminant Functions and Decision Surfaces

Introduction A priori probability - probability before classification

Estimation of p(x) – density function for x

Discriminant Higher a posteriori probability wins

Bayesian If p(x|ω1 )P(ω1 ) > p(x|ω2 )P(ω2 ) , x is classified to ω1

Introduction If the a priori probabilities are equal: P(ω1 ) = P(ω2 )

Bayesian Classifier is OPTIMAL with respect to

• Classification error probability assigns the same importance

• For the tumour example λ12 is much greater than λ21 .

Discriminant 2 Bayes Decision Theory

Bayesian 3 Discriminant Functions and Decision Surfaces

• Sometimes it is more convenient to work with functions of

Estimation of classify x into ωi if gi (x) > gj (x) ∀j 6= i

gij (x) ≡ gi (x) − gj (x) = 0, i , j = 1, 2, . . . , M i 6= j

Discriminant 2 Bayes Decision Theory

Bayesian 3 Discriminant Functions and Decision Surfaces

Introduction • The likelihood density functions describing the data in

Introduction • Due to the exponential form of the involved densities, the

Assuming l = 2 and σ1,2 = σ2,1 = 0, the decision curves

• The only quadric contribution in Equation (2) is xT Σi−1 x

• Assuming equiprobable classes with the same covariance

Introduction • In practice, it is quite common to assume the Gaussian distribution of the

Discriminant 2 Bayes Decision Theory

Bayesian 3 Discriminant Functions and Decision Surfaces

Introduction • So far, we have assumed that the likelihood density

• Let us consider an M-class problem with feature vectors

Estimation of • Goal: to estimate the unknown parameters using a set of

• Let X = {x1 , x2 , . . . , xN } be a set of feature vectors

• To find a maximum, the gradient has to be zero

• Set of feature vectors X = {x1 , x2 , . . . , xN }

Bayes Decision • θ is an unknown random vector

You might also like