You are on page 1of 35

Pattern Recognition Lecture

Bayes Decision Theory

Prof. Dr. Marcin Grzegorzek

Research Group for Pattern Recognition


Institute for Vision and Graphics
University of Siegen, Germany
Pattern Recognition Chain

Introduction

Bayes Decision
Theory

Discriminant
Functions and
Decision
Surfaces patterns feature feature classifier system
sensor
generation selection design evaluation
Bayesian


Classification
for Normal
Distributions

Estimation of
Unknown
Probability
Density
Functions
Overview

Introduction
1 Introduction
Bayes Decision
Theory

Discriminant 2 Bayes Decision Theory


Functions and
Decision
Surfaces

Bayesian 3 Discriminant Functions and Decision Surfaces


Classification
for Normal
Distributions

Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
Overview

Introduction
1 Introduction
Bayes Decision
Theory

Discriminant 2 Bayes Decision Theory


Functions and
Decision
Surfaces

Bayesian 3 Discriminant Functions and Decision Surfaces


Classification
for Normal
Distributions

Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
Statistical Classification - Problem Statement

Introduction Classification of an unknown pattern in the most probable


Bayes Decision of the classes!
Theory
• Set of classes: {ω1 , ω2 , . . . , ωM }
Discriminant
Functions and
Decision • Unknown pattern represented by its feature vector x
Surfaces
• Conditional probabilities: P(ωi |x), i = 1, 2, . . . , M
Bayesian
Classification
for Normal
• Classification result: the class with the maximum
Distributions
conditional probability
Estimation of
Unknown
Probability
Density But how to compute the conditional probability for a
Functions
particular class?
Probability P vs. Density p

Introduction
Probability P
Bayes Decision is a real number describing an event belonging to the
Theory
range < 0, 1 >.
Discriminant
Functions and
Decision
Surfaces
Density p
Bayesian
Classification is a value of a function1 p(x) describing the distribution of
for Normal
Distributions the random variable x.
Estimation of
Unknown
Probability
Density
If the random variable takes only discrete values, the
Functions
densities become probabilities!

1
This function is often referred as pdf - probability density function.
Overview

Introduction
1 Introduction
Bayes Decision
Theory

Discriminant 2 Bayes Decision Theory


Functions and
Decision
Surfaces

Bayesian 3 Discriminant Functions and Decision Surfaces


Classification
for Normal
Distributions

Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
A Priori Probability vs. A Posteriori Probability

Introduction A priori probability - probability before classification


Bayes Decision
Theory • How probable is a particular class ωi for a pattern x before
Discriminant applying any classification algorithm?
Functions and
Decision • Answer: P(ωi )
Surfaces

Bayesian
Classification
for Normal A posteriori probability - probability after classification
Distributions

Estimation of
• How probable is a particular class ωi for a pattern x after
Unknown
Probability
applying a statistical classification algorithm?
Density
Functions • Answer: P(ωi |x)
Likelihood Density Function

Introduction

Bayes Decision
Theory
Likelihood Density Function
Discriminant
Functions and • How feature vectors x are distributed in a class ωi ?
Decision
Surfaces
• Answer: p(x|ωi )
Bayesian
Classification • p(x|ωi ) is the likelihood function of ωi with respect to x
for Normal
Distributions
• p(x|ωi ) can be trained from examples
Estimation of
Unknown
Probability
Density
Functions
Bayes Decision Theory for a Two-Class Problem

Known
Introduction
Classes: {ω1 , ω2 }
Bayes Decision
Theory A priori probabilities: P(ω1 ) and P(ω2 )
Discriminant Likelihood density functions: p(x|ω1 ) and p(x|ω2 )
Functions and
Decision Pattern to be classified: x = [x1 , x2 , . . . , xl ]T
Surfaces

Bayesian
Classification Assumption
for Normal
Distributions The feature vectors can take any value in the
Estimation of
Unknown
l -dimensional feature space: x = [x1 , x2 , . . . , xl ]T ∈ IRl
Probability
Density
Functions
Unknown
A posteriori probabilities: P(ω1 |x) and P(ω2 |x)
Computation of the A Posteriori Probability

Introduction

Bayes Decision
Theory Using the Bayes Rule
Discriminant
Functions and
Decision p(x|ωi )P(ωi )
Surfaces P(ωi |x) = i = 1, 2 (1)
p(x)
Bayesian
Classification
for Normal
Distributions

Estimation of p(x) – density function for x


Unknown
Probability
Density
Functions
Bayes Classification Rule (1)

Introduction

Bayes Decision
Theory

Discriminant Higher a posteriori probability wins


Functions and
Decision
Surfaces
If P(ω1 |x) > P(ω2 |x), x is classified to ω1
Bayesian
Classification
for Normal
Distributions
If P(ω1 |x) < P(ω2 |x), x is classified to ω2
Estimation of
Unknown
Probability
Density
Functions
Bayes Classification Rule (2)

Introduction

Bayes Decision
Theory

Discriminant
Considering the Bayes Rule (Eq. 1)
Functions and
Decision
p(x|ω1 )P(ω1 ) p(x|ω2 )P(ω2 )
Surfaces If p(x) > p(x) , x is classified to ω1
Bayesian
Classification
for Normal p(x|ω1 )P(ω1 ) p(x|ω2 )P(ω2 )
Distributions If p(x) < p(x) , x is classified to ω2
Estimation of
Unknown
Probability
Density
Functions
Bayes Classification Rule (3)

Introduction

Bayes Decision
Theory
p(x) can be disregarded, because it is the same for all
Discriminant
Functions and classes
Decision
Surfaces

Bayesian If p(x|ω1 )P(ω1 ) > p(x|ω2 )P(ω2 ) , x is classified to ω1


Classification
for Normal
Distributions
If p(x|ω1 )P(ω1 ) < p(x|ω2 )P(ω2 ) , x is classified to ω2
Estimation of
Unknown
Probability
Density
Functions
Bayes Classification Rule (4)

Introduction If the a priori probabilities are equal: P(ω1 ) = P(ω2 )


Bayes Decision
Theory
If p(x|ω1 ) > p(x|ω2 ) , x is classified to ω1
Discriminant
Functions and
Decision
Surfaces If p(x|ω1 ) < p(x|ω2 ) , x is classified to ω2
Bayesian
Classification
for Normal
Distributions

Estimation of
Unknown We are done, since the likelihood density functions
Probability
Density p(x|ω1 ) and p(x|ω2 ) are assumed to have been trained
Functions
from examples!
Classification Error Probability

p(x|v) p(x|v1)

p(x|v2)
Introduction

Bayes Decision
Theory

Discriminant
Functions and
Decision
Surfaces

Bayesian
Classification
for Normal
Distributions

Estimation of
Unknown x0 x
Probability R1 R2
Density
Functions

1
Rx0 1
R∞
Error Probability: Pe = 2 p(x|ω2 )dx + 2 p(x|ω1 )dx
−∞ x0
Classification Error Probability in General

Introduction
• A priori probabilities are not equal: P(ω1 ) 6= P(ω2 )
Bayes Decision
Theory
• Feature vectors have more than one dimension: l > 1
Discriminant
Functions and
Decision
Surfaces
x = [x1 , x2 , . . . , xl ]T
Bayesian
Classification • General form:
for Normal
Distributions Z Z
Estimation of Pe = P(ω1 ) p(x|ω1 )dx + P(ω2 ) p(x|ω2 )dx
Unknown
Probability
Density
R2 R1
Functions
Classification Error Probability

Introduction

Bayes Decision
Theory

Discriminant
Functions and
Decision
Surfaces

Bayesian
Classification
for Normal
Distributions

Estimation of
Unknown
Probability
Density
Functions

Bayesian Classifier is OPTIMAL with respect to


minimising the classification error probability!
Minimising Average Risk for Two Classes

• Classification error probability assigns the same importance


Introduction
to all errors, which is wrong for many applications (e. g.,
Bayes Decision
Theory ω1 → “malignant tumour”, ω2 → “benign tumour” ).
Discriminant
Functions and
• In such cases a penalty term is assigned to weight each
Decision
Surfaces
error.
Bayesian • A modified version of the error probability has to be
Classification
for Normal minimised:
Distributions
Z Z
Estimation of
Unknown r = λ12 P(ω1 ) p(x|ω1 )dx + λ21 P(ω2 ) p(x|ω2 )dx
Probability
Density
Functions
R2 R1

• For the tumour example λ12 is much greater than λ21 .


Modified Bayes Classification Rule

Introduction

Bayes Decision
Theory

Discriminant
If the a priori probabilities are equal: P(ω1 ) = P(ω2 )
Functions and
Decision
Surfaces
If p(x|ω2 ) > p(x|ω1 ) λλ12
21
, x is classified to ω2
Bayesian
Classification
for Normal
Distributions If p(x|ω2 ) < p(x|ω1 ) λλ21
12
, x is classified to ω1
Estimation of
Unknown
Probability
Density
Functions
Overview

Introduction
1 Introduction
Bayes Decision
Theory

Discriminant 2 Bayes Decision Theory


Functions and
Decision
Surfaces

Bayesian 3 Discriminant Functions and Decision Surfaces


Classification
for Normal
Distributions

Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
Discriminant Functions

• Sometimes it is more convenient to work with functions of


probabilities instead of probabilities
Introduction

Bayes Decision
Theory gi (x) ≡ f (P(ωi |x))
Discriminant
Functions and • f (·) is a monotonically increasing function
Decision
Surfaces
• gi (x) is known as discriminant function
Bayesian
Classification • The decision test is now stated as
for Normal
Distributions

Estimation of classify x into ωi if gi (x) > gj (x) ∀j 6= i


Unknown
Probability
Density • The decision surfaces, separating contiguous regions, are
Functions
described by

gij (x) ≡ gi (x) − gj (x) = 0, i , j = 1, 2, . . . , M i 6= j


Overview

Introduction
1 Introduction
Bayes Decision
Theory

Discriminant 2 Bayes Decision Theory


Functions and
Decision
Surfaces

Bayesian 3 Discriminant Functions and Decision Surfaces


Classification
for Normal
Distributions

Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
Assumption

Introduction • The likelihood density functions describing the data in


Bayes Decision each of the classes, are multivariate Gaussian (normal)
Theory
distributions
Discriminant
Functions and l 1 1 T Σ −1 (x−µ )
Decision
Surfaces
p(x|ωi ) = (2π)− 2 |Σi |− 2 e − 2 (x−µi ) i i

Bayesian
Classification
for Normal
Distributions

Estimation of
Unknown • This “monster” will be denoted by
Probability
Density
Functions
p(x|ωi ) = N (µi , Σi ) i = 1, 2, . . . , M
Discriminant Function f (·) = ln(·)

Introduction • Due to the exponential form of the involved densities, the


Bayes Decision following discriminant function is applied:
Theory

Discriminant
Functions and gi (x) = ln(p(x|ωi )P(ωi )) = ln p(x|ωi ) + ln P(ωi )
Decision
Surfaces

Bayesian
m considering the “monster”
Classification
for Normal
1
Distributions
gi (x) = − (x − µi )T Σi−1 (x − µi ) + ln P(ωi ) + ci (2)
Estimation of 2
Unknown
Probability
Density
Functions
• Where: ci = − 2l ln 2π − 1
2 ln |Σi |
Quadrics as Decision Curves

Assuming l = 2 and σ1,2 = σ2,1 = 0, the decision curves


gi (x) − gj (x) = 0
Introduction

Bayes Decision
Theory
are quadrics (i. e., ellipsoids, parabolas, hyperbolas, pairs of
Discriminant
lines)
Functions and
Decision
Surfaces
x2 x2
Bayesian
Classification 3 4
for Normal
v2
Distributions

Estimation of v1 1
v1 v1
Unknown 0
Probability
Density 22 v2
Functions
23
25
23 22 21 0 x1 210 25 0 5 x1
(a) (b)
Decision Hyperplanes

• The only quadric contribution in Equation (2) is xT Σi−1 x


Introduction • Assuming that the covariance matrix is the same for all
Bayes Decision classes Σi = Σ the quadric term will be the same for all
Theory

Discriminant
discriminant functions
Functions and
Decision • Thus, the quadric term can be disregarded by decision
Surfaces
surface equations. The same is true for the constant ci
Bayesian
Classification • The simplified version of the discriminant function is just a
for Normal
Distributions linear function
Estimation of
Unknown
gi (x) = wi T x + wi 0
Probability
Density where
Functions

1
wi = Σ −1 µi and wi 0 = ln P(ωi ) − µi T Σ −1 µi
2
Minimum Distance Classifiers

• Assuming equiprobable classes with the same covariance


Introduction matrix and neglecting the constants Eq. 2 is simplified to:
Bayes Decision
Theory 1
gi (x) = − (x − µi )T Σ −1 (x − µi )
Discriminant
Functions and
2
Decision
Surfaces
• If Σ = σ 2 I (diagonal matrix) the maximum gi (x) implies
Bayesian
Classification the minimum Euclidean distance dǫ = ||x − µi ||
for Normal
Distributions

Estimation of
Unknown
• Thus, a feature vector x is assigned to a class bi according
Probability
Density
to its Euclidean distance to the respective mean points µi
Functions
bi = argmax(gi (x)) = argmin(||x − µi ||)
i i
Remarks

Introduction • In practice, it is quite common to assume the Gaussian distribution of the


Bayes Decision data. In this case, the Bayesian classifier is either linear or quadratic in
Theory nature. These approaches are knows as linear discriminant analysis (LDA)
Discriminant
or quadratic discriminant analysis (QDA).
Functions and
Decision
Surfaces • A major problem associated with LDA and QDA is the large number of
Bayesian parameters to be estimated. Thus, l parameters in each mean vector and
2
Classification approximately l2 in each covariance matrix. Moreover, a large number of
for Normal
Distributions training points N is needed.
Estimation of
Unknown
Probability
• LDA and QDA perform very good for many different applications. However,
Density in many cases the assumed normal distribution is not the right method to
Functions statistically model the data.
Overview

Introduction
1 Introduction
Bayes Decision
Theory

Discriminant 2 Bayes Decision Theory


Functions and
Decision
Surfaces

Bayesian 3 Discriminant Functions and Decision Surfaces


Classification
for Normal
Distributions

Estimation of
4 Bayesian Classification for Normal Distributions
Unknown
Probability
Density
Functions 5 Estimation of Unknown Probability Density Functions
Problem Statement

Introduction • So far, we have assumed that the likelihood density


Bayes Decision
Theory
functions p(x|ωi ) for i = 1, 2, . . . , M are known.
Discriminant
Functions and
Decision • This is not the most common case. In many problems, the
Surfaces

Bayesian
likelihood density functions have to be estimated from the
Classification
for Normal
available training data.
Distributions

Estimation of
Unknown • Here, two estimation methods will be considered, namely
Probability
Density • Maximum Likelihood Parameter Estimation
Functions
• Maximum a Posteriori Probability Estimation
Maximum Likelihood Parameter Estimation (1)

• Let us consider an M-class problem with feature vectors


Introduction

Bayes Decision
distributed according to p(x|ωi ), i = 1, 2, . . . , M.
Theory
• The likelihood functions are assumed to be given in a
Discriminant
Functions and parametric form. The statistical parameters for the classes
Decision
Surfaces ωi form vectors θi which are unknown
Bayesian
Classification
for Normal
p(x|ωi ) = p(x|ωi ; θi )
Distributions

Estimation of • Goal: to estimate the unknown parameters using a set of


Unknown
Probability known feature vectors in each class.
Density
Functions • Since the estimation process is the same for all classes, the
index i will be skipped for further investigations.
Maximum Likelihood Parameter Estimation (2)

• Let X = {x1 , x2 , . . . , xN } be a set of feature vectors


Introduction describing training samples of a particular class.
Bayes Decision • Assuming statistical independence between the different
Theory

Discriminant
feature vectors, we can form the joint density function
Functions and
Decision N
Surfaces Y
Bayesian
p(X ; θ) = p(x1 , x2 , . . . , xN ; θ) = p(xk ; θ)
Classification k=1
for Normal
Distributions

Estimation of
• The ML method estimates θ so that the likelihood
Unknown
Probability
function takes its maximum value
Density
Functions N
Y
θbML = argmax p(xk ; θ)
θ k=1
Maximum Likelihood Parameter Estimation (3)

• To find a maximum, the gradient has to be zero


QN
∂ k=1 p(xk ; θ)
Introduction

Bayes Decision =0
Theory ∂θ
Discriminant
Functions and
• Due to the monotonicity of the logarithmic function, we
Decision
Surfaces
can use also the log-likelihood function
Bayesian N
Classification Y
for Normal
Distributions
L(θ) = ln p(xk ; θ)
Estimation of
k=1
Unknown
Probability • Looking for the maximum here, we have
Density
Functions
N N
∂L(θ) X ∂ ln p(xk ; θ) X 1 ∂p(xk ; θ)
= = =0
∂θ ∂θ p(xk ; θ) ∂θ
k=1 k=1
Maximum a Posteriori Probability Estimation

• Set of feature vectors X = {x1 , x2 , . . . , xN }


Introduction

Bayes Decision • θ is an unknown random vector


Theory
• The starting point is the following density function
Discriminant
Functions and
Decision
p(θ)p(X |θ)
Surfaces
p(θ|X ) =
Bayesian p(X )
Classification
for Normal
Distributions • The MAP estimate θbMAP is defined as a point where
Estimation of
Unknown
p(θ|X ) becomes maximum
Probability
Density
∂ ∂
θbMAP :
Functions
p(θ|X ) = 0 or (p(θ)p(X |θ)) = 0
∂θ ∂θ

You might also like