You are on page 1of 17

Bayes Decision Theory

Minimum-Error-Rate Classification

Classifiers, Discriminant Functions and Decision Surfaces

The Normal Density

CSE 555: Srihari 0


Minimum-Error-Rate Classification

• Actions are decisions on classes


If action αi is taken and the true state of nature is ωj
then: decision is correct if i = j and in error if i ≠ j

Seek a decision rule that minimizes the


probability of error which is the error rate

CSE 555: Srihari 1


Minimum Error Rate Classifier Derivation
• zero-one loss function: ⎧0 i = j
λ ( α i ,ω j ) = ⎨ i , j = 1 ,..., c
⎩1 i ≠ j

• Therefore, the conditional risk is: j =c


R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1

= ∑ P( ω j | x ) = 1 − P( ω i | x )
j≠1

The risk corresponding to this loss function is the average probability error”
• Minimize the risk requires maximize P(ωi | x)
(since R(αi | x) = 1 – P(ωi | x))
For Minimum error rate
Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i

CSE 555: Srihari 2


Likelihood Ratio Classification
• Regions of decision and zero-one loss function,
λ12 − λ 22 P ( ω 2 ) P( x | ω1 )
Let . = θ λ then decide ω 1 if : > θλ
λ 21 − λ11 P ( ω 1 ) P( x | ω 2 )

• If λ is the zero-one loss function which means:


⎛ 0 1⎞
λ = ⎜⎜ ⎟⎟
⎝ 1 0 ⎠
P( ω 2 )
then θ λ = = θa
P( ω1 )
⎛0 2 ⎞ 2 P( ω 2 )
if λ = ⎜⎜ ⎟⎟ then θ λ = = θb
⎝ 1 0 ⎠ P ( ω 1 ) • Likelihood Ratio p(x/ω1)/p(x/ω2).
• If we use a zero-one loss function
decision boundaries are determined by
threshold θa.
• If loss function penalizes miscategorizing ω2
as ω1 more than converse we get larger
Class-conditional pdfs threshold θb and hence R1 becomes smaller3
CSE 555: Srihari
Classifiers, Discriminant Functions
and Decision Surfaces

• Many methods of representing pattern classifiers


Set of discriminant functions gi(x), i = 1,…, c
Classifier assigns feature x to class ωi if gi(x) > gj(x) ∀j ≠ i

Classifier is a machine
that computes c
discriminant functions

Functional structure of a general statistical pattern


Classifier with d inputs and c discriminant functions
gi(x)
CSE 555: Srihari 4
Forms of Discriminant Functions

• Let gi(x) = - R(αi | x)


(max. discriminant corresponds to min. risk!)

• For the minimum error rate, we take


gi(x) = P(ωi | x)
(max. discrimination corresponds to max. posterior!)
gi(x) ≡ P(x | ωi) P(ωi)
gi(x) = ln P(x | ωi) + ln P(ωi)

CSE 555: Srihari 5


Decision Region
• Feature space divided into c decision regions
if gi(x) > gj(x) ∀j ≠ i then x is in Ri

2-D, two-category classifier with Gaussian pdfs


Decision Boundary = two hyperbolas
Hence decision region R2 is not simply connected 6
CSE 555: Srihari
Ellipses mark where density is 1/e times that of peak distribution
The Two-Category case
A classifier is a dichotomizer
that has two discriminant functions g1 and g2

Let g(x) ≡ g1(x) – g2(x)


Decide ω1 if g(x) > 0 ; Otherwise decide ω2

The computation of g(x)

g( x ) = P ( ω 1 | x ) − P ( ω 2 | x )
P( x | ω1 ) P( ω1 )
= ln + ln
P( x | ω 2 ) P( ω 2 )
CSE 555: Srihari 7
The Normal Distribution
A bell-shaped distribution defined by the probability density function

1 x−µ 2
1 − ( )
p( x) = e 2 σ

2πσ 2
If the random variable X follows a normal distribution, then
• The probability that X will fall into the interval (a,b) is
given by b
∫a
p ( x)dx

• Expected, or mean, value of X is E[ X ] = ∫ xp( x)dx =µ
−∞
• Variance of X is ∞
Var ( x) = E[( x − µ ) 2 ] = ∫ ( x − µ ) 2 p( x)dx = σ 2
−∞
• Standard deviation of X,σ 2, is
σx =σ 8
CSE 555: Srihari
Relationship between Entropy and Normal Density

Entropy of a distribution

H ( p ( x)) = ∫ p( x) ln p( x)dx
−∞

Measured in nats. If log2 is uses the unit is bits

Entropy measures uncertainty in the values of points


selected randomly from a distribution

Normal distribution has maximum entropy over all


distributions having a given mean and variance
CSE 555: Srihari 9
Normal Distribution, Mean 0, Standard
Deviation 1

With 80% confidence the r.v. will lie in the two-sided interval[-1.28,1.28]

CSE 555: Srihari 10


The Normal Density in Pattern Recognition
• Univariate density
• Analytically tractable, continuous
• A lot of processes are asymptotically Gaussian
• Central Limit Theorem: aggregate effect of a sum of a large number of
small, independent random disturbances will lead to a Gaussian
distribution
• Handwritten characters, speech sounds are ideal or prototype
corrupted by random process

1 ⎡ 1 ⎛ x − µ ⎞2 ⎤
P( x ) = exp ⎢ − ⎜ ⎟ ⎥,
2π σ ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦

Where:
µ = mean (or expected value) of x
σ2 = expected squared deviation
or variance

Univariate normal distribution has roughly 95% of its area in the range
CSE 555: Srihari |x-µ|<2σ. 11
The peak of the distribution has value p(µ)=1/sqrt(2πσ)
Multivariate density

Multivariate normal density in d dimensions is:


1 ⎡ 1 ⎤
p( x) = exp ⎢− ( x − µ ) t Σ −1 ( x − µ )⎥
(2π ) d / 2 Σ ⎣ 2 ⎦
1/ 2

abbreviated as
p ( x) ~ N ( µ , Σ)
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
µ = (µ1, µ2, …, µd)t mean vector
Σ = d*d covariance matrix
|Σ| and Σ-1 are determinant and inverse respectively

CSE 555: Srihari 12


Mean and Covariance Matrix

• Formal Definitions
µ = Ε[ x] = ∫ xp( x)dx

∑ = Ε[( x − µ )( x − µ ) t
= ∫ ( x − µ )( x − µ ) t
p ( x)dx

Mean vector has its components which are means of variables

Covariance :
Diagonal elements are variances of variables
Cross-diagonal elements are covariances of pairs of variables
Statistical independence means off-diagonal elements are zero
CSE 555: Srihari 13
Multivariate Normal Density

• Specified by d+d(d+1)/2 parameters: mean and


independent elements of covariance matrix

Locii of points
of constant
density are
hyperellipsoids

Samples drawn from a 2-D Gaussian lie in a cloud centered at the mean µ.
Ellipses show lines
CSE 555: of equal probability density of the Gaussian
Srihari 14
Linear Combinations of Normally distributed
variables are normally distributed

Whitening Transform
Φ is matrix whose columns are the orthonormal Eigen vectors
of ∑ and A is diagonal matrix of corresponding Eigen values
then the transformation A w = ΦA−1/ 2 applied to the coordinates
ensures that transformed distribution has covariance matrix equal
to the identity matrix

Action of a linear transformation on the feature space will convert an arbitrary normal
distribution into another normal distribution. One transformation A, takes the source
distribution into distribution N(Atµ,AtΣA). Another linear transformation- a projection P onto
a Line defined by vector a– leads to N(µ, σ2) measured along that line. While the
transforms yield distributions in a different space they are shown superimposed on the
original x1-x2 space. A whitening transform Aw leads to a circularly symmetric Gaussian,
here shown displaced.
CSE 555: Srihari 15
Mahanalobis Distance
r 2 = ( x − µ ) t Σ −1 ( x − µ ) Contours of constant
is the Mahanalobis distance from x toµ Density are hyperellipsoids
of constant Mahanalobis
Distance

For a given dimensionality


Scatter of samples varies
directly with |E|1/2

Samples drawn from a 2-D Gaussian lie in a cloud centered at the mean µ.
Ellipses show lines of equal probability density of the Gaussian

CSE 555: Srihari 16

You might also like