You are on page 1of 68

Bayesian decision theory and density estimation

Giovanni Montana
Imperial College
14 July, 2010
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 1 / 68
1
Bayes decision error for minimum error
2
Bayes decision error for minimum risk
3
Non-parametric density estimation methods
4
Performance assessment
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 2 / 68
Bayes decision error for minimum error
The problem
We have observed {x
i
, t
i
} on a training data set (random sample)
Input x is a p dimensional vector (x
1
, x
2
, . . . , x
p
) in Euclidean space
Each x will be called pattern or data point or observation
Output t is generally univariate

In regression, typically the output is a continuous measurement

In classification, the output is a class label C
k
, k = 1, . . . , K
In some applications the response may also be multivariate, perhaps
high-dimensional
Joint probability distribution p(x, t) is generally unknown and
estimated using the training data
Given a new, unseen pattern x

, pattern classification consists in

Assigning the correct class label t

for x

Making a decision accordingly
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 3 / 68
Bayes decision error for minimum error
Example: patient classification using MRI data
Data points {x
i
}, i = 1, . . . , n consists of n
1
healthy subjects and n
2
individuals with Alzheimer’s disease
For each sample, input x consists of p pixel intensities
Output t consists of two classes, C
1
(healthy) and C
2
(diseased)
Take t = 0 to indicate C
1
and t = 1 to indicate C
2
We wish to classify a new patient, and take a related action (e.g.
provide a treatment), on the basis on x
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 4 / 68
Bayes decision error for minimum error
Making decisions when no learning data is available
Assume known prior probabilities p(C
k
), k = 1, 2
On unseen data, we want to make as few misclassifications as possible
Suppose we only use our prior information
We would assign a new pattern x to class k if
p(C
k
) > p(C
j
), j = k
For classes with equal probability, objects are randomly assigned
We can do better by making use of the data x
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 5 / 68
Bayes decision error for minimum error
Posterior probabilities
We are interested in posterion probabilities p(C
k
|x), k = 1, 2
Using Bayes’ theorem
p(C
k
|x) =
p(x|C
k
)p(C
k
)
p(x)
If p(x, C) is known, all the requires probabilities are available
The evidence p(x) is
p(x) = p(x, C
1
) + p(x, C
2
) = p(x|C
1
)p(C
1
) + p(x|C
2
)p(C
2
)
It is just a scaling factor so that

k
p(C
k
|x) = 1
How do we make as few misclassifications as possible?
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 6 / 68
Bayes decision error for minimum error
Example: conditional densities with univariate input
Left: class conditional densities p(x|C
1
) and p(x|C
2
)
Right: posterior probabilities p(C
1
|x) and p(C
2
|x)
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 7 / 68
Bayes decision error for minimum error
Intuitive approach
After observing x, we find that p(C
1
|x) > p(C
2
|x)
Intuitively, we would classify x as C
1
We would take the opposite decision if p(C
2
|x) > p(C
1
|x)
Accordingly, we would incur an error with probability
p(error|x) =
_
p(C
1
|x) if we decide C
2
p(C
2
|x) if we decide C
1
Is this approach optimal?
Note that
p(error) =
_

−∞
p(error, x)dx =
_

−∞
p(error|x) p(x)
.¸¸.
fixed
dx
If we ensure that p(error|x) is small as possible for every x, the
integral must be as small as possible
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 8 / 68
Bayes decision error for minimum error
Bayes decision rule for minimal error (two classes)
Divide the input space x into decision regions R
k
, k = 1, 2

If pattern x is in R
k
, then x is classified as belonging to C
k
The probability of misclassification is
p(error) = p(x ∈ R
1
, C
2
) + p(x ∈ R
2
, C
1
)
=
_
R
1
p(x, C
2
) dx +
_
R
2
p(x, C
1
) dx
=
_
R
1
p(C2|x)p(x) dx +
_
R
2
p(C
1
|x)p(x) dx
Ignoring the common factor p(x), the rule that minimises the
probability of misclassification is
assign x to C
1
if p(C
1
|x) > p(C
2
|x)
Under this decision rule,
p(error|x) = min[p(C
1
|x), p(C
2
|x)]
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 9 / 68
Bayes decision error for minimum error
Example: p(error) with K = 2
p(error) is the coloured area
ˆ x is a boundary value defining the two decision regions
x
0
is the optimal threshold - the probability of error is minimised
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 10 / 68
Bayes decision error for minimum error
Likelihood ratio
According to the Bayes rule,
x belongs to class C
1
if p(C
1
|x) > p(C
2
|x)
The rule provides the required decision boundary
Using Bayes’ theorem, we can rewrite the rule as
x belongs to class C
1
if p(x|C
1
)p(C
1
) > p(x|C
2
)p(C
2
)
or alternatively
x belongs to class C
1
if
p(x|C
1
)
p(x|C
2
)
. ¸¸ .
likelihood ratio
>
p(C
2
)
p(C
1
)
Clearly the rule depends on both the prior probabilities and
class-conditional densities
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 11 / 68
Bayes decision error for minimum error
Example: Decision regions in the univariate case
Using a 0 −1 loss function, R
1
and R
2
are determined by θ
a
If the loss function penalises misclassifying C
2
as C
1
more than the
converse, the threshold is higher (θ
b
> θ
a
) and R
1
is smaller.
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 12 / 68
Bayes decision error for minimum error
Bayes decision rule for minimal error (more than two classes)
Divide the input space x into decision regions R
k
, k = 1, . . . , K
The probability of correct classification is
p(correct) =
K

k=1
p(x ∈ R
k
, C
k
)
=
K

k=1
_
R
k
p(x, C
k
) dx
=
K

k=1
_
R
k
p(C
k
|x) p(x)
.¸¸.
fixed
dx
This probability is maximised when x is assigned to class C
k
for which
p(C
k
|x) is the largest
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 13 / 68
Bayes decision error for minimum error
Remark
We have seen that, in the two-classes case, the error is
p(error) =
_

−∞
p(error|x)p(x)dx
Under the optimal decision rule
p(error|x) = min[p(C
1
|x), p(C
2
|x)]
Note that, even if the posterior densities are continuous, this form of
conditional error virtually always leads to a discontinuous integrand in
the error
For arbitrary densities, an upper bound for the error is given by
p(error|x) < 2p(C
2
|x)p(C
1
|x)
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 14 / 68
Bayes decision error for minimum risk
Loss matrix
In some applications either a loss or utility function may be available
A (K ×K) loss matrix Λ with elements λ
ij
defines the loss incurred
when a pattern in class i is classified as belonging to class j , with
i , j = 1, . . . , K
When K = 2, we have
Λ =
_
0 λ
01
λ
10
0
_
In the patient classification example, we could take: λ
10
> λ
01
An optimal solution now minimises the loss function
The loss function depends on the true classification, which is not
available with certainty– the uncertainty is quantified by p(x, C
k
)
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 15 / 68
Bayes decision error for minimum risk
Conditional and expected loss
The total conditional loss or risk incurred when a pattern x is
assigned to class k is
r (C
k
|x) =
K

i =1
λ
ik
p(C
i
|x)
The average loss or risk over the region supporting C
k
is
E[r (C
k
|x)] =
_
R
k
r (C
k
|x)p(x) dx
=
_
R
k
K

i =1
λ
ik
p(C
i
|x)p(x) dx
=
_
R
k
K

i =1
λ
ik
p(x, C
i
) dx
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 16 / 68
Bayes decision error for minimum risk
Minimising the expected loss
The average loss or risk is
E[r (x)] =
K

k=1
_
R
k
K

i =1
λ
ik
p(x, C
i
) dx
We want to define regions {R
k
} that minimise this expected loss
This implies that we should minimise
K

i =1
λ
ik
p(C
i
|x)p(x)
This is the same as classifying x to the class k for which the
conditional risk
r (C
k
|x) =
K

i =1
λ
ik
p(C
i
|x)
is minimum – the resulting minimum overall risk is called Bayes risk
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 17 / 68
Bayes decision error for minimum risk
Example: two classes (1/2)
Suppose there are only two classes, C
1
and C
2
Let λ
ij
= λ(C
i
|C
j
) be the loss incurred for deciding C
i
when the true
class is C
j
The conditional risks are
r (C
1
|x) = λ
11
p(C
1
|x) + λ
12
p(C
2
|x)
and
r (C
2
|x) = λ
21
p(C
1
|x) + λ
22
p(C
2
|x)
The rule says that we should decide C
1
if
r (C
1
|x) < r (C
2
|x)
Using posterior probabilities, we decide C
1
if

21
−λ
11
)
. ¸¸ .
positive
p(C
1
|x) > (λ
12
−λ
22
)
. ¸¸ .
positive
p(C
2
|x)
or otherwise decide C
2
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 18 / 68
Bayes decision error for minimum risk
Example: two classes (2/2)
Alternatively, by employing Bayes formula, we decide C
1
if

21
−λ
11
)p(x|C
1
)p(C
1
) > (λ
12
−λ
22
)p(x|C
2
)p(C
2
)
or otherwise decide C
2
Yet another formulation of the same rule suggests that we decide C
1
if
p(x|C
1
)
p(x|C
2
)
. ¸¸ .
likelihood ratio
>
λ
12
−λ
22
λ
21
−λ
11
p(C
2
)
p(C
1
)
or otherwise decide C
2
– assuming (λ
21
−λ
11
) > 0
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 19 / 68
Bayes decision error for minimum risk
A special case: the zero-one loss matrix
A special loss function has elements λ
ik
= 1 −I
ik
I
ik
= 1 if i = k otherwise I
ik
= 0
We classify a pattern as belonging to class k if
K

i =1
λ
ik
p(C
i
|x) =
K

i =1
(1 −I
ik
)p(C
i
|x) = 1 −
K

i =1
I
ik
p(C
i
|x)
is a minimum – we have used the fact that

K
i =1
p(C
i
|x) = 1
This suggests that we should choose class k for which
1 −p(C
k
|x)
is the smallest
Equivalently, choose class k for which p(C
k
|x) is a maximum
Minimising the expect loss will minimise the misclassification rate
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 20 / 68
Bayes decision error for minimum risk
The reject option
Classification errors arise from regions where the largest of p(C
k
|x) is
significantly less than unity – that is when we are very uncertain
about class membership
In some applications we may want not to classify (i.e. reject) patterns
for which we are very uncertain
Introduce an arbitrary threshold θ ∈ [0, 1]
Define a reject region
reject region = {x : max
k
p(C
k
|x) ≤ θ}
Then take either one of the following decisions:

Do not classify patterns that falls in the reject region

Classify all other patterns using the Bayes decision rules, as before
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 21 / 68
Bayes decision error for minimum risk
Example: the reject option with univariate input
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 22 / 68
Bayes decision error for minimum risk
The reject option that minimises the expected loss
We now account for the loss incurred when a reject decision is made
If we decide to assign pattern x to class k, we will incur an expected
loss of
K

i =1
λ
ik
p(C
i
|x)
Suppose that, if we opt for the reject option, we incur a loss of λ
If we take
j = arg min
k
K

i =1
λ
ik
p(C
i
|x)
then the expected loss is minimised if we follow this rule:

Choose class j if min
k

K
i =1
λ
ik
p(C
i
|x) < λ

Otherwise reject
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 23 / 68
Bayes decision error for minimum risk
The reject option that minimises the expected loss
For a zero-one loss matrix, λ
ik
= 1 −I
ik
we choose class j when
min
k
K

i =1
λ
ik
p(C
i
|x) = min
k
{1 −p(C
k
|x)} < λ
Or equivalently, when
max
k
p(C
k
|x) < 1 −λ
In the standard reject criterion, we reject if
max
k
p(C
k
|x) < θ
Hence the two criteria for rejection are equivalent provided that
θ = 1 −λ
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 24 / 68
Bayes decision error for minimum risk
Summary
Strategies for obtaining decision regions needed to classify patterns
were introduced
The Bayes rule for minimum error is the optimum rule
Introducing the costs of making incorrect decisions leads to the Bayes
rule for minimum risk
We have assumed that a priori probabilities and class-conditional
distributions are known – generally these will be learned from the data
Two alternatives will be briefly described next – both require
knowledge of the class-conditional probability density functions

the Neyman-Pearson rule

the minimax rule
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 25 / 68
Bayes decision error for minimum risk
Neyman-Person decision rule for two-class problems
An alternative to Bayes decision rule for a two-class problem is the
Neyman-Pearson rule
This is often used in radar detection when a signal has to be classified
as real (class C
1
) and noise (class C
2
)
Two possible errors can be made

1
=
_
R
2
p(x|C
1
) dx = error probability of Type I

2
=
_
R
1
p(x|C
2
) dx = error probability of Type II
where

C
1
is the positive class, so
1
is the false negative rate

C
2
is the negative class, so
2
is the false positive rate
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 26 / 68
Bayes decision error for minimum risk
Neyman-Person decision rule
The rule arises from a constrained optimisation setup: minimises
1
subject to
2
being equal to a constant
0
Hence we wish to minimise
err =
_
R
2
p(x|C
1
) dx + µ
__
R
1
p(x|C
2
) dx −
0
_
where µ is the Lagrange multiplier
This will be minimised if we choose R
1
such that
if µp(x|C
2
) −p(x|C
1
) < 0 then x ∈ C
1
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 27 / 68
Bayes decision error for minimum risk
Decision rule
Expressed in terms of the likelihood ratio, the Neyman-Person
decision rule states that
if
p(x|C
1
)
p(x|C
2
)
> µ then x ∈ C
1
The threshold µ is found so that
_
R
1
p(x|C
2
) dx =
0
which often requires numerical integration
The rule depends only on the within-class distributions and ignores
the a priori probabilities
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 28 / 68
Bayes decision error for minimum risk
Minimax criterion
Bayes decision rules rely on a knowledge of both the within-class
distributions and the prior class probabilities
Often the prior class probabilities are unknown or they may vary
according to external factors
A reasonable approach is to design a classifier so that the worst
possible error that can be made for any value of the priors is as small
as possible
The minimax procedure is designed to minimise the maximum
possible overall error or risk
We will consider the simple case of K = 2 classes
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 29 / 68
Bayes decision error for minimum risk
Bayes minimum error for a given prior
Let us consider the classification error
e
B
= p(C
2
)
_
R
1
p(x|C
2
)dx + p(C
1
)
_
R
2
p(x|C
1
)dx
As a function of p(C
1
), e
B
is non-linear because the decision regions
also depend on the prior
When a value for p(C
1
) has been selected, the two regions R
1
and R
2
are determined, and we regard the error function as a function of
p(C
1
) only – we call it ˜ e
B
˜ e
B
is linear in p(C
1
) and we can easily determine the value of p(C
1
)
which gives the largest error
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 30 / 68
Bayes decision error for minimum risk
Example: Bayes minimum error curves
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 31 / 68
Bayes decision error for minimum risk
Example: Bayes minimum error curve
The true Bayes error must be less than or equal to that linearly
bounded value, since one has the freedom to change the decision
boundary at each value of p(C
1
)
Also note that the Bayes error is zero at p(C
1
) = 0 and p(C
1
) = 1
since the Bayes decision rule under those conditions is to always
decide C
1
or C
2
, respectively, and this gives zero error
Thus the curve of Bayes error is concave down all prior probabilities
For fixed decision regions, the max error will occur at an extreme
value of the prior – in the example, p(C
1
) = 1
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 32 / 68
Bayes decision error for minimum risk
Minimax solution
The minimax procedure minimises the largest possible error,
max(˜ e
B
(0), ˜ e
B
(1)) = max
__
R
2
p(x|C
1
)dx,
_
R
1
p(x|C
2
)dx
_
This is a minimum when ˜ e
B
(p(C
1
)) is horizontal and touches the
Bayes minimum error curve at its peak
A minimum is achieved when
_
R
2
p(x|C
1
)dx =
_
R
1
p(x|C
2
)dx
which is the point where the error will not change as function of the
prior – in the example, p(C
1
) = 0.6
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 33 / 68
Bayes decision error for minimum risk
Three main approaches to solving decision problems
1
Using generative models

First estimate the class-conditional densities p(x|C
k
), k = 1, . . . , K

Then use Bayes’ theorem to find the posterior probabilities p(C
k
|x)
p(C
k
|x) =
p(x|C
k
)p(C
k
)
p(x)
p(x) =
K

k=1
p(x|C
k
)p(C
k
)

Or model the joint distribution p(x, C
k
) and then find p(C
k
|x)

Each p(C
k
) can be estimated as proportion of data points in class k

Use decision theory to allocate a new input x to a class
2
Using discriminative models

Model the posterior class probabilities p(C
k
|x) directly

Use decision theory to allocate a new input x to a class
3
Using discriminative functions

Find a function f (x) which maps an input x directly onto a class
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 34 / 68
Non-parametric density estimation methods
Parametric and non-parametric density estimation
In the parametric approach, probability distributions have specific
functional forms governed by a small number of parameters
The parameters are estimated from the data, for instance using a
maximum-likelihood or Bayesian approach
A limitation of this approach is that the chosen density might be a
poor model for the distribution that generated the data, which results
in poor predictive performance
In the non-parametric approach, we make fewer assumptions about
the form of the distributions

Both frequentist and Bayesian approach have been developed

We shall focus on commonly used frequentist approaches
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 35 / 68
Non-parametric density estimation methods
Overview
We will introduce three popular methods for density estimation

Histogram methods

Kernel methods (or Parzen estimation methods)

Nearest neighbours methods
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 36 / 68
Non-parametric density estimation methods
Histogram methods for density estimation
Let us assume a simple univariate pattern x observed on n objects
Using standard histograms we

partition the observations x
1
, x
2
, . . . , c
n
into distinct bins of width ∆
i

count the number n
i
of observations falling in each bin
In order to obtain a proper probability density, we divide n
i
by the
total number of observations n and by the width ∆
i
to obtain
p
i
=
n
i
n∆
i
The it follows that the estimated density is indeed a probability
density so
_
p(x) dx = 1
Under this approach the density is constant over the width of each
bin and usually

i
= ∆
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 37 / 68
Non-parametric density estimation methods
Example: density estimation using the histogram method
50 data points generated from the distribution in green (mixture of two Gaussians)
When ∆ is too small, the estimated density can be under-smoothed
(spiky), otherwise it can be too smooth
The edge location of the bin also pays a role, but has less effect
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 38 / 68
Non-parametric density estimation methods
Histogram method: pro and con
Once the histogram has been computed, the raw data can be
discarded, which is an advantage with large data sets
It is suitable for on-line updating, when the data points arrive
sequentially
The estimated density has discontinuities due to the bin edges rather
than any property of the true density
Also it does not scale well with dimensionality

Suppose x is p-dimensional with p large

If each dimension is divided into M bins, we have a total of M
p
bins

The number of data points required by the method is too large

This exponential scaling with p is sometimes referred to as the curse of
dimensionality
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 39 / 68
Non-parametric density estimation methods
The curse of dimensionality
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 40 / 68
Non-parametric density estimation methods
Towards alternative approaches
Despite its shortcomings, the method is widely used, especially for a
quickly visualisation of the data (in few dimensions)
Moreover, this approach highlights two elements needed for
developing more complex methods
First, to estimate the probability density at a particular location, we
should consider data points that lie within some local neighbourhood
of that point

This notion of locality requires that we assume some distance measure,
for instance the Euclidean distance

In histograms, this proximity notion was defined by the bins

The bin width is a smoothing parameter that defines the spatial extent
of the local region
Second, the value of the smoothing parameter should not be too large
or too small in order to obtain good results
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 41 / 68
Non-parametric density estimation methods
Preliminaries
Assume that n data points are drawn from an unknown probability
density p(x) in a p-dimensional Euclidean space
Let us consider a small region Q containing x
The probability mass associated with this region is given by
P =
_
Q
p(x) dx
Suppose we have n data points from p(x), then each point has a
probability P of being in Q
The number k of points that lie inside Q follows a Binomial
k | (n, P) ∼
n!
k!(n −k)!
P
K
(1 −P)
n−k
The average number of points in Q is P with variance P(1 −P)n
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 42 / 68
Non-parametric density estimation methods
Distribution of k as function of n
The true value is P = 0.7 – note that as n increases the curve peaks around P
In the limit n →∞ the curve approaches a delta function
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 43 / 68
Non-parametric density estimation methods
Density estimate
The estimated average of points within Q is
k
n
= P
so for large n we take
k ≈ nP
If however we also assume that the region Q is sufficiently small, so
that the probability density is roughly constant over the region, then
P ≈ p(x)V
where V is the volume of Q
Combining the two we obtain a density estimate at x
p(x) =
P
V
=
k
nV
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 44 / 68
Non-parametric density estimation methods
Towards two alternative methods
Notice that the validity of the estimate
p(x) =
k
nV
depends on two contradictory assumptions:

We want Q to be sufficiently large (in relation to the value of that
density) so that the number k of points falling inside the region is large
enough for the binomial to be sharply peaked around P

We want Q to be sufficiently small that the density is approximately
constant over the region
There are two possible approaches:

We can fix V and determine k from the data – this is done by defining
localised regions around x

We can fix k and determine V from the data – this is done by
searching for neighbours of x
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 45 / 68
Non-parametric density estimation methods
Kernel density estimator
We wish to determine the probability density at x
Take the region Q to be a small hypercube centred on x
In order to count the number k of points within Q we define a
function representing the unit cube centred on the origin
g(u) =
_
1 |u
i
| ≤ 1/2 j = 1, 2, . . . , p
0 otherwise
This is called the Parzen window, an instance of a kernel function
We have that
g
_
x −x
i
h
_
=
_
1 if x
i
lies inside a cube of side h centred on x
0 otherwise
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 46 / 68
Non-parametric density estimation methods
Kernel density estimation
The number k of data points inside the cube therefore is
k =
n

i =1
g
_
x −x
i
h
_
We have previously established that an estimate for p(x) is
k
nV
Combining the two results gives
p(x) =
1
n
n

i =1
1
V
g
_
x −x
i
h
_
=
1
n
n

i =1
1
h
p
g
_
x −x
i
h
_
where we have taken V = h
p
, the volume of a hypercube of side h in
p dimensions
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 47 / 68
Non-parametric density estimation methods
Remarks on the kernel density estimator
As with the histogram method, the kernel density estimator suffers
from the presence of discontinuities – here these are located at the
boundaries of the cubes
Smoother density estimates can be obtained simply by using
alternative kernel functions
A common choice is the Gaussian kernel which yields the estimator
p(x) =
1
n
n

i =1
1
(2πh
2
)
1/2
exp
_

x −x
i

2
2h
2
_
where h is the standard deviation of the Gaussian components
The density model places a Gaussian over each data point and then
adds the contributions over all points
It can be proved that the kernel density estimator converges to the
true density in the limit n →∞
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 48 / 68
Non-parametric density estimation methods
Example: kernel density estimation using a Gaussian kernel
50 data points generated from the distribution in green (mixture of two Gaussians)
h acts as a smoothing parameter
when h is too small the estimated density is under-smoothed (too
noisy), otherwise is over-smoothed
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 49 / 68
Non-parametric density estimation methods
Other kernels and improvements
Any other kernel can be chosen subject to
g(u) ≥ 0 and
_
g(u) du = 1
thus ensuring that the probability distribution is nonnegative
everywhere and integrate to one
The computational cost of evaluating the density grows linearly with n
Choosing the correct h is critical – much research has been done to
develop procedures that learns h from the data
So far the parameter h has been fixed in all component kernels

in regions of high density, a large h may lead to over-smoothing but an
high h may lead to noisy estimates

the optimal h depends on the region of the input space
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 50 / 68
Non-parametric density estimation methods
Nearest neighbours methods
Instead of fixing V and determining k from the data, we can fix k
and then use the data to find V accordingly
Suppose we want to estimate the density x at x
Take a small sphere centred at x at which we want to estimate p(x)
We let the radius of the sphere grow until it contains exactly k points
The density estimate is then given as before by
p(x) =
k
nV
with V set to the volume of the sphere
The degree of smoothing is now governed by the parameter k, the
number of nearest neighbours
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 51 / 68
Non-parametric density estimation methods
Example: nearest neighbours
50 data points generated from the distribution in green (mixture of two Gaussians)
k acts as a smoothing parameter
When k is too small the estimated density is under-smoothed (too
noisy), otherwise is over-smoothed
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 52 / 68
Non-parametric density estimation methods
Nearest neighbours method for classification
We can apply the nearest neighbour method to estimate the
probability density in each class C
k
, then use the Bayes rule to obtain
the posterior probabilities
Suppose we have M classes
We have observed n
m
patterns in class C
m
, with
M

m=1
n
m
= n
Given a new pattern x to be classified, we take a sphere around it
containing exactly its k nearest points
The sphere has volume V and contains k
m
points from class C
m
with
M

m=1
k
m
= k
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 53 / 68
Non-parametric density estimation methods
Nearest neighbours method for classification
The class priors can be estimated by
p(C
k
) =
n
k
n
The class-conditional density estimates are
p(x|C
m
) =
k
m
n
m
V
m = 1, 2, . . . , M
As before, the unconditional density is given by
p(x) =
k
nV
Combining these results we obtain the posterior probabilities
p(C
m
|x) =
p(x|C
m
)p(C
m
)
p(x)
=
k
m
k
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 54 / 68
Non-parametric density estimation methods
Remarks on the nearest neighbours method
This approach gives a sensible rule:

compute the nearest neighbours of x in each class

classify x as belonging to C
m
if the majority of its neighbours are in C
m
As with the kernel density estimator, it can be proved that the kernel
density estimator converges to the true density in the limit n →∞
An interesting property of nearest neighbours methods (when k = 1)
is that, in the limit n →∞ the error is never more than twice the
minimum achievable error rate of an optimal classifiers
The notion of similarity between any two given patterns x and x

is
important

For patterns living in an Euclidean space, the Euclidean distance is an
obvious choice but other alternatives are available (more on this later)

When the patterns are objects of a different nature (graphs, trees,
strings) other distances should be used
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 55 / 68
Non-parametric density estimation methods
Example: classification with nearest neighbours
Classification using K = 3
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 56 / 68
Non-parametric density estimation methods
Example: classification with nearest neighbours
Decision region obtained using K = 1
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 57 / 68
Non-parametric density estimation methods
Example: oil flow data set
During the course of this study the proportions of oil, water and gas
in the North Sea oil transfer pipelines was measured
Depending on the geometrical configuration of these three materials,
a pipe was labelled as

Stratified

Annular

Homogeneous
Each data point comprises a 12-dimensional pattern consisting of
non-invasive measurements taken with gamma ray densitometers
The principle is the following: if a narrow beam of of gamma rays
passes through the pipe, the attenuation in the intensity of the beam
provides information about the density of material along its path
The ultimate objective is to classify a pipe as belonging to one of
those three classes
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 58 / 68
Non-parametric density estimation methods
Example: oil flow data set
The three classes define geometrical configurations of the oil, water and gas
Each pipe is a pattern in a 12-dimensional space
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 59 / 68
Non-parametric density estimation methods
Example: oil flow data set
Red is Homogeneous, green is Annular and blue is Stratified
The pattern × has to be classified based on variables x
6
and x
7
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 60 / 68
Non-parametric density estimation methods
Example: oil flow data set
Decision region obtained using K = 1
The regions are fragmented and complex – overfitting may take place
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 61 / 68
Non-parametric density estimation methods
Example: oil flow data set
Decision region obtained using K = 3
The regions are smoother
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 62 / 68
Non-parametric density estimation methods
Example: oil flow data set
Decision region obtained using K = 31
The regions are too smooth
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 63 / 68
Non-parametric density estimation methods
Summary
When the true class-conditional densities are known, the minimal
achievable error rate is that of a Bayes classifier
When they are not known, they are first learned from the data using
either a parametric or non-parametric approach
Non-parametric approaches make no assumptions about the form of
the distributions and we have briefly considered three methods:
histograms, kernels and nearest-neighbours
All non-parametric methods rely on some tuning parameters that will
inevitably affect the classification performance on unseen pattern
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 64 / 68
Performance assessment
The generalisation error
As we will see, almost all classifiers also rely on tuneable parameters
These parameters control the complexity of the model, which in turn
determines the complexity of the decision regions

If the model is too simple (for instance a nearest neighbour with
k = 1), some salient features of the data will not be captured

If the model is too complex, it will generate complex decision
boundaries that may not describe well the true boundaries
This is the issue of generalisation – ultimately we want the model to
perform well on new, unseen patterns
The question then is, how to obtain measures of generalisation error
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 65 / 68
Performance assessment
The independent test approach
Suppose the data set D contains n labelled patterns
We can randomly split the data set into two parts

A training set (e.g. 90% of the data) used for adjusting the parameters

A test set used to estimate the generalisation error
A measure of generalisation error E can simply be the fraction of
wrongly classified labels in the test set
Since the ultimate goal is that of obtaining low generalisation error,
we train the classifier until we reach a minimum of this error
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 66 / 68
Performance assessment
Generalisation error as function of model complexity
Classifiers that are too complex perform well on training data but not on independent data
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 67 / 68
Performance assessment
v-fold cross validation
A simple generalisation of the previous approach is given by the v-fold
cross-validation approach
The data set D is divided into v disjoint sets of equal size n/v
The classifier is trained v times, each time with a different set held
out as a validation set
For each model, a generalisation error is computed, say
E
i
, i = 1, . . . , m
The estimated generalisation error is the mean of the v errors
In the limit, when v = n, the method is called leave-one-out
cross-validation
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 68 / 68

1

Bayes decision error for minimum error

2

Bayes decision error for minimum risk

3

Non-parametric density estimation methods

4

Performance assessment

Giovanni Montana Imperial College ()

Bayesian decision theory and density estimation

14 July, 2010

2 / 68

Bayes decision error for minimum error

The problem
We have observed {xi , ti } on a training data set (random sample) Input x is a p dimensional vector (x1 , x2 , . . . , xp ) in Euclidean space Each x will be called pattern or data point or observation Output t is generally univariate
In regression, typically the output is a continuous measurement In classification, the output is a class label Ck , k = 1, . . . , K

In some applications the response may also be multivariate, perhaps high-dimensional Joint probability distribution p(x, t) is generally unknown and estimated using the training data Given a new, unseen pattern x∗ , pattern classification consists in
Assigning the correct class label t ∗ for x∗ Making a decision accordingly
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 3 / 68

n consists of n1 healthy subjects and n2 individuals with Alzheimer’s disease For each sample.Bayes decision error for minimum error Example: patient classification using MRI data Data points {xi }. C1 (healthy) and C2 (diseased) Take t = 0 to indicate C1 and t = 1 to indicate C2 We wish to classify a new patient.g. on the basis on x Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. i = 1. 2010 4 / 68 . . . input x consists of p pixel intensities Output t consists of two classes. . provide a treatment). and take a related action (e. .

2 On unseen data. 2010 5 / 68 . objects are randomly assigned We can do better by making use of the data x Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. we want to make as few misclassifications as possible Suppose we only use our prior information We would assign a new pattern x to class k if p(Ck ) > p(Cj ).Bayes decision error for minimum error Making decisions when no learning data is available Assume known prior probabilities p(Ck ). k = 1. j =k For classes with equal probability.

2010 6 / 68 . 2 Using Bayes’ theorem p(Ck |x) = p(x|Ck )p(Ck ) p(x) If p(x. C1 ) + p(x. C) is known. all the requires probabilities are available The evidence p(x) is p(x) = p(x.Bayes decision error for minimum error Posterior probabilities We are interested in posterion probabilities p(Ck |x). k = 1. C2 ) = p(x|C1 )p(C1 ) + p(x|C2 )p(C2 ) It is just a scaling factor so that p(Ck |x) = 1 k How do we make as few misclassifications as possible? Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 7 / 68 .Bayes decision error for minimum error Example: conditional densities with univariate input Left: class conditional densities p(x|C1 ) and p(x|C2 ) Right: posterior probabilities p(C1 |x) and p(C2 |x) Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

x)dx = −∞ p(error|x) p(x) dx fixed If we ensure that p(error|x) is small as possible for every x. 2010 8 / 68 . we would incur an error with probability p(error|x) = Is this approach optimal? Note that ∞ ∞ p(C1 |x) p(C2 |x) if we decide C2 if we decide C1 p(error) = −∞ p(error. the integral must be as small as possible Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. we find that p(C1 |x) > p(C2 |x) Intuitively.Bayes decision error for minimum error Intuitive approach After observing x. we would classify x as C1 We would take the opposite decision if p(C2 |x) > p(C1 |x) Accordingly.

C2 ) + p(x ∈ R2 . C2 ) dx + R2 p(x. then x is classified as belonging to Ck The probability of misclassification is p(error) = p(x ∈ R1 . p(error|x) = min[p(C1 |x). 2 If pattern x is in Rk . k = 1. 2010 9 / 68 . p(C2 |x)] Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. the rule that minimises the probability of misclassification is assign x to C1 if p(C1 |x) > p(C2 |x) Under this decision rule.Bayes decision error for minimum error Bayes decision rule for minimal error (two classes) Divide the input space x into decision regions Rk . C1 ) dx p(C1 |x)p(x) dx R2 = R1 p(C2|x)p(x) dx + Ignoring the common factor p(x). C1 ) = R1 p(x.

the probability of error is minimised Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum error Example: p(error) with K = 2 p(error) is the coloured area x is a boundary value defining the two decision regions ˆ x0 is the optimal threshold . 2010 10 / 68 .

Bayes decision error for minimum error Likelihood ratio According to the Bayes rule. 2010 11 / 68 . we can rewrite the rule as x belongs to class C1 if p(x|C1 )p(C1 ) > p(x|C2 )p(C2 ) or alternatively x belongs to class C1 if p(x|C1 ) p(x|C2 ) likelihood ratio > p(C2 ) p(C1 ) Clearly the rule depends on both the prior probabilities and class-conditional densities Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. x belongs to class C1 if p(C1 |x) > p(C2 |x) The rule provides the required decision boundary Using Bayes’ theorem.

the threshold is higher (θb > θa ) and R1 is smaller. Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum error Example: Decision regions in the univariate case Using a 0 − 1 loss function. 2010 12 / 68 . R1 and R2 are determined by θa If the loss function penalises misclassifying C2 as C1 more than the converse.

. .Bayes decision error for minimum error Bayes decision rule for minimal error (more than two classes) Divide the input space x into decision regions Rk . . k = 1. . Ck ) p(x. K The probability of correct classification is K p(correct) = k=1 K p(x ∈ Rk . 2010 13 / 68 . Ck ) dx k=1 Rk K = = k=1 Rk p(Ck |x) p(x) dx fixed This probability is maximised when x is assigned to class Ck for which p(Ck |x) is the largest Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

an upper bound for the error is given by p(error|x) < 2p(C2 |x)p(C1 |x) Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 14 / 68 . this form of conditional error virtually always leads to a discontinuous integrand in the error For arbitrary densities. the error is ∞ p(error) = −∞ p(error|x)p(x)dx Under the optimal decision rule p(error|x) = min[p(C1 |x).Bayes decision error for minimum error Remark We have seen that. even if the posterior densities are continuous. p(C2 |x)] Note that. in the two-classes case.

.Bayes decision error for minimum risk Loss matrix In some applications either a loss or utility function may be available A (K × K ) loss matrix Λ with elements λij defines the loss incurred when a pattern in class i is classified as belonging to class j. . . we could take: λ10 > λ01 An optimal solution now minimises the loss function The loss function depends on the true classification. j = 1. which is not available with certainty– the uncertainty is quantified by p(x. K When K = 2. 2010 15 / 68 . with i. we have Λ= 0 λ01 λ10 0 In the patient classification example. . Ck ) Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Conditional and expected loss The total conditional loss or risk incurred when a pattern x is assigned to class k is K r (Ck |x) = i=1 λik p(Ci |x) The average loss or risk over the region supporting Ck is E[r (Ck |x)] = Rk K r (Ck |x)p(x) dx λik p(Ci |x)p(x) dx Rk i=1 K = = Rk i=1 λik p(x. 2010 16 / 68 . Ci ) dx Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Minimising the expected loss The average loss or risk is K K E[r (x)] = k=1 Rk i=1 λik p(x. 2010 17 / 68 . Ci ) dx We want to define regions {Rk } that minimise this expected loss This implies that we should minimise K λik p(Ci |x)p(x) i=1 This is the same as classifying x to the class k for which the conditional risk K r (Ck |x) = i=1 λik p(Ci |x) is minimum – the resulting minimum overall risk is called Bayes risk Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Example: two classes (1/2) Suppose there are only two classes. C1 and C2 Let λij = λ(Ci |Cj ) be the loss incurred for deciding Ci when the true class is Cj The conditional risks are r (C1 |x) = λ11 p(C1 |x) + λ12 p(C2 |x) and r (C2 |x) = λ21 p(C1 |x) + λ22 p(C2 |x) The rule says that we should decide C1 if r (C1 |x) < r (C2 |x) Using posterior probabilities. 2010 18 / 68 . we decide C1 if (λ21 − λ11 ) p(C1 |x) > (λ12 − λ22 ) p(C2 |x) positive positive or otherwise decide C2 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

we decide C1 if (λ21 − λ11 )p(x|C1 )p(C1 ) > (λ12 − λ22 )p(x|C2 )p(C2 ) or otherwise decide C2 Yet another formulation of the same rule suggests that we decide C1 if p(x|C1 ) p(x|C2 ) likelihood ratio > λ12 − λ22 p(C2 ) λ21 − λ11 p(C1 ) or otherwise decide C2 – assuming (λ21 − λ11 ) > 0 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. by employing Bayes formula.Bayes decision error for minimum risk Example: two classes (2/2) Alternatively. 2010 19 / 68 .

choose class k for which p(Ck |x) is a maximum Minimising the expect loss will minimise the misclassification rate Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum risk A special case: the zero-one loss matrix A special loss function has elements λik = 1 − Iik Iik = 1 if i = k otherwise Iik = 0 We classify a pattern as belonging to class k if K K K λik p(Ci |x) = i=1 i=1 (1 − Iik )p(Ci |x) = 1 − i=1 Iik p(Ci |x) is a minimum – we have used the fact that K p(Ci |x) = 1 i=1 This suggests that we should choose class k for which 1 − p(Ck |x) is the smallest Equivalently. 2010 20 / 68 .

reject) patterns for which we are very uncertain Introduce an arbitrary threshold θ ∈ [0. 1] Define a reject region reject region = {x : maxk p(Ck |x) ≤ θ} Then take either one of the following decisions: Do not classify patterns that falls in the reject region Classify all other patterns using the Bayes decision rules.e. as before Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 21 / 68 .Bayes decision error for minimum risk The reject option Classification errors arise from regions where the largest of p(Ck |x) is significantly less than unity – that is when we are very uncertain about class membership In some applications we may want not to classify (i.

Bayes decision error for minimum risk Example: the reject option with univariate input Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 22 / 68 .

we incur a loss of λ If we take j = arg min k i=1 K λik p(Ci |x) then the expected loss is minimised if we follow this rule: Choose class j if mink Otherwise reject Giovanni Montana Imperial College () K i=1 λik p(Ci |x) < λ Bayesian decision theory and density estimation 14 July. we will incur an expected loss of K λik p(Ci |x) i=1 Suppose that.Bayes decision error for minimum risk The reject option that minimises the expected loss We now account for the loss incurred when a reject decision is made If we decide to assign pattern x to class k. 2010 23 / 68 . if we opt for the reject option.

Bayes decision error for minimum risk The reject option that minimises the expected loss For a zero-one loss matrix. when max p(Ck |x) < 1 − λ k In the standard reject criterion. 2010 24 / 68 . we reject if max p(Ck |x) < θ k Hence the two criteria for rejection are equivalent provided that θ =1−λ Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. λik = 1 − Iik we choose class j when K min k i=1 λik p(Ci |x) = min{1 − p(Ck |x)} < λ k Or equivalently.

2010 25 / 68 .Bayes decision error for minimum risk Summary Strategies for obtaining decision regions needed to classify patterns were introduced The Bayes rule for minimum error is the optimum rule Introducing the costs of making incorrect decisions leads to the Bayes rule for minimum risk We have assumed that a priori probabilities and class-conditional distributions are known – generally these will be learned from the data Two alternatives will be briefly described next – both require knowledge of the class-conditional probability density functions the Neyman-Pearson rule the minimax rule Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Neyman-Person decision rule for two-class problems An alternative to Bayes decision rule for a two-class problem is the Neyman-Pearson rule This is often used in radar detection when a signal has to be classified as real (class C1 ) and noise (class C2 ) Two possible errors can be made 1 = R2 p(x|C1 ) dx = error probability of Type I p(x|C2 ) dx = error probability of Type II R1 2 = where C1 is the positive class. so C2 is the negative class. 2010 26 / 68 Bayesian decision theory and density estimation . so Giovanni Montana Imperial College () 1 2 is the false negative rate is the false positive rate 14 July.

2010 27 / 68 .Bayes decision error for minimum risk Neyman-Person decision rule The rule arises from a constrained optimisation setup: minimises subject to 2 being equal to a constant 0 Hence we wish to minimise err = R2 1 p(x|C1 ) dx + µ R1 p(x|C2 ) dx − 0 where µ is the Lagrange multiplier This will be minimised if we choose R1 such that if µp(x|C2 ) − p(x|C1 ) < 0 then x ∈ C1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Decision rule Expressed in terms of the likelihood ratio. the Neyman-Person decision rule states that if p(x|C1 ) > µ then x ∈ C1 p(x|C2 ) The threshold µ is found so that p(x|C2 ) dx = R1 0 which often requires numerical integration The rule depends only on the within-class distributions and ignores the a priori probabilities Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 28 / 68 .

2010 29 / 68 .Bayes decision error for minimum risk Minimax criterion Bayes decision rules rely on a knowledge of both the within-class distributions and the prior class probabilities Often the prior class probabilities are unknown or they may vary according to external factors A reasonable approach is to design a classifier so that the worst possible error that can be made for any value of the priors is as small as possible The minimax procedure is designed to minimise the maximum possible overall error or risk We will consider the simple case of K = 2 classes Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

the two regions R1 and R2 are determined.Bayes decision error for minimum risk Bayes minimum error for a given prior Let us consider the classification error eB = p(C2 ) R1 p(x|C2 )dx + p(C1 ) R2 p(x|C1 )dx As a function of p(C1 ). eB is non-linear because the decision regions also depend on the prior When a value for p(C1 ) has been selected. and we regard the error function as a function of p(C1 ) only – we call it eB ˜ eB is linear in p(C1 ) and we can easily determine the value of p(C1 ) ˜ which gives the largest error Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 30 / 68 .

Bayes decision error for minimum risk Example: Bayes minimum error curves Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 31 / 68 .

2010 32 / 68 . the max error will occur at an extreme value of the prior – in the example. p(C1 ) = 1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum risk Example: Bayes minimum error curve The true Bayes error must be less than or equal to that linearly bounded value. respectively. since one has the freedom to change the decision boundary at each value of p(C1 ) Also note that the Bayes error is zero at p(C1 ) = 0 and p(C1 ) = 1 since the Bayes decision rule under those conditions is to always decide C1 or C2 . and this gives zero error Thus the curve of Bayes error is concave down all prior probabilities For fixed decision regions.

R1 p(x|C2 )dx This is a minimum when eB (p(C1 )) is horizontal and touches the ˜ Bayes minimum error curve at its peak A minimum is achieved when p(x|C1 )dx = R2 R1 p(x|C2 )dx which is the point where the error will not change as function of the prior – in the example. eB (1)) = max e ˜ R2 p(x|C1 )dx. max(˜B (0). 2010 33 / 68 .6 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. p(C1 ) = 0.Bayes decision error for minimum risk Minimax solution The minimax procedure minimises the largest possible error.

Bayes decision error for minimum risk Three main approaches to solving decision problems 1 Using generative models First estimate the class-conditional densities p(x|Ck ). k = 1. K Then use Bayes’ theorem to find the posterior probabilities p(Ck |x) p(Ck |x) = p(x|Ck )p(Ck ) p(x) K p(x) = k=1 p(x|Ck )p(Ck ) Or model the joint distribution p(x. . . . Ck ) and then find p(Ck |x) Each p(Ck ) can be estimated as proportion of data points in class k Use decision theory to allocate a new input x to a class 2 Using discriminative models Model the posterior class probabilities p(Ck |x) directly Use decision theory to allocate a new input x to a class 3 Using discriminative functions Find a function f (x) which maps an input x directly onto a class Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. . 2010 34 / 68 .

2010 35 / 68 . probability distributions have specific functional forms governed by a small number of parameters The parameters are estimated from the data. which results in poor predictive performance In the non-parametric approach. we make fewer assumptions about the form of the distributions Both frequentist and Bayesian approach have been developed We shall focus on commonly used frequentist approaches Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. for instance using a maximum-likelihood or Bayesian approach A limitation of this approach is that the chosen density might be a poor model for the distribution that generated the data.Non-parametric density estimation methods Parametric and non-parametric density estimation In the parametric approach.

Non-parametric density estimation methods Overview We will introduce three popular methods for density estimation Histogram methods Kernel methods (or Parzen estimation methods) Nearest neighbours methods Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 36 / 68 .

. . .Non-parametric density estimation methods Histogram methods for density estimation Let us assume a simple univariate pattern x observed on n objects Using standard histograms we partition the observations x1 . we divide ni by the total number of observations n and by the width ∆i to obtain ni pi = n∆i The it follows that the estimated density is indeed a probability density so p(x) dx = 1 Under this approach the density is constant over the width of each bin and usually ∆i = ∆ Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. cn into distinct bins of width ∆i count the number ni of observations falling in each bin In order to obtain a proper probability density. 2010 37 / 68 . x2 . .

the estimated density can be under-smoothed (spiky).Non-parametric density estimation methods Example: density estimation using the histogram method 50 data points generated from the distribution in green (mixture of two Gaussians) When ∆ is too small. 2010 38 / 68 . but has less effect Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. otherwise it can be too smooth The edge location of the bin also pays a role.

we have a total of M p bins The number of data points required by the method is too large This exponential scaling with p is sometimes referred to as the curse of dimensionality Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. when the data points arrive sequentially The estimated density has discontinuities due to the bin edges rather than any property of the true density Also it does not scale well with dimensionality Suppose x is p-dimensional with p large If each dimension is divided into M bins. 2010 39 / 68 .Non-parametric density estimation methods Histogram method: pro and con Once the histogram has been computed. the raw data can be discarded. which is an advantage with large data sets It is suitable for on-line updating.

Non-parametric density estimation methods The curse of dimensionality Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 40 / 68 .

we should consider data points that lie within some local neighbourhood of that point This notion of locality requires that we assume some distance measure. 2010 41 / 68 .Non-parametric density estimation methods Towards alternative approaches Despite its shortcomings. the method is widely used. this approach highlights two elements needed for developing more complex methods First. this proximity notion was defined by the bins The bin width is a smoothing parameter that defines the spatial extent of the local region Second. especially for a quickly visualisation of the data (in few dimensions) Moreover. the value of the smoothing parameter should not be too large or too small in order to obtain good results Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. to estimate the probability density at a particular location. for instance the Euclidean distance In histograms.

P) ∼ n! P K (1 − P)n−k k!(n − k)! The average number of points in Q is P with variance P(1 − P)n Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. then each point has a probability P of being in Q The number k of points that lie inside Q follows a Binomial k | (n.Non-parametric density estimation methods Preliminaries Assume that n data points are drawn from an unknown probability density p(x) in a p-dimensional Euclidean space Let us consider a small region Q containing x The probability mass associated with this region is given by P= Q p(x) dx Suppose we have n data points from p(x). 2010 42 / 68 .

Non-parametric density estimation methods Distribution of k as function of n The true value is P = 0. 2010 43 / 68 .7 – note that as n increases the curve peaks around P In the limit n → ∞ the curve approaches a delta function Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

then P ≈ p(x)V where V is the volume of Q Combining the two we obtain a density estimate at x p(x) = Giovanni Montana Imperial College () P k = V nV 14 July. so that the probability density is roughly constant over the region. 2010 44 / 68 Bayesian decision theory and density estimation .Non-parametric density estimation methods Density estimate The estimated average of points within Q is k =P n so for large n we take k ≈ nP If however we also assume that the region Q is sufficiently small.

2010 45 / 68 .Non-parametric density estimation methods Towards two alternative methods Notice that the validity of the estimate p(x) = k nV depends on two contradictory assumptions: We want Q to be sufficiently large (in relation to the value of that density) so that the number k of points falling inside the region is large enough for the binomial to be sharply peaked around P We want Q to be sufficiently small that the density is approximately constant over the region There are two possible approaches: We can fix V and determine k from the data – this is done by defining localised regions around x We can fix k and determine V from the data – this is done by searching for neighbours of x Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 46 / 68 .Non-parametric density estimation methods Kernel density estimator We wish to determine the probability density at x Take the region Q to be a small hypercube centred on x In order to count the number k of points within Q we define a function representing the unit cube centred on the origin g (u) = 1 |ui | ≤ 1/2 0 otherwise j = 1. . an instance of a kernel function We have that g x − xi h = 1 0 if xi lies inside a cube of side h centred on x otherwise Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. p This is called the Parzen window. 2. . . .

Non-parametric density estimation methods Kernel density estimation The number k of data points inside the cube therefore is n k= i=1 g x − xi h We have previously established that an estimate for p(x) is k nV Combining the two results gives p(x) = 1 n n i=1 1 g V x − xi h = 1 n n i=1 1 g hp x − xi h where we have taken V = hp . 2010 47 / 68 . the volume of a hypercube of side h in p dimensions Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 48 / 68 . the kernel density estimator suffers from the presence of discontinuities – here these are located at the boundaries of the cubes Smoother density estimates can be obtained simply by using alternative kernel functions A common choice is the Gaussian kernel which yields the estimator 1 p(x) = n n i=1 1 x − xi exp − 2 )1/2 2h2 (2πh 2 where h is the standard deviation of the Gaussian components The density model places a Gaussian over each data point and then adds the contributions over all points It can be proved that the kernel density estimator converges to the true density in the limit n → ∞ Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Remarks on the kernel density estimator As with the histogram method.

otherwise is over-smoothed Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Example: kernel density estimation using a Gaussian kernel 50 data points generated from the distribution in green (mixture of two Gaussians) h acts as a smoothing parameter when h is too small the estimated density is under-smoothed (too noisy). 2010 49 / 68 .

2010 50 / 68 .Non-parametric density estimation methods Other kernels and improvements Any other kernel can be chosen subject to g (u) ≥ 0 and g (u) du = 1 thus ensuring that the probability distribution is nonnegative everywhere and integrate to one The computational cost of evaluating the density grows linearly with n Choosing the correct h is critical – much research has been done to develop procedures that learns h from the data So far the parameter h has been fixed in all component kernels in regions of high density. a large h may lead to over-smoothing but an high h may lead to noisy estimates the optimal h depends on the region of the input space Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

we can fix k and then use the data to find V accordingly Suppose we want to estimate the density x at x Take a small sphere centred at x at which we want to estimate p(x) We let the radius of the sphere grow until it contains exactly k points The density estimate is then given as before by p(x) = k nV with V set to the volume of the sphere The degree of smoothing is now governed by the parameter k.Non-parametric density estimation methods Nearest neighbours methods Instead of fixing V and determining k from the data. the number of nearest neighbours Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 51 / 68 .

2010 52 / 68 .Non-parametric density estimation methods Example: nearest neighbours 50 data points generated from the distribution in green (mixture of two Gaussians) k acts as a smoothing parameter When k is too small the estimated density is under-smoothed (too noisy). otherwise is over-smoothed Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

with M nm = n m=1 Given a new pattern x to be classified. 2010 53 / 68 . then use the Bayes rule to obtain the posterior probabilities Suppose we have M classes We have observed nm patterns in class Cm .Non-parametric density estimation methods Nearest neighbours method for classification We can apply the nearest neighbour method to estimate the probability density in each class Ck . we take a sphere around it containing exactly its k nearest points The sphere has volume V and contains km points from class Cm with M km = k m=1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Non-parametric density estimation methods

Nearest neighbours method for classification
The class priors can be estimated by p(Ck ) = nk n

The class-conditional density estimates are p(x|Cm ) = km nm V m = 1, 2, . . . , M

As before, the unconditional density is given by p(x) = k nV

Combining these results we obtain the posterior probabilities p(Cm |x) =
Giovanni Montana Imperial College ()

p(x|Cm )p(Cm ) km = p(x) k
14 July, 2010 54 / 68

Bayesian decision theory and density estimation

Non-parametric density estimation methods

Remarks on the nearest neighbours method
This approach gives a sensible rule:
compute the nearest neighbours of x in each class classify x as belonging to Cm if the majority of its neighbours are in Cm

As with the kernel density estimator, it can be proved that the kernel density estimator converges to the true density in the limit n → ∞ An interesting property of nearest neighbours methods (when k = 1) is that, in the limit n → ∞ the error is never more than twice the minimum achievable error rate of an optimal classifiers The notion of similarity between any two given patterns x and x is important
For patterns living in an Euclidean space, the Euclidean distance is an obvious choice but other alternatives are available (more on this later) When the patterns are objects of a different nature (graphs, trees, strings) other distances should be used
Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 55 / 68

Non-parametric density estimation methods

Example: classification with nearest neighbours
Classification using K = 3

Giovanni Montana Imperial College ()

Bayesian decision theory and density estimation

14 July, 2010

56 / 68

Non-parametric density estimation methods Example: classification with nearest neighbours Decision region obtained using K = 1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 57 / 68 .

water and gas in the North Sea oil transfer pipelines was measured Depending on the geometrical configuration of these three materials. 2010 58 / 68 . the attenuation in the intensity of the beam provides information about the density of material along its path The ultimate objective is to classify a pipe as belonging to one of those three classes Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Example: oil flow data set During the course of this study the proportions of oil. a pipe was labelled as Stratified Annular Homogeneous Each data point comprises a 12-dimensional pattern consisting of non-invasive measurements taken with gamma ray densitometers The principle is the following: if a narrow beam of of gamma rays passes through the pipe.

2010 59 / 68 .Non-parametric density estimation methods Example: oil flow data set The three classes define geometrical configurations of the oil. water and gas Each pipe is a pattern in a 12-dimensional space Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 60 / 68 . green is Annular and blue is Stratified The pattern × has to be classified based on variables x6 and x7 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Example: oil flow data set Red is Homogeneous.

Non-parametric density estimation methods Example: oil flow data set Decision region obtained using K = 1 The regions are fragmented and complex – overfitting may take place Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 61 / 68 .

Non-parametric density estimation methods Example: oil flow data set Decision region obtained using K = 3 The regions are smoother Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 62 / 68 .

Non-parametric density estimation methods Example: oil flow data set Decision region obtained using K = 31 The regions are too smooth Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 63 / 68 .

Non-parametric density estimation methods Summary When the true class-conditional densities are known. kernels and nearest-neighbours All non-parametric methods rely on some tuning parameters that will inevitably affect the classification performance on unseen pattern Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 64 / 68 . the minimal achievable error rate is that of a Bayes classifier When they are not known. they are first learned from the data using either a parametric or non-parametric approach Non-parametric approaches make no assumptions about the form of the distributions and we have briefly considered three methods: histograms.

2010 65 / 68 . some salient features of the data will not be captured If the model is too complex. which in turn determines the complexity of the decision regions If the model is too simple (for instance a nearest neighbour with k = 1). almost all classifiers also rely on tuneable parameters These parameters control the complexity of the model.Performance assessment The generalisation error As we will see. unseen patterns The question then is. how to obtain measures of generalisation error Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. it will generate complex decision boundaries that may not describe well the true boundaries This is the issue of generalisation – ultimately we want the model to perform well on new.

we train the classifier until we reach a minimum of this error Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 90% of the data) used for adjusting the parameters A test set used to estimate the generalisation error A measure of generalisation error E can simply be the fraction of wrongly classified labels in the test set Since the ultimate goal is that of obtaining low generalisation error.Performance assessment The independent test approach Suppose the data set D contains n labelled patterns We can randomly split the data set into two parts A training set (e.g. 2010 66 / 68 .

2010 67 / 68 .Performance assessment Generalisation error as function of model complexity Classifiers that are too complex perform well on training data but not on independent data Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

when v = n. . . i = 1. 2010 68 / 68 . . the method is called leave-one-out cross-validation Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. say Ei . m The estimated generalisation error is the mean of the v errors In the limit.Performance assessment v -fold cross validation A simple generalisation of the previous approach is given by the v -fold cross-validation approach The data set D is divided into v disjoint sets of equal size n/v The classifier is trained v times. each time with a different set held out as a validation set For each model. a generalisation error is computed. .