6 views

Uploaded by Ερμής Γ.

save

- mit6 041f10 l21
- stat.pdf
- (Www.entrance-exam.net)-CMAT Exam Question Paper - 01
- MIT15_450F10_rec10.pdf
- Problems in Communication Theory With Solutions
- Optimal Design of Network for Control of Total Station Instruments
- Bayesian Model
- verdonck_csm2006.pdf
- 3.Schervish-1995
- Beamer 7
- Multicollinearity vs Autocorrelation
- Linear Programming Data Formulation
- ACCELERATED BAYESIAN OPTIMIZATION FOR DEEP LEARNING
- BayesianCourse_session13 (Ajuste Bayesiano)
- 1112.5550v2
- Ola
- Bayesian Networks Darwiche
- 2013 GCE a Level Solution H1 Math
- ArfimaNum
- tmpF7AE.tmp
- Output
- Prob It Regression
- Greene 99
- 10.1.1.68.4902
- Adaptive Filtering and Change Dectection
- The navigation of mobile robots in non-stationary and non-structured environments
- Results of a Practice Test
- angle
- LISREL Overview
- Morfología de La Tierra
- Combined Footing Design.xls
- MANLAK BPJS.pdf
- Piano Complement a Rio
- Jurnal Perbandingan Sistem Keamanan Wind
- Oh Moradora - Bajo
- cromosoma.pdf
- 1- TRINOMIO CUADRADO PERFECTO
- tram15.doc
- resume 06
- soluc-un-reto-5.pdf
- documento
- Be Glorified - Completo
- Makluman Peperiksaan Percubaan UPSR.pdf
- Rizal223
- Updated CV 04-16-2018.pdf
- Extrait j'Eleve Mon Enfant 2017
- CUVINTE MAGICE
- Đi Tìm Lẽ Sống
- derecho-a-la-igualdad.docx
- гојазност.pdf
- A Whole New World orquestation
- ETICA-AMATORIA-DEL-DESEO-LIBERTARIO-Y-LAS-AFECTACIONES-LIBRES-Y-ALEGRES-Ludditas-Sexxxuales.pdf
- 5b6073bc833a5.pdf
- 3.-Kartu-Bimbingan-Skripsi.docx
- CV_Carhuancho_Fernandez_Marco.docx
- LM_UCSP
- 12
- Qué Es La Bradicardia
- PAQUETES Callcenters 2017

You are on page 1of 68

Giovanni Montana

Imperial College

14 July, 2010

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 1 / 68

1

Bayes decision error for minimum error

2

Bayes decision error for minimum risk

3

Non-parametric density estimation methods

4

Performance assessment

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 2 / 68

Bayes decision error for minimum error

The problem

We have observed {x

i

, t

i

} on a training data set (random sample)

Input x is a p dimensional vector (x

1

, x

2

, . . . , x

p

) in Euclidean space

Each x will be called pattern or data point or observation

Output t is generally univariate

In regression, typically the output is a continuous measurement

**In classiﬁcation, the output is a class label C
**

k

, k = 1, . . . , K

In some applications the response may also be multivariate, perhaps

high-dimensional

Joint probability distribution p(x, t) is generally unknown and

estimated using the training data

Given a new, unseen pattern x

∗

, pattern classiﬁcation consists in

**Assigning the correct class label t
**

∗

for x

∗

**Making a decision accordingly
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 3 / 68

Bayes decision error for minimum error

Example: patient classiﬁcation using MRI data

Data points {x

i

}, i = 1, . . . , n consists of n

1

healthy subjects and n

2

individuals with Alzheimer’s disease

For each sample, input x consists of p pixel intensities

Output t consists of two classes, C

1

(healthy) and C

2

(diseased)

Take t = 0 to indicate C

1

and t = 1 to indicate C

2

We wish to classify a new patient, and take a related action (e.g.

provide a treatment), on the basis on x

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 4 / 68

Bayes decision error for minimum error

Making decisions when no learning data is available

Assume known prior probabilities p(C

k

), k = 1, 2

On unseen data, we want to make as few misclassiﬁcations as possible

Suppose we only use our prior information

We would assign a new pattern x to class k if

p(C

k

) > p(C

j

), j = k

For classes with equal probability, objects are randomly assigned

We can do better by making use of the data x

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 5 / 68

Bayes decision error for minimum error

Posterior probabilities

We are interested in posterion probabilities p(C

k

|x), k = 1, 2

Using Bayes’ theorem

p(C

k

|x) =

p(x|C

k

)p(C

k

)

p(x)

If p(x, C) is known, all the requires probabilities are available

The evidence p(x) is

p(x) = p(x, C

1

) + p(x, C

2

) = p(x|C

1

)p(C

1

) + p(x|C

2

)p(C

2

)

It is just a scaling factor so that

k

p(C

k

|x) = 1

How do we make as few misclassiﬁcations as possible?

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 6 / 68

Bayes decision error for minimum error

Example: conditional densities with univariate input

Left: class conditional densities p(x|C

1

) and p(x|C

2

)

Right: posterior probabilities p(C

1

|x) and p(C

2

|x)

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 7 / 68

Bayes decision error for minimum error

Intuitive approach

After observing x, we ﬁnd that p(C

1

|x) > p(C

2

|x)

Intuitively, we would classify x as C

1

We would take the opposite decision if p(C

2

|x) > p(C

1

|x)

Accordingly, we would incur an error with probability

p(error|x) =

_

p(C

1

|x) if we decide C

2

p(C

2

|x) if we decide C

1

Is this approach optimal?

Note that

p(error) =

_

∞

−∞

p(error, x)dx =

_

∞

−∞

p(error|x) p(x)

.¸¸.

ﬁxed

dx

If we ensure that p(error|x) is small as possible for every x, the

integral must be as small as possible

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 8 / 68

Bayes decision error for minimum error

Bayes decision rule for minimal error (two classes)

Divide the input space x into decision regions R

k

, k = 1, 2

If pattern x is in R

k

, then x is classiﬁed as belonging to C

k

The probability of misclassiﬁcation is

p(error) = p(x ∈ R

1

, C

2

) + p(x ∈ R

2

, C

1

)

=

_

R

1

p(x, C

2

) dx +

_

R

2

p(x, C

1

) dx

=

_

R

1

p(C2|x)p(x) dx +

_

R

2

p(C

1

|x)p(x) dx

Ignoring the common factor p(x), the rule that minimises the

probability of misclassiﬁcation is

assign x to C

1

if p(C

1

|x) > p(C

2

|x)

Under this decision rule,

p(error|x) = min[p(C

1

|x), p(C

2

|x)]

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 9 / 68

Bayes decision error for minimum error

Example: p(error) with K = 2

p(error) is the coloured area

ˆ x is a boundary value deﬁning the two decision regions

x

0

is the optimal threshold - the probability of error is minimised

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 10 / 68

Bayes decision error for minimum error

Likelihood ratio

According to the Bayes rule,

x belongs to class C

1

if p(C

1

|x) > p(C

2

|x)

The rule provides the required decision boundary

Using Bayes’ theorem, we can rewrite the rule as

x belongs to class C

1

if p(x|C

1

)p(C

1

) > p(x|C

2

)p(C

2

)

or alternatively

x belongs to class C

1

if

p(x|C

1

)

p(x|C

2

)

. ¸¸ .

likelihood ratio

>

p(C

2

)

p(C

1

)

Clearly the rule depends on both the prior probabilities and

class-conditional densities

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 11 / 68

Bayes decision error for minimum error

Example: Decision regions in the univariate case

Using a 0 −1 loss function, R

1

and R

2

are determined by θ

a

If the loss function penalises misclassifying C

2

as C

1

more than the

converse, the threshold is higher (θ

b

> θ

a

) and R

1

is smaller.

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 12 / 68

Bayes decision error for minimum error

Bayes decision rule for minimal error (more than two classes)

Divide the input space x into decision regions R

k

, k = 1, . . . , K

The probability of correct classiﬁcation is

p(correct) =

K

k=1

p(x ∈ R

k

, C

k

)

=

K

k=1

_

R

k

p(x, C

k

) dx

=

K

k=1

_

R

k

p(C

k

|x) p(x)

.¸¸.

ﬁxed

dx

This probability is maximised when x is assigned to class C

k

for which

p(C

k

|x) is the largest

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 13 / 68

Bayes decision error for minimum error

Remark

We have seen that, in the two-classes case, the error is

p(error) =

_

∞

−∞

p(error|x)p(x)dx

Under the optimal decision rule

p(error|x) = min[p(C

1

|x), p(C

2

|x)]

Note that, even if the posterior densities are continuous, this form of

conditional error virtually always leads to a discontinuous integrand in

the error

For arbitrary densities, an upper bound for the error is given by

p(error|x) < 2p(C

2

|x)p(C

1

|x)

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 14 / 68

Bayes decision error for minimum risk

Loss matrix

In some applications either a loss or utility function may be available

A (K ×K) loss matrix Λ with elements λ

ij

deﬁnes the loss incurred

when a pattern in class i is classiﬁed as belonging to class j , with

i , j = 1, . . . , K

When K = 2, we have

Λ =

_

0 λ

01

λ

10

0

_

In the patient classiﬁcation example, we could take: λ

10

> λ

01

An optimal solution now minimises the loss function

The loss function depends on the true classiﬁcation, which is not

available with certainty– the uncertainty is quantiﬁed by p(x, C

k

)

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 15 / 68

Bayes decision error for minimum risk

Conditional and expected loss

The total conditional loss or risk incurred when a pattern x is

assigned to class k is

r (C

k

|x) =

K

i =1

λ

ik

p(C

i

|x)

The average loss or risk over the region supporting C

k

is

E[r (C

k

|x)] =

_

R

k

r (C

k

|x)p(x) dx

=

_

R

k

K

i =1

λ

ik

p(C

i

|x)p(x) dx

=

_

R

k

K

i =1

λ

ik

p(x, C

i

) dx

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 16 / 68

Bayes decision error for minimum risk

Minimising the expected loss

The average loss or risk is

E[r (x)] =

K

k=1

_

R

k

K

i =1

λ

ik

p(x, C

i

) dx

We want to deﬁne regions {R

k

} that minimise this expected loss

This implies that we should minimise

K

i =1

λ

ik

p(C

i

|x)p(x)

This is the same as classifying x to the class k for which the

conditional risk

r (C

k

|x) =

K

i =1

λ

ik

p(C

i

|x)

is minimum – the resulting minimum overall risk is called Bayes risk

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 17 / 68

Bayes decision error for minimum risk

Example: two classes (1/2)

Suppose there are only two classes, C

1

and C

2

Let λ

ij

= λ(C

i

|C

j

) be the loss incurred for deciding C

i

when the true

class is C

j

The conditional risks are

r (C

1

|x) = λ

11

p(C

1

|x) + λ

12

p(C

2

|x)

and

r (C

2

|x) = λ

21

p(C

1

|x) + λ

22

p(C

2

|x)

The rule says that we should decide C

1

if

r (C

1

|x) < r (C

2

|x)

Using posterior probabilities, we decide C

1

if

(λ

21

−λ

11

)

. ¸¸ .

positive

p(C

1

|x) > (λ

12

−λ

22

)

. ¸¸ .

positive

p(C

2

|x)

or otherwise decide C

2

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 18 / 68

Bayes decision error for minimum risk

Example: two classes (2/2)

Alternatively, by employing Bayes formula, we decide C

1

if

(λ

21

−λ

11

)p(x|C

1

)p(C

1

) > (λ

12

−λ

22

)p(x|C

2

)p(C

2

)

or otherwise decide C

2

Yet another formulation of the same rule suggests that we decide C

1

if

p(x|C

1

)

p(x|C

2

)

. ¸¸ .

likelihood ratio

>

λ

12

−λ

22

λ

21

−λ

11

p(C

2

)

p(C

1

)

or otherwise decide C

2

– assuming (λ

21

−λ

11

) > 0

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 19 / 68

Bayes decision error for minimum risk

A special case: the zero-one loss matrix

A special loss function has elements λ

ik

= 1 −I

ik

I

ik

= 1 if i = k otherwise I

ik

= 0

We classify a pattern as belonging to class k if

K

i =1

λ

ik

p(C

i

|x) =

K

i =1

(1 −I

ik

)p(C

i

|x) = 1 −

K

i =1

I

ik

p(C

i

|x)

is a minimum – we have used the fact that

K

i =1

p(C

i

|x) = 1

This suggests that we should choose class k for which

1 −p(C

k

|x)

is the smallest

Equivalently, choose class k for which p(C

k

|x) is a maximum

Minimising the expect loss will minimise the misclassiﬁcation rate

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 20 / 68

Bayes decision error for minimum risk

The reject option

Classiﬁcation errors arise from regions where the largest of p(C

k

|x) is

signiﬁcantly less than unity – that is when we are very uncertain

about class membership

In some applications we may want not to classify (i.e. reject) patterns

for which we are very uncertain

Introduce an arbitrary threshold θ ∈ [0, 1]

Deﬁne a reject region

reject region = {x : max

k

p(C

k

|x) ≤ θ}

Then take either one of the following decisions:

Do not classify patterns that falls in the reject region

**Classify all other patterns using the Bayes decision rules, as before
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 21 / 68

Bayes decision error for minimum risk

Example: the reject option with univariate input

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 22 / 68

Bayes decision error for minimum risk

The reject option that minimises the expected loss

We now account for the loss incurred when a reject decision is made

If we decide to assign pattern x to class k, we will incur an expected

loss of

K

i =1

λ

ik

p(C

i

|x)

Suppose that, if we opt for the reject option, we incur a loss of λ

If we take

j = arg min

k

K

i =1

λ

ik

p(C

i

|x)

then the expected loss is minimised if we follow this rule:

**Choose class j if min
**

k

K

i =1

λ

ik

p(C

i

|x) < λ

Otherwise reject

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 23 / 68

Bayes decision error for minimum risk

The reject option that minimises the expected loss

For a zero-one loss matrix, λ

ik

= 1 −I

ik

we choose class j when

min

k

K

i =1

λ

ik

p(C

i

|x) = min

k

{1 −p(C

k

|x)} < λ

Or equivalently, when

max

k

p(C

k

|x) < 1 −λ

In the standard reject criterion, we reject if

max

k

p(C

k

|x) < θ

Hence the two criteria for rejection are equivalent provided that

θ = 1 −λ

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 24 / 68

Bayes decision error for minimum risk

Summary

Strategies for obtaining decision regions needed to classify patterns

were introduced

The Bayes rule for minimum error is the optimum rule

Introducing the costs of making incorrect decisions leads to the Bayes

rule for minimum risk

We have assumed that a priori probabilities and class-conditional

distributions are known – generally these will be learned from the data

Two alternatives will be brieﬂy described next – both require

knowledge of the class-conditional probability density functions

the Neyman-Pearson rule

**the minimax rule
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 25 / 68

Bayes decision error for minimum risk

Neyman-Person decision rule for two-class problems

An alternative to Bayes decision rule for a two-class problem is the

Neyman-Pearson rule

This is often used in radar detection when a signal has to be classiﬁed

as real (class C

1

) and noise (class C

2

)

Two possible errors can be made

1

=

_

R

2

p(x|C

1

) dx = error probability of Type I

2

=

_

R

1

p(x|C

2

) dx = error probability of Type II

where

C

1

is the positive class, so

1

is the false negative rate

C

2

is the negative class, so

2

is the false positive rate

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 26 / 68

Bayes decision error for minimum risk

Neyman-Person decision rule

The rule arises from a constrained optimisation setup: minimises

1

subject to

2

being equal to a constant

0

Hence we wish to minimise

err =

_

R

2

p(x|C

1

) dx + µ

__

R

1

p(x|C

2

) dx −

0

_

where µ is the Lagrange multiplier

This will be minimised if we choose R

1

such that

if µp(x|C

2

) −p(x|C

1

) < 0 then x ∈ C

1

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 27 / 68

Bayes decision error for minimum risk

Decision rule

Expressed in terms of the likelihood ratio, the Neyman-Person

decision rule states that

if

p(x|C

1

)

p(x|C

2

)

> µ then x ∈ C

1

The threshold µ is found so that

_

R

1

p(x|C

2

) dx =

0

which often requires numerical integration

The rule depends only on the within-class distributions and ignores

the a priori probabilities

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 28 / 68

Bayes decision error for minimum risk

Minimax criterion

Bayes decision rules rely on a knowledge of both the within-class

distributions and the prior class probabilities

Often the prior class probabilities are unknown or they may vary

according to external factors

A reasonable approach is to design a classiﬁer so that the worst

possible error that can be made for any value of the priors is as small

as possible

The minimax procedure is designed to minimise the maximum

possible overall error or risk

We will consider the simple case of K = 2 classes

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 29 / 68

Bayes decision error for minimum risk

Bayes minimum error for a given prior

Let us consider the classiﬁcation error

e

B

= p(C

2

)

_

R

1

p(x|C

2

)dx + p(C

1

)

_

R

2

p(x|C

1

)dx

As a function of p(C

1

), e

B

is non-linear because the decision regions

also depend on the prior

When a value for p(C

1

) has been selected, the two regions R

1

and R

2

are determined, and we regard the error function as a function of

p(C

1

) only – we call it ˜ e

B

˜ e

B

is linear in p(C

1

) and we can easily determine the value of p(C

1

)

which gives the largest error

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 30 / 68

Bayes decision error for minimum risk

Example: Bayes minimum error curves

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 31 / 68

Bayes decision error for minimum risk

Example: Bayes minimum error curve

The true Bayes error must be less than or equal to that linearly

bounded value, since one has the freedom to change the decision

boundary at each value of p(C

1

)

Also note that the Bayes error is zero at p(C

1

) = 0 and p(C

1

) = 1

since the Bayes decision rule under those conditions is to always

decide C

1

or C

2

, respectively, and this gives zero error

Thus the curve of Bayes error is concave down all prior probabilities

For ﬁxed decision regions, the max error will occur at an extreme

value of the prior – in the example, p(C

1

) = 1

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 32 / 68

Bayes decision error for minimum risk

Minimax solution

The minimax procedure minimises the largest possible error,

max(˜ e

B

(0), ˜ e

B

(1)) = max

__

R

2

p(x|C

1

)dx,

_

R

1

p(x|C

2

)dx

_

This is a minimum when ˜ e

B

(p(C

1

)) is horizontal and touches the

Bayes minimum error curve at its peak

A minimum is achieved when

_

R

2

p(x|C

1

)dx =

_

R

1

p(x|C

2

)dx

which is the point where the error will not change as function of the

prior – in the example, p(C

1

) = 0.6

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 33 / 68

Bayes decision error for minimum risk

Three main approaches to solving decision problems

1

Using generative models

**First estimate the class-conditional densities p(x|C
**

k

), k = 1, . . . , K

**Then use Bayes’ theorem to ﬁnd the posterior probabilities p(C
**

k

|x)

p(C

k

|x) =

p(x|C

k

)p(C

k

)

p(x)

p(x) =

K

k=1

p(x|C

k

)p(C

k

)

**Or model the joint distribution p(x, C
**

k

) and then ﬁnd p(C

k

|x)

Each p(C

k

) can be estimated as proportion of data points in class k

**Use decision theory to allocate a new input x to a class
**

2

Using discriminative models

**Model the posterior class probabilities p(C
**

k

|x) directly

**Use decision theory to allocate a new input x to a class
**

3

Using discriminative functions

**Find a function f (x) which maps an input x directly onto a class
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 34 / 68

Non-parametric density estimation methods

Parametric and non-parametric density estimation

In the parametric approach, probability distributions have speciﬁc

functional forms governed by a small number of parameters

The parameters are estimated from the data, for instance using a

maximum-likelihood or Bayesian approach

A limitation of this approach is that the chosen density might be a

poor model for the distribution that generated the data, which results

in poor predictive performance

In the non-parametric approach, we make fewer assumptions about

the form of the distributions

Both frequentist and Bayesian approach have been developed

**We shall focus on commonly used frequentist approaches
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 35 / 68

Non-parametric density estimation methods

Overview

We will introduce three popular methods for density estimation

Histogram methods

Kernel methods (or Parzen estimation methods)

**Nearest neighbours methods
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 36 / 68

Non-parametric density estimation methods

Histogram methods for density estimation

Let us assume a simple univariate pattern x observed on n objects

Using standard histograms we

**partition the observations x
**

1

, x

2

, . . . , c

n

into distinct bins of width ∆

i

**count the number n
**

i

of observations falling in each bin

In order to obtain a proper probability density, we divide n

i

by the

total number of observations n and by the width ∆

i

to obtain

p

i

=

n

i

n∆

i

The it follows that the estimated density is indeed a probability

density so

_

p(x) dx = 1

Under this approach the density is constant over the width of each

bin and usually

∆

i

= ∆

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 37 / 68

Non-parametric density estimation methods

Example: density estimation using the histogram method

50 data points generated from the distribution in green (mixture of two Gaussians)

When ∆ is too small, the estimated density can be under-smoothed

(spiky), otherwise it can be too smooth

The edge location of the bin also pays a role, but has less eﬀect

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 38 / 68

Non-parametric density estimation methods

Histogram method: pro and con

Once the histogram has been computed, the raw data can be

discarded, which is an advantage with large data sets

It is suitable for on-line updating, when the data points arrive

sequentially

The estimated density has discontinuities due to the bin edges rather

than any property of the true density

Also it does not scale well with dimensionality

Suppose x is p-dimensional with p large

**If each dimension is divided into M bins, we have a total of M
**

p

bins

The number of data points required by the method is too large

**This exponential scaling with p is sometimes referred to as the curse of
**

dimensionality

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 39 / 68

Non-parametric density estimation methods

The curse of dimensionality

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 40 / 68

Non-parametric density estimation methods

Towards alternative approaches

Despite its shortcomings, the method is widely used, especially for a

quickly visualisation of the data (in few dimensions)

Moreover, this approach highlights two elements needed for

developing more complex methods

First, to estimate the probability density at a particular location, we

should consider data points that lie within some local neighbourhood

of that point

**This notion of locality requires that we assume some distance measure,
**

for instance the Euclidean distance

In histograms, this proximity notion was deﬁned by the bins

**The bin width is a smoothing parameter that deﬁnes the spatial extent
**

of the local region

Second, the value of the smoothing parameter should not be too large

or too small in order to obtain good results

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 41 / 68

Non-parametric density estimation methods

Preliminaries

Assume that n data points are drawn from an unknown probability

density p(x) in a p-dimensional Euclidean space

Let us consider a small region Q containing x

The probability mass associated with this region is given by

P =

_

Q

p(x) dx

Suppose we have n data points from p(x), then each point has a

probability P of being in Q

The number k of points that lie inside Q follows a Binomial

k | (n, P) ∼

n!

k!(n −k)!

P

K

(1 −P)

n−k

The average number of points in Q is P with variance P(1 −P)n

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 42 / 68

Non-parametric density estimation methods

Distribution of k as function of n

The true value is P = 0.7 – note that as n increases the curve peaks around P

In the limit n →∞ the curve approaches a delta function

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 43 / 68

Non-parametric density estimation methods

Density estimate

The estimated average of points within Q is

k

n

= P

so for large n we take

k ≈ nP

If however we also assume that the region Q is suﬃciently small, so

that the probability density is roughly constant over the region, then

P ≈ p(x)V

where V is the volume of Q

Combining the two we obtain a density estimate at x

p(x) =

P

V

=

k

nV

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 44 / 68

Non-parametric density estimation methods

Towards two alternative methods

Notice that the validity of the estimate

p(x) =

k

nV

depends on two contradictory assumptions:

**We want Q to be suﬃciently large (in relation to the value of that
**

density) so that the number k of points falling inside the region is large

enough for the binomial to be sharply peaked around P

**We want Q to be suﬃciently small that the density is approximately
**

constant over the region

There are two possible approaches:

**We can ﬁx V and determine k from the data – this is done by deﬁning
**

localised regions around x

**We can ﬁx k and determine V from the data – this is done by
**

searching for neighbours of x

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 45 / 68

Non-parametric density estimation methods

Kernel density estimator

We wish to determine the probability density at x

Take the region Q to be a small hypercube centred on x

In order to count the number k of points within Q we deﬁne a

function representing the unit cube centred on the origin

g(u) =

_

1 |u

i

| ≤ 1/2 j = 1, 2, . . . , p

0 otherwise

This is called the Parzen window, an instance of a kernel function

We have that

g

_

x −x

i

h

_

=

_

1 if x

i

lies inside a cube of side h centred on x

0 otherwise

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 46 / 68

Non-parametric density estimation methods

Kernel density estimation

The number k of data points inside the cube therefore is

k =

n

i =1

g

_

x −x

i

h

_

We have previously established that an estimate for p(x) is

k

nV

Combining the two results gives

p(x) =

1

n

n

i =1

1

V

g

_

x −x

i

h

_

=

1

n

n

i =1

1

h

p

g

_

x −x

i

h

_

where we have taken V = h

p

, the volume of a hypercube of side h in

p dimensions

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 47 / 68

Non-parametric density estimation methods

Remarks on the kernel density estimator

As with the histogram method, the kernel density estimator suﬀers

from the presence of discontinuities – here these are located at the

boundaries of the cubes

Smoother density estimates can be obtained simply by using

alternative kernel functions

A common choice is the Gaussian kernel which yields the estimator

p(x) =

1

n

n

i =1

1

(2πh

2

)

1/2

exp

_

−

x −x

i

2

2h

2

_

where h is the standard deviation of the Gaussian components

The density model places a Gaussian over each data point and then

adds the contributions over all points

It can be proved that the kernel density estimator converges to the

true density in the limit n →∞

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 48 / 68

Non-parametric density estimation methods

Example: kernel density estimation using a Gaussian kernel

50 data points generated from the distribution in green (mixture of two Gaussians)

h acts as a smoothing parameter

when h is too small the estimated density is under-smoothed (too

noisy), otherwise is over-smoothed

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 49 / 68

Non-parametric density estimation methods

Other kernels and improvements

Any other kernel can be chosen subject to

g(u) ≥ 0 and

_

g(u) du = 1

thus ensuring that the probability distribution is nonnegative

everywhere and integrate to one

The computational cost of evaluating the density grows linearly with n

Choosing the correct h is critical – much research has been done to

develop procedures that learns h from the data

So far the parameter h has been ﬁxed in all component kernels

**in regions of high density, a large h may lead to over-smoothing but an
**

high h may lead to noisy estimates

**the optimal h depends on the region of the input space
**

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 50 / 68

Non-parametric density estimation methods

Nearest neighbours methods

Instead of ﬁxing V and determining k from the data, we can ﬁx k

and then use the data to ﬁnd V accordingly

Suppose we want to estimate the density x at x

Take a small sphere centred at x at which we want to estimate p(x)

We let the radius of the sphere grow until it contains exactly k points

The density estimate is then given as before by

p(x) =

k

nV

with V set to the volume of the sphere

The degree of smoothing is now governed by the parameter k, the

number of nearest neighbours

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 51 / 68

Non-parametric density estimation methods

Example: nearest neighbours

50 data points generated from the distribution in green (mixture of two Gaussians)

k acts as a smoothing parameter

When k is too small the estimated density is under-smoothed (too

noisy), otherwise is over-smoothed

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 52 / 68

Non-parametric density estimation methods

Nearest neighbours method for classiﬁcation

We can apply the nearest neighbour method to estimate the

probability density in each class C

k

, then use the Bayes rule to obtain

the posterior probabilities

Suppose we have M classes

We have observed n

m

patterns in class C

m

, with

M

m=1

n

m

= n

Given a new pattern x to be classiﬁed, we take a sphere around it

containing exactly its k nearest points

The sphere has volume V and contains k

m

points from class C

m

with

M

m=1

k

m

= k

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 53 / 68

Non-parametric density estimation methods

Nearest neighbours method for classiﬁcation

The class priors can be estimated by

p(C

k

) =

n

k

n

The class-conditional density estimates are

p(x|C

m

) =

k

m

n

m

V

m = 1, 2, . . . , M

As before, the unconditional density is given by

p(x) =

k

nV

Combining these results we obtain the posterior probabilities

p(C

m

|x) =

p(x|C

m

)p(C

m

)

p(x)

=

k

m

k

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 54 / 68

Non-parametric density estimation methods

Remarks on the nearest neighbours method

This approach gives a sensible rule:

compute the nearest neighbours of x in each class

classify x as belonging to C

m

if the majority of its neighbours are in C

m

As with the kernel density estimator, it can be proved that the kernel

density estimator converges to the true density in the limit n →∞

An interesting property of nearest neighbours methods (when k = 1)

is that, in the limit n →∞ the error is never more than twice the

minimum achievable error rate of an optimal classiﬁers

The notion of similarity between any two given patterns x and x

is

important

**For patterns living in an Euclidean space, the Euclidean distance is an
**

obvious choice but other alternatives are available (more on this later)

**When the patterns are objects of a diﬀerent nature (graphs, trees,
**

strings) other distances should be used

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 55 / 68

Non-parametric density estimation methods

Example: classiﬁcation with nearest neighbours

Classiﬁcation using K = 3

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 56 / 68

Non-parametric density estimation methods

Example: classiﬁcation with nearest neighbours

Decision region obtained using K = 1

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 57 / 68

Non-parametric density estimation methods

Example: oil ﬂow data set

During the course of this study the proportions of oil, water and gas

in the North Sea oil transfer pipelines was measured

Depending on the geometrical conﬁguration of these three materials,

a pipe was labelled as

Stratiﬁed

Annular

Homogeneous

Each data point comprises a 12-dimensional pattern consisting of

non-invasive measurements taken with gamma ray densitometers

The principle is the following: if a narrow beam of of gamma rays

passes through the pipe, the attenuation in the intensity of the beam

provides information about the density of material along its path

The ultimate objective is to classify a pipe as belonging to one of

those three classes

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 58 / 68

Non-parametric density estimation methods

Example: oil ﬂow data set

The three classes deﬁne geometrical conﬁgurations of the oil, water and gas

Each pipe is a pattern in a 12-dimensional space

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 59 / 68

Non-parametric density estimation methods

Example: oil ﬂow data set

Red is Homogeneous, green is Annular and blue is Stratified

The pattern × has to be classiﬁed based on variables x

6

and x

7

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 60 / 68

Non-parametric density estimation methods

Example: oil ﬂow data set

Decision region obtained using K = 1

The regions are fragmented and complex – overﬁtting may take place

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 61 / 68

Non-parametric density estimation methods

Example: oil ﬂow data set

Decision region obtained using K = 3

The regions are smoother

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 62 / 68

Non-parametric density estimation methods

Example: oil ﬂow data set

Decision region obtained using K = 31

The regions are too smooth

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 63 / 68

Non-parametric density estimation methods

Summary

When the true class-conditional densities are known, the minimal

achievable error rate is that of a Bayes classiﬁer

When they are not known, they are ﬁrst learned from the data using

either a parametric or non-parametric approach

Non-parametric approaches make no assumptions about the form of

the distributions and we have brieﬂy considered three methods:

histograms, kernels and nearest-neighbours

All non-parametric methods rely on some tuning parameters that will

inevitably aﬀect the classiﬁcation performance on unseen pattern

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 64 / 68

Performance assessment

The generalisation error

As we will see, almost all classiﬁers also rely on tuneable parameters

These parameters control the complexity of the model, which in turn

determines the complexity of the decision regions

**If the model is too simple (for instance a nearest neighbour with
**

k = 1), some salient features of the data will not be captured

**If the model is too complex, it will generate complex decision
**

boundaries that may not describe well the true boundaries

This is the issue of generalisation – ultimately we want the model to

perform well on new, unseen patterns

The question then is, how to obtain measures of generalisation error

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 65 / 68

Performance assessment

The independent test approach

Suppose the data set D contains n labelled patterns

We can randomly split the data set into two parts

A training set (e.g. 90% of the data) used for adjusting the parameters

**A test set used to estimate the generalisation error
**

A measure of generalisation error E can simply be the fraction of

wrongly classiﬁed labels in the test set

Since the ultimate goal is that of obtaining low generalisation error,

we train the classiﬁer until we reach a minimum of this error

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 66 / 68

Performance assessment

Generalisation error as function of model complexity

Classiﬁers that are too complex perform well on training data but not on independent data

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 67 / 68

Performance assessment

v-fold cross validation

A simple generalisation of the previous approach is given by the v-fold

cross-validation approach

The data set D is divided into v disjoint sets of equal size n/v

The classiﬁer is trained v times, each time with a diﬀerent set held

out as a validation set

For each model, a generalisation error is computed, say

E

i

, i = 1, . . . , m

The estimated generalisation error is the mean of the v errors

In the limit, when v = n, the method is called leave-one-out

cross-validation

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 68 / 68

1

Bayes decision error for minimum error

2

Bayes decision error for minimum risk

3

Non-parametric density estimation methods

4

Performance assessment

Giovanni Montana Imperial College ()

Bayesian decision theory and density estimation

14 July, 2010

2 / 68

Bayes decision error for minimum error

The problem

We have observed {xi , ti } on a training data set (random sample) Input x is a p dimensional vector (x1 , x2 , . . . , xp ) in Euclidean space Each x will be called pattern or data point or observation Output t is generally univariate

In regression, typically the output is a continuous measurement In classiﬁcation, the output is a class label Ck , k = 1, . . . , K

In some applications the response may also be multivariate, perhaps high-dimensional Joint probability distribution p(x, t) is generally unknown and estimated using the training data Given a new, unseen pattern x∗ , pattern classiﬁcation consists in

Assigning the correct class label t ∗ for x∗ Making a decision accordingly

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 3 / 68

n consists of n1 healthy subjects and n2 individuals with Alzheimer’s disease For each sample.Bayes decision error for minimum error Example: patient classiﬁcation using MRI data Data points {xi }. C1 (healthy) and C2 (diseased) Take t = 0 to indicate C1 and t = 1 to indicate C2 We wish to classify a new patient.g. on the basis on x Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. i = 1. 2010 4 / 68 . . . input x consists of p pixel intensities Output t consists of two classes. . provide a treatment). and take a related action (e. .

2 On unseen data. 2010 5 / 68 . objects are randomly assigned We can do better by making use of the data x Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. we want to make as few misclassiﬁcations as possible Suppose we only use our prior information We would assign a new pattern x to class k if p(Ck ) > p(Cj ).Bayes decision error for minimum error Making decisions when no learning data is available Assume known prior probabilities p(Ck ). k = 1. j =k For classes with equal probability.

2010 6 / 68 . 2 Using Bayes’ theorem p(Ck |x) = p(x|Ck )p(Ck ) p(x) If p(x. C1 ) + p(x. C) is known. all the requires probabilities are available The evidence p(x) is p(x) = p(x.Bayes decision error for minimum error Posterior probabilities We are interested in posterion probabilities p(Ck |x). k = 1. C2 ) = p(x|C1 )p(C1 ) + p(x|C2 )p(C2 ) It is just a scaling factor so that p(Ck |x) = 1 k How do we make as few misclassiﬁcations as possible? Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 7 / 68 .Bayes decision error for minimum error Example: conditional densities with univariate input Left: class conditional densities p(x|C1 ) and p(x|C2 ) Right: posterior probabilities p(C1 |x) and p(C2 |x) Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

x)dx = −∞ p(error|x) p(x) dx ﬁxed If we ensure that p(error|x) is small as possible for every x. 2010 8 / 68 . we would incur an error with probability p(error|x) = Is this approach optimal? Note that ∞ ∞ p(C1 |x) p(C2 |x) if we decide C2 if we decide C1 p(error) = −∞ p(error. the integral must be as small as possible Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. we ﬁnd that p(C1 |x) > p(C2 |x) Intuitively.Bayes decision error for minimum error Intuitive approach After observing x. we would classify x as C1 We would take the opposite decision if p(C2 |x) > p(C1 |x) Accordingly.

C2 ) + p(x ∈ R2 . C2 ) dx + R2 p(x. then x is classiﬁed as belonging to Ck The probability of misclassiﬁcation is p(error) = p(x ∈ R1 . p(error|x) = min[p(C1 |x). 2 If pattern x is in Rk . k = 1. 2010 9 / 68 . p(C2 |x)] Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. the rule that minimises the probability of misclassiﬁcation is assign x to C1 if p(C1 |x) > p(C2 |x) Under this decision rule.Bayes decision error for minimum error Bayes decision rule for minimal error (two classes) Divide the input space x into decision regions Rk . C1 ) dx p(C1 |x)p(x) dx R2 = R1 p(C2|x)p(x) dx + Ignoring the common factor p(x). C1 ) = R1 p(x.

the probability of error is minimised Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum error Example: p(error) with K = 2 p(error) is the coloured area x is a boundary value deﬁning the two decision regions ˆ x0 is the optimal threshold . 2010 10 / 68 .

Bayes decision error for minimum error Likelihood ratio According to the Bayes rule. 2010 11 / 68 . we can rewrite the rule as x belongs to class C1 if p(x|C1 )p(C1 ) > p(x|C2 )p(C2 ) or alternatively x belongs to class C1 if p(x|C1 ) p(x|C2 ) likelihood ratio > p(C2 ) p(C1 ) Clearly the rule depends on both the prior probabilities and class-conditional densities Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. x belongs to class C1 if p(C1 |x) > p(C2 |x) The rule provides the required decision boundary Using Bayes’ theorem.

the threshold is higher (θb > θa ) and R1 is smaller. Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum error Example: Decision regions in the univariate case Using a 0 − 1 loss function. 2010 12 / 68 . R1 and R2 are determined by θa If the loss function penalises misclassifying C2 as C1 more than the converse.

. .Bayes decision error for minimum error Bayes decision rule for minimal error (more than two classes) Divide the input space x into decision regions Rk . . k = 1. . Ck ) p(x. K The probability of correct classiﬁcation is K p(correct) = k=1 K p(x ∈ Rk . 2010 13 / 68 . Ck ) dx k=1 Rk K = = k=1 Rk p(Ck |x) p(x) dx ﬁxed This probability is maximised when x is assigned to class Ck for which p(Ck |x) is the largest Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

an upper bound for the error is given by p(error|x) < 2p(C2 |x)p(C1 |x) Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 14 / 68 . this form of conditional error virtually always leads to a discontinuous integrand in the error For arbitrary densities. the error is ∞ p(error) = −∞ p(error|x)p(x)dx Under the optimal decision rule p(error|x) = min[p(C1 |x).Bayes decision error for minimum error Remark We have seen that. even if the posterior densities are continuous. p(C2 |x)] Note that. in the two-classes case.

.Bayes decision error for minimum risk Loss matrix In some applications either a loss or utility function may be available A (K × K ) loss matrix Λ with elements λij deﬁnes the loss incurred when a pattern in class i is classiﬁed as belonging to class j. . . we could take: λ10 > λ01 An optimal solution now minimises the loss function The loss function depends on the true classiﬁcation. j = 1. which is not available with certainty– the uncertainty is quantiﬁed by p(x. K When K = 2. 2010 15 / 68 . with i. we have Λ= 0 λ01 λ10 0 In the patient classiﬁcation example. . Ck ) Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Conditional and expected loss The total conditional loss or risk incurred when a pattern x is assigned to class k is K r (Ck |x) = i=1 λik p(Ci |x) The average loss or risk over the region supporting Ck is E[r (Ck |x)] = Rk K r (Ck |x)p(x) dx λik p(Ci |x)p(x) dx Rk i=1 K = = Rk i=1 λik p(x. 2010 16 / 68 . Ci ) dx Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Minimising the expected loss The average loss or risk is K K E[r (x)] = k=1 Rk i=1 λik p(x. 2010 17 / 68 . Ci ) dx We want to deﬁne regions {Rk } that minimise this expected loss This implies that we should minimise K λik p(Ci |x)p(x) i=1 This is the same as classifying x to the class k for which the conditional risk K r (Ck |x) = i=1 λik p(Ci |x) is minimum – the resulting minimum overall risk is called Bayes risk Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Example: two classes (1/2) Suppose there are only two classes. C1 and C2 Let λij = λ(Ci |Cj ) be the loss incurred for deciding Ci when the true class is Cj The conditional risks are r (C1 |x) = λ11 p(C1 |x) + λ12 p(C2 |x) and r (C2 |x) = λ21 p(C1 |x) + λ22 p(C2 |x) The rule says that we should decide C1 if r (C1 |x) < r (C2 |x) Using posterior probabilities. 2010 18 / 68 . we decide C1 if (λ21 − λ11 ) p(C1 |x) > (λ12 − λ22 ) p(C2 |x) positive positive or otherwise decide C2 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

we decide C1 if (λ21 − λ11 )p(x|C1 )p(C1 ) > (λ12 − λ22 )p(x|C2 )p(C2 ) or otherwise decide C2 Yet another formulation of the same rule suggests that we decide C1 if p(x|C1 ) p(x|C2 ) likelihood ratio > λ12 − λ22 p(C2 ) λ21 − λ11 p(C1 ) or otherwise decide C2 – assuming (λ21 − λ11 ) > 0 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. by employing Bayes formula.Bayes decision error for minimum risk Example: two classes (2/2) Alternatively. 2010 19 / 68 .

choose class k for which p(Ck |x) is a maximum Minimising the expect loss will minimise the misclassiﬁcation rate Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum risk A special case: the zero-one loss matrix A special loss function has elements λik = 1 − Iik Iik = 1 if i = k otherwise Iik = 0 We classify a pattern as belonging to class k if K K K λik p(Ci |x) = i=1 i=1 (1 − Iik )p(Ci |x) = 1 − i=1 Iik p(Ci |x) is a minimum – we have used the fact that K p(Ci |x) = 1 i=1 This suggests that we should choose class k for which 1 − p(Ck |x) is the smallest Equivalently. 2010 20 / 68 .

reject) patterns for which we are very uncertain Introduce an arbitrary threshold θ ∈ [0. 1] Deﬁne a reject region reject region = {x : maxk p(Ck |x) ≤ θ} Then take either one of the following decisions: Do not classify patterns that falls in the reject region Classify all other patterns using the Bayes decision rules.e. as before Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 21 / 68 .Bayes decision error for minimum risk The reject option Classiﬁcation errors arise from regions where the largest of p(Ck |x) is signiﬁcantly less than unity – that is when we are very uncertain about class membership In some applications we may want not to classify (i.

Bayes decision error for minimum risk Example: the reject option with univariate input Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 22 / 68 .

we incur a loss of λ If we take j = arg min k i=1 K λik p(Ci |x) then the expected loss is minimised if we follow this rule: Choose class j if mink Otherwise reject Giovanni Montana Imperial College () K i=1 λik p(Ci |x) < λ Bayesian decision theory and density estimation 14 July. we will incur an expected loss of K λik p(Ci |x) i=1 Suppose that.Bayes decision error for minimum risk The reject option that minimises the expected loss We now account for the loss incurred when a reject decision is made If we decide to assign pattern x to class k. 2010 23 / 68 . if we opt for the reject option.

Bayes decision error for minimum risk The reject option that minimises the expected loss For a zero-one loss matrix. when max p(Ck |x) < 1 − λ k In the standard reject criterion. 2010 24 / 68 . we reject if max p(Ck |x) < θ k Hence the two criteria for rejection are equivalent provided that θ =1−λ Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. λik = 1 − Iik we choose class j when K min k i=1 λik p(Ci |x) = min{1 − p(Ck |x)} < λ k Or equivalently.

2010 25 / 68 .Bayes decision error for minimum risk Summary Strategies for obtaining decision regions needed to classify patterns were introduced The Bayes rule for minimum error is the optimum rule Introducing the costs of making incorrect decisions leads to the Bayes rule for minimum risk We have assumed that a priori probabilities and class-conditional distributions are known – generally these will be learned from the data Two alternatives will be brieﬂy described next – both require knowledge of the class-conditional probability density functions the Neyman-Pearson rule the minimax rule Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Neyman-Person decision rule for two-class problems An alternative to Bayes decision rule for a two-class problem is the Neyman-Pearson rule This is often used in radar detection when a signal has to be classiﬁed as real (class C1 ) and noise (class C2 ) Two possible errors can be made 1 = R2 p(x|C1 ) dx = error probability of Type I p(x|C2 ) dx = error probability of Type II R1 2 = where C1 is the positive class. so C2 is the negative class. 2010 26 / 68 Bayesian decision theory and density estimation . so Giovanni Montana Imperial College () 1 2 is the false negative rate is the false positive rate 14 July.

2010 27 / 68 .Bayes decision error for minimum risk Neyman-Person decision rule The rule arises from a constrained optimisation setup: minimises subject to 2 being equal to a constant 0 Hence we wish to minimise err = R2 1 p(x|C1 ) dx + µ R1 p(x|C2 ) dx − 0 where µ is the Lagrange multiplier This will be minimised if we choose R1 such that if µp(x|C2 ) − p(x|C1 ) < 0 then x ∈ C1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Bayes decision error for minimum risk Decision rule Expressed in terms of the likelihood ratio. the Neyman-Person decision rule states that if p(x|C1 ) > µ then x ∈ C1 p(x|C2 ) The threshold µ is found so that p(x|C2 ) dx = R1 0 which often requires numerical integration The rule depends only on the within-class distributions and ignores the a priori probabilities Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 28 / 68 .

2010 29 / 68 .Bayes decision error for minimum risk Minimax criterion Bayes decision rules rely on a knowledge of both the within-class distributions and the prior class probabilities Often the prior class probabilities are unknown or they may vary according to external factors A reasonable approach is to design a classiﬁer so that the worst possible error that can be made for any value of the priors is as small as possible The minimax procedure is designed to minimise the maximum possible overall error or risk We will consider the simple case of K = 2 classes Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

the two regions R1 and R2 are determined.Bayes decision error for minimum risk Bayes minimum error for a given prior Let us consider the classiﬁcation error eB = p(C2 ) R1 p(x|C2 )dx + p(C1 ) R2 p(x|C1 )dx As a function of p(C1 ). eB is non-linear because the decision regions also depend on the prior When a value for p(C1 ) has been selected. and we regard the error function as a function of p(C1 ) only – we call it eB ˜ eB is linear in p(C1 ) and we can easily determine the value of p(C1 ) ˜ which gives the largest error Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 30 / 68 .

Bayes decision error for minimum risk Example: Bayes minimum error curves Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 31 / 68 .

2010 32 / 68 . the max error will occur at an extreme value of the prior – in the example. p(C1 ) = 1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Bayes decision error for minimum risk Example: Bayes minimum error curve The true Bayes error must be less than or equal to that linearly bounded value. respectively. since one has the freedom to change the decision boundary at each value of p(C1 ) Also note that the Bayes error is zero at p(C1 ) = 0 and p(C1 ) = 1 since the Bayes decision rule under those conditions is to always decide C1 or C2 . and this gives zero error Thus the curve of Bayes error is concave down all prior probabilities For ﬁxed decision regions.

R1 p(x|C2 )dx This is a minimum when eB (p(C1 )) is horizontal and touches the ˜ Bayes minimum error curve at its peak A minimum is achieved when p(x|C1 )dx = R2 R1 p(x|C2 )dx which is the point where the error will not change as function of the prior – in the example. eB (1)) = max e ˜ R2 p(x|C1 )dx. max(˜B (0). 2010 33 / 68 .6 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. p(C1 ) = 0.Bayes decision error for minimum risk Minimax solution The minimax procedure minimises the largest possible error.

Bayes decision error for minimum risk Three main approaches to solving decision problems 1 Using generative models First estimate the class-conditional densities p(x|Ck ). k = 1. K Then use Bayes’ theorem to ﬁnd the posterior probabilities p(Ck |x) p(Ck |x) = p(x|Ck )p(Ck ) p(x) K p(x) = k=1 p(x|Ck )p(Ck ) Or model the joint distribution p(x. . . . Ck ) and then ﬁnd p(Ck |x) Each p(Ck ) can be estimated as proportion of data points in class k Use decision theory to allocate a new input x to a class 2 Using discriminative models Model the posterior class probabilities p(Ck |x) directly Use decision theory to allocate a new input x to a class 3 Using discriminative functions Find a function f (x) which maps an input x directly onto a class Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. . 2010 34 / 68 .

2010 35 / 68 . probability distributions have speciﬁc functional forms governed by a small number of parameters The parameters are estimated from the data. which results in poor predictive performance In the non-parametric approach. we make fewer assumptions about the form of the distributions Both frequentist and Bayesian approach have been developed We shall focus on commonly used frequentist approaches Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. for instance using a maximum-likelihood or Bayesian approach A limitation of this approach is that the chosen density might be a poor model for the distribution that generated the data.Non-parametric density estimation methods Parametric and non-parametric density estimation In the parametric approach.

Non-parametric density estimation methods Overview We will introduce three popular methods for density estimation Histogram methods Kernel methods (or Parzen estimation methods) Nearest neighbours methods Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 36 / 68 .

. . .Non-parametric density estimation methods Histogram methods for density estimation Let us assume a simple univariate pattern x observed on n objects Using standard histograms we partition the observations x1 . we divide ni by the total number of observations n and by the width ∆i to obtain ni pi = n∆i The it follows that the estimated density is indeed a probability density so p(x) dx = 1 Under this approach the density is constant over the width of each bin and usually ∆i = ∆ Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. cn into distinct bins of width ∆i count the number ni of observations falling in each bin In order to obtain a proper probability density. 2010 37 / 68 . x2 . .

the estimated density can be under-smoothed (spiky).Non-parametric density estimation methods Example: density estimation using the histogram method 50 data points generated from the distribution in green (mixture of two Gaussians) When ∆ is too small. 2010 38 / 68 . but has less eﬀect Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. otherwise it can be too smooth The edge location of the bin also pays a role.

we have a total of M p bins The number of data points required by the method is too large This exponential scaling with p is sometimes referred to as the curse of dimensionality Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. when the data points arrive sequentially The estimated density has discontinuities due to the bin edges rather than any property of the true density Also it does not scale well with dimensionality Suppose x is p-dimensional with p large If each dimension is divided into M bins. 2010 39 / 68 .Non-parametric density estimation methods Histogram method: pro and con Once the histogram has been computed. the raw data can be discarded. which is an advantage with large data sets It is suitable for on-line updating.

Non-parametric density estimation methods The curse of dimensionality Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 40 / 68 .

we should consider data points that lie within some local neighbourhood of that point This notion of locality requires that we assume some distance measure. 2010 41 / 68 .Non-parametric density estimation methods Towards alternative approaches Despite its shortcomings. the method is widely used. this approach highlights two elements needed for developing more complex methods First. this proximity notion was deﬁned by the bins The bin width is a smoothing parameter that deﬁnes the spatial extent of the local region Second. especially for a quickly visualisation of the data (in few dimensions) Moreover. the value of the smoothing parameter should not be too large or too small in order to obtain good results Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. to estimate the probability density at a particular location. for instance the Euclidean distance In histograms.

P) ∼ n! P K (1 − P)n−k k!(n − k)! The average number of points in Q is P with variance P(1 − P)n Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. then each point has a probability P of being in Q The number k of points that lie inside Q follows a Binomial k | (n.Non-parametric density estimation methods Preliminaries Assume that n data points are drawn from an unknown probability density p(x) in a p-dimensional Euclidean space Let us consider a small region Q containing x The probability mass associated with this region is given by P= Q p(x) dx Suppose we have n data points from p(x). 2010 42 / 68 .

Non-parametric density estimation methods Distribution of k as function of n The true value is P = 0. 2010 43 / 68 .7 – note that as n increases the curve peaks around P In the limit n → ∞ the curve approaches a delta function Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

then P ≈ p(x)V where V is the volume of Q Combining the two we obtain a density estimate at x p(x) = Giovanni Montana Imperial College () P k = V nV 14 July. so that the probability density is roughly constant over the region. 2010 44 / 68 Bayesian decision theory and density estimation .Non-parametric density estimation methods Density estimate The estimated average of points within Q is k =P n so for large n we take k ≈ nP If however we also assume that the region Q is suﬃciently small.

2010 45 / 68 .Non-parametric density estimation methods Towards two alternative methods Notice that the validity of the estimate p(x) = k nV depends on two contradictory assumptions: We want Q to be suﬃciently large (in relation to the value of that density) so that the number k of points falling inside the region is large enough for the binomial to be sharply peaked around P We want Q to be suﬃciently small that the density is approximately constant over the region There are two possible approaches: We can ﬁx V and determine k from the data – this is done by deﬁning localised regions around x We can ﬁx k and determine V from the data – this is done by searching for neighbours of x Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 46 / 68 .Non-parametric density estimation methods Kernel density estimator We wish to determine the probability density at x Take the region Q to be a small hypercube centred on x In order to count the number k of points within Q we deﬁne a function representing the unit cube centred on the origin g (u) = 1 |ui | ≤ 1/2 0 otherwise j = 1. . an instance of a kernel function We have that g x − xi h = 1 0 if xi lies inside a cube of side h centred on x otherwise Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. p This is called the Parzen window. 2. . . .

Non-parametric density estimation methods Kernel density estimation The number k of data points inside the cube therefore is n k= i=1 g x − xi h We have previously established that an estimate for p(x) is k nV Combining the two results gives p(x) = 1 n n i=1 1 g V x − xi h = 1 n n i=1 1 g hp x − xi h where we have taken V = hp . 2010 47 / 68 . the volume of a hypercube of side h in p dimensions Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 48 / 68 . the kernel density estimator suﬀers from the presence of discontinuities – here these are located at the boundaries of the cubes Smoother density estimates can be obtained simply by using alternative kernel functions A common choice is the Gaussian kernel which yields the estimator 1 p(x) = n n i=1 1 x − xi exp − 2 )1/2 2h2 (2πh 2 where h is the standard deviation of the Gaussian components The density model places a Gaussian over each data point and then adds the contributions over all points It can be proved that the kernel density estimator converges to the true density in the limit n → ∞ Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Remarks on the kernel density estimator As with the histogram method.

otherwise is over-smoothed Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Example: kernel density estimation using a Gaussian kernel 50 data points generated from the distribution in green (mixture of two Gaussians) h acts as a smoothing parameter when h is too small the estimated density is under-smoothed (too noisy). 2010 49 / 68 .

2010 50 / 68 .Non-parametric density estimation methods Other kernels and improvements Any other kernel can be chosen subject to g (u) ≥ 0 and g (u) du = 1 thus ensuring that the probability distribution is nonnegative everywhere and integrate to one The computational cost of evaluating the density grows linearly with n Choosing the correct h is critical – much research has been done to develop procedures that learns h from the data So far the parameter h has been ﬁxed in all component kernels in regions of high density. a large h may lead to over-smoothing but an high h may lead to noisy estimates the optimal h depends on the region of the input space Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

we can ﬁx k and then use the data to ﬁnd V accordingly Suppose we want to estimate the density x at x Take a small sphere centred at x at which we want to estimate p(x) We let the radius of the sphere grow until it contains exactly k points The density estimate is then given as before by p(x) = k nV with V set to the volume of the sphere The degree of smoothing is now governed by the parameter k.Non-parametric density estimation methods Nearest neighbours methods Instead of ﬁxing V and determining k from the data. the number of nearest neighbours Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 51 / 68 .

2010 52 / 68 .Non-parametric density estimation methods Example: nearest neighbours 50 data points generated from the distribution in green (mixture of two Gaussians) k acts as a smoothing parameter When k is too small the estimated density is under-smoothed (too noisy). otherwise is over-smoothed Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

with M nm = n m=1 Given a new pattern x to be classiﬁed. 2010 53 / 68 . then use the Bayes rule to obtain the posterior probabilities Suppose we have M classes We have observed nm patterns in class Cm .Non-parametric density estimation methods Nearest neighbours method for classiﬁcation We can apply the nearest neighbour method to estimate the probability density in each class Ck . we take a sphere around it containing exactly its k nearest points The sphere has volume V and contains km points from class Cm with M km = k m=1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

Non-parametric density estimation methods

**Nearest neighbours method for classiﬁcation
**

The class priors can be estimated by p(Ck ) = nk n

The class-conditional density estimates are p(x|Cm ) = km nm V m = 1, 2, . . . , M

As before, the unconditional density is given by p(x) = k nV

**Combining these results we obtain the posterior probabilities p(Cm |x) =
**

Giovanni Montana Imperial College ()

p(x|Cm )p(Cm ) km = p(x) k

14 July, 2010 54 / 68

Bayesian decision theory and density estimation

Non-parametric density estimation methods

**Remarks on the nearest neighbours method
**

This approach gives a sensible rule:

compute the nearest neighbours of x in each class classify x as belonging to Cm if the majority of its neighbours are in Cm

As with the kernel density estimator, it can be proved that the kernel density estimator converges to the true density in the limit n → ∞ An interesting property of nearest neighbours methods (when k = 1) is that, in the limit n → ∞ the error is never more than twice the minimum achievable error rate of an optimal classiﬁers The notion of similarity between any two given patterns x and x is important

For patterns living in an Euclidean space, the Euclidean distance is an obvious choice but other alternatives are available (more on this later) When the patterns are objects of a diﬀerent nature (graphs, trees, strings) other distances should be used

Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July, 2010 55 / 68

Non-parametric density estimation methods

**Example: classiﬁcation with nearest neighbours
**

Classiﬁcation using K = 3

Giovanni Montana Imperial College ()

Bayesian decision theory and density estimation

14 July, 2010

56 / 68

Non-parametric density estimation methods Example: classiﬁcation with nearest neighbours Decision region obtained using K = 1 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 57 / 68 .

water and gas in the North Sea oil transfer pipelines was measured Depending on the geometrical conﬁguration of these three materials. 2010 58 / 68 . the attenuation in the intensity of the beam provides information about the density of material along its path The ultimate objective is to classify a pipe as belonging to one of those three classes Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Example: oil ﬂow data set During the course of this study the proportions of oil. a pipe was labelled as Stratiﬁed Annular Homogeneous Each data point comprises a 12-dimensional pattern consisting of non-invasive measurements taken with gamma ray densitometers The principle is the following: if a narrow beam of of gamma rays passes through the pipe.

2010 59 / 68 .Non-parametric density estimation methods Example: oil ﬂow data set The three classes deﬁne geometrical conﬁgurations of the oil. water and gas Each pipe is a pattern in a 12-dimensional space Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

2010 60 / 68 . green is Annular and blue is Stratified The pattern × has to be classiﬁed based on variables x6 and x7 Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.Non-parametric density estimation methods Example: oil ﬂow data set Red is Homogeneous.

Non-parametric density estimation methods Example: oil ﬂow data set Decision region obtained using K = 1 The regions are fragmented and complex – overﬁtting may take place Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 61 / 68 .

Non-parametric density estimation methods Example: oil ﬂow data set Decision region obtained using K = 3 The regions are smoother Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 62 / 68 .

Non-parametric density estimation methods Example: oil ﬂow data set Decision region obtained using K = 31 The regions are too smooth Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 63 / 68 .

Non-parametric density estimation methods Summary When the true class-conditional densities are known. kernels and nearest-neighbours All non-parametric methods rely on some tuning parameters that will inevitably aﬀect the classiﬁcation performance on unseen pattern Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 2010 64 / 68 . the minimal achievable error rate is that of a Bayes classiﬁer When they are not known. they are ﬁrst learned from the data using either a parametric or non-parametric approach Non-parametric approaches make no assumptions about the form of the distributions and we have brieﬂy considered three methods: histograms.

2010 65 / 68 . some salient features of the data will not be captured If the model is too complex. which in turn determines the complexity of the decision regions If the model is too simple (for instance a nearest neighbour with k = 1). almost all classiﬁers also rely on tuneable parameters These parameters control the complexity of the model.Performance assessment The generalisation error As we will see. unseen patterns The question then is. how to obtain measures of generalisation error Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. it will generate complex decision boundaries that may not describe well the true boundaries This is the issue of generalisation – ultimately we want the model to perform well on new.

we train the classiﬁer until we reach a minimum of this error Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. 90% of the data) used for adjusting the parameters A test set used to estimate the generalisation error A measure of generalisation error E can simply be the fraction of wrongly classiﬁed labels in the test set Since the ultimate goal is that of obtaining low generalisation error.Performance assessment The independent test approach Suppose the data set D contains n labelled patterns We can randomly split the data set into two parts A training set (e.g. 2010 66 / 68 .

2010 67 / 68 .Performance assessment Generalisation error as function of model complexity Classiﬁers that are too complex perform well on training data but not on independent data Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July.

when v = n. . . i = 1. 2010 68 / 68 . . the method is called leave-one-out cross-validation Giovanni Montana Imperial College () Bayesian decision theory and density estimation 14 July. say Ei . m The estimated generalisation error is the mean of the v errors In the limit.Performance assessment v -fold cross validation A simple generalisation of the previous approach is given by the v -fold cross-validation approach The data set D is divided into v disjoint sets of equal size n/v The classiﬁer is trained v times. each time with a diﬀerent set held out as a validation set For each model. a generalisation error is computed. .

- mit6 041f10 l21Uploaded byapi-246008426
- stat.pdfUploaded byGerman Chiappe
- (Www.entrance-exam.net)-CMAT Exam Question Paper - 01Uploaded byShashank Patel
- MIT15_450F10_rec10.pdfUploaded byaluiscg
- Problems in Communication Theory With SolutionsUploaded bymathgeek
- Optimal Design of Network for Control of Total Station InstrumentsUploaded byDaniel Torres
- Bayesian ModelUploaded byk
- verdonck_csm2006.pdfUploaded byCosorAndrei-Alexandru
- 3.Schervish-1995Uploaded bySvend Erik Fjord
- Beamer 7Uploaded bySMILEY 0519
- Multicollinearity vs AutocorrelationUploaded byabulfaiziqbal
- Linear Programming Data FormulationUploaded byprosmatic
- ACCELERATED BAYESIAN OPTIMIZATION FOR DEEP LEARNINGUploaded byCS & IT
- BayesianCourse_session13 (Ajuste Bayesiano)Uploaded byLuis Pineda
- 1112.5550v2Uploaded byLeo D'Addabbo
- OlaUploaded bySoukainael
- Bayesian Networks DarwicheUploaded byroblim1
- 2013 GCE a Level Solution H1 MathUploaded byNgo Duy Vu
- ArfimaNumUploaded byPanda Pramudia
- tmpF7AE.tmpUploaded byFrontiers
- OutputUploaded byIsh rams
- Prob It RegressionUploaded byArup Guha Niyogi
- Greene 99Uploaded byMaurizio Malpede
- 10.1.1.68.4902Uploaded byThoth333
- Adaptive Filtering and Change DectectionUploaded byadya_tripathi
- The navigation of mobile robots in non-stationary and non-structured environmentsUploaded byAnonymous 0U9j6BLllB
- Results of a Practice TestUploaded byRebeccaTaylor
- angleUploaded bysatya agarwal
- LISREL OverviewUploaded byRahul Thakurta