Parameter Estimation - PR

Nonparametric techniques
(Bishop and Duda both to be referred here)

Introduction
• Bayes Parameter Estimation and Maximum likelihood estimation.
• The form of the density function is known.
• We assume that p(x|ωi) is a normal density with mean µi and covariance

matrix Σi,
• What if its not Gaussian… or taking any parametric form?

• All of the classical parametric densities are unimodal (have a single local
maximum), whereas many practical problems involve multimodal densities.
Non-parametric density
Estimation
• Estimate arbitrary distributions and without the assumption that the forms
of the underlying densities are known.
• several types of nonparametric methods
• 1. estimating the density functions p(x|ωj) from sample patterns
• 2. Directly estimate posteriori probabilities P(ωj|x)

Histogram methods for density
estimation
• Consider case of a single continuous variable x.
• Standard histograms simply partition x into distinct bins of width ∆i

and then count the number ni of observations of x falling in bin i.
estimation
• Standard histograms simply partition x into distinct bins of width ∆i and then
count the number ni of observations of x falling in bin i.
estimation
• Standard histograms simply partition x into distinct bins of width ∆i

and then count the number ni of observations of x falling in bin i.
Histogram methods for density estimation
• An illustration of the histogram approach to density estimation, in which a data set of 50

data points is generated from the distribution shown by the green curve
• In order to turn this count into a normalized probability density,

• Divide n by the total number N of observations and by the width ∆i of the bins to obtain
probability values for each bin given by
• An illustration of the histogram approach to density estimation, in which a data set of 50

data points is generated from the distribution shown by the green curve
• In order to turn this count into a normalized probability density,

• Divide n by the total number N of observations and by the width ∆i of the bins to obtain
probability values for each bin given by
Divide n by the total number N of observations and by the width ∆i of the bins to
obtain probability values for each bin given by
•
• Effect of histogram bin width
We see that when ∆ is very small

(top figure), the resulting density
model is very
spiky, with a lot of structure that is
not present in the underlying
distribution that
generated the data set.
∆ is too large (bottom figure) then

the result is
a model that is too smooth and that
consequently fails to capture the
bimodal property of the green curve.
The best results are obtained for

some intermediate value
of ∆ (middle figure). In principle, a
histogram density model is also
dependent on
the choice of edge location for the
bins, though this is typically much
less significant
than the value of ∆.
• Once the histogram has been computed, the data set itself can
be discarded, which can be advantageous if the data set is large.
• Also, the histogram approach is easily applied if the data points are
arriving sequentially.
• estimated density has discontinuities that are due to the bin edges rather
than any property of the underlying distribution that generated the data.
• Another major limitation of the histogram approach is its

scaling with dimensionality. If we divide each variable in a D-dimensional
space
into M bins, then the total number of bins will be MD.
• This exponential scaling with D is an example of the curse of
dimensionality
• Two important lessons:
• 1.To estimate the probability density at a particular location, we should consider the data points that lie
within some local neighbourhood of that point.
• 2. Second, the value of the smoothing parameter should be neither too large nor
too small in order to obtain good results.
Non-parametric density Estimation
• Two popular methods:
• 1. Parzen window / kernel method

• 2. K-Nearest Neighbor Method
• Let us suppose that observations are being drawn from some unknown
probability density p(x) in some D-dimensional space.
• and we wish to estimate the value of p(x)
• Let R be a small region around point x, (let us say a ball of small radius ).
• The integral of the probability density over the region is given as:
• probability P that a vector x will fall in a region R



• If ‘p’ is nearly constant in the region R, then P= p(x). V, where

• V is the volume of the region R.
• Thus p(x) = P/V…..
• I have to estimate p(x)
• I can take a region of know volume V, question is how do I know P?
• Suppose out of ‘n’ i.i.d samples ‘k’ samples fall in
the region R.
Independent and Identically distributed
random variables
• if each random variable has the same probability distribution as
the others and all are mutually independent.
random variables
• if each random variable has the same probability distribution as the
others and all are mutually independent.
• Example 1
• Toss a coin 10 times and record how many times does the coin lands
on head.
1. Independent – each outcome of landing will not affect the other
outcome, which means the 10 results are independent from each
other.
2. Identically Distributed – if the coin is a homogeneous material, each
time the probability for head is 0.5, which means the probability is
identical for each time.
random variables
• if each random variable has the same probability distribution as the
others and all are mutually independent.
• Example 2
• Choose a card from a standard deck of cards containing 52 cards,
then place the card back in the deck. Repeat it for 52 times. Record
the number of King appears
1. Independent – each outcome of the card will not affect the next one,
which means the 52 results are independent from each other.
2. Identically Distributed – after drawing one card from it, each time the
probability for King is 4/52, which means the probability is identical
for each time
• Now suppose that we have collected a data set comprising N
observations drawn from p(x).
• Because each data point has a probability P of falling within R, the
total number K of points that lie inside R will be distributed according
to the binomial distribution
•
•
•
• The choice of V affects the quality of approximation.
• There are two contradictory statements :

• 1. For the approximation P= p(x). V to be good, V should be very
small.
• 2. On the other hand if V is very small it may be possible that it has
zero samples in it (unless n is very very large).
• So the choice of V is a compromised between these two

requirements.
• Let the total no. of samples be n
• Let kn be the number of samples falling in region Rn,
• Let Vn be the volume of region Rn
• and pn(x) be the nth estimate for p(x):
• If n goes to infinity, am I getting a good estimate?

• Let the total no. of samples be n
• Let kn be the number of samples falling in region Rn,
• Let Vn be the volume of region Rn
• and pn(x) be the nth estimate for p(x):
• If pn(x) -> p(x), as n -> ∞ Then
1. Vn-> 0
2. kn -> ∞ and
3. Kn/n ->0
•
• In practical scenario we only have a finite n.
• We choose the size of V based on n.
• We have two ways to approach this problem :
• Fix the volume V and count k.

1. Fix the volume V and count k. (Parzen window method)

2.Fix a k and choose the volume that that incorporates k nearest

neighbors of x.
1. Parzen Window Method
• Assume that the region Rn is a d-dimensional hypercube.
• If hn is the length of an edge of that hypercube, then its volume is
given by
• Then expression for kn, the number of samples falling in the
• window hypercube, by defining the following window function:
• where ϕ defines a unit hypercube centered at the origin

• Then expression for kn, the number of samples falling in the
• window hypercube, by defining the following window function:
• where ϕ defines a unit hypercube centered at the origin
• Note that ϕ is symmetric, i.e ϕ u = ϕ -u

•
•
• Then we have to estimate pn(x) -> probability density at given point x
• Also we know that
•
• and
• thus,
• Then we have to estimate pn(x) -> Conditional PDF at given point x
• Also we know that
• aand
• thus,
•
•
And window function should be positive
And window function should itself be a density function

• Hypercube results in discontinuities, in this case at the boundaries of
the cubes
• We can obtain a smoother density model if we choose a smoother

kernel function
• Gaussian Kernel (hyperpshere)
where h represents the standard deviation of the Gaussian components.

• 1d gaussian kernel used here
We see that h acts as a

smoothing parameter and that if it is set
too small (top panel), the result is a very
noisy density model,
whereas if it is set
too large (bottom panel), then the bimodal
nature of the underlying distribution from
which the data is generated (shown by the
green curve) is washed out.
The best density model is obtained for some intermediate value

of h (middle panel).




•
And window function should be positive
And window function should itself be a density function

2. KNN Method
• Kernel approach to density estimation is that the parameter h
governing the kernel width is fixed for all kernels.
• instead of fixing V and determining the value of K from the data,

• consider a fixed value of K and use the data to find an appropriate
value for V
2. KNN Method
• consider a small sphere centred on the point x at which we wish to
estimate the density p(x), and we allow the radius of the sphere to
grow until it contains precisely K data points.
2. KNN Method
2. KNN Method
We see that the parameter K governs the degree

of smoothing,
so that a small value of
K leads to a very noisy density model
(top panel),
whereas a large value (bottom panel) smoothes
out the bimodal nature of the true distribution (shown by the
green curve) from which the data set was
generated
2. KNN Method
We see that the parameter K governs the degree

of smoothing,
so that a small value of
K leads to a very noisy density model
(top panel),
whereas a large value (bottom panel) smoothes
out the bimodal nature of the true distribution (shown by the
green curve) from which the data set was
generated
2. KNN Method
• K-nearest-neighbour technique for density estimation can be
extended to the problem of classification.
• apply the K-nearest-neighbour density estimation technique to each
class separately and then make use of Bayes’ theorem.
2. KNN Method
• Let us suppose that we have a data set comprising Nk points in class
Ck with N points in total,
2. KNN Method
• Let us suppose that we have a data set comprising Nk points in class
Ck with N points in total,
• If we wish to classify a new point x, we draw a sphere centred on x

containing precisely K points irrespective of their class.
2. KNN Method
• Suppose this sphere has volume V and contains
• Kk points from class Ck.
• The estimate of the density associated

with each class is
2. KNN Method

with each class is
Similarly, the unconditional density is given by

2. KNN Method

with each class is
• Similarly, the unconditional density is given by

• class priors are given by
2. KNN Method
• Bayes’ theorem to obtain
• the posterior probability of class membership

Parameter Estimation - PR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parameter Estimation - PR

Uploaded by

Copyright:

Available Formats

Nonparametric techniques

(Bishop and Duda both to be referred here)

• The form of the density function is known.

• We assume that p(x|ωi) is a normal density with mean µi and covariance

• What if its not Gaussian… or taking any parametric form?

• several types of nonparametric methods

• 1. estimating the density functions p(x|ωj) from sample patterns

• 2. Directly estimate posteriori probabilities P(ωj|x)

• Standard histograms simply partition x into distinct bins of width ∆i

• Standard histograms simply partition x into distinct bins of width ∆i

• An illustration of the histogram approach to density estimation, in which a data set of 50

• In order to turn this count into a normalized probability density,

• An illustration of the histogram approach to density estimation, in which a data set of 50

• In order to turn this count into a normalized probability density,

• Effect of histogram bin width

We see that when ∆ is very small

• Effect of histogram bin width

∆ is too large (bottom figure) then

• Effect of histogram bin width

The best results are obtained for

• Another major limitation of the histogram approach is its

• 1. Parzen window / kernel method

• probability P that a vector x will fall in a region R

• probability P that a vector x will fall in a region R

• probability P that a vector x will fall in a region R

• probability P that a vector x will fall in a region R

• If ‘p’ is nearly constant in the region R, then P= p(x). V, where

• There are two contradictory statements :

• So the choice of V is a compromised between these two

• If n goes to infinity, am I getting a good estimate?

• If pn(x) -> p(x), as n -> ∞ Then

• We have two ways to approach this problem :

• Fix the volume V and count k.

• We have two ways to approach this problem :

1. Fix the volume V and count k. (Parzen window method)

• We have two ways to approach this problem :

2.Fix a k and choose the volume that that incorporates k nearest

• where ϕ defines a unit hypercube centered at the origin

• where ϕ defines a unit hypercube centered at the origin

• Note that ϕ is symmetric, i.e ϕ u = ϕ -u

And window function should be positive

And window function should itself be a density function

• We can obtain a smoother density model if we choose a smoother

where h represents the standard deviation of the Gaussian components.

We see that h acts as a

The best density model is obtained for some intermediate value

We see that h acts as a

The best density model is obtained for some intermediate value

We see that h acts as a

The best density model is obtained for some intermediate value

And window function should be positive

And window function should itself be a density function

• instead of fixing V and determining the value of K from the data,

We see that the parameter K governs the degree

We see that the parameter K governs the degree

• If we wish to classify a new point x, we draw a sphere centred on x

• The estimate of the density associated

• The estimate of the density associated

Similarly, the unconditional density is given by

• The estimate of the density associated

• Similarly, the unconditional density is given by

You might also like