You are on page 1of 66

Nonparametric techniques

(Bishop and Duda both to be referred here)


Introduction
• Bayes Parameter Estimation and Maximum likelihood estimation.

• The form of the density function is known.

• We assume that p(x|ωi) is a normal density with mean µi and covariance


matrix Σi,

• What if its not Gaussian… or taking any parametric form?


• All of the classical parametric densities are unimodal (have a single local
maximum), whereas many practical problems involve multimodal densities.
Non-parametric density
Estimation
• Estimate arbitrary distributions and without the assumption that the forms
of the underlying densities are known.

• several types of nonparametric methods

• 1. estimating the density functions p(x|ωj) from sample patterns

• 2. Directly estimate posteriori probabilities P(ωj|x)


Histogram methods for density
estimation
• Consider case of a single continuous variable x.

• Standard histograms simply partition x into distinct bins of width ∆i


and then count the number ni of observations of x falling in bin i.
Histogram methods for density
estimation
• Consider case of a single continuous variable x.

• Standard histograms simply partition x into distinct bins of width ∆i and then
count the number ni of observations of x falling in bin i.
Histogram methods for density
estimation
• Consider case of a single continuous variable x.

• Standard histograms simply partition x into distinct bins of width ∆i


and then count the number ni of observations of x falling in bin i.
Histogram methods for density estimation

• An illustration of the histogram approach to density estimation, in which a data set of 50


data points is generated from the distribution shown by the green curve

• In order to turn this count into a normalized probability density,


• Divide n by the total number N of observations and by the width ∆i of the bins to obtain
probability values for each bin given by
Histogram methods for density estimation

• An illustration of the histogram approach to density estimation, in which a data set of 50


data points is generated from the distribution shown by the green curve

• In order to turn this count into a normalized probability density,


• Divide n by the total number N of observations and by the width ∆i of the bins to obtain
probability values for each bin given by
Histogram methods for density estimation
Divide n by the total number N of observations and by the width ∆i of the bins to
obtain probability values for each bin given by


Histogram methods for density estimation

• Effect of histogram bin width

We see that when ∆ is very small


(top figure), the resulting density
model is very
spiky, with a lot of structure that is
not present in the underlying
distribution that
generated the data set.
Histogram methods for density estimation

• Effect of histogram bin width

∆ is too large (bottom figure) then


the result is
a model that is too smooth and that
consequently fails to capture the
bimodal property of the green curve.
Histogram methods for density estimation

• Effect of histogram bin width

The best results are obtained for


some intermediate value
of ∆ (middle figure). In principle, a
histogram density model is also
dependent on
the choice of edge location for the
bins, though this is typically much
less significant
than the value of ∆.
Histogram methods for density estimation
• Once the histogram has been computed, the data set itself can
be discarded, which can be advantageous if the data set is large.
• Also, the histogram approach is easily applied if the data points are
arriving sequentially.
• estimated density has discontinuities that are due to the bin edges rather
than any property of the underlying distribution that generated the data.

• Another major limitation of the histogram approach is its


scaling with dimensionality. If we divide each variable in a D-dimensional
space
into M bins, then the total number of bins will be MD.
• This exponential scaling with D is an example of the curse of
dimensionality
Histogram methods for density estimation
• Two important lessons:

• 1.To estimate the probability density at a particular location, we should consider the data points that lie
within some local neighbourhood of that point.

• 2. Second, the value of the smoothing parameter should be neither too large nor
too small in order to obtain good results.
Non-parametric density Estimation
• Two popular methods:

• 1. Parzen window / kernel method


• 2. K-Nearest Neighbor Method
Non-parametric density Estimation
• Let us suppose that observations are being drawn from some unknown
probability density p(x) in some D-dimensional space.
• and we wish to estimate the value of p(x)

• Let R be a small region around point x, (let us say a ball of small radius ).
• The integral of the probability density over the region is given as:

• probability P that a vector x will fall in a region R


Non-parametric density Estimation
• Let us suppose that observations are being drawn from some unknown
probability density p(x) in some D-dimensional space.
• and we wish to estimate the value of p(x)

• Let R be a small region around point x, (let us say a ball of small radius ).
• The integral of the probability density over the region is given as:

• probability P that a vector x will fall in a region R


Non-parametric density Estimation
• Let us suppose that observations are being drawn from some unknown
probability density p(x) in some D-dimensional space.
• and we wish to estimate the value of p(x)

• Let R be a small region around point x, (let us say a ball of small radius ).
• The integral of the probability density over the region is given as:

• probability P that a vector x will fall in a region R


Non-parametric density Estimation
• Let R be a small region around point x, (let us say a ball of small radius ).
• The integral of the probability density over the region is given as:

• probability P that a vector x will fall in a region R

• If ‘p’ is nearly constant in the region R, then P= p(x). V, where


• V is the volume of the region R.
• Thus p(x) = P/V…..
• I have to estimate p(x)
• I can take a region of know volume V, question is how do I know P?
Non-parametric density Estimation
• Suppose out of ‘n’ i.i.d samples ‘k’ samples fall in
the region R.
Independent and Identically distributed
random variables
• if each random variable has the same probability distribution as
the others and all are mutually independent.
Independent and Identically distributed
random variables
• if each random variable has the same probability distribution as the
others and all are mutually independent.

• Example 1
• Toss a coin 10 times and record how many times does the coin lands
on head.
1. Independent – each outcome of landing will not affect the other
outcome, which means the 10 results are independent from each
other.
2. Identically Distributed – if the coin is a homogeneous material, each
time the probability for head is 0.5, which means the probability is
identical for each time.
Independent and Identically distributed
random variables
• if each random variable has the same probability distribution as the
others and all are mutually independent.

• Example 2
• Choose a card from a standard deck of cards containing 52 cards,
then place the card back in the deck. Repeat it for 52 times. Record
the number of King appears
1. Independent – each outcome of the card will not affect the next one,
which means the 52 results are independent from each other.
2. Identically Distributed – after drawing one card from it, each time the
probability for King is 4/52, which means the probability is identical
for each time
Non-parametric density Estimation
• Now suppose that we have collected a data set comprising N
observations drawn from p(x).
• Because each data point has a probability P of falling within R, the
total number K of points that lie inside R will be distributed according
to the binomial distribution
Non-parametric density Estimation

Non-parametric density Estimation

Non-parametric density Estimation
Non-parametric density Estimation

Non-parametric density Estimation
• The choice of V affects the quality of approximation.

• There are two contradictory statements :


• 1. For the approximation P= p(x). V to be good, V should be very
small.
• 2. On the other hand if V is very small it may be possible that it has
zero samples in it (unless n is very very large).

• So the choice of V is a compromised between these two


requirements.
Non-parametric density Estimation
• Let the total no. of samples be n
• Let kn be the number of samples falling in region Rn,
• Let Vn be the volume of region Rn
• and pn(x) be the nth estimate for p(x):

• If n goes to infinity, am I getting a good estimate?


Non-parametric density Estimation
• Let the total no. of samples be n
• Let kn be the number of samples falling in region Rn,
• Let Vn be the volume of region Rn
• and pn(x) be the nth estimate for p(x):

• If pn(x) -> p(x), as n -> ∞ Then

1. Vn-> 0
2. kn -> ∞ and
3. Kn/n ->0
Non-parametric density Estimation

Non-parametric density Estimation
• In practical scenario we only have a finite n.
• We choose the size of V based on n.

• We have two ways to approach this problem :

• Fix the volume V and count k.


Non-parametric density Estimation
• In practical scenario we only have a finite n.
• We choose the size of V based on n.

• We have two ways to approach this problem :

1. Fix the volume V and count k. (Parzen window method)


Non-parametric density Estimation
• In practical scenario we only have a finite n.
• We choose the size of V based on n.

• We have two ways to approach this problem :

2.Fix a k and choose the volume that that incorporates k nearest


neighbors of x.
1. Parzen Window Method
• Assume that the region Rn is a d-dimensional hypercube.
• If hn is the length of an edge of that hypercube, then its volume is
given by
1. Parzen Window Method
• Then expression for kn, the number of samples falling in the
• window hypercube, by defining the following window function:

• where ϕ defines a unit hypercube centered at the origin


1. Parzen Window Method
• Then expression for kn, the number of samples falling in the
• window hypercube, by defining the following window function:

• where ϕ defines a unit hypercube centered at the origin

• Note that ϕ is symmetric, i.e ϕ u = ϕ -u


1. Parzen Window Method

1. Parzen Window Method

1. Parzen Window Method
• Then we have to estimate pn(x) -> probability density at given point x
• Also we know that


• and

• thus,
1. Parzen Window Method
• Then we have to estimate pn(x) -> Conditional PDF at given point x
• Also we know that

• aand

• thus,
1. Parzen Window Method

1. Parzen Window Method

And window function should be positive

And window function should itself be a density function


1. Parzen Window Method
• Hypercube results in discontinuities, in this case at the boundaries of
the cubes

• We can obtain a smoother density model if we choose a smoother


kernel function
1. Parzen Window Method
• Gaussian Kernel (hyperpshere)

where h represents the standard deviation of the Gaussian components.


1. Parzen Window Method
• 1d gaussian kernel used here
1. Parzen Window Method

We see that h acts as a


smoothing parameter and that if it is set
too small (top panel), the result is a very
noisy density model,

whereas if it is set
too large (bottom panel), then the bimodal
nature of the underlying distribution from
which the data is generated (shown by the
green curve) is washed out.

The best density model is obtained for some intermediate value


of h (middle panel).
1. Parzen Window Method

We see that h acts as a


smoothing parameter and that if it is set
too small (top panel), the result is a very
noisy density model,

whereas if it is set
too large (bottom panel), then the bimodal
nature of the underlying distribution from
which the data is generated (shown by the
green curve) is washed out.

The best density model is obtained for some intermediate value


of h (middle panel).
1. Parzen Window Method

We see that h acts as a


smoothing parameter and that if it is set
too small (top panel), the result is a very
noisy density model,

whereas if it is set
too large (bottom panel), then the bimodal
nature of the underlying distribution from
which the data is generated (shown by the
green curve) is washed out.

The best density model is obtained for some intermediate value


of h (middle panel).
1. Parzen Window Method

And window function should be positive

And window function should itself be a density function


2. KNN Method
• Kernel approach to density estimation is that the parameter h
governing the kernel width is fixed for all kernels.

• instead of fixing V and determining the value of K from the data,


• consider a fixed value of K and use the data to find an appropriate
value for V
2. KNN Method
• consider a small sphere centred on the point x at which we wish to
estimate the density p(x), and we allow the radius of the sphere to
grow until it contains precisely K data points.
2. KNN Method
2. KNN Method

We see that the parameter K governs the degree


of smoothing,
so that a small value of
K leads to a very noisy density model
(top panel),
whereas a large value (bottom panel) smoothes
out the bimodal nature of the true distribution (shown by the
green curve) from which the data set was
generated
2. KNN Method

We see that the parameter K governs the degree


of smoothing,
so that a small value of
K leads to a very noisy density model
(top panel),
whereas a large value (bottom panel) smoothes
out the bimodal nature of the true distribution (shown by the
green curve) from which the data set was
generated
2. KNN Method
• K-nearest-neighbour technique for density estimation can be
extended to the problem of classification.
• apply the K-nearest-neighbour density estimation technique to each
class separately and then make use of Bayes’ theorem.
2. KNN Method
• Let us suppose that we have a data set comprising Nk points in class
Ck with N points in total,
2. KNN Method
• Let us suppose that we have a data set comprising Nk points in class
Ck with N points in total,

• If we wish to classify a new point x, we draw a sphere centred on x


containing precisely K points irrespective of their class.
2. KNN Method
• Suppose this sphere has volume V and contains
• Kk points from class Ck.

• The estimate of the density associated


with each class is
2. KNN Method
• Suppose this sphere has volume V and contains
• Kk points from class Ck.

• The estimate of the density associated


with each class is

Similarly, the unconditional density is given by


2. KNN Method
• Suppose this sphere has volume V and contains
• Kk points from class Ck.

• The estimate of the density associated


with each class is

• Similarly, the unconditional density is given by


• class priors are given by
2. KNN Method
• Bayes’ theorem to obtain
• the posterior probability of class membership

You might also like