0% found this document useful (0 votes)
23 views37 pages

AA1 Tema4

Chapter 4 discusses non-parametric classifiers, focusing on methods like histograms and k-nearest neighbors for estimating probability density functions (pdfs) without assuming a specific model. It explains the convergence conditions for these estimators and introduces techniques such as Parzen windows and k-nearest neighbors to improve accuracy. The chapter also covers the importance of choosing the right parameters and distances in classification tasks, emphasizing the impact of standardization and the selection of k in nearest neighbor methods.

Uploaded by

paugonal5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views37 pages

AA1 Tema4

Chapter 4 discusses non-parametric classifiers, focusing on methods like histograms and k-nearest neighbors for estimating probability density functions (pdfs) without assuming a specific model. It explains the convergence conditions for these estimators and introduces techniques such as Parzen windows and k-nearest neighbors to improve accuracy. The chapter also covers the importance of choosing the right parameters and distances in classification tasks, emphasizing the impact of standardization and the selection of k in nearest neighbor methods.

Uploaded by

paugonal5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

AA1

Chapter 4. Non-parametric classifiers


Non-parametric estimation of p(x|y)

If we cannot assume a model for p(x|wi) we can resort to non-parametric


estimators, like the histogram. A precise definition follows:

• The probability that x is in region R is:

P  p(x ')dx '


R

• If we have n independent observations D = {x1,…, xn}, the probability


of having k among the n vectors in the region is:

n k
Pk    P (1  P ) n  k  E k   nP
k 

2
The ML estimator of P is: Pˆ  k / n

If p(x) is continuous and R is small enough that p(x) does not change:

 p(x ')dx '  p(x) V


R
R

Combining both expressions we get the histogram:

 p(x ')dx ' k/n


p (x)  R
  pˆ ( x)
 dx '
R
VR

3
Convergence conditions
The histogram is averaging values in a region, and hence it is a
distorted version of p(x).

To reduce the effect we are interested in having VR → 0, which


implies k → 0 if the number of samples is finite.

How can we guarantee the convergence of


kn / n
pˆ (x) 
VR , n
when n→?
How shall we design VR,n?

4
Having lim pˆ ( x)  p ( x) implies:
n 

lim VR , n  0
n 

lim kn  
n 

kn
lim  0
n  n

We can guarantee the three conditions in two ways:


1. Parzen windows: taking VR,n as a function of n.
2. k-nearest neighbors: adapting VR,n in every region of the domain
of x in such a way that k grows at a smaller rate than n.

5
Parzen windows

Assume that region R is defined by a function j(x) enclosing a hyper-volume


VR,n around x. A measure of the number of vectors inside this region is:

n
 x  xi 
kn  x  i 1
j
h
 n 

The estimation of the pdf is given by:

1 n
1  x  xi  1 n
pˆ (x) 
n 
i 1
j
VR , n  hn

 n

i 1
n  x  xi 

6
The condition for j(x) such that p(x)
^
is a pdf are:

j ( x)  0

 j ( x ) dx  1
Vn  hnd
Example 1:
h=1 h=0.5 h=0.2

(x) (x) (x)

Will give a very much Will give a noisy estimation of


averaged estimation of the pdf the pdf

7
Histogram obtained with n = 5 vectors, for three different values of h:

h=1 h=0.5 h=0.2

fn(x) fn(x)
fn(x)

Usually radial basis functions are used for j(x). These are defined as
functions for which:

x1  x 2  j  x1   j  x 2 

8
Mean and variance of the histogram
Mean

1 n
 1  x  xi  
E  p (x) 
ˆ
n 
i 1
E j
VR , n  hn
 
 
1 xv
  j
VR , n  hn  
 p ( v ) dv   n  x  v  p ( v ) d v

It is a convolution of the
window with the true pdf

Interpretation
If VR,n → 0, then j(x) → d(x) (a Dirac delta function) and
the estimator is non-biased, but the number of vectors in each
region tends to zero and the estimation will not be good  we
have to evaluate the variance.

9
Variance
n  1  x  xi  1 
2

 n2 (x)  
i 1
E 
 nV
j
 R , n  hn
  E  pˆ ( x) 
 n 


 1  1
2  x  xi  
   E  p (x) 
2
 nE  2 2 j  ˆ
 n VR , n  hn   n
1 1 2xv 1
  p ( v ) dv  E  pˆ ( x)
2
 j 
nVR , n VR , n  hn  n
sup j .  E  pˆ (x)
Upper bound:  n (x) 
2

nVR , n
Interpretation
Given n, a low variance implies a large VR,n  large bias

10
Variance can be small if VR,n is large.

Example 2:
Gaussian window

11
Example 3: Decision boundaries for a two classes problem using
Parzen Gaussian windows for a small (left) and large value (right) of
h (it is a hyperparameter to be validated)
Pr (w1|x)> Pr (w2|x) Pr (w1|x)< Pr (w2|x)

The optimum size of the window possibly depends on the region


under analysis: should be larger where the density of data is small…

12
Neural probabilistic classifier

max
A classifier based on the Decision

estimation of the pdf using


Parzen windows can be
w1 w2 wc
built using a neural Classes

network-like structure that


estimates the pdf and
implements a Bayesian rule.
Training
vectors

We have n training vectors


of dimension d: xi
Input vectors

13
Training phase

3. An edge is set between the


block associated to the vector xi
and its associated class
Classes w1 w2 wc

2. The pattern block i weights


its inputs with the weights
Training
wi = xi vectors

1. Training vectors are


Input vectors
normalized.

14
Links can be weighted with a
priori probabilities if known
Prediction phase

4. The decision on the class is


taken by maximizing gj(x) max
Decision

3. Each block sums its inputs and


generate a likelihood function w2 wc
Classes w1
 p(x|wj)

2. The pattern blocks compute an


activity factor:
 x  wi
j
 hn



  exp   x  w i   x  w i  2 
T 2
 Training
vectors

 
 exp   xT x  w Ti w i  2xT w i  2 2 


 xT x  w Ti w i  1  exp  xT w i  1  2 
Input vectors
1. The vector x to classify is
normalized.

15
K-nearest neighbour estimation
Instead of looking for the best window (in shape and size), the
volumen of the cell is increased or decreased as a function of the
training data:

To estimate p(x) we enlarge the volumen VR(x) around x


until kn vectors are included: the kn – nearest neighbors.
kn / n
pˆ (x) 
VR (x)

kn
Convergence is guaranteed if lim  0
n  n

Example 4:
kn  n kn  ln(n)

16
Example 5:
k-nearest-neighbors
estimations. Compare
them with those in
slide 11

17
A posteriori probabilities Pr(wi|x) can be computed as:

ki
p (x | wi ) Pr(wi ) p (x | wi ) Pr(wi ) nVR (x) ki
Pr wi | x     
c c
k k
 p(x | w ) Pr(w )
i 1
i i  p(x, w )
i 1
i nVR ( x)

ki / ni ni ki
p (x | wi ) Pr(wi )  
VR (x) n nVR (x)

The fraction of vectors


belonging to class wi among
the k neighbours of x

18
K-nearest neighbour classifier
Classification rule. Using the previous equation we can classify a
vector x with the following rule (which is optimum if n is large):

Select the class most represented in a region around x.

5
x1
Example 6: 4

k=8
3
Selected class for vector x is ‘+’
2 x
1 The size of the region is not
0
constant in the whole domain

-1

-2
-2 -1 0 1 2 3 4 5
x2
19
• Parzen: Select the class with more weighted  x  xi 
j 
vectors in the window centered in xi  hn 
• K-nearest-neighbors: Find the window containing k neighbours around
x. The most represented class is the chosen one.

We can obtain reasonable performance if the class is


chosen from the one associated to the nearest
training vector x (“1-nearest”) (Cover&Hart 1965):

 c 
PMAP (e)  P1NN (e)  PMAP (e)  2  PMAP (e)   2 PMAP (e)
 c 1 
when N∞. For c = 2
Decision regions for 1-nearest
PMAP (e)  P1NN (e)  2 PMAP (e) 1  PMAP (e) 

20
Distances
Properties:

Non - negativity D( x, z )  0
Reflexivity D ( x, z )  0 iff x  z
Symmetry D( x, z )  D (z, x)
Triangular inequality D(x, z )  D(z, t )  D(x, t )

The best distance depends on the problem and has to be validated.


1/ p
 d

 x z
p
- Minkowski distance D( x, z )  x  z p   i i 
 i 1 
p  1 (Manhattan)
p  2 (Euclidean)

- Mahalanobis distance D( x, z )   x  z  C1  x  z 


T

- Hamming distance: fraction of common elements between x and z


21
Similarities
Properties:
Bounds -1  s( x, z )  1 or 0  s( x, z )  1
Maximum s( x, z )  1 iff x  z
Symmetry s (x, z )  s ( z, x)

xT z
- Cosine similarity s ( x, z ) 
x2 z 2

x   z  
T

- Pearson correlation s ( x, z ) 
x 2
z  2

 is the mean of all training samples

  x  z  1
j j j
- Jaccard coefficient s ( x, z ) 
  x  1 or z  1
j j j

22
Choosing k
Nearest neighbor is very sensitive to different values of k
- If k is low, the classifier overfits
- If k is large, the classifier underfits

It is a hyperparameter that has to be selected using cross-validation.


Boundaries get smoother as k increases.

Example 7. The iris dataset, where c = 3


and d = 2. Sepal width and sepal length
are features.

23
Choosing k

Decision boundaries as a function of k.

Overfitting Underfitting

24
Importance of standardization

Top: both axes are scaled properly. Correct decision boundaries are found for k=5.
Bottom: the scale of vertical axis has been increased 100 times.
25
Combining labels to make predictions

Options for classification:


• Majority vote (ties are broken randomly)
• Distance-weighted vote: choose weights that are higher for closer points,
e.g. inverse of the distance or inverse of the squared distance

Options for regression:


• Use the average of target values of the k nearest neighbors
• Use the weighted average of targets of k nearest neighbors, e.g. inverse of
the distance or inverse of the squared distance

Using weights for prediction, we can relax the constraint of using only k
samples are use the whole training set instead.

26
Comparison in toy datasets

27
Example 8: Estimation of the pdf to predict crime in San Francisco

• Problem definition
Predict crimes in San Francisco among 38 types from a labeled data base.

• Data base
800.000 crimes recorded between 2003 and 2015. Each crime is characterized by:

28
29
29
• Evaluation: multiclass logarithmic loss (better when lower)
1 Ntest c
logloss    yij log Pr  j xi  yij  0,1
N i1 j 1
Validation using a fraction of the data base.

30
• Estimation of p(x | w ) for 4 classes…

31
32
32
Conclusions
• These are the only suitable methods when d is large compared to the number
of training vectors.
• Non-parametric prediction can be slow specially if having a large training set.
There are ways to mitigate (use prototypes, hashing)
• Prone to overfitting, better to remove “noisy” samples from training (i.e.
samples with nearest neighbours of different class), and choose appropriate k
• Suffers from curse of dimensionality: as d increases everything seems to be
close
• Suffers from the presence of irrelevant features
• Standardization of features is crucial to avoid excessive relevance of features
with large absolute values

33
Organization of a machine learning competition

Data base #1 with


feature vectors and
labels

xi yi Train
Classifier
i = 1,…,N

34
Organization of a machine learning competition

Data base #1 with


feature vectors and
labels

xi yi Train
Classifier
i = 1,…,N

Data base #2
with feature
vectors

xi
Retrain and Classify Pr(wk|xi)
resubmit as Classifier k = 1,…,c
i = N+1,…,M
many times
Submission
as desired

yi Evaluation of score
i = N+1,…,M by organizer

Public ranking

35
Organization of a machine learning competition
Phase 1 Phase 2
Data base #1 with Data base #3
feature vectors and with feature
labels vectors

xi yi Train
xi
Classify Pr(wk|xi)
Classifier Classifier
i = 1,…,N i = M+1,…,P k = 1,…,c
Single submission

yi Evaluation of score
i = M+1,…,P by organizer

Data base #2 Final ranking


with feature
vectors

xi
Retrain and Classify Pr(wk|xi)
resubmit as Classifier k = 1,…,c
i = N+1,…,M
many times
Submission
as desired

yi Evaluation of score
i = N+1,…,M by organizer The organisation in two phases prevents from using the
ranking data base to validate our classifier

Public ranking

36
… example of a hosted competition

37

You might also like