AA1
Chapter 4. Non-parametric classifiers
Non-parametric estimation of p(x|y)
If we cannot assume a model for p(x|wi) we can resort to non-parametric
estimators, like the histogram. A precise definition follows:
• The probability that x is in region R is:
P p(x ')dx '
R
• If we have n independent observations D = {x1,…, xn}, the probability
of having k among the n vectors in the region is:
n k
Pk P (1 P ) n k E k nP
k
2
The ML estimator of P is: Pˆ k / n
If p(x) is continuous and R is small enough that p(x) does not change:
p(x ')dx ' p(x) V
R
R
Combining both expressions we get the histogram:
p(x ')dx ' k/n
p (x) R
pˆ ( x)
dx '
R
VR
3
Convergence conditions
The histogram is averaging values in a region, and hence it is a
distorted version of p(x).
To reduce the effect we are interested in having VR → 0, which
implies k → 0 if the number of samples is finite.
How can we guarantee the convergence of
kn / n
pˆ (x)
VR , n
when n→?
How shall we design VR,n?
4
Having lim pˆ ( x) p ( x) implies:
n
lim VR , n 0
n
lim kn
n
kn
lim 0
n n
We can guarantee the three conditions in two ways:
1. Parzen windows: taking VR,n as a function of n.
2. k-nearest neighbors: adapting VR,n in every region of the domain
of x in such a way that k grows at a smaller rate than n.
5
Parzen windows
Assume that region R is defined by a function j(x) enclosing a hyper-volume
VR,n around x. A measure of the number of vectors inside this region is:
n
x xi
kn x i 1
j
h
n
The estimation of the pdf is given by:
1 n
1 x xi 1 n
pˆ (x)
n
i 1
j
VR , n hn
n
i 1
n x xi
6
The condition for j(x) such that p(x)
^
is a pdf are:
j ( x) 0
j ( x ) dx 1
Vn hnd
Example 1:
h=1 h=0.5 h=0.2
(x) (x) (x)
Will give a very much Will give a noisy estimation of
averaged estimation of the pdf the pdf
7
Histogram obtained with n = 5 vectors, for three different values of h:
h=1 h=0.5 h=0.2
fn(x) fn(x)
fn(x)
Usually radial basis functions are used for j(x). These are defined as
functions for which:
x1 x 2 j x1 j x 2
8
Mean and variance of the histogram
Mean
1 n
1 x xi
E p (x)
ˆ
n
i 1
E j
VR , n hn
1 xv
j
VR , n hn
p ( v ) dv n x v p ( v ) d v
It is a convolution of the
window with the true pdf
Interpretation
If VR,n → 0, then j(x) → d(x) (a Dirac delta function) and
the estimator is non-biased, but the number of vectors in each
region tends to zero and the estimation will not be good we
have to evaluate the variance.
9
Variance
n 1 x xi 1
2
n2 (x)
i 1
E
nV
j
R , n hn
E pˆ ( x)
n
1 1
2 x xi
E p (x)
2
nE 2 2 j ˆ
n VR , n hn n
1 1 2xv 1
p ( v ) dv E pˆ ( x)
2
j
nVR , n VR , n hn n
sup j . E pˆ (x)
Upper bound: n (x)
2
nVR , n
Interpretation
Given n, a low variance implies a large VR,n large bias
10
Variance can be small if VR,n is large.
Example 2:
Gaussian window
11
Example 3: Decision boundaries for a two classes problem using
Parzen Gaussian windows for a small (left) and large value (right) of
h (it is a hyperparameter to be validated)
Pr (w1|x)> Pr (w2|x) Pr (w1|x)< Pr (w2|x)
The optimum size of the window possibly depends on the region
under analysis: should be larger where the density of data is small…
12
Neural probabilistic classifier
max
A classifier based on the Decision
estimation of the pdf using
Parzen windows can be
w1 w2 wc
built using a neural Classes
network-like structure that
estimates the pdf and
implements a Bayesian rule.
Training
vectors
We have n training vectors
of dimension d: xi
Input vectors
13
Training phase
3. An edge is set between the
block associated to the vector xi
and its associated class
Classes w1 w2 wc
2. The pattern block i weights
its inputs with the weights
Training
wi = xi vectors
1. Training vectors are
Input vectors
normalized.
14
Links can be weighted with a
priori probabilities if known
Prediction phase
4. The decision on the class is
taken by maximizing gj(x) max
Decision
3. Each block sums its inputs and
generate a likelihood function w2 wc
Classes w1
p(x|wj)
2. The pattern blocks compute an
activity factor:
x wi
j
hn
exp x w i x w i 2
T 2
Training
vectors
exp xT x w Ti w i 2xT w i 2 2
xT x w Ti w i 1 exp xT w i 1 2
Input vectors
1. The vector x to classify is
normalized.
15
K-nearest neighbour estimation
Instead of looking for the best window (in shape and size), the
volumen of the cell is increased or decreased as a function of the
training data:
To estimate p(x) we enlarge the volumen VR(x) around x
until kn vectors are included: the kn – nearest neighbors.
kn / n
pˆ (x)
VR (x)
kn
Convergence is guaranteed if lim 0
n n
Example 4:
kn n kn ln(n)
16
Example 5:
k-nearest-neighbors
estimations. Compare
them with those in
slide 11
17
A posteriori probabilities Pr(wi|x) can be computed as:
ki
p (x | wi ) Pr(wi ) p (x | wi ) Pr(wi ) nVR (x) ki
Pr wi | x
c c
k k
p(x | w ) Pr(w )
i 1
i i p(x, w )
i 1
i nVR ( x)
ki / ni ni ki
p (x | wi ) Pr(wi )
VR (x) n nVR (x)
The fraction of vectors
belonging to class wi among
the k neighbours of x
18
K-nearest neighbour classifier
Classification rule. Using the previous equation we can classify a
vector x with the following rule (which is optimum if n is large):
Select the class most represented in a region around x.
5
x1
Example 6: 4
k=8
3
Selected class for vector x is ‘+’
2 x
1 The size of the region is not
0
constant in the whole domain
-1
-2
-2 -1 0 1 2 3 4 5
x2
19
• Parzen: Select the class with more weighted x xi
j
vectors in the window centered in xi hn
• K-nearest-neighbors: Find the window containing k neighbours around
x. The most represented class is the chosen one.
We can obtain reasonable performance if the class is
chosen from the one associated to the nearest
training vector x (“1-nearest”) (Cover&Hart 1965):
c
PMAP (e) P1NN (e) PMAP (e) 2 PMAP (e) 2 PMAP (e)
c 1
when N∞. For c = 2
Decision regions for 1-nearest
PMAP (e) P1NN (e) 2 PMAP (e) 1 PMAP (e)
20
Distances
Properties:
Non - negativity D( x, z ) 0
Reflexivity D ( x, z ) 0 iff x z
Symmetry D( x, z ) D (z, x)
Triangular inequality D(x, z ) D(z, t ) D(x, t )
The best distance depends on the problem and has to be validated.
1/ p
d
x z
p
- Minkowski distance D( x, z ) x z p i i
i 1
p 1 (Manhattan)
p 2 (Euclidean)
- Mahalanobis distance D( x, z ) x z C1 x z
T
- Hamming distance: fraction of common elements between x and z
21
Similarities
Properties:
Bounds -1 s( x, z ) 1 or 0 s( x, z ) 1
Maximum s( x, z ) 1 iff x z
Symmetry s (x, z ) s ( z, x)
xT z
- Cosine similarity s ( x, z )
x2 z 2
x z
T
- Pearson correlation s ( x, z )
x 2
z 2
is the mean of all training samples
x z 1
j j j
- Jaccard coefficient s ( x, z )
x 1 or z 1
j j j
22
Choosing k
Nearest neighbor is very sensitive to different values of k
- If k is low, the classifier overfits
- If k is large, the classifier underfits
It is a hyperparameter that has to be selected using cross-validation.
Boundaries get smoother as k increases.
Example 7. The iris dataset, where c = 3
and d = 2. Sepal width and sepal length
are features.
23
Choosing k
Decision boundaries as a function of k.
Overfitting Underfitting
24
Importance of standardization
Top: both axes are scaled properly. Correct decision boundaries are found for k=5.
Bottom: the scale of vertical axis has been increased 100 times.
25
Combining labels to make predictions
Options for classification:
• Majority vote (ties are broken randomly)
• Distance-weighted vote: choose weights that are higher for closer points,
e.g. inverse of the distance or inverse of the squared distance
Options for regression:
• Use the average of target values of the k nearest neighbors
• Use the weighted average of targets of k nearest neighbors, e.g. inverse of
the distance or inverse of the squared distance
Using weights for prediction, we can relax the constraint of using only k
samples are use the whole training set instead.
26
Comparison in toy datasets
27
Example 8: Estimation of the pdf to predict crime in San Francisco
• Problem definition
Predict crimes in San Francisco among 38 types from a labeled data base.
• Data base
800.000 crimes recorded between 2003 and 2015. Each crime is characterized by:
28
29
29
• Evaluation: multiclass logarithmic loss (better when lower)
1 Ntest c
logloss yij log Pr j xi yij 0,1
N i1 j 1
Validation using a fraction of the data base.
30
• Estimation of p(x | w ) for 4 classes…
31
32
32
Conclusions
• These are the only suitable methods when d is large compared to the number
of training vectors.
• Non-parametric prediction can be slow specially if having a large training set.
There are ways to mitigate (use prototypes, hashing)
• Prone to overfitting, better to remove “noisy” samples from training (i.e.
samples with nearest neighbours of different class), and choose appropriate k
• Suffers from curse of dimensionality: as d increases everything seems to be
close
• Suffers from the presence of irrelevant features
• Standardization of features is crucial to avoid excessive relevance of features
with large absolute values
33
Organization of a machine learning competition
Data base #1 with
feature vectors and
labels
xi yi Train
Classifier
i = 1,…,N
34
Organization of a machine learning competition
Data base #1 with
feature vectors and
labels
xi yi Train
Classifier
i = 1,…,N
Data base #2
with feature
vectors
xi
Retrain and Classify Pr(wk|xi)
resubmit as Classifier k = 1,…,c
i = N+1,…,M
many times
Submission
as desired
yi Evaluation of score
i = N+1,…,M by organizer
Public ranking
35
Organization of a machine learning competition
Phase 1 Phase 2
Data base #1 with Data base #3
feature vectors and with feature
labels vectors
xi yi Train
xi
Classify Pr(wk|xi)
Classifier Classifier
i = 1,…,N i = M+1,…,P k = 1,…,c
Single submission
yi Evaluation of score
i = M+1,…,P by organizer
Data base #2 Final ranking
with feature
vectors
xi
Retrain and Classify Pr(wk|xi)
resubmit as Classifier k = 1,…,c
i = N+1,…,M
many times
Submission
as desired
yi Evaluation of score
i = N+1,…,M by organizer The organisation in two phases prevents from using the
ranking data base to validate our classifier
Public ranking
36
… example of a hosted competition
37