0% found this document useful (0 votes)

23 views37 pages

AA1 Tema4

Chapter 4 discusses non-parametric classifiers, focusing on methods like histograms and k-nearest neighbors for estimating probability density functions (pdfs) without assuming a specific model. It explains the convergence conditions for these estimators and introduces techniques such as Parzen windows and k-nearest neighbors to improve accuracy. The chapter also covers the importance of choosing the right parameters and distances in classification tasks, emphasizing the impact of standardization and the selection of k in nearest neighbor methods.

Uploaded by

paugonal5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views37 pages

AA1 Tema4

Uploaded by

paugonal5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

AA1

Chapter 4. Non-parametric classifiers

Non-parametric estimation of p(x|y)

If we cannot assume a model for p(x|wi) we can resort to non-parametric

estimators, like the histogram. A precise definition follows:

• The probability that x is in region R is:

P  p(x ')dx '

• If we have n independent observations D = {x1,…, xn}, the probability

of having k among the n vectors in the region is:

n k
Pk    P (1  P ) n  k  E k   nP
k 

2
The ML estimator of P is: Pˆ  k / n

If p(x) is continuous and R is small enough that p(x) does not change:

 p(x ')dx '  p(x) V

R
R

Combining both expressions we get the histogram:

 p(x ')dx ' k/n

p (x)  R
  pˆ ( x)
 dx '
R
VR

3
Convergence conditions
The histogram is averaging values in a region, and hence it is a
distorted version of p(x).

To reduce the effect we are interested in having VR → 0, which

implies k → 0 if the number of samples is finite.

How can we guarantee the convergence of

kn / n
pˆ (x) 
VR , n
when n→?
How shall we design VR,n?

4
Having lim pˆ ( x)  p ( x) implies:
n 

lim VR , n  0
n 

lim kn  
n 

kn
lim  0
n  n

We can guarantee the three conditions in two ways:

1. Parzen windows: taking VR,n as a function of n.
2. k-nearest neighbors: adapting VR,n in every region of the domain
of x in such a way that k grows at a smaller rate than n.

5
Parzen windows

Assume that region R is defined by a function j(x) enclosing a hyper-volume

VR,n around x. A measure of the number of vectors inside this region is:

n
 x  xi 
kn  x  i 1
j
h
 n 


The estimation of the pdf is given by:

1 n
1  x  xi  1 n
pˆ (x) 
n 
i 1
j
VR , n  hn

 n

i 1
n  x  xi 

6
The condition for j(x) such that p(x)
^
is a pdf are:

j ( x)  0

 j ( x ) dx  1
Vn  hnd
Example 1:
h=1 h=0.5 h=0.2

(x) (x) (x)

Will give a very much Will give a noisy estimation of

averaged estimation of the pdf the pdf

7
Histogram obtained with n = 5 vectors, for three different values of h:

h=1 h=0.5 h=0.2

fn(x) fn(x)
fn(x)

Usually radial basis functions are used for j(x). These are defined as
functions for which:

x1  x 2  j  x1   j  x 2 

8
Mean and variance of the histogram
Mean

1 n
 1  x  xi  
E  p (x) 
ˆ
n 
i 1
E j
VR , n  hn
 
 
1 xv
  j
VR , n  hn  
 p ( v ) dv   n  x  v  p ( v ) d v

It is a convolution of the
window with the true pdf

Interpretation
If VR,n → 0, then j(x) → d(x) (a Dirac delta function) and
the estimator is non-biased, but the number of vectors in each
region tends to zero and the estimation will not be good  we
have to evaluate the variance.

9
Variance
n  1  x  xi  1 
2

 n2 (x)  
i 1
E 
 nV
j
 R , n  hn
  E  pˆ ( x) 
 n 


 1  1
2  x  xi  
   E  p (x) 
2
 nE  2 2 j  ˆ
 n VR , n  hn   n
1 1 2xv 1
  p ( v ) dv  E  pˆ ( x)
2
 j 
nVR , n VR , n  hn  n
sup j .  E  pˆ (x)
Upper bound:  n (x) 
2

nVR , n
Interpretation
Given n, a low variance implies a large VR,n  large bias

10
Variance can be small if VR,n is large.

Example 2:
Gaussian window

11
Example 3: Decision boundaries for a two classes problem using
Parzen Gaussian windows for a small (left) and large value (right) of
h (it is a hyperparameter to be validated)
Pr (w1|x)> Pr (w2|x) Pr (w1|x)< Pr (w2|x)

The optimum size of the window possibly depends on the region

under analysis: should be larger where the density of data is small…

12
Neural probabilistic classifier

max
A classifier based on the Decision

estimation of the pdf using

Parzen windows can be
w1 w2 wc
built using a neural Classes

network-like structure that

estimates the pdf and
implements a Bayesian rule.
Training
vectors

We have n training vectors

of dimension d: xi
Input vectors

13
Training phase

3. An edge is set between the

block associated to the vector xi
and its associated class
Classes w1 w2 wc

2. The pattern block i weights

its inputs with the weights
Training
wi = xi vectors

1. Training vectors are

Input vectors
normalized.

14
Links can be weighted with a
priori probabilities if known
Prediction phase

4. The decision on the class is

taken by maximizing gj(x) max
Decision

3. Each block sums its inputs and

generate a likelihood function w2 wc
Classes w1
 p(x|wj)

2. The pattern blocks compute an

activity factor:
 x  wi
j
 hn



  exp   x  w i   x  w i  2 
T 2
 Training
vectors

 
 exp   xT x  w Ti w i  2xT w i  2 2 


 xT x  w Ti w i  1  exp  xT w i  1  2 
Input vectors
1. The vector x to classify is
normalized.

15
K-nearest neighbour estimation
Instead of looking for the best window (in shape and size), the
volumen of the cell is increased or decreased as a function of the
training data:

To estimate p(x) we enlarge the volumen VR(x) around x

until kn vectors are included: the kn – nearest neighbors.
kn / n
pˆ (x) 
VR (x)

kn
Convergence is guaranteed if lim  0
n  n

Example 4:
kn  n kn  ln(n)

16
Example 5:
k-nearest-neighbors
estimations. Compare
them with those in
slide 11

17
A posteriori probabilities Pr(wi|x) can be computed as:

ki
p (x | wi ) Pr(wi ) p (x | wi ) Pr(wi ) nVR (x) ki
Pr wi | x     
c c
k k
 p(x | w ) Pr(w )
i 1
i i  p(x, w )
i 1
i nVR ( x)

ki / ni ni ki
p (x | wi ) Pr(wi )  
VR (x) n nVR (x)

The fraction of vectors

belonging to class wi among
the k neighbours of x

18
K-nearest neighbour classifier
Classification rule. Using the previous equation we can classify a
vector x with the following rule (which is optimum if n is large):

Select the class most represented in a region around x.

5
x1
Example 6: 4

k=8
3
Selected class for vector x is ‘+’
2 x
1 The size of the region is not
0
constant in the whole domain

-1

-2
-2 -1 0 1 2 3 4 5
x2
19
• Parzen: Select the class with more weighted  x  xi 
j 
vectors in the window centered in xi  hn 
• K-nearest-neighbors: Find the window containing k neighbours around
x. The most represented class is the chosen one.

We can obtain reasonable performance if the class is

chosen from the one associated to the nearest
training vector x (“1-nearest”) (Cover&Hart 1965):

 c 
PMAP (e)  P1NN (e)  PMAP (e)  2  PMAP (e)   2 PMAP (e)
 c 1 
when N∞. For c = 2
Decision regions for 1-nearest
PMAP (e)  P1NN (e)  2 PMAP (e) 1  PMAP (e) 

20
Distances
Properties:

Non - negativity D( x, z )  0
Reflexivity D ( x, z )  0 iff x  z
Symmetry D( x, z )  D (z, x)
Triangular inequality D(x, z )  D(z, t )  D(x, t )

The best distance depends on the problem and has to be validated.

1/ p
 d

 x z
p
- Minkowski distance D( x, z )  x  z p   i i 
 i 1 
p  1 (Manhattan)
p  2 (Euclidean)

- Mahalanobis distance D( x, z )   x  z  C1  x  z 

- Hamming distance: fraction of common elements between x and z

21
Similarities
Properties:
Bounds -1  s( x, z )  1 or 0  s( x, z )  1
Maximum s( x, z )  1 iff x  z
Symmetry s (x, z )  s ( z, x)

xT z
- Cosine similarity s ( x, z ) 
x2 z 2

x   z  
T

- Pearson correlation s ( x, z ) 
x 2
z  2

 is the mean of all training samples

  x  z  1
j j j
- Jaccard coefficient s ( x, z ) 
  x  1 or z  1
j j j

22
Choosing k
Nearest neighbor is very sensitive to different values of k
- If k is low, the classifier overfits
- If k is large, the classifier underfits

It is a hyperparameter that has to be selected using cross-validation.

Boundaries get smoother as k increases.

Example 7. The iris dataset, where c = 3

and d = 2. Sepal width and sepal length
are features.

23
Choosing k

Decision boundaries as a function of k.

Overfitting Underfitting

24
Importance of standardization

Top: both axes are scaled properly. Correct decision boundaries are found for k=5.
Bottom: the scale of vertical axis has been increased 100 times.
25
Combining labels to make predictions

Options for classification:

• Majority vote (ties are broken randomly)
• Distance-weighted vote: choose weights that are higher for closer points,
e.g. inverse of the distance or inverse of the squared distance

Options for regression:

• Use the average of target values of the k nearest neighbors
• Use the weighted average of targets of k nearest neighbors, e.g. inverse of
the distance or inverse of the squared distance

Using weights for prediction, we can relax the constraint of using only k
samples are use the whole training set instead.

26
Comparison in toy datasets

27
Example 8: Estimation of the pdf to predict crime in San Francisco

• Problem definition
Predict crimes in San Francisco among 38 types from a labeled data base.

• Data base
800.000 crimes recorded between 2003 and 2015. Each crime is characterized by:

28
29
29
• Evaluation: multiclass logarithmic loss (better when lower)
1 Ntest c
logloss    yij log Pr  j xi  yij  0,1
N i1 j 1
Validation using a fraction of the data base.

30
• Estimation of p(x | w ) for 4 classes…

31
32
32
Conclusions
• These are the only suitable methods when d is large compared to the number
of training vectors.
• Non-parametric prediction can be slow specially if having a large training set.
There are ways to mitigate (use prototypes, hashing)
• Prone to overfitting, better to remove “noisy” samples from training (i.e.
samples with nearest neighbours of different class), and choose appropriate k
• Suffers from curse of dimensionality: as d increases everything seems to be
close
• Suffers from the presence of irrelevant features
• Standardization of features is crucial to avoid excessive relevance of features
with large absolute values

33
Organization of a machine learning competition

Data base #1 with

feature vectors and
labels

xi yi Train
Classifier
i = 1,…,N

34
Organization of a machine learning competition

Data base #1 with

feature vectors and
labels

xi yi Train
Classifier
i = 1,…,N

Data base #2
with feature
vectors

xi
Retrain and Classify Pr(wk|xi)
resubmit as Classifier k = 1,…,c
i = N+1,…,M
many times
Submission
as desired

yi Evaluation of score
i = N+1,…,M by organizer

Public ranking

35
Organization of a machine learning competition
Phase 1 Phase 2
Data base #1 with Data base #3
feature vectors and with feature
labels vectors

xi yi Train
xi
Classify Pr(wk|xi)
Classifier Classifier
i = 1,…,N i = M+1,…,P k = 1,…,c
Single submission

yi Evaluation of score
i = M+1,…,P by organizer

Data base #2 Final ranking

with feature
vectors

xi
Retrain and Classify Pr(wk|xi)
resubmit as Classifier k = 1,…,c
i = N+1,…,M
many times
Submission
as desired

yi Evaluation of score
i = N+1,…,M by organizer The organisation in two phases prevents from using the
ranking data base to validate our classifier

Public ranking

36
… example of a hosted competition

w5 Classification
No ratings yet
w5 Classification
34 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
ML 5
No ratings yet
ML 5
35 pages
Unit 3
No ratings yet
Unit 3
100 pages
Lec 04
No ratings yet
Lec 04
70 pages
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
No ratings yet
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
23 pages
Data-Driven Healthcare Insights
No ratings yet
Data-Driven Healthcare Insights
38 pages
Machine Learning Homework Solutions
No ratings yet
Machine Learning Homework Solutions
8 pages
k-Nearest Neighbors Lecture Slides
No ratings yet
k-Nearest Neighbors Lecture Slides
57 pages
LFD 2005 Nearest Neighbour
No ratings yet
LFD 2005 Nearest Neighbour
6 pages
k-Nearest Neighbors Density Estimation
No ratings yet
k-Nearest Neighbors Density Estimation
31 pages
Fish Species Classification Guide
No ratings yet
Fish Species Classification Guide
141 pages
Patern Recogniton Part 2
No ratings yet
Patern Recogniton Part 2
4 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
Intro To Machine Learning Lecture Notes3
No ratings yet
Intro To Machine Learning Lecture Notes3
7 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
cs4302 Lecture2
No ratings yet
cs4302 Lecture2
40 pages
Mlfa Autumn 22 Lec 03
No ratings yet
Mlfa Autumn 22 Lec 03
61 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
K Nearest Neighbor Classification Guide
0% (1)
K Nearest Neighbor Classification Guide
32 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
CpE646 7v3 PDF
No ratings yet
CpE646 7v3 PDF
40 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
05 KNN
No ratings yet
05 KNN
49 pages
Non-Parametric Classification Guide
No ratings yet
Non-Parametric Classification Guide
41 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
SWE622 Lecture 3 Classification
No ratings yet
SWE622 Lecture 3 Classification
57 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Solutions for Duda's Pattern Classification
No ratings yet
Solutions for Duda's Pattern Classification
77 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
33 pages
Unit 2
No ratings yet
Unit 2
88 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Classification
No ratings yet
Classification
11 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Lecture 11
No ratings yet
Lecture 11
32 pages
COMP9517 Lab3 - Theory
No ratings yet
COMP9517 Lab3 - Theory
16 pages
Non-Parametric Classification Overview
No ratings yet
Non-Parametric Classification Overview
74 pages
Week 7 Nearest Neighbours
No ratings yet
Week 7 Nearest Neighbours
21 pages
Inf2b Learn Note10 2up
No ratings yet
Inf2b Learn Note10 2up
7 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
Machine Learning Class Overview
No ratings yet
Machine Learning Class Overview
27 pages
Chapter 07
No ratings yet
Chapter 07
68 pages
Pattern Recognition 21BR551 MODULE 03 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 03 NOTES
16 pages
4 KNN Parzen en
No ratings yet
4 KNN Parzen en
30 pages
Example 1: Riding Mowers
No ratings yet
Example 1: Riding Mowers
6 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
Classification Exam Prep
No ratings yet
Classification Exam Prep
9 pages
08 Kmethods3 Curse Deminsionality
No ratings yet
08 Kmethods3 Curse Deminsionality
44 pages
Zero, First and Second Conditional Worksheet
100% (1)
Zero, First and Second Conditional Worksheet
1 page
SCADA Systems: Fundamentals and Functions
100% (1)
SCADA Systems: Fundamentals and Functions
10 pages
Additions To Tax
No ratings yet
Additions To Tax
6 pages
Reverse Facial Mask in Orthodontics
No ratings yet
Reverse Facial Mask in Orthodontics
6 pages
Game-Based Learning Boosts Critical Thinking
No ratings yet
Game-Based Learning Boosts Critical Thinking
27 pages
AS-Level Economics: Price Controls
No ratings yet
AS-Level Economics: Price Controls
4 pages
Old Man Yells at Cloud Know Your Meme
No ratings yet
Old Man Yells at Cloud Know Your Meme
1 page
Skimming and Scanning: English Text Interpretation
No ratings yet
Skimming and Scanning: English Text Interpretation
2 pages
Physician Resource Directory
100% (2)
Physician Resource Directory
56 pages
Business Model Canvas A2
No ratings yet
Business Model Canvas A2
1 page
Felicity Colman - Deleuze and Cinema
100% (9)
Felicity Colman - Deleuze and Cinema
289 pages
Owerri West Local Government Overview
No ratings yet
Owerri West Local Government Overview
7 pages
Gender Bias in ELT Textbooks
No ratings yet
Gender Bias in ELT Textbooks
13 pages
Walker and Company Strategic Analysis
No ratings yet
Walker and Company Strategic Analysis
9 pages
New Greens List 04 January 2025
No ratings yet
New Greens List 04 January 2025
14 pages
Witnesses
100% (2)
Witnesses
23 pages
NCERT Solutions For Class 11 Psychology Chapter 7
No ratings yet
NCERT Solutions For Class 11 Psychology Chapter 7
3 pages
Summary of Fathers of Nations Chapters 1-14
63% (8)
Summary of Fathers of Nations Chapters 1-14
16 pages
Prelim Exam on Philippine History
No ratings yet
Prelim Exam on Philippine History
2 pages
Deed of Gift of Goodwill of Business, Trade Marks by A Father To Two Sons in Partnership
No ratings yet
Deed of Gift of Goodwill of Business, Trade Marks by A Father To Two Sons in Partnership
3 pages
Enviromental3 Safe Places
0% (1)
Enviromental3 Safe Places
10 pages
Energy Efficiency Financial Model for Mining
No ratings yet
Energy Efficiency Financial Model for Mining
26 pages
Whiteboard in Classroom Which Tool Is The Most Preferred To Clean The Surface?
No ratings yet
Whiteboard in Classroom Which Tool Is The Most Preferred To Clean The Surface?
4 pages
The Joy Luck Club
No ratings yet
The Joy Luck Club
3 pages
App 006 Reviewer
No ratings yet
App 006 Reviewer
9 pages
DMCC Services Agreement
No ratings yet
DMCC Services Agreement
15 pages
Mock Test 6
No ratings yet
Mock Test 6
6 pages
English Grammar: Conditional Clauses
No ratings yet
English Grammar: Conditional Clauses
6 pages
Night Essay
No ratings yet
Night Essay
4 pages
Business Structures & Tax Guide
No ratings yet
Business Structures & Tax Guide
43 pages

AA1 Tema4

Uploaded by

AA1 Tema4

Uploaded by

AA1

Chapter 4. Non-parametric classifiers

If we cannot assume a model for p(x|wi) we can resort to non-parametric

• The probability that x is in region R is:

P  p(x ')dx '

• If we have n independent observations D = {x1,…, xn}, the probability

 p(x ')dx '  p(x) V

Combining both expressions we get the histogram:

 p(x ')dx ' k/n

To reduce the effect we are interested in having VR → 0, which

How can we guarantee the convergence of

We can guarantee the three conditions in two ways:

Assume that region R is defined by a function j(x) enclosing a hyper-volume

The estimation of the pdf is given by:

(x) (x) (x)

Will give a very much Will give a noisy estimation of

h=1 h=0.5 h=0.2

The optimum size of the window possibly depends on the region

estimation of the pdf using

network-like structure that

We have n training vectors

3. An edge is set between the

2. The pattern block i weights

1. Training vectors are

4. The decision on the class is

3. Each block sums its inputs and

2. The pattern blocks compute an

To estimate p(x) we enlarge the volumen VR(x) around x

The fraction of vectors

Select the class most represented in a region around x.

We can obtain reasonable performance if the class is

The best distance depends on the problem and has to be validated.

- Mahalanobis distance D( x, z )   x  z  C1  x  z 

- Hamming distance: fraction of common elements between x and z

 is the mean of all training samples

It is a hyperparameter that has to be selected using cross-validation.

Example 7. The iris dataset, where c = 3

Decision boundaries as a function of k.

Options for classification:

Options for regression:

Data base #1 with

Data base #1 with

Data base #2 Final ranking

You might also like