You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220580323

K-nearest neighbor

Article  in  Scholarpedia · January 2009


DOI: 10.4249/scholarpedia.1883 · Source: DBLP

CITATIONS READS
408 645

1 author:

Leif Peterson
NXG Logic, LLC
223 PUBLICATIONS   6,342 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine Learning Lectures View project

Pfizer-BioNTech Vaccine Trial Post-Hoc Power and Efficacy Simulations View project

All content following this page was uploaded by Leif Peterson on 08 April 2021.

The user has requested enhancement of the downloaded file.


K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

K-nearest neighbor
Leif E. Peterson (2009), Scholarpedia, 4(2):1883. doi:10.4249/scholarpedia.1883 revision #137311 [link to/cite this article]

Leif E. Peterson, Center for Biostatistics, The Methodist Hospital Research Institute

K-nearest-neighbor (kNN) classification is one of the most fundamental and simple classification methods
and should be one of the first choices for a classification study when there is little or no prior knowledge about
the distribution of the data. K-nearest-neighbor classification was developed from the need to perform
discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to
determine. In an unpublished US Air Force School of Aviation Medicine report in 1951, Fix and Hodges
introduced a non-parametric method for pattern classification that has since become known the k-nearest
neighbor rule (Fix & Hodges, 1951). Later in 1967, some of the formal properties of the k-nearest-neighbor rule
were worked out; for instance it was shown that for k = 1 and n → ∞ the k-nearest-neighbor classification error
is bounded above by twice the Bayes error rate (Cover & Hart, 1967). Once such formal properties of k-nearest-
neighbor classification were established, a long line of investigation ensued including new rejection approaches
(Hellman, 1970), refinements with respect to Bayes error rate (Fukunaga & Hostetler, 1975), distance weighted
approaches (Dudani, 1976; Bailey & Jain, 1978), soft computing (Bermejo & Cabestany, 2000) methods and fuzzy
methods (Jozwik, 1983; Keller et al., 1985).

Contents
1 Characteristics of kNN
1.1 Between-sample geometric distance
1.2 Classification decision rule and confusion matrix
1.3 Feature transformation
1.4 Performance assessment with cross-validation
2 Pseudocode (algorithm)
3 Commonly Employed Data Sets
4 Performance Evaluation
5 Acknowledgments
6 References
7 Recommended reading

Characteristics of kNN

Between-sample geometric distance


The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the
specified training samples. Let xi be an input sample with p features (xi1 , xi2 , … , xip ) , n be the total number
of input samples (i = 1, 2, … , n) and p the total number of features (j = 1, 2, … , p) . The Euclidean distance
between sample xi and xl (l = 1, 2, … , n) is defined as
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
d(xi , xl ) = √(xi1 − xl1 )2 + (xi2 − xl2 )2 + ⋯ + (xip − xlp )2 .

1 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

A graphic depiction of the nearest neighbor concept is


illustrated in the Voronoi tessellation (Voronoi, 1907)
shown in Figure 1. The tessellation shows 19 samples
marked with a "+", and the Voronoi cell, R , surrounding
each sample. A Voronoi cell encapsulates all neighboring
points that are nearest to each sample and is defined as
Ri = {x ∈ Rp : d(x, xi ) ≤ d(x, xm ), ∀ i ≠ m} ,
where Ri is the Voronoi cell for sample xi , and x
represents all possible points within Voronoi cell Ri .
Voronoi tessellations primarily reflect two characteristics
of a coordinate system: i) all possible points within a
sample's Voronoi cell are the nearest neighboring points
for that sample, and ii) for any sample, the nearest sample
is determined by the closest Voronoi cell edge. Using the
latter characteristic, the k-nearest-neighbor classification Figure 1: Voronoi tessellation showing Voronoi cells
of 19 samples marked with a "+". The Voronoi
rule is to assign to a test sample the majority category
tessellation reflects two characteristics of the
label of its k nearest training samples. In practice, k is example 2-dimensional coordinate system: i) all
usually chosen to be odd, so as to avoid ties. The k = 1 rule possible points within a sample's Voronoi cell are
is generally called the nearest-neighbor classification rule. the nearest neighboring points for that sample, and
ii) for any sample, the nearest sample is determined
Classification decision rule and by the closest Voronoi cell edge.

confusion matrix
Classification typically involves partitioning samples into training and testing categories. Let xi be a training
sample and x be a test sample, and let ω be the true class of a training sample and ω ^ be the predicted class for a
^ = 1, 2, … , Ω) . Here, Ω is the total number of classes.
test sample (ω, ω

During the training process, we use only the true class ω of each training sample to train the classifier, while
^ of each test sample. It warrants noting that kNN is a "supervised"
during testing we predict the class ω
classification method in that it uses the class labels of the training data. Unsupervised classification methods, or
"clustering" methods, on the other hand, do not employ the class labels of the training data.

With 1-nearest neighbor rule, the predicted class of test sample x is set equal to the true class ω of its nearest
neighbor, where mi is a nearest neighbor to x if the distance

d(mi , x) = minj {d(mj , x)}.


For k-nearest neighbors, the predicted class of test sample x is set equal to the most frequent true class among k
^.
nearest training samples. This forms the decision rule D : x → ω
The confusion matrix used for tabulating test sample class predictions during testing is denoted as C and has
^ = ω), then the
dimensions Ω × Ω . During testing, if the predicted class of test sample x is correct (i.e., ω
diagonal element cωω of the confusion matrix is incremented by 1. However, if the predicted class is incorrect
^ ≠ ω), then the off-diagonal element cωω^ is incremented by 1. Once all the test samples have been
(i.e., ω
classified, the classification accuracy is based on the ratio of the number of correctly classified samples to the
total number of samples classified, given in the form
∑Ωω cωω
Acc = ntotal

2 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

where cωω is a diagonal element of C and ntotal is the total number of samples classified.
Consider a machine learning study to classify 19 samples with 2 features, X and Y . Table 1 lists the pairwise
Euclidean distance between the 19 samples. Assume that sample x11 is being used as a test sample while all
remaining samples are used for training. For k=4, the four samples closest to sample x11 are sample x10 (blue
class label), sample x12 (red class label), x13 (red class label), and x14 (red class label).

Table 1. Euclidean distance matrix D listing all possible pairwise Euclidean


distances between 19 samples.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18
x2 1.5
x3 1.4 1.6
x4 1.6 1.4 1.3
x5 1.7 1.4 1.5 1.5
x6 1.3 1.4 1.4 1.5 1.4
x7 1.6 1.3 1.4 1.4 1.5 1.8
x8 1.5 1.4 1.6 1.3 1.7 1.6 1.4
x9 1.4 1.3 1.4 1.5 1.2 1.4 1.3 1.5
x10 2.3 2.4 2.5 2.3 2.6 2.7 2.8 2.7 3.1
x11 2.9 2.8 2.9 3.0 2.9 3.1 2.9 3.1 3.0 1.5
x12 3.2 3.3 3.2 3.1 3.3 3.4 3.3 3.4 3.5 3.3 1.6
x13 3.3 3.4 3.2 3.2 3.3 3.4 3.2 3.3 3.5 3.6 1.4 1.7
x14 3.4 3.2 3.5 3.4 3.7 3.5 3.6 3.3 3.5 3.6 1.5 1.8 0.5
x15 4.2 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 1.7 1.6 0.3 0.5
x16 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 1.6 1.5 0.4 0.5 0.4
x17 5.9 6.2 6.2 5.8 6.1 6.0 6.1 5.9 5.8 6.0 2.3 2.3 2.5 2.3 2.4 2.5
x18 6.1 6.3 6.2 5.8 6.1 6.0 6.1 5.9 5.8 6.0 3.1 2.7 2.6 2.3 2.5 2.6 3.0
x19 6.0 6.1 6.2 5.8 6.1 6.0 6.1 5.9 5.8 6.0 3.0 2.9 2.7 2.4 2.5 2.8 3.1 0.4

Figure 2 shows an X-Y scatterplot of the 19 samples plotted as a function of their X and Y values. One can notice
that among the four samples closest to test sample x11 (labeled green), 3/4 of the class labels are for Class A (red
color), and therefore, the test sample is assigned to Class A.

Feature transformation
Increased performance of a classifier can sometimes be achieved when the feature values are transformed prior
to classification analysis. Two commonly used feature transformations are standardization and fuzzification.
Standardization removes scale effects caused by use of features with different measurement scales. For
example, if one feature is based on patient weight in units of kg and another feature is based on blood protein
values in units of ng/dL in the range [-3,3], then patient weight will have a much greater influence on the

3 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

distance between samples and may bias the performance


of the classifier. Standardization transforms raw feature
values into z-scores using the mean and standard
deviation of a feature values over all input samples, given
by the relationship
xij −μj
zij = σj ,

where xij is the value for the ith sample and jth feature,
μj is the average of all xij for feature j, σj is the standard
deviation of all xij over all input samples. If the feature
values take on a Gaussian distribution, then the histogram
of z-scores will represent a standard normal distribution
having a mean of zero and variance of unity. Once
Figure 2: X-Y Scatterplot of the 19 samples for which
standardization is performed on a set of features, the pairwise Euclidean distances are listed in Table 1.
range and scale of the z-scores should be similar, Among the 4 nearest neighbors of the test sample,
providing the distributions of raw feature values are alike. the most frequent class label color is red, and thus
the test sample is assigned to the red class.
Fuzzification is a transformation which exploits
uncertainty in feature values in order to increase
classification performance. Fuzzification replaces the original features by mapping original values of an input
feature into 3 fuzzy sets representing linguistic membership functions in order to facilitate the semantic
interpretation of each fuzzy set (Klir and Juan, 1995; Dubois and Prade, 2000; Pal and Mitra, 2004). First,
determine xmin and xmax as the minimum and maximum values of xij for feature j over all input samples and q1
and q2 as the quantile values of xij at the 33rd and 66th percentile. Next, calculate the averages
Avg1 = (xmin + q1 )/2 , Avg2 = (q1 + q2 )/2 , and Avg3 = (q2 + xmax )/2 . Next, translate each value of xij
for feature j into 3 fuzzy membership values in the range [0,1] as μlow,i,j , μmid,i,j , and μhigh,i,j using the
relationships

⎧1
⎪ x < Avg1
= ⎨ q −Av
q2 −x

Avg1 ≤ x < q2

μlow,i,j g1
2
0 x ≥ q2 ,



0 x < q1

⎪ Avg2 −x
= ⎨ Avq2g−x
q1 ≤ x < Avg2
2 −q1


μmed,i,j


⎩ q2 −Avg2

Avg2 ≤ x < q2
0 x ≥ q2 ,

⎪ 0 x−q x < q1
= ⎨ Avg −q q1 ≤ x < Avg3


1
μhigh,i,j
3 1
1 x ≥ Avg3 .
The above computations result in 3 fuzzy sets (vectors) μlow,j , μmed,j and μhigh,j of length n which replace the
original input feature.
The statistical significance of class discrimination for each jth feature can be assessed by using the F-ratio test,
given as

4 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

∑Ω 2
ω=1 nω (ȳ ω −ȳ ) /(Ω−1)
F (j) = ,
∑Ω nω 2
ω=1 ∑i=1 (y ωi −ȳ ω ) /(n−Ω)

where nω is the number of training samples in class ω (ω = 1, 2, … , Ω) , ȳ ω is the mean feature value among
training samples in class ω , ȳ is the mean feature value for all training samples, and yωi is the feature value
among training samples in class ω , (Ω − 1) is the numerator degrees of freedom and (n − Ω) is the
denominator degrees of freedom for the F-ratio test. Tail probabilities, i.e., P robj , are derived for values of the
F-ratio statistic based on the numerator and denominator degrees of freedom. A simple way to quantify
simultaneously the total statistical significance of class discrimination for p independent features is to sum the
minus natural logarithm of feature-specific p-values using the form
∑pj Probj
sum[-log(p-value)] = p .
High values of sum[-log(p-value)] for a set of features (>1000) suggest that the feature values are heterogeneous
across the classes considered and can discriminate classes well, whereas low values of sum[-log(p-value)] (<100)
suggest poor discrimination ability of a feature.

Performance assessment with cross-validation


A basic rule in classification analysis is that class predictions are not made for data samples that are used for
training or learning. If class predictions are made for samples used in training or learning, the accuracy will be
artificially biased upward. Instead, class predictions are made for samples that are kept out of training process.
The performance of most classifiers is typically evaluated through cross-validation, which involves the
determination of classification accuracy for multiple partitions of the input samples used in training. For
example, during 5-fold (κ = 5) cross-validation training, a set of input samples is split up into 5 partitions
D1 , D2 , … , D5 having equal sample sizes to the extent possible. The notion of ensuring uniform class
representation among the partitions is called stratified cross-validation, which is preferred. To begin, for 5-fold
cross-validation, samples in partitions D2 , D3 , … , D5 are first used for training while samples in partition D1
are used for testing. Next, samples in groups D1 , D3 , … , D5 are used for training and samples in partition D2
used for testing. This is repeated until each partitions have been used singly for testing. It is also customary to re-
partition all of the input samples e.g. 10 times in order to get a better estimate of accuracy.

Pseudocode (algorithm)

Pseudocode is defined as a listing of sequential steps for solving a computational problem. Pseudocode is used by
computer programmers to mentally translate each computational step into a set of programming instructions
involving various mathematical operations (addition, subtraction, multiplication, division, power and
transcendental functions, differentiation/integration, etc.) and resources (vectors, arrays, graphics, input/output,
etc.) in order to solve an analytic problem. Following is a listing of pseudocode for the k-nearest-neighbor
classification method using cross-validation.

Algorithm 1. (PseudoCode for κ-Fold Cross-Validation)

begin
initialize the n × n distance matrix D , initialize the Ω × Ω confusion matrix C , set t ← 0 , T otAcc ← 0 , and
set NumIterations equal to the desired number of iterations (re-partitions).
calculate distances between all the input samples and store in n × n matrix D . (For a large number of samples,

5 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

use only the lower or upper triangular of D for storage since it is a square symmetric matrix.)
for t ← 1 to NumIterations do

set C ← 0, and ntotal ← 0 .


partition the input samples into κ equally-sized groups.
for fold ← 1 to κ do

assign samples in the fold th partition to testing, and use the remaining samples for training. Set the
number of samples used for testing as ntest .
set ntotal ← ntotal + ntest .

for i ← 1 to ntest do

for test sample xi determine the k closest training samples based on the calculated distances.
^ , the most frequent class label among the k closest training samples.
determine ω
^ the predicted class
increment confusion matrix C by 1 in element cω,ω^ , where ω is the true and ω
^ then the increment of +1 will occur on the diagonal of the
label for test sample xi . If ω = ω
confusion matrix, otherwise, the increment will occur in an off-diagonal.
∑Ω
j cjj
determine the classification accuracy using Acc = ntotal where cjj is a diagonal element of the confusion
matrix C .

calculate T otAcc = T otAcc + Acc .


calculate AvgAcc = T otAcc/NumIterations

end

The above pseudocode was applied to several commonly used data sets (see next section) where the fold value
varied in order to asses s performance (accuracy) as a function of the size of the cross validation partitions.

Commonly Employed Data Sets

Nine data sets from the Machine Learning Repository (http://mlearn.ics.uci.edu/MLRepository.html) of the
University of California - Irvine (UCI) were used for several k-nearest neighbor runs (Newman et al., 1998). Table
2 lists the data sets, number of classes, number of samples, and number of features (attributes) in each data set.

6 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

Table 2. Data sets used.


Data set #Samples #Classes #Features Reference
Cancer (Wisconsin) 699 2 9 Wolberg & Mangasarin, 1990
Dermatology 366 6 34 Guvenir et al., 1998
Glass 214 6 9 Evett & Spiehler, 1987
Ionosphere 351 2 32 Sigillito et al., 1989
Fisher Iris 150 3 4 Fisher, 1936
Liver 345 2 8 Forsyth, 1990
Pima Diabetes 768 2 8 Smith et al., 1988
Soybean 266 15 38 Michalski & Chilausky, 1980
Wine 178 3 13 Aeberhard et al., 1992

Performance Evaluation

Figure 3 shows the strong linear relationship between 10-fold cross-validation accuracy for the 9 data sets as a
function of the ratio of the feature sum[-log(p)] to number of features. The liver data set resulted in the lowest
accuracy, while the Fisher Iris data resulted in the greatest accuracy. The low value of sum[-log(p-value)] for
features in the liver data set will on average result in lower classification accuracy, wheres the greater level of
sum[-log(p-value)] for the Fisher Iris data and cancer data set will yield much greater levels of accuracy.
Figure 4 reflects k-nearest neighbor performance (k=5,
feature standardization) for various cross validation
methods for each data set. 2- and 5-fold cross validation
("CV2" and "CV5") performed worse than 10-fold
("CV10") and leave-one-out cross validation ("CV-1"). 10-
fold cross validation ("CV10") was approximately the
same as leave-one-out cross validation ("CV-1").
Bootstrapping resulted in slightly lower performance
when compared with CV10 and CV-1.
Figure 5 shows that when averaging performance over all
data sets (k=5), that both feature standardization and
feature fuzzification resulted in greater accuracy levels Figure 3: Linear relationship between classification
when compared with no feature transformation. accuracy and the ratio sum[-log(p)]/#features. 5NN
used with feature standardization.
Figures 6, 7, and 8 illustrates the CV10 accuracy for each
data set as a function of k without no transformation,
standardization, and fuzzification, respectively. It was apparent that feature standardization (Figure 7) and
fuzzification (Figure 8) greatly improved the accuracy of the dermatology and wine data sets. Fuzzification
(Figure 8) slightly reduced the performance of the Fisher Iris data set. Interestingly, performance for the soybean
data set did not improve with increasing values of k, suggesting overlearning or overfitting.
Average accuracy as function of k is shown for feature standardization and fuzzification for all data sets combined

7 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

is shown in Figure 9. Again, feature standardization and


fuzzification resulted in improved accuracy values over
the range of k. Finally, in Figures 10, 11, and 12 are shown
the bootstrap accuracy as a function of training sample
size when (k=5), i.e. 5NN, with and without feature
standardization and fuzzification. The use of feature
standardization and fuzzification resulted in substantial
performance gains for the dermatology and wine data
sets. Feature fuzzification markedly improved
performance for the dermatology data set, especially at
lower sample size. Standardization also improved the
dermatology date set performance at smaller sample sizes.
Performance for the liver, glass, and soybean data sets Figure 4: Bias as a function of various cross-
was not improved by feature standardization or validation methods for the data sets used. Feature
fuzzification. values standardized and k=5.

Performance of the unsupervised k-nearest neighbor


classification method was assessed using several data sets,
cross validation, and bootstrapping. All methods involved
initial use of a distance matrix and construction of a
confusion matrix during sample testing, from which
classification accuracy was determined. With regard to
accuracy calculation, for cross-validation it is
recommended that the confusion matrix be filled
incrementally with results for all input samples
partitioned into the various groups, and then calculating
accuracy -- rather than calculating accuracy and averaging
after each partition of training samples is used for testing. Figure 5: Bias as a function of cross validation
method averaged over all training sets as a function
In other words, for e.g. 5-fold cross-validation, it is not
of feature transformation. K=5 used.
recommended to calculate accuracy after the first 4/5ths
of samples are used for training and the first 1/5th of
samples are used for testing. Instead, it is better to determine accuracy after all 5 partitions have been used for
testing to fill in the confusion matrix for each input sample considered along the way. Then, re-partition the
samples into 5 groups again and repeat training and testing on each of the partitions. Another example would be
to consider an analysis for which there are 100 input samples and 10-fold cross-validation is to be used. The
suggestion is not to calculate average accuracy every time 10 of the samples are used for testing, but rather to go
through the 10 partitions in order to fill in the confusion matrix for the entire set of 100 samples, and then
calculate accuracy. This should be repeated e.g. 10 times during which re-partitioning is done.
The hold-out method of accuracy determination is another approach to assess the performance of k-nearest
neighbor. Here, input samples are randomly split into 2 groups with 2/3 (~66%) of the input samples assigned to
the training set and 1/3 (~33%) of the samples (remaining) assigned to testing. Training results are used to
classify the test samples. A major criticism of the hold-out method when compared with cross-validation is that it

8 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

makes inefficient use of the entire data set, since date are
split one time and used once in this configuration to
assess classification accuracy. It is important to recognize
that the hold-out method is not the same as predicting
class membership for an independent set of supplemental
experimental validation samples. Validation sets are used
when the goal is to confirm the predictive capabilities of a
classification scheme based on the results from an
independent set of supplemental samples not used
previously for training and testing. Laboratory
investigations involving molecular biology and genomics Figure 6: Bias as a function of k without feature
commonly use validation sets raised independently from transformation. 10-fold cross validation used.
the original training/testing samples. By using an
independent set of validation samples, the ability of a set
of pre-selected features (e.g. mRNA or microRNA
transcripts, or proteins) to correctly classify new samples
can be better evaluated. The attempt to validate a set of
features using a new set of samples should be done
carefully, since processing new samples at a later date
using different lab protocols, buffers, and technicians can
introduce significant systematic error into the
investigation. As a precautionary method, a laboratory
should plan on processing the independent validation set
of samples in the same laboratory, using the same Figure 7: Bias as a function of k with feature
protocol and buffer solutions, the same technician(s), and standardization. 10-fold cross validation used.
preferably at the same time the original samples are
processed. Waiting until a later phase in a study to
generate the independent validation set of samples may
seriously degrade the predictive ability of the features
identified from the original samples, ultimately
jeopardizing the classification study.

The data sets used varied over the number of classes,


features, and statistical significance for class
discrimination based on the feature-specific F-ratio tests.
An important finding during the performance evaluation
of k-nearest neighbor was that feature standardization
improved accuracy for some data sets and did not reduce Figure 8: Bias as a function of k with feature
accuracy. On the other hand, while feature fuzzification fuzzification. 10-fold cross validation used.

improved performance for several data sets, it


nevertheless resulted in decreased performance for one data set (Fisher Iris). The effect of feature
standardization and fuzzification varies depending on the data set and the classifier being used. In an
independent analysis of 14 classifiers applied to 9 large DNA microarray data sets, it was found that feature
standardization or fuzzification improved performance for all classifiers except naive Bayes classifier, quadratic
discriminant analysis, and artificial neural networks (Peterson and Coleman, 2007). While standardization

9 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

reduced performance of only quadratic discriminant


analysis, fuzzification reduced the performance of the
naive Bayes, quadratic discriminant analysis, and artificial
neural networks classifiers.

In light of the transformations explored in this study of


k-nearest neighbor classification, it is recommended that
at least the effects of feature standardization be
performed and comparatively assessed when using
k-nearest neighbor classification. In addition, the effects
of values of k should also be determined in order to
identify regions where overlearning or overfitting may
occur. Lastly, there may be unique characteristics of the Figure 9: Bias as a function of k averaged over all
sample and feature space being studied, which may cause training sets as a function of feature transformation.
other classifiers to result in better (worse) performance
when compared with k-nearest neighbor classification.
Hence, a full evaluation of K-nearest neighbor
performance as a function of feature transformation and k
is suggested.

Acknowledgments

We are grateful to the current and past librarians of the


University of California-Irvine (UCI) Machine Learning
Repository, namely, Patrick M. Murphy, David Aha, and
Christopher J. Merz. Figure 10: Bootstrap bias as a function of the
number of training instances sampled randomly
References with replacement. No feature transformation used.

Aeberhard, S., Coomans, D., de Vel, O. Comparison of


Classifiers in High Dimensional Settings. Tech. Rep.
no. 92-02, (1992), Dept. of Computer Science and
Dept. of Mathematics and Statistics, James Cook
University of North Queensland, 1992.
Bailey, T., Jain, A. A note on distance-weighted
k-nearest neighbor rules. IEEE Trans. Systems, Man,
Cybernetics, Vol. 8, pp. 311-313, 1978.
Bermejo, S. Cabestany, J. Adaptive soft k-nearest-
neighbour classifiers. Pattern Recognition, Vol. 33, pp.
Figure 11: Bootstrap bias as a function of the number
1999-2005, 2000.
of training instances sampled randomly with
Cover, T.M., Hart, P.E. Nearest neighbor pattern replacement. Feature standardization used.
classification. IEEE Trans. Inform. Theory,
IT-13(1):21–27, 1967.
Dubois, D., Prade, H. Fundamentals of Fuzzy Sets, Boston (MA), Kluwer, 2000.

10 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

Dudani, S.A. The distance-weighted k-nearest-


neighbor rule. IEEE Trans. Syst. Man Cybern.,
SMC-6:325–327, 1976.
Evett, I.W., Spiehler, E.J., Rule Induction in Forensic
Science. Central Research Establishment, Home Office
Forensic Science Service, Aldermaston, Reading,
Berkshire RG7 4PN, 1987.
Fisher, R.A. The use of multiple measurements in
taxonomic problems. Annals Eugenics, 7, Part II,
179-188, 1936.
Figure 12: Bootstrap bias as a function of the
Fix, E., Hodges, J.L. Discriminatory analysis,
number of training instances sampled randomly
nonparametric discrimination: Consistency properties. with replacement. Feature fuzzification used.
Technical Report 4, USAF School of Aviation
Medicine, Randolph Field, Texas, 1951.
Forsyth, R.S., BUPA Liver Disorders. Nottingham NG3 5DX, 0602-621676, 1990.
Fukunaga, K., Hostetler, L. k-nearest-neighbor bayes-risk estimation. IEEE Trans. Information Theory, 21(3),
285-293, 1975.
Guvenir, H.A., Demiroz, G., Ilter, N. Learning differential diagnosis of erythemato-squamous diseases using
voting feature intervals. Artificial Intelligence in Medicine. 13(3):147-165; 1998.
Hellman, M.E. The nearest neighbor classification rule with a reject option. IEEE Trans. Syst. Man Cybern.,
3:179–185, 1970.
Jozwik, A. A learning scheme for a fuzzy k-nn rule. Pattern Recognition Letters, 1:287–289, 1983.
Keller, J.M., Gray, M.R., Givens, J.A. A fuzzy k-nn neighbor algorithm. IEEE Trans. Syst. Man Cybern.,
SMC-15(4):580–585, 1985.
Klir, G.J., Yuan, B. Fuzzy Sets and Fuzzy Logic, Upper Saddle River(NJ), Prentice-Hall, 1995.
Michalski, R.S., Chilausky R.L. Learning by Being Told and Learning from Examples: An Experimental
Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for
Soybean Disease Diagnosis. International Journal of Policy Analysis and Information Systems. 4(2), 1980.
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases
[1] (http://www.ics.uci.edu/~mlearn/MLRepository.html) . Irvine, CA: University of California, Department
of Information and Computer Science.
Pal, S.K., Mitra, P. Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery and
Soft Granular Computing. Boca Raton (FL), Chapman & Hall, 2004.
Peterson, L.E., Coleman, M.A. Machine learning-based receiver operating characteristic (ROC) curves for
crisp and fuzzy classification of DNA microarrays in cancer research. Int. J. of Approximate Reasoning. 47,
17-36; 2008.
Sigillito, V.G., Wing, S.P., Hutton, L.V., Baker, K.B. Classification of radar returns from the ionosphere using
neural networks. Johns Hopkins APL Technical Digest, 10, 262-266; 1989.
Smith,J.W., Everhart, J.E., Dickson,W.C., Knowler, W.C., Johannes, R.S. Using the ADAP learning algorithm
to forecast the onset of diabetes mellitus. Proceedings of the Symposium on Computer Applications and
Medical Care} (pp. 261--265). IEEE Computer Society Press, 1988.

11 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

Voronoi, G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Journal für
die Reine und Angewandte Mathematik. 133, 97-178; 1907.
Wolberg, W.H., Mangasarian, O.L. Multisurface method of pattern separation for medical diagnosis applied
to breast cytology. PNAS. 87: 9193-9196; 1990.
Internal references
Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.
Milan Mares (2006) Fuzzy sets. Scholarpedia, 1(10):2031.

Recommended reading

k Nearest Neighbor Demo (Java) (http://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html)


k Nearest Neighbor - Electronic Statistical Textbook (Statsoft, Inc.) (http://www.statsoft.com/textbook
/stknn.html)
Dasarthy, B.V. Nearest Neighbor Classification Techniques. IEEE Press, Hoboken(NJ), 1990.
(http://www.amazon.com/Nearest-Neighbor-Pattern-Classification-Techniques/dp/0818689307/)
Mitchell, T.M. Machine Learning. McGraw-Hill, Columbus(OH), 1997. (http://www.amazon.com/Machine-
Learning-Mcgraw-Hill-International-Edit/dp/0071154671/)
Duda, R.O, Hart, P.G., Stork, D.E. Pattern Classification. John Wiley & Sons, New York(NY), 2001.
(http://www.amazon.com/Pattern-Classification-2nd-Richard-Duda/dp/0471056693/)
Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning. Berlin, Springer-Verlag, 2001.
(http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845/)

Sponsored by: Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access


encyclopedia
Reviewed by (http://www.scholarpedia.org/w/index.php?title=K-nearest_neighbor&oldid=30134) : Dr.
Francesco Masulli, Dipartimento di Informatica e Scienze dell'Informazione Universita' di Genova, Italy
Reviewed by (http://www.scholarpedia.org/w/index.php?title=K-nearest_neighbor&oldid=30134) :
Anonymous
Accepted on: 2009-02-21 18:06:59 GMT (http://www.scholarpedia.org/w/index.php?title=K-
nearest_neighbor&oldid=58444)

Category: Computational Intelligence

This page was last modified on 3


November 2013, at 20:07.
This page has been accessed 304,455
times.
"K-nearest neighbor" by Leif E.
Peterson is licensed under a Creative
Commons Attribution-
NonCommercial-ShareAlike 3.0
Unported License. Permissions

12 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor

beyond the scope of this license are described in the Terms of Use

13 of 13 View publication stats


4/7/2021, 10:41 PM

You might also like