Professional Documents
Culture Documents
net/publication/220580323
K-nearest neighbor
CITATIONS READS
408 645
1 author:
Leif Peterson
NXG Logic, LLC
223 PUBLICATIONS 6,342 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Pfizer-BioNTech Vaccine Trial Post-Hoc Power and Efficacy Simulations View project
All content following this page was uploaded by Leif Peterson on 08 April 2021.
K-nearest neighbor
Leif E. Peterson (2009), Scholarpedia, 4(2):1883. doi:10.4249/scholarpedia.1883 revision #137311 [link to/cite this article]
Leif E. Peterson, Center for Biostatistics, The Methodist Hospital Research Institute
K-nearest-neighbor (kNN) classification is one of the most fundamental and simple classification methods
and should be one of the first choices for a classification study when there is little or no prior knowledge about
the distribution of the data. K-nearest-neighbor classification was developed from the need to perform
discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to
determine. In an unpublished US Air Force School of Aviation Medicine report in 1951, Fix and Hodges
introduced a non-parametric method for pattern classification that has since become known the k-nearest
neighbor rule (Fix & Hodges, 1951). Later in 1967, some of the formal properties of the k-nearest-neighbor rule
were worked out; for instance it was shown that for k = 1 and n → ∞ the k-nearest-neighbor classification error
is bounded above by twice the Bayes error rate (Cover & Hart, 1967). Once such formal properties of k-nearest-
neighbor classification were established, a long line of investigation ensued including new rejection approaches
(Hellman, 1970), refinements with respect to Bayes error rate (Fukunaga & Hostetler, 1975), distance weighted
approaches (Dudani, 1976; Bailey & Jain, 1978), soft computing (Bermejo & Cabestany, 2000) methods and fuzzy
methods (Jozwik, 1983; Keller et al., 1985).
Contents
1 Characteristics of kNN
1.1 Between-sample geometric distance
1.2 Classification decision rule and confusion matrix
1.3 Feature transformation
1.4 Performance assessment with cross-validation
2 Pseudocode (algorithm)
3 Commonly Employed Data Sets
4 Performance Evaluation
5 Acknowledgments
6 References
7 Recommended reading
Characteristics of kNN
1 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
confusion matrix
Classification typically involves partitioning samples into training and testing categories. Let xi be a training
sample and x be a test sample, and let ω be the true class of a training sample and ω ^ be the predicted class for a
^ = 1, 2, … , Ω) . Here, Ω is the total number of classes.
test sample (ω, ω
During the training process, we use only the true class ω of each training sample to train the classifier, while
^ of each test sample. It warrants noting that kNN is a "supervised"
during testing we predict the class ω
classification method in that it uses the class labels of the training data. Unsupervised classification methods, or
"clustering" methods, on the other hand, do not employ the class labels of the training data.
With 1-nearest neighbor rule, the predicted class of test sample x is set equal to the true class ω of its nearest
neighbor, where mi is a nearest neighbor to x if the distance
2 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
where cωω is a diagonal element of C and ntotal is the total number of samples classified.
Consider a machine learning study to classify 19 samples with 2 features, X and Y . Table 1 lists the pairwise
Euclidean distance between the 19 samples. Assume that sample x11 is being used as a test sample while all
remaining samples are used for training. For k=4, the four samples closest to sample x11 are sample x10 (blue
class label), sample x12 (red class label), x13 (red class label), and x14 (red class label).
Figure 2 shows an X-Y scatterplot of the 19 samples plotted as a function of their X and Y values. One can notice
that among the four samples closest to test sample x11 (labeled green), 3/4 of the class labels are for Class A (red
color), and therefore, the test sample is assigned to Class A.
Feature transformation
Increased performance of a classifier can sometimes be achieved when the feature values are transformed prior
to classification analysis. Two commonly used feature transformations are standardization and fuzzification.
Standardization removes scale effects caused by use of features with different measurement scales. For
example, if one feature is based on patient weight in units of kg and another feature is based on blood protein
values in units of ng/dL in the range [-3,3], then patient weight will have a much greater influence on the
3 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
where xij is the value for the ith sample and jth feature,
μj is the average of all xij for feature j, σj is the standard
deviation of all xij over all input samples. If the feature
values take on a Gaussian distribution, then the histogram
of z-scores will represent a standard normal distribution
having a mean of zero and variance of unity. Once
Figure 2: X-Y Scatterplot of the 19 samples for which
standardization is performed on a set of features, the pairwise Euclidean distances are listed in Table 1.
range and scale of the z-scores should be similar, Among the 4 nearest neighbors of the test sample,
providing the distributions of raw feature values are alike. the most frequent class label color is red, and thus
the test sample is assigned to the red class.
Fuzzification is a transformation which exploits
uncertainty in feature values in order to increase
classification performance. Fuzzification replaces the original features by mapping original values of an input
feature into 3 fuzzy sets representing linguistic membership functions in order to facilitate the semantic
interpretation of each fuzzy set (Klir and Juan, 1995; Dubois and Prade, 2000; Pal and Mitra, 2004). First,
determine xmin and xmax as the minimum and maximum values of xij for feature j over all input samples and q1
and q2 as the quantile values of xij at the 33rd and 66th percentile. Next, calculate the averages
Avg1 = (xmin + q1 )/2 , Avg2 = (q1 + q2 )/2 , and Avg3 = (q2 + xmax )/2 . Next, translate each value of xij
for feature j into 3 fuzzy membership values in the range [0,1] as μlow,i,j , μmid,i,j , and μhigh,i,j using the
relationships
⎧1
⎪ x < Avg1
= ⎨ q −Av
q2 −x
⎩
Avg1 ≤ x < q2
⎪
μlow,i,j g1
2
0 x ≥ q2 ,
⎧
⎪
⎪
0 x < q1
⎪
⎪ Avg2 −x
= ⎨ Avq2g−x
q1 ≤ x < Avg2
2 −q1
⎪
μmed,i,j
⎪
⎪
⎩ q2 −Avg2
⎪
Avg2 ≤ x < q2
0 x ≥ q2 ,
⎧
⎪ 0 x−q x < q1
= ⎨ Avg −q q1 ≤ x < Avg3
⎩
⎪
1
μhigh,i,j
3 1
1 x ≥ Avg3 .
The above computations result in 3 fuzzy sets (vectors) μlow,j , μmed,j and μhigh,j of length n which replace the
original input feature.
The statistical significance of class discrimination for each jth feature can be assessed by using the F-ratio test,
given as
4 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
∑Ω 2
ω=1 nω (ȳ ω −ȳ ) /(Ω−1)
F (j) = ,
∑Ω nω 2
ω=1 ∑i=1 (y ωi −ȳ ω ) /(n−Ω)
where nω is the number of training samples in class ω (ω = 1, 2, … , Ω) , ȳ ω is the mean feature value among
training samples in class ω , ȳ is the mean feature value for all training samples, and yωi is the feature value
among training samples in class ω , (Ω − 1) is the numerator degrees of freedom and (n − Ω) is the
denominator degrees of freedom for the F-ratio test. Tail probabilities, i.e., P robj , are derived for values of the
F-ratio statistic based on the numerator and denominator degrees of freedom. A simple way to quantify
simultaneously the total statistical significance of class discrimination for p independent features is to sum the
minus natural logarithm of feature-specific p-values using the form
∑pj Probj
sum[-log(p-value)] = p .
High values of sum[-log(p-value)] for a set of features (>1000) suggest that the feature values are heterogeneous
across the classes considered and can discriminate classes well, whereas low values of sum[-log(p-value)] (<100)
suggest poor discrimination ability of a feature.
Pseudocode (algorithm)
Pseudocode is defined as a listing of sequential steps for solving a computational problem. Pseudocode is used by
computer programmers to mentally translate each computational step into a set of programming instructions
involving various mathematical operations (addition, subtraction, multiplication, division, power and
transcendental functions, differentiation/integration, etc.) and resources (vectors, arrays, graphics, input/output,
etc.) in order to solve an analytic problem. Following is a listing of pseudocode for the k-nearest-neighbor
classification method using cross-validation.
begin
initialize the n × n distance matrix D , initialize the Ω × Ω confusion matrix C , set t ← 0 , T otAcc ← 0 , and
set NumIterations equal to the desired number of iterations (re-partitions).
calculate distances between all the input samples and store in n × n matrix D . (For a large number of samples,
5 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
use only the lower or upper triangular of D for storage since it is a square symmetric matrix.)
for t ← 1 to NumIterations do
assign samples in the fold th partition to testing, and use the remaining samples for training. Set the
number of samples used for testing as ntest .
set ntotal ← ntotal + ntest .
for i ← 1 to ntest do
for test sample xi determine the k closest training samples based on the calculated distances.
^ , the most frequent class label among the k closest training samples.
determine ω
^ the predicted class
increment confusion matrix C by 1 in element cω,ω^ , where ω is the true and ω
^ then the increment of +1 will occur on the diagonal of the
label for test sample xi . If ω = ω
confusion matrix, otherwise, the increment will occur in an off-diagonal.
∑Ω
j cjj
determine the classification accuracy using Acc = ntotal where cjj is a diagonal element of the confusion
matrix C .
end
The above pseudocode was applied to several commonly used data sets (see next section) where the fold value
varied in order to asses s performance (accuracy) as a function of the size of the cross validation partitions.
Nine data sets from the Machine Learning Repository (http://mlearn.ics.uci.edu/MLRepository.html) of the
University of California - Irvine (UCI) were used for several k-nearest neighbor runs (Newman et al., 1998). Table
2 lists the data sets, number of classes, number of samples, and number of features (attributes) in each data set.
6 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
Performance Evaluation
Figure 3 shows the strong linear relationship between 10-fold cross-validation accuracy for the 9 data sets as a
function of the ratio of the feature sum[-log(p)] to number of features. The liver data set resulted in the lowest
accuracy, while the Fisher Iris data resulted in the greatest accuracy. The low value of sum[-log(p-value)] for
features in the liver data set will on average result in lower classification accuracy, wheres the greater level of
sum[-log(p-value)] for the Fisher Iris data and cancer data set will yield much greater levels of accuracy.
Figure 4 reflects k-nearest neighbor performance (k=5,
feature standardization) for various cross validation
methods for each data set. 2- and 5-fold cross validation
("CV2" and "CV5") performed worse than 10-fold
("CV10") and leave-one-out cross validation ("CV-1"). 10-
fold cross validation ("CV10") was approximately the
same as leave-one-out cross validation ("CV-1").
Bootstrapping resulted in slightly lower performance
when compared with CV10 and CV-1.
Figure 5 shows that when averaging performance over all
data sets (k=5), that both feature standardization and
feature fuzzification resulted in greater accuracy levels Figure 3: Linear relationship between classification
when compared with no feature transformation. accuracy and the ratio sum[-log(p)]/#features. 5NN
used with feature standardization.
Figures 6, 7, and 8 illustrates the CV10 accuracy for each
data set as a function of k without no transformation,
standardization, and fuzzification, respectively. It was apparent that feature standardization (Figure 7) and
fuzzification (Figure 8) greatly improved the accuracy of the dermatology and wine data sets. Fuzzification
(Figure 8) slightly reduced the performance of the Fisher Iris data set. Interestingly, performance for the soybean
data set did not improve with increasing values of k, suggesting overlearning or overfitting.
Average accuracy as function of k is shown for feature standardization and fuzzification for all data sets combined
7 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
8 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
makes inefficient use of the entire data set, since date are
split one time and used once in this configuration to
assess classification accuracy. It is important to recognize
that the hold-out method is not the same as predicting
class membership for an independent set of supplemental
experimental validation samples. Validation sets are used
when the goal is to confirm the predictive capabilities of a
classification scheme based on the results from an
independent set of supplemental samples not used
previously for training and testing. Laboratory
investigations involving molecular biology and genomics Figure 6: Bias as a function of k without feature
commonly use validation sets raised independently from transformation. 10-fold cross validation used.
the original training/testing samples. By using an
independent set of validation samples, the ability of a set
of pre-selected features (e.g. mRNA or microRNA
transcripts, or proteins) to correctly classify new samples
can be better evaluated. The attempt to validate a set of
features using a new set of samples should be done
carefully, since processing new samples at a later date
using different lab protocols, buffers, and technicians can
introduce significant systematic error into the
investigation. As a precautionary method, a laboratory
should plan on processing the independent validation set
of samples in the same laboratory, using the same Figure 7: Bias as a function of k with feature
protocol and buffer solutions, the same technician(s), and standardization. 10-fold cross validation used.
preferably at the same time the original samples are
processed. Waiting until a later phase in a study to
generate the independent validation set of samples may
seriously degrade the predictive ability of the features
identified from the original samples, ultimately
jeopardizing the classification study.
9 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
Acknowledgments
10 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
11 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
Voronoi, G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Journal für
die Reine und Angewandte Mathematik. 133, 97-178; 1907.
Wolberg, W.H., Mangasarian, O.L. Multisurface method of pattern separation for medical diagnosis applied
to breast cytology. PNAS. 87: 9193-9196; 1990.
Internal references
Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.
Milan Mares (2006) Fuzzy sets. Scholarpedia, 1(10):2031.
Recommended reading
12 of 13 4/7/2021, 10:41 PM
K-nearest neighbor - Scholarpedia http://www.scholarpedia.org/article/K-nearest_neighbor
beyond the scope of this license are described in the Terms of Use