Professional Documents
Culture Documents
Abstract
Deep neural networks require large training sets but suffer
from high computational cost and long training times. Train-
ing on much smaller training sets while maintaining nearly
the same accuracy would be very beneficial. In the few-shot
learning setting, a model must learn a new class given only
a small number of samples from that class. One-shot learn-
ing is an extreme form of few-shot learning where the model
must learn a new class from a single example. We propose the
‘less than one’-shot learning task where models must learn
N new classes given only M < N examples and we show
that this is achievable with the help of soft labels. We use a
soft-label generalization of the k-Nearest Neighbors classifier
to explore the intricate decision landscapes that can be cre-
ated in the ‘less than one’-shot learning setting. We analyze
these decision landscapes to derive theoretical lower bounds
for separating N classes using M < N soft-label samples Figure 1: A SLaPkNN classifier is fitted on 2 soft-label pro-
and investigate the robustness of the resulting systems. totypes and partitions the space into 3 classes. The soft label
distribution of each prototype is illustrated by the pie charts.
Introduction
Deep supervised learning models are extremely data-hungry,
generally requiring a very large number of samples to train horse and a photo of a rhinoceros, and say that a unicorn is
on. Meanwhile, it appears that humans can quickly gener- something in between. With just two examples, the alien has
alize from a tiny number of examples (Lake, Salakhutdi- now learned to recognize three different animals. The type of
nov, and Tenenbaum 2015). Getting machines to learn from information sharing that allows us to describe a unicorn by
‘small’ data is an important aspect of trying to bridge this its common features with horses and rhinoceroses is the key
gap in abilities. Few-shot learning (FSL) is one approach to to enabling LO-shot learning. In particular, this unintuitive
making models more sample efficient. In this setting, mod- setting relies on soft labels to encode and decode more in-
els must learn to discern new classes given only a few exam- formation from each example than is otherwise possible. An
ples per class (Lake, Salakhutdinov, and Tenenbaum 2015; example can be seen in Figure 1 where two samples with soft
Snell, Swersky, and Zemel 2017; Wang et al. 2020). Further labels are used to separate a space into three classes. There
progress in this area has enabled a more extreme form of is already some existing evidence that LO-shot learning is
FSL called one-shot learning (OSL); a difficult task where feasible. Sucholutsky and Schonlau (2019) showed that it is
models must learn to discern a new class given only a sin- possible to design a set of five soft-labelled synthetic im-
gle example of it (Fei-Fei, Fergus, and Perona 2006; Vinyals ages, sometimes called ‘prototypes’, that train neural net-
et al. 2016). In this paper, we propose ‘less than one’-shot works to over 90% accuracy on the ten-class MNIST task.
learning (LO-shot learning), a setting where a model must We discuss the algorithm used to achieve this result in more
learn N new classes given only M < N examples, less than detail in the next section. In this paper, we aim to investi-
one example per class. At first glance, this appears to be an gate the theoretical foundations of LO-shot learning so we
impossible task, but we both theoretically and empirically choose to work with a simpler model that is easier to ana-
demonstrate feasibility. As an analogy, consider an alien zo- lyze than deep neural networks. Specifically, we propose a
ologist who arrived on Earth and is being tasked with catch- generalization of the k-Nearest Neighbors classifier that can
ing a unicorn. It has no familiarity with local fauna and there be fitted on soft-label points. We use it to explore the intri-
are no photos of unicorns, so humans show it a photo of a cate decision landscapes that can still be created even with
an extremely limited number of training samples. We first instead focus on theoretically establishing the link between
analyze these decision landscapes in a data-agnostic way to soft-label prototypes and ‘less than one’-shot learning.
derive theoretical lower bounds for separating N classes us- Our analysis is centered around a distance-weighted kNN
ing M < N soft-label samples. Unexpectedly, we find that variant that can make use of soft-label prototypes. Distance-
our model fitted on just two prototypes with carefully de- weighted kNN rules have been studied extensively (Du-
signed soft labels can be used to divide the decision space dani 1976; Macleod, Luk, and Titterington 1987; Gou et al.
into any finite number of classes. Additionally, we provide a 2012; Yigit 2015) since inverse distance weighting was first
method for analyzing the stability and robustness of the cre- proposed by Shepard (1968). Much effort has also gone
ated decision landscapes. We find that by carefully tuning into providing finer-grained class probabilities by creating
the hyper-parameters we can elicit certain desirable proper- soft, fuzzy, and conditional variants of the kNN classi-
ties. We also perform a case study to confirm that soft labels fier (Mitchell and Schaefer 2001; El Gayar, Schwenker, and
can be used to represent training sets using fewer prototypes Palm 2006; Thiel 2008; El-Zahhar and El-Gayar 2010; Kan-
than there are classes, achieving large increases in sample- janatarakul, Kuson, and Denoeux 2018; Wang and Zhang
efficiency over regular (hard-label) prototypes. In extreme 2019; Gweon, Schonlau, and Steiner 2019). We chose our
cases, using soft labels can even reduce the minimal number kNN variant, described in the next section, because it works
of prototypes required to perfectly separate N classes from well with soft-label prototypes but remains simple and easy
O(N 2 ) down to O(1). to implement.
The rest of this paper is split into three sections. In the Several studies have been conducted on the robustness
first section we discuss related work. In the second section and stability of kNN classifiers. El Gayar, Schwenker, and
we explore decision landscapes that can be created in the Palm (2006) found that that using soft labels with kNN re-
‘less than one’-shot learning setting and derive theoretical sulted in classifiers that were more robust to noise in the
lower bounds for separating N classes using M < N soft- training data. Sun, Qiao, and Cheng (2016) proposed a near-
label samples. We also examine the robustness of the de- est neighbor classifier with optimal stability based on their
cision landscapes to perturbations of the prototypes. In the extensive study of the stability of hard-label kNN classifiers
final section we discuss the implications of this paper and for binary classification. Our robustness and stability analy-
propose future directions for research. ses focus specifically on the LO-shot learning setting which
has never previously been explored.
Related Work
‘Less Than One’-Shot Learning
Wang et al. (2018) showed that Dataset Distillation (DD) We derive and analyze several methods of configuring M
can use backpropagation to create small synthetic datasets soft-label prototypes to divide a space into N classes using
that train neural networks to nearly the same accuracy as the distance-weighted SLaPkNN classifier. All proofs for re-
when training on the original datasets. Networks can reach sults in this section can be found in the appendix contained
over 90% accuracy on MNIST after training on just one such in the supplemental materials.
distilled image per class (ten in total), an impressive exam-
ple of one-shot learning. Sucholutsky and Schonlau (2019) Definitions
showed that dataset sizes could be reduced even further by Definition 1 A hard label is a vector of length N represent-
enhancing DD with soft, learnable labels. Soft-Label Dataset ing a point’s membership to exactly one out of N classes.
Distillation (SLDD) can create a dataset of just five distilled
images (less than one per class) that trains neural networks T
y hard = ei = [0 ... 0 1 0 ... 0]
to over 90% accuracy on MNIST. In other words, SLDD can
create five samples that allow a neural network to separate Hard labels can only be used when each point belongs to
ten classes. To the best of our knowledge, this is the first exactly one class. If there are n classes, and some point x
example of LO-shot learning and it motivates us to further belongs to class i, then the hard label for this point is the ith
explore this direction. unit vector from the standard basis.
However, we do not focus on the process of selecting or
generating a small number of prototypes based on a large Definition 2 A soft label is the vector-representation of a
training dataset. There are already numerous methods mak- point’s simultaneous membership to several classes. Soft la-
ing use of various tricks and heuristics to achieve impres- bels can be used when each point is associated with a distri-
sive results in this area. These include active learning (Cohn, bution of classes. We denote soft labels by y sof t .
Ghahramani, and Jordan 1996; Tong and Koller 2001), core-
set selection (Tsang, Kwok, and Cheung 2005; Bachem, Definition 2.1 A probabilistic (soft) label is a soft label
Lucic, and Krause 2017; Sener and Savarese 2017), prun- whose elements form a valid probability distribution.
ing (Angelova, Abu-Mostafam, and Perona 2005), and kNN
prototype methods (Bezdek and Kuncheva 2001; Triguero
et al. 2011; Garcia et al. 2012; Kusner et al. 2014) among ∀i ∈ {1, ..., n} yisof t ≥ 0
many others. There are even methods that perform soft-label Xn
Examining the asymptotic behavior as the total number of tions that further increase the number of classes that can be
classes increases, we notice a potentially undesirable prop- separated by M soft-label prototypes.
erty of this configuration. Increases in M ‘dilute’ classes a
and c, which results in them shrinking towards x0 . In the Theorem 5 Suppose M soft-label prototypes are arranged
limit, only class b remains. It is possible to find a configura- as the vertices and center of an (M − 1)-sided regular poly-
tion of probabilistic labels that results in asymptotically sta- gon. There exist soft labels (Y1 , ..., YM ) such that fitting
ble classes but this would require either lifting our previous SLaPkNN with k = M will divide the space into 3M − 2
restriction that the third label value be zero when separating classes.
three classes with two points, or changing the underlying
geometrical arrangement of our prototypes.
(a) Ten classes using four soft- (b) Thirteen classes using five
label prototypes soft-label prototypes
(a) Seven classes using four soft- (b) 9 classes using 5 soft-label
label prototypes prototypes
Figure 5: SLaPkNN can separate 3M − 2 classes using M
soft-label prototypes
Figure 3: SLaPkNN can separate 2M − 1 classes using M
soft-label prototypes
Interval Partitioning Methods We now aim to show that
two points can even induce multiple classes between them.
Proposition 4 Suppose M soft-label prototypes are ar- Lemma 6 Assume that two points are positioned 4 units
ranged as the vertices of an M -sided regular polygon. There apart in two-dimensional Euclidean space. Without loss of
exist soft labels (Y1 , ..., YM ) such that fitting SLaPkNN with generality, suppose that point x1 = (0, 0) and point x2 =
k = 2 will divide the space into 2M classes. (4, 0) have probabilistic labels y1 and y2 respectively. We
denote the ith element of each label by y1,i and y2,i . There
In this configuration, it is possible to decouple the system
exist values of y1 and y2 such that SLaPkNN with k = 2 can
from the number of pairs that each prototype participates
separate four classes when fitted on x1 and x2 .
in. It can then be shown that the local decision boundaries,
in the neighbourhood of any pair of adjacent prototypes, do We can again find a solution to this system that has sym-
metrical labels and also sets an element of the label to zero. experiments with modifying these parameters, but more are
6 included in the accompanying supplemental material. One
y1,1 = y2,4 = unexpected result is that SLaPkNN can generate any number
14
of concentric elliptical decision bounds using a fixed num-
5
y1,2 = y2,3 = ber of prototypes as seen in Figures 7a and 7b. In Figure 7d,
14 we show that variations in k can cause large changes in the
3 decision landscape, even producing non-contiguous classes.
y1,3 = y2,2 =
14
0 Robustness
y1,4 = y2,1 =
14
In order to understand the robustness of LO-shot learning
landscapes to noise, we aim to analyze which regions are
We visualize the results of fitting a SLaPkNN classifier with at highest risk of changing their assigned class if the proto-
k = 2 to a set of two points with these labels in Figure 6. type positions or labels are perturbed. We use the difference
between the largest predicted label value and second-largest
predicted label value as a measure of confidence in our pre-
diction for a given point. The risk of a given point being re-
classified is inversely proportional to the confidence. Since
our aim is to inspect the entire landscape at once, rather than
individual points, we visualize risk as a gradient from black
(high confidence/low risk) to white (low confidence/high
risk) over the space. We find that due to the inverse distance
weighting mechanism, there are often extremely high confi-
dence regions directly surrounding the prototypes, resulting
in a very long-tailed, right-skewed distribution of confidence
values. As a result, we need to either clip values or convert
them to a logarithmic scale in order to be able to properly
visualize them. We generally find that clipping is helpful for
understanding intra-class changes in risk, while the log scale
is more suited for visualizing inter-class changes in risk that
occur at decision boundaries.
Figure 6: A SLaPkNN classifier is fitted on two points and Figure 7 visualizes the risk gradient for many of the de-
used to partition the space into four classes. The probabilis- cision landscapes derived above. Overall, we find that LO-
tic soft labels of each point are illustrated by the pie charts. shot learning landscapes have lower risk in regions that are
distant from decision boundaries and lowest risk near the
We can further extend this line of reasoning to produce prototypes themselves. Changes in k can have a large ef-
the main theorem of this paper. fect on the inter-class risk behaviour which can be seen as
Theorem 7 (Main Theorem) SLaPkNN with k = 2 can changes in the decision boundaries. However, they tend to
separate n ∈ [1, ∞) classes using two soft-label prototypes. have a smaller effect on intra-class behavior in regions close
to the prototypes. The most noticeable changes occur when
increasing k from 1 to 2, as this causes a large effect on the
This unexpected result is crucial for enabling extreme LO- decision landscape itself, and from 2 to 3, which tends to not
shot learning as it shows that in some cases we can com- affect the decision landscape but causes a sharp increase in
pletely disconnect the number of classes from the number risk at decision boundaries. Meanwhile, increasing the num-
of prototypes required to separate them. The full proof can ber of classes (M ) can in some cases cause large changes
be found in the supplemental materials, but we provide a in the intra-class risk behaviour, but we can design proto-
system of equations describing soft labels that result in two type configurations (such as the ellipse generating configu-
soft-label prototypes separating n ∈ [1, ∞) classes. ration) that prevent this behaviour. In general, it appears that
Pn−1 by tuning the prototype positions and labels we can simul-
j=i j n(n − 1) − i(i − 1) taneously elicit desired properties both in the decision land-
y1,i = y2,(n−i) = Pn−1 = Pn−1 ,
j 2 2 j=1 j 2 scape and risk gradient. This suggests that LO-shot learning
j=1
i = 1, 2, ..., n is not only feasible, but can also be made at least partially
robust to noisy training data.
Other Results
The diverse decision landscapes we have already seen were
Case Study: Prototype Generation for Circles
generated using probabilistic labels and k = 2. Using un- We consider a simple example that captures the difference
restricted soft labels and modifying k enables us to explore between hard and soft-label prototype generation. Assume
a much larger space of decision landscapes. Due to space that our data consists of concentric circles, where each cir-
constraints we present only a few of the results from our cle is associated with a different class. We wish to select or
(a) SLaPkNN with k = 3 is shown separating N = 3, 5, 7, 9 classes (b) SLaPkNN with k = 2 is shown separating N = 2, 3, 4, 5 classes
(left to right, top to bottom) with elliptical decision boundaries af- (left to right, top to bottom) after being fitted on 2 prototypes. The pie
ter being fitted on 3 prototypes. The pie charts represent unrestricted charts represent probabilistic labels.
labels.
(c) SLaPkNN with k = 2 is shown separating 2M classes after being (d) SLaPkNN with k = 1, 2, 3, 4, 5 (left to right, top to bottom) is
fitted on M = 4, 6, 8, 10 prototypes (left to right, top to bottom). The shown separating 15 classes after being fitted on 8 prototypes. The
pie charts represent probabilistic labels. pie charts represent unrestricted labels.
Figure 7: Various LO-shot learning decision landscapes and risk gradients are presented. Each color represents a different class.
Gray-scale is used to visualize the risk gradient, with darker shadows corresponding to lower risk. In (a), the two colorful charts
show decision landscapes and the two gray-scale charts show risk landscapes. In (b)-(d), the risk gradient is laid directly over
the decision landscape. The pie charts represent the soft labels of each prototype.
Figure 8: Decision boundaries of a vanilla 1NN classifier fit- Figure 9: SLaPkNN can separate 6 circles using 5 soft-label
ted on the minimum number of prototypes required to per- prototypes. Each pie chart represents the soft label of one
fectly separate circle classes. From inner to outer, the circles prototype, and is labeled with its location. 4 of the proto-
have 1, 4, 7, 10, 13, and 16 prototypes. types are located outside of the visible range of the chart.
generate the minimal number of prototypes such that the as- linear decision boundaries.
sociated kNN classification rule will separate each circle as a In this case study, enabling soft labels reduced the mini-
different class. We tested several hard-label prototype meth- mal number of prototypes required to perfectly separate N
ods and found their performance to be poor on this simulated classes from O(N 2 ) down to O(1). We note that the number
data. Many of the methods produced prototypes that did not of required hard-label prototypes may be reduced by care-
achieve separation between classes which prevents us from fully tuning k as well as the neighbor weighting mechanism
performing a fair comparison between hard and soft-label (e.g. uniform or distance-based). However, even in the best
prototypes. In order to allow such a comparison, we analyt- case, the number of required hard-label prototypes is at the
ically derive a near-optimal hard-label prototype configura- very least linear in the number of classes.
tion for fitting a 1NN classifier to the circle data.
Theorem 8 Suppose we have N concentric circles with ra- Conclusion
dius rt = t ∗ c for the tth circle. An upper bound for the We have presented a number of results that we believe can
minimum number of hard-label prototypes required for 1NN be used to create powerful and efficient dataset condensa-
to produce decision boundaries that perfectly separate all tion and prototype generation techniques. More generally,
PN
circles, is t=1 cos−1 π1− 1 ; with the tth circle requiring our contributions lay the theoretical foundations necessary
( 2t2 )
π to establish ‘less than one’-shot learning as a viable new di-
cos (1− 2t2 )
−1 1 prototypes.
rection in machine learning research. We have shown that
even a simple classifier like SLaPkNN can perform LO-shot
The proof can be found in the appendix.
√ We can use learning, and we have proposed a way to analyze the ro-
the approximation cos−1 (1 − y) ≈ 2y to get that
π bustness of the decision landscapes produced in this setting.
cos−1 (1− 1 )
≈ ( π1 ) = tπ. Thus the upper bound for the
2t2 t We believe that creating a soft-label prototype generation al-
minimum number of prototypes required is approximately gorithm that specifically optimizes prototypes for LO-shot
PN N (N +1)π
t=1 t ∗ π = 2 . Figure 8 visualizes the decision learning is an important next step in exploring this area.
boundaries of a regular kNN that perfectly separates six con- Such an algorithm will also be helpful for empirically an-
centric circles given dtπe prototypes for the tth circle. It is alyzing LO-shot learning in high-dimensional spaces where
possible that the number of prototypes may be slightly re- manually designing soft-label prototypes is not feasible.
duced by carefully adjusting the prototype locations on ad- Additionally, we are working on showing that LO-shot
jacent circles to maximize the minimal distance between learning is compatible with a large variety of machine learn-
the prototype midpoints of one circle and the prototypes ing models. Improving prototype design is critical for speed-
of neighboring circles. However, SLaPkNN requires only ing up instance-based, or lazy learning, algorithms like kNN
a constant number of prototypes to generate concentric el- by reducing the size of their training sets. However, eager
lipses. In Figure 9, SLaPkNN fitted on five soft-label pro- learning models like deep neural networks would benefit
totypes is shown separating the six circles. The decision more from the ability to learn directly from a small num-
boundaries created by SLaPkNN match the underlying ge- ber of real samples to enable their usage in settings where
ometry of the data much more accurately than those created little training data is available. This remains a major open
by 1NN. This is because 1NN can only create piecewise- challenge in LO-shot learning.
References Mitchell, H.; and Schaefer, P. 2001. A “soft” K-nearest
Angelova, A.; Abu-Mostafam, Y.; and Perona, P. 2005. neighbor voting scheme. International Journal of Intelligent
Pruning training sets for learning of object categories. In Systems 16(4): 459–468.
2005 IEEE Computer Society Conference on Computer Vi- Ruta, D. 2006. Dynamic data condensation for classifica-
sion and Pattern Recognition, volume 1, 494–501. IEEE. tion. In International Conference on Artificial Intelligence
Bachem, O.; Lucic, M.; and Krause, A. 2017. Practical and Soft Computing, 672–681. Springer.
coreset constructions for machine learning. arXiv preprint Sener, O.; and Savarese, S. 2017. Active learning for con-
arXiv:1703.06476 . volutional neural networks: A core-set approach. arXiv
Bezdek, J. C.; and Kuncheva, L. I. 2001. Nearest proto- preprint arXiv:1708.00489 .
type classifier designs: An experimental study. International Shepard, D. 1968. A two-dimensional interpolation function
Journal of Intelligent Systems 16(12): 1445–1473. for irregularly-spaced data. In Proceedings of the 1968 23rd
ACM National Conference, 517–524.
Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1996. Ac-
tive learning with statistical models. Journal of Artificial Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical
Intelligence Research 4: 129–145. networks for few-shot learning. In Advances in Neural In-
formation Processing Systems, 4077–4087.
Dudani, S. A. 1976. The distance-weighted k-nearest-
neighbor rule. IEEE Transactions on Systems, Man, and Sucholutsky, I.; and Schonlau, M. 2019. Soft-Label Dataset
Cybernetics (4): 325–327. Distillation and Text Dataset Distillation. arXiv preprint
arXiv:1910.02551 .
El Gayar, N.; Schwenker, F.; and Palm, G. 2006. A study of
the robustness of KNN classifiers trained using soft labels. Sun, W. W.; Qiao, X.; and Cheng, G. 2016. Stabilized near-
In IAPR Workshop on Artificial Neural Networks in Pattern est neighbor classifier and its statistical properties. Jour-
Recognition, 67–80. Springer. nal of the American Statistical Association 111(515): 1254–
1265.
El-Zahhar, M. M.; and El-Gayar, N. F. 2010. A semi-
supervised learning approach for soft labeled data. In 2010 Thiel, C. 2008. Classification on soft labels is robust against
10th International Conference on Intelligent Systems Design label noise. In International Conference on Knowledge-
and Applications, 1136–1141. IEEE. Based and Intelligent Information and Engineering Systems,
65–73. Springer.
Fei-Fei, L.; Fergus, R.; and Perona, P. 2006. One-shot learn-
ing of object categories. IEEE Transactions on Pattern Anal- Tong, S.; and Koller, D. 2001. Support vector machine active
ysis and Machine Intelligence 28(4): 594–611. learning with applications to text classification. Journal of
Machine Learning Research 2(Nov): 45–66.
Garcia, S.; Derrac, J.; Cano, J.; and Herrera, F. 2012. Proto-
type selection for nearest neighbor classification: Taxonomy Triguero, I.; Derrac, J.; Garcia, S.; and Herrera, F. 2011. A
and empirical study. IEEE Transactions on Pattern Analysis taxonomy and experimental study on prototype generation
and Machine Intelligence 34(3): 417–435. for nearest neighbor classification. IEEE Transactions on
Systems, Man, and Cybernetics, Part C (Applications and
Gou, J.; Du, L.; Zhang, Y.; Xiong, T.; et al. 2012. A new Reviews) 42(1): 86–100.
distance-weighted k-nearest neighbor classifier. Journal of
Tsang, I. W.; Kwok, J. T.; and Cheung, P.-M. 2005. Core
Information and Computational Science 9(6): 1429–1436.
vector machines: Fast SVM training on very large data sets.
Gweon, H.; Schonlau, M.; and Steiner, S. H. 2019. The Journal of Machine Learning Research 6(Apr): 363–392.
k conditional nearest neighbor algorithm for classification
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al.
and class probability estimation. PeerJ Computer Science 5:
2016. Matching networks for one shot learning. In Advances
e194.
in Neural Information Processing Systems, 3630–3638.
Kanjanatarakul, O.; Kuson, S.; and Denoeux, T. 2018. An Wang, S.; and Zhang, L. 2019. Regularized Deep Transfer
evidential K-nearest neighbor classifier based on contextual Learning: When CNN Meets kNN. IEEE Transactions on
discounting and likelihood maximization. In International Circuits and Systems II: Express Briefs .
Conference on Belief Functions, 155–162. Springer.
Wang, T.; Zhu, J.-Y.; Torralba, A.; and Efros, A. A. 2018.
Kusner, M.; Tyree, S.; Weinberger, K.; and Agrawal, K. Dataset distillation. arXiv preprint arXiv:1811.10959 .
2014. Stochastic neighbor compression. In International
Conference on Machine Learning, 622–630. Wang, Y.; Yao, Q.; Kwok, J. T.; and Ni, L. M. 2020. Gener-
alizing from a few examples: A survey on few-shot learning.
Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. ACM Computing Surveys (CSUR) 53(3): 1–34.
Human-level concept learning through probabilistic pro-
gram induction. Science 350(6266): 1332–1338. Yigit, H. 2015. ABC-based distance-weighted kNN algo-
rithm. Journal of Experimental & Theoretical Artificial In-
Macleod, J. E.; Luk, A.; and Titterington, D. M. 1987. A telligence 27(2): 189–198.
re-examination of the distance-weighted k-nearest neighbor
classification rule. IEEE Transactions on Systems, Man, and
Cybernetics 17(4): 689–696.