You are on page 1of 13

Journal of Intelligent & Fuzzy Systems 35 (2018) 1633–1644 1633

DOI:10.3233/JIFS-169701
IOS Press

ARSkNN: An efficient k-nearest neighbor


classification technique using mass based
similarity measure
Ashish Kumara,∗ , Roheet Bhatnagara and Sumit Srivastavab
a Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, India
b Department of Information Technology, Manipal University Jaipur, Jaipur, India

Abstract. Even though finding out distance is the central core of k-Nearest Neighbor classification techniques, similarity
measures are often favored against distance in various realistic scenarios and situation. Most of the similarity measures,
which are used to classify an instance, are based on geometric model. Their effectiveness decreases with the increases in
the number of dimensions. This paper establishes an efficient technique called ARSkNN for finding out class of any given
instance using a measure based on an unique similarity, that does no longer compute distance, for k-NN classification. Our
empirical results show that ARSkNN classification technique is better than the previous established k-NN classifiers. The
performance of algorithm was verified and validated on various datasets from different domains.

Keywords: Data mining, classification, nearest neighbor, similarity measure

1. Introduction utilize for determination of similarities or dissim-


ilarities (distances) between the instances with the
In the domain of data mining, classification tech- known classes and the questioned instance [34]. The
niques are realm of much eagerness for well-known vast majority of the works as reported for the metric
researchers, attributable to its ample range of practi- learning phase, i.e., the first phase for k-NN classi-
cal applications, viz; geostatistics, computer vision, fication approach, are dependent on distance metrics
speech recognition and many more. The popular and [3, 9, 25, 34]. Nevertheless, in numerous real world
established k-Nearest Neighbors also called as k-NN applications like document classification, similarities
classification approach is among the utmost straight- (such as cosine similarity) are preferred over dis-
forward and oldest among classification approaches. tances (such as Euclidean distance).
In this approach, the class of the questioned instance Additionally, numerous researches have revealed
is established in two steps; in the initial step, k near- that utilizing measures based on similarity are often
est neighbors of questioned instance is establish and chosen against the distance measures even in the
in next or last step, the class that has the majority non-textual datasets (like Iris, Wine and Balance)
in those nearest neighbors is said to be the class of [22]. Preferring measure based on distance/similarity
questioned instance [12]. The accurateness of k-NN relies on the representation or the measurement type
classification is drastically influenced by the metric of instances which needs classification. Many sim-
∗ Corresponding author. Ashish Kumar, Department of Com- ilarity measures which are used in metric learning
puter Science and Engineering, Manipal University Jaipur, 303007 phase of k-NN classification are directly or indi-
Jaipur, India. E-mail: kumar.ashish@jaipur.manipal.edu. rectly calculated with the help of distance measures.

1064-1246/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved
1634 A. Kumar et al. / An efficient k-nearest neighbor classification technique

The recently proposed similarity measure based on Mahalanobis distance [34], dynamic (adaptive)
mass (Massim), is a substitute to similarity measure distance [33] and local distance metric [20]. Fur-
based on distance and has revealed better outcome thermore, the customary k-NN classifier has seen
in information retrieving tasks [31]. The cardinality a variety of augmentations and transformations till
feature of the least adjacent area which wraps those now, for example, distance lower bound based nearest
two instances is the main concept to calculate mas- neighbor [9], Nearest Feature Line (NFL) clas-
sim. Our algorithm ARSkNN makes use of Massim sification technique [19], Nearest Feature Space
as a metric learner in k-NN classification approach. (NFS) classification technique and Nearest Feature
This paper makes the following contributions; It: Plane (NFP) classification technique [10], Nearest
Neighbor Line classification technique (NNLC) [37],
– establishes an efficient kNN classification tech- Center-based Nearest Neighbor classification tech-
nique, ARSkNN (the name comes from the first nique (CBNNC) [16], Meaningful Nearest Neighbor
character of first name of its proposers Ashish, (MNN) [21], MaxNearestDist based nearest neighbor
Roheet and Sumit), which utilizes similarity [23], Optimal Weighted Nearest Neighbor classifiers
measure based on mass rather than distance. [24] and k-regular nearest neighbor graph [6]. But, all
– establishes the empirical and experimental foun- the above classifiers are based on distance (dissimi-
dation of ARSkNN. larity) or similarity metrics. Also genetic algorithms
– empirically evaluate ARSkNN and compared it and genetic programming are used to propose robust
with the traditional kNN. The results shows its nearest neighbor classifier [15]. Yu-Jie Xiong et al.
superiority in the context of avg. runtime and [35] has suggested that distance based classifiers are
avg. accuracy percentage. very convenient and relevant in the domain of writer
recognition.
Till now, there is no k-NN classifier which takes
2. Literature review into account similarity metric based on mass (Mas-
sim). This research paper comes up with an efficient
The section reviews k-NN classifiers and similarity classifier for the same and it is the novelty of our
measures.Massim and operational concept of Massim approach in ARSkNN.
is also explained in this section. An approach of cal-
culating similarity measure based on mass (Massim) 2.2. Concept of similarity measures
is recently developed and hence available literature
is also very limited. Similarity, is a perception which can be utilize not
only in data mining domain but also numerous further
2.1. k - Nearest neighbors classification subjects like ecology and psychology. To determine
similarity among two objects, we can use similarity
The cornerstones of k - nearest neighbors classi- measure which is a function of real values. When
fication [12] are Nearest Neighbor classification and objects are more similar to each other, then the return
the k-NN law first suggested in 1951 by Fix et al. It value of the function is higher.
has been also recognized by so many names; some Afterwards a similarity measure to classify bio-
of them are instance based classifier, case based clas- logical species was formulated by Jaccard in 1901,
sifier, memory based classifier, lazy classifier. There various dissimilarity (distance) and similarity mea-
are 3 crucial pillars of k-NN classifier: class labeled sures have been formulated in various scientific
instances set; the value of k and dissimilarity or disciplines in the past hundred years. The literature
similarity metric. It is lazy learning classification demonstrates that Zezula et al. [36] and Cha [7]
technique because it does not create any general- have done exhaustive surveys of similarity measures.
ization from the training instances of the dataset. Seung-Seok Choi et al. have also done a review of 76
Therefore, it desires all instances of t raining data binarized distance and similarity metrics alongside
in memory for each testing instance. Due to this, their correlations, via hierarchical clustering tech-
there is a minimal training phase, but costly test- niques [11].
ing phase in context of space & time complexity A similarity measure is treated as a metric on the
both [1]. off chance, that it results in a higher value as the sim-
To enhance k-NN classifiers, various measures ilarity between objects increases. A metric similarity
have been recommended. Some of them are as; S assures the following [27]:
A. Kumar et al. / An efficient k-nearest neighbor classification technique 1635

Table 1
Distance-based Similarity versus Mass-based Similarity
Distance-based similarity Mass-based similarity
Computation dist(x, y) is calculated (exclusively) on the basis of Mass(x, y) is calculated (exclusively) on the basis of
coordinate x and y in the feature space. data distribution in local region of feature space.
Definition the shortest path length from x to y is determined by The cardinality of smallest local area covering both x
dist(x, y). and y determines Mass base function M(x, y).
Inequality similarity(x, y) > similarity(x, z) ≡ dist(x, y) < similarity(x, y) > similarity(x, z) ≡ Mass(x, y) <
dist(x, z) Mass(x, z)
Metric All four distance axioms generally intact. All major distance axioms intact except one.

Table 2
Axioms followed by Distance-based Similarity Measures and Mass-based Similarity Measures
Distance-based similarity measure Mass-based similarity measure
Axiom 1 dist.(x, y) >= 0 Mass.(x, y) >= 1 (non-negativity)
Axiom 2 dist.(x, y) = 0 ⇐⇒ x = y i. ∀x, y Mass.(x, x) <= Mass.(x, y) (reflexivity)
ii. ∃x =
/ y Mass.(x, x) =
/ Mass.(y, y)
Axiom 3 dist.(x, y) = dist.(y, x) Mass.(x, y) = Mass.(y, x) (symmetry)
Axiom 4 dist.(x, z) <= dist.(x, y) + dist.(y, z) Mass.(x, z) < Mass.(x, y) + Mass.(y, z) (triangle in-equality)

1. Non-Negativity: S(X, Y ) >= 0 i.e., the dis- proposed and established for replacement of density
tance among any two elements of the metric estimation. Massim is not quite the same as all simi-
space can not be less than 0. larity measures. It is built on the perception of mass
2. Reflexivity: S(X, Y ) = 0 if and only if X = Y [30], that is the cardinality of region, instead of dis-
i.e., if two elements are equivalent to each other tance.Mainly, it does not quantify distance measure,
then the distance among them is 0. which is the heart and soul of all the distance based
3. Symmetry: S(X, Y ) = S(Y, X) i.e., the distance similarities where as it is established on data distribu-
among any two elements will be same in either tion in the local area. Massim assures that 2 similar
way. objects should be in the same local region. In view of
4. Triangle Inequality: S(X, Z) <= S(X, Y ) + mass, Massim is a binary function to quantify resem-
S(Y, Z) i.e., let three elements x,y,z of metric blance among 2 instances however, Mass is a unary
space then the sum of the distance among (x, y) function.
and (y,z) is greater than (x,z). The comparison among distance based and mass
based similarity measures have been shown in
These four properties are also known as distance Table 1.
axioms. Nonetheless, a few researchers confronted Massim does not fulfill all the four distance axioms
the triangle inequality axiom by proposing non- dissimilar to distance-based similarity measures, as
metric similarity measures [26]. The similarity mentioned in Table 2.
measures based on distance is a binary function Massim is a revolutionary concept which is a sim-
dist(x,y) [Table 1], i.e., all of them must have two ilarity measure based on mass.
variables, regardless of whether non-metric or met-
ric. In large datasets, the fundamental downside with
similarity measures based on distance is that they are 2.4. Operational concept for massim
expensive with regard to computation due to their
high time complexities, which refers to the compari- The two adaptations for Massim, first being for
son matrix drawn for the pairwise data selected. There single dimensional and the next being for multidi-
are some established techniques by which one can mensional representation of objects (instances) [31].
transform the similarity measure to distance measures For single dimensional adaptation, in order to cre-
and vice versa [8]. ate a model E to distinct feature space into least
convex local section R(x, y|E), which covers x, y ∈
2.3. Massim - similarity measure based on mass IR, as product of twofold split that separates real line
characterized by D = {x1 < x2 < x3 ... < xn } into
In 2010, Kai Ming Ting et al. suggested the idea two non-intersecting local area r1 (where x <= bi )
of mass and how to estimate mass [29]. It was and r2 (where x >= bi ); and one region r3 , that
1636 A. Kumar et al. / An efficient k-nearest neighbor classification technique

spreads over the intact real line should also be there. where, Mh,i (x, y) is the mass in R(x, y|Ei (h)), i.e. the
Now Massim(x, y) is expressed as: smallest area of numerous multi-dimensional models
Ei (h) enclosing x and y both; and t > 1 is the quantity

n−1
1 of arbitrarily selected areas which are referred for
Massim(x, y) = ( Mi (x, y)e p(bi )) e (1)
outline the mass similarity of x, y.
i=1

where, e is mean either arithmetic or harmonic, p(bi )


is the probability to choose twofold split i.e., (bi ) and 3. ARSkNN: the proposed k-NN classifier


⎪ i if R(x, y|E) = r1
⎨ In our previous research work [18], ARSkNN has
Mi (x, y) = n − 1 if R(x, y|E) = r2 (2) been already is proposed as a novel and robust k-



n if R(x, y|E) = r3 nearest neighbor classifier which utilizes Massim, a
similarity measure based on mass, against the tech-
Another smoother form of single dimensional niques rely upon distance-based similarity measures.
Massim i.e. Level-h Massim, is estimated iteratively Further the algorithm is explained in the following
by using the following equation: paragraph:
n−1 In this algorithm, y is the query instance for
1
Massimh (x, y) = ( Massimh−1 (x, y)e p(bi )) e which Massh (x, y) is calculated for every instance
i
i=1 x in the given dataset D. In order to calculate,
(3) instances (x, y) are analyzed with t similarity
with the condition, trees (sTrees) from the constructed similarity forest
(sForest) which is generated from dataset D, that has
Massimh (x, y)
n−1 {(x1 , c1 ), (x2 , c2 ), ..., (xn , cn )} without consideration
i=1 mi (x)p(bi ) h=1 of ci [31]. Further to find the k nearest neighbors
= n−1 h−1
(4) (instances) in D with respect to queried instance, one
i=1 massimi (x)p(bi ) h > 1 has to use Algorithm 1 i.e., ARSkNN.
where,

i ifx <= xi
mi (x) = (5)
n−1 ifx >= xi
Level-h Massim takes into account the manner of
data distribution which is further efficient and it is
utilized for multi-modal data distribution [31].
The similarity of 2 instances (x, y) (for multi-
dimensional Massim) is assessed by standard avg.
of mass for every smallest local area that covers
instances (x, y). The mass inequality produced by
half-space splits [29], is promise by the same random
local areas.
Also for multi-dimensional datasets, every single The nearest neighbors of y are established in this
half-space split is carried out over arbitrarily choose way: ∀xi , xj ∈ D, if Massh (xi , y) < Massh (xj , y)
features. A h-level split is represented using a tree then similarity (xi , y) > similarity(xj , y). It implies
structure in which every single subtree contains h that the instance having lower mass concerning query
half-space splits. Hence there would be 2h distinct instance is more identical to the query instance. After
local areas in this tree [31]. this discover the k nearest neighbor of the y and later
Further to calculate similarity for multi- conclude class of y in the voted way.
dimensional mass similarity, the following equation
is referred: 3.1. Complexity discussion
1 
t
1
Massimh (x, y) ≈ ( Mh,i (x, y)e ) (6) The avg. case time complexity of traditional k-
t e nearest neighbor, using Lp Norm metric (Minkowski
i=1
A. Kumar et al. / An efficient k-nearest neighbor classification technique 1637

metric) is O(knlogn) [32], whereas ARSkNN is hav- 4. Implementation


ing O(ntlogd) complexity in the modeling phase
which is deliberated in the equation (6) defined The implementation of traditional kNN defined as
in section 2.4. Further for class assignment phase, IBK in Waikato Environment for Knowledge Analy-
the time complexity is O(kn) here, n represent sis (Weka) 3.7.10 [17] platform has been used in this
all instances of dataset D; d represents number experiment. In order to implement nearest neighbors
of arbitrarily chosen instances of dataset D; t search algorithm using Euclidean distance, Weka
represents number of arbitrary areas utilized for named it as LinearNNSearch.
calculating massim and k is the count of nearest The implementation of ARSkNN has been per-
neighbors. formed using JDK 1.8.0 with IDE Netbeans 8.0.2.
In first phase (modeling), the time complexity can The jar file with the name ARSkNN.jar was merged
scale down from O(ntlogd) to O((n + d)t) by adopt- during runtime in Weka 3.7.10 platform and it was
ing any suitable indexing technique, as all sTrees of tested on various datasets from different domains as
every sForest are balanced binary tree [18]. described below.

3.2. Comparison with traditional kNN 4.1. Datasets used

The traditional kNN always experience challenges The datasets were chosen from various research
in high dimension datasets due to the fact that all domains varying in sizes for executing and testing the
instances are approximately equidistant to the queried algorithm. The characteristics, attribute and instance
instance [13]. Linear discriminant analysis (LDA) information of all datasets utilized in the evaluation
or principal component analysis (PCA) or canonical are outlined in Table 3.
correlation analysis (CCA) techniques are needed in The datasets used along with their brief descrip-
the pre-processing step to save kNN from Curse of tions are as mentioned below:
Dimensionality. These techniques along with kNN
are computationally expensive due to their own run 4.1.1. Pendigits dataset
time complexity. However, ARSkNN do not need Pendigits dataset consists of 250 samples from 44
these techniques as pre-processing step, because writers. These writers wrote 250 digits inside boxes
Massim arbitrarily selects an attribute to generate of 500×500 pixel resolution on the tablet. These
an internal node of sTree and each sTree itself is 250 samples get converted to 10992 instances. Each
generated by utilizing a small subset of the entire instance has 16 numeric attributes and all instances
dataset. are classified into 10 classes as numeric digits 0–9.
The traditional kNN needs all the training instances Approximately 1090 instances belong to each class.
in memory to classify a queried instance. This is the kNN with k = 5 and multilayer perceptron were used
reason why it is also known as lazy classifier [1]. It as a classifier in the first research approach applied to
is also a drawback by which kNN gives problems this dataset [2]. Recently published paper in 2013, the
when applied upon datasets with very high dimen- accuracy percentage on this dataset using DEMass-
sion. While, ARSkNN does not need to keep all Bayes classifier is projected at 98.87% [28].
training instances in memory due to an intermit-
tent modeling phase in which it makes sForest with 4.1.2. Letter recognition dataset
sTrees. It is therefore, easy to apply ARSkNN upon Letter recognition dataset is about character image
high dimension datasets. features. This dataset is based on 20 different types

Table 3
Datasets Attributes
Dataset Name [Ref] # Instances # Attributes # Classes Domain
PenDigits [2] 10992 16 10 Digit Recognition
Letter Recognition [14] 20000 16 26 Character Recognition
Magic04 [5] 19020 10 2 Gamma Particles
Statlog Middle 6435 4 7 Landsat Satellite
Statlog 6435 36 7 Landsat Satellite
Covertype Small [4] 10000 54 7 Forest Covertype
1638 A. Kumar et al. / An efficient k-nearest neighbor classification technique

of fonts for 26 English alphabet characters. These 5. Empirical evaluations


20 fonts are twisted to create 20000 unique instances
with 16 numeric attributes. Each alphabet has approx- This section demonstrates the experimental results
imately 770 instances. In the first publication with achieved by evaluating performance of the ARSkNN
this dataset, an adaptive classifier (IF-THEN classi- vis-à-vis the traditional kNN. Evaluations are carried
fier) was used for classification [14]. The accuracy out to:
percentage using DEMass-Bayes classifier on this
dataset is 92.27% [28]. (i) compare ARSkNN with traditional kNN; and
(ii) analyse the behaviours of ARSkNN when key
4.1.3. Magic04 dataset parameters are changed.
Magic04 dataset is created underneath of a Monte
Carlo program to recreate enrollment of high energy The entire classification experiment has been set
gamma particles in a very ground-based atmo- up on the system equipped with Intel Core i7 proces-
spheric Cherenkov gamma telescope by employing sor, 2.4 GHz clock cycle and 8 Gega Bytes processing
the imaging method. The dataset consists of 19920 memory. Further the 10 fold cross validation on Weka
instances out of which 12332 are gammas and 6688 3.7.10 has been executed for both the classifiers dur-
are hadrons. Each instance is characterized by 10 ing the experiment.
numeric attributes. Total 14 different classifiers were Table 4 represents the outcomes for the avg. classi-
applied in the first research done on this dataset and fication accuracy percentage parameter over 10-fold
kNN was one of them [5]. DEMass-Bayes classi- cross validation for IBK and ARSkNN with 10, 50,
fier produced 84.17 as accuracy percentage on this and 100 sTrees for distinct values of k which are 1, 3,
dataset [28]. 5 and 10. Table 4 concludes that ARSkNN contributes
better avg. classification accuracy in Magic04, Stat-
4.1.4. Statlog dataset log Middle, Statlog and Covertype Small datasets
Statlog dataset is created from Landsat Satellite over IBK.
data to identify the land use in 7 different classes. The results on accuracy percentage as shown in
The dataset is a very small sub-area of a sight captured Table 4 are also graphically represented in Figs. 1– 4.
by Landsat Satellite, comprising of 82×100 pixels. In these figures, a comparison of IBK, ARSkNN with
Each instance of dataset resembles to a 3×3 square 10 trees, ARSkNN with 50 trees and ARSkNN with
region of pixels accommodated within the 82×100 100 trees over 10-fold cross validation for all six
pixels area. Original dataset has 6435 instances with datasets has been illustrated for avg. classification
36 numeric attributes. There is no any known first accuracy.
research done on this dataset. But, many researchers The count of nearest neighbor (k) is taken as 1 in
had used this dataset in their research. Fig. 1. It is clearly evident from Fig. 1, ARSkNN with
100 trees produced significantly better accuracies in
4.1.5. Covertype dataset comparison with IBK for Magic04, Statlog Middle,
Covertype dataset is use to predict the cover type Statlog and Covertype Small datasets. In Pendig-
of four forest areas in the Roosevelt National Forest. its and Letter Recognition datasets, the accuracy of
In this dataset, 12 independent variables consists of ARSkNN is not significantly worse in comparison
total 54 numeric attributes for 581012 instances are with IBK.
generated from digital spatial data processed in a GIS. In Fig. 2, nearest neighbors (k) are chosen as 3. As
In the first research done on this dataset, Artificial shown in Fig. 2, the better classification accuracies
Neural Network classifier is used for classification for Magic04 and Statlog Middle datasets are pro-
task [4]. DEMass-Bayes classifier produced 92.05 as duced in ARSkNN with 50 tress and ARSkNN with
accuracy percentage on this dataset [28]. 100 trees produced significantly better accuracies
Pendigits, Letter Recognition, Magic04 and Stat- for Statlog and Covertype Small datasets. In Pendig-
log datasets are taken as provided in UCI repository. its and Letter Recognition datasets, the accuracy of
In Statlog Middle dataset, only the four spectral val- ARSkNN is not significantly worse in comparison
ues are taken by attribute number 17, 18, 19 and with IBK.
20 for the central pixel of the real dataset. Cover- Figure 3 shows result for 5 nearest neighbors (k).
type Small dataset is prepared by randomly chosen As can be deciphered from Fig. 3, the better classi-
10000 instances from the original Covertype dataset. fication accuracies for Magic04 and Statlog Middle
A. Kumar et al. / An efficient k-nearest neighbor classification technique 1639

Table 4
Average classification accuracy (%) Over 10-Fold cross-validation for IBK and ARSKNN with 10 sTrees, ARSkNN with 50 sTrees and
ARSkNN with 100 sTrees for distinct values of k (1, 3, 5, and 10)
Data Sets For k = 1 For k = 3
IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree
PenDigits 99.36 98.33 99.19 99.28 99.35 98.46 99.19 99.24
Letter Recognition 95.96 86 94.46 95.63 95.63 86.87 93.41 94.58
Magic04 80.93 83.18 85.29 85.52 83.2 84.02 84.9 84.84
Satlog Middle 82.25 82 83.49 83.57 84.07 84.35 84.75 84.61
Satlog 90.2 88.51 91.11 91.37 90.78 89.19 90.45 90.87
Covertype Small 80.9 79.12 81.64 81.92 79.09 78.95 79.89 80.64
(a) (b)
Data Sets For k = 5 For k = 10
IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree
PenDigits 99.26 98.31 98.97 98.94 99.02 97.98 98.54 98.57
Letter Recognition 95.5 85.66 92.48 93.37 94.79 82.98 90.12 91.18
Magic04 83.64 83.84 84.55 84.52 83.79 83.47 83.82 83.86
Satlog Middle 84.66 85.19 85.48 85.45 85.62 85.71 86.23 86.02
Satlog 90.62 88.92 89.85 90.99 90.48 88.22 89.23 89.33
Covertype Small 78.01 77.53 78.1 78.62 75.55 75.79 75.74 75.93
(c) (d)
Note: Boldface values represent the best accuracy achieved in the dataset.
Average Accuracy (%)

Fig. 1. Avg. Accuracy (%) of classifiers for k = 1.


Average Accuracy (%)

Fig. 2. Avg. Accuracy (%) of classifiers for k = 3.


1640 A. Kumar et al. / An efficient k-nearest neighbor classification technique

Average Accuracy (%)

Fig. 3. Avg. Accuracy (%) of classifiers for k = 5.


Average Accuracy (%)

Fig. 4. Avg. Accuracy (%) of classifiers for k = 10.

Table 5
Avg. runtime (in seconds) over 10-fold cross-validation for IBK, ARSkNN with 10 sTrees, ARSkNN with 50 sTrees and ARSkNN with
100 sTrees for distinct values of k (1,3,5,10)
Data Sets For k = 1 For k = 3
IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree
PenDigits 0.55 0.02 0.11 0.25 0.61 0.02 0.09 0.21
Letter Recognition 3.05 0.04 0.24 0.52 3.31 0.04 0.24 0.54
Magic04 1.87 0.04 0.17 0.44 2.04 0.04 0.21 0.48
Statlog Middle 0.10 0.01 0.03 0.07 0.11 0.01 0.03 0.07
Statlog 0.29 0.02 0.12 0.45 0.36 0.02 0.13 0.34
Covertype Small 0.49 0.04 0.24 0.61 0.73 0.04 0.23 0.53
(a) (b)
Data Sets For k = 5 For k = 10
IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree
PenDigits 0.64 0.02 0.10 0.20 0.70 0.02 0.09 0.18
Letter Recognition 3.22 0.04 0.24 0.54 3.59 0.04 0.23 0.54
Magic04 2.24 0.03 0.19 0.46 2.54 0.03 0.26 0.49
Statlog Middle 0.11 0.01 0.04 0.07 0.12 0.00 0.03 0.06
Statlog 0.35 0.02 0.11 0.32 0.40 0.02 0.10 0.27
Covertype Small 0.92 0.05 0.23 0.46 0.90 0.05 0.22 0.49
(c) (d)
Note: Boldface values represent the worst runtime achieved in the dataset.

datasets are produced in ARSkNN with 50 tress and accuracy of ARSkNN is not significantly worse in
ARSkNN with 100 trees produced significantly better comparison with IBK.
accuracies for Statlog and Covertype Small datasets. Figure 4 illustrates results for the number of near-
In Pendigits and Letter Recognition datasets, the est neighbors (k) as 10. As shown in Fig. 4, the better
A. Kumar et al. / An efficient k-nearest neighbor classification technique 1641

Average Runtime (Seconds)

Fig. 5. Avg. Runtime (in seconds) of classifiers for k = 1.


Average Runtime (Seconds)

Fig. 6. Avg. Runtime (in seconds) of classifiers for k = 3.

classification accuracies for Statlog Middle datasets Figure 5 shows the result for nearest neighbor (k)
are produced in ARSkNN with 50 tress and ARSkNN as 1. Figure 5 reflects that ARSkNN is taking a signif-
with 100 trees produced significantly better accura- icantly less avg. runtime against IBK even with 100
cies for Magic04 and Covertype Small datasets. In trees upon Pendigits, Letter Recognition, Magic04
Pendigits, Letter Recognition and Statlog datasets, and Statlog Middle datasets. In Statlog and Cover-
the accuracy of ARSkNN is not significantly worse type Small datasets, the avg. runtime of ARSkNN
in comparison with IBK. with 10 trees and with 50 trees are lesser than IBK
Table 5 shows the obtained results based on the but the avg. runtime of ARSkNN with 100 trees is
avg. runtime parameter in seconds over 10-fold cross more than IBK.
validation for IBK and ARSkNN with 10, 50, and 100 In Fig. 6, nearest neighbors (k) are taken as 3.
sTrees for distinct values of k which are 1, 3, 5 and 10. Figure 6 reflects that ARSkNN is taking a signif-
Table 5 concludes that ARSkNN contributes better icantly less avg. runtime against IBK even with
avg. runtime in all the datasets used for evaluation 100 trees upon all datasets named Pendigits, Letter
over IBK. Recognition, Magic04, Covertype Small, Statlog and
The results on avg. runtime (in seconds) as Statlog Middle.
shown in Table 5 are also graphically represented The value of nearest neighbor (k) is taken as 5
in Figs. 5–8. In these figures, a comparison of IBK, in Fig. 7. As can be deciphered from Fig. 7 that
ARSkNN with 10 trees, ARSkNN with 50 trees and ARSkNN is taking a significantly less avg. runtime
ARSkNN with 100 trees over 10-fold cross valida- against IBK even with 100 trees upon all datasets
tion for all six datasets has been illustrated for avg. named Pendigits, Letter Recognition, Magic04,
runtime (in seconds). Covertype Small, Statlog and Statlog Middle.
1642 A. Kumar et al. / An efficient k-nearest neighbor classification technique

Average Runtime (Seconds)

Fig. 7. Avg. Runtime (in seconds) of classifiers for k = 5.


Average Runtime (Seconds)

Fig. 8. Avg. Runtime (in seconds) of classifiers for k = 10.

The result with nearest neighbor (k) as 10 for all six instance and the testing instance in the testing phase,
datasets has been shown in Fig. 8. The ARSkNN sig- ARSkNN calculate the similarity measure before the
nificance over IBK in term of less runtime achieved on testing phase during the modeling of sForest and the
the Pendigits, Letter Recognition, Magic04, Cover- testing instance is compared with the similarities in
type Small, Statlog and Statlog Middle datasets is the modeled sForest.
reflected in the Fig. 8. This algorithm will give more accuracy percent-
age, if more number of samples are taken to build
sTrees, but one should not take more than the half
6. Results and discussion of total number of training instances. Generally,
it is observed that for ARSkNN, k = 3; number
This research established the fact that ARSkNN of sTrees = 100; and number of samples to build
is significantly better as compared to IBK (using sTree = 5000 gives a better accuracy in most of the
Euclidean distance) in accuracy percentage for 4 datasets. However, it has been observed that the avg.
datasets namely, Magic04, Statlog Middle, Statlog accuracy percentage of the IBK depends upon the
and Covertype Small as shown in Table 4. One should learning metric (similarities or dissimilarities), the
not expect ARSkNN to outperform IBK in all datasets value of k and the characteristic of datasets.
because the classifiers accuracy percentage depends The similarities and differences between ARSkNN
upon number of factors, such as whether the attributes and traditional kNN classifier are discussed below.
of dataset’s domain counterpart similarity measure. Both classifiers utilizing the voting mechanism for
However, ARSkNN was found significantly better deciding the classes of testing instances, after finding
than IBK (with Euclidean distance) in avg. runtime the k-nearest neighbors and also for both classifiers,
for all the datasets. This is because of the fact that there is need to decide the proper value of k. The fun-
rather than computing distance among each training damental difference between both of them is that the
A. Kumar et al. / An efficient k-nearest neighbor classification technique 1643

traditional kNN classifier uses (dis) similarity mea- [4] J.A. Blackard and D.J. Dean, Comparative accuracies of
sures which have their base as distance calculation, artificial neural networks and discriminant analysis in pre-
dicting forest cover types from cartographic variables,
whereas ARSkNN uses a similarity measure known Computers and Electronics in Agriculture 24(3) (1999),
as Massim which has mass estimation as its base. 131–151.
There are some applications of tree structures (e.g. [5] R.K. Bock, A. Chilingarian, M. Gaug, F. Hakl, Th. Heng-
k-d Tree) for indexing in order to increase the nearest stebeck, M. Jiřina, J. Klaschka, E. Kotrč, P. Savický, S.
Towers, et al., Methods for multidimensional event clas-
neighbor search. The reader should not confuse with sification: A case study using images from a cherenkov
the purpose of the sTrees, that they are used to hold gamma-ray telescope. Nuclear Instruments and Methods
similarity measures and not for indexing, because in Physics Research Section A: Accelerators, Spectrome-
sTrees are balanced binary trees so the indexing tech- ters, Detectors and Associated Equipment 516(2) (2004),
511–528.
niques can be used to decrease the time complexity [6] K. Broelemann, X. Jiang, S. Mukherjee and A.S.
of sTrees. Chowdhury, Variants of k-regular nearest neighbor graph
and their construction, Information Processing Letters
(2017).
[7] S.-H. Cha, Comprehensive survey on distance/similarity
7. Conclusion measures between probability density functions, City 1(2)
(2007), 1.
This paper well establishes the empirical and [8] S. Chen, B. Ma and K. Zhang, On the similarity metric and
the distance metric, Theoretical Computer Science 410(24)
experimental foundation of ARSkNN, as an efficient (2009), 2365–2376.
approach of kNN classifiers. ARSkNN represents a [9] Y.-S. Chen, Y.-P. Hung, T.-F. Yen and C.-S. Fuh, Fast
paradigm shift from distance based kNN to Massim and versatile algorithm for nearest neighbor search based
based kNN. ARSkNN overcame the key drawback on a lower bound tree, Pattern Recognition 40(2) (2007),
360–375.
of kNN classifiers, i.e., to keep all training instances [10] J.-T. Chien and C.-C. Wu, Discriminant waveletfaces and
in memory for the sake of classifying a single test- nearest feature classifiers for face recognition, IEEE Trans-
ing instance. Our results verified and established the actions on Pattern Analysis and Machine Intelligence
24(12) (2002), 1644–1649.
fact that ARSkNN has better accuracy percentage and
[11] S.-S. Choi, S.-H. Cha and C.C. Tappert, A survey of binary
better avg. runtime than kNN. similarity and distance measures. Journal of Systemics,
There are possibilities of extending the proposed Cybernetics and Informatics 8(1) (2010), 43–48.
algorithm. [12] T. Cover and P. Hart, Nearest neighbor pattern classification,
IEEE Transactions on Information Theory 13(1) (1967),
(i) one can utilize other distance metrics as of 21–27.
[13] D. Francois, V. Wertz and M. Verleysen, The concentration
Euclidean distance for comparison of ARSkNN of fractional distances, IEEE Transactions on Knowledge
with kNN. and Data Engineering 19(7) (2007), 873–886.
(ii) ARSkNN can also be applied in various [14] P.W. Frey and D.J. Slate, Letter recognition using holland-
datasets from other domains like document style adaptive classifiers, Machine Learning 6(2) (1991),
161–182.
classification, geostatistics etc. [15] C. Gagné and M. Parizeau, Coevolution of nearest neighbor
(iii) the efficiency of ARSkNN will be checked classifiers, International Journal of Pattern Recognition and
upon parameters like sensitivity (recall), F- Artificial Intelligence 21(05) (2007), 921–946.
measure, balanced classification ratio (BCR). [16] Q.-B. Gao and Z.-Z. Wang, Center-based nearest neighbor
classifier, Pattern Recognition 40(1) (2007), 346–349.
[17] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-
mann and I.H. Witten, The weka data mining software:
An update, ACM SIGKDD Explorations Newsletter 11(1)
References
(2009), 10–18.
[18] A. Kumar, R. Bhatnagar and S. Srivastava, Arsknn-a k-nn
[1] D.W. Aha, Lazy learning. chapter Lazy Learning, pp. classifier using mass based similarity measure, Procedia
7–10. Kluwer Academic Publishers, Norwell, MA, USA, Computer Science 46 (2015), 457–462.
1997. [19] S.Z. Li and J. Lu, Face recognition using the nearest feature
[2] F. Alimoglu and E. Alpaydin, Methods of combining line method, IEEE Transactions on Neural Networks 10(2)
multiple classifiers based on different representations for (1999), 439–443.
penbased handwritten digit recognition. In Proceedings of [20] Y.-K. Noh, B.-T. Zhang and D.D. Lee, Gen erative local
the Fifth Turkish Artificial Intelligence and Artificial Neural metric learning for nearest neighbor classification. In
Networks Symposium (TAINN 96. Citeseer, 1996. Advances in Neural Information Processing Systems, 2010,
[3] L. Baoli, L. Qin and Y. Shiwen, An adaptive k-nearest pp. 1822–1830.
neighbor text categorization strategy. ACM Transactions [21] D. Omercevic, O. Drbohlav and A. Leonardis, High-
on Asian Language Information Processing (TALIP) 3(4) dimensional feature matching: Employing the concept
(2004), 215–226. of meaningful nearest neighbors. In 2007 IEEE 11th
1644 A. Kumar et al. / An efficient k-nearest neighbor classification technique

International Conference on Computer Vision, 2007, pp. [30] K.M. Ting, G.-T. Zhou, F.T. Liu and S.C. Tan, Mass esti-
1–8. IEEE. mation, Machine Learning 90(1) (2013), 127–160.
[22] A.M. Qamar, E. Gaussier, J.-P. Chevallet and J.H. Lim, Sim- [31] K.M. Ting, T.L. Fernando and G.I. Webb, Mass-based sim-
ilarity learning for nearest neighbor classification. In 2008 ilarity measure: An effective alternative to distance-based
Eighth IEEE International Conference on Data Mining, similarity measures. Technical report, Technical report
2008, pp. 983–988. IEEE. 2013/276, Calyton School of IT, Monash University, Aus-
[23] H. Samet, K-nearest neighbor finding using maxnearest- tralia, 2013.
dist. IEEE Transactions on Pattern Analysis and Machine [32] P.M. Vaidya, Ano (n logn) algorithm for the all-
Intelligence 30(2) (2008), 243–252. nearestneighbors problem. Discrete & Computational
[24] R.J. Samworth et al., Optimal weighted nearest neigh- Geometry 4(2) (1989), 101–115.
bour classifiers, The Annals of Statistics 40(5) (2012), [33] J. Wang, P. Neskovic and L.N. Cooper, Improving near-
2733–2763. est neighbor rule with a simple adaptive distance measure.
[25] S. Shalev-Shwartz, Y. Singer and A.Y. Ng, Online and batch Pattern Recognition Letters 28(2) (2007), 207–213.
learning of pseudo-metrics. In Proceedings of the twenty- [34] K.Q. Weinberger and L.K. Saul, Distance metric learning
first international conference on Machine learning, 2004, for large margin nearest neighbor classification, Journal of
pp. 94. ACM. Machine Learning Research 10(Feb) (2009), 207–244.
[26] T. Skopal and B. Bustos, On nonmetric similarity search [35] Y.-J. Xiong, Y. Lu and P.S.P. Wang, Off-line text-
problems in complex domains. ACM Computing Surveys independent writer recognition: A survey, International
(CSUR) 43(4) (2011), 34. Journal of Pattern Recognition and Artificial Intelligence
[27] S. Theodoridis and K. Koutroumbas, Pattern recognition, 31(05) (2017), 1756008.
academic press. New York, 1999. [36] P. Zezula, G. Amato, V. Dohnal and M. Batko, Similarity
[28] K.M. Ting, T. Washio, J.R. Wells, F.T. Liu and S. Aryal, search: The metric space approach, volume 32. Springer
Demass: A new density estimator for big data, Knowledge Science & Business Media, 2006.
and Information Systems 35(3) (2013), 493–524. [37] W. Zheng, L. Zhao and C. Zou, Locally nearest neigh-
[29] K.M. Ting, G.-T. Zhou, F.T. Liu and J.S.C. Tan, Mass esti- bor classifiers for pattern classification, Pattern Recognition
mation and its applications. In Proceedings of the 16th ACM 37(6) (2004), 1307–1309.
SIGKDD international conference on Knowledge discovery
and data mining, 2010, pp. 989–998. ACM.
Copyright of Journal of Intelligent & Fuzzy Systems is the property of IOS Press and its
content may not be copied or emailed to multiple sites or posted to a listserv without the
copyright holder's express written permission. However, users may print, download, or email
articles for individual use.

You might also like