You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308486078

Analysis of Eight Data Mining Algorithms for Smarter Internet of Things (IoT)

Article  in  Procedia Computer Science · December 2016


DOI: 10.1016/j.procs.2016.09.068

CITATIONS READS

70 670

4 authors:

Furqan Alam Rashid Mehmood


King Abdulaziz University King Abdulaziz University
12 PUBLICATIONS   369 CITATIONS    134 PUBLICATIONS   1,724 CITATIONS   

SEE PROFILE SEE PROFILE

Iyad Katib Aiiad Albeshri


King Abdulaziz University King Abdulaziz University
79 PUBLICATIONS   733 CITATIONS    44 PUBLICATIONS   558 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Multimedia Big Data View project

Enterprise Systems (ERP, SCM, CRM, PLM etc.) View project

All content following this page was uploaded by Rashid Mehmood on 21 June 2017.

The user has requested enhancement of the downloaded file.


Available online at www.sciencedirect.com

ScienceDirect
Procedia Computer Science 98 (2016) 437 – 442

International Workshop on Data Mining in IoT Systems (DaMIS 2016)

Analysis of Eight Data Mining Algorithms for Smarter Internet of


Things (IoT)
Furqan Alama , Rashid Mehmoodb,∗, Iyad Katiba , Aiiad Albeshria
a Department of Computer Science, Faculty of Computing and Information Technology (FCIT), King Abdulaziz University, Jeddah, KSA
b High Performance Computing Center, King Abdulaziz University, Jeddah, KSA

Abstract
Internet of Things (IoT) is set to revolutionize all aspects of our lives. The number of objects connected to IoT is expected to reach
50 billion by 2020, giving rise to an enormous amounts of valuable data. The data collected from the IoT devices will be used to
understand and control complex environments around us, enabling better decision making, greater automation, higher efficiencies,
productivity, accuracy, and wealth generation. Data mining and other artificial intelligence methods would play a critical role in
creating smarter IoTs, albeit with many challenges. In this paper, we examine the applicability of eight well-known data mining
algorithms for IoT data. These include, among others, the deep learning artificial neural networks (DLANNs), which build a feed
forward multi-layer artificial neural network (ANN) for modelling high-level data abstractions. Our preliminary results on three
real IoT datasets show that C4.5 and C5.0 have better accuracy, are memory efficient and have relatively higher processing speeds.
ANNs and DLANNs can provide highly accurate results but are computationally expensive.
©c 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
 2015 The Authors. Published by Elsevier B.V.
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-reviewunder
Peer-review underresponsibility
responsibility
of of
thethe Conference
Program Chairs Program Chairs.
Keywords: Internet of Things (IoT), Support Vector Machine (SVM), K-Nearest Neighbours (KNN), Linear Discriminant Analysis (LDA),
Naı̈ve Bayes (NB), C4.5, C5.0, Artificial Neural Networks (ANNs), Deep Learning ANNs (DLANNs), Big Data, Smart Cities;

1. Introduction

The Internet of Things (IoT) is “a global infrastructure for the information society, enabling advanced services by
interconnecting (physical and virtual) things based on existing and evolving interoperable information and communi-
cation technologies” 1 . IoT 2,3 , is one of the major technological developments of our times given its potential is fully
realized. It is set to revolutionize all aspects of our lives, be it work, social interactions, or entertainment. A plethora
of objects are emerging every day to be part of the IoT infrastructure and the number of objects connected to IoT is
expected to reach 50 billion by 2020 4 . This would generate enormous amounts of valuable data.
A major objective of IoT is to make the environment around us smarter, by giving the environment the information
it needs, through real-time and historic data feeds, and apply computational intelligence on the information to make

∗ Corresponding author. Tel.: +966-537-069-318.


E-mail address: RMehmood@kau.edu.sa

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Program Chairs
doi:10.1016/j.procs.2016.09.068
438 Furqan Alam et al. / Procedia Computer Science 98 (2016) 437 – 442

smart decisions automatically. This would enhance our ability to manage our cities, roads, health, homes, forests and
much more. We will have better crowd and traffic management, emergency predictions, better prediction of accidents
and crimes, etc. The data collected from IoT devices will be used to understand and control complex environments
around us, enabling better decision making, greater automation, higher efficiencies, productivity, accuracy, and wealth
generation. A major challenge in these settings is the timely analyses of large amounts of data (i.e. big data) to produce
highly reliable and accurate insights and decisions so that IoT could live up to its promise. Machine learning is among
the top methods to gain hidden insights from IoT data.
The aim of this research is to explore whether the conventional data mining algorithms would also work for the
IoT datasets, or new families of data mining algorithms are required. To this end, this paper provides a preliminary
analysis on examining the applicability of several well-known data mining algorithms to real IoT datasets. We have
used eight data mining algorithms. These are Support Vector Machine (SVM), K-Nearest Neighbours (KNN), Linear
Discriminant Analysis (LDA), Naı̈ve Bayes (NB), C4.5, C5.0, Artificial Neural Networks (ANNs), and Deep Learning
ANNs (DLANNs). The main contribution of this work is the analysis of the effectiveness and efficiency of eight of
the well-known data mining algorithms.
The remainder of this paper is organized as follows. Section 2 provides a review of the relevant literature. Our
experimental methodology is explained in Section 3. Simulation results and analysis are reported in Section 4. Con-
clusions are drawn in Section 5.

2. Background Material and Literature Review

Modern day data mining tasks are far more challenging due to an unprecedented increase in the amount and
complexity of data 8,7 . With the emergence of IoT paradigm, a completely new set of challenges have been added
to the data mining domain 5,6,9 . Support Vector Machine (SVM), K-nearest neighbor (KNN), Nave Bayes (NB),
Linear discriminant analysis (LDA), C4.5, C50 and ANNs are widely used in the field of data mining. SVM was
designed initially to address bias variance tradeoff, over-fitting and capacity control 10 . Burges 10 also stated that SVM
accuracy depends a lot on the quality of training data and machine capacity. The use of SVM is further extended
from classification to regression and element ranking. SVM is a very efficient tool to work with in complex and noisy
domains. The Computational inefficiency is one of the major drawback of SVM, however several optimizations have
been done to reduce its computational cost and to increase its scalability 9,11 .
A lazy learner algorithm known as KNN is one of the simplest available classifier which is easy to understand and
implement. Due to the simplicity of KNN, several issues arise that limit its performance such as the selection of right
distance measure, number of neighbors and majority vote to combing class labels is not always effective 5,9 . KNN is
also used effectively for various tasks in wireless sensor networks (WSN) and IoT domain for intrusion detection 12 ,
indoor positioning systems 13 and activity recognition 14 . For binary classification problems, SVM is a one of the best
options. However, scalability is always an issue 9,10,11 . LDA which is also known as Fisher Discriminant Analysis
(FDA), is an appropriate answer for multi-class problems 15,16 . The C4.5 algorithm, another popular data mining
algorithm, proposed by Ross Quinlan 17,18 is considered as the best algorithm in data mining 9 . Quinlan later on
developed an improvised version of C4.5, known as C5.0, which is claimed faster than C4.5, has better memory
efficiency, and support for boosting, winnowing and weighting 19 .
The approach taken by the NB and ANN algorithms to solve the given data mining task is completely different to
the ones we have discussed earlier in this section. The NB algorithm is probability based and is a very old classifier.
It uses Bayes’ theorem with the independence supposition among the features. NB is a robust and simple classifier
like KNN. Although simple, it often produces surprisingly accurate results 9,20,21 . On the other side, ANN algorithms
are based on mimicking neural system of human brain. ANNs are extremely efficient in solving data mining tasks
with higher accuracy. However, ANN based algorithms are too complex and an extensive amount of computation is
required for solving a problem with higher accuracy 22,23,24 . Further extension of ANNs are the revolutionary learning
algorithms based on deep learning concept. DLANNs have extreme learning ability, process enormous amount of data,
and produce highly accurate results which are not possible with other conventional machine learning and data mining
algorithms 25 . However, this deep learning branch of computer science is still in its infancy period. Deep learning
can give us novel insights from IoT data which is not possible from other data mining algorithms. Particularly, with
respect to IoT, little work is done to gain benefits from DLANNs 26,27,28 .
Furqan Alam et al. / Procedia Computer Science 98 (2016) 437 – 442 439

Fig. 1. Experimental Methodology

3. Experimental Methodology

We have considered eight well-known data mining algorithms in this paper. These eight algorithms also include
DLANNs which build feed-forward multilayer ANNs. All the simulations are performed using the R platform. Par-
ticularly, for simulating DLANNs, we used the H2O package available in R. For the experiments, we used three real
sensor datasets from the UCI data repository 29,30,31 . Datasets are collected by using sensors and accelerometers and
are used to classify human activities, robot navigation, body postures and movements. Before simulating the algo-
rithms, we preprocessed the datasets to make them suitable for the classifiers. This is a preliminary analysis and
hence, we have only used partial datasets. Our experimental methodology is depicted in Fig. 1.

4. Results and Analysis

We now present a preliminary analysis of the eight well established data mining algorithms as mentioned in Fig.
1. The experiments have been carried out on the Aziz supercomputer. The Aziz supercomputer is Fujitsu made and is
able to deliver peak performance of 230 teraflops. It has a total of 11,904 cores in 496 nodes. Aziz was ranked number
360 in the June 2015 Top500 competition (http://www.top500.org/), currently it is at number 491 (November
2015). For performance evaluation of the algorithms, we have summarised the results in form of confusion matrix
(CM). With the help of CM, we are able to know the total number of instances rightly and wrongly classified. The
class to which the wrongly classified instances belong to can also be identified with CM.
The CMs of all eight algorithms simulated on three different datasets 29,30,31 are given in Fig. 2 to Fig. 4. Moreover,
Table 1, classification accuracy (CA) percentage and elapsed time is mentioned. With respect to Fig. 2 to Fig. 4
and Table 1, we note that C4.5, C5.0, ANN and DLANN algorithms performed far better than SVM, KNN, NB and
LDA. The C4.5, C5.0 and ANN algorithms are very close to each other when considering the classification accuracy.
With average accuracy (AAC) of 97.15% obtained by C4.5, it performs slightly better than 96.61% AAC of the C5.0
algorithm. AAC of ANN is 96.19% for all three datasets. C4.5 tops among all the eights algorithms is terms of the
classification accuracy, followed closely by C5.0.
All the datasets 29,30,31 are multi-label as mentioned in Fig.2 to Fig.4. Consequently, SVM shows its weaknesses
towards multi-label data classification as compared to binary classification problem where its performance is one
of the best 9 . SVM performed better than KNN with 4.09% higher AAC. The choice of k-neighbours and distance
measure affects the CA of KNN. The remaining two algorithms, NB and LDA, performed the worst in terms of CA.
Our these results strongly agree to the conclusion made by Wu et al. 9 .

4.1. Execution Time

Both NB and LDA algorithms are the fastest amongst all the eight algorithms. LDA has slightly better processing
times than NB. LDA will not have higher CA if discriminatory information is in the variance. Also, it is a parametric
method. Average processing time (APT) of C4.5, C5.0 is 7.70 and 7.21 seconds respectively. SVM uses a lot of
system resources and has slow processing speed. KNN is lighter and has low execution times as mentioned in Table.1.
440 Furqan Alam et al. / Procedia Computer Science 98 (2016) 437 – 442

Fig. 2. Confusion matrix of (a) SVM; (b) KNN; (c) NB; (d) C4.5, (e) LDA; (f) C5.0; (g) ANNs and (h) DLANNs for dataset 29

Fig. 3. Confusion matrix of (a) SVM; (b) KNN; (c) NB; (d) C4.5, (e) LDA; (f) C5.0; (g) ANNs and (h) DLANNs for dataset 30
Furqan Alam et al. / Procedia Computer Science 98 (2016) 437 – 442 441

Fig. 4. Confusion matrix of (a) SVM; (b) KNN; (c) NB; (d) C4.5, (e) LDA; (f) C5.0; (g) ANNs and (h) DLANNs for dataset 31

ANNs and DLANN have higher computational requirements. For IoT, there can be problems where high CA does not
matter much but processing time matters; in those cases, NB and LDA can be handy.

4.2. The Deep Learning Algorithm

Based on our preliminary assessment, we believe that DLANNs can have the best CA among all the simulated
algorithms. We observed that an improved classification accuracy could be achieved by increasing the epochs, hidden
layers and neurons. In DLANNs, CA also depends significantly on its parameters tuning. DLANNs have a very
complex structure, need a large amount of system resources, and as a result, DLANN algorithm has the highest
execution time among all the eight algorithms presented in this work.

Table 1. Classification accuracy in % and elapsed time in seconds for all the algorithms
Dataset 29 Dataset 30 Dataset 31
Algorithm Accuracy% Elapsed Time Accuracy% Elapsed Time Accuracy% Elapsed Time
SVM 98.57 2350.1 86.43 5.2 91.75 3.12
KNN 98.94 450.6 86.88 0.88 78.67 1.32
NB 77.04 0.52 70.09 0.02 52.72 0.51
C4.5 99.69 22.65 91.81 0.15 99.95 0.32
C5.0 99.62 21.1 90.26 0.13 99.96 0.4
LDA 81.85 0.98 71.53 0.02 66.4 0.04
ANN 99.03 33228.1 89.55 94.2 100 36
DLANN 99.52 12600 87.10 210.12 98.49 620.1

5. Conclusion

The IoT paradigm brings new sets of data mainly collected from sensor devices. To capture this hidden knowledge
from IoT data is a challenging task in data mining. Some researchers argue that a new family of data mining algorithms
are needed to handle IoT data. In our work, we examined the applicability of some of the well established data mining
442 Furqan Alam et al. / Procedia Computer Science 98 (2016) 437 – 442

algorithms including DLANNs. With our preliminary analysis, we conclude that C4.5, C5.0, ANNs and DLANNS
can give relatively higher accuracy results. We plan to conduct a detailed study on larger and diverse IoT datasets in
the future.

Acknowledgements

The experiments reported in this paper were performed on the Aziz supercomputer at King AbdulAziz University,
Jeddah, Saudi Arabia. We are also thankful to UCI data repository for making datasets available.

References

1. ITU. Recommendation ITU-T Y.2060 (06/2012), http://handle.itu.int/11.1002/1000/11559, 2012.


2. Atzori L, Iera A, Morabito G. The Internet of Things: A survey. Computer Networks. 2010;54(15):2787-2805.
3. Ma H. Internet of Things: Objectives and Scientific Challenges. J Comput Sci Technol. 2011;26(6):919-924.
4. Evans D, The Internet of Things: How the Next Evolution of the Internet Is Changing Everything, Cisco. April 2011. Retrieved 11 June 2016.
5. Chen F, Deng P, Wan J, Zhang D, Vasilakos A, Rong X. Data Mining for the Internet of Things: Literature Review and Challenges. International
Journal of Distributed Sensor Networks. 2015;2015:1-14.
6. Ngai E, Xiu L, Chau D. Application of data mining techniques in customer relationship management: A literature review and classification.
Expert Systems with Applications. 2009;36(2):2592-2602.
7. S. Cuomo, P. D. Michele, A. Galletti and F. Piccialli, ”A Cultural Heritage Case Study of Visitor Experiences Shared on a Social Network,”
2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), Krakow, 2015, pp. 539-544.
8. Angelo Chianese, Fiammetta Marulli, Francesco Piccialli, Paolo Benedusi, Jai E. Jung, An associative engines based approach supporting
collaborative analytics in the Internet of cultural things, Future Generation Computer Systems, 2016.
9. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H et al. Top 10 algorithms in data mining. Knowledge and Information Systems.
2007;14(1):1-37.
10. BURGES C. A Tutorial on Support Vector Machines for Pattern Recognition. Bell Laboratories and Lucent Technologies. 1998.
11. Burbidge RBuxton B. An Introduction to Support Vector Machines for Data Mining. UCL: Computer Science Dept. 2001.
12. Li W, Yi P, Wu Y, Pan L, Li J. A New Intrusion Detection System Based on KNN Classification Algorithm in Wireless Sensor Network.
Journal of Electrical and Computer Engineering. 2014;2014:1-8.
13. Li D, Zhang B, Li C. A Feature Scaling based k-Nearest Neighbor Algorithm for Indoor Positioning Systems. IEEE Internet of Things Journal.
2015;:1-1.
14. Bourobou SYoo Y. User Activity Recognition in Smart Homes Using Pattern Clustering Applied to Temporal ANN Algorithm. Sensors.
2015;15(5):11953-11971.
15. Pang S, Ozawa S, Kasabov N. Incremental Linear Discriminant Analysis for Classification of Data Streams. IEEE Trans Syst, Man, Cybern
B. 2005;35(5):905-914.
16. Li T, Zhu S, Ogihara M. Using discriminant analysis for multi-class classification: an experimental investigation. Knowledge and Information
Systems. 2006;10(4):453-472.
17. Quinlan J. C4.5: Program for machine learning. Morgan Kaufmann; 1993.
18. Kuhn MJohnson K. Applied predictive modeling. New York: Springer; 2013.
19. Hssina B, Merbouha A, Ezzikouri H, Erritali M. A comparative study of decision tree ID3 and C4.5. International Journal of Advanced
Computer Science and Applications, Special Issue on Advances in Vehicular Ad Hoc Networking and Applications. 2014.
20. Chunyue S, Zhihuan S, Ping L, Wenyuan S. The Study of Naive Bayes Algorithm Online in Data Mining. Proceedings of the 5th World
Congress on Intelligent Control and Automation,. 2004;6:15-19.
21. Liangxiao Jiang, Zhang H, Zhihua Cai. A Novel Bayes Model: Hidden Naive Bayes. IEEE Trans Knowl Data Eng. 2009;21(10):1361-1371.
22. Hongjun Lu, Setiono R, Huan Liu. Effective data mining using neural networks. IEEE Trans Knowl Data Eng. 1996;8(6):957-961.
23. Yashpal Singh, Alok hauhan A. Neural network in data mining. Journal of Theoretical and Applied Information Technology. 2009;:37-42.
24. Kamruzzaman Sarkar A. A New Data Mining Scheme Using Artificial Neural Networks. Sensors. 2011;11(12):4622-4647.
25. Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks. 2015;61:85-117.
26. Xin H, Hong Z, Ding Y. A New Pedestrian Detect Method in Crowded Scenes. 2013 IEEE International Conference on Green Computing and
Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing. 2013.
27. Niu X, Zhu Y, Zhang X. DeepSense: A novel learning mechanism for traffic prediction with taxi GPS traces. 2014 IEEE Global Communica-
tions Conference. 2014.
28. Aryal JDutta R. Smart city and geospatiality: Hobart deeply learned. 2015 31st IEEE International Conference on Data Engineering Work-
shops. 2015.
29. Ugulino W, Cardador D, Vega K, Velloso E, Milidi? R, Fuks H. Wearable Computing: Accelerometers Data Classification of Body Postures
and Movements. Advances in Artificial Intelligence - SBIA 2012. 2012;:52-61.
30. Lichman M. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information
and Computer Science. 2013.
31. Freire A, Barreto G, Veloso M, Varela A. Short-term memory mechanisms in neural network learning of robot navigation tasks: A case study.
2009 6th Latin American Robotics Symposium (LARS 2009). 2009.

View publication stats

You might also like