Professional Documents
Culture Documents
ISSN: 3159-0040
Vol. 2 Issue 6, June - 2015
Abstract— Data Analysis is getting more and more cluster data. To find mutually exclusive clustering of
important in Today’s world. If data of a person gets lost unique shape partitioning method is useful. To store
then, it will be a loss of his/her identity so maintain a data at multiple levels, consider Hierarchical methods.
big or huge amount of data without loosing, is a Density-based clusters can find arbitrarily shaped
challenge. Data clustering can do that because it can clusters and many more. Such as we can use grid
maintain a huge amount of data by dividing it into based methods for fast processing time, it depends on
various clusters. the grid size.
Clustering is separation of similar data from A. Intrusion Detection System-
dissimilar content. This paper reviews the data The larger and larger amount of data is generating
clustering techniques used to separate large amount of because of the network communication. The social
data. Motivation for this paper comes from an
media, social networking sites making a tremendous
enormous amount of data in clusters and finding a
amount of profit through all the communication. The
small errors in them. The review shows that Incremental
clustering is becoming a significant problem because management of data is also creating it is importance
the data is generating in an enormous amount. The data and with the increasing amount of network
can be of various types like it could be in the form of communication the risk is also increasing, so various
text, spatial data, images, sequence data, data in the techniques of Intrusion detection system(IDS) are
form of streams, multimedia data. To manage such coming to the market. Intrusion detection systems can
kinds of data, data mining has various techniques. be host-based intrusion detection system (HIDS), or it
Affinity Propagation (AP) is one of the methods that has could be Network intrusion detection system (NIDS).
use in much incremental clustering problems. Most HIDS monitors log files and system calls and NIDS
recent approaches reduce the data content by various monitors network.
methods such as compressed model that combines
horizontal compression with vertical compression. The Abnormal behaviors should be discovered at an
review covers incremental clustering, Affinity initial stage therefore research on anomaly detection is
propagation to maintain a large amount of data and going on rapidly than traditional signature based
other methods regarding clustering. The analysis detection. Training data should be secure by IDS
shows a trend towards reducing a data into small size because it is on a large scale, so an effective
so the data analyzation will become accessible. Data is technique to minimize time cost for detection is to
information and to take out the knowledge from that compress the volume of the model. To compress the
information is important, so that mining of data will volume of model Chen et.al. [1] Proposed a
become efficient. compressed model that combines horizontal
compression with vertical compression. To compress
Keywords—incremental clustering, affinity large training data, this model plays a significant role of
propagation, Data mining, intrusion detection. compressing data.
I. INTRODUCTION Intrusion detection system is an important research
With the increasing amount of data, the topic with many potential applications. Because of the
management of data is becoming more and more larger amount of communication, the Ids design is
difficult. So to manage that data, data analysts are getting much more complicated.[1] To discover
using different techniques of data mining so that to abnormal behaviors at it is an early stage the anomaly
mine that data will be feasible. Data mining is an detection method can used, because anomaly
approach to discovering interesting patterns from a detection is more modern and efficient than signature
large amount of data. Clustering is making a group of based detection. Anomaly detection has the advantage
a set of objects into multiple clusters. In clustering the to detect unknown attacks because it uses heuristic
objects that are in the cluster should have the same learning on historical training data. To find out a
similarity, and they should be very dissimilar to the systematic solution on training data the compressed
objects in another cluster. model is used.[1][22] The compressed model is useful
for efficiently and effectively handling the problems.
Partitioning methods can do clustering; Hierarchical
methods, Density-based methods, Grid-based There are some attacks that are harmful to our
clustering, these are some methods that used in system. The attacks are as follows –
clustering However it entirely depends on how to
www.jmest.org
JMESTN42350667 1305
Journal of Multidisciplinary Engineering Science and Technology (JMEST)
ISSN: 3159-0040
Vol. 2 Issue 6, June - 2015
B. Incremental clustering-
Clustering is making a set of similar data points
and for that we have to define a measure of similarity.
Clustering is a solution to a problem of large training
data. However, the new issue of the incremental
cluster regarding clustering is arised, Because of the
Fig 3. Incremental clustering
network communication and other uses the data that
is generating a massive amount, so the cluster size is The data is increasing and with the data the clusters
also increasing. Most of the applications in clustering are also increasing and the cluster size grows too.
deals with static data [2]. The solution to the problem The data points that are not similar to any data points
of incremental clustering is necessary. Apply Affinity and which are far from others they include as outliers.
Data streaming is also large scale clusters that stores
www.jmest.org
JMESTN42350667 1306
Journal of Multidisciplinary Engineering Science and Technology (JMEST)
ISSN: 3159-0040
Vol. 2 Issue 6, June - 2015
data of living streaming. The data of streaming is also traditional model without model compression [1].
increasing. Zhang et.al. [3]. applied Affinity Affinity propagation is a way better algorithm in data
Propagation (AP) make clusters. stream clustering, AP is used to cluster the data with
the best representatives. AP gives guarantee of
clustering optimality to select exemplars [1][2][3]. AP
C. Aims and Overview- clusters multispectral images [4].
This paper reviews some incremental clustering
MapReduce parallelization approach compress the
problems that are coming to the researcher. There are
large scale of training data. The compression of data
some solutions found on the issue of incremental
using MapReduce can do the data compression. For
clustering. AP is a better algorithm for clustering
applying MapReduce, first MapReduce parallel
problems.
processing for OneR compresses training data, then
Chen et.al. employed Affinity Propagation as MapReduce parallel processing for AP compresses
vertical compression to select small exemplars from the reversed transposed matrix and in the end the
training data in a large amount[1]. Sun & Guo applied compression takes place. Chen et. al. [1] proposed a
Affinity Propagation (AP) in incremental clustering compressed model and checked the results using
problems [2]. Zhang et.al. also applied AP on Data KDD99 and CDMC2012 dataset on KNN and SVM
Stream Clustering to Cluster the data[3]. Zhang et.al. classifiers.
have done analysis of Functional magnetic resonance
B. Incremental clustering of AP-
imaging(fMRI) data using AP and integrated principal
component analysis(PCA)[4]. On the same technology, Clustering is a main topic in data mining.
AP Clustering in Multispectral images is done by Yang However, with clustering the problem of incremental is
et al. arising now a days [3][4]. The fig 1,2,3 gives the idea
of incremental clustering. Clustering algorithms are
II. RELATED WORK discovered and designed to identify patterns in static
A. AP in vertical compresssion- data[2].
To achieve high efficiency of classification in IDS However Vedic Mathematics offers an entirely unique
Chen et. al. [1] proposed a compressed model. In and a new approach for pattern recognition [7].
compressed model, OneR classifier is used for
horizontal compression and Affinity propagation (AP) Sun & Guo [2] have extended a clustering
algorithm AP to handle dynamic data. Affinity
employed as vertical compression. AP used in many
Propagation (AP) is an exemplar based method. In AP,
clustering problems. Sun & Guo [2] used AP in
exemplars are recognized by passing the messages
incremental clustering problems. AP clustering based on a bipartite graph. There is a difficulty in expanding
on K-medoids. Feature selection or Attribute reduction AP in dynamic clustering of data; that is the objects
are the modern methods to improve detection that are established have a particular relationship to
efficiency [1]. Chen et. al. [1] improved efficiency of each other. After AP, While new relationships of
the model with the increasing accuracy so that the objects are at the initial level objects add at different
model will detect intrusions. Compressed model is statuses and different time, so it was hard to get
built using training data. proper exemplary set by continuing the AP.
www.jmest.org
JMESTN42350667 1307
Journal of Multidisciplinary Engineering Science and Technology (JMEST)
ISSN: 3159-0040
Vol. 2 Issue 6, June - 2015
www.jmest.org
JMESTN42350667 1308
Journal of Multidisciplinary Engineering Science and Technology (JMEST)
ISSN: 3159-0040
Vol. 2 Issue 6, June - 2015
www.jmest.org
JMESTN42350667 1309
Journal of Multidisciplinary Engineering Science and Technology (JMEST)
ISSN: 3159-0040
Vol. 2 Issue 6, June - 2015
[16] Zhang, J. P., Chen, F. C., Liu, L. X., et al. (2013).
Online stream clustering using density and affinity
propagation algorithm. In 2013 4th IEEE
international conference on software engineering
and service science (ICSESS) (pp. 828–832).
IEEE.
[17] Wei, M., Xia, L., Jin, J., et al. (2012). Research of
intrusion detection based on clustering analysis.
In Proceedings of the 2012 international
conference on cybernetics and informatics (pp.
1973–1979). New York: Springer.
[18] Wang, F., Chawla, S., & Surian, D. (2013). Latent
outlier detection and the low precision problem. In
Proceedings of the ACM SIGKDD workshop on
outlier detection and description (pp. 46–52).
ACM.
[19] Srinivas, M., & Andrew, H. S. (2003). Feature
selection for intrusion detection using neural
networks and support vector machines. In Annual
meeting of the transportation research board.
[20] J. Han, M. Kamber, and J. Pei, Data Mining:
Concepts and Techniques, 3rd edition, Morgan
Kaufmann, 2011. p. 444.
[21] T.W. Liao, ”Clustering of Time Series Data: A
Survey,” Pattern Recognition, vol. 38, no. 11, pp.
1857-1874, Nov. 2005.
[22] H. Geng, X. Deng, and H. Ali, ”A New Clustering
Algorithm Using Message Passing and its
Applications in Analyzing Microarray Data,” Proc.
Fourth Int’l Conf. Machine Learning and
Applications (ICMLA ’05), 2005.
[23] L. Nicolas, ”New Incremental Fuzzy c medoids
Clustering Algorithms,” Proc. 2010 Annual
Meeting of the North American on Fuzzy
Information Processing Society (NAFIPS ’10), pp.
1-6, July 2010.
[24] Y. He, Q. Chen, X. Wang, R. Xu, X. Bai, and X.
Meng, ”An Adaptive Affinity thPropagation
Document Clustering,” Proc. the 7 Int’l Conf.
Informatics and Systems (INFOS ’10), pp. 1-7,
March 2010.
[25] A triangle area based nearest neighbors approach
intrusion detection Chih-Fong Tsai∗, Chia-YingLin
[26] Chang, C. C., & Lin, C. J. (2011). LIBSVM: A
library for support vector machines. ACM
Transactions on Intelligent Systems and
Technology (TIST), 2(3), 27.
[27] Daniela, B., Kavé, S., & Margin, M. (2010). A
signal processing view on packet Sampling and
anomaly detection. In IEEE INFOCOM (pp. 713–
721).
[28] Davis, J. J., & Clark, A. J. (2011). Data
preprocessing for anomaly-based network
intrusion detection: A review. Computers &
Security, 30(6), 353–375.
[29] Meng, Y. (2012). Adaptive false alarm filter using
machine learning in intrusion detection. Practical
Applications of Intelligent Systems. Berlin,
Heidelberg: Springer, pp. 573–584.
www.jmest.org
JMESTN42350667 1310