Professional Documents
Culture Documents
STREAMING DATA
A PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
March 20
i
BONAFIDE CERTIFICATE
________________________ ________________________
ii
ABSTRACT
iii
ABSTRACT
In traditional clustering, the dataset is generally static in nature hence it can be processed
and analyzed many times. However, data stream clustering algorithm has to satisfy
constraints regarding real-time response, bounded and less memory, single-pass evaluation,
concept-drift detection. This imposes a limitation on traditional clustering algorithm. In this
era, huge volume of data has been emanating from various digital sources. Streaming data is
that continuously arriving sequence of data which is generated from different areas like
telephone records, multimedia data, financial transaction, network data etc. Further, the data
stream can be noisy and the properties might change over time. Our project performs an
online, bio-inspired approach to cluster dynamic data streams. The proposed algorithm is
density-based clustering where high density data points are separated from low density area.
The clusters are formed by grouping micro clusters. The tumbling window is used to read
stream of data. During a single pass of tumbling window rough clusters are incrementally
formed. The rough clusters formed are often noisy, impure and this can be refined by sorting
behavior of ants. Ants pick the items and drop it in the more similar region. Sorting ants
cluster the data by picking and dropping micro cluster into the most similar and denser
cluster. Clusters are summarized using the concept of micro clusters and these statistics are
stored in a removable storage for future use so that we can access or edit the data later. It
requires only three parameters such as window size, sleep max, epsilon value and less
computational time. To identify the best epsilon value, silhouette co-efficient is used. From a
range of epsilon values, the value which has the highest silhouette co-efficient is selected and
that epsilon value gives accurate results. Experimental Results shows that the clustering
quality is scalable, robust to noise and favorable to leading streaming clustering algorithm.
iv
ACKNOWLEDGEMENT
v
ACKNOWLEDGEMENT
Apart from our efforts, the success of our project depends largely on the
encouragement of many others. We take this opportunity to express our gratitude to the
people who have been instrumental in the successful completion of our project. We would
like to express our immense pleasure to thank our college management for giving required
amenities regarding our project.
We would like to convey our sincere thanks to our respected Principal,
Dr.S.Arivazhagan, M.E.,Ph.D., Mepco Schlenk Engineering College, for providing us with
facilities to complete our project.
We extend our profound gratitude and heartfelt thanks to Dr.T.Revathi, M.E.,Ph.D.,
Senior Professor and Head of the Department of Information Technology for providing us
constant encouragement.
We are bound to thank our project coordinator Dr.J.Angela Jennifa Sujana,
M.Tech., Ph.D., Associate Professor of Information Technology. We sincerely thank our
project guide Dr.T.Revathi, M.E.,Ph.D., Senior Professor, Department of Information
Technology, for her inspiring guidance and valuable suggestions to complete our project
successfully.
The guidance and support received from all the staff members and lab technicians of
our department who contributed to our project, was vital for the success of the project. We
are grateful for their constant support and help.
We would like to thank our parents and friends for their help and support in our
project.
vi
TABLE OF CONTENTS
vii
TABLE OF CONTENTS
CONTENT PAGE NO
LIST OF TABLES x
LIST OF FIGURES xii
LIST OF SYMBOLS xiv
LIST OF ABBREVIATIONS xvi
CHAPTER 1 INTRODUCTION 1
1.1 STREAM CLUSTERING 2
1.2 CLUSTERING AND ITS TYPES 2
1.2.1 PARTITIONING CLUSTERING 2
1.2.2 HIERARCHICAL CLUSTERING 2
1.2.3 FUZZY CLUSTERING 2
1.2.4 MODEL BASED CLUSTERING 3
1.3 DENSITY BASED CLUSTERING 3
1.4 TUMBLING WINDOW 3
1.5 DENSITY BASED APPROACH FOR CLUSTERING 3
CHAPTER 2 LITERATURE STUDY 5
2.1 OVERVIEW 6
2.2 DENSITY-BASED CLUSTERING OVER AN 6
EVOLVING DATA STREAM WITH NOISE
2.3 A SINGLE PASS ALGORITHM FOR CLUSTERING 6
EVOLVING DATA STREAMS BASED ON SWARM
INTELLIGENCE
2.4 A FRAMEWORK FOR CLUSTERING EVOLVING 7
DATA STREAMS
2.5 ANTCLUST: ANT CLUSTERING AND WEB USAGE 7
MINING
2.6 THE CLUSTREE: INDEXING MICRO-CLUSTERS 7
FOR ANYTIME STREAM MINING
CHAPTER 3 SYSTEM STUDY 9
3.1 SCOPE 10
3.2 PRODUCT FUNCTION 10
3.3 SYSTEM REQUIREMENTS 11
3.3.1 HARDWARE INTERFACES 11
3.3.2 SOFTWARE INTERFACES 11
viii
3.3.2.1 ANACONDA 11
3.3.2.2 SPYDER 11
3.3.2.3 PYTHON 12
3.3.2.4 PYSPARK 12
CHAPTER 4 SYSTEM DESIGN 13
4.1 OVERVIEW 14
4.2 OVERALL ARCHITECTURE 14
4.3 MODULES 14
4.3.1 FORMATION OF ROUGH CLUSTERS 15
4.3.2 GROUPING MICRO-CLUSTERS 16
4.3.3 CATEGORIZATION OF CLUSTERS 18
CHAPTER 5 IMPLEMENTAION METHODOLOGY 21
5.1 OVERVIEW 22
5.2 ESSENTIAL LIBRARIES 22
5.2.1 NUMPY 22
5.2.2 MATH 22
5.2.2.1 Sqrt() 22
5.2.3 SCIPY 22
5.2.3.1 scipy.spatial.distance.pdist() 23
5.2.4 FUNCTOOLS 23
5.2.4.1 Reduce() 23
5.2.5 TKINTER 23
5.2.6 MATPLOTLIB 23
5.2.6.1 Matplotlib.pyplot 23
5.3 FUNCTIONS USED FOR IMPLEMENTATION 23
5.3.1 FIND CLUSTERS 24
5.3.2 SUITABILITY 24
5.3.3 SIMILARITY 24
5.3.4 GROUP MICRO-CLUSTERS 24
5.3.5 RADIUS 24
5.3.6 GROUP CLUSTERS 24
5.3.7 PICK 25
5.3.8 DROP 25
5.3.9 PURITY 25
5.3.10 F-MEASURE 25
5.3.11 SILHOUETTE 25
5.3.12 SHOW TABLE 25
CHAPTER 6 PERFORMANCE METRICS 26
6.1 OVERVIEW 27
ix
6.2 SILHOUETTE CO-EFFICIENT 27
6.3 PURITY 28
6.4 F-MEASURE 28
6.5 RAND INDEX 29
CHAPTER 7 RESULTS AND DISCUSSION 30
7.1 OVERVIEW 31
7.2 DATASET 31
7.2.1 STATIONARY DATASET 31
7.2.2 NON-STATIONARY DATASET 31
7.3 SCREENSHOTS 32
CHAPTER 8 CONCLUSION AND FUTURE WORK 44
8.1 CONCLUSION 45
8.2 FUTURE WORK 45
CHAPTER 9 APPENDIX 46
9.1 CODING 47
CHAPTER 10 REFERENCES 57
10.1 REFERENCES 58
x
LIST OF TABLES
LIST OF TABLES
xi
TABLE NO TABLE NAME PAGE NO
7.3.1 Performance of Stationary dataset 42
7.3.2 Performance of Non-stationary dataset 42
7.3.3 Performance of B1C10D25 with noise 42
7.3.4 Result of Existing Stream Clustering 43
Algorithms
7.3.5 Result of Existing Ant Clustering 43
Algorithms
xii
LIST OF FIGURES
13
LIST OF FIGURES
14
LIST OF SYMBOLS
15
NOTATION MEANING
X Dataset
C Center of dataset
R Radius of dataset
xi Data point
S Silhouette Value
P Purity
F F-Measure
K Clusters
R Rand index
LIST OF SYMBOLS
16
LIST OF ABBREVATION
17
LIST OF ABBREVATION
18
INTRODUCTION
CHAPTER 1
19
INTRODUCTION
In recent years, a large amount of streaming data has been generated but analyzing
and processing such kind of data have become a hot topic. Streaming data is defined as data
such as multimedia data, telephone records, financial transactions etc., that arrived
continuously. Streaming data clustering is clustering streaming data into groups having
similar behavior. Streaming data should be processed incrementally using Stream
Processing techniques without having access to all of the data. The goal of stream clustering
is to group the streaming data into similar classes. Streaming data can be examined only
once. A Stream can be unbounded and infinite but only limited amount of memory is
available. The nature of the stream signifies that data can drift, new clusters can appear, or
disappear and reappear repeatedly.
Clustering is the task of dividing data points into groups such that data points in the
same groups are more similar to other data points in the same group than those in other
groups. In simple words, segregate groups with similar traits and assign them into clusters.
Clustering can be done in different approaches:
Partitioning algorithms are clustering techniques that subdivide the data sets into a set
of k groups, where k is the number of groups pre-specified by the analyst.
In Fuzzy clustering, each data points can be a member of more than one cluster. Each
point has a set of membership coefficients resultant to extent of being in a given cluster.
20
In model-based clustering, data are viewed as coming from a distribution that is
mixture of two are more clusters. It finds best fit of models to data and estimates number of
clusters.
1.3 DENSITY BASED CLUSTERING
It is a partitioning method which can find out clusters of different shapes and sizes
from data containing noise and outliers. Density-based clustering defines clusters as high-
density areas separated by areas of low density. In our proposed algorithm, highly dense areas
are described using micro-clusters with center c and radius r. Micro-clusters have a maximum
radius epsilon є i.e. radius ≤ epsilon. A data point is assigned to a micro-cluster if the data
point lies within its radius. The number of micro-clusters can be greater than the number of
actual clusters but micro-clusters can be fewer than the data points. This has two advantages:
i) statistics about clusters can be stored in a fraction of the space
ii) evaluating micro-clusters are easier than the individual data points.
21
point is considered as a micro-cluster and these micro-clusters attempt to merge with other
micro-clusters present in the same cluster only. These rough clusters are refined using an ant-
inspired sorting method. The pick and drop method are based on the behavior of certain
species of ant which sort their larvae into piles. The Sorting ant picked up the isolated items
and then dropped at locations where similar items are present. Sort ants are assigned to every
cluster. They refine the primary clusters by picking micro-clusters and dropping it in more
suitable clusters. The dissolution of smaller clusters and their contents moved to similar,
larger clusters. This simplifies the method, decreases the total complexity and allows for
actual sampling.
22
LITERATURE STUDY
CHAPTER 2
LITERATURE STUDY
23
2.1 OVERVIEW:
The existing system which traditional and streaming algorithm include limitations on
available memory, high computational time, less viable to noisy data and techniques to
cluster the streaming data are described in this chapter.
24
2.4 A FRAMEWORK FOR CLUSTERING EVOLVING DATA
STREAMS [3]:
The CluStream algorithm uses a idea of microclusters to cluster dynamic data
streams. CluStream algorithm is risky when the data stream has noisy data because it has only
fixed number of clusters. CluStream propose two different phases, they are online phase and
offline phase. In online phase a group of microclusters are stored in the memory, when a data
point is arriving from the stream it can be either added to an prevailing microclusters or to a
new microclusters. Online phase satisfies single-pass limitation so that huge data can be
cluster. A new microclusters is formed by deleting the existing micro cluster after merging
the similar two microclusters. The offline phase uses weighted k-means algorithm on the
microclusters, to get the final clusters from the data streams. The idea of using microclusters
instead of data points confirms that they can be used for large amount of data. This algorithm
is estimated using an open source software, massive online analysis.
25
tactic is the first anytime clustering procedure for data streams. Anytime clustering ClusTree
overcomes the problems with old batch processing algorithm. Experimental results
illustration that it is accomplished of managing a gathering of diverse stream features for
precise and scalable anytime clustering of stream data. Furthermore, it discoursed
compatibility of this method to invent clusters of random shape and to showing cluster
changes and data development using new methods.
26
SYSTEM STUDY
CHAPTER 3
SYSTEM STUDY
3.1 SCOPE:
27
The scope of A Density Based Approach To Cluster Streaming Data is to achieve a
significant decrease in computational time and clustering vast amount of data without
compromising the cluster quality.
3.2 PRODUCT FUNCTION:
Initially each data point is considered as an individual clusters and the unbounded data
streams are passed through the tumbling window.
The first data point will form a new cluster and then the successive data point’s suitability
with already formed clusters will be calculated, if it is less than or equal to epsilon then it
will be added to that cluster otherwise it separately forms a new cluster.
The similarity between each cluster with its neighbouring cluster is updated. the formed
clusters are refined by creating micro clusters
Two micro-clusters are grouped as a single micro-cluster, if the radius is less than epsilon,
otherwise merge operation fails.
Micro-clusters m1 & m2 are said to be density reachable if distance between centre of m1
and centre of m2 is less than epsilon.
The merge operation takes place within the cluster only.
The sorting ants are assigned to each cluster, it randomly pick-up micro-cluster from a
cluster and combines into larger cluster.
If the selected micro-cluster and the other micro-clusters in the same cluster are density
reachable, then reachable count is increased.
If micro-cluster is picked successfully, carrying count is set as true and ant moves to the
similar cluster and drop the micro-cluster based on the inverse of pick.
If drop success, then the selected micro-cluster move to the new cluster else it moves
back to its native cluster. This operation continues until ant asleep or the cluster becomes
empty.
If either pick or drop operation is unsuccess, then counter is incremented. When the
counter reaches sleep max value, the cluster is considered as sorted and sleeping counter
is set to true.
The counter is reset to zero after successful operation or new micro-cluster is placed in
the cluster by a foreign sorting ant.
If there exists only one micro-cluster in a cluster then it is an outlier. The result will be
non-empty sorted micro-cluster.
3.3.2.1 ANACONDA
Anaconda platform is used for machine learning and any other large-scale data
processing. It is an open source distribution which is capable of working with R and Python
programming languages and free of cost. It consists of more than 1500 packages and virtual
environment manager. The virtual environmental manager is named as Anaconda Navigator and
it comprises all the libraries to be installed within. It holds certain default navigators like Spyder,
JupyterLab, Jupyter Notebook, Orange, Rstudio etc.
3.3.2.2 SPYDER
To implement the proposed system the IDE used is Spyder environment. It is an open
source cross-platform integrated development environment. It is the combination of advanced
features such as debugging, editing, and analysis of huge data. This tool helps in interactive
execution, data exploration and visualization of data.
3.3.2.3 PYTHON
Python is an interpreter, high-level data structures, general-purpose programming
language. It can be used for creating web applications on server side. Python is also suitable as an
extension language for customized applications.
3.3.2.4 PYSPARK
PySpark is the Application Programming Interface that is written in python language to
support Apache Spark functionality. The large data analysis is handled by the distributed
framework known as Spark. PySpark is a powerful language which is mainly used in data science
29
and machine learning. It can be easy integrated with any other languages. This framework works
with greater speed when compared with other frameworks for processing data.
30
SYSTEM DESIGN
CHAPTER 4
SYSTEM DESIGN
4.1 OVERVIEW
This section presents the overview of the whole system. The Section 4.2 shows the
system Section 4.2 defines the main three modules used Section 4.3.1 defines how clusters
are formed with the unbounded data streams Section 4.3.2 defines the merge operation with
the previously formed rough clusters Section 4.3.3 describes how the clusters are categorized
and stored offline.
Final Clusters
4.3 MODULES
1.while<window size>do
2for<each points>do
3.if<clusters exist>then
4.find suitability with clusters (1)
5.if<suitable> then
6.add the data point to cluster
7.else
8.create a new cluster
9.else if<no clusters>then
10.create a new cluster
11.update similarity value (2)
12.return rough clusters
32
The data streams are processed through non-overlapping tumbling window. At each
iteration, a fixed size non repeated set of data is considered. In a single-pass of the window,
clusters are consequently formed. The first point seeds the new cluster. Subsequent data
points are assigned to a prevailing cluster or seeds a new cluster. The Euclidean distance is
calculated for suitability of data points with n samples from cluster c is estimated as follows:
distance (c i , q)
Suitability(S) =∑ (1)
n
Where,
n - number of points present already in the cluster.
q - Data point
c – Existing Clusters
Every cluster is evaluated and each datapoint is assigned to the most suitable cluster
where the suitability is equal to or below epsilon. New cluster is formed if the suitability
value is greater than the epsilon. The parameter є is the maximum radius for a micro-cluster
in the subsequent step.
As we evaluate each point’s suitability with every cluster, we store each cluster’s
suitability. Upon establishing a cluster, we update the similarity information between the
particular cluster and its neighboring clusters. The similarity between clusters c1 and c2 is the
average of each data point q in cluster c1’s suitability with cluster c2
distance (c 1 , c 2)
Similarity=∑ (2)
n
The similarity to every neighboring cluster is a continuing average updated whenever a new
point is allotted to the cluster. Although comparable, Similarity(c 1 , c2 ) ≠ Similarity (c 2 , c1 ) .
Calculate Suitability
yes
No
suitable
33
Update cluster similarity
Rough Clusters
34
LSum
C= (3)
N
√
2
R= SSum −( LSum ) (4)
N N
A micro-cluster can also comprise a sequential variable, but this is not vital in this
window model. A micro-cluster M can engage point p if, after updating the LSum and SSum
of M with p, radius(M) ≤ epsilon. Likewise, two micro-clusters m1 and m2 can try to merge
into a single micro-cluster M as:
If radius(M) ≤ є, the clusters merge; else, the merge operation fails. Micro-clusters m1
and m2 are said density reachable if
Rough Clusters
>є <є
Calculate
radius
The earlier step discovers clusters in a single-pass of the window. The clusters
identified at this phase are often rough, impure and too-many. In this stage, micro-clusters are
formed, merged, and inter-cluster sorting is achieved. At first, each d-dimensional point p in
each cluster is treated as its own micro-cluster M. This micro-cluster with radius zero and
center of p. Formally, we have
M.N = 1
M.LSumi = pi, i = {1..., d}
M.SSumi = pi2, i = {1..., d}
Where,
pi - ith dimension of data point p
Before sorting operation, each micro-cluster tries to merge with other micro-clusters
in the same cluster. The merge operation is performed by comparing each micro-cluster with
every other in the same cluster. If, after adding their constituent parts (5), the radius is less
36
than or equal to, the merge operation is a success. Merging at this step ensures that only
neighboring micro-clusters try to merge and avoids the unnecessary calculation of comparing
micro-clusters in different dense areas. Another benefit is that, during the sorting segment, n
data points represented by a micro-cluster can be moved in a single operation, speeding up
the categorization method and reducing the number of pair wise comparisons. Ant is assigned
to each cluster for sort operations. Each ant is resident to its own cluster. The ants
probabilistically choose to pick-up a micro-cluster from their cluster. A micro-cluster M is
selected at random from cluster k and is iteratively compared with n micro-clusters in the
same cluster. The Euclidean distance from the center of M to each of the selected micro-
clusters is evaluated and if both are density-reachable (6), then a nearby count is incremented.
The chance of a picking-up a micro-cluster is calculated as follows:
(1−nearby)
P pick = (7)
n
This ensures a higher probability of a pick-up in clusters containing less micro-cluster.
This leads to the dissolution of smaller clusters into large similar clusters. If a micro-cluster is
effectively picked-up, the Boolean variable carrying is true and the categorizing ant moves to
a neighboring cluster and attempts to drop it. Ant moves to the most similar ensuring that
they do not try to drop micro-clusters in clusters that are different to their own cluster. An ant
attempts to drop its micro-cluster in the new cluster based on the converse of (7). If the drop
operation is success, the micro-cluster is moved to the new cluster; or else, the micro-cluster
leftovers in its original cluster. The ant returns to its resident cluster and updates the
similarity information between the two clusters with the latest suitability score [see (2)]. Each
ant continues to attempt categorizing of cluster until either the cluster is empty or the
categorizing ant is “asleep.” Each categorizing ant has a counter and if a pick-and-drop
operation is unsuccessful, either picking or dropping, this counter is incremented. When the
counter reaches sleepmax, then the cluster is organized and a Boolean counter sleep is true.
The counter is retuning to zero after a successful operation or if a new micro-cluster is placed
in the cluster by an unknown categorizing ant. When each and every ant is sleeping, the
process is terminated. This step simplifies each cluster and causes many smaller, similar
clusters to dissolve and form larger cluster. Clusters containing only one micro-cluster are
outliers and the clustering result is given as the group of non-empty clusters. Each cluster
contains a group of density-reachable micro-clusters which summarize the partitioned region
of large density in the feature space. This summary information is stored offline for further
use.
37
Assign ants to each
cluster
Calculate
Ppick
Drop micro-cluster
38
IMPLEMENTATION METHODOLOGY
CHAPTER 5
IMPLEMENTATION METHODOLOGY
5.1 OVERVIEW:
The implementation methodology describes the main functional requirements, which
needed for doing the project.
5.2.1 NUMPY:
Numpy is a library for the Python language. It can be used for large, multi-
dimensional arrays and matrices. It is used along with a huge collection of high-level
mathematical functions to operate on these arrays. It is precisely useful for algorithm
developers. The processing will be slow when working with multi-dimensional arrays and the
39
functions and operators that are used. Numpy is useful to overcome this restriction. Any
datatypes can be defined by this package.
5.2.2 MATH:
The Math Library provides us access to some common mathematical functions and
constants in Python, which we can use in our code for complex mathematical computations.
The library is an in-built Python module, so we don't have to do any installation. The
mathematical function used in this project is as follows:
5.2.2.1 Sqrt():
sqrt() function is an inbuilt function in Python that returns the square root of any
number. It is used to find radius of the micro-cluster.
5.2.3 SCIPY:
Scipy is a scientific library for Python is an open source. The scipy library depends on
Numpy, which provides convenient and fast N-dimensional array manipulation. The Scipy
library is built to work with Numpy arrays and provides many user-friendly and efficient
numerical practices such as routines for numerical integration and optimization.
5.2.3.1 scipy.spatial.distance.pdist:
It is used to find Euclidean distance between pairwise n-dimensional array in the
project. The squareform converts a vector-form distance vector to a square-form distance
matrix, and vice-versa.
5.2.4 FUNCTOOLS:
The functools module is for higher-order functions. It acts on or return other functions.
In general, any callable object can be preserved as a function for the purposes of this module.
5.2.4.1 Reduce():
The reduce() function is used to apply a specific function passed in its argument to all
of the list elements stated in the sequence passed along. It is used to convert 2-d list into 1-d
list in our project.
5.2.5 TKINTER:
40
The GUI used in the project is made with the help of the tkinter library and tkinter
widgets. Tkinter is the standard GUI Python library. Tkinter provides a fast and easy way to
create GUI applications. Tkinter delivers a powerful object-oriented interface to the Tk GUI
toolkit. Tkinter offers various controls, such as buttons, labels and text boxes used in a GUI
application.
5.2.6 MATPLOTLIB:
Matplotlib is one of the most common packages used for data visualization in python. It
is a cross-platform library for creating 2-Dimensional plots from data in arrays. Matplotlib is
written in Python. Matplotlib with NumPy can be used as the open source.
5.2.6.1 matplotlib.pyplot:
The matplotlib.pyplot is a collection of command style functions that make matplotlib
work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a
figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot
with labels, etc. In this project matplotlib.pyplot is used to generate graph such as Bar graph,
Cartesian graphs which represents the different Parameters Vs performance of the algorithm.
5.3.2 Suitability:
The Euclidean distance is calculated for suitability of data points with n samples from
cluster c is estimated. Every cluster is evaluated and each data points are assigned to the most
suitable cluster where the suitability is equal to or below epsilon. New cluster is formed if the
suitability value is greater than the epsilon.
5.3.3 Similarity:
41
The similarity between clusters c1 and c2 is the average of each data point q in cluster
c1’s suitability with cluster c2. The similarity to every neighboring cluster is a continuing
average updated whenever a new point is allotted to the cluster.
5.3.5 Radius:
The radius R of the micro-cluster is calculated in this method using linear sum and
squared sum of d-dimensional arrays with number of data points(n).
5.3.7 Pick:
The Euclidean distance from the center of M to each of the selected micro-clusters is
evaluated and if both are density-reachable, then a nearby count is incremented. The
probability of picking-up a micro-cluster is high then pick successful and select the micro-
cluster.
5.3.8 Drop:
42
Ant moves to the similar cluster and drop the micro-cluster based on the inverse of
pick. If drop success, then the selected micro-cluster move to the new cluster else it moves
back to its native cluster.
5.3.9 Purity
The purity metric measures in what way homogeneous a cluster is. The overall purity
for the total number of clusters is calculated in this method.
5.3.10 F-Measure:
The F-measure called F-score or F1-score is the harmonic mean of the recall and
precision scores. The overall F-Measure for the total number of clusters is calculated in this
method.
5.3.11 Silhouette:
The silhouette co-efficient is a quantity of how comparable an object is to its native cluster
compared to further clusters. The silhouette ranges from -1 to +1. The silhouette value for
range of epsilon are calculated, the highest silhouette value gives the exact epsilon which
gives better clustering results.
43
PERFORMANCE METRICS
CHAPTER 6
PERFORMANCE METRICS
6.1 OVERVIEW:
Our algorithm is evaluated across three metrics: 1) silhouette coefficient; 2) F-
measure and 3) Purity 4) Rand-index. The performance metrics of non-stationary data
streams is compared with three important stream clustering algorithms they are CluStream,
DenStream, ClusTree and stationary data streams are compared with ant clustering
algorithms.
44
value represents that the object is well matched to its native cluster and not well matched to
adjacent clusters.
1
d(a) ¿ ∑ dist (a , b)
|K a|−1 b ∈K a ≠b
a,
Where,
dist(a,b) – distance between data point a & b
K a – Cluster
c(a) - mean distance between a and all other data points in the same cluster.
1
d(a)¿ min
i≠ a
∑ dist (a , b)
|K i| b ∈ K
i
Where,
dist(a,b) – distance between data point a & b
K i – Cluster
d(a) - minimum mean distance between a & all other data points in the other
cluster.
{ }
c ( j)
1− , if c ( j ) <d ( j)
d ( j)
S(j) = 0 ,if c ( j )=d ( j)
d ( j)
−1 , if c ( j ) >d ( j)
c( j)
6.3 PURITY:
The purity metric measures in what way homogeneous a cluster is. A cluster is
allocated to the class which appears often inside the cluster, the accurateness of this is
estimated by adding the instances of this class and dividing by the total number of instances
in the cluster. K contains n number of clusters. In every identified cluster K i (i = {1..., n}), Si
represents the most frequent class label in cluster Ki. The succeeding features for cluster Ki:
S∑¿ i
Precision Ki = ¿
¿ K i∨¿¿
Where,
i
S sum - number of instances of Si in Ki
45
i
S total -the total number of instances of Si in the present window
Overall, purity (P) can now be stated in terms of the total number of clusters
identified, as follows:
n
1
P= ∑ PrecisionR
n i=1 i
6.4 F-MEASURE:
The F-measure called F-score or F1-score is the harmonic mean of the recall and
precision scores found by the algorithm.
Recall Ki = S
i ∑¿ ¿
i
S total
6.5 RAND-INDEX:
The rand index (R) measures the number of conclusions that are accurate by fining
false positives and false negatives. It measures the accurateness of the structure, defined as
follows:
TP+TN
R=
TN + FN +TP+ FP
Where,
TP - true positive
TN - true negative
FP - false positive
FN - false negative decisions
46
47
RESULTS AND DISCUSSION
CHAPTER 7
RESULTS AND DISCUSSION
7.1 OVERVIEW
This chapter explains the result of our project and the screenshots for each step are
included and explained
7.2 DATASETS:
A Stationary dataset is one whose statistical properties such as the mean, variance and
autocorrection are all constant over time.
48
DATASETS CLASSES FEATURES EXAMPLES
Iris 3 4 150
Wine 3 13 178
Zoo 7 17 101
50
Figure 7.3.4 Silhouette Vs Epsilon on wine dataset
Figure 7.3.4 shows the graph for stationary dataset - wine silhouette vs epsilon graph.
The epsilon value which has highest silhouette value is considered as the best epsilon value.
51
Figure 7.3.6 Purity Vs Epsilon on Wine Dataset
Figure 7.3.6 shows the graph for stationary dataset – wine Purity vs epsilon bar graph.
The purity value for wine dataset at different epsilon value are represented in the graph.
52
Figure 7.3.8 Purity Vs Epsilon on 1CDT Dataset
Figure 7.3.8 shows the graph for non-stationary dataset – 1CDT Purity vs epsilon bar
graph. The purity value for 1CDT dataset at different epsilon value are represented in the
graph.
53
Figure 7.3.10 Threshold value Vs Performance on Zoo Dataset
Figure 7.3.10 shows the Threshold value for Zoo Dataset. This plot graph shows that
how the purity and F-Measure varies with respect to change in threshold value.
54
Figure 7.3.12 Performance Vs Sleep max on Network Intrusion
Figure 7.3.12 the plot graph shows that how the purity and F-Measure varies with
respect to change in sleep max value on network intrusion dataset.
55
Figure 7.3.14 F-Measure Vs window size on Forest Cover
Figure 7.3.14 the plot graph shows that how F-Measure varies when stream of data
over 1000 window of size 500 on Forest cover dataset.
56
Figure 7.3.16 F-Measure of DBCSD Vs other Ant Clustering Algorithms
Figure 7.3.16 this bar graph shows that how the high F-Measure of our algorithm
when compared with other ant clustering algorithms F-Measure on stationary dataset.
57
Figure 7.3.18 F-Measure of DBCSD Vs other Stream Clustering Algorithms
Figure 7.3.18 this bar graph shows that how the high F-Measure of our algorithm
when compared with other Stream clustering algorithms F-Measure on non-stationary
dataset.
58
Figure 7.3.20 Clustering Result of Wine Dataset
Figure 7.3.20 this scatter plot shows various clusters formed after categorization of
clusters. The three colors red, blue and purple represents three different clusters.
DATASET PURITY F-MEASURE
Wine 0.95 0.96
Iris 0.96 0.91
Zoo 0.93 0.92
59
Table 7.3.2 Performance of Non-stationary dataset
Table 7.3.2 shows the purity and F-measure value. The performance of Non-
stationary dataset given in the above table.
P F P F P F P F
60
Table 7.3.5 Result of Existing Ant Clustering Algorithms
Table 7.3.5 shows the Performance of DBCSD on stationary over Existing ant clustering
algorithms
61
CONCLUSION AND FUTURE WORK
CHAPTER 8
CONCLUSION
8.1 CONCLUSION
In this project we have used behaviour of ant for clustering dynamic data streams. The
system is tested with the non-stationary datasets like network intrusion, forest cover, etc. and
stationary dataset like iris, wine, zoo. The Rough clusters are formed in the single pass of the
window. The earlier formed clusters are refined using a process stimulated by the
categorizing behaviour of ants. This categorization technique is created on the typical pick-
and-drop method used by ant. Rough clusters are formed rapidly in a one pass and then
sorted. Also, in the old-style algorithm, data points are moved independently which consumes
excessive time. By combining similar points into micro-clusters, a amount of points can be
moved during single operation, additionally fast-moving up the algorithm. Experimental
results show that our algorithm performs well while needing less parameters. The parameter
epsilon was shown to be very delicate and critically disturbs the performance of the
algorithm. It is data-dependent and requires manual tuning. By calculating silhouette value,
the exact epsilon value is found out in our project.
62
In Future Work, we planned to apply various clustering algorithms to get better
cluster quality and investigate an adaptive, local є parameter. This could potentially allow the
discovery of clusters with varying densities in the data.
63
APPENDIX
CHAPTER 9
APPENDIX
9.1 CODING:
import math
from functools import reduce
import scipy.spatial as sc1
from pyspark import SparkContext
from pyspark import SparkConf
import numpy
import timeit
radii=[]
a=[]
Data=[]
dist=[]
#Find Rough Clusters
class find:
clust=[]
def __init__(self,epsilon):
self.epsilon=epsilon
def findcluster(self,ds,ran):
suit=0
max=0
hh=0
i=j=k=0
ranlen=0
if len(self.clust)==0:
self.clust.append([0])
ran=1
ranlen=len(ds)
64
else:
ranlen=ran+len(ds)
for i in range(ran,ranlen):
for j in range(len(self.clust)):
suit=0
for k in range(len(self.clust[j])):
#Suitablility Calculation
suit=suit+dist[self.clust[j][k]][i]
suit=suit/(len(self.clust[j])+1)
if(suit>max): max=suit
hh=j
if max<=self.epsilon:
self.clust[hh].append(i)
else:
self.clust.append([i])
max=0
print("Rough self.clusters\n",self.clust)
return self.clust
65
for j in range(len(s)):
for h in range(len(t)):
temp=0
var=0
for k in range(len(t[h])):
var=var+radii[t[h][k]]
var=(var+radii[s[j]])/(len(t[h])+1)
if var<self.epsilon:
t[h].append(s[j])
temp=1
break
if temp==0:
t.append([s[j]])
temp=0
a[i]=t
print("\nMerged MicroClusters\n",a)
print("Length before sorting\n",len(a))
return a
#Sorting Cluster after merge operation
class SortCluster(Microcluster):
def __init__(self,epsilon,sim,dist,sclust):
self.sim=sim
self.epsilon=epsilon
self.dist=dist
self.sclust=sclust
self.ant=[False for i in range(len(self.sclust))]
self.counter=[0 for i in range(len(self.sclust))]
self.dummy=[0 for i in range(len(self.sclust))]
self.sleepmax=[]
for f in range(len(self.sclust)):
self.sleepmax.insert(f,3)
#To drop the selected micro cluster into the most similar cluster
def drop(self,lt,i):
maxi=0
lists=[]
count=0
temp=0
t7=0
if len(self.sclust)==1:
return 0
for j in range(len(self.sclust)):
if i!=j:
if maxi<self.sim[i][j]:
maxi=self.sim[i][j]
temp=j
lists=self.sclust[temp]
if len(lists)==1:
count=0
t7=len(lists[0])
else:
for j in range(len(lists)):
q=abs(self.Mean(lists[j])-self.Mean(lt))
if q<=epsilon:
count=count+1
66
t7=t7+len(lists[j])
t7=t7+len(lt)
pdrop=1/(1-(count/t7))
if pdrop <=2:
self.sclust[temp].append(lt)
self.counter[i]=0
return 1
else:
self.counter[i]=self.counter[i]+1
return 0
#To pick a micro-cluster
def pick(self,i,k,count,t7):
ppick=1-(count/t7)
if ppick >=0.5:
self.counter[i]=0
if self.drop(self.sclust[i][k],i):
self.sclust[i].remove(self.sclust[i][k])
self.sleepmax[i]=3
#print(self.sclust)
if self.sclust[i]==[]:
self.sclust.remove(self.sclust[i])
self.ant.remove(self.ant[len(self.ant)-1])
self.sim=super().similarity(self.sclust,dist)
return 1
else:
return 0
else:
self.counter[i]=self.counter[i]+1
return 0
def Mean(self,x):
sum1=0
i=0
for i in range(len(x)):
sum1= sum1+(sum(Data[x[i]])/len(Data[x[i]]))
return sum1
def mergecluster(self):
i=j=k=q=t7=count=tp=t1=0
while i < len(self.sclust):
print("iiii",i)
while self.ant[i]==False:
j=0
while j < len(self.sclust[i]):
k=0
while k < len(self.sclust[i]):
tp=0
if len(self.sclust[i])==1:
count=1
t7=len(self.sclust[i][k])
tp=1
break
if j!=k:
q=abs(self.Mean(self.sclust[i][k])-self.Mean(self.sclust[i][j]))
if q<=epsilon:
count=count+1
67
t7=t7+len(self.sclust[i][k])
k=k+1
if tp==1:
if self.pick(i,k,count,t7)==0:
self.dummy[i]=self.dummy[i]+1
j=j+1
else:
t1=1
break
else:
t7=t7+len(self.sclust[i][j])
if self.pick(i,j,count,t7)==0:
self.dummy[i]=self.dummy[i]+1
j=j+1
t7=0
tp=0
count=0
if t1==1:
break
if self.dummy[i]==len(self.sclust[i]):
self.ant[i]=True
if self.counter[i]==0:
self.ant[i]=True
if(self.counter[i]>=self.sleepmax[i]):
self.ant[i]=True
if t1==1:
self.dummy[i]=0
if t1==0:
i=i+1
if i<len(self.sclust) :
self.sleepmax[i]=len(self.sclust[i])
t1=0
print("ANT",self.ant)
print("Sorted cluster",self.sclust)
print("Length after sorting:",len(self.sclust))
return self.sclust
# To Find F-Measure Value
def F-Measure(self,t,f):
v=0
for i in range(len(f)):
for j in range(len(f[i])):
for k in range(len(f[i][j])):
if Data[f[i][j][k]][0]==t:
v=v+1
return v
# To Find Purity Value
def Purity(self,f):
r=ind=sum=fscore=tot=maxi=0
arr=[]
for i in range(len(f)):
arr=[]
dup=[]
ct=[]
for j in range(len(f[i])):
68
for k in range(len(f[i][j])):
arr.insert(ind,int(Data[f[i][j][k]][0]))
ind=ind+1
r=r+len(f[i][j])
dup=arr
dup=list(dict.fromkeys(dup))
ct=[]
for ii in range(len(dup)):
ct.insert(ii,arr.count(dup[ii]))
maxi=max(ct)
for ii in range(len(ct)):
if(ct[ii]==maxi):
break
t=ii+1
if len(f[i])==1:
recall=1
else:
recall=self.recalls(t,f)
recall=maxi/recall
sum=(maxi/r)
score=2*((recall*sum)/(recall+sum))
tot=tot+sum
fscore=fscore+score
t=0
r=0
maxi=0
tot=tot/len(f)
fscore=fscore/len(f)
print("precision:",tot)
file1 = open("1.txt","a+")
file2=open("2.txt","a+")
file1.write(str(self.epsilon)+'\t'+str(tot)+'\t'+'\n')
file2.write(str(self.epsilon)+'\t'+str(fscore)+'\t'+'\n')
file1.close()
file2.close()
# Silhouette calculation
def silhouette(self,clust,dist):
ll=[]
ai=0
bi=0
si=0
sumi=0
mini=1000000
# print("cluster",clust[0])
for i in range(len(clust)):
ll.insert(i,reduce(lambda aa,bb :aa+bb,clust[i]))
#print("liiii",ll)
for i in range(len(ll)):
if(len(ll[i])==1):
ai=0
j=0
else:
j=random.randint(0,len(ll[i])-1)
# print("jjjj",j)
69
# print("len",len(ll[i]))
for k in range(len(ll[i])):
if j!=k:
ai=ai+dist[ll[i][j]][ll[i][k]]
ai=ai/(len(ll[i])-1)
for r in range(len(ll)):
if(i!=r):
bi=0
for k in range(len(ll[r])):
bi=bi+dist[ll[i][j]][ll[r][k]]
bi=bi/(len(ll[r]))
if bi<mini:
mini=bi
bi=mini
if(ai<bi):
si=1-(ai/bi)
elif bi<ai:
si=(bi/ai)-1
else:
si=0
sumi=sumi+si
sumi=sumi/(len(ll))
print("silhouette",sumi)
return sumi
def display():
global path
f1=str(filedialog.askopenfilename(filetypes=[("csv files","*.csv")]))
lab=tk.Label(top,text=f1,font=("Courier", 14))
lab.place(x=300,y=200)
print(f1)
path=f1
def Clustering():
n1=textbox.get("1.0","end-1c")
window=int(n1)
start = timeit.default_timer()
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
logData = sc.textFile(path).cache() # to locate the csv file
li=logData.map(lambda l:l.split(",")).collect() # split the csv file as list of strings
sum1=0
for i in range(0,len(li)):
t=list(map(float,li[i])) # convert the list of strings into float
Data.append(t)
sum1=sum1+max(Data[i])
sum1=sum1/len(Data)
print(sum1)
fg=sc1.distance.pdist(Data,'euclidean')
global dist
dist=sc1.distance.squareform(fg)
st=e=0
if sum1<5:
st=4
e=8
inc=1
elif sum1<6:
70
st=3
e=4
inc=0.2
elif sum1<10:
st=15
e=20
inc=1
elif sum1<15:
st=10
e=15
inc =1
elif sum1<30:
st=10
e=25
inc=5
elif sum1>700 :
st=500
e=600
inc=20
global data
global data1
global cols
global row
row=0
for i in np.arange(st,e,inc):
epsilon=round(i,1)
print("EPI",epsilon)
f=find(epsilon)
m=Microcluster(epsilon)
for ik in range(len(Data)):
radii.insert(ik,m.radius(Data[ik]))
p=0
sett=[]
ii=0
while(p<len(Data)):
p=p+window
sett=Data[ii:p]
rough=f.findcluster(sett,ii)
ii=p
print("Rough",rough)
mer=m.mergemicro(rough)
epi=str(epsilon)
bef=str(len(mer))
sim=m.similarity(mer,dist)
s=SortCluster(epsilon,sim,dist,mer)
fp=s.mergecluster()
aftr=str(len(fp))
s.calc(fp)
rough=[]
file1.write(res+'\t'+'\n')
kk=str(s.silhouett(fp,dist))
data[row][0]=epi
data[row][1]=bef
data[row][2]=aftr
71
data[row][3]=tot
data[row][4]=fscore
data[row][5]=kk
if float(kk) <0.0:
kk=str(0.0)
data1[row][0]=float(epsilon)
data1[row][1]=float(kk)
row=row+1
print("silh",kk)
clust.clear()
mer.clear()
print("row:",row)
def showtable():
n=70
m=450
k=70
# Table to display the Results
for y in range(row+1):
for x in range(len(cols)):
if y==0:
e=tk.Entry(font=('Courier 10 bold'),bg='khaki',justify='center')
e.grid(column=x, row=y)
e.insert(0,cols[x])
e.place(x=n,y=450)
n=n+150
else:
e=tk.Entry(font=('Courier 10 bold'))
e.grid(column=x, row=y)
e.insert(0,data[y-1][x])
e.place(x=k,y=m)
k=k+150
m=m+20
k=70
# Bargraph for silhouette and epsilon
df = pd.DataFrame(data1[:row],columns=['Epsilon','Silhoutte value'])
print("dffffff",df)
figure1 = plt.Figure(figsize=(5,4), dpi=100)
ax1 = figure1.add_subplot(111)
bar1 = FigureCanvasTkAgg(figure1, top)
bar1.get_tk_widget().pack(side=tk.RIGHT, fill=tk.BOTH)
df = df[['Epsilon','Silhoutte value']].groupby('Epsilon').sum()
df.plot(kind='bar', legend=True, ax=ax1,color='khaki' ,width=0.3)
ax1.set_title('Epsilon Vs. Silhoutte value')
top = tk.Tk()
background_image=tk.PhotoImage(file="C:\\Users\priya\Desktop\project\.spyder-py3\clus.jpg")
background_label = tk.Label(top, image=background_image)
background_label.place(x=10, y=10, relwidth=1, relheight=1)
top.geometry(str(top.winfo_screenwidth()-20)+"x"+str(top.winfo_screenheight()-40))
top.title("A Density Based Approach To Cluster Streaming Data")
lab=tk.Label(top,text="Browse Dataset",font=("Courier", 14))
lab.place(x=100,y=200)
b1=tk.Button(top,text="UPLOAD",font=("Courier", 14),command=disp,bg="khaki")
b1.place(x=800,y=197)
lab=tk.Label(top,text="Enter Window Size",font=("Courier", 14))
72
lab.place(x=100,y=250)
textbox=tk.Text(top,height=1.5,width=30)
textbox.place(x=350,y=250)
b2=tk.Button(top,text="StartClustering",font=("Courier",14),command=lambda:getnumber(),bg="kha
ki")
b2.place(x=690,y=250)
b2=tk.Button(top,text="GetResults",font=("Courier",14),command=lambda:getresult(),bg="khaki")
b2.place(x=390,y=350)
top.mainloop()
# To generate a graph for purtiy and f-measure
import matplotlib.pyplot as plt
import numpy as np
leg=["Purity","F-measure"]
files=["1.txt","2.txt"]
for i in range(2):
plt.title('performance metrics for wine dataset')
plt.xlabel('Epsilon')
plt.ylabel('Performance')
f=open(files[i], "r")
clus=[]
time=[]
for line in f:
col=line.split('\t')
clus.append(col[0])
time.append(col[1])
print(clus)
print(time)
clus=np.array(clus).astype(np.float)
time=np.array(time).astype(np.float)
if i==0:
plt.plot(clus,time, color='green', linestyle='dashed', linewidth = 1,
marker='o', markerfacecolor='blue', markersize=6,label=leg[i])
plt.legend()
if i==1:
plt.plot(clus,time, color='black', linestyle='dashed', linewidth = 1,
marker='o', markerfacecolor='red', markersize=6,label=leg[i])
plt.legend()
73
74
REFERENCES
CHAPTER 10
REFERENCES
10.1 REFERENCES:
[1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving
data streams,” in Proc. 29th Int. Conf. Very Large Data Bases, vol. 29. Berlin, Germany,
2003, pp. 81–92.
[2] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over an evolving data
stream with noise,” in Proc. SIAM Int. Conf. Data Min., vol. 6, 2006, pp. 328–339.
[3] A. Forestiero, C. Pizzuti, and G. Spezzano, “A single pass algorithm for clustering
evolving data streams based on swarm intelligence,” Data Min. Knowl. Disc., vol. 26, no. 1,
pp. 1–26, Nov. 2011.
[4] P. Kranen, I. Assent, C. Baldauf, and T. Seidl, “The ClusTree: Indexing micro-clusters for
anytime stream mining,” Knowl. Inf. Syst., vol. 29, no. 2, pp. 249–272, 2011.
[5] P. S. Shelokar, V. K. Jayaraman, and B. D. Kulkarni, “An ant colony approach for
clustering,” Analytica Chimica Acta, vol. 509, no. 2, pp. 187–195, May 2004.
[6] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering
clusters in large spatial databases with noise,” in Proc. KDD, vol. 96, 1996, pp. 226–231.
[7] N. Labroche, N. Monmarché, and G. Venturini, “AntClust: Ant clustering and Web usage
mining,” in Proc. Genet. Evol. Comput. Conf., 2003, pp. 25–36.
[8] A. H. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan, “Density clustering based
on radius of data,” World Acad. Sci. Eng. Technol., vol. 3, 2006
[9] S. Mahran and K. Mahar, “Using grid for accelerating density-based clustering,” in Proc.
IEEE Int. Conf. Comput. Inf. Technol., Sydney, NSW, Australia, 2008, pp. 35–40.
[10] N. Masmoudi, H. Azzag, M. Lebbah, C. Bertelle, and M. B. Jemaa, “How to use ants for
data stream clustering,” in Proc. IEEE Congr. Evol. Comput., 2015, Sendai, Japan, pp. 656
663.
75
[11] P. Hore, L. O. Hall, and D. B. Goldgof, “Creating streaming iterative soft clustering
algorithms,” in Proc. IEEE Annu. Meeting North Amer. Fuzzy Inf. Process. Soc. (NAFIPS),
San Diego, CA, USA, 2007, pp. 484–488.
[12] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” J. Amer. Stat.
Assoc., vol. 66, no. 336, pp. 846–850, Dec. 1971.
[13] S. U. Rehman, A. Asghar, S. Fong, and S. Sarasvady, “DBSCAN: Past, present and
future,” in Proc. 5th Int. Conf. Appl. Digit. Inf. Web Technol. (ICADIWT), 2014, pp. 232–
238.
[14] C. W. Reynolds, “Flocks, herds and schools: A distributed behavioural model,” ACM
SIGGRAPH Comput. Graphics, vol. 21, no. 4, pp. 25–34, Jul. 1987.
[15] T. A. Runkler, “Ant colony optimization of clustering models,” Int. J. Intell. Syst., vol.
20, no. 12, pp. 1233–1251, 2005.
76