You are on page 1of 77

EKT: EXERCISE-AWARE

KNOWLEDGE TRACING FOR


STUDENT PERFORMANCE
PREDICTION
A PROJECT REPORT

Submitted by

C. ASHWIN (Reg. No. 2020060)


K.A. HARI HARA SANKAR (Reg. No. 202006021)
N. HARIRAM (Reg. No. 202006022)

in partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY

in

INFORMATION TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY


MEPCO SCHLENK ENGINEERING COLLEGE, SIVAKAS
(An Autonomous Institution affiliated to Anna University, Chennai)

i
March 20

ii
BONAFIDE CERTIFICATE

Certified that this project report titled EKT: EXERCISE-AWARE KNOWLEDGE

TRACING FOR STUDENT PERFORMANCE PREDICTION is the bonafide work of


Mr.C.Ashwin(Regno: 2020060), Mr.K.A.Hari Hara Sankar (Regno: 202006021),
Mr.N.Hariram (Regno: 202006022) who carried out the research under my supervision.
Certified further, that to the best of my knowledge the work reported herein does not form
part of any other project report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.

________________________ ________________________

Dr.T.REVATHI, M.E., Ph.D., Dr.T.REVATHI, M.E., Ph.D.,

Internal guide Head of the Department


Senior Professor, Senior Professor,
Department of Information Technology, Department of Information Technology,
Mepco Schlenk Engineering College, Mepco Schlenk Engineering College,
Sivakasi. Sivakasi.

Submitted for Viva-Voce Examination held at MEPCO SCHLENK ENGINEERING

COLLEGE, SIVAKASI (AUTONOMOUS) on …………………………

Internal Examiner External Examiner

iii
ABSTRACT

iv
ABSTRACT
For offering proactive services (e.g., personalized exercise recommendation) to the
students in computer supported intelligent education, one of the fundamental tasks is
predicting student performance (e.g., scores) on future exercises, where it is necessary to
track the change of each student’s knowledge acquisition during her exercising activities.
Unfortunately, to the best of our knowledge, existing approaches can only exploit the
exercising records of students, and the problem of extracting rich information existed in the
materials (e.g., knowledge concepts, exercise content) of exercises to achieve both more
precise prediction of student performance and more interpretable analysis of knowledge
acquisition remains underexplored. To this end, in this paper, we present a holistic study of
student performance prediction. To directly achieve the primary goal of performance
prediction, we first propose a general Exercise-Enhanced Recurrent Neural Network
(EERNN) framework by exploring both student’s exercising records and the text content of
corresponding exercises. In EERNN, we simply summarize each student’s state into an
integrated vector and trace it with a recurrent neural network, where we design a bidirectional
LSTM to learn the encoding of each exercise from its content. For making final predictions,
we design two implementations on the basis of EERNN with different prediction strategies,
i.e., EERNNM with Markov property and EERNNA with Attention mechanism. Then, to
explicitly track student’s knowledge acquisition on multiple knowledge concepts, we extend
EERNN to an explainable Exercise-aware Knowledge Tracing (EKT) framework by
incorporating the knowledge concept information, where the student’s integrated state vector
is now extended to a knowledge state matrix. In EKT, we further develop a memory network
for quantifying how much each exercise can affect the mastery of students on multiple
knowledge concepts during the exercising process. Finally, we conduct extensive
experiments and evaluate both EERNN and EKT frameworks ona large-scale real-world data.
The results in both general and cold-start scenarios clearly demonstrate the effectiveness of
two frameworks in student performance prediction as well as the superior interpretability of
EKT.

iv
ACKNOWLEDGEMENT

v
ACKNOWLEDGEMENT

Apart from our efforts, the success of our project depends largely on the
encouragement of many others. We take this opportunity to express our gratitude to the
people who have been instrumental in the successful completion of our project. We would
like to express our immense pleasure to thank our college management for giving required
amenities regarding our project.
We would like to convey our sincere thanks to our respected Principal, Dr.S.Arivazhagan,
M.E.,Ph.D., Mepco Schlenk Engineering College, for providing us with facilities to complete
our project. We extend our profound gratitude and heartfelt thanks to Dr.T.Revathi,
M.E.,Ph.D., Senior Professor and Head of the Department of Information Technology for
providing us constant encouragement. We are bound to thank our project coordinator
Dr.J.Angela Jennifa Sujana, M.Tech., Ph.D., Associate Professor of Information
Technology. We sincerely thank our project guide Dr.T.Revathi, M.E.,Ph.D., Senior
Professor, Department of Information Technology, for her inspiring guidance and valuable
suggestions to complete our project successfully. The guidance and support received from all
the staff members and lab technicians of our department who contributed to our project, was
vital for the success of the project. We are grateful for their constant support and help. We
would like to thank our parents and friends for their help and support in our project.

vi
TABLE OF CONTENTS

TABLE OF CONTENTS
CONTENT PAGE NO
LIST OF TABLES x
LIST OF FIGURES xii
LIST OF SYMBOLS xiv
vii
LIST OF ABBREVIATIONS xvi

CHAPTER 1 INTRODUCTION 1
1.1 STREAM CLUSTERING 2
1.2 CLUSTERING AND ITS TYPES 2
1.2.1 PARTITIONING CLUSTERING 2
1.2.2 HIERARCHICAL CLUSTERING 2
1.2.3 FUZZY CLUSTERING 2
1.2.4 MODEL BASED CLUSTERING 3
1.3 DENSITY BASED CLUSTERING 3
1.4 TUMBLING WINDOW 3
1.5 DENSITY BASED APPROACH FOR CLUSTERING 3
CHAPTER 2 LITERATURE STUDY 5
2.1 OVERVIEW 6
2.2 DENSITY-BASED CLUSTERING OVER AN 6
EVOLVING DATA STREAM WITH NOISE
2.3 A SINGLE PASS ALGORITHM FOR CLUSTERING 6
EVOLVING DATA STREAMS BASED ON SWARM
INTELLIGENCE
2.4 A FRAMEWORK FOR CLUSTERING EVOLVING 7
DATA STREAMS
2.5 ANTCLUST: ANT CLUSTERING AND WEB USAGE 7
MINING
2.6 THE CLUSTREE: INDEXING MICRO-CLUSTERS 7
FOR ANYTIME STREAM MINING
CHAPTER 3 SYSTEM STUDY 9
3.1 SCOPE 10
3.2 PRODUCT FUNCTION 10
3.3 SYSTEM REQUIREMENTS 11
3.3.1 HARDWARE INTERFACES 11
3.3.2 SOFTWARE INTERFACES 11
3.3.2.1 ANACONDA 11
3.3.2.2 SPYDER 11
3.3.2.3 PYTHON 12
3.3.2.4 PYSPARK 12
CHAPTER 4 SYSTEM DESIGN 13

viii
4.1 OVERVIEW 14
4.2 OVERALL ARCHITECTURE 14
4.3 MODULES 14
4.3.1 FORMATION OF ROUGH CLUSTERS 15
4.3.2 GROUPING MICRO-CLUSTERS 16
4.3.3 CATEGORIZATION OF CLUSTERS 18
CHAPTER 5 IMPLEMENTAION METHODOLOGY 21
5.1 OVERVIEW 22
5.2 ESSENTIAL LIBRARIES 22
5.2.1 NUMPY 22
5.2.2 MATH 22
5.2.2.1 Sqrt() 22
5.2.3 SCIPY 22
5.2.3.1 scipy.spatial.distance.pdist() 23
5.2.4 FUNCTOOLS 23
5.2.4.1 Reduce() 23
5.2.5 TKINTER 23
5.2.6 MATPLOTLIB 23
5.2.6.1 Matplotlib.pyplot 23
5.3 FUNCTIONS USED FOR IMPLEMENTATION 23
5.3.1 FIND CLUSTERS 24
5.3.2 SUITABILITY 24
5.3.3 SIMILARITY 24
5.3.4 GROUP MICRO-CLUSTERS 24
5.3.5 RADIUS 24
5.3.6 GROUP CLUSTERS 24
5.3.7 PICK 25
5.3.8 DROP 25
5.3.9 PURITY 25
5.3.10 F-MEASURE 25
5.3.11 SILHOUETTE 25
5.3.12 SHOW TABLE 25
CHAPTER 6 PERFORMANCE METRICS 26
6.1 OVERVIEW 27
6.2 SILHOUETTE CO-EFFICIENT 27
6.3 PURITY 28
6.4 F-MEASURE 28
6.5 RAND INDEX 29
CHAPTER 7 RESULTS AND DISCUSSION 30

ix
7.1 OVERVIEW 31
7.2 DATASET 31
7.2.1 STATIONARY DATASET 31
7.2.2 NON-STATIONARY DATASET 31
7.3 SCREENSHOTS 32
CHAPTER 8 CONCLUSION AND FUTURE WORK 44
8.1 CONCLUSION 45
8.2 FUTURE WORK 45
CHAPTER 9 APPENDIX 46
9.1 CODING 47
CHAPTER 10 REFERENCES 57
10.1 REFERENCES 58

x
LIST OF TABLES
LIST OF TABLES
TABLE NO TABLE NAME PAGE NO
7.3.1 Performance of Stationary dataset 42
7.3.2 Performance of Non-stationary dataset 42
7.3.3 Performance of B1C10D25 with noise 42
xi
7.3.4 Result of Existing Stream Clustering 43
Algorithms
7.3.5 Result of Existing Ant Clustering 43
Algorithms

xii
LIST OF FIGURES
13
LIST OF FIGURES

FIGURE NO. TOPIC PAGE


NO.
1.4.1 Tumbling Window 3
4.2.1 System Architecture 14
4.3.1.1 Formation of Rough Clusters 16
4.3.2.1 Grouping micro-clusters within cluster 18
4.3.3.1 Categorization of Clusters 20

7.3.1 Home Page 32


7.3.2 User Input 32
7.3.3 Clustering Result 33
7.3.4 Silhouette Vs Epsilon on wine dataset 33
7.3.5 Silhouette Vs Epsilon on 1CDT dataset 34
7.3.6 Purity Vs Epsilon on Wine Dataset 34
7.3.7 F-Measure Vs Epsilon for Wine Dataset 35
7.3.8 Purity Vs Epsilon on 1CDT Dataset 35
7.3.9 F-Measure Vs Epsilon on 1CDT Dataset 36
7.3.10 Threshold Vs Performance on Zoo Dataset 36
7.3.11 Performance Vs Sleep max on Iris Dataset 37
7.3.12 Performance Vs Sleep max on network intrusion 37
7.3.13 F-Measure Vs window size on 4CE1CF stream 38
7.3.14 F-Measure Vs window size on Forest cover 38
7.3.15 Purity of DBCSD Vs other Ant Clustering Algorithms 39

7.3.16 F-Measure of DBCSD Vs other Ant Clustering Algorithms 39


7.3.17 Purity of DBCSD Vs other Stream Clustering Algorithms 40
7.3.18 F-Measure of DBCSD Vs other Stream Clustering 40
Algorithms
7.3.19 Clustering Result of Iris Dataset 41
7.3.20 Clustering Result of Wine Dataset 41

14
LIST OF SYMBOLS
15
NOTATION MEANING
X Dataset
C Center of dataset
R Radius of dataset
xi Data point
S Silhouette Value
P Purity
F F-Measure
K Clusters
R Rand index
LIST OF SYMBOLS

16
LIST OF ABBREVATION
17
LIST OF ABBREVATION

S.NO ACRONYMS ABBREVIATIONS


1 є Epsilon Value
2 1CDT One Class Diagonal Translation
3 2CHT Two Classes Horizontal Translation
4 4CR Four Classes Rotating Separated 
5 4CE1CF Four Classes Expanding and One Class Fixed

18
INTRODUCTION
CHAPTER 1

19
INTRODUCTION

1.1 STREAM CLUSTERING

In recent years, a large amount of streaming data has been generated but analyzing
and processing such kind of data have become a hot topic. Streaming data is defined as data
such as multimedia data, telephone records, financial transactions etc., that arrived
continuously. Streaming data clustering is clustering streaming data into groups having
similar behavior. Streaming data should be processed incrementally using Stream
Processing techniques without having access to all of the data. The goal of stream clustering
is to group the streaming data into similar classes. Streaming data can be examined only
once. A Stream can be unbounded and infinite but only limited amount of memory is
available. The nature of the stream signifies that data can drift, new clusters can appear, or
disappear and reappear repeatedly.

1.2 CLUSTERING AND ITS TYPES

Clustering is the task of dividing data points into groups such that data points in the
same groups are more similar to other data points in the same group than those in other
groups. In simple words, segregate groups with similar traits and assign them into clusters.
Clustering can be done in different approaches:

1.2.1 PARTITIONING CLUSTERING

Partitioning algorithms are clustering techniques that subdivide the data sets into a set
of k groups, where k is the number of groups pre-specified by the analyst.

1.2.2 HIERARCHICAL CLUSTERING

Hierarchical clustering is an alternative approach to partitioning clustering for


identifying groups in the dataset. It does not require pre-specifying the number of clusters to
be generated.

1.2.3 FUZZY CLUSTERING

In Fuzzy clustering, each data points can be a member of more than one cluster. Each
point has a set of membership coefficients resultant to extent of being in a given cluster.

1.2.4 MODEL BASED CLUSTERING

20
In model-based clustering, data are viewed as coming from a distribution that is
mixture of two are more clusters. It finds best fit of models to data and estimates number of
clusters.
1.3 DENSITY BASED CLUSTERING
It is a partitioning method which can find out clusters of different shapes and sizes
from data containing noise and outliers. Density-based clustering defines clusters as high-
density areas separated by areas of low density. In our proposed algorithm, highly dense areas
are described using micro-clusters with center c and radius r. Micro-clusters have a maximum
radius epsilon є i.e. radius ≤ epsilon. A data point is assigned to a micro-cluster if the data
point lies within its radius. The number of micro-clusters can be greater than the number of
actual clusters but micro-clusters can be fewer than the data points. This has two advantages:
i) statistics about clusters can be stored in a fraction of the space
ii) evaluating micro-clusters are easier than the individual data points.

1.4 TUMBLING WINDOW


Tumbling windows are a series of fixed-size, non-overlapping and contiguous time
intervals. The window size must be positive float constant. The window size is static and
cannot be changed dynamically at runtime.

Figure 1.4.1 Tumbling Window

1.5 DENSITY BASED APPROACH FOR CLUSTERING


This method follows density-based approach in which each data point is initially
considered as cluster. Tumbling window is used to feed the data points. It is a type of sliding
window. A fixed size nonoverlapping lump of data is considered in each iteration. Rough
clusters are incrementally created in a single pass of the window. The first point seeds the
new cluster, consequent points are assigned to prevailing cluster or, if too dissimilar, a new
cluster is seeded. After forming rough clusters, micro-clusters are created. Initially, each

21
point is considered as a micro-cluster and these micro-clusters attempt to merge with other
micro-clusters present in the same cluster only. These rough clusters are refined using an ant-
inspired sorting method. The pick and drop method are based on the behavior of certain
species of ant which sort their larvae into piles. The Sorting ant picked up the isolated items
and then dropped at locations where similar items are present. Sort ants are assigned to every
cluster. They refine the primary clusters by picking micro-clusters and dropping it in more
suitable clusters. The dissolution of smaller clusters and their contents moved to similar,
larger clusters. This simplifies the method, decreases the total complexity and allows for
actual sampling.

22
LITERATURE STUDY
CHAPTER 2
LITERATURE STUDY

23
2.1 OVERVIEW:
The existing system which traditional and streaming algorithm include limitations on
available memory, high computational time, less viable to noisy data and techniques to
cluster the streaming data are described in this chapter.

2.2 DENSITY-BASED CLUSTERING OVER AN EVOLVING DATA


STREAM WITH NOISE [1]:
In DenStream, on online component the summarized data points forms micro clusters
and these micro-clusters are grouped offline. DenStream algorithm uses a time stamped
window and give importance to the current data. The offline clustering is an addition to the
density algorithm which cluster in terms of core point, border point, and noise data. A core
point is at least minimum points (minPoints) within the distance epsilon. Border points have a
smaller number of points than minimum points (minPoints) within distance epsilon and the
noisy data are considered as outliers. micro-clusters are defined by the maximum
radius epsilon and minPoints, the least count of data points within epsilon for the
microclusters considered as dense. The Clustering on offline component is very expensive
and it is done when a user made a clustering request.

2.3 A SINGLE PASS ALGORITHM FOR CLUSTERING EVOLVING


DATA STREAMS BASED ON SWARM INTELLIGENCE [2]:
The DenStream has two phases such as online component to form micro cluster and
offline component to group and cluster micro clusters. In FlockStream, the two phases online
component and offline component of DenStream were combined into a single online phase
for summarizing the data points by forming micro clusters and clustering the micro clusters
into most similar clusters. It was stimulated by the flocking activities of birds. It follows self-
organizing tactic and uses a distributed approach to group most similar microclusters.
FlockStream require significantly less pair-wise distance comparisons than DenStream
resulting in most similar cluster purity. It adopts the concept of time weighted non-
core and core micro-clusters introduce in DenStream.

24
2.4 A FRAMEWORK FOR CLUSTERING EVOLVING DATA
STREAMS [3]:
The CluStream algorithm uses a idea of microclusters to cluster dynamic data
streams. CluStream algorithm is risky when the data stream has noisy data because it has only
fixed number of clusters. CluStream propose two different phases, they are online phase and
offline phase. In online phase a group of microclusters are stored in the memory, when a data
point is arriving from the stream it can be either added to an prevailing microclusters or to a
new microclusters. Online phase satisfies single-pass limitation so that huge data can be
cluster. A new microclusters is formed by deleting the existing micro cluster after merging
the similar two microclusters. The offline phase uses weighted k-means algorithm on the
microclusters, to get the final clusters from the data streams. The idea of using microclusters
instead of data points confirms that they can be used for large amount of data. This algorithm
is estimated using an open source software, massive online analysis.

2.5 ANTCLUST: ANT CLUSTERING AND WEB USAGE MINING [4]:


AntClust algorithm was inspired by biological behavior of ants. But the antclust
algorithm did not apply pick and drop method for clustering. Ants leave pheromone trails
from nest to source while they were searching for a food. While returning back to nest ants
following their pheromone trails and still depositing more pheromone. Each and every data
object are related with ants. Every ant is assigned with an odor. The nests are shares among
the ants having same odor Ants transfer from nest to nest along pheromone traces and find
their most suitable nest in terms of the Euclidean distance. To calculate clusters k-means
method is exercised which restricts its suitability for a streaming environment.

2.6 THE CLUSTREE: INDEXING MICRO-CLUSTERS FOR ANYTIME


STREAM MINING [5]:
Stream data clustering is of growing importance in various applications. This
algorithm planned index-based method with no parameter that itself adapts to changing data
stream speed and is accomplished of anytime clustering. The ClusTree upholds the ethics
essential for calculating mean and change of micro-clusters. Also, it present keys to handle
excruciating streams through collection mechanisms and suggest novel descent approaches
that advance the clustering outcome on unhurried streams as extended as time permits. This

25
tactic is the first anytime clustering procedure for data streams. Anytime clustering ClusTree
overcomes the problems with old batch processing algorithm. Experimental results
illustration that it is accomplished of managing a gathering of diverse stream features for
precise and scalable anytime clustering of stream data. Furthermore, it discoursed
compatibility of this method to invent clusters of random shape and to showing cluster
changes and data development using new methods.

26
SYSTEM STUDY
CHAPTER 3
SYSTEM STUDY

3.1 SCOPE:

27
The scope of A Density Based Approach To Cluster Streaming Data is to achieve a
significant decrease in computational time and clustering vast amount of data without
compromising the cluster quality.
3.2 PRODUCT FUNCTION:
 Initially each data point is considered as an individual clusters and the unbounded data
streams are passed through the tumbling window.
 The first data point will form a new cluster and then the successive data point’s suitability
with already formed clusters will be calculated, if it is less than or equal to epsilon then it
will be added to that cluster otherwise it separately forms a new cluster.
 The similarity between each cluster with its neighbouring cluster is updated. the formed
clusters are refined by creating micro clusters
 Two micro-clusters are grouped as a single micro-cluster, if the radius is less than epsilon,
otherwise merge operation fails.
 Micro-clusters m1 & m2 are said to be density reachable if distance between centre of m1
and centre of m2 is less than epsilon.
 The merge operation takes place within the cluster only.
 The sorting ants are assigned to each cluster, it randomly pick-up micro-cluster from a
cluster and combines into larger cluster.
 If the selected micro-cluster and the other micro-clusters in the same cluster are density
reachable, then reachable count is increased.
 If micro-cluster is picked successfully, carrying count is set as true and ant moves to the
similar cluster and drop the micro-cluster based on the inverse of pick.
 If drop success, then the selected micro-cluster move to the new cluster else it moves
back to its native cluster. This operation continues until ant asleep or the cluster becomes
empty.
 If either pick or drop operation is unsuccess, then counter is incremented. When the
counter reaches sleep max value, the cluster is considered as sorted and sleeping counter
is set to true.
 The counter is reset to zero after successful operation or new micro-cluster is placed in
the cluster by a foreign sorting ant.
 If there exists only one micro-cluster in a cluster then it is an outlier. The result will be
non-empty sorted micro-cluster.

3.3 SYSTEM REQUIREMENTS


28
The software and hardware requirements of the system are as follows:

3.3.1 HARDWARE INTERFACES

 Intel® Core TM i5-8265U 1.6GHz


 8 GB RAM

3.3.2 SOFTWARE INTERFACES


 Platform – Anaconda
 IDE – Spyder
 Technologies used – Python
 API– PySpark

3.3.2.1 ANACONDA
Anaconda platform is used for machine learning and any other large-scale data
processing. It is an open source distribution which is capable of working with R and Python
programming languages and free of cost. It consists of more than 1500 packages and virtual
environment manager. The virtual environmental manager is named as Anaconda Navigator and
it comprises all the libraries to be installed within. It holds certain default navigators like Spyder,
JupyterLab, Jupyter Notebook, Orange, Rstudio etc.

3.3.2.2 SPYDER
To implement the proposed system the IDE used is Spyder environment. It is an open
source cross-platform integrated development environment. It is the combination of advanced
features such as debugging, editing, and analysis of huge data. This tool helps in interactive
execution, data exploration and visualization of data.

3.3.2.3 PYTHON
Python is an interpreter, high-level data structures, general-purpose programming
language. It can be used for creating web applications on server side. Python is also suitable as an
extension language for customized applications.

3.3.2.4 PYSPARK
PySpark is the Application Programming Interface that is written in python language to
support Apache Spark functionality. The large data analysis is handled by the distributed
framework known as Spark. PySpark is a powerful language which is mainly used in data science
29
and machine learning. It can be easy integrated with any other languages. This framework works
with greater speed when compared with other frameworks for processing data.

30
SYSTEM DESIGN

CHAPTER 4
SYSTEM DESIGN

4.1 OVERVIEW
This section presents the overview of the whole system. The Section 4.2 shows the
system Section 4.2 defines the main three modules used Section 4.3.1 defines how clusters
are formed with the unbounded data streams Section 4.3.2 defines the merge operation with
the previously formed rough clusters Section 4.3.3 describes how the clusters are categorized
and stored offline.

4.2 OVERALL ARCHITECTURE:

Data points Form Rough clusters Create micro-clusters


31
Assign ants for Update Cluster Unite micro-clusters
categorization similarity into clusters

Final Clusters

Figure 4.2.1 System architecture

4.3 MODULES

 Formation of Rough Clusters


 Grouping Micro-Clusters
 Categorization of Clusters

4.3.1 FORMATION OF ROUGH CLUSTERS:

1.while<window size>do
2for<each points>do
3.if<clusters exist>then
4.find suitability with clusters (1)
5.if<suitable> then
6.add the data point to cluster
7.else
8.create a new cluster
9.else if<no clusters>then
10.create a new cluster
11.update similarity value (2)
12.return rough clusters

32
The data streams are processed through non-overlapping tumbling window. At each
iteration, a fixed size non repeated set of data is considered. In a single-pass of the window,
clusters are consequently formed. The first point seeds the new cluster. Subsequent data
points are assigned to a prevailing cluster or seeds a new cluster. The Euclidean distance is
calculated for suitability of data points with n samples from cluster c is estimated as follows:

distance (c i , q)
Suitability(S) =∑ (1)
n

Where,
n - number of points present already in the cluster.
q - Data point
c – Existing Clusters

Every cluster is evaluated and each datapoint is assigned to the most suitable cluster
where the suitability is equal to or below epsilon. New cluster is formed if the suitability
value is greater than the epsilon. The parameter є is the maximum radius for a micro-cluster
in the subsequent step.

As we evaluate each point’s suitability with every cluster, we store each cluster’s
suitability. Upon establishing a cluster, we update the similarity information between the
particular cluster and its neighboring clusters. The similarity between clusters c1 and c2 is the
average of each data point q in cluster c1’s suitability with cluster c2

distance (c 1 , c 2)
Similarity=∑ (2)
n

The similarity to every neighboring cluster is a continuing average updated whenever a new
point is allotted to the cluster. Although comparable, Similarity(c 1 , c2 ) ≠ Similarity (c 2 , c1 ) .

For each data points

Calculate Suitability

yes
No
suitable

Add point to new cluster Add point to existing cluster

33
Update cluster similarity

Rough Clusters

Fig 4.3.1.1 Work Flow of Formation of Rough Cluster

4.3.2 GROUPING MICRO-CLUSTERS:

This algorithm identifies clusters as a group of micro-clusters, and a micro-cluster is a


group of neighboring points within a fixed radius; the є-neighborhood fixes this radius. A
micro-cluster containing N data points { X j}, j = {1..., N}, is described using three
components.
1) The number of points the micro-cluster holds is represented as (N).
2) The linear sum (LSum) of each dimension
1. Add micro-cluster c2 to c1(5)
(i.e., ∑ N i =1 Y i).
2. New micro cluster c3 formed
3) The squared sum 3. R: =radius of c3(4) (SSum) of each
dimension (i.e., ∑ N i =1 4.if (R ≤ є) then // unite success
Y i 2).
LSum and SSum 5. Remove c1 and c2 are d-dimensional
arrays, where d is the 6. Return c3 number of dimensions
in a data point. From 7.else unite operation fails these calculations, the
radius R and center of the 8. Return c1 and c2 micro-cluster can be
determined using the following formula

34
LSum
C= (3)
N


2
R= SSum −( LSum ) (4)
N N

A micro-cluster can also comprise a sequential variable, but this is not vital in this
window model. A micro-cluster M can engage point p if, after updating the LSum and SSum
of M with p, radius(M) ≤ epsilon. Likewise, two micro-clusters m1 and m2 can try to merge
into a single micro-cluster M as:

M= ( N i + N j, LSumi +LSumj, SSumi +SSumj) (5)

If radius(M) ≤ є, the clusters merge; else, the merge operation fails. Micro-clusters m1
and m2 are said density reachable if

distance (center (mi), center (m j )) ≤ є (6)

The solution provided by our algorithm is a group of clusters comprising of density-reachable


micro-clusters. This Algorithm works in two steps.

1) Rough clusters are formed in a single-pass of the window.


2) These rough clusters are refined and their summary information is stored offline.

Rough Clusters

Each point as a new


micro-cluster

>є <є
Calculate
radius

Group Unsuccessful Group two micro-cluster


35
Grouped Micro-
clusters formed

Figure 4.3.2.1 Work Flow of Grouping micro-cluster

4.3.3 CATEGORIZATION OF CLUSTERS:

1. Assign ant to clusters


2. Initialize sleepmax for each ant
3. While<ant asleep>do
4. for<each ant>do
5. If<not sleeping>then
6. Pick-up micro clusters (7)
7. Find similar cluster
8. Drop micro cluster into new similar cluster (7)
9. Update similarity value (2)
10. Return Final clusters

The earlier step discovers clusters in a single-pass of the window. The clusters
identified at this phase are often rough, impure and too-many. In this stage, micro-clusters are
formed, merged, and inter-cluster sorting is achieved. At first, each d-dimensional point p in
each cluster is treated as its own micro-cluster M. This micro-cluster with radius zero and
center of p. Formally, we have
M.N = 1
M.LSumi = pi, i = {1..., d}
M.SSumi = pi2, i = {1..., d}
Where,
pi - ith dimension of data point p
Before sorting operation, each micro-cluster tries to merge with other micro-clusters
in the same cluster. The merge operation is performed by comparing each micro-cluster with
every other in the same cluster. If, after adding their constituent parts (5), the radius is less

36
than or equal to, the merge operation is a success. Merging at this step ensures that only
neighboring micro-clusters try to merge and avoids the unnecessary calculation of comparing
micro-clusters in different dense areas. Another benefit is that, during the sorting segment, n
data points represented by a micro-cluster can be moved in a single operation, speeding up
the categorization method and reducing the number of pair wise comparisons. Ant is assigned
to each cluster for sort operations. Each ant is resident to its own cluster. The ants
probabilistically choose to pick-up a micro-cluster from their cluster. A micro-cluster M is
selected at random from cluster k and is iteratively compared with n micro-clusters in the
same cluster. The Euclidean distance from the center of M to each of the selected micro-
clusters is evaluated and if both are density-reachable (6), then a nearby count is incremented.
The chance of a picking-up a micro-cluster is calculated as follows:
(1−nearby)
P pick = (7)
n
This ensures a higher probability of a pick-up in clusters containing less micro-cluster.
This leads to the dissolution of smaller clusters into large similar clusters. If a micro-cluster is
effectively picked-up, the Boolean variable carrying is true and the categorizing ant moves to
a neighboring cluster and attempts to drop it. Ant moves to the most similar ensuring that
they do not try to drop micro-clusters in clusters that are different to their own cluster. An ant
attempts to drop its micro-cluster in the new cluster based on the converse of (7). If the drop
operation is success, the micro-cluster is moved to the new cluster; or else, the micro-cluster
leftovers in its original cluster. The ant returns to its resident cluster and updates the
similarity information between the two clusters with the latest suitability score [see (2)]. Each
ant continues to attempt categorizing of cluster until either the cluster is empty or the
categorizing ant is “asleep.” Each categorizing ant has a counter and if a pick-and-drop
operation is unsuccessful, either picking or dropping, this counter is incremented. When the
counter reaches sleepmax, then the cluster is organized and a Boolean counter sleep is true.
The counter is retuning to zero after a successful operation or if a new micro-cluster is placed
in the cluster by an unknown categorizing ant. When each and every ant is sleeping, the
process is terminated. This step simplifies each cluster and causes many smaller, similar
clusters to dissolve and form larger cluster. Clusters containing only one micro-cluster are
outliers and the clustering result is given as the group of non-empty clusters. Each cluster
contains a group of density-reachable micro-clusters which summarize the partitioned region
of large density in the feature space. This summary information is stored offline for further
use.
37
Assign ants to each
cluster

All ants asleep


For each ant sleep=false

Low probability High probability

Calculate
Ppick

Increment sleep counter Find most similar cluster

Drop micro-cluster

Update cluster similarity

Final Clusters are formed

Figure 4.3.3.1 Work Flow of Categorization of Clusters

38
IMPLEMENTATION METHODOLOGY

CHAPTER 5
IMPLEMENTATION METHODOLOGY
5.1 OVERVIEW:
The implementation methodology describes the main functional requirements, which
needed for doing the project.

5.2 ESSENTIAL LIBRARIES:


The library used in this project are numpy, math, scipy, functools, tkinter, matplotlib.

5.2.1 NUMPY:
Numpy is a library for the Python language. It can be used for large, multi-
dimensional arrays and matrices. It is used along with a huge collection of high-level
mathematical functions to operate on these arrays. It is precisely useful for algorithm
developers. The processing will be slow when working with multi-dimensional arrays and the

39
functions and operators that are used. Numpy is useful to overcome this restriction. Any
datatypes can be defined by this package.

5.2.2 MATH:
The Math Library provides us access to some common mathematical functions and
constants in Python, which we can use in our code for complex mathematical computations.
The library is an in-built Python module, so we don't have to do any installation. The
mathematical function used in this project is as follows:

5.2.2.1 Sqrt():
sqrt() function is an inbuilt function in Python that returns the square root of any
number. It is used to find radius of the micro-cluster.

5.2.3 SCIPY:
Scipy is a scientific library for Python is an open source. The scipy library depends on
Numpy, which provides convenient and fast N-dimensional array manipulation. The Scipy
library is built to work with Numpy arrays and provides many user-friendly and efficient
numerical practices such as routines for numerical integration and optimization.

5.2.3.1 scipy.spatial.distance.pdist:
It is used to find Euclidean distance between pairwise n-dimensional array in the
project. The squareform converts a vector-form distance vector to a square-form distance
matrix, and vice-versa.

5.2.4 FUNCTOOLS:
The functools module is for higher-order functions. It acts on or return other functions.
In general, any callable object can be preserved as a function for the purposes of this module.
5.2.4.1 Reduce():
The reduce() function is used to apply a specific function passed in its argument to all
of the list elements stated in the sequence passed along. It is used to convert 2-d list into 1-d
list in our project.

5.2.5 TKINTER:

40
The GUI used in the project is made with the help of the tkinter library and tkinter
widgets. Tkinter is the standard GUI Python library. Tkinter provides a fast and easy way to
create GUI applications. Tkinter delivers a powerful object-oriented interface to the Tk GUI
toolkit. Tkinter offers various controls, such as buttons, labels and text boxes used in a GUI
application.

5.2.6 MATPLOTLIB:
Matplotlib is one of the most common packages used for data visualization in python. It
is a cross-platform library for creating 2-Dimensional plots from data in arrays. Matplotlib is
written in Python. Matplotlib with NumPy can be used as the open source.

5.2.6.1 matplotlib.pyplot:
The matplotlib.pyplot is a collection of command style functions that make matplotlib
work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a
figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot
with labels, etc. In this project matplotlib.pyplot is used to generate graph such as Bar graph,
Cartesian graphs which represents the different Parameters Vs performance of the algorithm.

5.3 FUNCTIONS USED FOR IMPLEMENTATION:


The user defined function used for the implementation of the project are
5.3.1 Find Clusters:
In this method, the first data point will form a new cluster and then the successive
data point’s suitability with already formed clusters will be calculated, if it is less than or
equal to epsilon then it will be added to that cluster otherwise it separately forms a new
cluster. Finally, Rough Clusters are formed.

5.3.2 Suitability:
The Euclidean distance is calculated for suitability of data points with n samples from
cluster c is estimated. Every cluster is evaluated and each data points are assigned to the most
suitable cluster where the suitability is equal to or below epsilon. New cluster is formed if the
suitability value is greater than the epsilon.

5.3.3 Similarity:

41
The similarity between clusters c1 and c2 is the average of each data point q in cluster
c1’s suitability with cluster c2. The similarity to every neighboring cluster is a continuing
average updated whenever a new point is allotted to the cluster.

5.3.4 Form Micro-cluster:


In this method, after forming rough clusters each data point is considered as a single
micro-cluster. The two micro-clusters m1 and m2 can try to merge into a single micro-cluster
M, if radius(M) ≤ є, the clusters merge; else, the merge operation fails.

5.3.5 Radius:
The radius R of the micro-cluster is calculated in this method using linear sum and
squared sum of d-dimensional arrays with number of data points(n).

5.3.6 Group clusters:


Ant is assigned to each cluster for Categorization. Each ant is resident to its own
cluster. The categorizing ants probabilistically choose to pick-up a micro-cluster from their
cluster. A micro-cluster M is selected at random from cluster k and is iteratively compared
with n micro-clusters in the same cluster. This ensures a higher probability of a pick-up in
clusters containing less micro-cluster. This leads to the dissolution of smaller clusters into
large similar clusters. If a micro-cluster is effectively picked-up, the Boolean variable
carrying is true and the categorizing ant moves to a neighboring cluster and attempts to drop
it. Ant moves to the most similar ensuring that they do not try to drop micro-clusters in
clusters that are different to their own cluster. When each and every ant is sleeping, the
process is terminated. This step simplifies each cluster and causes many smaller, similar
clusters to dissolve and form larger cluster. Clusters containing only one micro-cluster are
outliers and the clustering result is given as the group of non-empty clusters.

5.3.7 Pick:
The Euclidean distance from the center of M to each of the selected micro-clusters is
evaluated and if both are density-reachable, then a nearby count is incremented. The
probability of picking-up a micro-cluster is high then pick successful and select the micro-
cluster.

5.3.8 Drop:

42
Ant moves to the similar cluster and drop the micro-cluster based on the inverse of
pick. If drop success, then the selected micro-cluster move to the new cluster else it moves
back to its native cluster.

5.3.9 Purity
The purity metric measures in what way homogeneous a cluster is. The overall purity
for the total number of clusters is calculated in this method.

5.3.10 F-Measure:
The F-measure called F-score or F1-score is the harmonic mean of the recall and
precision scores. The overall F-Measure for the total number of clusters is calculated in this
method.

5.3.11 Silhouette:
The silhouette co-efficient is a quantity of how comparable an object is to its native cluster
compared to further clusters. The silhouette ranges from -1 to +1. The silhouette value for
range of epsilon are calculated, the highest silhouette value gives the exact epsilon which
gives better clustering results.

5.3.12 Show Table:


In this method, the clustering results are displayed in the table format. Here the
dataset is processed for different epsilon value, length of the cluster before and after sorting,
purity, F-score, and silhouette values are displayed.

43
PERFORMANCE METRICS
CHAPTER 6
PERFORMANCE METRICS
6.1 OVERVIEW:
Our algorithm is evaluated across three metrics: 1) silhouette coefficient; 2) F-
measure and 3) Purity 4) Rand-index. The performance metrics of non-stationary data
streams is compared with three important stream clustering algorithms they are CluStream,
DenStream, ClusTree and stationary data streams are compared with ant clustering
algorithms.

6.2 SILHOUETTE CO-EFFICIENT:

The silhouette co-efficient is a quantity of how comparable an object is to its native


cluster compared to further clusters. The silhouette ranges from -1 to +1, where a maximum

44
value represents that the object is well matched to its native cluster and not well matched to
adjacent clusters.

1
d(a) ¿ ∑ dist (a , b)
|K a|−1 b ∈K a ≠b
a,

Where,
dist(a,b) – distance between data point a & b
K a – Cluster
c(a) - mean distance between a and all other data points in the same cluster.
1
d(a)¿ min
i≠ a
∑ dist (a , b)
|K i| b ∈ K
i

Where,
dist(a,b) – distance between data point a & b
K i – Cluster
d(a) - minimum mean distance between a & all other data points in the other
cluster.

{ }
c ( j)
1− , if c ( j ) <d ( j)
d ( j)
S(j) = 0 ,if c ( j )=d ( j)
d ( j)
−1 , if c ( j ) >d ( j)
c( j)

6.3 PURITY:
The purity metric measures in what way homogeneous a cluster is. A cluster is
allocated to the class which appears often inside the cluster, the accurateness of this is
estimated by adding the instances of this class and dividing by the total number of instances
in the cluster. K contains n number of clusters. In every identified cluster K i (i = {1..., n}), Si
represents the most frequent class label in cluster Ki. The succeeding features for cluster Ki:

S∑¿ i

Precision Ki = ¿
¿ K i∨¿¿
Where,
i
S sum - number of instances of Si in Ki
45
i
S total -the total number of instances of Si in the present window
Overall, purity (P) can now be stated in terms of the total number of clusters
identified, as follows:

n
1
P= ∑ PrecisionR
n i=1 i

6.4 F-MEASURE:
The F-measure called F-score or F1-score is the harmonic mean of the recall and
precision scores found by the algorithm.

Recall Ki = S
i ∑¿ ¿
i
S total

Score Ki = 2∗ ( Precision K i∗Recall K i


Precision K i + Recall K i )
Overall, F-Measure (F) can now be stated in terms of the total number of clusters
identified, as follows:
n
1
F= ∑ Score R
n i=1 i

6.5 RAND-INDEX:
The rand index (R) measures the number of conclusions that are accurate by fining
false positives and false negatives. It measures the accurateness of the structure, defined as
follows:

TP+TN
R=
TN + FN +TP+ FP

Where,
TP - true positive
TN - true negative
FP - false positive
FN - false negative decisions

46
47
RESULTS AND DISCUSSION

CHAPTER 7
RESULTS AND DISCUSSION
7.1 OVERVIEW
This chapter explains the result of our project and the screenshots for each step are
included and explained

7.2 DATASETS:

The datasets used in our projects are:

7.2.1 STATIONARY DATASET:

A Stationary dataset is one whose statistical properties such as the mean, variance and
autocorrection are all constant over time.

48
DATASETS CLASSES FEATURES EXAMPLES
Iris 3 4 150
Wine 3 13 178
Zoo 7 17 101

Table 7.2.1 Stationary Dataset

7.2.2 NON-STATINARY DATASET:

A Non-stationary dataset is one whose statistical properties such as the mean,


variance and autocorrection are changing over time.

DATASETS CLASSES FEATURES EXAMPLES


1CDT 2 2 16,000
2CHT 2 2 16,000
4CR 4 2 144,000
4CE1CF 5 2 173,000
Network Intrusion 2 42 494,000
Forest Cover 7 54 580,000

Table 7.2.2 Non-stationary Dataset


7.3 SCREENSHOTS

Figure 7.3.1 Homepage


49
Figure 7.3.1 shows the homepage comprises of various options like uploading a
dataset file from the system, a textbox to enter window size to process data, start clustering
button to cluster dataset and get result button display the results.

Figure 7.3.2 User Input


Figure 7.3.2 shows the list of folders containing datasets. The input dataset can be
selected from any of the folders. These folders include various datasets like Iris, Wine, Zoo
etc. The user is allowed to select the input file from the folder.

Figure 7.3.3 Clustering Results


Figure 7.3.3 shows the results after clustering process is completed. Here the dataset
is processed for different epsilon value, length of the cluster before and after sorting, purity,
F-score, and silhouette values are displayed.

50
Figure 7.3.4 Silhouette Vs Epsilon on wine dataset
Figure 7.3.4 shows the graph for stationary dataset - wine silhouette vs epsilon graph.
The epsilon value which has highest silhouette value is considered as the best epsilon value.

Figure 7.3.5 Silhouette Vs Epsilon on 1CDT dataset


Figure 7.3.5 shows the graph for non-stationary dataset – 1CDT silhouette vs epsilon
bar graph. The best epsilon value is 15 which has the highest silhouette value

51
Figure 7.3.6 Purity Vs Epsilon on Wine Dataset
Figure 7.3.6 shows the graph for stationary dataset – wine Purity vs epsilon bar graph.
The purity value for wine dataset at different epsilon value are represented in the graph.

Figure 7.3.7 F-Measure Vs Epsilon on Wine Dataset


Figure 7.3.7 shows the graph for stationary dataset – wine F-Measure vs epsilon bar
graph. The F-Measure value for wine dataset at different epsilon value are represented in the
graph.

52
Figure 7.3.8 Purity Vs Epsilon on 1CDT Dataset
Figure 7.3.8 shows the graph for non-stationary dataset – 1CDT Purity vs epsilon bar
graph. The purity value for 1CDT dataset at different epsilon value are represented in the
graph.

Figure 7.3.9 F-Measure Vs Epsilon on 1CDT Dataset


Figure 7.3.9 shows the graph for non-stationary dataset – 1CDT F-Measure vs epsilon
bar graph. The F-Measure value for 1CDT dataset at different epsilon value are represented
in the graph.

53
Figure 7.3.10 Threshold value Vs Performance on Zoo Dataset
Figure 7.3.10 shows the Threshold value for Zoo Dataset. This plot graph shows that
how the purity and F-Measure varies with respect to change in threshold value.

Figure 7.3.11 Performance Vs Sleep max on Iris Dataset


Figure 7.3.11 shows the Sleep max value on Iris Dataset. This plot graph shows that
how the purity and F-Measure varies with respect to change in sleep max value.

54
Figure 7.3.12 Performance Vs Sleep max on Network Intrusion
Figure 7.3.12 the plot graph shows that how the purity and F-Measure varies with
respect to change in sleep max value on network intrusion dataset.

Figure 7.3.13 F-Measure Vs window size on 4CE1CF Stream


Figure 7.3.13 the plot graph shows that how F-Measure varies when stream of data
over 1000 window of size 500 on 4CE1CF data stream.

55
Figure 7.3.14 F-Measure Vs window size on Forest Cover
Figure 7.3.14 the plot graph shows that how F-Measure varies when stream of data
over 1000 window of size 500 on Forest cover dataset.

Figure 7.3.15 Purity of DBCSD Vs other Ant Clustering Algorithms


Figure 7.3.15 this bar graph shows that how the high purity metrics of our algorithm
when compared with other ant clustering algorithms purity on stationary dataset.

56
Figure 7.3.16 F-Measure of DBCSD Vs other Ant Clustering Algorithms
Figure 7.3.16 this bar graph shows that how the high F-Measure of our algorithm
when compared with other ant clustering algorithms F-Measure on stationary dataset.

Figure 7.3.17 Purity of DBCSD Vs other Stream Clustering Algorithms


Figure 7.3.17 this bar graph shows that how the high purity metrics of our algorithm
when compared with other stream clustering algorithms purity on non-stationary dataset.

57
Figure 7.3.18 F-Measure of DBCSD Vs other Stream Clustering Algorithms
Figure 7.3.18 this bar graph shows that how the high F-Measure of our algorithm
when compared with other Stream clustering algorithms F-Measure on non-stationary
dataset.

Figure 7.3.19 Clustering Result of Iris Dataset


Figure 7.3.19 this scatter plot shows various clusters formed after categorization of
clusters. The six colors red, blue, black, grey, green and purple represents six different
clusters.

58
Figure 7.3.20 Clustering Result of Wine Dataset
Figure 7.3.20 this scatter plot shows various clusters formed after categorization of
clusters. The three colors red, blue and purple represents three different clusters.
DATASET PURITY F-MEASURE
Wine 0.95 0.96
Iris 0.96 0.91
Zoo 0.93 0.92

Table 7.3.1 Performance of stationary dataset


Table 7.3.1 shows the purity and F-measure value. The performance of stationary
dataset like wine, iris, zoo is displayed.

DATASET PURITY F-MEASURE


1CDT 0.99 0.99
2CHT 0.83 0.46
4CR 0.99 0.97
4CE1CF 0.95 0.80
Network Intrusion 1.0 0.91
Forest Cover 0.85 0.60

59
Table 7.3.2 Performance of Non-stationary dataset
Table 7.3.2 shows the purity and F-measure value. The performance of Non-
stationary dataset given in the above table.

NOISE % PURITY F-MEASURE


0% 1.0 1.0
3% 0.998 0.996
5% 0.998 0.996
8% 0.998 0.996

Table 7.3.3 Performance of B1C10D25 with noise


Table 7.3.3 shows the purity and F-measure value with respect to change in noise
percentage. This shows that how our project is robust to noise. B1C10D25 specifies the
dataset holds 100000 data points of 25 dimensions, fitting to ten different clusters.

DATASET DENSTREAM CLUSTREAM CLUSTREE DBCSD

P F P F P F P F

1CDT 1.0 0.82 1.0 0.88 1.0 0.85 0.99 0.99

2CHT 0.43 0.27 0.24 0.23 0.22 0.89 0.83 0.46

4CR 1.0 0.67 1.0 0.89 1.0 0.89 0.99 0.97


4CE1CF 0.99 0.35 0.99 0.36 0.99 0.86 0.95 0.80
NETWORK 1.0 0.8 0.35 0.13 0.36 0.16 1.0 0.91
INTRUSION
FOREST 0.89 0.1 0 0 0 0 0.85 0.60
COVER
Table 7.3.4 Result of Existing Stream Clustering Algorithms
Table 7.3.4 shows the Performance of DBCSD on non-stationary dataset over
Existing clustering algorithms

DATASET Antclust ACAm DBCSD


P F P F P F
Iris 0.89 0.84 0.77 0.77 0.96 0.91
Zoo 0.66 0.68 0.77 0.76 0.93 0.92
Wine 0.94 0.73 0.86 0.86 0.95 0.96

60
Table 7.3.5 Result of Existing Ant Clustering Algorithms
Table 7.3.5 shows the Performance of DBCSD on stationary over Existing ant clustering
algorithms

61
CONCLUSION AND FUTURE WORK
CHAPTER 8
CONCLUSION
8.1 CONCLUSION
In this project we have used behaviour of ant for clustering dynamic data streams. The
system is tested with the non-stationary datasets like network intrusion, forest cover, etc. and
stationary dataset like iris, wine, zoo. The Rough clusters are formed in the single pass of the
window. The earlier formed clusters are refined using a process stimulated by the
categorizing behaviour of ants. This categorization technique is created on the typical pick-
and-drop method used by ant. Rough clusters are formed rapidly in a one pass and then
sorted. Also, in the old-style algorithm, data points are moved independently which consumes
excessive time. By combining similar points into micro-clusters, a amount of points can be
moved during single operation, additionally fast-moving up the algorithm. Experimental
results show that our algorithm performs well while needing less parameters. The parameter
epsilon was shown to be very delicate and critically disturbs the performance of the
algorithm. It is data-dependent and requires manual tuning. By calculating silhouette value,
the exact epsilon value is found out in our project.

8.2 FUTURE WORK

62
In Future Work, we planned to apply various clustering algorithms to get better
cluster quality and investigate an adaptive, local є parameter. This could potentially allow the
discovery of clusters with varying densities in the data.

63
APPENDIX
CHAPTER 9
APPENDIX
9.1 CODING:
import math
from functools import reduce
import scipy.spatial as sc1
from pyspark import SparkContext
from pyspark import SparkConf
import numpy
import timeit
radii=[]
a=[]
Data=[]
dist=[]
#Find Rough Clusters
class find:
clust=[]
def __init__(self,epsilon):
self.epsilon=epsilon
def findcluster(self,ds,ran):
suit=0
max=0
hh=0
i=j=k=0
ranlen=0
if len(self.clust)==0:
self.clust.append([0])
ran=1
ranlen=len(ds)

64
else:
ranlen=ran+len(ds)
for i in range(ran,ranlen):
for j in range(len(self.clust)):
suit=0
for k in range(len(self.clust[j])):
#Suitablility Calculation
suit=suit+dist[self.clust[j][k]][i]
suit=suit/(len(self.clust[j])+1)
if(suit>max): max=suit
hh=j
if max<=self.epsilon:
self.clust[hh].append(i)
else:
self.clust.append([i])
max=0
print("Rough self.clusters\n",self.clust)
return self.clust

# To group microcluster within the rough cluster


class Microcluster:
def __init__(self,epsilon):
self.epsilon=epsilon
# Radius for each cluster
def radius(self,x):
res=sum(map(lambda i:i*i,x))
radius=math.sqrt((res/len(x))-(numpy.mean(x)**2)) return radius
def suitability(self,p,pt,d):
3sum2=0
h=0
for h in range(len(p)):
sum2=sum2+d[p[h]][pt]
sum2=sum2/len(p)
return sum2
#Similarity Updation
def similarity(self,a,dist):
sim=[[0 for i in range(len(a))] for j in range(len(a))]
for z in range(len(a)):
for f in range(len(a)):
if f!=z:
o=0
sum1=0
ll=reduce(lambda aa,bb :aa+bb,a[f]) for y in range(len(a[z])):
for tt in range(len(a[z][y])):
sum1=sum1+self.suitability(ll,a[z][y][tt],dist)
o=o+len(a[z][y])
sum1=sum1/o
sim[z][f]=sum1
return sim
def FormMicro-cluster(self,a):
temp=0
for i in range(len(a)):
s=a[i]
t=[]

65
for j in range(len(s)):
for h in range(len(t)):
temp=0
var=0
for k in range(len(t[h])):
var=var+radii[t[h][k]]
var=(var+radii[s[j]])/(len(t[h])+1)
if var<self.epsilon:
t[h].append(s[j])
temp=1
break
if temp==0:
t.append([s[j]])
temp=0
a[i]=t
print("\nMerged MicroClusters\n",a)
print("Length before sorting\n",len(a))
return a
#Sorting Cluster after merge operation
class SortCluster(Microcluster):
def __init__(self,epsilon,sim,dist,sclust):
self.sim=sim
self.epsilon=epsilon
self.dist=dist
self.sclust=sclust
self.ant=[False for i in range(len(self.sclust))]
self.counter=[0 for i in range(len(self.sclust))]
self.dummy=[0 for i in range(len(self.sclust))]
self.sleepmax=[]
for f in range(len(self.sclust)):
self.sleepmax.insert(f,3)
#To drop the selected micro cluster into the most similar cluster
def drop(self,lt,i):
maxi=0
lists=[]
count=0
temp=0
t7=0
if len(self.sclust)==1:
return 0
for j in range(len(self.sclust)):
if i!=j:
if maxi<self.sim[i][j]:
maxi=self.sim[i][j]
temp=j
lists=self.sclust[temp]
if len(lists)==1:
count=0
t7=len(lists[0])
else:
for j in range(len(lists)):
q=abs(self.Mean(lists[j])-self.Mean(lt))
if q<=epsilon:
count=count+1

66
t7=t7+len(lists[j])
t7=t7+len(lt)
pdrop=1/(1-(count/t7))
if pdrop <=2:
self.sclust[temp].append(lt)
self.counter[i]=0
return 1
else:
self.counter[i]=self.counter[i]+1
return 0
#To pick a micro-cluster
def pick(self,i,k,count,t7):
ppick=1-(count/t7)
if ppick >=0.5:
self.counter[i]=0
if self.drop(self.sclust[i][k],i):
self.sclust[i].remove(self.sclust[i][k])
self.sleepmax[i]=3
#print(self.sclust)
if self.sclust[i]==[]:
self.sclust.remove(self.sclust[i])
self.ant.remove(self.ant[len(self.ant)-1])
self.sim=super().similarity(self.sclust,dist)
return 1
else:
return 0
else:
self.counter[i]=self.counter[i]+1
return 0
def Mean(self,x):
sum1=0
i=0
for i in range(len(x)):
sum1= sum1+(sum(Data[x[i]])/len(Data[x[i]]))
return sum1
def mergecluster(self):
i=j=k=q=t7=count=tp=t1=0
while i < len(self.sclust):
print("iiii",i)
while self.ant[i]==False:
j=0
while j < len(self.sclust[i]):
k=0
while k < len(self.sclust[i]):
tp=0
if len(self.sclust[i])==1:
count=1
t7=len(self.sclust[i][k])
tp=1
break
if j!=k:
q=abs(self.Mean(self.sclust[i][k])-self.Mean(self.sclust[i][j]))
if q<=epsilon:
count=count+1

67
t7=t7+len(self.sclust[i][k])
k=k+1
if tp==1:
if self.pick(i,k,count,t7)==0:
self.dummy[i]=self.dummy[i]+1
j=j+1
else:
t1=1
break
else:
t7=t7+len(self.sclust[i][j])
if self.pick(i,j,count,t7)==0:
self.dummy[i]=self.dummy[i]+1
j=j+1
t7=0
tp=0
count=0
if t1==1:
break
if self.dummy[i]==len(self.sclust[i]):
self.ant[i]=True
if self.counter[i]==0:
self.ant[i]=True
if(self.counter[i]>=self.sleepmax[i]):
self.ant[i]=True
if t1==1:
self.dummy[i]=0
if t1==0:
i=i+1
if i<len(self.sclust) :
self.sleepmax[i]=len(self.sclust[i])
t1=0
print("ANT",self.ant)
print("Sorted cluster",self.sclust)
print("Length after sorting:",len(self.sclust))
return self.sclust
# To Find F-Measure Value
def F-Measure(self,t,f):
v=0
for i in range(len(f)):
for j in range(len(f[i])):
for k in range(len(f[i][j])):
if Data[f[i][j][k]][0]==t:
v=v+1
return v
# To Find Purity Value
def Purity(self,f):
r=ind=sum=fscore=tot=maxi=0
arr=[]
for i in range(len(f)):
arr=[]
dup=[]
ct=[]
for j in range(len(f[i])):

68
for k in range(len(f[i][j])):
arr.insert(ind,int(Data[f[i][j][k]][0]))
ind=ind+1
r=r+len(f[i][j])
dup=arr
dup=list(dict.fromkeys(dup))
ct=[]
for ii in range(len(dup)):
ct.insert(ii,arr.count(dup[ii]))
maxi=max(ct)
for ii in range(len(ct)):
if(ct[ii]==maxi):
break
t=ii+1
if len(f[i])==1:
recall=1
else:
recall=self.recalls(t,f)
recall=maxi/recall
sum=(maxi/r)
score=2*((recall*sum)/(recall+sum))
tot=tot+sum
fscore=fscore+score
t=0
r=0
maxi=0
tot=tot/len(f)
fscore=fscore/len(f)
print("precision:",tot)
file1 = open("1.txt","a+")
file2=open("2.txt","a+")
file1.write(str(self.epsilon)+'\t'+str(tot)+'\t'+'\n')
file2.write(str(self.epsilon)+'\t'+str(fscore)+'\t'+'\n')
file1.close()
file2.close()
# Silhouette calculation
def silhouette(self,clust,dist):
ll=[]
ai=0
bi=0
si=0
sumi=0
mini=1000000
# print("cluster",clust[0])
for i in range(len(clust)):
ll.insert(i,reduce(lambda aa,bb :aa+bb,clust[i]))
#print("liiii",ll)
for i in range(len(ll)):
if(len(ll[i])==1):
ai=0
j=0
else:
j=random.randint(0,len(ll[i])-1)
# print("jjjj",j)

69
# print("len",len(ll[i]))
for k in range(len(ll[i])):
if j!=k:
ai=ai+dist[ll[i][j]][ll[i][k]]
ai=ai/(len(ll[i])-1)
for r in range(len(ll)):
if(i!=r):
bi=0
for k in range(len(ll[r])):
bi=bi+dist[ll[i][j]][ll[r][k]]
bi=bi/(len(ll[r]))
if bi<mini:
mini=bi
bi=mini
if(ai<bi):
si=1-(ai/bi)
elif bi<ai:
si=(bi/ai)-1
else:
si=0
sumi=sumi+si
sumi=sumi/(len(ll))
print("silhouette",sumi)
return sumi
def display():
global path
f1=str(filedialog.askopenfilename(filetypes=[("csv files","*.csv")]))
lab=tk.Label(top,text=f1,font=("Courier", 14))
lab.place(x=300,y=200)
print(f1)
path=f1
def Clustering():
n1=textbox.get("1.0","end-1c")
window=int(n1)
start = timeit.default_timer()
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
logData = sc.textFile(path).cache() # to locate the csv file
li=logData.map(lambda l:l.split(",")).collect() # split the csv file as list of strings
sum1=0
for i in range(0,len(li)):
t=list(map(float,li[i])) # convert the list of strings into float
Data.append(t)
sum1=sum1+max(Data[i])
sum1=sum1/len(Data)
print(sum1)
fg=sc1.distance.pdist(Data,'euclidean')
global dist
dist=sc1.distance.squareform(fg)
st=e=0
if sum1<5:
st=4
e=8
inc=1
elif sum1<6:

70
st=3
e=4
inc=0.2
elif sum1<10:
st=15
e=20
inc=1
elif sum1<15:
st=10
e=15
inc =1
elif sum1<30:
st=10
e=25
inc=5
elif sum1>700 :
st=500
e=600
inc=20
global data
global data1
global cols
global row
row=0
for i in np.arange(st,e,inc):
epsilon=round(i,1)
print("EPI",epsilon)
f=find(epsilon)
m=Microcluster(epsilon)
for ik in range(len(Data)):
radii.insert(ik,m.radius(Data[ik]))
p=0
sett=[]
ii=0
while(p<len(Data)):
p=p+window
sett=Data[ii:p]
rough=f.findcluster(sett,ii)
ii=p
print("Rough",rough)
mer=m.mergemicro(rough)
epi=str(epsilon)
bef=str(len(mer))
sim=m.similarity(mer,dist)
s=SortCluster(epsilon,sim,dist,mer)
fp=s.mergecluster()
aftr=str(len(fp))
s.calc(fp)
rough=[]
file1.write(res+'\t'+'\n')
kk=str(s.silhouett(fp,dist))
data[row][0]=epi
data[row][1]=bef
data[row][2]=aftr

71
data[row][3]=tot
data[row][4]=fscore
data[row][5]=kk
if float(kk) <0.0:
kk=str(0.0)
data1[row][0]=float(epsilon)
data1[row][1]=float(kk)
row=row+1
print("silh",kk)
clust.clear()
mer.clear()
print("row:",row)
def showtable():
n=70
m=450
k=70
# Table to display the Results
for y in range(row+1):
for x in range(len(cols)):
if y==0:
e=tk.Entry(font=('Courier 10 bold'),bg='khaki',justify='center')
e.grid(column=x, row=y)
e.insert(0,cols[x])
e.place(x=n,y=450)
n=n+150
else:
e=tk.Entry(font=('Courier 10 bold'))
e.grid(column=x, row=y)
e.insert(0,data[y-1][x])
e.place(x=k,y=m)
k=k+150
m=m+20
k=70
# Bargraph for silhouette and epsilon
df = pd.DataFrame(data1[:row],columns=['Epsilon','Silhoutte value'])
print("dffffff",df)
figure1 = plt.Figure(figsize=(5,4), dpi=100)
ax1 = figure1.add_subplot(111)
bar1 = FigureCanvasTkAgg(figure1, top)
bar1.get_tk_widget().pack(side=tk.RIGHT, fill=tk.BOTH)
df = df[['Epsilon','Silhoutte value']].groupby('Epsilon').sum()
df.plot(kind='bar', legend=True, ax=ax1,color='khaki' ,width=0.3)
ax1.set_title('Epsilon Vs. Silhoutte value')
top = tk.Tk()
background_image=tk.PhotoImage(file="C:\\Users\priya\Desktop\project\.spyder-py3\clus.jpg")
background_label = tk.Label(top, image=background_image)
background_label.place(x=10, y=10, relwidth=1, relheight=1)
top.geometry(str(top.winfo_screenwidth()-20)+"x"+str(top.winfo_screenheight()-40))
top.title("A Density Based Approach To Cluster Streaming Data")
lab=tk.Label(top,text="Browse Dataset",font=("Courier", 14))
lab.place(x=100,y=200)
b1=tk.Button(top,text="UPLOAD",font=("Courier", 14),command=disp,bg="khaki")
b1.place(x=800,y=197)
lab=tk.Label(top,text="Enter Window Size",font=("Courier", 14))

72
lab.place(x=100,y=250)
textbox=tk.Text(top,height=1.5,width=30)
textbox.place(x=350,y=250)
b2=tk.Button(top,text="StartClustering",font=("Courier",14),command=lambda:getnumber(),bg="kha
ki")
b2.place(x=690,y=250)
b2=tk.Button(top,text="GetResults",font=("Courier",14),command=lambda:getresult(),bg="khaki")
b2.place(x=390,y=350)
top.mainloop()
# To generate a graph for purtiy and f-measure
import matplotlib.pyplot as plt
import numpy as np
leg=["Purity","F-measure"]
files=["1.txt","2.txt"]
for i in range(2):
plt.title('performance metrics for wine dataset')
plt.xlabel('Epsilon')
plt.ylabel('Performance')
f=open(files[i], "r")
clus=[]
time=[]
for line in f:
col=line.split('\t')
clus.append(col[0])
time.append(col[1])
print(clus)
print(time)
clus=np.array(clus).astype(np.float)
time=np.array(time).astype(np.float)
if i==0:
plt.plot(clus,time, color='green', linestyle='dashed', linewidth = 1,
marker='o', markerfacecolor='blue', markersize=6,label=leg[i])
plt.legend()
if i==1:
plt.plot(clus,time, color='black', linestyle='dashed', linewidth = 1,
marker='o', markerfacecolor='red', markersize=6,label=leg[i])
plt.legend()

73
74
REFERENCES
CHAPTER 10
REFERENCES
10.1 REFERENCES:
[1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving
data streams,” in Proc. 29th Int. Conf. Very Large Data Bases, vol. 29. Berlin, Germany,
2003, pp. 81–92.
[2] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over an evolving data
stream with noise,” in Proc. SIAM Int. Conf. Data Min., vol. 6, 2006, pp. 328–339.
[3] A. Forestiero, C. Pizzuti, and G. Spezzano, “A single pass algorithm for clustering
evolving data streams based on swarm intelligence,” Data Min. Knowl. Disc., vol. 26, no. 1,
pp. 1–26, Nov. 2011.
[4] P. Kranen, I. Assent, C. Baldauf, and T. Seidl, “The ClusTree: Indexing micro-clusters for
anytime stream mining,” Knowl. Inf. Syst., vol. 29, no. 2, pp. 249–272, 2011.
[5] P. S. Shelokar, V. K. Jayaraman, and B. D. Kulkarni, “An ant colony approach for
clustering,” Analytica Chimica Acta, vol. 509, no. 2, pp. 187–195, May 2004.
[6] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering
clusters in large spatial databases with noise,” in Proc. KDD, vol. 96, 1996, pp. 226–231.
[7] N. Labroche, N. Monmarché, and G. Venturini, “AntClust: Ant clustering and Web usage
mining,” in Proc. Genet. Evol. Comput. Conf., 2003, pp. 25–36.
[8] A. H. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan, “Density clustering based
on radius of data,” World Acad. Sci. Eng. Technol., vol. 3, 2006
[9] S. Mahran and K. Mahar, “Using grid for accelerating density-based clustering,” in Proc.
IEEE Int. Conf. Comput. Inf. Technol., Sydney, NSW, Australia, 2008, pp. 35–40.
[10] N. Masmoudi, H. Azzag, M. Lebbah, C. Bertelle, and M. B. Jemaa, “How to use ants for
data stream clustering,” in Proc. IEEE Congr. Evol. Comput., 2015, Sendai, Japan, pp. 656
663.

75
[11] P. Hore, L. O. Hall, and D. B. Goldgof, “Creating streaming iterative soft clustering
algorithms,” in Proc. IEEE Annu. Meeting North Amer. Fuzzy Inf. Process. Soc. (NAFIPS),
San Diego, CA, USA, 2007, pp. 484–488.
[12] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” J. Amer. Stat.
Assoc., vol. 66, no. 336, pp. 846–850, Dec. 1971.
[13] S. U. Rehman, A. Asghar, S. Fong, and S. Sarasvady, “DBSCAN: Past, present and
future,” in Proc. 5th Int. Conf. Appl. Digit. Inf. Web Technol. (ICADIWT), 2014, pp. 232–
238.
[14] C. W. Reynolds, “Flocks, herds and schools: A distributed behavioural model,” ACM
SIGGRAPH Comput. Graphics, vol. 21, no. 4, pp. 25–34, Jul. 1987.
[15] T. A. Runkler, “Ant colony optimization of clustering models,” Int. J. Intell. Syst., vol.
20, no. 12, pp. 1233–1251, 2005.

76

You might also like