This action might not be possible to undo. Are you sure you want to continue?
- for unsupervised anomaly detection in network traﬃc
Koﬃ Bruno Yao (koﬃ@diku.dk)
February 28, 2006
0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Introduction to computer network security . . . . . . . . . . . 6
2.1.1 Network security . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Network intrusion detection systems . . . . . . . . . . 7
2.1.3 Network anomaly detection . . . . . . . . . . . . . . . 8
2.1.4 Computer attacks . . . . . . . . . . . . . . . . . . . . . 9
2.2 Introduction to clustering . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Notation and deﬁnitions . . . . . . . . . . . . . . . . . 12
2.2.2 The clustering problem . . . . . . . . . . . . . . . . . . 12
2.2.3 The clustering process . . . . . . . . . . . . . . . . . . 13
2.2.4 Feature selection . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Choice of clustering algorithm . . . . . . . . . . . . . . 13
2.2.6 Cluster validity . . . . . . . . . . . . . . . . . . . . . . 16
2.2.7 Clustering tendency . . . . . . . . . . . . . . . . . . . . 17
2.2.8 Clustering of network traﬃc data . . . . . . . . . . . . 18
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Clustering methods and algorithms 20
3.1 Hierarchical clustering methods . . . . . . . . . . . . . . . . . 21
3.2 Partitioning clustering methods . . . . . . . . . . . . . . . . . 24
3.2.1 Squared-error clustering . . . . . . . . . . . . . . . . . 24
3.2.2 Model-based clustering . . . . . . . . . . . . . . . . . . 27
3.2.3 Density-based clustering . . . . . . . . . . . . . . . . . 42
3.2.4 Grid-based clustering . . . . . . . . . . . . . . . . . . . 45
3.2.5 Online clustering . . . . . . . . . . . . . . . . . . . . . 47
3.2.6 Fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Discussion of the classical clustering methods . . . . . . . . . 52
3.4 Combining clustering methods . . . . . . . . . . . . . . . . . . 54
3.4.1 Two-level clustering with kmeans . . . . . . . . . . . . 54
3.4.2 Initialisation of clustering algorithms with the results
of leader clustering . . . . . . . . . . . . . . . . . . . . 60
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Experiments 62
4.1 Design of the experiments . . . . . . . . . . . . . . . . . . . . 62
4.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Choice of data set . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Description of the feature set . . . . . . . . . . . . . . 65
4.3 Implementation issues . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Evaluation of clustering methods 72
5.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . 72
5.2 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Evaluation measure requirements . . . . . . . . . . . . 74
5.2.2 Choice of evaluation measures . . . . . . . . . . . . . . 74
5.3 k-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Discussion and analysis of the experiment results . . . . . . . 76
5.4.1 Results of the experiments . . . . . . . . . . . . . . . . 76
5.4.2 Analysis of the experiment results . . . . . . . . . . . . 79
6 Conclusion 86
6.1 Resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A Deﬁnitions 95
A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B Feature set 98
B.1 The feature set of the KDD Cup 99 data set . . . . . . . . . . 98
C Computer attacks 101
C.1 Probe attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
C.2 Denial of service attacks . . . . . . . . . . . . . . . . . . . . . 102
C.3 User to root attacks . . . . . . . . . . . . . . . . . . . . . . . . 102
C.4 Remote to local attacks . . . . . . . . . . . . . . . . . . . . . . 103
C.5 Other attack scenarios . . . . . . . . . . . . . . . . . . . . . . 104
D Theorems 105
D.1 Algorithm: Hill climbing . . . . . . . . . . . . . . . . . . . . . 105
D.2 Theorem: Jensen’s inequality . . . . . . . . . . . . . . . . . . 105
D.3 Theorem: The Lagrange method . . . . . . . . . . . . . . . . 106
E Results of the experiments 107
List of Figures
3.1 A dendrogram corresponding to the distance matrix in table 3.1 22
3.2 Variation of the sum of squared-errors in kmeans . . . . . . . 28
3.3 Variation of the log-likelihood with the iterations of the clas-
siﬁcation maximum likelihood . . . . . . . . . . . . . . . . . . 38
3.4 A 3x3 kohonen network map . . . . . . . . . . . . . . . . . . . 40
3.5 Querying recursively a multi-resolution grid with STING . . . 45
3.6 Variation of the - fuzzy - sum of squared errors in fuzzy kmeans
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Variation of classiﬁcation accuracy with the number of basic
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Variation of the sum of squared-errors(SSE) with the number
of clusters in kmeans . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 The classiﬁcation accuracy of the clustering algorithms in ta-
bles E.1 and E.2. L+kmeans refers to leader + kmeans and
fuzzy K refers to fuzzy kmeans. The number of clusters is 23. 77
5.2 The number of diﬀerent cluster categories found by the algo-
rithms when the number of clusters is 23. The total number
of labels contained in the data set is 23. . . . . . . . . . . . . 81
5.3 The cluster entropies when the number of clusters is 23. The
cluster entropy measures the homogeneity of the clusters. The
lower the cluster entropy is the more homogeneous the clusters
are. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 The classiﬁcation accuracy of the clustering algorithms in ta-
bles E.3 and E.4. The number of clusters is 49. . . . . . . . . 83
5.5 The number of diﬀerent cluster categories found by the algo-
rithms when the number of clusters is 49. The total number
of labels contained in the data set is 23. . . . . . . . . . . . . 84
LIST OF FIGURES 5
5.6 The cluster entropies when the number of clusters is 49. . . . . 85
List of Tables
3.1 Example of distance matrix used for hierarchical clustering . . 21
4.1 Distribution of labels in the data set . . . . . . . . . . . . . . 70
B.1 Basic features of the KDD Cup 99 data set . . . . . . . . . . . 99
B.2 Content-based features . . . . . . . . . . . . . . . . . . . . . . 99
B.3 Traﬃc-based features . . . . . . . . . . . . . . . . . . . . . . . 100
E.1 Random initialisation . . . . . . . . . . . . . . . . . . . . . . . 108
E.2 Experimental results of various classical algorithms and com-
bination of those algorithms run on a KDD Cup 1999 data set
slightly modiﬁed. The number of clusters is set to the number
of attack and normal labels in the data set and this number
is 23. The results in table E.1 are obtained with random ini-
tialisation of the algorithms and that of table E.2 correspond
to initialisation of the algorithms with leader clustering. . . . 108
E.3 Random initialisation . . . . . . . . . . . . . . . . . . . . . . . 108
E.4 Experiment results when the number of clusters is 49 . . . . . 109
Network anomaly detection aims at detecting malicious activities in com-
puter network traﬃc data. In this approach, the normal proﬁle of the network
traﬃc is modelled and any signiﬁcant deviation from this normal proﬁle is
interpreted as malicious. While supervised anomaly detection models the
normal traﬃc behaviour on the basis of an attack free data set, unsuper-
vised anomaly detection works on a data set which contains both normal
and attack data. Clustering has recently been investigated as one way of
approaching the issues of unsupervised anomaly detection.
LIST OF TABLES 2
This thesis investigates the cluster-based approach to oﬀ-line anomaly de-
tection. Our goal is to study the purity of the clusters created by diﬀerent
clustering methods. Ideally each cluster should contain a single type of data:
either normal or a speciﬁc attack type. The result of such a clustering can
assist a security expert in understanding diﬀerent attack types and in la-
belling the data set. One of the main challenges in clustering network traﬃc
data for anomaly detection is related to the skewed distribution of the attack
categories. Generally, a very large proportion of the network traﬃc data is
normal and only a small percentage constitutes anomalies.
Six classical clustering algorithms: kmeans, SOM, EM-based clustering,
classiﬁcation EM clustering, fuzzy kmeans, leader clustering and diﬀerent
combination scenarios of these algorithms are discussed, implemented and
experimentally compared. The experiments are performed on the KDD Cup
99 data set, which is widely used for evaluating intrusion detection systems.
The evaluation of the clustering is done on the basis of the purity of the
clusters produced by the clustering algorithms. Two of the indexes used for
quantifying the purity of clusters are: the classiﬁcation accuracy and cluster
homogeneity. The classiﬁcation accuracy is measured by the proportion of
items successfully classiﬁed and cluster homogeneity is measured by the clus-
ter entropy. We have also investigated a clustering technique, which combines
diﬀerent clustering techniques. This technique has given promising results.
Oﬀ-line network anomaly detection, unsupervised anomaly detection, clus-
tering methods, external assessment of clustering methods.
LIST OF TABLES 1
This thesis has been written by koﬃ bruno yao at the department of computer
science of the university of copenhagen(DIKU). The thesis was written in
the period 19/04/2005 to 01/03/2006 and was supervised by Peter Johansen
professor at DIKU. I would like to thank my supervisor for his support. The
primary audience of this thesis is researchers in anomaly detection. However
any reader with interest in clustering will ﬁnd the thesis useful. The reader is
expected to have some basic understandings of computer networks and some
basic mathematical knowledge.
It is important for companies to keep their computer systems secure be-
cause their economical activities rely on it. Despite the existence of attack
prevention mechanisms such as ﬁrewalls, most company computer networks
are still the victim of attacks. According to the statistics of CERT , the
number of reported incidents against computer networks has increased from
252 in 1990 to 21756 in 2000, and to 137529 in 2003. This happened because
of misconﬁguration of ﬁrewalls or because malicious activities are generally
cleverly designed to circumvent the ﬁrewall policies. It is therefore crucial to
have another line of defence in order to detect and stop malicious activities.
This line of defence is intrusion detection systems (IDS).
During the last decades, diﬀerent approaches to intrusion detection have
been explored. The two most common approaches are misuse detection and
anomaly detection. In misuse detection, attacks are detected by matching
the current traﬃc pattern with the signature of known attacks. Anomaly
detection keeps a proﬁle of normal system behaviour and interprets any sig-
niﬁcant deviation from this normal proﬁle as malicious activity. One of the
strengths of anomaly detection is the ability to detect new attacks. Anomaly
detection’s most serious weakness is that it generates too many false alarms.
Anomaly detection falls into two categories: supervised anomaly detection
and unsupervised anomaly detection. In supervised anomaly detection, the
instances of the data set used for training the system are labelled either as
CHAPTER 1. INTRODUCTION 3
normal or as speciﬁc attack type. The problem with this approach is that la-
belling the data is time consuming. Unsupervised anomaly detection, on the
other hand, operates on unlabeled data. The advantage of using unlabeled
data is that the unlabeled data is easy and inexpensive to obtain. The main
challenge in performing unsupervised anomaly detection is distinguishing the
normal data patterns from attack data patterns.
Recently, clustering has been investigated as one approach to solving this
problem. As attack data patterns are assumed to diﬀer from normal data
patterns, clustering can be used to distinguish attack data patterns from
normal data patterns. Clustering network traﬃc data is diﬃcult because:
1. of high data volume
2. of high data dimension
3. the distribution of attack and normal classes is skewed
4. the data is a mixture of categorical and continuous data
5. of the pre-processing of the data required.
1.2 Goal of the thesis
Although diﬀerent clustering algorithms have been studied for this pur-
pose, to our knowledge not much has been done in the direction of comparing
and combining diﬀerent clustering approaches. We believe that such a study
could help in designing the most appropriate clustering approaches for the
purpose of unsupervised anomaly detection.
The main goals of this thesis are:
1. to provide a comprehensive study of the clustering problem and the
diﬀerent methods used to solve it
2. to implement and compare experimentally some classical clustering al-
3. and to combine diﬀerent clustering approaches.
CHAPTER 1. INTRODUCTION 4
1.3 Related works
Clustering has been studied in many scientiﬁc disciplines. A wide variety
of algorithms are found in clustering literature.  gives a good review of the
main classical clustering concepts and algorithms. [3, 20] provide an excellent
mathematical approach to the clustering problem.  is also an excellent
source; it covers all the main steps in clustering. It discusses clustering
from a statistical perspective. [10, 18] present recently developed clustering
algorithms for clustering large data sets.
There are in the literature many examples of experimental comparisons of
clustering algorithms. Some examples of recent works are found in [35, 12]. In
, dynamical clustering, kmeans, SOM, hierarchical agglomerative cluster-
ing and CLICK have been compared for gene expression data.  compares
kmeans, SOM and ART-C on text documents. These comparisons diﬀer as
to the selection of clustering algorithms, the data set, and the evaluation cri-
teria used for assessing the algorithms or the evaluation methodology. Some
of these experiments compare clusters on the basis of internal criteria such as
the number of clusters, compactness and separability of clusters, while other
works compare clustering algorithms on the basis of external indices. An
external index measures how well a partition created by a given clustering
algorithm matches an a priori partitioning of the data set. Our choice of clus-
tering algorithms, data set, evaluation criterion and evaluation methodology
distinguishes our work from these works.
 provides a good review of data mining approaches for intrusion detec-
tion. Much work has been done on the area of unsupervised anomaly detec-
tion [7, 4, 6]. In , Eskin uses clustering to group normal data; intrusions
are considered to be outliers. Eskin follows a probability based approach to
outliers’ detection. In this approach, the data space has an unknown proba-
bility distribution. In this data space, anomalies are located in sparse regions
while normal data are found in dense regions.
1.4 Thesis organization
This thesis is composed of two main parts:
CHAPTER 1. INTRODUCTION 5
• A theoretical part, in which the clustering problem and the diﬀerent
clustering methods are studied. This part consists in chapter 2 and
3. Chapter 2 is an introduction to anomaly detection and clustering.
Chapter 3 discusses the diﬀerent clustering methods. In conclusion,
diﬀerent combinations of these methods are proposed.
• An experimental part which consists in chapters 4 and 5. In chapter 4,
the data set, and the design of the experiments are discussed. Chapter
5 discusses the evaluation of clustering methods. Chapter 6 concludes
This chapter provides background in network security and clustering relevant
for understanding the thesis.
• Section 2.1 gives an introduction to network security. The deﬁnitions
of network terminologies, frequently used in this thesis, are found in
• Section 2.2 gives an introduction to clustering.
• Section 2.3 summarizes this chapter.
2.1 Introduction to computer network secu-
Computer networks interconnect multiple computers and make it easy and
fast to share resources between these computers. The most popular example
of such a network is the global Internet. In this thesis, the term computer
networks mainly refers to private computer networks, geographically limited
and connected to the outside world. The main threats to computer networks
are security issues. This section will give a brief discussion of some of the
main issues pertaining to network security.
2.1.1 Network security
Computer network security aims at preventing, detecting and stopping any
activity that has the potential of compromising the conﬁdentiality and in-
CHAPTER 2. BACKGROUND 7
tegrity of communication on the network as well as the availability of the
network’s resources and services. Another goal of security is to recover from
such malicious activities when they take place. Attack prevention is generally
implemented by security mechanisms such as authentication, cryptography
and ﬁrewalls. Although attack prevention mechanisms are crucial, they are
not enough for assuring the security of the network. Firewalls, for example,
can prevent malicious activities from penetrating into the internal network,
but they are not able to prevent malicious activities that are initiated from
inside the network. Firewalls can also be subject to attacks and prevented
from working by for example denial of service (DOS) attacks. Attacks can
also pass through ﬁrewalls successfully because the ﬁrewalls have been mis-
conﬁgured. Because of these weaknesses in prevention mechanisms, computer
networks will always be vulnerable to malicious activities.
Attack detection and recovery mechanisms complement attack prevention
mechanisms. This function of detection and recovery is mainly implemented
by intrusion detection systems. A distinction is made between host-based
intrusion detection systems (HIDS) and network based intrusion detection
systems (NIDS). HIDS detect intrusions directed against a single host. NIDS
detect intrusions directed against the entire network. In this thesis, we will
focus on network intrusion detection. In the next section, we will present
systems for network intrusion detection. We will discuss their architecture,
and the diﬀerent steps followed when dealing with intrusion.
2.1.2 Network intrusion detection systems
In this thesis, we cluster data for network intrusion detection. This section
discusses how the input data and the clustering result ﬁt into the architecture
of network intrusion detection systems. Network intrusion detection systems
are designed to detect the presence of malicious activities on the network.
The architecture of network intrusion detection systems generally consists
of three parts: agent, detector and notiﬁer. Agents gather network traﬃc
data, detectors analyse the information gathered by agents to determine the
presence of attacks. The notiﬁer makes decision as to whether a notiﬁcation
about the presence of an intrusion should be sent. The same software can
perform all these task in a simple network. In more complex networks, these
functions are distributed over the network for reasons of security, eﬃciency,
scalability and robustness.
In the context of this thesis, only agents and detectors are relevant.
CHAPTER 2. BACKGROUND 8
Agents generally gather network traﬃc data by sniﬃng the network. Sniﬀ-
ing the network involves the agent having access to all the network traﬃc.
In an Ethernet-based network one computer can play the role of an agent.
Agents generally process the gathered data into a format that is easy for the
detector to use. The detector can use diﬀerent techniques for the detection
of intrusions. The two main techniques are misuse detection and anomaly
detection. Misuse detection detects attacks by matching the current network
traﬃc against a database of known attack signatures. Anomaly detection,
on the other hand, ﬁnds attacks by identifying traﬃc patterns that deviate
signiﬁcantly from the normal traﬃc.
The data set used in this thesis is an example of data obtained from
network intrusion detection agents. The output of the clustering serves to
deﬁne or enrich models used by the detector.
In the next section, we will look at network anomaly detection, which is
the type of detection technique we are interested in in this thesis.
2.1.3 Network anomaly detection
As we explained earlier, detectors need models or rules for detecting intru-
sions. These models can be built oﬀ-line on the basis of earlier network traﬃc
data gathered by agents. Once the model has been built, the task of detect-
ing and stopping intrusions can be performed online. One of the weaknesses
of this approach is that it is not adaptive. This is because small changes in
traﬃc aﬀect the model globally. Some approaches to anomaly detection per-
form the model construction and anomaly detection simultaneously on-line.
In some of these approaches clustering has been used. One of the advan-
tages of online modelling is that it is less time consuming because it does
not require a separate training phase. Furthermore, the model reﬂects the
current nature of network traﬃc. The problem with this approach is that it
can lead to inaccurate models. This happens because this approach fails to
detect attacks performed systematically over a long period of time. These
types of attacks can only be detected by analysing network traﬃc gathered
over a long period of time.
The clusters obtained by clustering network traﬃc data oﬀ-line can be
used for either anomaly detection or misuse detection. For anomaly detec-
tion, it is the clusters formed by the normal data that are relevant for model
construction. For misuse detection, it is the diﬀerent attack clusters that are
used for model construction.
CHAPTER 2. BACKGROUND 9
This section has described mechanisms for detecting attacks against a
computer network. The next section is a discussion of computers attacks.
2.1.4 Computer attacks
A computer attack is any activity that aims at compromising the conﬁden-
tiality, the integrity or the availability of a computer system. Compromising
the conﬁdentiality consists in gaining unauthorized access to resources and
services on the computer system. Compromising the integrity consists in
unauthorized modiﬁcation of information on the computer system. Finally,
compromising the availability of the computer system makes the computer
system unavailable to legal users. These attacks can be performed at the
physical level, by damaging computer hardware, or they can be performed
at a software level. It is the type of attacks performed at the software level
we refer to, when using the term computer attacks in this thesis.
The computer attacks, considered in this thesis, fall into four main cat-
egories: probe attacks, denial of services (DOS) attacks, user to root (U2R)
attacks and remote to local (R2L) attacks. Probe attacks are attacks that
probe computers or computer networks in order to detect the services that
are available on the computer system. This information can then be used to
attack the computer system in a speciﬁc way. Denial of service attacks are
attacks that aim to make the computer systems unavailable for legal users.
This is done, for instance, by keeping computers busy dealing with tasks
submitted by the attacker. User to root attacks aim at gaining unauthorized
access to system resources. The attacker tries to obtain root privileges in
order to perform malicious activities. Examples of such attacks are buﬀer
overﬂow attacks. In buﬀer overﬂow attacks, the attacker gets root-privileges
by overwriting memory locations containing security sensitive information.
In remote to local attacks, the attacker exploits misconﬁgurations or weak-
nesses on a server host to gain remote access to the computer system with
the same level of privileges as an authorized user. For example, exploiting
the misconﬁguration on a FTP-server could make it possible for the attacker
to remotely add ﬁles to the FTP-server.
Attackers perform computer attacks for intellectual, economical or politi-
cal reasons or just for fun. Computer attacks performed for economic reasons
are a growing problem. According to , it-criminality was more proﬁtable
than drug trading in 2005. Two examples of economic it-criminality, that
are on the rise, are blackmailing organizations and phising. In phishing, the
CHAPTER 2. BACKGROUND 10
attacker sends emails to the victims. In these emails, he presents himself as
from an organization the victim knows and trusts, for example the victim’s
bank. The goal of the attacker is to collect the victim’s bank account in-
formation and misuse it. Blackmailing an organization consists in launching
attacks against the organization, if that organization refuses to satisfy the
attackers request. Attacks against computer systems are possible because of:
• social engineering: Legal users of the computer systems can delib-
erately lend their password to unauthorized users. Most legal users
have diﬃculty in following strict security policies. This can result in
passwords being made available to attackers.
• misuse of features: The denial of service attack named smurf is an
example of the misuse of features. This attack is based of misusing the
ping tool. The normal purpose of the ping tool is to make it possible for
one host to test if it has a connection to another host. Smurf abuses
this facility; an attacker makes a false ping-request to a large number
of hosts simultaneously on behalf of the victim host. As a consequence,
all the receivers of the ping-request will send a response back to the
victim host. This large volume of traﬃc will eventually put the victim
host out of normal function.
• misconﬁguration of computer systems: Correct conﬁguration of com-
puter systems is not easy. Generally, there is a large number of param-
eter values to select from. An example of a computer attack that takes
advantage of a misconﬁgured computer system is the ftp write attack.
This attack exploits a misconﬁguration concerning write privileges of
an anonymous account on a FTP-server. This misconﬁguration can
lead to a situation where any FTP-user can add an arbitrary ﬁle to the
• ﬂaws in software implementation: As software gets more and more
complex, the chance that ﬂaws exist in software also increases. Ac-
cording to the statistics of CERT , the number of reported vulnera-
bilities in widely used software has increased from 171 in 1995 to 1090
in 2000 and to 5990 in 2005. The buﬀer overﬂow attack is an example of
an attack that exploits ﬂaws in software implementation. This attack
works by overﬂowing the input buﬀer in order to overwrite memory
locations that contain security relevant information. This is possible
CHAPTER 2. BACKGROUND 11
because some software fails to check the size of the inputs entered by
• usurpation or masquerade: The attacker steals the identity of a legal
user. The attacker can also steal a TCP-connection successfully estab-
lished by a legal user and then acts as if he were that legal user.
It is practically impossible to protect a computer network totally from
all these vulnerability factors. Therefore computer networks will always be
vulnerable to some forms of attack. A short description of the computer
attacks considered in this thesis is found in appendix C.  provides a
complete description of these attacks.
CHAPTER 2. BACKGROUND 12
2.2 Introduction to clustering
Clustering, also known as cluster analysis, is used in scientiﬁc disciplines such
as psychology, biology, machine learning, data mining and statistics.
The term clustering was invented in the thirties in psychology. However,
numerical taxonomy in biology and pattern recognition in machine learning
have played an important role in the development of the concept of clustering
in the sixties.
2.2.1 Notation and deﬁnitions
• Notation: [A[: Given a set A, the notation [A[ refers to the size of A.
• Deﬁnition: Partition of a set: Let S be a set and ¦S
, i ∈ ¦1, ..., N¦¦,
N non empty subsets of S.
The family of subsets ¦S
, i ∈ ¦1, ..., N¦¦ is a partition of the set S if
and only if:
∀(i, j) ∈ ¦1, ...N¦ ¦1, ..., N¦ and i = j, S
= . and
• Note: In this thesis, the terms data points, data patterns, data items
and data instances refer all to the instances of a data set.
2.2.2 The clustering problem
Clustering is the process of grouping data into clusters, so that pairs of data
in the same clusters have a higher similarity than pairs of data from diﬀerent
clusters. It provides a tool for exploring the structure of the data. Formally,
the clustering problem can be expressed as follows:
The clustering problem
Given a data set D = ¦v
, ..., v
¦ of tuples, given a similarity measure
S : D D → R, the clustering problem is deﬁned as the mapping of each
∈ D to some class L. The mapping is performed under the con-
straint that: ∀v
∈ L and v
/ ∈ L, S(v
) > S(v
Another problem, related to the clustering problem is the classiﬁcation
problem. The diﬀerence between these two problems is that in classiﬁcation
the class labels are known a priori and the goal of the classiﬁcation is to
assign instances to the class they belong to. In clustering, on the other hand,
CHAPTER 2. BACKGROUND 13
no a priori class structure is known. The goal of the clustering is then to
deﬁne the class structure, that is how many categories the data set contains,
and to assign instances to a category in a meaningful way.
Clustering can be performed in various ways, depending on how the sim-
ilarity between pairs of data items is deﬁned. In the next chapter, diﬀerent
methods for performing clustering will be discussed.
2.2.3 The clustering process
The main steps in clustering are: feature selection, the choice of clustering
algorithm, and the validation of the clustering results.
2.2.4 Feature selection
Feature selection aims at selecting an optimal subset of relevant features for
representing the data. The deﬁnition of an optimal subset of features depends
on the speciﬁc application at hand. An optimal subset may be deﬁned as
a subset that provides the best classiﬁcation accuracy. The classiﬁcation
accuracy measures the proportion of items that are correctly classiﬁed, in a
classiﬁcation task. In the context of anomaly detection, we are interested
in a feature set, which eﬃciently discriminates normal data patterns from
attack data patterns.
2.2.5 Choice of clustering algorithm
Finding the optimal set of clusters, that maximizes the intra-cluster similar-
ity and minimizes the inter-cluster similarity, is a NP-hard problem because
all the possible partitions of the data set need to be examined. Generally,
we want a clustering algorithm that can provide an acceptable solution -not
necessarily the optimal solution.
A clustering algorithm is mainly characterized by the type of similarity mea-
sure it uses and by how it proceeds in ﬁnding clusters. Many clustering algo-
rithms approach the clustering problem as an optimisation problem. These
algorithms ﬁnd clusters by optimising a speciﬁed function, called objective
function. For this class of algorithms, the objective function is also a main
characteristic of the algorithm. Similarity measures and objective functions
will be discussed below.
CHAPTER 2. BACKGROUND 14
The deﬁnition of the similarity between data items depends on the type
of the data. Two main types of data exist: continuous data and categorical
. Examples of similarity measures for each of these types of data will
be presented in the following.
Distance measures in continuous data For continuous data, distance
measures are used for quantifying the degree of similarity or dissimilarity of
two data instances. The lower the distance between two instances, the more
similar the instances are. And the higher the distance, the more dissimilar
they are. A distance measure is a non-negative function δ : D D− > R
with the following properties:
δ(x, y) = 0 ⇐⇒x = y, ∀x, y ∈ D (2.1)
δ(x, y) = δ(y, x), ∀x, y ∈ D (2.2)
δ(x, y) ≤ δ(x, z) + δ(z, y), ∀x, y, z ∈ D (2.3)
Here are some examples of distance measures:
• Minkowski distance: d(x, y) =
, p > 0 if p=1, it is the
Hamming distance; if p=2, it is the euclidean distance
• Tchebyschev distance: d(x, y) = max
Similarity measures in categorical data: Given a data set D, an in-
dex of similarity is a function S : D D− > [0, 1], satisfying the following
S(x, x) = 1, ∀x ∈ D (2.4)
S(x, y) = S(y, x), ∀x, y ∈ D (2.5)
Similarity indices can, in principle, be used on arbitrary data types. However,
they are generally used for measuring similarity in categorical data. They
are seldom applied to continuous data because distance measures are more
Sometimes binary data, which is essentially categorical data with two categories, is
considered as a separate category. In this thesis, no distinction is made between categorical
data and binary data.
CHAPTER 2. BACKGROUND 15
suitable for continuous data than similarity indices are. Diﬀerent similarity
indices for binary or categorical data are found in the literature. Here are
three examples of similarity indices. In the following expressions, a is the
number of positive matches, d is the number of negative matches and b and
c are the number of mismatches between two instances A and B.
• The matching coeﬃcient:
a + d
a + b + c + d
• The Russel and Rao measure of similarity:
a + b + c + d
• The Jacard index:
a + b + c
The choice of similarity measure depends on the type of data at hand:
categorical or continuous data. Depending on the intent of the investigator,
continuous data can be converted to binary data, by ﬁxing some thresholds.
Alternatively categorical data can be converted to continuous data. As the
feature set is selected to provide an optimal description of the data, con-
verting from one data type to another may result in a loss of information
about the data. This will aﬀect the quality of the analysis being conducted
on the data set. The method of analysis to be conducted on the data also
inﬂuences the choice of similarity measure. For example, euclidean distance
is appropriated for methods that are easily explained geometrically.
Objective functions are used by clustering methods that approach the clus-
tering problem as an optimization problem. An objective function deﬁnes
the criterion to be optimised by a clustering algorithm in order to obtain an
optimal clustering of the data set. Diﬀerent objective functions are found in
cluster literature. Each of them is based on implicit or explicit assumptions
about the data set. A good choice of the objective function helps reveal a
meaningful structure in the data set. The most widely used objective func-
tion is the sum of squared-errors. Given a data set D = ¦x
, ..., x
CHAPTER 2. BACKGROUND 16
a partition P = ¦C
, ..., C
¦ the sum of squared-errors of P is:
[[x − µ
, where µ
is the mean of cluster C
x and [C
[ is the size of cluster C
. The popularity of the
sum of squared-errors objective function is partly related to its simplicity.
2.2.6 Cluster validity
The assessment of the quality of clustering results is important. It helps in
identifying meaningful partitioning of the data set. This assessment is im-
portant because the data set can be partitioned in diﬀerent ways. Generally,
the same clustering algorithm, executed with diﬀerent initial values, will pro-
duce diﬀerent partitions of the data set. Some of these partitions are more
meaningful than others. What is considered as a meaningful partitioning is
application speciﬁc. It depends on the kind of information or structure the
investigator is looking for.
Cluster validity can be performed at diﬀerent levels: hierarchical, individ-
ual, and partition levels. The validity of the hierarchical structure of clusters
is only relevant for hierarchical clustering. Hierarchical clustering creates a
hierarchy of clusters. The study of the validity of the hierarchical structure
aims at judging the quality of that hierarchical structure. The validity of
individual clusters measures the compactness and the isolation of the clus-
ter. A good cluster is expected to be compact and well separated from other
patterns. The validity of the partition structure evaluates the quality of the
partition produced by a clustering algorithm. For example, it may be used
to determine whether the correct number of clusters has been found or the
clusters found by the algorithm match an a priori partitioning of the data.
In this thesis, only the validity of the partition’s structure is considered
because we evaluate the clustering algorithms against an a priori partition of
the data set. So in the rest of this thesis, when we refer to cluster validity,
we mean validity of partition’s structure. The assessment of the partition’s
structure can be performed at diﬀerent levels: external, internal and relative
External validity: In external validity, the partition produced by a
clustering algorithm is compared with an a priori partition of the data set.
Some of the most common examples of external indices found in cluster
literature are: Jacard and Rand indices. These indices quantify the degree
of agreement between a partition produced by a cluster algorithm and an a
priori partition of the data set.
CHAPTER 2. BACKGROUND 17
Internal validity: Internal validity only makes use of the data involved
in the clustering to assess the quality of the clusterings result. Example of
such data include the proximity matrix. The proximity matrix is a N matrix
which entry (i, j) represents the similarity between data patterns i and j.
The purpose of relative clustering validity is to evaluate the partition pro-
duced by a clustering algorithm by comparing it with other partitions pro-
duced by the same algorithm, initialised with diﬀerent parameters.
External validity is independent of the clustering algorithms used. It is
therefore appropriate for the comparison of diﬀerent clustering algorithms.
Cluster validation by visualization:
This cluster validation is carried out by evaluating the quality of the cluster-
ing’s result with the human eye. This requires an appropriate representation
of the clusters so that they are easy to visualize. This approach is imprac-
tical for large data set and when the dimension of the data is high. It only
works in 2 to 3 dimensions because human eyes cannot visualize higher di-
mensions. For visualizing high dimension data, the dimension of the data
has to be reduced to 2 or 3. SOM, which is one clustering algorithm we will
study later, is often used as a tool for reducing the dimensions of the data
for visualization. Cluster validation by visualization will not be considered
in this thesis. There are two reasons for this: ﬁrst because the size of the
data set is large and the dimension of the data is high and second because
the visualization cannot be quantiﬁed. We need to be able to quantify the
quality of the partitions in order to compare the algorithms on this basis.
2.2.7 Clustering tendency
Clustering tendency evaluates whether the data set is suitable for clustering.
It determines whether the data set contains any structure. This study should
be performed before using clustering as a tool for exploring the structure of
the data. Despite its importance, this step is most often omitted -probably
because it is time consuming. An example of an algorithm for studying the
presence or absence of structure in the data set and one that also identiﬁes
the optimal number of clusters in the data, is the model explorer algorithm.
This algorithm has been presented by Ben-Hur et al. .
Here is the description of the model explorer algorithm:
CHAPTER 2. BACKGROUND 18
1. Choose a number of clusters K, the number of sub samples L, the
similarity measure between two partitions and the proportion α of the
data set to be sampled -without replacement.
2. Generate two sub samples s and t of the data set of size α*(the size of
the data set).
3. Cluster both subsamples using the same clustering algorithm.
4. Compute the similarity between the two partitions. Only elements
common to s and t are involved in this computation.
5. repeat the step 2 to 4 L times.
The model explorer algorithm is based on the following assumption: if the
data set has a structure, this structure will remain stable to small perturba-
tions of the data set such as removing or adding values. So the model explorer
algorithm gives an indication of the presence or the absence of structure in
the the data. In case of the presence of structure, the model explorer algo-
rithm ﬁnds the optimal number of clusters in the data.
The main problem with the model explorer is that it is computationally ex-
2.2.8 Clustering of network traﬃc data
The eﬃciency of the clustering algorithms depends on the nature of the data.
Some of the main diﬃculties in clustering network traﬃc data are:
• the size of the data is large,
• the dimension of the data is high,
• the distribution of the class is skewed,
• the data is a mixture of categorical and continuous data,
• the data needs to be pre-processed.
CHAPTER 2. BACKGROUND 19
In this chapter, aspects of network security and clustering relevant for the
rest of the thesis have been introduced. Network intrusion detection has been
brieﬂy presented. Because of the sophistication of network attack techniques
and the weaknesses in attack prevention mechanisms, network intrusion de-
tection systems are important for ensuring the security of computer networks.
The clustering problem has been deﬁned and steps of the clustering process
have been presented. The main steps of the clustering process are: feature
selection, choice of clustering algorithms and cluster validity. In the next
chapter, clustering methods will be discussed more deeply.
Clustering methods and
In this chapter, diﬀerent clustering methods will be discussed. For each of
the methods, examples of clustering algorithms will be presented.
• Section 3.1 discusses hierarchical clustering.
• Section 3.2 discusses partitioning methods. It is one of the most impor-
tant sections of this chapter as the discussion of partitioning clustering
provides the basis for the implementation of the algorithms used for
the experiments. The main classes of clustering methods are: squared-
error clustering, model-based clustering, density-based clustering and
grid-based clustering. Online clustering and fuzzy clustering methods
are also discussed. The main concepts of the algorithms not used for
the experiments will be presented while the algorithms that are part of
the experiments will be discussed in more detail.
• Section 3.3 compares the clustering methods and algorithms theoreti-
• Section 3.4 studies how to combine clustering methods. In this sec-
tion, we propose a clustering technique appropriate to the clustering of
network traﬃc data
A clustering method deﬁnes the general strategy for grouping the data
instances into clusters. It speciﬁes for example the objective criterion. It also
deﬁnes the basic theory or concept the clustering is based on. A clustering
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 21
Element a b c d e f
a 0 1 1 3 2 5
b 1 0 2 2 1 4
c 1 2 0 3 2 5
d 3 2 3 0 1 4
e 2 1 2 1 0 3
f 5 4 5 4 3 0
Table 3.1: Example of distance matrix used for hierarchical clustering
algorithm, on the other hand, is a particular implementation of a clustering
method. For example, the clustering method deﬁned by the sum of squared-
errors objective function can be implemented in diﬀerent ways. An example
of such implementation is the kmeans algorithm.
Clustering methods can be categorized in diﬀerent ways. At a higher
level, one can distinguish between two main clustering strategies: hierarchical
methods and partitioning methods. Hierarchical clustering organizes the
data instances into a tree of clusters. Each hierarchical level of the tree
corresponds to a partition of the data set. Partitioning methods, on the
other hand, create a single partition of the data set. Both categories will be
discussed in the following sections.
3.1 Hierarchical clustering methods
As mentioned earlier, hierarchical clustering methods organize the data in-
stances into a hierarchy of clusters. This organization follows a tree structure
known as a dendrogram. The root of the dendrogram represents the entire
data set. The clusters located at the leaves contain exactly one data instance.
Figure 3.1 shows an example of a dendrogram corresponding to the dis-
tance matrix in table 3.1. Cutting the dendrogram at each level of the tree
hierarchy gives a diﬀerent partition of the data set. Hierarchical clustering
methods can be divided into two main categories: hierarchical agglomerative
clustering (HAC) and hierarchical divisible clustering (HDC).
Hierarchical agglomerative clustering constructs clusters by moving step
by step from the leaves to the root of the dendrogram. HAC starts with
clusters consisting of a single element and iteratively merge them to form
the clusters of the next level of the tree hierarchy. This process continues
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 22
Figure 3.1: A dendrogram corresponding to the distance matrix in table 3.1
until the entire data set falls into a single cluster. At this point the root of
the dendrogram is known.
Hierarchical divisible clustering uses the dendrogram from the root to
the leaves. HDC starts with a single cluster representing the entire data
set. Then it proceeds by iteratively dividing large clusters at the current
level i into smaller clusters at level i + 1. This process stops when each of
current clusters consists in a single element. At this point the leaves of the
dendrogram are known.
The following are the main steps by which HAC organizes the data in-
stances into a hierarchy of clusters. How HDC proceeds can trivially be
deduced from the steps of HAC.
1. Compute the distance between all the items and store them in a dis-
2. Identify and merge the two most similar clusters.
3. Update the distance matrix by computing the distance between the
new cluster and all the others clusters.
4. Repeat step 2 and 3 until the desired number of clusters is obtained or
until all the items fall in a single cluster.
In order to merge clusters, the distance between pairs of clusters needs to be
computed. Below are some examples of inter-cluster distances.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 23
Inter-cluster distances for hierarchical clustering
Four distances are frequently used for measuring the similarity of two clusters
in hierarchical clustering. Let C
be two clusters, the four inter-cluster
• The maximum distance between C
) = max
• The minimum distance between C
) = min
• Average distance:
the average of the distances of all pair of elements (p
• The distance between the mean µ
and the mean µ
) = dist(µ
In these expressions, dist is the distance measure used between pairs of
elements. Generally, euclidean distance is used.
In , the authors illustrate the diﬀerence between these distances. They
show that if the clusters are compact and non-overlapping, these distances
are similar. But in the case that the clusters overlap or are not hyperspher-
ical shapes, they give results that diﬀer signiﬁcantly. The dist
computationally expensive than the others three distance measures because
it does not compute the distance between all pairs of instances of the two
. These inter-cluster distance measures correspond to dif-
ferent strategies for merging clusters. When dist
is used, the algorithm
is known as the nearest-neighbor algorithm and when dist
is used, the
algorithm is called the farthest-neighbor algorithm.
The problem with hierarchical clustering is that it is computationally
expensive both in time and space because the distances between all pairs
of instances of the data set need to be computed and stored. The time
complexity of HAC is at least O(N
log N), where N is the size of the data set.
This is because there is at least log N levels in the dendrogram and each of
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 24
the requires O(N
) for creating a partition. Because of its high computation
time, hierarchical clustering are not suitable for clustering large data sets.
Hierarchical clustering algorithms do not aim at maximizing a global
objective function. At each step of the clustering process, they make local
decisions in order to ﬁnd the best way of clustering the data.
In this section, hierarchical clustering has been brieﬂy discussed. Hierar-
chical clustering is impractical for large data sets. The next section is about
3.2 Partitioning clustering methods
Partitioning clustering methods, as opposed to hierarchical clustering meth-
ods create a single partition of the data set. The main categories of parti-
tioning clustering methods are described in the following.
3.2.1 Squared-error clustering
The objective of squared-error clustering is to ﬁnd the partition of the data
set with the minimal sum of squared-errors. The squared-error of a cluster
is deﬁned as the sum of the squared euclidean distance of each of the cluster
members to the cluster’s centre. And the sum of squared-errors of a parti-
tion P = ¦C
, .., C
¦ is deﬁned as the sum of the squared-errors of all the
clusters. In other words:
is the mean of cluster C
The general form of a squared-error clustering is:
Given a data set D,
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 25
(a) Specify the number of clusters and assign arbitrarily each instance
of the data set to a cluster
(b) Compute the centre for each cluster
2. Iterations: Repeat steps 2a, 2b and 2c until the diﬀerence of two con-
secutive iterations is below a speciﬁed threshold.
(a) Assign each instance of D to the cluster which centre it is closest
(b) Compute the new centre for each cluster
(c) Compute the squared-error.
Why do the squared-error clustering algorithm converge?
The sum of squared-errors clustering is an example of optimisation algorithms
based on local iterative optimisation steps. Here follows the description of a
local search algorithm and the proof of its convergence.
Local search algorithm: Let P be a ﬁnite set of possible solutions (in
partitioning clustering P is the set of all partitions), and let f : P → R
be a function to be minimized (in sum of squared clustering f is the sum of
squared errors). The algorithm starts from an initial solution x
∈ P. It
then ﬁnds a minimizer x
∈ P of f in a neighbourhood of x
. If x
a minimizer x
∈ P of f is found in a neighbourhood of x
. A sequence of
, ..., x
∈ P is constructed in this way. The iterations stop
gets very close to x
Proof: It is clear that f(x
) ≥ f(x
) ≥ ... ≥ f(x
). And the stopping
is satisﬁed at the point where f(x
) = f(x
means that the inequalities that exist before the stopping criterion is met
are all strict, so the algorithm progresses. It stops at some point in time be-
cause D is a ﬁnite set. The convergence is local and not optimal because the
algorithm performs locally; only a subset of the solution space is investigated.
More precisely, squared-error clustering is based on a version of a local
search algorithm called alternating minimization . Alternating minimiza-
tion is appropriate in situations where:
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 26
• the variables of the function to be optimised fall in two or more groups,
• and if optimising the function by keeping some of the variables constant
is easier than doing the optimisation with all the variables at the time.
The alternating minimization proceeds in the following way:
) be two groups of variables. In the case of squared-error
clustering, these variables are respectively the centres of the clusters and the
sum of squared-errors. At each iteration t, the minimization occurs by keep-
ing constant sse
is then found as the value of c that minimizes the
function f(c, sse
). The value of sse
is the value of sse that minimizes
The main strengths of squared-error clustering are its simplicity and eﬃ-
ciency. Some of its limitations are:
• The sum of squared-errors criterion is appropriate in situations where
the clusters are compact and non-overlapping.
• The partition with the lowest SSE is not always the one that reveals
the true structure of the data. Sometimes partitions consisting in large
clusters has smaller sum of squared error than partition that reﬂects
the true structure of the data. This situation often occurs when the
data contains outliers.
One of the most popular examples of squared-error clustering is the
Kmeans is an iterative clustering algorithm which moves items among clus-
ters until a speciﬁed convergence criterion is met. Convergence is reached
when only very small changes are observed between two consecutive itera-
tions. The convergence criterion can be expressed in terms of the sum of
squared-errors but it does not need to be so expressed.
Input: A data set D of size N and the number of clusters K,
Output: a set of K clusters with minimal sum of squared-error.
1. Randomly choose K instances from D as the initial cluster centres;
Repeat steps 2 and 3 until no change occurs.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 27
2. Assign each instance to the cluster, which centre the instance is closest
3. Recompute the cluster centres. The centre of the cluster C
, is given
, where [C
[ is the size of the C
Kmeans is simple and eﬃcient. Because of these qualities, kmeans is
widely used as a clustering tool.
The problems associated with kmeans are mainly those common to squared-
error clustering: the clustering’s result is not optimal and the sum of squared-
errors is not always a good indicator of the quality of the clustering. The
number of clusters needs to be speciﬁed by the user and the quality of the
clustering is dependent on the initial values. It is appropriate when the clus-
ters are compact, well separated, spherical and approximately of similar size.
The algorithm does not explicitly handle outliers and the presence of outliers
can degrade the quality of the clustering. The time complexity of kmeans is
O(I ∗K∗N), where N is the size of the data set, I is the number of iterations
and K is the number of clusters. Generally, the maximum number of itera-
tions is speciﬁed. In these cases, the time complexity is O(K ∗ N). Figure
3.2 illustrates how the sum of squared-errors varies during the iterations of
the kmeans algorithm.
Figure 3.2 shows that the sum of squared-errors decreases very slowly
after the 10
iteration. This indicates that convergence of the kmeans is
reached around the 10
3.2.2 Model-based clustering
Model-based clustering methods assume that the data set has an underly-
ing mathematical model, and they aim at uncovering this unknown model.
Generally, the model is speciﬁed in advance and what remains is the compu-
tation of its parameters. Two main classes of model-based clustering exist.
The ﬁrst class is based on a probabilistic approach and the second is based
on an artiﬁcial neural networks approach.
• Probabilistic clustering
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 28
0 5 10 15 20 25 30 35 40
Figure 3.2: Variation of the sum of squared-errors in kmeans
In the probabilistic approach, the mixture of gaussians is often used to
describe the underlying model of the data set. The model parameters
can be learned by two diﬀerent approaches: the mixture likelihood ap-
proach and the classiﬁcation likelihood approach. The main diﬀerence
between these two approaches is that the former assumes overlapping
clusters while the latter does not. The expectation maximization(EM)
algorithm  is generally used for learning the model parameters under
the mixture likelihood approach. And the classiﬁcation EM algorithm
 is used for learning model parameters under the classiﬁcation like-
lihood approach. An example of each of these approaches is presented
in the following.
– Clustering under the mixture likelihood approach or EM-
The maximum likelihood parameter estimation is at the heart of
this clustering approach.
The maximum likelihood parameter estimation:
Given a density function p(x[Θ), where Θ is a parameter set and
given a data set D = ¦x
, ..., x
¦. The maximum-likelihood pa-
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 29
rameter estimation consists in ﬁnding the value Θ
of the pa-
rameter Θ that maximizes the likelihood function λ deﬁned as:
M(D[Θ) = Π
[Θ) = λ(Θ[D), (3.5)
For the purpose of identifying clusters in a data set, the density
function used is a mixture of density functions. Each component
of the mixture represents a cluster.
The mixture of density functions is deﬁned as:
∀x ∈ D, p(x[Θ) =
where Θ = (Θ
, ..., Θ
is a set of parameters and
) and α
are respectively the density function and the mix-
ture proportion of the k
The maximum likelihood parameter estimation approach is based
on two assumptions:
For a speciﬁed value of the parameter Θ,
- the instances x
of the data set D are statistically independent
- the selection of instances from a mixture component is done in-
dependently of the other components.
An intuitive way of explaining the selection of each instance x
the mixture model is that it happens in two steps:
-ﬁrstly by selecting a component k with probability α
-and secondly by selecting x
from the component k with the prob-
For the experiments in this thesis, the model used is the mixture
of isotropic gaussians. This model is also known as the mixture
of spherical gaussians. In this model, each component of the mix-
ture is a spherical gaussian. The mixture of isotropic gaussians
has been chosen because of its simplicity, eﬃciency and scalability
to higher dimension.
The EM algorithm is a general method, used for estimating the
parameters of the mixture model. It is an iterative procedure that
consists in two steps: the expectation step and the maximization
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 30
step. The expectation step is commonly called the E-step and the
maximization step is called the M-step. The E-step estimates the
extent to which instances belong to clusters. The M-step computes
the new parameters of the model on the basis of the estimates of
the E-step. In the case of the mixture of isotropic gaussians, the
model parameters are the means, standard deviations and the
weights of the clusters. This step is called the maximization step
because it ﬁnds the values of the parameters that maximize the
The E and M steps are repeated until convergence of the parame-
ters is reached. Convergence is reached when the parameter values
of two consecutive iterations get very close. At the end of the it-
erations, a partitioning of the data set is obtained by assigning
each data instance to the cluster to which the instance has high-
est membership degree. This way of assigning instances to clusters
is called the maximum a posteriori (MAP) assignment. MAP as-
signment gives a crisp or hard clustering of the data set. A soft
clustering -also called fuzzy clustering- can be obtained by using
the cluster membership degrees computed in the E-step.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 31
Algorithm: Learning a mixture of isotropic gaussians with
the EM algorithm
Input: the data set of size N, a set of parameters Θ for the mixture
of gaussians: Θ = ¦α
Output: A partition of the data set into K clusters: ¦C
, ..., C
1. Random initialisation of the parameter set Θ
2. Repeat steps 3 and 4 until the log-likelihood function
3. E-step: Estimation of the posterior probabilities of the k
, Θ) =
4. M-step: Re-estimation of the parameter set Θ of the model
5. MAP assignment of instances to clusters
In the rest of this section about EM-based clustering, we will ex-
plain how the expressions of the model parameters used in the
iterations of the EM algorithm are obtained.
How does the EM algorithm work?
This section constitutes preparatory remarks concerning the EM
algorithm. Estimating the parameters of the mixture model using
the maximum likelihood approach can be diﬃcult or easy depend-
ing on the expression of likelihood function. In simple cases, the
problem can be solved by computing the derivative of the likeli-
hood function with respect to the model parameters. The value
of the parameters that maximize the likelihood function are then
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 32
found by setting the derivative to zero. In most cases, the problem
is not easy and special techniques, such as the EM algorithm are
needed to solve it. The EM algorithm can be explained in vari-
ous ways. In the following, we interpret EM approach as a lower
bound maximization problem. This approach has been presented
in . In this approach, the maximization of the complex log-
likelihood expression is replaced by the maximization of a simpler
lower bound function.
Here follows a brief explanation for why maximizing the bound
function helps maximizing the log-likelihood function. One of the
constraints the lower bound function must satisfy is that it must
touch the log-likelihood function at the current estimate of the
maximizer. Given that constraint and given two functions g and
h, let y = arg max
g(x). Suppose that g(x) ≤ h(x) ∀x, and for
some z, g(z) = h(z). Then, if g(y) > g(z) then, h(y) > h(z). This
means that a maximizer of g is also a maximizer of h. (Here, z
is the current estimate of maximizer of h and y is its new estimate).
Computation of the model parameters
As mentioned earlier, the mixture model of interest is the mixture
of isotropic gaussians(MIXIG) and its parameters are ¦α
The parameters ¦α
¦ are respectively the mixture propor-
tion, the mean and the standard deviation of the k
component. In this section, the expressions used for computing
the new estimates of the parameters at each iteration of EM pro-
cess will be derived.
In the E-step, the posterior probabilities for the t
computed. The posterior probabilities express the membership
degree of instances to clusters. The membership degree of in-
to the k
cluster, given the current parameters Θ
, ..., Θ
) is: P
The denominator of this fraction ensures that the sum of the pos-
teriors in each iteration gives 1. The numerator expresses how the
is selected from the data set: ﬁrst a cluster is cho-
sen with a probability α
and then x
is selected from the chosen
cluster according to the density function governing the selected
cluster; this gives the value α
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 33
The density function for each of the cluster is an isotropic gaus-
sian. And its expression is:
, where Θ
In the following, we ﬁnd a lower bound function of the log-likelihood
Let us recall the likelihood function; it is given by:
[Θ), where the function h is the gaussian mix-
ture density function. By deﬁnition h(x[Θ) =
where the function γ is the isotropic gaussian.
Putting it all together we get:
The logarithm of λ is easier to manipulate than λ. Because the
function log λ varies the same way as the function λ does, max-
imizing log λ is the same as maximizing λ. The logarithm of λ,
called the log-likelihood function is:
δ(Θ[D) = log λ(Θ[D) =
This expression is complex and diﬃcult to maximize because of
the logarithm of a sum it contains. Therefore, a lower bound
function of this function will be found and maximized instead. In
order to make the manipulation of symbols easier, some notations
are here introduced: s(k, n) = α
); that gives:
And using Jensens inequality [appendix D.2], this gives
) = B
By rewriting B
(Θ), we get :
, Θ)log(s(k, n))−
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 34
From the E-step, the second term of the right side of this expres-
sion is known, therefore the maximization B
(Θ) is reduced to the
, Θ) log(s(k, n)) (3.10)
This results in the following formulas:
The formulas are obtained by computing the derivative of the
lower bound function with respect to each of the parameters of
the model and by setting each of these derivatives to zero.
Here are the details of the derivation of the formulas of the model
Formula of the mean:
= 0 (3.11)
The computation of the standard deviation is obtained in the same
way. First, µ
is inserted in b
(Θ) and then b
(Θ) is derived with
respect to σ
. This gives the expression:
For the derivation of the expression of the mixture probabilities,
= 1 must be considered. In order to do
this, the lagrange method [D.3], is used. The expression b
extended by including the constraint
= 1. This results in a
(Θ) = b
(Θ) + λ(
where λ is the lagrange multiplier. By inserting the expression of
(Θ), this gives:
)log(s(k, n)) + λ(
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 35
s(k, n) = α
) = α
Setting the derivative of f
(Θ) with respect to α
to zero gives:
) + λ = 0 (3.17)
By taking into account the constraint
= 1, we get:
This is equivalent to:
, Θ) = 1
λ = −N (3.21)
Replacing λ by its value in the equation 3.18 gives the estimate of
the mixing probability:
, Θ). (3.22)
In this section, we have discussed one example of probability-based
clustering that uses the mixture likelihood approach. This ap-
proach assumes that clusters overlap. The next section is another
example of probabilistic clustering. It is based on the classiﬁca-
tion likelihood approach. This approach assumes that the clusters
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 36
– Clustering under the classiﬁcation likelihood approach or
The objective of clustering under the classiﬁcation likelihood ap-
proach is to ﬁnd a partition of the data set that maximizes the
classiﬁcation likelihood criterion κ deﬁned as:
is the k
cluster and µ
are respectively its mean,
standard deviation and mixture proportion.
While the EM algorithm is a general method for estimating the
model parameters under the mixture approach, the classiﬁcation
EM is a method for estimating the model parameters under the
classiﬁcation approach. The classiﬁcation EM algorithm has been
proposed by G. Celeux and G. Govaert in .
The classiﬁcation likelihood objective criterion κ is a special case
of the mixture likelihood criterion. In this special case, each in-
stance belongs exclusively to a single cluster. Like the EM al-
gorithm, the classiﬁcation EM algorithm has an expectation step
and a maximization step. During the expectation step, the ex-
pected membership degree of each instance to each of the clusters
is computed. Using the cluster membership degrees, computed in
the E-step, the maximization step computes the values of the pa-
rameters that maximize the log-likelihood function. In addition,
the classiﬁcation EM algorithm has a classiﬁcation step, called
C-step. The C-step takes place between the E-step and the M-
step. In the classiﬁcation step, instances are assigned to clusters
according to the maximum a posteriori (MAP) principle. Below
is a description of the classiﬁcation EM algorithm:
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 37
Algorithm: learning model parameters via CEM algorithm
Input A dataset set D of size N, the desired number of cluster K
Output A partition of D in K clusters.
Start from an initial partition P
of the data set.
Repeat the E, C and M steps until convergence is reached.
2. E-step: Computation of the posterior probabilities:
For i = 1, ..., N and for k = 1, ..., K, the posterior probability
for data instance x
belonging to cluster C
is given by
, where α
are the values
of the parameters of the model at the t
iteration and f is a
density distribution function.
3. C-step: MAP assignment of items to clusters
4. M-step: Computation of the parameter values
For k = 1, ..., K, α
, where N
is the size of C
The formula for the computation of the parameter Θ
on the exact expression of f.
In this thesis, f is a mixture of isotropic gaussians.
The mean of the cluster k is µ
and its variance is σ
This gives the expression:
, ∀k = 1, ..., K
, where d is the dimen-
sion of the data space and N
is the size of cluster C
These formula are intuitive. As a special case of the mixture like-
lihood, these formula can be derived from that of the EM-based
approach. This is done by replacing, P
, Θ) respectively
with 1 if x
and with 0 if x
/ ∈ C
in the formula obtained
with the EM algorithm.
Figure 3.3 shows how the log-likelihood of the data increases with
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 38
0 2 4 6 8 10 12 14
Figure 3.3: Variation of the log-likelihood with the iterations of the classiﬁ-
cation maximum likelihood
the iterations of the classiﬁcation maximum likelihood. From this
ﬁgure, it appears that the log-likelihood converges after 10 itera-
One of the drawbacks is that it is computationally expensive es-
pecially for high number of clusters.
The two previous examples are examples of model based clustering that
use a probabilistic approach. The next method, which is also an exam-
ple of model-based clustering uses an artiﬁcial neural network approach.
• Artiﬁcial neural network based methods
Artiﬁcial neural networks(ANN) are inspired by the way the human
brain works. ANN consists in many interconnected processing units,
called neurons. They are generally modelled as a directed graph. The
source of the graph is called the input layer and sink is the output layer.
Sometimes, hidden layers are located between the input layer and the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 39
ANN are used for both classiﬁcation and clustering. They can be com-
petitive or non-competitive. In competitive learning, the output nodes
compete and only one of them wins. A commonly used competitive
approach for clustering is self-organizing maps (SOM). The term self-
organizing refers to the ability of the nodes of the networks to organize
themselves into clusters.
SOM are represented by a single layered neural network in which each
output node is connected to all input nodes. This is illustrated in ﬁgure
3.4. When an input vector is presented to the input layer, only a sin-
gle output node is activated. This activated node is called the winner.
When the winner has been identiﬁed its weights are adjusted. At the
end of the learning process, similar items get associated to the same
output node. The most popular examples of SOM are the Kohonen
self-organizing maps .
Kohonen Self-Organizing Maps(SOM) algorithm
Kohonen self-organizing maps were developed by Teuvo Kohonen around
1982. They have two layers: an input layer and a competitive layer, as
illustrated in ﬁgure 3.4. The competitive layer is a grid of nodes. Each
input node is connected to all the nodes in the competitive layer. The
links between the input nodes and the nodes of the competitive layer
have a weight, and each node in the competitive layer has an activa-
tion function. The network learns in the following way: initially, the
weights of the network are randomly initialised. Then for each input
vector presented to the input layer, each of the competitive nodes pro-
duces an output value. The node that produces the best output value
is the winner of the competition. As a result, the weights of the winner
node as well as those of the nodes in its neighbourhood are adjusted.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 40
Figure 3.4: A 3x3 kohonen network map
Description of the kohonen SOM algorithm
The weights of the network are randomly chosen and the
neighbourhood of the output nodes are speciﬁed.
Iterations: Repeat steps 2, 3 and 4 until convergence. Conver-
gence is reached when the variation in weights of two consecu-
tive iterations becomes very small.
2. Find the winner node
For a given input X, the node of the kohonen layer most similar
to X is chosen as the winner.
3. Update the weights
The weights of the winner node as well as those in its neigh-
bourhood are updated. W
4. Decrease the learning rate and reduce the size of the neighbour-
hood of output nodes.
Initialisation of SOM algorithm: The weights of the network can
be initialised randomly. But with random initialisation, some of the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 41
output nodes may never win the competition. This problem can be
avoided by randomly choosing instances of the data set as initial val-
ues of the weights.
Choice of distance measure: The dot product and the euclidean
distance are commonly used as distance measures. The dot product is
used in situations where the input patterns and the network weights
Learning rate: The learning rate controls the amount by which the
weights of the winner node and that of its neighbours are adjusted. The
initial learning rate is speciﬁed at the initialisation. Then it decreases
as the number of iterations increases. Decreasing the learning rate
ensures that the learning process stops at some point in time. This
is important because usually the convergence criterion is deﬁned in
terms of very small changes in the weights of two consecutive iterations.
Competitive learning does not give any guaranties that this convergence
criteria will eventually be satisﬁed.
Deﬁning the neighbourhood: Initially, the neighbourhood is set to
a large value which then decreases with the iterations. This corresponds
to assigning instances to nodes with more precision as the number of
The time complexity of SOM is O(M ∗ N), where M is the size of the
grid and N is the size of the data set. The justiﬁcation of this time
complexity is the following: during the training a number of operations
(ﬁnding the winner and updating the neighbourhood), which is are at
most twice the size of the grid take place. And the maximum number
of iterations is equal to the size of the data. So the time complexity
for the training is O(M ∗ N). As the assignment only takes O(N) this
gives a total complexity of O(M ∗ N).
One of the main strengths of SOM is its ability to preserve the topology
of the input data: items that are close to each other in the input space
remain close in the output space. This makes SOM a valuable tool for
visualizing high dimensions data in low dimension. SOM also supports
parallel processing; this can speed up the learning process.
Some of the limitations of Kohonen SOM are: it is most appropriate
for detecting hyperspherical clusters. The choice of initial parameter
values - the initial weights of connections, the learning rate, and the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 42
size of the neighbourhood - is diﬃcult. The quality of the clustering
depends on the choice of the initial values of these parameters and also
on the order in which items are processed.
This subsection has discussed model-based clustering. This approach
assumes that the data can be described by a mathematical model and
aim at uncovering this model. The two main approaches to model-
based clustering are the probabilistic approach and the artiﬁcial neural
network approach. The next subsection approaches the clustering prob-
lem diﬀerently. It views clusters as dense regions in the data space.
3.2.3 Density-based clustering
In the density-based approach, a cluster is deﬁned as a region of the
data space with high density. This dense region is bordered by low-
density regions that separate the cluster from other points of the data
space. There are two main types of density-based clustering: the ap-
proach based on connectivity and the approach based on density func-
tions. An example of a clustering algorithm based on connectivity is
DBSCAN  and an example based on density functions is DENCLUE
. These two algorithms are popular for clustering large spatial data
The two algorithms will be brieﬂy presented in the following.
1. DBSCAN: Density-Based Spatial Clustering using Appli-
cations with Noise
DBSCAN ﬁnds clusters by ﬁrst identifying points in dense regions
and then growing the regions around these points until the bor-
ders of these regions are met. To be more speciﬁc, DBSCAN ﬁnds
- First identifying core points of the data set. A core point is a
point whose neighbourhood contains a minimum number of points.
The size of the neighbourhood and the the minimum number of
points are two parameters of the algorithm.
- Next, DBSCAN iteratively merges core points that are directly
density-reachable. A point p is directly density-reachable from a
point q if q is a core point and p belongs to the neighbourhood of
q. The iterations stop when it is no longer possible to add new
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 43
points to any of the clusters. DBSCAN is designed for spatial data
sets. A spatial data set capture the spatial relationship between
the instances of the data set. Examples of spatial data sets are
geographical data sets or image databases. The time complexity
of DBSCAN is O(N log N) when the spatial index R
-tree is used.
Some of the diﬃculties in using DBSCAN are related to the choice
of appropriate values of the neighbourhood and the minimum
number of points that characterizes a core point.
Some of the advantages of density-based clustering are their abil-
ity to detect clusters of arbitrary shapes and scalability to large
2. DENCLUE: DENsity-based CLUstEring
DENCLUE is a clustering algorithm that uses density distribu-
tion functions to identify clusters. Clusters are identiﬁed as local
maxima of the overall density function. The overall density func-
tion is the sum of all the inﬂuence functions of each point of the
data space. Given a data set D, the inﬂuence function of a point
y ∈ D is a function f
: D → R
, which models the impact of
y within a neighbourhood. The gaussian distribution function is
an example of an inﬂuence function that is commonly used. The
gaussian function is deﬁned as:
(x) = exp(
), where d is a distance measure.
Clusters are generated by density attractors. A density attractor
is a local maximum of the (overall) density function.
There are two types of clusters: center-deﬁned clusters and arbi-
trary shaped clusters.
Center-Deﬁned Cluster: Given a threshold Γ, a center-deﬁned
cluster for a density attractor x
is the subset C of the data set
D deﬁned by:
C = ¦y ∈ D[y is density-attracted by x
) ≥ Γ¦
Arbitrary-shaped Cluster: Given Γ, an arbitrary-shaped clus-
ter for the set of density attractors A is a subset C of the data set
D, such that:
- ∀x ∈ C, ∃ x
∈ A with f
) ≥ Γ and x is density-
attracted by x
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 44
- For all density attractors x
there is a path from
such that for all points y on this path, f
(y) ≥ Γ.
The notion of density-attracted points used in these two deﬁni-
tions of clusters is deﬁned as follows:
Density-attracted points: Given ∈ R
, a point x ∈ D
is density-attracted to a density attractor x
if and only if
there exists a chain of points x
, ..., x
such that x
= x and
) ≤ where x
for 0 < i < k.
For continuous and diﬀerentiable inﬂuence functions, such as the
gaussian inﬂuence function, a hill-climbing algorithm guided by
the gradient can be used to ﬁnd density-attracted points. The de-
scription of the hill-climbing algorithm can be found in appendix
DENCLUE consider outliers as noise and remove them. Some of
the strengths of DENCLUE are:
-It has a solid mathematical foundation
-It is good for clustering high dimensional data
-It detects clusters of arbitrary shapes
The main limitations of these algorithms are they remove have
been designed for spatial data only. The fact that they remove
outliers does not them suitable for identifying attack clusters that
are small. The quality of the clustering result depends on the
choice of density parameter and the noise threshold.
This subsection has presented the density-based approach to cluster-
ing. In this approach, clusters are deﬁned as dense regions of the data
space. Two examples of algorithms have been presented: DBSCAN
and DENCLUE. These algorithms are designed for spatial data. The
next subsection discusses the grid-based approach to clustering. This
approach is also designed for spatial data. It views the data space as a
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 45
Figure 3.5: Querying recursively a multi-resolution grid with STING
3.2.4 Grid-based clustering
In grid-based clustering, the data space is partitioned into grid cells. A
summary of the information about each cell is computed and kept in
a -grid- data structure. Cells that contain a number of points above a
speciﬁed threshold are considered as dense. Dense cells are connected
to form clusters.
One popular example of a grid-based clustering algorithm is STING
. It makes use of statistical information about the grid cells.
In the following we will explore STING.
STING: STatistical INformation Grid
STING was proposed by Wang et al. in . It is a multi-resolution
grid. The data space is divided into rectangular cells. The cells are or-
ganized in diﬀerent hierarchical levels corresponding to diﬀerent levels
of resolutions. A cell at a hierarchical level i is partitioned into cells at
the next hierarchical level i+1.
Statistical information about each cell is pre-computed and stored.
Some of the statistical information stored is:
- count: the number of items in the cell,
- mean: the mean of the cell,
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 46
- s: the standard deviation,
- min: the minimum value of the cell,
- max: the maximum value of the cell,
- the distribution: the distribution of the cell if it is known.
Statistical information about the cells is computed in a bottom-up fash-
ion. For example, the distribution at the hierarchical level i can be
estimated to be the distribution of the majority of cells at the hierar-
chical level i −1.
The statistical information about the cells is used in a top-down fash-
ion to answer spatial queries. A query asks for the selection of cells
satisfying certain conditions on density for example.
The query-answering is performed in the following way:
First the level of the cell hierarchy where the answering is to begin is
found. Generally, it contains a small number of cells. Then for each
cell at this level, the relevancy of the cell in answering the query is
estimated. Only the relevant cells are submitted to further processing
in the next hierarchical level down.
This process is repeated until the lowest level of the hierarchy is reached.
At this stage the cells satisfying the query are returned. Usually this
ends the clustering process. In cases where very accurate results are
desired, the relevant cells are submitted to further processing. Only
cell members that satisfy the query are returned.
Figure 3.5 illustrates a top-down querying of the grid in STING. Start-
ing at level 1, the possible candidates satisfying the query are localized.
These initial solutions are reﬁned at levels 2 and 3. The desired cells
are returned at level 3. As it appears from this ﬁgure, the borders of
the cluster are either horizontal or vertical.
Some of the strengths of STING and grid based clustering in general
are: -It is scalable to a large data set. The query processing time is
linear with respect to the number of cells.
-The grid structure supports parallel processing and incremental up-
One of the weaknesses of STING is that the borders of the clusters
are either vertical or horizontal. Grid based clustering algorithms are
designed for clustering spatial data. They will not be considered for
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 47
the experiment as the data to be used does not capture the spatial
relationship between items.
The following two clustering methods correspond to two other ways
of categorizing clustering methods. In the ﬁrst of these categorizations,
a distinction is made between online and oﬀ-line clustering algorithms.
All the methods discussed up to now, except SOM, are oﬀ-line methods.
The second categorization distinguishes between crisp clustering and
fuzzy clustering. Crisp clustering creates distinct clusters, while in
fuzzy clustering, items belongs to more than one cluster.
3.2.5 Online clustering
One of the main diﬀerences between oﬀ-line clustering and online clus-
tering is that the former requires that the entire data set is available
at each step of the clustering. That is so because oﬀ-line clustering
algorithms generally aim at ﬁnding the global optimiser of an objective
function. The latter -online clustering algorithms- generates clusters as
the data is produced. Online clustering algorithms are appropriate for
clustering in a data ﬂow environment. Network traﬃc is an example of
such a type of data.
Online clustering algorithms do not aim at optimising a global crite-
rion, rather they proceed by making local decisions. Optimisation of
a global criterion often leads to a stability problem, in that the clus-
ters produced by these methods are sensitive to small changes in the
data. The advantage of online clustering is that it leads to adaptable
and stable cluster structure. An example of online clustering is leader
The leader clustering algorithm
Leader clustering starts by selecting a representative of a cluster. This
representative is called the leader of the cluster. When assigning in-
stances to clusters, the distances of the instance to each of the current
clusters are computed. The instance is assigned to the closest cluster
if its distance to that cluster is below a speciﬁed threshold . If the dis-
tance of the instance to each of the existing clusters is greater than the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 48
threshold, a new cluster, consisting of that single instance, is created.
This process is repeated for each of the instances. Generally, euclidean
distance is used.
Description of the algorithm
Choose a threshold , and initialise the ﬁrst cluster centre µ
Generally the ﬁrst item of the data set is chosen.
For each of the remaining items x repeat the following steps:
2. Identify the closest cluster C
3. If |x −µ
| < , update the centre µ
otherwise create a
new cluster with x as its leader.
One of the main drawbacks of leader clustering, that is common to
on-line clustering algorithms, is that the clustering result is dependent
on the order in which instances are processed. When leader clustering
is used for oﬀ-line clustering, this problem can be solved by selecting
instances in a random order.
Some of the strengths of leader clustering are: it is fast, robust to
outliers and does not require the number of clusters to be speciﬁed ex-
plicitly. Its robustness in the presence of outliers indicates that it may
have some potential for clustering network traﬃc data for anomaly de-
The time complexity of the leader clustering is O(K ∗ N), where K
is the number of clusters and N is the size of the data set. A single
scan of the data set is required and a constant number of operations is
performed during the processing of each instance.
3.2.6 Fuzzy clustering
Another way of categorizing clustering methods is to consider the de-
gree of membership of data instances to clusters. A distinction is made
between crisp clustering and fuzzy clustering. In crisp clustering, each
data instance is assigned to only one cluster. In fuzzy clustering, on the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 49
other hand, each instance belongs to more than one cluster with some
degree of membership. The degree of membership of a data instance
to a cluster C
is a real value z
∈ [0, 1], where
Crisp clustering can be considered as a special case of fuzzy clustering,
= 1 if X
belongs to C
= 0 otherwise.
Fuzzy clustering aims at minimizing a fuzzy objective criterion. An
example of fuzzy clustering is the EM-based clustering studied earlier.
Another example is fuzzy kmeans discussed below.
The fuzzy kmeans algorithm
Fuzzy kmeans, also known as fuzzy cmeans, was proposed by Dunn
in 1974 and improved by Bezdek in 1981. The algorithm aims at min-
imizing the following objective function:
b is called the fuzziﬁer and it controls the degree of fuzziness. When the
fuzziﬁer b is closer to 1, the clustering tends to be crisp and when the
fuzziﬁer b becomes very large, the degree of membership approaches
1/K; that means the data instance is a member of all the clusters to
the same degree. Generally, the value of the fuzziﬁer b is chosen to be
Description of fuzzy kmeans algorithm
1. Initialisation: choose the number of clusters K, the initial cluster
centres, the fuzziﬁer b, a threshold and the cluster membership
(where i = 1, ..., N and k = 1, ..., K).
2. Normalize z
= 1, ∀i = 1, ..., N
Iterations: Repeat steps 3 and 4 until: (Q(t) −Q(t −1)) ≤
3. Recompute cluster means
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 50
4. Recompute the degree of cluster membership
Derivation of the formulas The above formulas are obtained as
Let ﬁrst derive the expression of cluster membership z
We are looking for an extremum of the the fuzzy objective function
Q under the constraint that
= 1. In order to include that
constraint we use the Lagrange method.
P = Q−λ(
where λ is the Lagrange multiplier
By setting the derivative of P with respect to z
to zero, we get:
−λ = 0 (3.28)
Using the expression of the derivative of Q gives:
= λ, (3.29)
Which is equivalent to:
Using the constraint
= 1, we get:
) = 1 (3.31)
Which is equivalent to:
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 51
0 2 4 6 8 10 12 14 16
Figure 3.6: Variation of the - fuzzy - sum of squared errors in fuzzy kmeans
When inserting the value of λ in equation 3.30, we get cluster member-
Finding the formula for the means of the cluster is simple because no
constraints have to be satisﬁed. This is obtained deriving Q according
and setting it to zero.
) = 0 (3.34)
Figure 3.6 shows how the fuzzy sum of squared-errors vary with the
number of iterations of fuzzykmeans. In this ﬁgure, it appears that the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 52
fuzzy sum of squared error decreases very slowly after the 11
tion. That indicates that convergence of the fuzzy kmeans is reached
around the 11
A limitation of the fuzzy kmeans is that it is more computationally
expensive when compared to the standard kmeans.
Fuzzy clustering is appropriate in situation where the clusters overlap.
In this thesis, we are looking for partitions of the data set. So our
interest in fuzzy clustering is limited to studying the eﬀects that fuzzy
concepts have on clustering results. At the end of the clustering pro-
cess, a partition -non-overlapping clusters- is returned using a MAP
assignment, for example.
3.3 Discussion of the classical clustering
The clustering algorithms discussed in the previous sections of this
chapter fall into two groups: the traditional ones and the most recent
ones. The traditional algorithms are HAC, kmeans, EM-based cluster-
ing, CEM-based clustering, SOM, fuzzy kmeans and leader clustering.
The most recent ones are examples of density-based clustering such
as DBSCAN and DENCLUE and examples of grid-based clustering
such as STING. The categorization of each of the clustering algorithms
as instances of a speciﬁc clustering method provides a framework for
understanding and discussing properties of the algorithms. Although
these algorithms belong to diﬀerent methods, some of them can be
easily related. Kmeans is a special case of classiﬁcation EM-based
clustering which, in turn, is a special case of EM. Kmeans can also be
seen as a special case of fuzzy kmeans clustering. A one-dimensional
SOM, in which only the winner node’s weights are updated during the
competitive learning, is equivalent to the online version of the kmeans
algorithm. The diﬀerence between online kmeans and kmeans is that
the former updates the cluster centres as items are assigned to clusters.
Only the centre of the cluster to which a new instance is assigned is
recomputed. The latter assigns all the instances to the clusters be-
fore re-computing the centres of the clusters. The relation between
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 53
those algorithms will be helpful in explaining the performance of the
All the discussed clustering algorithms have their strengths and limita-
tions. Generally, each of these algorithms will produce a good clustering
result if the assumptions and ideas the algorithm is based on match that
of the data set. A major diﬀerence between these algorithms are their
running times. Model-based clustering algorithms, such as EM-based
clustering and CEM-based clustering, and hierarchical clustering are
computationally expensive. Online clustering is fast and squared-error
clustering has an acceptable running time. So clustering algorithms
such as EM-based, CEM-based clustering and HAC are impractical for
clustering large data sets. The computationally time of EM-based clus-
tering increases drastically with the number of desired clusters. The
execution time of SOM increases only slightly with the number of clus-
ters -the size of the som-grid.
Of the partitioning clustering methods discussed, only the examples
of density based clustering and grid based clustering are useful in the
detection of clusters of arbitrary shapes and sizes. In both approaches,
identifying clusters is achieved by merging small dense clusters. The
main diﬀerence in these approaches is how they deﬁne and identify the
small clusters. DENCLUE, which is an example of density based clus-
tering, uses density distribution functions and identiﬁes dense regions
by ﬁnding the local maxima of the overall density function. DBSCAN,
which is another example of a density based clustering algorithm, local-
izes points that contain a number of items above a speciﬁed threshold.
STING, an example of grid based clustering, uses suﬃcient statistics
about grid cells for identifying the dense cells. These algorithms are de-
signed for spatial databases. They use eﬃcient spatial data structure,
such as R* tree, for merging dense clusters. This makes them scalable
to large data sets. Hierarchical agglomerative clusters are also con-
structed by merging small clusters. But it is impractical to use HAC
for clustering large data sets because HAC does not use an eﬃcient
In the next section, we will study the issue of combining clustering
methods. We will speciﬁcally study how the merging of small clusters
can be eﬃciently adapted to the data set at hand.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 54
3.4 Combining clustering methods
Clustering methods can be combined using two main approaches.
1. The ﬁrst approach combines the clustering results produced by
pairs of clustering algorithms. It deduces new partitions of the
data set by studying the agreement in the clustering results pro-
duced by diﬀerent clustering methods.  studied various algo-
rithms and techniques for studying the agreement in the partitions
provided by diﬀerent clustering methods. This approach will not
be considered in this thesis because it is computationally expen-
2. The second approach combines ideas and techniques from diﬀerent
clustering methods to derive new clustering techniques. The goal
is to use diﬀerent ideas and techniques from diﬀerent clustering
methods as building blocks for new clustering techniques to solve
the problem at hand. Two diﬀerent architectures will be explored.
– The ﬁrst involves initialising a clustering algorithm with the
partition produced by another clustering algorithm.
– The second clustering architecture consists in two levels. The
ﬁrst level creates a large number of small clusters using one of
the studied clustering algorithms and the second level merges
the clusters created at the ﬁrst level. This clustering archi-
tecture will be called two-level clustering.
3.4.1 Two-level clustering with kmeans
We use the two-level architecture in order to detect clusters of arbitrary
shapes and sizes. Because the distribution of the attacks is skewed,
producing high number of small clusters will help us to identify small
size attack clusters. Large clusters, consisting for example of normal
data, can be constructed by merging small clusters.
In this study, kmeans is used for the creation of the ﬁrst level clusters.
In principle, the choice of clustering algorithm for the creation of the
clusters at the ﬁrst level does not make a signiﬁcant diﬀerence as long
as the clusters created are of high purity. Kmeans has been chosen
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 55
because it is fast compared to most of the other algorithms, and because
it has some properties that are essential for the success of the proposed
method. In the rest of this section, the ﬁrst level clusters will be referred
to as basic clusters.
Merging basic clusters degrades the purity of the clustering. Our aim is
to merge clusters in such a way that the purity of the clusters degrades
as little as possible. As the attack labels are not known during the
clustering process, we do not have a way of directly measuring the
purity of the clustering. Other characteristics of the data will be used
to approximate the purity of clusters.
A cluster is said to be 100% pure if it contains attacks of exactly one
kind. Merging two 100% pure clusters, that contains the same attack
type, will not degrade the purity of the clustering. It will be assumed
that two basic clusters are of the same type and therefore can be merged
if the following two conditions are satisﬁed:
- the two clusters are close to each other,
- the two clusters have approximately the same density.
The ﬁrst of these conditions is based on the assumption that data in-
stances of the same attack type are close to each other. The second
condition is based on the assumption that clusters of the same attack
type have approximately the same density. The density of a cluster is
deﬁned as the average number of items in a speciﬁed radius ρ.
Estimation of the density of basic clusters:
Because kmeans is used for the creation of basic clusters, a basic clus-
ter size can be used as an approximation of the cluster density. This
is possible because kmeans is based on the implicit assumption that
clusters are spheres of identical radius δ. So by choosing ρ equal to δ
the cluster size can be to estimate cluster density.
Proof of the assumption regarding the shape and size of kmeans
The estimate of the density of basic clusters produced by kmeans is
based on our assertion that kmeans assumes that clusters are spheres
of the same size. The goal of this section is to prove this assertion.
As we explained earlier, kmeans aims at minimizing the sum of squared-
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 56
errors criterion. Let us recall the expression of the sum of squared-errors
of a partition P.
, where µ
is the center of cluster C
The purpose of the proof is to show that the minimization of the SSE
is equivalent to the maximization of a special case classiﬁcation like-
lihood criterion. This special case corresponds to the situation where
the model is a mixture of isotropic gaussians with identical standard
deviation and with identical mixture proportions. In this special case
CEM aims at ﬁnding clusters that are spheres of the same sizes. The
expression of the classiﬁcation likelihood criterion, shown earlier, is as
)), where C
is the k
ter and µ
are respectively its mean, standard deviation and
In the case where the mixture proportions and standard deviations are
identical for all the clusters, we have:
= 1/K and σ
= σ ∀k, 1 ≤ k ≤ K. So,
, σ)), (3.36)
Which is equivalent to:
, σ) + R, (3.37)
where R is a constant.
Using the expression of the isotropic gaussian, that is: p(x
, σ) =
+ d log(
2πσ) + R (3.38)
is the center of the k
because the maximum likelihood estimate
of the mean of a cluster is the center of the cluster -as shown in the
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 57
formula of the M step of the CEM algorithm. So,
SSE(P) −Nd log(
2πσ) + R (3.39)
where N is the size of the data set D and d is is the dimension of D.
This last equation proves that minimizing SSE is equivalent to maxi-
mizing the classiﬁcation likelihood criterion for a mixture of isotropic
gaussians with identical mixture proportion and identical standard de-
Merging basic clusters
In order to produce clusters of arbitrary shapes, basic clusters are linked
instead of being fused. The linking of basic clusters results in multi-
centered clusters. The fusion of basic clusters into one center-based
cluster will have produced spherical clusters. The distance between two
multi-centered clusters is deﬁned as the distance between their closest
basic clusters. The distance between two basic clusters is deﬁned as
the euclidean distance between their means. This distance measure has
been chosen because its computation is fast.
Selecting an optimal number of basic clusters :
The parameters that inﬂuence the quality of clustering with the two-
level approach are: the purity of the basic clusters and the number
of times basic clusters are linked. These two conditions are mutually
antagonistic. A high purity of basic clusters requires a large number of
basic clusters. But with a large number of basic clusters a high number
of linking operations are required. We need a mechanism for choosing
an optimal number of basic clusters.
Figure 3.7 illustrates how the classiﬁcation accuracy obtained with two-
level clustering varies with the number of basic clusters. Figure 3.7
shows that the classiﬁcation accuracy is highest when the number of
basic clusters is 200.
In order to choose the appropriate number of basic clusters, we study
how the SSE is related to the classiﬁcation accuracy for kmeans clus-
tering. This study shows that SSE and classiﬁcation accuracy vary in
a similar way with the number of clusters of kmeans. So we use SSE
for identifying the optimal number of basic clusters. As SSE measures
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 58
0 100 200 300 400 500 600 700 800
number of basic clusters
"accuracy2levels.dat" using 1:2
Figure 3.7: Variation of classiﬁcation accuracy with the number of basic
the compactness of the clusters, it makes sense to use it as a measure
of the homogeneity of the clusters. The identiﬁcation of an optimal
number of clusters is achieved by plotting the variation of the SSE
with the number of clusters. The optimal number of basic clusters is
chosen in the region of the graph where SSE begins to decrease very
slowly. Let us call this region of graph '. Selecting a point within
region ' is reasonable because the purity of the clusters does not vary
signiﬁcantly accounting from ' and because merging a high number of
clusters decreases the purity of the ﬁnal clusters.
In short, the optimal number of basic clusters is found experimentally
by studying the variation of SSE with respect to the number of clus-
ters. If the diﬀerence between the SSE of two consecutive number of
clusters, say α and β, is below a speciﬁed threshold, either or β is
selected as reasonable number of basic clusters.
Figure 3.8 shows how the SSE varies with the number of clusters in
kmeans. Selecting the number of basic clusters within the interval [150
250] is reasonable.
Some of the main strengths of two-level clustering are:
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 59
0 50 100 150 200 250 300 350 400
number of clusters
"kmeansVariationOfSSE.dat" using 1:2
Figure 3.8: Variation of the sum of squared-errors(SSE) with the number of
clusters in kmeans
– It detects clusters of arbitrary shapes and sizes
– It is possible to adjust the quality of the clustering: by varying
the number of basic clusters
Some of the weaknesses are:
– When the number of basic clusters is high, the computation time
may also be high, however it is not worst than most of the other
clustering algorithms considered in this study.
– Finding the optimal number of basic clusters is diﬃcult. It may
require experimentation and this is time consuming.
Two-level clustering can be seen as a combination of the kmeans and
HAC. It also makes use of ideas of density-based clustering when
merging basic clusters. In the following we summarize the steps used for
performing the two-level clustering. As it is a combination of kmeans
and HAC and density clustering we called this algorithm KHADENS
(Kmeans HAc and DENSity).
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 60
1. Initialisation: Specify the number of basic clusters β. This is
done through experimentation. Specify the minimum distance
minDist and the minimum rapport of size minDens which two
basic clusters must have in order to be merged.
2. Creation of the basic clusters: Create β clusters using the kmeans
Iteration: Repeat step 3 until no change occur
3. Merging clusters Start with the the basic clusters. For each pair
of clusters MC
, merge them if there is a basic cluster
and basic cluster bc
such that d(bc
Another variation of this algorithm has been explored. In this variation
that we call KHAC, the closest clusters are iteratively merged until the
desired number of clusters is reached. The main diﬀerence between
these two algorithms is that the size of basic clusters is not considered
when merging clusters in KHAC.
The running time of KHADENS and KHAC is mainly the time used
for the creation of the basic clusters. The merging of the basic clusters
is fast as it generally involves a small number of clusters.
3.4.2 Initialisation of clustering algorithms with
the results of leader clustering
Leader clustering is very fast and robust to outliers. It can, therefore, be
used for the identiﬁcation of better initial cluster centres to be used in
each of the other algorithms. The procedure for initialising a clustering
algorithm CA with the leader clustering algorithm is the following: Let
K be the number of clusters desired by CA. The leader clustering is
used to cluster the data set into M clusters, where M ≥ K. Then the
centres of K of the clusters created by the leader algorithm are used as
initial centres for the clustering algorithm CA.
CHAPTER 3. CLUSTERING METHODS AND ALGORITHMS 61
In this chapter, diﬀerent clustering methods have been discussed. A
distinction has been made between clustering methods and clustering
algorithms. A clustering method deﬁnes the general concept and theory
the clustering is based on, while a clustering algorithm is a particular
implementation of a clustering method. Examples of each of the con-
sidered clustering methods have been discussed. Most of the classical
clustering algorithms considered in this thesis approach the clustering
problem as an optimisation problem. They aim at optimising a global
objective function. They make use of an iterative process to solve the
problem. Another group of algorithms do not approach the clustering
problem as an optimisation problem. They view clusters as dense re-
gions in the data space and identify clusters by merging small units of
dense regions. This is the case for density based clustering and grid
based clustering. A clustering architecture inspired by some properties
of the clustering algorithms using an optimisation approach and the
clustering algorithms which constructs clusters by making local deci-
sions has been proposed. This architecture takes into consideration the
characteristics of the data set at hand. The discussed clustering algo-
rithms, with the exception of DBSCAN, DENCLUE and STING are
used for our experiments, which are discussed in the next chapter.
This section describes and discusses the design and the execution of
the experiments. This discussion is important in order to understand
and explain aspects of the experiments that have an impact on the
performance of the clustering algorithms.
– Section 4.1 discusses the design of the experiments.
– Section 4.2 discusses the data set and its feature set.
– Section 4.3 discusses implementation issues.
– Section 4.4 summarizes the chapter.
The experiments are conducted on a pentium 4 processor with 1.5
GB of memory. The operative system is FREEBSD release 5.4.
4.1 Design of the experiments
The design and the implementation are modular. The programming
language used for the implementations of the clustering algorithms and
the experiment is C++. As an example of object oriented programming
language, C++ supports modularity. Furthermore C++ is an eﬃcient
and ﬂexible programming language. The eﬃciency is an important is-
sue in our experiment because of the large size of the data set. The
CHAPTER 4. EXPERIMENTS 63
architecture of the system used for implementing the clustering algo-
rithms and performing the experiments is composed of four modules:
the data preparation module, the clustering algorithms module, the
experiment module and the evaluation module.
– Data preparation module: this module puts the data in a form
that is easily used by the clustering algorithms. It transforms
non-numeric feature values into numeric feature values and it nor-
malizes the feature values.
– Clustering algorithm module: This module implements the dis-
tance measures and the clustering algorithms. Common and im-
portant clustering concepts have been encapsulated into classes so
they can be easily share by the diﬀerent clustering algorithms.
– Experiment module: This module implements operations related
to the execution of the experiment. It implements for example the
execution of the ten-fold cross validation. It also implements the
diﬀerent indices to be used for evaluation of the algorithms.
– Evaluation module: On the basis of the evaluation indices com-
puted in the experiment module, the evaluation module compares
the clustering algorithms. It computes the means and standard
deviations of the indices and makes a paired t-test comparison.
4.2 Data set
The performance of clustering algorithms partly depends on the char-
acteristics of the data set. This section describes and discusses the data
set selected for the experiments.
4.2.1 Choice of data set
The data set chosen for the experiment is the KDD Cup 99 data set.
This data set is available at . The KDD Cup 99 data set is a
processed version of a data set, developed under the sponsorship of the
Defense Advanced Research Projects Agency (DARPA) -of the USA- in
1998 and 1999, for the oﬀ-line evaluation of intrusion detection systems
CHAPTER 4. EXPERIMENTS 64
(IDS) [17, 41]. Currently the DARPA data set is the most widely used
data set for testing IDS.
This DARPA project was prepared and executed by the Massachusetts
Institute of Technology (MIT) Lincoln Laboratory. MIT Lincoln Labs
set up an environment on a local area network that simulated a mili-
tary network under intensive attacks. The simulated network consists
of hundreds of users on thousands of hosts. Working in a simulated
environment made it possible for the experimenters to have complete
control of the data generation process. The experiment was carried
out over nine weeks and a raw network traﬃc data, also called raw
tcpdump data, was collected during this period.
The raw tcpdump data has then been processed into connection records
used in the KDD Cup 99 data set. The KDD Cup 99 data set contains
a rich variety of computer attacks. The full size of the KDD Cup 99 is
about ﬁve million network connection records. Each connection record
is described by 41 features and is labelled either as normal or as a
speciﬁc network attack. One of the reasons for choosing this data set
is that the data set is standard. This will make it easy to compare the
results of our work with other similar works. Another reason is that
it is diﬃcult to get another data set which contain so rich a variety of
attacks as the one used here.
Some criticisms have been made about the generation of the DARPA
data set. One the strongest criticisms was made by J. McHUGH in
. The network traﬃc generated in the DARPA data set has two
components: the background traﬃc data -which consists of network
traﬃc data generated during the normal usage of the network- and the
attack data. According to McHUGH, the generation process of the
background traﬃc data has not been described explicitly by the exper-
imenters. Therefore, there is no direct evidence that the background
traﬃc matches the normal usage pattern of the network to be simu-
lated. He made similar criticisms about the generation of the attack
data. The intensive attacks the network has been submitted to do not
reﬂect a real word attack scenario.
Although some of these criticisms are important and can be useful in
future generation oﬀ-line intrusion evaluation data sets, the DARPA
data set has many strengths which still make it the best publicly avail-
CHAPTER 4. EXPERIMENTS 65
able data set for evaluation of intrusion detection systems.
4.2.2 Description of the feature set
The choice of the feature set is crucial for the success of clustering.
The goal of this section is to describe the feature set, and its ability to
discriminate normal patterns and attack patterns.
Generally, the construction of eﬃcient features for intrusion detection is
either done manually or by semi-automated process. The manual con-
struction of features uses only security domain knowledge while semi-
automated feature construction automates part of the feature construc-
tion process. To our knowledge, none of the existing feature construc-
tion methods for network attack detection fully automates the feature
Stolfo et. al in  used a semi-automated approach to identify use-
ful features for discriminating normal patterns from attack patterns.
Their approach is based on data mining concepts and techniques like
link analysis and sequence analysis. Link analysis determines the re-
lation between ﬁelds of a data record; sequence analysis identiﬁes the
sequential pattern between records. Their work has led to the feature
set describing the data set used in this thesis.
In the following, we will ﬁrst describe the feature set, then we will give
a brief explanation of the approach used in  to derive them and
ﬁnally a discussion of the discriminative capability of the feature set
will be presented. A short description of the feature set is available in
appendix B.1. The full description can be found in [16, 14].
There are 41 features, which fall into diﬀerent categories: basic features
and derived features.
1. The basic features describe single network connections.
2. The derived features can be divided into content-based features
and traﬃc based features.
(a) The content-based features are derived using domain knowl-
CHAPTER 4. EXPERIMENTS 66
(b) And the traﬃc-based features are obtained by studying the
sequential patterns between the connection records as well as
the correlation between basic features.
In order to construct the feature set, the raw tcpdump data has been
pre-processed into connection records. The basic features are directly
obtained from the connection records. The derived features fall into
two groups: the content-based features and the traﬃc based features.
Content-based features are used for the description of attacks that are
embedded in the data portion of the IP packet. The description of
these types of attacks requires some domain knowledge and cannot be
done only on the basis of information available in the packet header.
Most of these attacks are R2L and U2R attacks. Traﬃc based features
have been computed automatically; they are eﬀective for the detection
of DOS and probe attacks. The diﬀerent types of attack contained in
the data set are described in appendix C.
In order to derive the traﬃc features, Stolfo et al. made use of an
algorithm that identiﬁes frequent sequential patterns. The algorithm
takes the network connection records, described by the current basic
features, as input and computes the frequent sequential patterns. The
frequent episodes algorithm is executed on two diﬀerent data sets: an
intrusion-free data set and a data set with intrusions. Then these two
results are compared in order to identify intrusion-patterns.
The derived features are constructed on the basis of patterns that only
appear in intrusion data records. Therefore, they are able to discrim-
inate between normal and intrusion connection records. Although ex-
perience shows that the feature set considered here discriminates well
between normal and intrusive patterns it has some limitations when it
is used for anomaly detection. Because the feature set has been derived
on the basis of intrusions in the training data set, the derived feature
set cannot describe attacks not included in the training data set. The
feature set is, therefore, more suitable for misuse detection than for
anomaly detection. Another limitation of the feature set is that it may
not discriminate well between normal data and attacks embedded in
the data portion of the data packet. The reason for this is that the
feature set has been constructed primarily on the basis of information
CHAPTER 4. EXPERIMENTS 67
available in the header of the packet. The content-based features may
not describe correctly attacks embedded in the data portion of the IP
packet. These features have been derived from indices that characterize
the session between two communicating hosts. These indices may not
be suﬃcient to capture the full nature of an attack embedded in the
Scaling and normalization of the feature values
The purpose of scaling the feature values is to avoid a situation where
features with large and infrequent values dominate the small and fre-
quent values during the computation of distance. Normalization scales
all the feature values in the range [0 1]. Some examples of scaling
schemes are the linear scale, logarithmic scale, and scaling using the
mean and standard deviation of the feature.
The linear scale of feature value x of feature nr. j is:
, where min
are respectively the min-
imum and the maximum value of feature j.
The logarithmic scale is NormL(x) = Norm(log(x) + 1).
And the third scaling scheme, based on the mean and standard devia-
tion of the feature nr.j, is deﬁned as: NormD(x) =
The advantage of the linear scale compared to the other two scaling
schemes is its simplicity. Furthermore, the linear scale normalizes the
feature values. For these reasons, the linear scale has been used for
scaling the feature values.
Handling categorial feature values
All the clustering algorithms considered in this thesis are appropriate
for numerical feature values. As the feature set of the KDD Cup 99
data set is a mixture of continuous and categorical values, we need
a mechanism for converting the categorical feature values to numeric
values. Converting from one feature type to another must be done with
care because it may result in the loss of information about the data.
This loss of information may aﬀect the discriminative capacity of the
resulting feature set.
CHAPTER 4. EXPERIMENTS 68
One way of quantifying categorical feature value is by replacing it by
its frequency. For example, if we consider feature that describes the
transport protocol used for communication, two of its possible values
are: TCP and UDP. The categorical feature value TCP is converted to
0.6 if 60% of the connection records use the TCP protocol.
This conversion scheme has been used earlier in the implementation of
CLAD - which is a cluster-based anomaly detection system. It is
reasonable to use frequency for quantifying categorical feature values
because values that appear more frequently are less likely to be anoma-
lous. The frequency can help us to separate normal connection records
from attack connection records.
Another method for encoding categorical feature values is, the so-called
1 to N encoding scheme. In this scheme, each categorical feature is ex-
tended to N features where N is the number of diﬀerent values this
feature can take. The value of 1 (or 1/N in the normalized form) is set
in the columns corresponding to that feature in the extended feature
space, the other columns of that feature in the extended feature space
are set to 0 to mark the absence of that category.
One of the problems with this encoding scheme is that it increases the
dimension of the data space. How serious this problem is, depends on
the number of categorical features and on the number of diﬀerent val-
ues each of them can take.
Once the categorical feature values have been converted to numerical
values and the feature values have been normalized, euclidean distance
is used as the similarity measure between instances.
Usage of the data set
This section describes how the data set is used for the experiments.
A 10% version of the KDD Cup data set is also available at . We
use the 10% version of the KDD Cup data set. The 10% version of the
KDD Cup data set contains the same attack labels as the full version.
It has been constructed by selecting, from the original data set, 10% of
each of the most frequent attack categories and by keeping the smaller
attack categories unchanged. The advantage of using this version of the
data set is that the data set is smaller and therefore faster to process.
Working with the original data set would have made the execution of
CHAPTER 4. EXPERIMENTS 69
the experiment impossible on the computation resource at our disposal.
About eighty percent of the data are attacks. Most of these attacks
are DOS attacks: neptune and smurf. A large percentage of this data
consists of duplicates. In order to reduce the size of the data set, we
select only a low percentage of the smurf data and the neptune data
set. This new distribution of attack and normal labels is closer to a
real life scenario. Most researches in unsupervised anomaly detection
make some assumptions about the data set. Without such assumptions
the task of unsupervised anomaly detection is not possible. The subset
selected for the experiments consists of 10% attacks and 90% normal
data. Table 4.1 shows the distribution of attack categories for this data
For each of the 10 phases in the ten-fold cross validation, each clustering
algorithm is run 3 times. We proceed in this way because most of
the algorithms are randomly initialised and the result of clustering is
dependent on the initial values. As mentioned above, the instances
of the data set are labelled: either as normal or as a speciﬁc attack
category. The labels are not used during clustering, they are only used
during the evaluation of the clustering algorithms.
4.3 Implementation issues
The implementations of the clustering algorithms have been kept sim-
ple. The implementation has been kept simple because we have focused
on highlighting the basic ideas in each of the clustering algorithms. We
have avoided optimisation techniques that could possibly inﬂuence the
In the implementation of two-level clustering, no signiﬁcant diﬀerence
has been observed between linking successively the two closest clusters
until the desired number of clusters is reached (KHAC) and the linking
approach just described in KHADENS. So for simplicity, KHAC has
been used our experiments. One possible reason for why the two merg-
ing strategies produce similar result may be that the closest clusters
have almost similar size.
CHAPTER 4. EXPERIMENTS 70
attack type number percentage
normal 107011 89.94
back 2424 2.04
buﬀer overﬂow 33 0.03
ftp write 8 0.007
guess password 59 0.05
imap 13 0.011
ipsweep 1370 1.15
land 21 0.018
loadmodule 10 0.008
multihop 7 0.006
neptune 649 0.54
nmap 253 0.21
perl 3 0.002
phf 4 0.003
ping of death(pod) 290 0.24
portsweep 1146 0.96
rootkit 11 0.009
satan 1077 0.90
smurf 1693 1.42
spy 2 0.002
teardrop 1077 0.90
warezclient 1120 0.94
warezmaster 22 0.002
TOTAL 118980 100
Table 4.1: Distribution of labels in the data set
CHAPTER 4. EXPERIMENTS 71
For each of the clustering algorithms, various tests have been performed
in order to select the best parameter values. The experiments have been
performed with the best parameter values identiﬁed.
This chapter has covered the design and execution of our experiments.
Special attention has been paid to the data set and feature set used.
– The data set used is a slightly modiﬁed version of the KDD data
set. The feature values have been scaled and normalized using a
linear scale. The categorical feature values have been transformed
to numeric values using a frequency encoding.
– For each of the clustering algorithms, diﬀerent tests have been
performed in order to choose the best set of parameters
– The limitation of the feature set for unsupervised anomaly detec-
tion has been discussed: Some of these limitations are: Firstly,
the algorithm used for the construction of the features relies on
the existence of an attack-free data set. But the fact that it is
diﬃcult to obtain attack-free data set is the main motivation for
performing unsupervised anomaly detection. So for the purpose
of unsupervised anomaly detection we need some other method
to compute the feature set. Secondly, for the purpose of anomaly
detection, it is the normal traﬃc patterns we want to describe and
not the attacks, so it is appropriate for us to construct features
that describe the normal patterns and not the attacks.
In the next chapter, we evaluate the clustering algorithms.
Evaluation of clustering
In this chapter the studied clustering algorithms are compared experi-
– Section 5.1 describes the evaluation methodology used.
– Section 5.2 discusses the evaluation measures.
– Section 5.3 discusses the usage of the k-fold cross validation method.
– Section 5.4 presents and analyses the results of the experiments.
– Section 5.5 summarizes the chapter
5.1 Evaluation methodology
The clustering algorithms are evaluated on the basis of external in-
dices. External evaluation is possible because data labels are available.
Because the considered clustering algorithms are instances of diﬀerent
clustering methods, external evaluation is the correct method for evalu-
ating the algorithms. Evaluating the algorithms on the basis of internal
indices, such as the sum of squared-errors, is not appropriate. This is
because internal indices are generally based on assumptions about the
clustering methods used or about the data set. For example, using
the sum of squared-errors as a measure of compactness and evaluating
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 73
the clustering algorithms using it will provide favorable conditions for
squared-errors clustering algorithms such as kmeans.
The methodology used for comparing the clustering algorithms is de-
We use a ten-fold cross validation. Diﬀerent experiments in cluster
literature, such as , have shown that ten fold is appropriate when
performing k-fold cross validation. The clusters produced during the
training phase are used as a classiﬁer and used for the classiﬁcation
of the test data. The same assignment method and measure used for
assigning instances to clusters during training is used for assignment
during the test. The idea in using cross validation is to measure the
generality of the clusters produced during the training phase. The k-
fold cross validation method is described in the next section. As all
of the studied clustering algorithms are dependent on the initialisa-
tion values, for each pair of training and test data set of the ten-fold
cross validation, the clustering algorithms are run three times. Run-
ning the experiments will produce 30 values for each of the evaluation
indices and for each of the clustering algorithms. Then for each of the
clustering algorithms, the average of the 30 indices and the standard
deviation are computed. A paired t-test is used to compare each pair
of clustering algorithms. The paired t-test is used to estimate the sta-
tistical signiﬁcance of the diﬀerence in performance for each pair of
algorithms. In order to evaluate the performance of each of the stud-
ied clustering algorithms individually, each of them is compared to the
result of a random clustering of the data set. The random clustering is
done by assigning instances to clusters randomly.
The experiments are run for two diﬀerent number of clusters: 23 and
49. 23 is the number of categories in the data set. We choose 49 arbi-
trary in order to study how the algorithm perform with another number
5.2 Evaluation measures
This section discusses the choice of evaluation measures.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 74
5.2.1 Evaluation measure requirements
The goal in clustering network traﬃc data for anomaly detection is to
create clusters that ideally consist in a single category. The category
is either a speciﬁc attack such as smurf or normal.
So the diﬀerent types of attack/normal category identiﬁed by a cluster-
ing algorithm are a good indication of how well the algorithm perform
this task. We do not expect the clustering algorithms to produce clus-
ters that are 100% pure. So the attack category of a cluster is deﬁned
to be the label of which there are most of in the cluster. The percent-
age represented by the majority of attack labels in a cluster is another
indication of how pure that cluster is.
It is also useful to measure whether a cluster contain few attack cate-
gories or several attack categories.
These requirements lead to the choice of three evaluation measures that
will be studied in the next section. Each of these evaluation measures
covers one of the requirements. The ﬁrst one is the number of dif-
ferent categories: it counts the diﬀerent number of attack or normal
categories found by the clustering algorithms. The second is the clas-
siﬁcation accuracy: it computes the proportion of label of which there
is most of in the cluster. And the third measure is the cluster entropy,
which estimates the homogeneity of the clusters.
5.2.2 Choice of evaluation measures
Some of the classical external validation measure found in the litera-
ture  are the Jacard, Hubert, Rand and Corrected Rand indices.
But these measures do not match our requirement that they should
measure the purity of clusters. The measures used in this thesis are:
count of cluster categories, classiﬁcation accuracy and cluster entropy.
The count of cluster categories is the number of diﬀerent cluster cate-
gories found by the clustering algorithm. The category of a cluster is
deﬁned as the label of which there are most of in the cluster.
The classiﬁcation accuracy of a cluster is deﬁned as the proportion rep-
resented by the label that is in majority in this cluster. The overall
classiﬁcation accuracy of the clustering is deﬁned as the weighted mean
of the classiﬁcation accuracy of the clusters produced by this cluster-
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 75
The cluster entropy has been introduced in . This measure captures
the homogeneity of clusters. Clusters which contain data from diﬀerent
attack classes have a higher value. And the clusters, which contain only
few attack classes, have low entropy- close to zero. The overall cluster
entropy is the weighted mean of the cluster entropies.
The classiﬁcation accuracy of a cluster is the proportion of label most
often found in that cluster. That is
(Size of majority label)
Size of cluster
And the overall classiﬁcation accuracy of the clustering is the weighted
mean of the classiﬁcation accuracy of the clusters. The weight of a
cluster is its size divided by the total number of instances.
The entropy of the cluster level captures the homogeneity of the clus-
ter. The entropy of a cluster is deﬁned as:
is the size of the i
cluster and N
is the number of instances
of cluster i which belongs to the class label j.
And the cluster entropy is the weighted sum of the cluster entropies:
where N is the total size of the data set and n
is the number of in-
stances in cluster i.
The cluster entropy is lowest when the clusters consists of a single data
types and it is highest when the proportion of each of data category in
the clusters is the same.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 76
5.3 k-fold cross validation
K-fold cross validation is used in classiﬁcation to evaluate the accuracy
of classiﬁers. It consists in randomly dividing the data set in k disjoint
subsets of approximately equal size. The classiﬁer is trained and tested
k times. Each training uses k-1 subsets and the subset left out is used
for testing. K-fold cross validation can be adapted to clustering.
As with classiﬁcation, the system is trained and tested k times. The
training consists in clustering k-1 of the subsets. The subset left out
is used for the test. The test is done by assigning instances of the
test data set to the clusters produced during training. The same as-
signment method and measure used for assigning instances to clusters
during training is also used for assignment during testing. During the
assignment in the test phase, the characteristics of the clusters are not
updated. For example, the means of the clusters are not recomputed.
So the performance of a clustering indicates how well the algorithm
performed during the training and the test phases.
Using an independent test data set makes it possible to evaluate the
robustness and the generality of the clusters produced by the clustering
algorithms. After all, the goal of the oﬀ-line clustering we perform is to
create clusters that will be used for performing a classiﬁcation of new
data. Therefore it is important that the clustering algorithms are able
to classify correctly an independent data set.
5.4 Discussion and analysis of the exper-
This section analyses the results of the experiments.
5.4.1 Results of the experiments
Figures 5.1, 5.2, 5.3, 5.4, 5.5 and 5.6, located from page 77 to page
85, show the experiment result. The data used for the generation of
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 77
the histograms in these ﬁgures are found in appendix E. In these his-
tograms, the notation L + clustAlgo, where clustAlgo is a clustAlgo
refers to initializing clustAlgo with the centres of the clusters produced
by the leader clustering algorithm. And the notation fuzzy K refers to
the fuzzy kmeans algorithm.
Figure 5.1: The classiﬁcation accuracy of the clustering algorithms in tables
E.1 and E.2. L+kmeans refers to leader + kmeans and fuzzy K refers to
fuzzy kmeans. The number of clusters is 23.
When the number of clusters is 23, the ﬁgures 5.1, 5.3 and 5.2 show
that, the two-level clustering(KHAC) and SOM or kmeans, initialised
with the clustering results of the leader clustering algorithm give the
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 78
best classiﬁcation accuracies. The clusters identiﬁed by these cluster-
ing approaches are homogeneous: the majority of items in each of these
clusters are from the same attack/normal category. The clusters pro-
duced by these algorithms represent a wider variety of attack categories.
When initialised with the results of the leader clustering algorithm, the
performance of kmeans and SOM are very similar.
When the number of cluster is 49, the ﬁgures 5.4,5.6 and 5.5 show
that KHAC, leader clustering and the combination of leader clustering
with any of the other algorithms- except EM- have the best classiﬁ-
cation accuracies and the clusters found by these algorithms represent
a larger varieties of attack categories. Although initialising any of the
other algorithms with the leader clustering improves the performance
of the algorithm, these combinations do not perform signiﬁcantly better
than the leader clustering alone. And this is true both for the classi-
ﬁcation accuracies, cluster entropies and number of cluster categories.
Because most of the studied clustering except the leader clustering are
slow, using only leader seems more appropriate than using any of the
other algorithm either alone or in combination with the leader cluster-
The homogeneity of the clusters produced by kmeans is slightly better
than any of the other algorithms. The homogeneity of the clusters pro-
duced by fuzzy kmeans, EM-based clustering, CEM clustering is poor
than that of the other algorithms.
For both 23 and 49 clusters, each of the clustering outperform random
The performance of the EM-based clustering algorithm is not so im-
Some of conclusions that can be drawn from these results are that:
– The performance of the clustering algorithms depends on the num-
ber of clusters to be found. The diﬀerence in the performance of
the clustering algorithms decrease as the number of clusters in-
crease. This indicates for large number of clusters, other criterion
such as the running time can be used to guide the selection of
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 79
– The leader clustering is of signiﬁcance in clustering network traﬃc
data for anomaly detection: It achieves good perfomance for each
of the evaluation measures considered independently of the num-
ber of clusters to be found. Furthermore it is very fast and using
it for initializing the other algorithms improves the performance
of these algoritms signiﬁcantly. This improvement is impressive
in the case of the SOM algorithm.
– When the desired number of clusters is small, the two-level clus-
tering, seems to be a good choice of algorithm.
These conclusions can be reformulated as follows:
– KHAC is a good choice of clustering algorithm when the desired
number of clusters is small.
– Leader clustering is more appropriate for high number of clusters.
5.4.2 Analysis of the experiment results
It seems that the clustering algorithms that create clusters one at the
time, e.g leader clustering and KHAC, perform better than the others.
One possible explanation is related to the skewed distribution of attack
categories. The group of algorithms that performs poorly consists in
algorithms that are randomly initialised. And their performance is de-
pendent on the initial choice cluster centres. When the initial cluster
centres are selected randomly from the data set, the chance that rep-
resentative from diﬀerent categories will be picked out is not equal for
each category. The categories that are in majority are more likely to be
selected. This explains why initializing with leader clustering improves
the performance of those algorithms. KHAC and leader clustering are
not initialised in this way so this problem does not aﬀect them. It may
be preferable to initialise the clustering algorithms by choosing totally
random values than choosing randomly items from the data set. An-
other observation which tends to conﬁrm the above explanation is the
fact that the performance of most of the studied algorithms are similar
for high number of clusters. That is because with high number of clus-
ters the change of selecting representatives of diﬀerent attack category
as initial centres is high.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 80
The good performance of KHAC can also be related to the fact that
it is the only of the studied algorithms that is able to detect clusters
of an arbitrary shape. The penalty of approximating incorrectly the
shape of clusters is higher for large clusters than for small clusters. This
could explain why KHAC have a good performance when the number
of clusters is low.
The EM-based algorithm did not produce good results compared to
most of the others algorithms. This was surprising, because most of
the others algorithms can be explained as special case of EM-based
clustering. One of the possible explanations for the poor performance
of EM-based clustering may be that the mixture of isotropic gaussians
does not match the underlying model of the data. But, this explanation
does not seem to hold because the classiﬁcation EM clustering which
also assumes that the components of the model are non overlapping
isotropic gaussians gives better results. We could not relate the poor
performance of EM-based clustering to the fact that it assumes over-
lapping clusters. This is because, the fuzzy kmeans algorithm, which
also makes an assumption of overlapping clusters, has a much better
We conclude that the EM-based clustering’s poor performance is re-
lated to some parameters of the EM based clustering algorithm that
may not have been chosen correctly. For example, the number of clus-
ters, considered in our experiments may not be optimal for the EM-
based clustering. Alternatively, it may simply be related to the fact
that this clustering algorithm is not appropriate for this task. The
EM-based clustering is also less attractive for this task because of its
high computation time.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 81
Figure 5.2: The number of diﬀerent cluster categories found by the algorithms
when the number of clusters is 23. The total number of labels contained in
the data set is 23.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 82
Figure 5.3: The cluster entropies when the number of clusters is 23. The
cluster entropy measures the homogeneity of the clusters. The lower the
cluster entropy is the more homogeneous the clusters are.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 83
Figure 5.4: The classiﬁcation accuracy of the clustering algorithms in tables
E.3 and E.4. The number of clusters is 49.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 84
Figure 5.5: The number of diﬀerent cluster categories found by the algorithms
when the number of clusters is 49. The total number of labels contained in
the data set is 23.
CHAPTER 5. EVALUATION OF CLUSTERING METHODS 85
Figure 5.6: The cluster entropies when the number of clusters is 49.
In this thesis, we have:
– discussed issues of network security and in particular issues con-
cerning unsupervised anomaly detection. We have discussed how
clustering can be used to solve this problem.
– discussed the clustering problem and the most common cluster-
ing methods. Examples of clustering algorithms have also been
– implemented and compared classical clustering methods. The
classical clustering algorithms considered for this study are: stan-
dard kmeans, fuzzy kmeans, Expectation Maximization(EM) based
clustering, Classiﬁcation Expectation Maximization based cluster-
ing, Kohonen self organizing feature maps, leader clustering
– investigated two combinations of clustering methods.
∗ The ﬁrst one uses the results of the leader clustering algorithm
for the initialization of each of the other studied algorithms.
This method improves signiﬁcantly the performance of these
∗ The second combination is a technique we have proposed.
Essentially, this technique is a combination of Kmeans and
CHAPTER 6. CONCLUSION 87
Hierarchical Agglomerative Clustering. We call this combi-
nation KHAC. The purpose of KHAC is to create a large
number of small clusters using kmeans and then merge these
small clusters in a similar fashion to hierarchical agglomer-
ative clustering. The advantage of this clustering technique
is its ability to detect arbitrarily shaped clusters. We found
that KHAC gives better results compared to most of the other
studied algorithms. The performance of KHAC is especially
impressive for small numbers of ﬁnal clusters.
On the basis our results, we can say that clustering can be successfully
used for unsupervised anomaly detection. Some of the clustering algo-
rithms are more appropriate for this task than others. We investigated
the potential of the leader clustering algorithm. This algorithm is very
simple and fast and produces good clustering results compared to most
of the other studied algorithms. When leader clustering is used for
initializing the other clustering algorithms, included in this thesis, the
clustering results of these algorithms improve signiﬁcantly.
The main goal of the thesis has been to investigate the eﬃciency of dif-
ferent classical clustering algorithms in clustering network traﬃc data
for unsupervised anomaly detection. The clusters obtained by cluster-
ing the network traﬃc data set are intended to be used by a security
expert for manual labelling. A second goal has been to study some
possible ways of combining these algorithms in order to improve their
performance. We can say that these goals have been achieved. The
results of our experiments have given us an indication of which cluster-
ing algorithms are good for this task and which ones are less suitable
for this task. Furthermore, we have studied ways of combining cluster-
ing ideas in order to eﬃciently solve the problem. We have found out
that, when the number of clusters is low, KHAC which is a combina-
tion of clustering concepts we have proposed, produces better results
than most of the other studied algorithms. Our data shows the poten-
tial of leader clustering algorithm in performing this task. Clustering
algorithms similar to leader cluster algorithm have been successfully
CHAPTER 6. CONCLUSION 88
used in some earlier works [6, 30] for clustering network traﬃc data.
The reasons for using this particular algorithm have not been explic-
itly stated in these works. In conclusion for our thesis we can say that
leader clustering is to be preferred, not only because it is fast but also
because it perform better than most of the other clustering algorithms.
So leader-like clustering algorithms could be investigated further in fu-
ture research on unsupervised detection. What make them specially
attractive is their scalability to a large data set. And KHAC seems
attractive when the number of clusters is low.
One of the limitations of this thesis is that it has not possible to validate
the conclusions of the experiments against a real life data set. This has
not been possible because of the diﬃculties of acquiring such a data
6.4 Future work
This work will serve as a ﬁrst step in building a complete cluster-based
anomaly detection system.
 A comparative Study of Anomaly Detection Schemes
in Network Intrusion Detection, A. Lazarevic, L. Er-
toz, V. Kumar, A. Ozgur, J. Srivastava
 A.K. Jain, M.N. Murty and P.J Flynn. Data cluster-
ing: A Review. ACM Computing Surveys, Vol. 31,
 Richard O. Duda, Peter E. Hart and David D. Stork. Pattern Classiﬁcation.
John Wiley & sons, second edition, 2001.
 Eleazar Eskin. Anomaly Detection over noisy data using learned probabil-
ity distribution, located at: http://citeseer.ist.psu.edu/eskin00anomaly.html,
 Arthur Dempster, Nan Laird, and Donald Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical So-
ciety B, 39, 1-38, 1977.
 Leonid Portnoy. Intrusion detection with unlabeled data using clustering.
located at: http://citeseer.ist.psu.edu/574910.html, 2001
 E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. Stolfo. A geometric Frame-
work for Unsupervised Anomaly Detection: Detecting Intrusion in Unlabeled
Data. available at: http://www.cs.cmu.edu/˜ aarnold/ids/uad-dmsa02.pdf,
 Dorothy E. Denning, An intrusion detection model, IEEE Transactions on
software engineering, vol SE-13, No 2, Februar 1987 pages: 222-232 also
located at: http://www.cs.georgetown.edu/ denning/infosec/ids-model.rtf.
 S. T. Brugger. Data mining methods for network intrusion detection,
University of California, Davis, appeared in ACM and available at:
 Recent Advance in Clustering : a brief review, S.B. KoTSIANTIS, P.E. PIN-
 Mining in a data-ﬂow environment: Experience in network intrusion detec-
tion. In proc. 5th ACM SIGKDD Int. Conf. Knowledge Discovery and data
mining, 114-124, W. Lee;S. Stolfo, and K. Mok 1999.
 J. He, A.H. Tan, C.L. Tan, and S.Y. Sung. On Quantitative Evaluation
of Clustering Systems. In W.Wu, H. Xiong and S. Shekhar Clustering and
Information retrieval(pp. 105-133), Kluwer Academic Publishers, 2004.
 K. Kendall. A database of Computer Attacks for the Evaluation of Intrusion
Detection Systems, Master thesis, Massachusetts Institute of Technology,
 W. Lee and S.J. Stolfo, A Framework for constructing features and models for
intrusion detection systems, ACM Transactions on Information and System
Security, Vol.3 No.4, November 2000, pages 227-261.
 The internet traﬃc archive ( 2000): http://ita.ee.lbl.gov
 KDD cup 99. Located at: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
 DARPA. Located at: http://www.ll.mit.edu/IST/ideval/
 Clustering large datasets, D.P. Mercer, october 2003
 I. Costa, F. de Carvalho, Maricilio C.P de Souto. Comparative analysis of
clustering methods for gene expression time course data,
 Boris Mirkin. Mathematical Classiﬁcation and Clustering, Kluwer Ademic
 Robert Rosenthal and Ralph L. Rosnow, Essentials of Behavioral Research,
Methods and Data Analysis, second edition, 1991.
 Jiawei Han and Micheline Kamber. Data Mining, Concepts and Techniques,
Morgan Kaufmann Publishers ,2001.
 Anil K. Jain, Richard C. Dubes. Algorithms for clustering data, Prentice
 Giles Celeux and Gerard Govaert. A classiﬁcation EM algorithm for cluster-
ing and two stochastic versions, INRIA, 1991.
 Stuart J.Russel and Peter Norvig. Artiﬁcial Intelligence, a modern approach
, second edition , Prentice Hall, 2003.
 Thomas P. Minka. Expectation maximization as
lower bound maximization, 1998, tutorial located at:
 An eﬃcient approach to clustering in large multimedia database with noise
Alexander Hinneburg , Daniel A. Keim American Association for artiﬁcial
intelligence (www.aaai.org) 1998
 STING: A STastiscal Information Grid Approach to Spatial data Mining 1997
, Wei Wang, Jiong Yang and Richard Muntz. [In twenty-third International
conference on very large data bases pp. 186-195 Athens, Greece. Morgan
 Ben Krose and Patrick van der Smagt. Artiﬁcial Neural Networks, eight
edition november, 1996.
 Learning rules and clusters for anomaly detection in network traﬃc: Philipp
K. Chan, Matthew V. Mahoney, and Muhammad H. Arshad. Located at:
http://www.cs.ﬁt.edu/ pkc/papers/cyber.pdf, Florida Institute of Technol-
ogy and Massachusetts Institute of Technology.
 Sushmita Mitra,Tinku ACharya. Data mining, multimedia, soft computing
and bioinformatics. Wiley Interscience, 2003.
 Ludmila I. Kuncheva. Combining pattern classiﬁers, methods and algorithms.
Wiley Interscience, 2004.
 A. Ben-Hur, A. Elisseeﬀ and I. Guiyon. A stability based method for discov-
ering structure clustered data In. Proc. Paciﬁc Symposium on Biocomputing,
2002,pp. 6-17 ?
 Wenke lee, Salvatore J. Stolfo, Kui W. Mok, Mining in a data-ﬂow environ-
ment. Experience in network intrusion detection, March 1999.
 Comparative analysis of clustering methods for gene expression time course
data , Costa et.al august 2004
 Ron Kohavi. A study of Cross-validation and Bootstrap for Accuracy Estima-
tion and Model Selection, International Conference on Artiﬁcial Intelligence,
 Daniel Boley. Principal direction divisive partitioning. Data Mining and
Knowledge Discovery, 2(4) :325-344, 1998,
 Michalis Vazirgiannis, Maria Halkidi, Dimitrios Gunopulos. Uncertainty han-
dling and quality assessment in data mining, Springer 2003.
 Wenke Lee and S. J. Stolfo. Data Mining Approaches for Intrusion Detection,
 Stefano Zanero and Sergio M. Savaresi. Unsupervised learning techniques for
an intrusion detection system, ACM March 2004.
 R. Lippmann, J.w.Haines, D.J. Fried,J. Korba and K. Das. The 1999 DARPA
Oﬀ-line Intrusion Detection Evaluation, Lincoln Laboratory MIT, 2000.
 John M
HUGH. Testing Intrusion Detection Systems: A Critique of 1998
and 1999 DARPA Intrusion Detection System Evaluations as Performed by
Lincoln Laboratory, ACM Transactions on Information and System Security,
Vol.3, No. 4, November 2000 pages 262-294.
 E. Eskin,M. Miller,Z. Zhong,G. Yi, W. Lee, and S. Stolfo. Adaptive model
generation for intrusion detection systems.
 http://www.cert.org/stats/cert stats.html#incidents
 Martin Ester, Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu. A density-based
clustering algorithm for discovering Clusters in Large Spatial databases with
noise. Proceedings of 2nd international Conference on Knowledge Discovery
and Data Mining, 1996.
 Teuvo Kohonen. Self-organizing maps, 2nd edition Springer, 1997.
 A. Ultsch and C. Vetter. Self-organizing-feature-maps versus sta-
tistical clustering methods: A benchmark. University of Marbug.
Research Report 0994. located at: http://www.mathematik.uni-
 Ross J. Anderson. Security Engineering: A guide to building dependable
distributed systems. John Wiley & Sons, 2001.
 A. Wespi, G. Vigna and L.Deri. Recent Advances in Intrusion Detection.
5th International Symposium, Raid 2002 Zurich, Switzerland, October 2002
 D. Gollmann. Computer Security. John Wiley & Sons, 1999.
 P. Giudici. Applied Data Mining: Statistical Methods for Business and In-
 Bjarne Stroustrup. The C++ programming language, third edition, Addison-
DOS: Denial of service attacks.
OS : Operative systems.
IDS: Intrusion detection systems.
NIDS: Network intrusion detection systems.
pod: Ping of Death.
IP: Internet Protocol.
TCP: Transport Control Protocol.
UDP: User Datagram Protocol.
ICMP: Internet control message protocol.
HTTP: hypertexts Transport Protocol.
FTP: File Transfer Protocol.
In this thesis network traﬃc refers to transfer of IP packets through
network communication channels.
APPENDIX A. DEFINITIONS 95
Firewalls are security systems protecting the boundary of an internal
To broadcast a message
To broadcast a message consists in delivering that message to every
host on a -given- network.
A program that is used to test if a connection can be established to a
A protocol is a speciﬁes how modules running on diﬀerent hosts should
communicate with each other.
A host is a synonym for computer.
A CGI (common gateway interface) script is a program running on a
server and which can be invoked by a client from the CGI interface.
Client and Server
On a network, a client is the host that requests service from another
host. And the host delivering the service is called the server.
A TCP connection is a sequence of IP packets ﬂowing from the packet
sender to the packet receiver under the control of a speciﬁc protocol.
The duration of the connection is limited in time.
Is a log obtained by monitoring network traﬃc. Diﬀerent tools exist
for sniﬃng network traﬃc. On such a tool which has been used for
collecting the network traﬃc data used in this thesis is the program
called TCPDUMP .
APPENDIX A. DEFINITIONS 96
Data mining is the process of extracting useful models from large vol-
ume of data.
B.1 The feature set of the KDD Cup 99
Tables B.1, B.2 and B.3 respectively describe the basic features, the
content-based features and the traﬃc-based features of the KDD Cup
99 data set.
APPENDIX B. FEATURE SET 98
name of feature description feature type
duration the length of the connection is seconds continuous
protocol-type the type of -transport- protocol used symbolic
service the network service e.g. http symbolic
src bytes the number of bytes sent from source to destination continuous
dst bytes number of bytes from destination to source continuous
ﬂag indicate a normal or error status of the connection symbolic
land check if source and destination are the same symbolic
urgent number of urgent packets continuous
wrong fragments the number of wrong fragments continuous
Table B.1: Basic features of the KDD Cup 99 data set
name of feature description feature type
hot number of hot indicators continuous
num failed logins number of unsuccessful logins continuous
logged in indicates whether logged in successfully or not symbolic
num compromised number of compromised conditions continuous
root shell indicate whether a root shell is obtained or not symbolic
su attempted set to 1 if attempt to switch to root else 0 symbolic
num roots number of root accesses continuous
num ﬁle creation number of ﬁle creation actions continuous
num shells number of shell prompts continuous
num access ﬁles number of operations on access control ﬁles continuous
num outbound cmds number of outbound commands in an ftp session continuous
is hot login indicate whether the login is hot or not symbolic
is guest login indicate whether the it is a guest login or not symbolic
Table B.2: Content-based features
APPENDIX B. FEATURE SET 99
name of feature description feature typ
count number of connections to same host continuous
serror rate %con. to same host with SYN errors continuous
rerror rate %con. to same host with REJ continuous
same srv rate %con. to same host with the same service continuous
diﬀ srv rate %con. to same host with diﬀerent services continuous
srv count number of con. to the same service continuous
srv serror rate %con. to same service with SYN errors continuous
srv rerror rate %con. to same service with REJ errors continuous
srv diﬀ host rate %con. to same service on diﬀerent hosts continuous
dst host count number of connections to same host continuous
dst host serror rate %con. from dst. to same host with SYN errors continuous
dst host rerror rate %con. from dst. to same host with REJ continuous
dst host same srv rate %con. from dst. to same host with the same service continuous
dst host diﬀ srv rate %con. from dst. to same host with diﬀerent services continuous
dst host srv count number of con. from dst. to the same service continuous
dst host srv serror rate %con. from dst. to same service with SYN errors continuous
dst host srv rerror rate %con. from dst. to same service with REJ errors continuous
dst host srv diﬀ host rate %con. from dst. to same service on diﬀerent hosts continuous
dst host same src port rate % con. from dst. to the same source port continuous
Table B.3: Traﬃc-based features
Here is a list of the computer attacks considered in this thesis:
C.1 Probe attacks
probes the network to discover available services on the network.
probes a host to ﬁnd available services on that host.
is a complete and ﬂexible tool for scanning a network either ran-
domly or sequentially.
is an administration tool; it gathers information about the net-
work. This information can be used by an attacker.
APPENDIX C. COMPUTER ATTACKS 101
C.2 Denial of service attacks
– Ping of death (pod) makes the victim host unavailable by sending
it oversized ICMP packets as ping requests.
is a denial of service attack against Apache webservers. The at-
tacker sends requests containing many front slashes. The process-
ing of which is time consuming.
Spoofed SYN packet sent to the victim host resulting in that that
host repeatedly synchronizing with itself.
A broadcast of ping requests with a spoofed sender address which
results in that the victim being bombarded with a huge number
of ping responses.
The attacker half opens a number of TCP connections to the vic-
tim host making it impossible for the victim host to accept new
TCP connections from other hosts.
Confuses the victim host by sending it overlapping IP fragments:
overlapping IP fragments are incorrectly dealt with by some older
C.3 User to root attacks
This attack exploits a ﬂaw in how SUNOS 4.1 dynamically load
modules. This ﬂaw makes it possible for any user of the system
APPENDIX C. COMPUTER ATTACKS 102
to get root privileges.
Exploits a bug in some PERL implementations on some earlier
systems. This bug consists in these PERL implementations im-
properly handling their root privileges. This leads to a situation
where any user can obtain root privileges.
– Buﬀer overﬂow
Consists in overﬂowing input buﬀers in order to overwrite memory
locations containing security relevant information.
C.4 Remote to local attacks
Imap causes a buﬀer overﬂow by exploiting a bug in the authenti-
cation procedure of the imap server on some versions of LINUX.
The attacker gets root privileges and can execute an arbitrary se-
quence of commands.
– Ftp write
This attack exploits a misconﬁguration aﬀecting write privileges
of anonymous accounts on an FTP server.
This allows any ftp user to add arbitrary ﬁles to the FTP server.
Is an example of badly written CGI scripts that is distributed
with the apache server. Exploiting this ﬂaw allows the attacker
to execute codes with the http privileges.
The warezmaster attack is possible in a situation where write per-
missions are improperly assigned on a FTP server.
When this is the case, the attacker can upload copies of illegal
APPENDIX C. COMPUTER ATTACKS 103
software that can then be download by other users.
The Warezclient attack consists in downloading illegal software
previously upload during a warezmaster attack.
C.5 Other attack scenarios
The four categories of attacks described take place usually during a
In most realistic attack scenarios, the attacker performs his attack over
a certain period of time in order to minize the chances of detection and
in order to perform more precise and successful attacks.
These attack scenarios are performed by combining some the basis at-
tack categories described.
Here are some of these attacks scenarios:
– Guessing passwords
– Making use of spy programs
A spy program monitors the activity on the victim host and makes
information available to the attacker.
– Making use of rootkit
A rootkit is a program that hides the presence of other -malicious-
programs or data ﬁles. Spyware programs often make use of rootk-
its in order to avoid detection by anti-spyware programs.
– Multihop attack
This attack ﬁrst aﬀects a host on a network and then uses that
host to attack other hosts on the network.
D.1 Algorithm: Hill climbing
The hill-climbing algorithm is a local optimisation algorithm.
• Hill climbing algorithm Let g(x) be the gradient of a function f(x).
In searching for the maximizer of f(x), the algorithm proceeds as fol-
- It starts with an arbitrary solution s
∈ S, where S is the solution
- Then a sequence ¦s
, t ≥ 0¦ of solutions that approaches the max-
imizer of f(x) is constructed. The sequence is deﬁned as: s
) + (1 −α)x
, where α > 0.
D.2 Theorem: Jensen’s inequality
Let f be a convex function deﬁned on an interval I.
∈ I and α
≥ 0 with
APPENDIX D. THEOREMS 105
D.3 Theorem: The Lagrange method
Let f : R
→ R and g : R
→ R be C
- that is f and g are derivable and
their respective derivative are continuous. Let α ∈ R
such that ∇g(α) = 0
(∇g is the gradient of g). If α is an extremum of f under the constraint
, ..., x
) = 0, ∃λ ∈ R such that
∇f(α) = λ∇g(α) (D.2)
Results of the experiments
APPENDIX E. RESULTS OF THE EXPERIMENTS 107
Algorithms classiﬁcation accuracy cluster entropy nb of categories
random 0.899±0.0 0.548±0.0 1.0±0.0
kmeans 0.929±0.001 0.209±0.005 4.5±0.2
leader 0.937±0.001 0.240±0.004 7.3±0.1
EM 0.907±0.001 0.274±0.006 3.6±0.2
CEM 0.916±0.002 0.276±0.006 2.6±0.2
som 0.919±0.001 0.252±0.004 3.2±0.1
fuzzy kmeans 0.915±0.001 0.243±0.003 3.3±0.2
KHAC 0.954±0.001 0.204±0.003 9.4±0.1
Table E.1: Random initialisation
Algorithms classiﬁcation accuracy cluster entropy nb of categories
leader + kmeans 0.941±0.001 0.194±0.004 8.0±0.2
leader + EM 0.909±0.0 0.268±0.003 3.3±0.1
leader + CEM 0.935±0.002 0.219±0.005 7.4±0.2
leader + som 0.944±0.001 0.187±0.004 8.0±0.2
leader + fuzzy kmeans 0.937±0.001 0.196±0.002 5.8±0.1
Table E.2: Experimental results of various classical algorithms and combina-
tion of those algorithms run on a KDD Cup 1999 data set slightly modiﬁed.
The number of clusters is set to the number of attack and normal labels in
the data set and this number is 23. The results in table E.1 are obtained
with random initialisation of the algorithms and that of table E.2 correspond
to initialisation of the algorithms with leader clustering.
Algorithms classiﬁcation accuracy cluster entropy nb of categories
random 0.899±0.0 0.546±0.0 1.0±0.0
kmeans 0.954±0.003 0.123±0.005 7.9±0.3
leader 0.951±0.001 0.151±0.001 12.8±0.3
EM 0.927±0.002 0.204±0.008 5.8±0.3
CEM 0.930±0.004 0.253±0.026 5.6±0.5
som 0.929±0.003 0.198±0.008 4.8±0.2
fuzzy kmeans 0.935±0.002 0.184±0.006 6.1±0.4
KHAC 0.962±0.001 0.146±0.003 9.6±0.3
Table E.3: Random initialisation
APPENDIX E. RESULTS OF THE EXPERIMENTS 108
Algorithms classiﬁcation accuracy cluster entropy nb of categories
leader + kmeans 0.954±0.0 0.138±0.002 13.6±0.1
leader + EM 0.938±0.0 0.165±0.003 7.0±0.1
leader + CEM 0.952±0.0 0.150±0.002 12.7±0.2
leader + som 0.951±0.001 0.147±0.002 13.1±0.3
leader + fuzzy kmeans 0.953±0.001 0.146±0.002 10.0±0.1
Table E.4: Experiment results when the number of clusters is 49