You are on page 1of 21

Using Clustering to perform Anomaly Detection for

Intrusion Detection
Matt Helgren and Manish Katyal

Abstract
There are two main classes of techniques to building Intrusion Detection Systems –
Signature based techniques and Anomaly based techniques. Systems based on Signature
based techniques, have been widely adopted and are commercially available. Anomaly
Detection techniques are still at their infancy. While they hold great promise, there are
still several challenges that must be resolved. Their key strength in comparison to
Signature based Systems, is that they can be used to detect attacks which have never been
seen before. In this paper we review two Anomaly detection based approaches – CLAD,
which is an existing technique, and our own approach. These approaches use Clustering
techniques to detect “abnormal” data which is flagged as attacks. We present the results
of our experiments using these approaches with the DARPA dataset.

Overview
Computer networks and the Internet have become a very important part of how
businesses operate and how we access information from our homes. The number of
compute devices on these networks has increased dramatically and even more
importantly the amount of business and personal data on these devices has also increased.
With this hackers have arrived on the scene for reasons that are both financial and
ideological to try to compromise computer systems. In an attempt to keep our systems
and networks secure an entire industry has been developed to produce firewalls, anti-
virus, anti-spyware and intrusion detection systems.

Intrusion Detection Systems (IDS) are software and hardware systems that monitor
network traffic for known attacks or abnormal behavior. Generally these systems can be
described as signature based or non-signature based detectors where signatures match
known attributes of an attack against observed network data. Non-signature based
systems tend to be model based systems that classify network traffic behavior as attacks
or non-attacks.

Some features that distinguish IDS systems are whether they perform on real-time data
and whether they are host or network based systems. Real-time systems are able to
assess an attack based on a single or few network packets being seen. Non-real time
systems would require all the data to be available before hand in order to analyze any
data for attacks (a form of closed world assumption). Host based IDS systems are those
that use some amount of log or audit information from the host computer operating
system. Network based IDS make no assumptions about availability of host data and
operate on network level packet information. In this paper, we focus on non-real time,
network based IDS systems.

One of the best known signature based systems comes from an open-source project called
SNORT. SNORT monitors all network packets and matches them against rules that
have been defined by users. Almost all known network attacks currently have rules in
SNORT and this makes it very effective at detecting general hacking attempts. The
major drawback to the signature based systems lies in the fact that an attack must be well
known and has a signature that has been distributed.

Given that new attacks or variations on attacks are constantly being developed by hackers
there is a very real need for IDS systems that are more flexible than the current signature
approaches. Non-signature based systems have this flexibility in that they can treat the
problem as a two-class classification problem (attack or no attack) and use standard data
mining techniques to solve this problem. Anomaly Detection (AD) is now being studied
extensively in this area for its ability to learn models of good data and then distinguish
abnormal data as things that are possible attacks. The drawback to AD is that it is not as
precise as signature based systems and the amount of false positives is likely to be much
higher. A good AD system for IDS will be one that generates many true positives and
few false positives. False positives will require a network administer to manually
validate the data and cost will be incurred.
Several papers [PES2001, MC02, and ZKS2004] have proposed the use of Clustering to
perform Anomaly Detection for IDS. Clustering has several strengths, most notably, it is
an unsupervised technique. Supervised classification techniques require manual labeling
of the training data. Given the sheer volume of data, this is impractical. Clustering
however, does not require manual labeling as it is an unsupervised approach. Supervised
classification techniques cannot be used to detect attacks which they have never seen. On
the other hand, since Clustering does not require labeling of data, it can be used to detect
novel attacks. [ZKS2004]. Clustering can reduce the manual overhead of labeling each
individual data item for Supervised classification techniques. Experts can label groups or
clusters of similar data items at a time – increasing their productivity. [ZKS2004]. Novel
attacks found using Clusters can be analyzed by experts and then rules to detect them can
be written for Signature based systems.

The current approaches using Clustering differ in how they identify “abnormal” or
anomaly data. The task of labeling clusters as normal or attack (abnormal) clusters is a
challenging one and there is no technique that we know off which will work in all
circumstances. The technique outlined in paper [ZKS2004] uses the cluster size, distance
of the cluster to known “normal” clusters and purity of the clusters as factors in
identifying clusters of attacks. They discovered that labeling clusters based on the above
factors was not reliable. For example, they labeled a Cluster as a normal cluster because
it was pure and was close to the largest cluster found in the dataset. This was an
inaccurate labeling as they found the Cluster to be an Attack Cluster that contained a
large number of Denial of Service Attacks [ZKS2004]. The technique in [PES2001]
requires the number of attacks to be limited between 1 and 1.5% of the complete training
data. This is an unrealistic assumption as demonstrated by the DARPA dataset. The
approach presented in PHAD [MC02] clusters each feature independently during
training. An overall anomaly score is generated by determining the anomaly score of each
feature independently. This can limit the types of attacks PHAD can find. CLAD [AC03]
uses “a cluster’s size and position relative to other clusters” to find anomaly data. They
try to find what they classify as “strong outliers” that are “suspicious both at the global
and local level”.

The Clustering techniques also differ in the features selected for clustering. PHAD looks
only at packet-headers while ignoring application data. Attacks such as apache2 which
occur in the application payload are thus undetectable by PHAD. CLAD operates solely
on TCP connection network data and thus cannot detect ICMP or UDP attacks.

Problem Statement
Detect network intrusions by looking at network packet data using only AD clustering
techniques in a non real-time manner.

Approach
We have focused on AD using Clustering to detect network-based attacks for the reasons
stated in the introduction.
Our approach is two-pronged:
• Investigate the use of CLAD to perform anomaly detection and determine its
current practical usability
• Investigate our proposed technique for self-labeling clusters as attack or normal
clusters

In the first approach we decided to emulate an existing technique “Clustering for


Anomaly Detection” (CLAD) and validate its effectiveness and understand any potential
weaknesses inherent. CLAD is interesting as it defines a method to discriminate
anomalous clusters based on statistical measures of the clusters.

In the second approach, we propose a technique for automatically classifying clusters as


attack or normal clusters. Techniques proposed in [PES2001], [ZKS2004] and [AC03]
make assumptions about the attack clusters – purity, size, inter-cluster distance,
percentage of attacks relative to the overall dataset, etc., to identify attack clusters. As
stated earlier there are problems with such assumptions. Our technique takes a different
approach. We leverage a priori knowledge of attack free data to label new data. A Cluster
is labeled as a “normal cluster” or an “attack cluster” solely based on the percentage of
known attack free data it contains. We present an overview and analysis of our approach
and the results of our experiments with the DARPA 1999 dataset [DARPA].

For all analyses the Lincoln Laboratories dataset has been used. This dataset was
produced through a DARPA project in 1999 to simulate an Air Force base computer
network and its possible intrusions. There are 5 weeks of data where the first and third
weeks are attack free. The data itself is raw output from the UNIX packet sniffer
Tcpdump. The data is freely available to anyone on the web [DARPA]. Along with the
data there is a catalog available that details the types of attacks that were simulated and
general information on where they occur in the datasets for evaluation. A very detailed
paper exists that analyzes the data and the attacks [KK1999].

Approach 1 – Identifying attacks with CLAD Clustering

CLAD is based on summarized TCP connection information rather than on a per packet
basis. In order to do this, the tcpdump data must be summarized on a per TCP
connection basis. This is done by tracking all TCP connections opened where a
connection is identified by its source and destination IP addresses and TCP ports. All
subsequent packets are associated with this connection through the closing TCP packet.
Each connection found can then be summarized by its unique connection parameters
(source and destination IP, source and destination TCP port number), duration of the
connection in seconds, length of data transferred, TCP flags from the first and last
packets, and the first 10 bytes of payload sent for the connection.

After the TCP summarization process, all features are transformed into continuous values
based on their frequency of appearance for that feature. This transformation works on all
features whether they are nominal, string or already continuous. An example would be a
TCP source address of 192.168.0.1. If this source appears in 5 instances in the
summarized TCP data, all occurrences of that string “192.168.0.1” will be replaced with
the continuous value of 5. The only exception is the start time which is not used for the
clustering process.

Once the appropriate values are made continuous they are then quantized and scaled.
The quantization process puts all the continuous values on a log scale as the the authors
of CLAD found that the values exhibited a power-law distribution where small values are
much more frequent than larger values. In order to avoid negative values from the log
functional, all values have 1 added to them to them avoid log of values less then 1. Once
values are quantized they are then scaled to range [0,1]. This is done by finding the min
and max values for each attribute and applying the function scale(y) = y / (MAXy –
MINy).

The instance clustering process in CLAD is based on assignment to the closest cluster
using a Euclidean distance measure. Instances are only assigned to clusters if they are
within a predefined width from the cluster centroid. Instances can be assigned to
multiple clusters if they are within the distance of multiple centroids. This was
determined to be necessary as cluster density is an important part of determining outliers
and mutually exclusive clusters might arbitrarily affect density. The basic clustering
process is outlined in the following steps:

1. Initial cluster set is empty set.


2. For each instance d
3. For each cluster c
a. If the distance of c to d is less then width assign d to c
4. If d is not assigned to any clusters, create a new cluster c with d as centroid
5. For each instance d
6. For each cluster c
a. If the distance of c to d is less then width assign d to c
Essentially the assignment process is repeated twice, with the second pass assigning
instances to any clusters that may have been created after the instance was processed the
first time. The computational complexity of the algorithm is O(d * c * # features).

Cluster width is determined by taking a random sample of %1 from the data set and
computing the pair-wise distances of those points. The distances are then sorted and the
smallest %1 of the pair-wise values are averaged and that value is used as the cluster
width.

For determining what clusters represent outliers and thus the anomalies, CLAD defines
two measures to be computed that will be used in their discriminant function. The first is
Inter Cluster Distance (ICD) which is the average distance between all cluster centroids
separations. This theory is that anomalies will be significantly more separated from
other clusters than an average separation. The second measure is the Median Absolute
Deviation (MAD) of cluster size. The authors use MAD for cluster size as it more
robust to outliers and skewed distributions. A cluster is considered an outlier when it is
more than one standard deviation different from the ICD and when the cluster size is
more than one MAD from the average cluster size. These clusters are considered to be
distant and sparse/dense clusters which represent outliers in the data.

To determine the usefulness of the CLAD method and gain hands on experience with
using these techniques the authors implemented CLAD as a Weka [WEKA] clusterer.
Weka is an open source data mining toolkit containing many machine learning
techniques. We chose to use Weka provided a standard platform for integrating and
transforming data, and results analysis. The implementation is all in Java except for the
TCP summarization code “te.cpp” [FIT] which was reused from the Florida Institute of
Technology. Additionally an external Java program was developed to evaluate the
outliers detected with the Weka clusterer against the DARPA disclosed attack list.

The general process for generating results was the following:


1. Obtain the 1999 DARPA data files from Lincoln Laboratory web site [DARPA]
for weeks 4 and 5.
2. Download the master condensed attack list for 1999 from the same web site.
3. Use the “te” TCP connection summarization program from the FIT website [FIT]
to generate TCP summarized instance data.
a. Program was modified to produce a Weka ARFF file format.
b. Program was modified to filter data per defined TCP port.
4. Cluster and generate outliers with CLAD Weka clusterer the authors wrote.
5. Analyze the outliers against master attack list to determine over all true positives
and false positives.

Some important things to note about the analysis:

• The CLAD authors determined that their technique was more efficient when the
data was partitioned and analyzed by TCP port number. Because of this all results are
reported by port number.

• Weeks 1 and 3 are attack free data. Weeks 4 and 5 include attack data. The
CLAD authors decided to include weeks 1 and 3 even though it is not necessary to have
attack free data. Apparently the additional data was useful to make outliers more
pronounced. In our analysis we found that adding weeks 1 and 3 to the data set decreased
true positives. For this reason our results only include weeks 4 and 5.

Evaluation of a true positive can be done in several ways with the DARPA data as
correlation to the master attack list is somewhat loose. The correlation is done only
through the identifiers time and IP destination.

Table 1: Results from CLAD implementation


Port TP/FP Possible # # Cluster ICD ICD Median MAD
Clusters Outliers Width Ave. SD Size
Detected
20 6/17 54 102 23 0.0 3.25 .24 2 1
21 14/7 66 200 24 .22 1.88 .46 2 1
23 22/85 97 509 111 0.0 2.28 .57 2 1
25 28/64 107 1036 120 .24 2.08 .63 3 2
53 11/23 15 3675 94 0.0 .28 .34 1 0
79 5/16 33 127 23 0.0 2.27 .60 2 1
80 29/96 68 807 188 .18 2.43 .42 2 1
110 2/1 6 40 3 0.0 2.61 .96 18.5 17.5

Table 1 is a breakdown of the results obtained by our CLAD implementation per port and
the statistics for the analysis of that data set. Cluster width refers to the width that CLAD
computes for clustering. One observation was that overall SSE (Sum Squared Error) for
the resulting clusters was either very small or 0 implying that generally clusters were
made up of identical data items. In those cases the cluster width parameter had no effect
on the number of clusters.

60

50

40
True Positives

30

20

10

0
0 100 200 300 400 500 600 700 800 900 1000
False Positives

Figure 1: Detection versus False Alarms Curve for port 80 data weeks 4 and 5

Figure one shows a DFA (Detection versus False Alarms) curve for CLAD analyzing
port 80 data from the DARPA data set, weeks 4 and 5. Results were obtained by
analyzing the clustering for outliers and varying the threshold of distant outlier (typically
average ICD plus its standard deviation) between zero and three times the average ICD.

Based on the results that were generated (Table 1) and compared to the results in the
CLAD paper, the technique appears moderately effective in that it will correctly identify
one attack for every 2-3 it incorrectly identifies. Some issues that have been identified
through this work concern CLAD’s ability to scale to real-time traffic, handle moderately
frequent attacks and results. These concerns are discussed more in the following
paragraphs.

To apply this technique to real time network analysis it would require storing weeks of
summarized TCP connections. While this is feasible from a storage standpoint (2 weeks
of DARPA data with 25k connections is around 2.5MB) the issue would be how quickly
data could be analyzed after the connections and attacks are made. The issue with TCP
connection analysis is that connections have an end that could be several minutes or
hours after the open. This would cause additional delay between the time a possible
attack is started, analysis done, and the alarm is generated.

Only attacks that exhibit the sparse and dense behavior can be detected. Any attack that
resides in a median or near median density cluster will not be detected. In real terms this
means given a clustering where the average cluster size is 3 and the median absolute
deviation is 2, any attack that is repeated exactly 2-4 times cannot be detected as long as
all attributes are held constant. This is due to how the average and MAD values are used
to discriminate outliers and attacks (described above).

A final observation of CLAD is that all results from the DARPA data are somewhat
suspect as the evaluation of results is somewhat loose. In our analysis the evaluation
program would match attacks that could not have possibly occurred in the partition of the
data set being used. This occurs because matches are done only by time and ip
destination so it is possible to match an attack by “accident”. Future DARPA research
should have a more rigorous evaluation method.
Future work around CLAD might be to find another AD technique that would
compliment CLAD in areas where it has shortcomings such as its “blindspot” with cluster
sizes. An ensemble approach could be used to allow CLAD to work with other
techniques to vote on likeliness of attacks. Given the end use of the system, if false
positives were very undesirable, the ensemble could report only those attacks that each
technique agrees is an attack.

Approach 2 – Identifying attack data using Clustering


In the following section, we describe our proposed technique for identifying attack data
using Clustering. The technique assumes that attack-free training data is available. The
objective is to leverage this knowledge to identify clusters of attack data in the testing
dataset. The entire dataset that comprises of training and testing data is clustered. The
resulting clusters are analyzed and labeled as either “normal” or “attack” clusters using a
simple heuristic. For each Cluster we calculate the percentage of the data that is known to
be attack-free. If the percentage exceeds a certain threshold as defined by a global
parameter, then the Cluster is labeled as a normal Cluster, else it is an attack cluster. Our
hypothesis is that Cluster’s contain instances that are more similar to each other than to
instances in other Clusters. So if a Cluster contains a “decent” representation of known
attack free data, then the rest of the data in the Cluster must be attack free as well.
Otherwise, it’s an “attack” Cluster. In future work, we plan to automatically derive this
threshold parameter since it is a factor of the ratio of how much we know (a priori attack
free data) and how much we don’t know (the size of the test dataset).

The process outlined above is an iterative process. You start out with a set of data that
you know is attack-free. The dataset that needs to be analyzed is split into several smaller
subsets. Then one by one the subsets are clustered along with the known attack free data.
After every clustering step, our knowledge of what is normal (attack-free data) and
abnormal l(attacks) increases which should improve our results. The size of a subset is a
factor of the amount of attack-free data available. In future work, we plan to come up
with a mechanism to automatically derive this parameter (training data max size). Our
algorithm is extremely sensitive to this parameter.

To test our hypothesis, we chose the DARPA 1999 dataset. The DARPA dataset contains
millions of records. As a simplification, we restricted our analysis to the ICMP protocol
data for Week 3 (attack free data), Week 4 and Week 5. Table 2 shows the breakdown of
the ICMP data by the weeks.
Table 2: Records breakdown by week
Number of ICMP Records (packets)
Week 3 (Attack free data) 7169
Week 4 (Attacks and 66,414
attack free data)
Week 5 (Attacks and 18,042
attack free data)
Total 91,625

This restricted data set contains a large number of attacks. They are “smurf”, “Ping of
Death” (pod) and “satan”. Smurf and pod are denial of service attacks, while “satan” is a
probe attack [KK1999]. A denial of service attack is “an attack in which the attacker
makes some computing or memory resource too busy to handle legitimate requests, or
denies legitimate users access to a machine” [KK1999]. In POD, the intent is to make
hosts behave in an unpredictable manner by sending it several large ICMP packets. In the
smurf attack, an attacker tries to flood a victim with a large number of ICMP “echo
reply” packets. It does this by sending an ICMP “echo request” packet with the source
address spoofed to be that of the victim to the broadcast addresses of a number of
subnets. This results in machines listening in those subnets to respond in large numbers to
the victim with ICMP “echo reply” packets. In a probe attack, an attacker scans a
network of computers to either gather information or find known vulnerabilities
[KK1999]. Satan is used to scan for vulnerabilities in services such as FTP, X Server,
NFS, etc. Satan uses a variety of network protocols in addition to ICMP.
Table 3: Available ICMP data features
Feature Name Data type Description
Time Numeric UTC Time in seconds since
1970
Ethernet Size Numeric
Ethernet Source String MAC address of the Source
Ethernet Destination String MAC address of the
Destination
Ethernet Protocol Nominal IPv4 or IPv6
IP Header Length Numeric Size of the IP Header
IP TOS String Type of Service flags
IP Length Numeric Length of the packet
IP Fragmentation ID Numeric
IP Protocol Nominal Either TCP, UDP or ICMP
IP Source String IP Address of the Source
IP Destination String IP Address of the
Destination
ICMP Type Nominal Value indicates “Echo
Request”, etc.
ICMP Code Nominal Used in conjunction with
the type to determine the
intent of the message
ICMP Checksum String To ensure message validity

Table 3 shows the 15 features captured per ICMP data record in the DARPA 1999
tcpdump dataset. Because of a flaw in the implementation of the hacking tools, several of
the ICMP records had an invalid checksum of “0x0000”. These would typically be
discarded as invalid packets as they have an invalid checksum. We decided however to
retain these records and disregard the ICMP Checksum value since a large percentage of
the attacks had an invalid checksum value.
We performed feature extraction using he techniques outlined in the CLAD paper. This
involved transformation, quantization and scaling to convert the feature values to a [0,1]
range. We then reduced the number of features to ten. Time, Ethernet Protocol, IP Header
Length, IP Protocol and ICMP Checksum were removed. The floor function used in the
feature extraction process made the value of Time the same in all instances. The value of
Ethernet Protocol, IP Header Length and IP Protocol was constant across all ICMP IPv4
instances. ICMP Checksum was removed since it contained invalid values. After feature
selection and extraction, the dataset was converted to a CLUTO format.

For Clustering we used gCLUTO [gCLUTO]. gCLUTO is a graphical clustering toolkit


that can be used for low and high-dimensional datasets. It can handle effectively process
relatively large volumes of data (in the order of ten’s of thousands of records). gCLUTO
provides graphical tools to analyze the characteristics of the various clusters. It is built on
top of the CLUTO clustering library [CLUTO]. CLUTO provides three main types of
Clustering algorithms – partitional, agglomerative and graph-partitioning. We used the
“Repeated Bisection” Partitional Clustering method. The method was chosen based on
the experiments we conducted. It has low computational requirements and can cluster
large, high dimensional datasets.

Figure 2 - E1 Criterion Function

The parameters for this method are – k the number of clusters, the Clustering criterion,
the Similarity function and the Cluster selection technique. The algorithm generates k
clusters by performing repeated k-1 bisections. The dataset is first clustered into two
groups and then based on the Cluster selection technique one of these groups is selected
and bisected further. This continues until the desired number of clusters is found.
Clustering is viewed as an optimization problem that attempts to minimize the Clustering
criterion function. After experimenting, we chose the E1 Criterion function shown in
Figure 2. In the equation above, k is the number of clusters, S is the total records to be
clustered, Si is the set of records assigned to the ith cluster, ni is the number of records in
the ith cluster, u and v represent two records and sim(v,u) is the similarity between the
two records. We used “Correlation Coefficient” as the similarity function.

Table 4: Clustering Results


Cluster Week 3 Data Week 4 Week 5 % of known Cluster
Number (Attack free) Data Data attack free label
data
0 0 65 1058 0% Attack
1 3 6 2667 0.11% Attack
2 0 2030 0 0% Attack
3 0 0 5504 0% Attack
4 0 0 1344 0% Attack
5 1 2415 6 0.04% Attack
6 0 2408 496 0% Attack
7 853 517 647 42.29% Normal
8 131 74 116 40.81% Normal
9 0 2405 0 0% Attack
10 1152 1712 974 30.02% Normal
11 109 598 142 12.84% Normal
12 853 413 548 47.02% Normal
13 28 26 20 37.84% Normal
14 164 62 261 33.68% Normal
15 683 347 584 42.32% Normal
16 2440 1098 2937 37.68% Normal
17 726 460 707 38.35% Normal
18 0 36,743 10 0% Attack
19 26 15035 21 0.17% Attack

Table 4 above shows the results of Clustering. For each Cluster, we provide the
breakdown of the instances by the subset (the week) they belong to. Based on this
breakdown, we make the determination of whether the Cluster is an “Attack” or a
“Normal” cluster. Our threshold parameter was 10%. Any cluster where 10% or more of
the instances were known to be attack-free was judged to be attack free or normal. Based
on this heuristic, we found 10 of the 20 Clusters to be “Attack” Clusters.
Table 5: Validation of Results
Cluster Number Label Actual Label Number of Attack Class Purity
instances
0 Attack Attack/POD 1002 89.23%
1 Attack Attack/Smurf 2653 99.14%
2 Attack Attack/Smurf 2030 100.00%
3 Attack Attack/Smurf 5500 99.93%
4 Attack Attack/Satan 1340 99.70%
5 Attack Attack/Smurf 2415 99.71%
6 Attack Attack/Smurf 2544 87.60%
7 Normal Normal 1 99.5%
8 Normal Normal 0 100%
9 Attack Attack/Smurf 2405 100%
10 Normal Normal 0 100%
11 Normal Normal 0 100%
12 Normal Normal 1 99.94%
13 Normal Normal 0 100%
14 Normal Normal 0 100%
15 Normal Normal 0 100%
16 Normal Normal 0 100%
17 Normal Normal 0 100%
18 Attack Attack/Smurf 36695 99.84%
19 Attack Attack/Smurf 14986 99.36%

Table 5 above shows the correlation of our results with the actual attacks in the dataset.
All the attack instances were accounted for. Additionally, they were accounted for
accurately as denoted by the class purity.
Table 6: Confusion Matrix
Classified As Attack Classified as Normal
Attacks 71,572 2
Normal 671 19,382
Figure 3: Mountain Visualization of the Clusters

Figure 3 shows the Mountain Visualization of the Clusters found. It was generated using
gCLUTO. As you can see, since the dataset contains a very large percentage of attacks,
the normal clusters are small and isolated. The attack clusters were very close to each
other (except attack Cluster 0).

We believe that the results are promising. We found all the attack clusters with a
moderate number of false alarms. There are several challenges that we plan to address in
future work. Our current technique is not scalable. We require all the data instances to be
present in memory for analysis. As noted earlier, we “learn” in an incremental manner.
The technique could be used to process “windows” of data at a time, where a window
could be data collected for a week or a month depending on the amount of data being
collected. Every time we process a window or a subset of the data, our knowledge of
what is attack-free or normal data will increase. While we believe this will improve our
results, it is also a scalability problem as we require each of the individual instances to be
retained for clustering in the future. We believe that a generative, online Clustering
algorithm such as BIRCH [ZR96] would be the solution. BIRCH is a scalable clustering
technique that uses summarized information rather than individual records to perform
clustering.

The technique is dependent on the right choice of parameters – the number of clusters,
the threshold parameter that determines whether a cluster is an attack cluster or a normal
cluster and finally the training data maximum size parameter that determines the size of
the subsets of testing data. Our results are extremely sensitive to the correct choice of
these parameters. We believe that in future work we could build mechanisms to
automatically derive the threshold parameter and the data maximum size parameter. For
the number of clusters, at this time we believe experimentation is the only way to derive
an appropriate value for it.

In the approach above, we only leverage a priori attack free data. In the future, we plan to
leverage a priori knowledge of attack data. The main issue here is that of scalability.

We believe while the above results are promising, it is a single dataset (DARPA) that was
used. So it remains to be seen if this approach is generally applicable. The dataset also
comprised of a very limited set of attacks – 3 out of the possible 58 attacks in the
DARPA dataset. The 3 attacks were probe and Denial of Service (DoS) attacks which
have very different characteristics from attacks such as apache2. DoS and probe attacks
involve large number of records. Apache2 attacks involve relatively few records and thus
it remains to be seen if our approach could detect such needle-in-a-haystack type attacks.

Conclusion
In conclusion, we believe that Anomaly detection techniques are still a work in progress.
There are several challenges that need to be addressed. None of the techniques that we
reviewed (including ours) can be applied effectively in a real-time manner. They can be
used for after the fact, forensic analysis. Building scalable algorithms that can deal with
large volumes of data is another challenge. Several algorithms (ours included), require
training data that is attack free. This for obvious reasons limits their commercial viability
and acceptance. Anomaly detection techniques can generate a large number of false
alarms or alerts. Alarms are typically manually handled by humans and thus can be very
tedious and time consuming to deal with. Anomaly detection techniques can however be
used to detect new attacks for which no rules exist as they have never been seen before.
They can be used in conjunction with Signature based systems such as SNORT.

Given that anomaly detection systems can never be as precise as signature based and will
generate more false alarms, likely candidates for its use are those institutions that are
willing to manually validate results. These would potentially be security companies,
government agencies or universities that can invest in doing such research and distribute
the new knowledge of attacks as signatures to other users. AD then becomes a tool of the
security expert to help in discovering new attacks but is not directly responsible for
securing computer networks in a real-time fashion. In this light AD should continue to be
developed as it will ultimately make our networks more secure.

References
[MC02] Matthew V. Mahoney and Philip K.Chan. “Learning Nonstationary Models od
Normal Network Traffic for Detecting Novel Attacks”. SIGKDD ‘02, July 23-26 2002

[ZKS2004] Shi Zhong, Taghi M. Khoshgoftaar, and Naeem Seliya. “Evaluating


Clustering Techniques for Network Intrusion Detection”. In 10th ISSAT Int. Conf. on
Reliability and Quality Design, pp. 149-155. Las Vegas, Nevada, USA. August 2004.

[LHF2000] Lippman, R. et al., “The 1999 DARAP Off-Line Intrusion Detection


Evaluation”, Computer Networks 34(4) 579-595, 2000.

[KK1999] Kristopher Kendall, “A Database of Computer Attacks for the Evaluation of


Intrusion Detection Systems”, Masters Thesis, MIT, 1998
[PES2001] “Intrusion detection with unlabeled data using clustering”. In ACM
Workshop on Data Mining Applied to Security.

[gCLUTO] gCLUTO Software for Clustering.


http://www-users.cs.umn.edu/~karypis/cluto/gcluto/

[CLUTO] CLUTO library for Clustering.


http://www-users.cs.umn.edu/~karypis/cluto/index.html

[AC03] Muhammad H. Arshad and Philip K. Chan. “Identifying Outliers via Clustering
for Anomaly Detection”. Florida Institute of Technology Technical Report CS-2003-19

[FIT] Network Anomaly Intrusion Detection Research at Florida Tech.


http://www.cs.fit.edu/~mmahoney/dist/

[DARPA] DARPA 1999 dataset. http://www.ll.mit.edu/IST/ideval/docs/docs_index.html

[ZR96] Tian Zhang, Raghu Ramakrishnan, Miron Livny. “BIRCH: An Efficient Data
Clustering Method for Very Large Databases”.
[WEKA] Weka toolkit. http://www.cs.waikato.ac.nz/ml/weka/