Professional Documents
Culture Documents
Dr. Dirk Ourston, Ms. Sara Matzner, Mr. William Stump, and Dr. Bryan Hopkins,
Applied Research Laboratories University of Texas at Austin
P.O. Box 8029, Austin, TX 78713-8029
1-512-835-3200
{ourston, matzner, stump, bhopkins}@arlut.utexas.edu
Abstract. This paper describes a novel describe why HMMs are particularly useful
approach using Hidden Markov Models when there is an order to the actions
(HMM) to detect complex Internet attacks. constituting the attack (that is, for the case
These attacks consist of several steps that where one action must precede or follow
may occur over an extended period of time. another action in order to be effective).
Within each step, specific actions may be Because of this property, we show that HMMs
interchangeable. A perpetrator may are well suited to address the multi-step
deliberately use a choice of actions within a attack problem. In a direct comparison with
step to mask the intrusion. In other cases, two other classic machine learning
alternate action sequences may be random techniques, decision trees and neural nets, we
(due to noise) or because of lack of show that HMMs perform generally better
experience on the part of the perpetrator. For than decision trees and substantially better
an intrusion detection system to be effective than neural networks in detecting these
against complex Internet attacks, it must be complex intrusions.
capable of dealing with the ambiguities
described above. We describe research Keywords: Coordinated Internet attacks,
results concerning the use of HMMs as a Hidden Markov Models, rare data, noise,
defense against complex Internet attacks. We multi-stage network intrusions, partial data.
plural of HMM) to describe the sequential and extraordinary events and does not target misuse
probabilistic nature of multi-stage Internet attacks [1]. specifically, as does the HMM representation that we
The remainder of this paper is organized as have implemented. The IBM group uses an association
follows. Based on the problem domain being addressed rule mining algorithm for identifying events that
and the techniques used, we compare our work to that frequently occur together within the incoming alert
of other research groups. We present in detail the data stream. These rules are then used to characterize
salient properties of an HMM that motivated our “normal” operation of the network. Any deviation from
approach. We present our system architecture, showing the conditions characterized by these rules is
the HMM within the deployed system. Then we discuss considered to be an anomaly, or a potential attack. An
the process that we employed to prepare the incoming approach using an HMM could be applied to this
data for use by the HMM algorithm. We present problem domain, with the exception that the HMM
experimental results using a variety of measures, and output would be various specific attack types, rather
we compare the performance of HMMs against two than the single broad category of “anomalies.”
classic machine learning algorithms, decision trees and Our approach, using Hidden Markov Models [6],
neural networks. We provide an introduction to has only recently been applied to the intrusion
techniques that can be applied to the rare data problem detection problem [7]. The work at the University of
because, as we have already mentioned, complex New Mexico [7] considered HMMs as a tool for
Internet attacks may occur only rarely. Finally, we detecting misuse based on operating system calls. This
discuss our plans for future work, including study found that HMMs provided greater classification
investigation of the rare data problem, investigation of accuracy than other methods, but at the expense of
methods for recognizing the early stages of an attack, additional computations.
and methods for coping with noisy or missing data. The HMM technique has been extensively applied
to problems in speech recognition [8], text
Related Work understanding [9], image identification [10], and
Other researchers have also studied multi-stage microbiology [11]. Finally, several institutions have
Internet attacks. For example, work at Stanford addressed the problem of complex Internet attacks
Research Institute (SRI) [2] has included a probabilistic [12], [13], applying non-probabilistic ML techniques
approach to intrusion detection. The SRI approach [14], [15].
calculates the similarity between alerts of various
types, emanating from multiple sensors. If the alerts are The HMM
sufficiently similar, they are fused into a meta-alert that The Hidden Markov Model (see Figure 1) starts
summarizes the information contained in the individual with a finite set of states. Transitions among the states
alerts. In the SRI approach, the probability of are governed by a set of probabilities (transition
transitioning from one attack phase to another (attack probabilities) associated with each state. In a particular
states in our formulation) is specified by the incident state, an outcome or observation can be generated
class similarity matrix. This matrix is comparable to according to a separate probability distribution
the state transition matrix in the HMM representation. associated with the state. It is only the outcome, not the
Therefore, a significant difference between our state, that is visible to an external observer. The states
approach and the SRI approach is that we compute the are “hidden” to the outside; hence the name Hidden
state transition matrix automatically, using ML Markov Model. The Markov Model used for the hidden
techniques, whereas the SRI approach derives the layer is a first-order Markov Model, which means that
incident class similarity matrix manually, using human the probability of being in a particular state depends
judgment. However, an HMM could be used as a only on the previous state. While in a particular state,
source for the correlated attack reports, discussed in the Markov Model is said to “emit” an observable
Ref. [2], using the meta-alerts as input. corresponding to that state. One of the goals of using
A group at Purdue [3] has included temporal an HMM is to deduce from the set of emitted
sequence learning in their approach to intrusion observables the most likely path in state space that was
detection. This work has focused on host-based followed by the system.
intrusions, rather than network intrusions. The ML Given the set of observables contained in an
algorithm that they have selected is instance-based example corresponding to an attack, an HMM can also
learning [4]. In this technique, a set of exemplars is determine the likelihood of an attack of a specific type.
chosen for the class to be modeled. These exemplars In our case, the observables are the set of alerts
are then compared with example data using a similarity contained in the example. Each example is constructed
metric to determine if the example is a member of the to contain all of the alerts that occurred between a
given class. The HMM representation could be applied specified source host and a specified target host, over a
to this domain directly using the same format for the specified time period (in our case 24 hours). The goal
examples that was used at Purdue. for the HMM is therefore to determine the most likely
Researchers at IBM [5] developed a system that attack type corresponding to a sequence of alerts.
incorporates network alarms instead of raw network
data. This system is limited in that it is sensitive to
t
ec
nn
t
2
ec
ry
IN
s R ied
d
nn
Co
i st
nie
TF DM
ow Den
Co
eg
De
R
L
R
ind ess
SE A
AI
ss
M
NT N_
SA
W Acc
ce
SA
Ac
I
NT
-B
NT
NT
T
Observables
SM
N
Markov Model
Compromise
Consolidate
Exploit
Probe
To train the HMM, a set of training examples is several attack categories. This approach facilitates
created, with each example labeled by category (see parallelization in the operational system, if that is
Developing Labeled Data, below). The maximum desired.
likelihood method [16] is used to adjust the HMM
parameters for optimal classification of the example Why Use an HMM?
set. The HMM parameters are the initial probability There are several properties of alert sequences
distribution for the HMM states, the state transition corresponding to multi-stage Internet attacks that
probability matrix, and the observable probability match well with the HMM representation. A multi-
distribution. stage Internet attack often consists of several steps,
The state transition probability matrix is a square each of which is dependent on the outcome of the
matrix with size equal to the number of states. Each of previous step. These steps are identified in an input
the elements represents the probability of transitioning example by the alert corresponding to the step. In the
from a given state to another possible state. The matrix HMM, each step is probabilistically dependent on the
is not symmetric. For example, the likelihood of previous step. Also, in the HMM representation, each
transitioning from the state corresponding to an ip scan step is identified by the observable, that is, the alert,
into the state corresponding to a port probe is very corresponding to the step. The steps themselves are
high, since these two events are likely to occur in modeled internally in the HMM as states.
proximity to each other in the order given. However, An intruder mounting an Internet attack has many
the transition port probe to ip scan is much less likely, ways to accomplish his goals. Within the HMM, each
since port probes are usually conducted on hosts that alternative is represented as a specific state sequence.
have been identified by ip scans. The observable Furthermore, the HMM calculates a probability that a
probability distribution is a non-square matrix, with particular state sequence was followed during the
dimensions number of states by number of observables. attack. This probability is related to the frequency with
The observable probability distribution represents the which the set of transitions contained in the sequence
probability that a given observable will be emitted by a was seen in the HMM training data.
given state. For example, in the late stages of an attack, There may also be different ways to accomplish
the observable “root login” would be much more likely each step in the state sequence. For example, suppose
than “port probe,” since port probes typically happen at that the goal for a given step is to discover whether or
the beginning of an attack and a root login typically not a host exists at a given ip address. One method to
happens towards the end. achieve this goal is to attempt to initiate a telnet
We implemented our system using a separate connection with the host. A second is to consult the
HMM for each classification category since the DNS server responsible for the network. A third
standard HMM is designed to simulate a single method might involve attempting to establish an FTP
category and the intrusion detection domain includes connection. Out of the various possible actions, only
one is chosen, resulting in the observable for the step. sometimes not then. The HMM observable layer
The likelihood associated with each alert type is then models the overt portion of the attack, whereas the
calculated according to the frequency with which the hidden level represents the attacker’s true intentions, a
particular alert type had been associated with the given useful property when the attacker attempts to hide his
state in the training data. The HMM therefore can true intentions.
simulate variations in both the attack step sequence and
the choice of methods for accomplishing the goals of System Architecture
each step. The hidden and visible layers of the HMM Figure 2 shows the architecture for the intrusion
also have semantic value and could support the detection system. The current status of the system is
development of a cognitive model of the way attackers that all of the modules shown in the architecture have
accomplish their goals. For example, attackers been implemented as research prototypes, and the
sometimes attempt to disguise their work as normal system is currently undergoing extensive test and
network activities. The attacker only reveals his evaluation (the results of some of our experimentation
motives when he has gained his objectives—and form the basis for this paper).
Sensor 1
Connection
Records
Network Alarm Data Pre- HMM
Sensor 2
Database Filtering
Sensor n
Figure 2. System architecture
The flow through the system is as follows: One Each example consists of the temporally ordered
or more sensors monitor network traffic and declare sequence of alerts that occurred between a given
alerts when potential intrusions are detected. The source/destination host pair over a 24-hour period.
alerts from the sensors are stored in an alarm Using the network sensor data as input allows us
database for later processing by the system. Data pre- to work at a higher level of abstraction than is
filtering (see Preparing the Data, below) is used to possible with raw packet data. The network sensors
eliminate redundant data from the input data. After filter out a large number of normal connections,
the alarm data has been pre-filtered, it is assembled letting our algorithms focus on events that more
into examples for analysis by the HMM classification probably correspond to intrusions. Although using
module. Currently, the example input to the HMM sensors improves the quality of the data being
consists of a source address (the source host), a processed by our algorithms, we still have to deal
destination address, and an ordered sequence of alerts with the “false positive” problem. This problem
that occurred between the source ip and the originates from the fact that most, if not all, network
destination ip during a 24-hour period. The HMM sensors adhere to the philosophy that it is preferable
highlights high-interest examples for further analysis. to include many erroneously identified intrusions in
In the event that an example is incorrectly classified the input data stream (false positives) rather than to
by the HMM, the network analyst corrects the miss one real intrusion (false negative).
classification and sends the example back to the alert In addition, our input data originally suffered
database so that the HMM can be updated using that from the alert repetition problem: a single alert type
example. or a set of alert types being repeated over and over in
the example. A particular example of this
Preparing the Data phenomenon occurs during probe activities. In this
In the real-time data stream, alerts come in as case the source host sends out service requests over
independent events from each of many network all possible port numbers on the target machine,
sensors. A single record of the real-time data consists attempting to find vulnerabilities in the service
of the type of alert being reported by the network applications. An alert is generated for each probe
sensor, as well as various types of context attempt, resulting in …pppp… appearing in the
information including bytes transferred, source host example, where p corresponds to a probe alert. It can
address, target host address, and so on. Viewing the also happen that a particular alert sequence gets
data in this manner tends to obscure correlations repeated in the example as …abcabcabc…, where a,
among the alerts, and makes identification of b, and c are each unique alert types. However, for our
coordinated attacks more difficult. Consequently, we purposes in identifying coordinated attacks these
elected to group the data when forming our examples. repetitive signatures provide little or no information
1
HMM
P 0.9
e
r 0.8
f
C4.5
o 0.7
r
m
0.6
a
n
c 0.5 NN
e
0.4
0.3
0 50 100 150 200 250 300 350 400 450 500
Number of training examples
Figure 3. Performance versus number of training examples
We extended the HmmLib software by adding When compared with the C4.5 algorithm, the
the capability to learn from multiple examples and to HMM algorithm is competitive with C4.5 at all levels
process HMMs for all attack categories in parallel. and significantly out-performs C4.5 at the lower
For these test results, we used 10 to 500 training training levels. For instance, with only 10 examples
examples, 100 test examples, and averaged the results from which to learn, the HMM algorithm scores 45
over 100 samples. The training and test populations percent versus 30 percent for the C4.5 algorithm. In
were obtained using sampling without replacement addition, the HMM algorithm learns much more
(i.e., the training and test populations were disjoint). quickly. By 50 training examples, the HMM
The parameters used for the neural net, which were algorithm is already achieving 79 percent
default values for the WEKA [17] neural network performance versus 67 percent for C4.5.
implementation, are as follows (note that the WEKA The neural network algorithm was clearly
neural network uses the standard back propagation inferior to the HMM for this domain, where its
algorithm for learning): likelihood of correctly classifying an example only
slightly exceeded chance at the highest training level.
Learning rate = .3 It should also be noted that the neural network
Momentum = .2 required vastly more time than the other two
Number of epochs = 500 algorithms to complete its training. The significant
Number of hidden layers = 1 performance advantage displayed by the HMM at
Nodes per hidden layer = 20 low training levels indicated potential applicability to
the “rare data” problem: the case where only a few
examples of a complex Internet attack are available.
Figure 4 presents the confusion matrix [19], category 1 were correctly assigned to category 1 by the
precision, recall, and F-measure data for our current classifier.
system. The confusion matrix presents the possibility The second column of the figure shows the
of assigning an input example of a particular class to a average number of test examples in a category,
particular output category, while precision, recall, and averaged over the number of iterations for testing. The
F-measure are defined as follows [20]: row in the figure that is just above the confusion
tp matrix, labeled “Examples assigned to category,”
Precision = indicates the average number of examples that were
tp + fp classified into a particular category. Points along the
diagonal of the confusion matrix (where one would like
tp to see “1’s”) are shown in bold. Precision, recall, and
Recall = F-measure are given in the last three rows of the figure
tp + fn (just below the confusion matrix). Each element in
these rows corresponds to a given category, starting a
1 category 1 on the left and running to category 13 at the
F-measure = far right. Values containing question marks indicate
1 1
α + (1 − α ) cases where the denominator of the corresponding
P R expression was zero.
where tp is the number of positive examples in the In order to avoid data security issues, the category
input data that have been correctly identified by the labels do not indicate the specific category definitions.
classifier; fp is the number of negative examples that It can be said that the progression among the categories
have been incorrectly identified as positive by the is from less interesting to more interesting in terms of
classifier; fn is the number of positive examples that analyst interest. Category 1 is the least interesting
have been identified as negative by the classifier; α is a category, and category 13 is the most interesting. The
parameter designed to allow a weighting of the point should also be made that the data used for our
importance ascribed to precision or recall; P is analysis is “real” data corresponding to suspected
precision and R is recall. For equal weighting of intrusions.
precision and recall, as was done for our experiments, As can be seen in Figure 4, good performance can
α is set to 0.5. be obtained from this level of training (overall
Figure 4 was generated from the 300 training performance is 92.5 percent). However, cases for
example case data point shown above in Figure 3. The which there are few examples tend to have a lower
confusion matrix is shown in the box in Figure 4. Each performance result. For example, category 6 had, on
element of the confusion matrix indicates the the average, only .2 examples in the test population; it
proportion of examples from a particular category that was classified with an accuracy of 35 percent. Category
were assigned to another category by the classifier. For 7 also had, on average, .2 examples and achieved only
example, the value of “.96” in the upper left cell of the 15 percent classification accuracy.
matrix indicates that 96 percent of the test examples in
Considering the precision, recall, and F-measure high priority attacks. For example, in the confusion
calculations, these values are very good, except for matrix of Figure 4, there are very few example of
those categories where there are insufficient training category 2, and almost all of these examples are
and test examples. In addition, for categories 2 and 7, mistakenly assigned to category 1, a very common
the classifier attained a high value for precision, but category. Similarly, the examples of category 5 (rare)
low value for recall. This means that for those are also assigned to category 1, and the examples of
categories the classifier had few false positives but had category 6 are mistakenly assigned to category 8. This
a high proportion of false negatives. This could mean problem is known in the literature as the problem of
that the training population, with only a few examples, “skewed distributions” or “imbalanced data sets” [22],
had almost no positive examples, causing the classifier [23], [24], [25]. This phenomenon has only recently
to learn that almost all examples should be labeled as been addressed from within the ML community and to
negative. Figure 5 shows the ROC curves [21] obtained date is not well explored. Because of the incipient
from the testing activity. Note that the ROC curves nature of the work in this area, in this section we only
presented in Figure 6 were obtained from the category provide a brief list of approaches that have been used
1 testing data. ROC curves for the other categories are to attack the rare data problem in the past. It is our
similar. The point that should be made regarding this intention to look more deeply at this problem in the
figure is that the ROC curves gain most of their next stage of our research
performance in the region from 0 to 40 percent false The standard approaches for handling the rare data
alarm rate, and the ROC curves improve as the training problem are as follows:
level is increased. Over-sample the rare data population [25]. In this
The improvement in performance with training case, the samples of the rare data population are
level is shown more clearly in Figure 6, which presents duplicated so as to increase their representation during
area under the ROC curve as a function of training the learning process. Under-sample the abundant data
level. Note that categories for which there were population [25]. This is simply the inverse of the
insufficient test examples to generate a valid curve are method described above. In both cases the objective is
not included in Figure 6. Categories that start out at to obtain a balance between the most populous sample
lower performance values increase performance with types and the rare samples.
training, as expected. However, the top two categories, Apply recognition filters to the data to eliminate
8 and 11, actually appear to decrease in performance as samples of one class or another [25]. “Boost” the
training is increased, an effect for which we currently performance of the algorithm against rare data [24].
have no explanation. Teach the classification algorithm in two stages
The Rare Data Problem consisting of: 1) training the classifier to properly
In general, the most dangerous attacks happen only classify all of the positive examples, and 2) amending
rarely. Therefore the training set for the HMM contains the classifier by teaching it to properly classify all of
only a few examples of these attack types. In addition, the negative examples that were improperly classified
the input alert data stream contains many low priority as positive during the first stage [22], [26].
alerts that tend to mask identification of these rare,
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Figure 5. ROC curves as a function of training level
1 'Category 11
'Category 8
0.8 Category 1
Category 3
Area Under the ROC Curve
Category 4
Category 10
Category 9
0.6
Category 5
0.4
0.2
0
0 50 100 150 200 250 300
Number of Training Examples
Figure 6. Area under the ROC curve as a function of training level
Each of the approaches listed has strengths and accounting for rare data in the intrusion detection
weaknesses in the context of the intrusion detection domain. This work is especially vital in regard to
problem. When only a few examples (i.e., < 10) of a multi-stage attacks, as these attacks are almost always
particular attack exist in the training data, none of these rare.
approaches may work and we may have to resort to Future Work
other approaches to solve the rare data problem. The work done so far concerning coordinated
The approaches listed above represent possible Internet attacks has identified areas where further work
starting points for the work that needs to be done in is required. One problem that needs to be addressed is