Professional Documents
Culture Documents
Network Attacks
Dr. Dirk Ourston, Ms. Sara Matzner, Mr. William Stump, and Dr. Bryan Hopkins,
Applied Research Laboratories University of Texas at Austin
P.O. Box 8029, Austin, TX 78713-8029
1-512-835-3200
{ourston, matzner, stump, bhopkins}@arlut.utexas.edu
Abstract. This paper describes a novel describe why HMMs are particularly useful
approach using Hidden Markov Models when there is an order to the actions
(HMM) to detect complex Internet attacks. constituting the attack (that is, for the case
These attacks consist of several steps that where one action must precede or follow
may occur over an extended period of time. another action in order to be effective).
Within each step, specific actions may be Because of this property, we show that HMMs
interchangeable. A perpetrator may are well suited to address the multi-step
deliberately use a choice of actions within a attack problem. In a direct comparison with
step to mask the intrusion. In other cases, two other classic machine learning
alternate action sequences may be random techniques, decision trees and neural nets, we
(due to noise) or because of lack of show that HMMs perform generally better
experience on the part of the perpetrator. For than decision trees and substantially better
an intrusion detection system to be effective than neural networks in detecting these
against complex Internet attacks, it must be complex intrusions.
capable of dealing with the ambiguities
described above. We describe research Keywords: Coordinated Internet attacks,
results concerning the use of HMMs as a Hidden Markov Models, rare data, noise,
defense against complex Internet attacks. We multi-stage network intrusions, partial data.
2
ec
ry
IN
s R ied
d
nn
Co
i st
nie
TF DM
ow Den
Co
eg
De
R
L
R
ind ess
SE A
AI
ss
M
NT N_
SA
W Acc
ce
SA
Ac
I
NT
-B
NT
NT
T
Observables
SM
N
Markov Model
Compromise
Consolidate
Exploit
Probe
To train the HMM, a set of training examples is several attack categories. This approach facilitates
created, with each example labeled by category (see parallelization in the operational system, if that is
Developing Labeled Data, below). The maximum desired.
likelihood method [16] is used to adjust the HMM
parameters for optimal classification of the example Why Use an HMM?
set. The HMM parameters are the initial probability There are several properties of alert sequences
distribution for the HMM states, the state transition corresponding to multi-stage Internet attacks that
probability matrix, and the observable probability match well with the HMM representation. A multi-
distribution. stage Internet attack often consists of several steps,
The state transition probability matrix is a square each of which is dependent on the outcome of the
matrix with size equal to the number of states. Each of previous step. These steps are identified in an input
the elements represents the probability of transitioning example by the alert corresponding to the step. In the
from a given state to another possible state. The matrix HMM, each step is probabilistically dependent on the
is not symmetric. For example, the likelihood of previous step. Also, in the HMM representation, each
transitioning from the state corresponding to an ip scan step is identified by the observable, that is, the alert,
into the state corresponding to a port probe is very corresponding to the step. The steps themselves are
high, since these two events are likely to occur in modeled internally in the HMM as states.
proximity to each other in the order given. However, An intruder mounting an Internet attack has many
the transition port probe to ip scan is much less likely, ways to accomplish his goals. Within the HMM, each
since port probes are usually conducted on hosts that alternative is represented as a specific state sequence.
have been identified by ip scans. The observable Furthermore, the HMM calculates a probability that a
probability distribution is a non-square matrix, with particular state sequence was followed during the
dimensions number of states by number of observables. attack. This probability is related to the frequency with
The observable probability distribution represents the which the set of transitions contained in the sequence
probability that a given observable will be emitted by a was seen in the HMM training data.
given state. For example, in the late stages of an attack, There may also be different ways to accomplish
the observable “root login” would be much more likely each step in the state sequence. For example, suppose
than “port probe,” since port probes typically happen at that the goal for a given step is to discover whether or
the beginning of an attack and a root login typically not a host exists at a given ip address. One method to
happens towards the end. achieve this goal is to attempt to initiate a telnet
We implemented our system using a separate connection with the host. A second is to consult the
HMM for each classification category since the DNS server responsible for the network. A third
standard HMM is designed to simulate a single method might involve attempting to establish an FTP
category and the intrusion detection domain includes connection. Out of the various possible actions, only
Sensor 1
Connection
Records
Network Alarm Data Pre- HMM
Sensor 2
Database Filtering
Sensor n
Figure 2. System architecture
The flow through the system is as follows: One Each example consists of the temporally ordered
or more sensors monitor network traffic and declare sequence of alerts that occurred between a given
alerts when potential intrusions are detected. The source/destination host pair over a 24-hour period.
alerts from the sensors are stored in an alarm Using the network sensor data as input allows us
database for later processing by the system. Data pre- to work at a higher level of abstraction than is
filtering (see Preparing the Data, below) is used to possible with raw packet data. The network sensors
eliminate redundant data from the input data. After filter out a large number of normal connections,
the alarm data has been pre-filtered, it is assembled letting our algorithms focus on events that more
into examples for analysis by the HMM classification probably correspond to intrusions. Although using
module. Currently, the example input to the HMM sensors improves the quality of the data being
consists of a source address (the source host), a processed by our algorithms, we still have to deal
destination address, and an ordered sequence of alerts with the “false positive” problem. This problem
that occurred between the source ip and the originates from the fact that most, if not all, network
destination ip during a 24-hour period. The HMM sensors adhere to the philosophy that it is preferable
highlights high-interest examples for further analysis. to include many erroneously identified intrusions in
In the event that an example is incorrectly classified the input data stream (false positives) rather than to
by the HMM, the network analyst corrects the miss one real intrusion (false negative).
classification and sends the example back to the alert In addition, our input data originally suffered
database so that the HMM can be updated using that from the alert repetition problem: a single alert type
example. or a set of alert types being repeated over and over in
the example. A particular example of this
Preparing the Data phenomenon occurs during probe activities. In this
In the real-time data stream, alerts come in as case the source host sends out service requests over
independent events from each of many network all possible port numbers on the target machine,
sensors. A single record of the real-time data consists attempting to find vulnerabilities in the service
of the type of alert being reported by the network applications. An alert is generated for each probe
sensor, as well as various types of context attempt, resulting in …pppp… appearing in the
information including bytes transferred, source host example, where p corresponds to a probe alert. It can
address, target host address, and so on. Viewing the also happen that a particular alert sequence gets
data in this manner tends to obscure correlations repeated in the example as …abcabcabc…, where a,
among the alerts, and makes identification of b, and c are each unique alert types. However, for our
coordinated attacks more difficult. Consequently, we purposes in identifying coordinated attacks these
elected to group the data when forming our examples. repetitive signatures provide little or no information
1
HMM
P 0.9
e
r 0.8
f
C4.5
o 0.7
r
m
0.6
a
n
c 0.5 NN
e
0.4
0.3
0 50 100 150 200 250 300 350 400 450 500
Number of training examples
Figure 3. Performance versus number of training examples
Considering the precision, recall, and F-measure high priority attacks. For example, in the confusion
calculations, these values are very good, except for matrix of Figure 4, there are very few example of
those categories where there are insufficient training category 2, and almost all of these examples are
and test examples. In addition, for categories 2 and 7, mistakenly assigned to category 1, a very common
the classifier attained a high value for precision, but category. Similarly, the examples of category 5 (rare)
low value for recall. This means that for those are also assigned to category 1, and the examples of
categories the classifier had few false positives but had category 6 are mistakenly assigned to category 8. This
a high proportion of false negatives. This could mean problem is known in the literature as the problem of
that the training population, with only a few examples, “skewed distributions” or “imbalanced data sets” [22],
had almost no positive examples, causing the classifier [23], [24], [25]. This phenomenon has only recently
to learn that almost all examples should be labeled as been addressed from within the ML community and to
negative. Figure 5 shows the ROC curves [21] obtained date is not well explored. Because of the incipient
from the testing activity. Note that the ROC curves nature of the work in this area, in this section we only
presented in Figure 6 were obtained from the category provide a brief list of approaches that have been used
1 testing data. ROC curves for the other categories are to attack the rare data problem in the past. It is our
similar. The point that should be made regarding this intention to look more deeply at this problem in the
figure is that the ROC curves gain most of their next stage of our research
performance in the region from 0 to 40 percent false The standard approaches for handling the rare data
alarm rate, and the ROC curves improve as the training problem are as follows:
level is increased. Over-sample the rare data population [25]. In this
The improvement in performance with training case, the samples of the rare data population are
level is shown more clearly in Figure 6, which presents duplicated so as to increase their representation during
area under the ROC curve as a function of training the learning process. Under-sample the abundant data
level. Note that categories for which there were population [25]. This is simply the inverse of the
insufficient test examples to generate a valid curve are method described above. In both cases the objective is
not included in Figure 6. Categories that start out at to obtain a balance between the most populous sample
lower performance values increase performance with types and the rare samples.
training, as expected. However, the top two categories, Apply recognition filters to the data to eliminate
8 and 11, actually appear to decrease in performance as samples of one class or another [25]. “Boost” the
training is increased, an effect for which we currently performance of the algorithm against rare data [24].
have no explanation. Teach the classification algorithm in two stages
The Rare Data Problem consisting of: 1) training the classifier to properly
In general, the most dangerous attacks happen only classify all of the positive examples, and 2) amending
rarely. Therefore the training set for the HMM contains the classifier by teaching it to properly classify all of
only a few examples of these attack types. In addition, the negative examples that were improperly classified
the input alert data stream contains many low priority as positive during the first stage [22], [26].
alerts that tend to mask identification of these rare,
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
Figure 5. ROC curves as a function of training level
1 'Category 11
'Category 8
0.8 Category 1
Category 3
Area Under the ROC Curve
Category 4
Category 10
Category 9
0.6
Category 5
0.4
0.2
0
0 50 100 150 200 250 300
Number of Training Examples
Figure 6. Area under the ROC curve as a function of training level
Each of the approaches listed has strengths and accounting for rare data in the intrusion detection
weaknesses in the context of the intrusion detection domain. This work is especially vital in regard to
problem. When only a few examples (i.e., < 10) of a multi-stage attacks, as these attacks are almost always
particular attack exist in the training data, none of these rare.
approaches may work and we may have to resort to Future Work
other approaches to solve the rare data problem. The work done so far concerning coordinated
The approaches listed above represent possible Internet attacks has identified areas where further work
starting points for the work that needs to be done in is required. One problem that needs to be addressed is