You are on page 1of 4

Analysis and Design for Intrusion Detection System

Based on Data Mining


Duanyang Zhao, Qingxiang Xu, Zhilin Feng
Zhijiang College of Zhejiang University of Technology
Hangzhou, Zhejiang Province, 310024, China
{sunny, xqx, fengzl}@zjc.zjut.edu.cn


AbstractNetwork and host Intrusion Detection Systems (IDS)
have become a standard component in security infrastructures.
As the action of intrusion represents variable, complicated, and
uncertainty characteristic, they face so many problems to resolve
for intrusion detection. Each approach has its strengths and
weaknesses. A truly effective intrusion detection system will
employ both technologies. We discusses the differences in host-
and network-based intrusion detection techniques to demonstrate
how the two can work together to provide additionally effective
intrusion detection and protection. We propose a hybrid IDS,
which combines network and host IDS, with anomaly and misuse
detection mode, utilizes auditing programs to extract an extensive
set of features that describe each network connection or host
session, and applies data mining programs to learn rules that
accurately capture the behavior of intrusions and normal
activities.
Keywords-intrusion detection; hybrid ids; data mining; analysis
engine; apriori algorithm
I. INTRODUCTION
Apriori algorithm in data mining can show that the
attribute-values frequently appear together in a given data set.
It can mine the relationships between attribute values from a
database table, and is more suitable method for intrusion
detection system.
The most representative of the research in the world is
Wenke Lee Research Group in Columbia University [1][2],
1998. They were supported by the Defense Advanced Research
Projects Agency (DARPA) and the National Natural Science
Foundation (NSF) funding, and focused on the research in this
area. Since then, the IDS Research Group of the Department of
Computer Science, under the leadership of Professor Salvatore
J. Stolfo, carried out extensive study on data mining-based IDS.
They have been divided their research into twelve sub-topics.
Their research is on top in the world. The SANS (System
Admin, Audit, Network, Security) has outstanding
performance in this area [3].
In recent years, both the Chinese Academy of Sciences
(CAS) and key universities and colleges in China are actively
carrying out researches in this area [4][5]. With the help of the
Development Project of National Key Basic Research and the
Major Projects Fund of CAS Knowledge Innovation Project,
PhD Xu Jing, in Computing Center, Research Institute of High
Energy Physics of CAS, made a preliminary implementation
for Intrusion Detection System based on data mining. With the
help of the National Natural Science Fund, doctoral students of
Department of Computer Science at Nanjing University of
Science, Wuhan University, Northern Jiaotong University, and
other key universities carried out similar researches.
By analyzing the characteristics of hacker programs with
back door, by which hackers control target hosts, networks
may cause unexpected connection records. Because of huge
amount of data in the network processing, the number of
connection records after filtered is also very impressive. While
establishing a connection, it will increase a record. Therefore,
we can not simply compare the connection records to achieve
intrusion detection.
In recent years, the use of data mining knowledge for
intrusion detection system has won more and more attention,
but there are a lot of problems. For examples, it is difficult to
have a clear standard in the selection of test data, there are
large amounts of useless information in the results of mining
out of the experiment data, and how we express the rules mined
from the experiment data for intrusion detection system.
The remaining section of this paper is organized as follows.
In the second section, the paper describes the framework of
hybrid intrusion detection system. In the third section, we show
the experimental design and results of apriori algorithm in data
mining. Finally, we draw a conclusion and exhibit a prospect.
II. THE FRAMEWORK OF HYBRID IDS
Intrusion detection technology is a new security support
mechanism, and monitors the network system without affecting
the network performance to prevent internal and external
attacks and misuse. Intrusion detection systems have a variety
of classifications. In accordance with the objects of the system
detection, they are divided into the host-based, the network-
based, and the hybrid IDS; in accordance with system
architecture, they are divided into centralized and distributed
IDS; and finally in accordance with the detection type, they can
be divided into anomaly-based model and misuse-based model
IDS.
The hybrid IDS in the paper is a combination of intrusion
detection engines of misuse and anomaly detection, uses data
mining algorithms as the data processing for vast amounts of
security audit data, and generates detection models and test
models separately from the network data and host system calls,
as shown in Fig. I:
2010 Second International Workshop on Education Technology and Computer Science
978-0-7695-3987-4/10 $26.00 2010 IEEE
DOI 10.1109/ETCS.2010.478
339

Network Sensor

Figure 1. The hybrid IDS based data mining algorithms
The hybrid IDS consists of four parts: data warehouse,
sensors, analysis engine and alarm system.
A. Data warehouse
Data warehouse technology has the following functions: to
manage decision-making process, subject-oriented, integrated,
and time-related data collection, to support multi-process and
multi-threading technology. Many commercial DBMS have the
function. The project uses SQL Server 2005 data warehouse
technology, which includes Analysis Services. It can easily set
up a data warehouse, achieve distributed computing, and
provide OLE DB Controls and ADO (ActiveX Data Objects)
technology, and has a flexible data model, etc. Obviously, the
features can improve the speeds of the data mining and the
analysis engine.
Data warehouse technology is beneficial that the different
components asynchronously handle the same piece of data
stored in a database. Therefore, it is the heart of the data and
models in the whole system.
B. Sensors
Sensors are closely related with the network operating
system, usually to discuss Windows system or UNIX/Linux
systems. This paper sets out technical means of sensors as an
example of Windows system.
1) Host Sensors
They gather information in monitored hosts with a variety
of methods, such as application logs, security logs and event
logs, running applications and registry changes.
After set up audit features, Windows Server will monitor
various states of the system, and write them to logs. With the
help of Windows API functions, we develop programs to
monitor the system logs, running applications and registry
changes, and to send them to the host sensor manager of the
analysis engine to be analyzed.
We use the hook function to intercept API calls.
Hook is an important technology of Windows message
processing mechanism. With installing a variety of hooks, the
application can set the appropriate subroutines to monitor the
system messaging. Before messages reach their destinations,
subroutines intercept them and make some analysis according
to the user requirement. Hook is divided into thread-specific
hooks and global hooks. Thread-specific hooks monitor the
specified thread, and the global hooks monitor all the threads in
the system. For the global hooks, hook functions must be
included in a separate dynamic-link library (DLL) so that they
can be called by a variety of associated applications.
Hook function is a mechanism for application programs to
monitor message flows and to process some type of the
messages that have not yet reached the purpose window in the
system. For example:
The process installs a hook WH_GETMESSAGE to check
each window message in the system. It can install a hook by
calling SetWindowsHookEx function as following:
HHOOK hHook = SetWindowsHookEx
(WH_GETMESSAGE, GetMsgProc, hinstDLL, 0);
Where parameters WH_GETMESSAGE indicates the type
of hook to be installed, GetMsgProc indicates the function
address of system call while the window deals with the
message, and hinstDLL indicates the specified DLL that
contains GetMsgProc function.
2) Network Sensors
With the netstat tool of Windows system, network sensors
collect the network connection information established
between computers. Netstat command can collect all the open
port information on the computers. We may design a program
to run netstat command at a regular interval, and to output the
results. But this way will add to the burden on the system. In a
relatively busy system, the records of a day may go up to some
GB in size.
Therefore, we can optimize the program to capture easily
the network connection information. It first lists all open ports,
monitors the port whether it is a new open and when it is
closed, records the port information only updated, and outputs
records to Network Sensor Manager in Analysis Engine. The
records include port services, port number, activation time, and
time stamp and so on.
Data
Warehouse

Host Sensor
Alarm System
Alarm
Manager
Intruder Tracing
System Protection
Strategy
Archive Information
Alarm Strategy
Network Sensor Manager
Host Sensor Manager
Pattern
Mining
Mining Algorithm
Library
Misuse
Detector
Sensor-1 Sensor-2 Sensor-m
Analysis Engine
Alarm
Message
Alarm
Message
Anomaly
Detector
Sensor-1 Sensor-2 Sensor-n
340
C. Analysis Engine
Analysis Engine consists of three parts: Network/Host
Sensor Manager, Misuse and Anomaly Detector, Mining
Algorithm Library and Pattern Mining.
1) Sensor Manager receives data from sensors, then
analyse the data, translate them into the form of database
records, and store them into the data warehouse.
2) Misuse and Anomaly Detection detects intrusions based
on the matching patterns stored in the data warehouse.
Traditional IDS is divided into two separate types: misuse
detection and anomaly detection. Anomaly detection is known
as behavior-based detection, which sets up the behavioral
models for users under normal circumstances in the learning
phase, then compares the current user behavior with the
existing behavioral models, and founds an intrusion if the
deviation is greater than the threshold of the credibility. The
basic principle is that intrusion comes out if any behavior is not
consistent with the known behaviors.
Misuse detection is also called knowledge-based intrusion
detection, which sets up intrusion patterns for the known
intrusions, then matches the current user behaviors and system
status with the existing intrusion behavior patterns. The basic
principle is that intrusion comes out if any behavior is
consistent with the known behaviors.
We integrate these two models into the hybrid IDS, thus
format new basic principles of intrusion detection: any
behavior is a normal behavior if it is consistent with normal
behavior model, any behavior is a intrusion behavior if it is
consistent with anomaly behavior model, and others are added
to the detection models in data warehouse by the Pattern
Mining module based on Mining Algorithm Library to generate
a new detection model. While comparing an unknown behavior
with normal/anomaly behavior model, the detectors determine
a normal/anomaly behavior by comparing support and
confidence level of calculated results with a given minimum
support and confidence level.
3) Mining Algorithm Library and Pattern Mining for
mining unknown intrusions.
Point of view from the data warehouse, data mining can be
regarded as an advanced stage of online analytical processing
(OLAP). We apply data mining technology to IDS, use its
algorithms of association analysis and sequential pattern
analysis to extract safety-related characteristic properties,
generate classification models based on them, and identify
automatically security incidents. The analytical methods of
data mining can be divided into three parts:
a) Association analysis
Its purpose is to uncover hidden relationships among the
data. Based on correlation among a set of items, you can use
the association analysis to identify the correlation between
intrusion behaviors.
Here are the basic algorithms of association analysis:
Set I=(i
1
, i
2
, ..., i
m
) is a collection of binary words in which
the elements are referred to as item. Assume D as a collection
of transaction T, which is a collection of items, and TI.
Assume X is a collection of items in I, if XT, therefore
transaction T contains X.
An associational rule is an implication form like XY,
where XI, YI, and XY=. The support of rule XY in
the transaction D is the ratio of the number of transactions
contained X and Y in a transaction set to the number of all
transactions, denoted by Support (XY), that is:
Support(XY)=|{T: XYT, TD}|D|
The confidence level of rule XY in the transaction D is
the ratio of the number of transactions contained X and Y in a
transaction set to the number of transactions contained X,
denoted by Confidence (XY), that is:
Confidence(XY)=|{T: XYT, TD}||{T: XT, TD}|
Given a transaction set D, the tasks of association analysis
are to create the associational rules that support and confidence
level from mining data are respectively greater than the
minimum support (minsupp) and the minimum confidence
(minconf) given by the users.
Agrawal and et al in 1993, designed a basic algorithm
(Apriori). In recent years, the algorithm has been made
considerable progress. The project applied the latest algorithms
for pattern mining.
b) Sequence pattern analysis
Similar to the association analysis, its purpose is to uncover
relationships among the data. But its focus is on analysis of
context among the data. Many behaviors of hacker intrusions
have context, and some actions must occur after others. For
example: a hacker generally scans the system port before attack.
c) Classification analysis
Assume record collection and a set of tags, where tag is a
group of categories with different characteristics. We give a tag
for each record, that is, to classify records by tags. Then we
check the tagged records, and describe their characteristics. For
example, the intrusions are divided into three categories based
on harmful levels of hacking: the fatal intrusion, the general
intrusion, and the weak intrusion. Classification analysis
checks the previous hacking, classifies each risky level, and
then gives their descriptions according to classification
standards.
Bayesian classification algorithm is as following:
Each connection record is described with an n-dimensional
feature vector X=(x
1
, x
2
, ..., x
n
), where the n attributes,
respectively, describe characteristics of n-connected records.
Assume that there are m categories C
1
, C
2
, ..., C
m
. Given
an unknown connection record X (or no tag), classification
predicts that X is the highest category of posterior probability,
namely, Bayesian classifier assigns unknown connection
records d to the category C
i
, if and only if P(C
i
|X)P(C
j
|X), 1
j m, j i. According to Bayesian,
P(C
i
|X)=P(X|C
i
)P(C
i
)/P(X).
For any category, P(X) is a constant, we can get the greatest
341
value of P(X|C
i
)P(C
i
). The priori probability of category is
P(C
i
)=s
i
/s, where s
i
is the number of connection records in the
cat
nd s
i
is the number of connection records in the
cat
, if and only if P(X|C
i
)P(C
P(X
rithms in this project to
improve the performance of IDS.
D.
, archiving, intrusion tracing
when necessary. Here omitted.
III. THE EXPERIMENT FOR ASSOCIATION ANALYSIS
A.
s and the detecting phase of
o be suitable for detection rule set while
rule is that of the
lidity has a direct impact on
the accuracy of detection results.
B.
from the file of network pac
rec
a text file, in which th
var
efore, we have to
which the right items does not
which the left items does
more ideal. The few associational rules
are as Table I showed.
TABLE I. ASSOCIATIONAL
rules
ite s
S t Co ce
egory C
i
, s is the total number of connection records.
For calculation P(X|C
i
), in order to reduce overhead, given
the assumption condition of category independence, so that
P(X|C
i
)=P(X
k
|C
i
), (k=1,,n), where P(X
k
|C
i
)=s
ik
/s
i,
s
ik
is the
number of connection records that has the value of X
k
in the
category C
i
, a
egory C
i
.
In order to classify the unknown connections, for each
category C
i
, we calculate P(X|C
i
)P(C
i
), to assign connection
records X to category C
i i
)>
not contain the IP and Ports.
After the above steps of filtration, we get the final
associational rules to be
|C
j
)P(C
j
), 1jm, ji.
Although the algorithms adapt to different scenes, we
comprehensively use these algo
Alarm system
The main functions of alarm system are to build the
emergency measures based on alarm strategies, such as the
appropriate system protection
The design of associational rule detector
Association analysis in data mining is divided into two
parts: the learning phase of the rule
the application of the rules learnt.
1) In the learning phase: the Analysis Engine applys
association analysis to connection records from Network
Sensor Manager, to mine out the associations between the
values of data items under the normal state of networks, and
obtaine the associational rule set, which are filtered by some
artificial rules so as t
detecting intrusions.
2) In the detecting phase: the Analysis Engine gets the
connection records from Network Sensor Manager, and
matches with detection rule set to determine whether intrusion
takes place. The process matching detection
association analysis in the detecting phase.
The detection rule set made in the learning phase is the core
of the Analysis Engine. Their va
The experimental results for associational rules
Our experiment data are kets
orded by TCPdump tool.
We compile the network packets to the format of the
connection records, save them as e
iables are separated by a space.
Association analysis of data mining builds up the rule sets
from the connection records, where the minimum support is set
to 5%, and the minimum confidence is set to 100%. But there
are a large number of useless rules in the rule sets. They can
not be used simply to express the meaningful associations
between the values of connection attributes. If we use them as a
standard for monitoring the network intrusions, the decisions of
the system would be misdetections. Ther
remove the useless rules, as the following:
To filter out the rules in
contain the categories;
Then to filter out the rules in
RULES
uppor
The left items of
The right
ms of rule (%)
nfiden
(%)
192. 168. 7. 13 80 sf normal 7. 8 100. 0
192. 168. 4. 16 25 passive exter normal 8. 3 100. 0
192. 168. 2. 10 80 active normal 23. 2 100. 0
192. 168. 4. 18 tcp 25 sf normal 8. 5 100. 0
192. 168. 7. 23 80 a 19. 4 100. 0 active norm l
IV. CONCLUSIONS
The hybrid IDS is efficient to detect known and unknown
intrusions. The research on intrusion detections based on data
mining is one of the hot study topics at home and abroad.
There are still a series of theoretical and practical problems to
be resolved, and a number of key technologies are required to
make further deep study. The experiment shows that the design
and implementation of an efficient and accurate IDS based on
dat
representative
original data and to filter precisely useless rules.
ence
Foundation of Zhejiang Province, China (No. Y1080343).
the 7th USENIX Security Symposium, San
9 IEEE
from
Computer Engineering, Beijing.
2002, 28(6), pp9-10,169
a mining is a large, complex project.
In the application of the data mining algorithms to original
connection records, how to effectively get the corresponding
frequent patterns is the key to study. In the future, we will
focus the study on how to select appropriate and
ACKNOWLEDGMENT
The work has been supported by the Natural Sci
REFERENCES
[1] W. Lee and S. J. Stolfo. Data mining approaches for intrusion
detection, In Proceedings of
Antonio, TX, January 1998.
[2] W. Lee and S. J. Stolfo. A data mining framework for building
intrusion detection models, In Proceedings of the 199
Symposium on Security and Privacy, Oakland, CA, May 1999
[3] http://www.sans.org/resources/idfaq/data_mining.php?printer=Y,2003.4
[4] Chinese Academy of Sciences (CAS). Network IDS technology in CAS
reached the international advanced level, in Chinese. Retrieved
http://www.cas.cn/jzd/jcx/jcxlc/200204/t20020403_1034832.shtml
[5] Xu Jing, Liu Baoxu and Xu Rongsheng. Design and implementation of
data mining-based IDS, in Chinese,

342

You might also like