You are on page 1of 7

Identifying Key Variables for Intrusion Detection Using Soft Computing

Paradigms
Srinivas Mukkamala1, Andrew H. Sung1,2 and Ajith Abraham3
1
Department of Computer Science
2
Institute for Complex Additive Systems Analysis, New Mexico Tech, Socorro, New Mexico 87801,
Srinivas|sung@cs.nmt.edu
3
Department of Computer Science, Oklahoma State University, 700 N Greenwood Avenue, Tulsa, OK 74106,
ajith.abraham@ieee.org

ABSTRACT- This paper concerns using learning data for intrusion detection in order for the IDS to
machines for intrusion detection. Two classes of achieve maximal performance. Therefore, we also
learning machines are studied: Artificial Neural study feature ranking and selection, which is itself a
Networks (ANNs) and Support Vector Machines (SVMs). problem of great interest in building models based on
We show that SVMs are superior to ANNs for intrusion experimental data.
detection in three critical respects: SVMs train, and
run, an order of magnitude faster; SVMs scale much Since most of the intrusions can be uncovered by
better; and SVMs give higher classification accuracy.
examining patterns of user activities, many IDSs
We also address the related issue of ranking the have been built by utilizing the recognized attack and
importance of input features, which is itself a problem
misuse patterns to develop learning machines
of great interest in modeling. Since elimination of the
insignificant and/or useless inputs leads to a [3,4,5,6,7,8,9,10,11]. In our recent work, SVMs are
simplification of the problem and possibly faster and found to be superior to ANNs in many important
more accurate detection, feature selection is very respects of intrusion detection [12,13,14]; we will
important in intrusion detection. concentrate on SVMs and briefly summarize the
results of ANNs.
Two methods for feature ranking are presented: the
first one is independent of the modeling tool, while the
second method is specific to SVMs. The two methods The data we used in our experiments originated
are applied to identify the important features in the from MIT’s Lincoln Lab. It was developed for
1999 DARPA intrusion data. It is shown that the two intrusion detection system evaluations by DARPA
methods produce results that are largely consistent. and is considered a benchmark for intrusion detection
We present various experimental results that indicate
evaluations [15].
that SVM-based intrusion detection using a reduced
number of features can deliver enhanced or comparable We performed experiments to rank the importance
performance. An SVM-based IDS for class-specific of input features for each of the five classes (normal,
detection is thereby proposed. Finally, we also illustrate probe, denial of service, user to super-user, and
some of our current ongoing research work using remote to local) of patterns in the DARPA data. It is
neuro-fuzzy systems and linear genetic programming. shown that using only the important features for
classification gives good accuracies and, in certain
I. INTRODUCTION cases, reduces the training time and testing time of
This paper concerns intrusion detection and the the SVM classifier.
related issue of identifying important input features
for intrusion detection. Intrusion detection is a In the rest of the paper, a brief introduction to the
problem of great significance to critical infrastructure data we used is given in section 2. In section 3 we
protection owing to the fact that computer networks describe the method of deleting one input feature at a
are at the core of the nation’s operational control. time and the performance metrics considered for
This paper summarizes our current work to build deciding the importance of a particular feature. Some
Intrusion Detection Systems (IDSs) using Artificial theoretical aspects of neuro-fuzzy systems and linear
Neural Networks or ANNs [1], Support Vector genetic programming are introduced in this Section.
Machines or SVMs [2] and some of our ongoing In section 4 we present the experimental results of
research work using neuro-fuzzy systems and linear using SVMs for feature ranking. In section 5 we
genetic programming. Since the ability to identify the present the experimental results of using ANNs for
important inputs and redundant inputs of a classifier feature ranking. In section 6 we summarize our
leads directly to reduced size, faster training and results and give a brief description of our proposed
possibly more accurate results, it is critical to be able IDS architecture.
to identify the important features of network traffic
II. THE DATA possibilities, e.g., taking two variables at a time to
In the 1998 DARPA intrusion detection evaluation analyze their dependence or correlation, then taking
program, an environment was set up to acquire raw three at a time, etc. This, however, is both infeasible
TCP/IP dump data for a network by simulating a (requiring 2n experiments!) and not infallible (since
typical U.S. Air Force LAN. The LAN was operated the available data may be of poor quality in sampling
like a real environment, but being blasted with the whole input space). In the following, therefore,
multiple attacks. For each TCP/IP connection, 41 we apply the technique of deleting one feature at a
various quantitative and qualitative features were time (14) to rank the input features and identify the
extracted (16). Of this database a subset of 494021 most important ones for intrusion detection using
data were used, of which 20% represent normal SVMs (19).
patterns.
A. Performance-Based Method for Ranking
Attack types fall into four main categories: Importance
1. DOS: denial of service We first describe a general (i.e., independent of the
2. R2L: unauthorized access from a remote modeling tools being used), performance-based input
machine ranking methodology: One input feature is deleted
3. U2Su: unauthorized access to local super user from the data at a time, the resultant data set is then
(root) privileges used for the training and testing of the classifier.
4. Probing: surveillance and other probing Then the classifier’s performance is compared to that
of the original classifier (based on all features) in
III. RANKING THE SIGNIFICANCE OF INPUTS terms of relevant performance criteria. Finally, the
Feature selection and ranking (17,18) is an importance of the feature is ranked according to a set
important issue in intrusion detection. Of the large of rules based on the performance comparison.
number of features that can be monitored for The procedure is summarized as follows:
intrusion detection purpose, which are truly useful, 1. compose the training set and the testing set;
which are less significant, and which may be useless? for each feature do the following
The question is relevant because the elimination of 2. delete the feature from the (training and testing)
useless features (or audit trail reduction) enhances the data;
accuracy of detection while speeding up the 3. use the resultant data set to train the classifier;
computation, thus improving the overall performance 4. analyze the performance of the classifier using
of an IDS. In cases where there are no useless the test set, in terms of the selected performance
features, by concentrating on the most important ones criteria;
we may well improve the time performance of an 5. rank the importance of the feature according to
IDS without affecting the accuracy of detection in the rules;
statistically significant ways.
B. Performance Metrics
The feature ranking and selection problem for
intrusion detection is similar in nature to various To rank the importance of the 41 features (of the
engineering problems that are characterized by: DARPA data) in an SVM-based IDS, we consider
Having a large number of input variables x = (x1, three main performance criteria: overall accuracy of
x2,_,_,_ xn) of varying degrees of importance; i.e., (5-class) classification; training time; and testing
some elements of x are essential, some are less time. Each feature will be ranked as “important”,
important, some of them may not be mutually “secondary”, or “insignificant”, according to the
independent, and some may be useless or noise following rules that are applied to the result of
Lacking an analytical model or mathematical formula performance comparison of the original 41-feature
that precisely describes the input-output relationship, SVM and the 40-feature SVM:
Y = F (x). Rule set:
Having available a finite set of experimental data,
1. If accuracy decreases and training time
based on which a model (e.g. neural networks) can be
increases and testing time decreases, then the
built for simulation and prediction purposes
feature is important
2. If accuracy decreases and training time
Due to the lack of an analytical model, one can
increases and testing time increases, then the
only seek to determine the relative importance of the
feature is important
input variables through empirical methods. A
complete analysis would require examination of all
3. If accuracy decreases and training time important features for that class which it is
decreases and testing time increases, then the responsible for making classifications.
feature is important
4. If accuracy unchanges and training time C. SVM-specific Feature Ranking Method
increases and testing time increases, then the
Information about the features and their
feature is important
contribution towards classification is hidden in the
5. If accuracy unchanges and training time
support vector decision function. Using this
decreases and testing time increases, then the
information one can rank their significance, i.e., in
feature is secondary
the equation
6. If accuracy unchanges and training time
F (X) = ΣWiXi + b
increases and testing time decreases, then the
feature is secondary The point X belongs to the positive class if F(X) is a
7. If accuracy unchanges and training time positive value. The point X belongs to the negative
decreases and testing time decreases, then class if F(X) is negative. The value of F(X) depends
the feature is unimportant on the contribution of each value of X and Wi. The
8. If accuracy increases and training time absolute value of Wi measures the strength of the
increases and testing time decreases, then the classification. If Wi is a large positive value then the
feature is secondary ith feature is a key factor for positive class. If Wi is a
9. If accuracy increases and training time large negative value then the ith feature is a key factor
decreases and testing time increases, then the for negative class. If Wi is a value close to zero on
feature is secondary either the positive or the negative side, then the ith
10. If accuracy increases and training time feature does not contribute significantly to the
decreases and testing time decreases, then classification. Based on this idea, a ranking can be
the feature is unimportant done by considering the support vector decision
function.
According to the above rules, the 41 features are
ranked into the 3 types of {Important}, <Secondary>, D. Support Vector Decision Function (SVDF)
or (Unimportant), for each of the 5 classes of The input ranking is done as follows: First the
patterns, as follows: original data set is used for the training of the
classifier. Then the classifier’s decision function is
class 1 Normal: {1,3,5,6,8-10,14,15,17,20-23,25, 29, used to rank the importance of the features. The
33,35,36,38,39,41}, <2,4,7,11,12,16,18,19,24,30, 31 procedure is:
34,37,40>, (13,32) Calculate the weights from the support vector
decision function;
class 2 Probe: {3,5,6,23,24,32,33}, <1,4,7-9,12-19, Rank the importance of the features by the absolute
21,22,25-28,34-41>, (2,10,11,20,29,30,31,36,37) values of the weights;

class 3 DOS: {1,3,5,6,8,19,23-28,32,33,35,36,38-41}, According to the ranking method, the 41 features


<2,7,9-11,14,17,20,22,29,30,34,37>, (4,12,13,15,16, are placed into the 3 categories of {Important},
18,19,21,3) <Secondary> or (Unimportant), for each of the 5
classes of patterns, as follows:
class 4 U2Su: {5,6,15,16,18,32,33}, <7,8,11,13,17,
19-24,26,30,36-39>, (9,10,12,14,27,29,31,34,35, 40, class 1 Normal:
41) {1,2,3,4,5,6,10,12,17,23,24,27,28,29,31,32,33,34,36,
39}, <11,13,14,16,19,22,25,26,30,35,37,38,40,41>,
class 5: {3,5,6,24,32,33}, <2,4,7-23,26-31,34-41>, (7,8,9,15,18,20,21)
(1,20,25,38)
class 2 Probe: {1,2,3,4,5,6,23,24,29,32,33}, <10,12,
Because SVMs are only capable of binary 22,28, 34,35,36,38,39,41 >, (7,8,9,11,13,14, 15,16,
classifications, we will need to employ five SVMs for 17,18, 19 20,21,25,26,27,30,31,37,40)
the five-class identification problem in intrusion
detection. But since the set of important features may class 3 DOS: {1,5,6,23,24,25,26,32,36,38,39}, <2,3,
differ from class to class, using five SVMs becomes 4,10, 12,29,33,34 >(7,8,9,11,13,14,15,16, 17 18,19,
an advantage rather than a hindrance, i.e., in building 20,21, 22,27,28,30,31,35,36,37,40,41)
an IDS using five SVMs, each SVM can use only the
class 4 U2Su: {1,2,3,5,6,12,23,24,32,33}, <4,10,13, from valid ranges. In LGP the maximum size of the
14,17 22,27,29,31,34,36,37,39 > (7,8,9,11,15, 16,18, program is usually restricted to prevent programs
19,20, 21,25,26,28,30,35,38,40,41) without bounds. As LGP could be implemented at
machine code level, it will be fast to detect intrusions
class 5 R2L: {1,3,5,6,32,33}, <2,4,10,12,22,23,24, on an online mode.
29,31,34,36,37,38,40 >, (7,8,9,11,13,14,15,16,17, 18,
19,20,21,25,26,27,28,30,35,39,41) IV. EXPERIMENTS
SVMs are used, in each of the two methods, for
E. Fuzzy Inference System (FIS) ranking the importance of the input features. Once
Fuzzy logic provides a framework to model the importance of the input features was ranked, the
uncertainty, human way of thinking, reasoning and classifiers were trained and tested with only the
the perception process. Fuzzy if-then rules and fuzzy important features. Further, we validate the ranking
reasoning are the backbone of FIS, which are the by comparing the performance of the classifier
most important modeling tools based on fuzzy set [20,21] using all input features to that using the
theory. We will use the Adaptive Neuro-Fuzzy important features; and we also compare the
Inference System (ANFIS) [23] framework based on performance of a classifier using the union of the
neural network learning to fine tune the rule important features for all fives classes.
antecedent parameters and a least mean squares
estimation to adapt the rule consequent parameters of A. SVM Performance Statistics
a Takagi-Sugeno FIS. A step in the learning Our results are summarized in the following tables.
procedure has two parts: In the first part the input Table 1 gives the performance results of the five
patterns are propagated, and the optimal conclusion SVMs for each respective class of data. Table 2
parameters are estimated by an iterative least mean shows the results of SVMs performing classification,
square procedure, while the antecedent parameters with each SVM using as input the important features
(membership functions) are assumed to be fixed for for all five classes. Table 3 shows the results of
the current cycle through the training set. In the SVMs performing classification, with each SVM
second part the patterns are propagated again, and in using as input the union of the important features for
this epoch, back propagation is used to modify the all five classes. Table 4 shows the result of SVMs
antecedent parameters, while the conclusion performing classification, with each SVM using as
parameters remain fixed [22]. In IDS fuzzy inference input the important and secondary features for each
systems could play an important role, as they are respective class. Table 5 shows the results of SVMs
highly interpretable and easy to construct. performing classification, with each SVM using as
F. Linear Genetic Programming (LGP) input the important features obtained from the SVDF
ranking. Table 6 shows the results of SVMs
LGP is a variant of the Genetic Programming (GP) performing classification, with each SVM using as
technique that acts on linear genomes [24]. The linear input the union of the important features for each
genetic programming technique used for our current class as obtained from the SVDF ranking; the union
experiment is based on machine code level has 23 features. Table 7 shows the result of SVMs
manipulation and evaluation of programs. Its main performing classification, with each SVM using as
characteristics in comparison to tree-based GP lies is input the important and secondary features for each
that the evolvable units are not the expressions of a respective class.
functional programming language (like LISP), but the
programs of an imperative language (like C) are TABLE1
evolved. In the Automatic Induction of Machine Performance of SVMs using 41 features
Code by Genetic Programming, individuals are Class Training Testing Accuracy
Time (sec) Time (sec) (%)
manipulated directly as binary code in memory and
executed directly without passing an interpreter Normal 7.66 1.26 99.55
during fitness calculation. The LGP tournament
selection procedure puts the lowest selection pressure Probe 49.13 2.10 99.70
on the individuals by allowing only two individuals
DOS 22.87 1.92 99.25
to participate in a tournament. A copy of the winner
replaces the loser of each tournament. The crossover U2Su 3.38 1.05 99.87
points only occur between instructions. Inside
instructions the mutation operation randomly replaces R2L 11.54 1.02 99.78
the instruction identifier, a variable or the constant
TABLE 2 TABLE 6
Performance of SVMs using important features Performance of SVMs using union of important features (total 19)
Class No of Training Testing Accuracy as ranked by SVDF
Featur Time Time Class Training Testing Accuracy
es (sec) (sec) Time (sec) Time (sec) (%)

Normal 25 9.36 1.07 99.59


Normal 4.35 1.03 99.55
Probe 7 37.71 1.87 99.38 Probe 26.52 1.73 99.42

DOS 8.64 1.61 99.19


DOS 19 22.79 1.84 99.22
U2Su 2.04 0.18 99.85
U2Su 8 2.56 0.85 99.87
R2L 5.67 1.12 99.78
R2L 6 8.76 0.73 99.78

TABLE 7
TABLE 3 Performance of SVMs using important and secondary features
Performance of SVMs using union of important features (30) using SVDF
Class Training Testing Accuracy Class No of Training Testing Accuracy
Time (sec) Time (sec) (%) Features Time Time (%)
Normal 7.67 1.02 99.51 (sec) (sec)

Probe 44.38 2.07 99.67 Normal 34 4.61 0.97 99.55

DOS 18.64 1.41 99.22 Probe 21 39.69 1.45 99.56

U2Su 3.23 0.98 99.87 DOS 19 73.55 1.50 99.56

R2L 9.81 1.01 99.78 U2Su 23 1.73 0.79 99.87


R2L 20 5.94 0.91 99.78

TABLE 4
Performance of SVMs using important and secondary features
Class No of Training Testing Accuracy V. NEURAL NETWORK EXPERIMENTS
Features Time Time (%) This section summarizes the authors’ recent work
(sec) (sec) in comparing ANNs and SVMs for intrusion
Norma 39 8.15 1.22 99.59
detection [10,11,12]. Since a (multi-layer feed
l forward) ANN is capable of making multi-class
Probe 32 47.56 2.09 99.65 classifications, a single ANN is employed to perform
the intrusion detection, using the same training and
DOS 32 19.72 2.11 99.25 testing sets as those for the SVMs.

U2Su 25 2.72 0.92 99.87 Neural networks are used for ranking the
R2L 37 8.25 1.25 99.80 importance of the input features, taking training time,
testing time, and classification accuracy as the
performance measure; and a set of rules is used for
TABLE 5 ranking. Therefore, the method is an extension of the
Performance of SVMs using important features as ranked by feature ranking method described in [17] where a
SVDF
Class No of Training Testing Accuracy cement bonding quality problem is used as the
Features Time (sec) Time (%) engineering application. Once the importance of the
(sec) input feature was ranked, the ANNs are trained and
tested with the data set containing only the important
Norma 20 4.58 0.78 99.55
l
features. We then compare the performance of the
Probe 11 40.56 1.20 99.36 trained classifier against the original ANN trained
with data containing all input features.
DOS 11 18.93 1.00 99.16
A. ANN Performance Statistics
U2Su 10 1.46 0.70 99.87 Table 7 below gives the comparison of the ANN
R2L 6 6.79 0.72 99.72 with all 41 features to that of using 34 important
features that have been obtained by our feature-
ranking algorithm described above.
increases slightly for one class ‘Normal’,
TABLE 8 decreases slightly for two classes ‘Probe’ and
Neural network results using all 34 important features ‘DOS’, and remains the same for the two most
No of Accuracy False False Number serious attack classes.
features positiv negative of epochs
e rate rate • Experimentations related to fuzzy inference
systems and linear genetic programming is under
41 87.07 6.66 6.27 412 progress and we will be able to summarize the
comparative performance very soon.
34 81.57 18.19 0.25 27

VII. ACKNOWLEDGEMENTS
VI SUMMARY & CONCLUSIONS Support for this research received from ICASA
• A number of observations and conclusions are (Institute for Complex Additive Systems Analysis, a
drawn from the results reported: division of New Mexico Tech) and a U.S.
• SVMs outperform ANNs in the important Department of Defense IASP capacity building grant
respects of scalability (SVMs can train with a is gratefully acknowledged. We would also like to
larger number of patterns, while would ANNs acknowledge many insightful conversations with Dr.
take a long time to train or fail to converge at all Jean-Louis Lassez and David Duggan that helped
when the number of patterns gets large); training clarify some of our ideas.
time and running time (SVMs run an order of
magnitude faster); and prediction accuracy. VIII. REFERENCES
• SVMs easily achieve high detection accuracy [1] Hertz J., Krogh A., Palmer, R. G. (1991)
(higher than 99%) for each of the 5 classes of “Introduction to the Theory of Neural
data, regardless of whether all 41 features are Computation,” Addison –Wesley.
used, only the important features for each class [2] Joachims T. (1998) “Making Large-Scale
are used, or the union of all important features SVM Learning Practical,” LS8-Report,
for all classes are used. University of Dortmund, LS VIII-Report.
We note, however, that the difference in accuracy [3] Denning D. (Feb. 1987) “An Intrusion-
figures tend to be very small and may not be Detection Model,” IEEE Transactions on
statistically significant, especially in view of the fact Software Engineering, Vol.SE-13, No 2.
that the 5 classes of patterns differ in their sizes [4] Kumar S., Spafford E. H. (1994) “An
tremendously. More definitive conclusions can only Application of Pattern Matching in Intrusion
be made after analyzing more comprehensive sets of Detection,” Technical Report CSD-TR-94-013.
network traffic data. Purdue University.
[5] Ghosh A. K. (1999). “Learning Program
Regarding feature ranking, we observe that Behavior Profiles for Intrusion Detection,”
• The two feature ranking methods produce largely USENIX.
consistent results: except for the class 1 (Normal) [6] Cannady J. (1998) “Artificial Neural Networks
and class 4 (U2Su) data, the features ranked as for Misuse Detection,” National Information
Important by the two methods heavily overlap. Systems Security Conference.
• The most important features for the two classes [7] Ryan J., Lin M-J., Miikkulainen R. (1998)
of ‘Normal’ and ‘DOS’ heavily overlap. “Intrusion Detection with Neural Networks,”
• ‘U2Su’ and ‘R2L’, the two smallest classes Advances in Neural Information Processing
representing the most serious attacks, each has a Systems 10, Cambridge, MA: MIT Press.
small number of important features and a large [8] Debar H., Becke M., Siboni D. (1992) “A
number of secondary features. Neural Network Component for an Intrusion
• The performances of (a) using the important Detection System,” Proceedings of the IEEE
features for each class, Table 2, Table 5, (b) Computer Society Symposium on Research in
using the union of important features, Table 3, Security and Privacy.
Table 6, and (c) using the union of important and [9] Debar H., Dorizzi. B. (1992) “An Application
secondary features for each class Table 4 and of a Recurrent Network to an Intrusion
Table 7, do not show significant differences, and Detection System,” Proceedings of the
are all similar to that of using all 41 features. International Joint Conference on Neural
• Using the important features for each class gives Networks, pp.78-83.
the most remarkable performance: the testing [10] Luo J., Bridges S. M. (2000) “Mining Fuzzy
time decreases in each class; the accuracy Association Rules and Fuzzy Frequency
Episodes for Intrusion Detection,” [18] Lin, Y., Cunningham, G. A. (1995) A New
International Journal of Intelligent Systems, “Approach to Fuzzy-Neural System
John Wiley & Sons, pp.687-703. Modeling,” IEEE Transactions on Fuzzy
[11] Cramer M., et. al. (1995) “New Methods of Systems, Vol. 3, No. 2, pp.190-198.
Intrusion Detection using Control-Loop [19] Joachims T. (2000) “SVMlight is an
Measurement,” Proceedings of the Technology Implementation of Support Vector Machines
in Information Security Conference (TISC) (SVMs) in C,”
’95, pp.1-10. http://ais.gmd.de/~thorsten/svm_light.
[12] Mukkamala S., Janoski G., Sung A. H. (2001) University of Dortmund. Collaborative
“Monitoring Information System Security,” Research Center on Complexity Reduction in
Proceedings of the 11th Annual Workshop on Multivariate Data (SFB475).
Information Technologies & Systems, pp.139- [20] Vladimir V. N. (1995) “The Nature of
144. Statistical Learning Theory,” Springer.
[13] Mukkamala S., Janoski G., Sung A. H. (2002) [21] Joachims T. (2000) “Estimating the
“Intrusion Detection Using Neural Networks Generalization Performance of a SVM
and Support Vector Machines,” Proceedings of Efficiently,” Proceedings of the International
IEEE International Joint Conference on Neural Conference on Machine Learning, Morgan
Networks, pp.1702-1707. Kaufman.
[14] Mukkamala S., Janoski G., Sung A. H. (2002) [22] Abraham A., “Neuro-Fuzzy Systems: State-of-
“Comparison of Neural Networks and Support the-Art Modeling Techniques,” Connectionist
Vector Machines, in Intrusion Detection,” Models of Neurons, Learning Processes, and
Workshop on Statistical and Machine Learning Artificial Intelligence, Jose Mira et al (Eds.),
Techniques in Computer Intrusion Detection, Germany, Springer-Verlag, LNCS 2084: 269-
June 11-13, 2002 276, 2001.
http://www.mts.jhu.edu/~cidwkshop/abstracts. [23] Jang S.R., Sun C.T. and Mizutani E., “Neuro-
html Fuzzy and Soft Computing: A Computational
[15] http://kdd.ics.uci.edu/databases/kddcup99/task. Approach to Learning and Machine
htm. Intelligence,” US, Prentice Hall Inc., 1997.
[16] J. Stolfo, Wei Fan, Wenke Lee, Andreas [24] Banzhaf W., Nordin P., Keller R.E., Francone
Prodromidis, and Philip K. Chan “Cost-based F.D., “Genetic Programming: An Introduction
Modeling and Evaluation for Data Mining On The Automatic Evolution of Computer
With Application to Fraud and Intrusion Programs and Its Applications”, Morgan
Detection,” Results from the JAM Project by Kaufmann Publishers, Inc., 1998.
Salvatore
[17] Sung A. H. (1998) “Ranking Importance of
Input Parameters Of Neural Networks,” Expert
Systems with Applications, pp.405-41.

You might also like