Robust and Efficient Intrusion

Detection Systems
Kapil Kumar Gupta
Submitted in total fulfilment of the requirements of the degree of
Doctor of Philosophy
January 2009
Department of Computer Science and Software Engineering
THE UNIVERSITY OF MELBOURNE
Abstract
I
NTRUSION Detection systems are now an essential component in the overall network and
data security arsenal. With the rapid advancement in the network technologies including
higher bandwidths and ease of connectivity of wireless and mobile devices, the focus of intrusion
detection has shifted from simple signature matching approaches to detecting attacks based on an-
alyzing contextual information which may be specific to individual networks and applications. As
a result, anomaly and hybrid intrusion detection approaches have gained significance. However,
present anomaly and hybrid detection approaches suffer from three major setbacks; limited attack
detection coverage, large number of false alarms and inefficiency in operation.
In this thesis, we address these three issues by introducing efficient intrusion detection frame-
works and models which are effective in detecting a wide variety of attacks and which result in very
few false alarms. Additionally, using our approach, attacks can not only be accurately detected but
can also be identified which helps to initiate effective intrusion response mechanisms in real-time.
Experimental results performed on the benchmark KDD 1999 data set and two additional data
sets collected locally confirm that layered conditional random fields are particularly well suited to
detect attacks at the network level and user session modeling using conditional random fields can
effectively detect attacks at the application level.
We first introduce the layered framework with conditional random fields as the core intrusion
detector. Layered conditional random field can be used to build scalable and efficient network
intrusion detection systems which are highly accurate in attack detection. We show that our sys-
tems can operate either at the network level or at the application level and perform better than
other well known approaches for intrusion detection. Experimental results further demonstrate
that our system is robust to noise in training data and handles noise better than other systems such
as the decision trees and the naive Bayes. We then introduce our unified logging framework for
audit data collection and perform user session modeling using conditional random fields to build
iii
real-time application intrusion detection systems. We demonstrate that our system can effectively
detect attacks even when they are disguised within normal events in a single user session. Using
our user session modeling approach based on conditional random fields also results in early at-
tack detection. This is desirable since intrusion response mechanisms can be initiated in real-time
thereby minimizing the impact of an attack.
iv
Declaration
This is to certify that
1. the thesis comprises only my original work towards the PhD,
2. due acknowledgement has been made in the text to all other material used,
3. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliographies
and appendices.
Kapil Kumar Gupta,
January 2009
v
List of Publications
Part of the work which is described in this thesis has been published as journal articles, book
chapters and conference proceedings. Following is the list of the papers which have been published
during the course of the candidature.
1. Robust Application Intrusion Detection using User Session Modeling – Kapil Kumar Gupta,
Baikunth Nath, Kotagiri Ramamohanarao. Submitted to the ACM Transactions on Infor-
mation and Systems Security (TISSEC). Under Review.
2. Layered Approach using Conditional Random Fields for Intrusion Detection – Kapil Ku-
mar Gupta, Baikunth Nath, Kotagiri Ramamohanarao. IEEE Transactions on Depend-
able and Secure Computing (TDSC). In Press.
3. User Session Modeling for Effective Application Intrusion Detection – Kapil Kumar Gupta,
Baikunth Nath, Kotagiri Ramamohanarao. In Proceedings of the 23
rd
International In-
formation Security Conference, Lecture Notes in Computer Science, Springer Verlag,
vol (278), pages 269 - 284, 2008.
4. Sequence Labeling for Effective Intrusion Detection – Kotagiri Ramamohanarao, Kapil
Kumar Gupta, Baikunth Nath. In Proceedings of the 2
nd
Annual Computer Security
Conference. In Press.
5. Intrusion Detection in Networks and Applications – Kapil Kumar Gupta, Baikunth Nath,
Kotagiri Ramamohanarao. In Handbook of Communication Networks and Distributed
Systems, World Scientific. In Press.
6. The Curse of Ease of Access to the Internet – Kotagiri Ramamohanarao, Kapil Kumar
Gupta, Tao Peng, Christopher Leckie. In Proceedings of the 3
rd
International Conference
on Information Systems Security, Lecture Notes in Computer Science, Springer Verlag,
vol (4812), pages 234 - 249, 2007.
vii
7. Conditional Random Fields for Intrusion Detection – Kapil Kumar Gupta, Baikunth Nath,
Kotagiri Ramamohanarao. In Proceedings of the IEEE 21
st
International Conference
on Advanced Information Networking and Applications Workshops, IEEE Computer
Society, vol (1), pages 203 - 208, 2007.
8. Network Security Framework – Kapil Kumar Gupta, Baikunth Nath, Kotagiri Ramamoha-
narao. International Journal of Computer Science and Network Security (IJCSNS),
vol 6(7B), pages 151 - 157, 2006.
9. Attacking Confidentiality: An Agent Based Approach – Kapil Kumar Gupta, Baikunth
Nath, Kotagiri Ramamohanarao, Ashraf Kazi. In Proceedings of the IEEE International
Conference on Intelligence and Security Informatics, Lecture Notes in Computer Sci-
ence, Springer Verlag, vol (3975), pages 285 - 296, 2006.
viii
Acknowledgements
It gives me immense pleasure to thank and express my gratitude towards my supervisors Assoc.
Prof. Baikunth Nath and Prof. Ramamohanarao Kotagiri, for their support throughout the
course of my study. Their constant motivation, support and expert guidance has helped me to
overcome all odds making this journey a truly rewarding experience in my life. I thank them from
the bottom of my heart.
I would also like to thank my Ph.D. committee member Assoc. Prof. Chris Leckie for his
valuable feedback and critical reviews which have helped to improve the quality of the thesis.
I am grateful for the support received from the University of Melbourne via numerous chan-
nels including the Melbourne International Fee Remission Scholarship (MIFRS), tremendous sup-
port from the School of Graduate Research, supportive staff at the university libraries and various
other university resources. In particular, I thank the staff at the Department of Computer Science
and Software Engineering, Melbourne School of Engineering who has been extremely helpful at
numerous occasions.
I am extremely grateful to the National ICT Australia (NICTA) for the financial support
in the form of the prestigious NICTA Studentship and regular support to present the research at
various international conferences and to visit international laboratories.
I do not have words to express my gratitude towards my parents and my elder brother whose
support and uncountable sacrifices have paved the way for me to pursue this study. It would not
have been possible for me to undertake this challenging task without their constant support.
I would like to thank my friends in the research lab, room 3.08, and in the department for
making the place a fun place to work and who also helped me to collect the data sets used in this
research. Finally, Alauddin Bhuiyan deserves a special mention and I shall cherish the frequent
tea breaks that we had together.
ix
Contents
1 Introduction 1
1.1 Motivation and Problem Description . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Emerging Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions to Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Layered Framework for Intrusion Detection . . . . . . . . . . . . . . . . 5
1.3.2 Layered Conditional Random Fields for Network Intrusion Detection . . 5
1.3.3 Unified Logging Framework for Audit Data Collection . . . . . . . . . . 6
1.3.4 User Session Modeling for Application Intrusion Detection . . . . . . . . 7
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Intrusion Detection and Intrusion Detection System . . . . . . . . . . . . . . . . 11
2.2.1 Principles and Assumptions in Intrusion Detection . . . . . . . . . . . . 13
2.2.2 Components of Intrusion Detection Systems . . . . . . . . . . . . . . . . 13
2.2.3 Challenges and Requirements for Intrusion Detection Systems . . . . . . 14
2.3 Classification of Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . 15
2.3.1 Classification based upon the Security Policy definition . . . . . . . . . . 17
2.3.2 Classification based upon the Audit Patterns . . . . . . . . . . . . . . . . 19
2.4 Audit Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Properties of Audit Patterns useful for Intrusion Detection . . . . . . . . 22
2.4.2 Univariate or Multivariate Audit Patterns . . . . . . . . . . . . . . . . . 23
2.4.3 Relational or Sequential Representation . . . . . . . . . . . . . . . . . . 24
xi
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Frameworks for building Intrusion Detection Systems . . . . . . . . . . 26
2.6.2 Network Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.3 Monitoring Access Logs . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.4 Application Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Layered Framework for Building Intrusion Detection Systems 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Description of our Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Components of Individual Layers . . . . . . . . . . . . . . . . . . . . . 41
3.4 Advantages of Layered Framework . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Comparison with other Frameworks . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Layered Conditional Random Fields for Network Intrusion Detection 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Integrating the Layered Framework . . . . . . . . . . . . . . . . . . . . 55
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Building Individual Layers of the System . . . . . . . . . . . . . . . . . 57
4.5.2 Implementing the Integrated System . . . . . . . . . . . . . . . . . . . . 64
4.6 Comparison and Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.1 Significance of Layered Framework . . . . . . . . . . . . . . . . . . . . 70
4.6.2 Significance of Feature Selection . . . . . . . . . . . . . . . . . . . . . . 71
4.6.3 Significance of Our Results . . . . . . . . . . . . . . . . . . . . . . . . 72
xii
4.7 Robustness of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7.1 Addition of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Unified Logging Framework and Audit Data Collection 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Description of our Framework . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Audit Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.2 Normal Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.3 Attack Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 User Session Modeling using Unified Log for Application Intrusion Detection 91
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.1 Feature Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.2 Session Modeling using a Moving Window of Events . . . . . . . . . . . 96
6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5.1 Experiments with Clean Data (p = 1) . . . . . . . . . . . . . . . . . . . 99
6.5.2 Experiments with Disguised Attack Data (p = 0.60) . . . . . . . . . . . . 102
6.6 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6.1 Effect of ‘S’ on Attack Detection . . . . . . . . . . . . . . . . . . . . . 114
6.6.2 Effect of ‘p’ on Attack Detection (0 < p ≤ 1) . . . . . . . . . . . . . . . 116
6.6.3 Significance of Using Unified Log . . . . . . . . . . . . . . . . . . . . . 118
6.6.4 Test Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6.5 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.7 Issues in Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xiii
6.7.1 Availability of Training Data . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.2 Suitability of Our Approach for a Variety of Applications . . . . . . . . . 123
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7 Conclusions 125
7.1 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 131
Appendices 147
A An Introduction to Conditional Random Fields 149
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.3.1 Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.3.2 Undirected Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 167
A.4 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.4.1 Representation of Conditional Random fields . . . . . . . . . . . . . . . 169
A.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.4.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.4.4 Tools Available for Conditional Random Fields . . . . . . . . . . . . . . 175
A.5 Comparing the Directed and Undirected Graphical Models . . . . . . . . . . . . 175
A.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
B Feature Selection for Network Intrusion Detection 177
B.1 Feature Selection for Probe Layer . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.2 Feature Selection for DoS Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.3 Feature Selection for R2L Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.4 Feature Selection for U2R Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.5 Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
C Feature Selection for Application Intrusion Detection 181
C.1 Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
xiv
List of Tables
2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 KDD 1999 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Detecting Probe Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 58
4.3 Detecting Probe Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . 59
4.4 Detecting DoS Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 60
4.5 Detecting DoS Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . 60
4.6 Detecting R2L Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 61
4.7 Detecting R2L Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . 62
4.8 Detecting U2R Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 63
4.9 Detecting U2R Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . 63
4.10 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Attack Detection at Individual Layers (Case:1) . . . . . . . . . . . . . . . . . . 66
4.12 Attack Detection at Individual Layers (Case:2) . . . . . . . . . . . . . . . . . . 67
4.13 Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.14 Layered Vs. Non Layered Framework . . . . . . . . . . . . . . . . . . . . . . . 70
4.15 Significance of Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.16 Ranking Various Methods for Intrusion Detection . . . . . . . . . . . . . . . . . 73
6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Effect of ‘S’ on Attack Detection for Data Set One, when p = 0.60 . . . . . . . . 114
6.3 Analysis of Performance of Different Methods . . . . . . . . . . . . . . . . . . . 115
6.4 Effect of ‘S’ on Attack Detection for Data Set Two, when p = 0.60 . . . . . . . 116
6.5 Comparison of Test Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xv
B.1 Probe Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.2 DoS Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.3 R2L Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.4 U2R Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
xvi
List of Figures
1.1 Behaviour of an Intruding Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Classification of Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . 16
2.2 Knowledge Representation for a Resource (R) . . . . . . . . . . . . . . . . . . . 17
2.3 Representation of a Signature Based System . . . . . . . . . . . . . . . . . . . . 18
2.4 Representation of a Behaviour Based System . . . . . . . . . . . . . . . . . . . 18
2.5 Representation of a Hybrid System . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Graphical Representation of a Conditional Random Field . . . . . . . . . . . . . 35
3.1 Layered Framework for Building Intrusion Detection Systems . . . . . . . . . . 40
3.2 Traditional Layered Defence Approach to Provide Enterprise Wide Security . . . 44
4.1 Conditional Random Fields for Network Intrusion Detection . . . . . . . . . . . 51
4.2 Representation of Probe Layer with Feature Selection . . . . . . . . . . . . . . . 53
4.3 Integrating Layered Framework with Conditional Random Fields . . . . . . . . . 55
4.4 Effect of Noise on Probe Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Effect of Noise on DoS Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Effect of Noise on R2L Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Effect of Noise on U2R Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Framework for Building Application Intrusion Detection System . . . . . . . . . 83
5.2 Representation of a Single Event in the Unified log . . . . . . . . . . . . . . . . 85
5.3 Representation of a Normal Session . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Representation of an Anomalous Session . . . . . . . . . . . . . . . . . . . . . . 89
6.1 User Session Modeling using Conditional Random Fields . . . . . . . . . . . . . 95
xvii
6.2 Comparison of F-Measure (p = 1) . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Comparison of F-Measure (p = 0.60) . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Results using Conditional Random Fields at p = 0.60 . . . . . . . . . . . . . . . 105
6.5 Results using Support Vector Machines at p = 0.60 . . . . . . . . . . . . . . . . 107
6.6 Results using Decision Trees at p = 0.60 . . . . . . . . . . . . . . . . . . . . . 109
6.7 Results using Naive Bayes Classifier at p = 0.60 . . . . . . . . . . . . . . . . . 111
6.8 Results using Hidden Markov Models at p = 0.60 . . . . . . . . . . . . . . . . . 113
6.9 Effect of ‘p’: Results using Conditional Random Fields when 0 < p ≤ 1 . . . . 117
6.10 Significance of Using Unified Log . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.1 Fully Connected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.2 Fully Disconnected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . 154
A.3 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.4 Maxent Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.5 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.6 Decoding in an Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . 162
A.7 Maximum Entropy Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.8 Label Bias Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.9 Undirected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.10 Linear Chain Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . 170
A.11 Factorization in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . 176
xviii
Chapter 1
Introduction
I
N this thesis, we address three significant issues which severely restrict the utility of anomaly
and hybrid intrusion detection systems in present networks and applications. The three issues
are; limited attack detection coverage, large number of false alarms and inefficiency in operation.
Present anomaly and hybrid intrusion detection systems have limited attack detection capability,
suffer from a large number of false alarms and cannot be deployed in high speed networks and ap-
plications without dropping audit patterns. Hence, most existing intrusion detection systems such
as the USTAT, IDIOT, EMERALD, Snort and others are developed using knowledge engineering
approaches where domain experts can build focused and optimized pattern matching models [1].
Though such systems result in very few false alarms, they are specific in attack detection and of-
ten tend to be incomplete. As a result their effectiveness is limited. Further, due to their manual
development process, signature based systems are expensive and slow to build. We, thus, address
these shortcomings and develop better anomaly and hybrid intrusion detection systems which are
accurate in attack detection, efficient in operation and have wide attack detection coverage.
1.1 Motivation and Problem Description
I
NTRUSION detection as defined by the Sysadmin, Audit, Networking, and Security (SANS)
institute is the act of detecting actions that attempt to compromise the confidentiality, integrity
or availability of a resource [2]. Today, intrusion detection is one of the high priority and chal-
lenging tasks for network administrators and security professionals.
The objective of an intrusion detection system is to provide data security and ensure continuity
of services provided by a network [3]. Present networks provide critical services which are neces-
sary for businesses to perform optimally and are, thus, a target of attacks which aim to bring down
1
2 Introduction
the services provided by the network. Additionally, with more and more data becoming available
in digital format and more applications being developed to access this data, the data and appli-
cations are also a victim of attackers who exploit these applications to gain access to data. With
the deployment of more sophisticated security tools, in order to protect the data and services, the
attackers often come up with newer and more advanced methods to defeat the installed security
systems [4], [5].
According to the Internet Systems Consortium (ISC) survey, the number of hosts on the In-
ternet exceeded 550,000,000 in July 2008 [6]. Earlier, a project in 2002, estimated the size of
the Internet to be 532,897 TB [7]. Increasing dependence of businesses on the services over the
Internet, though, has led to their rapid growth; it has also made the networks and applications a
prime target of attacks. Configuration errors and vulnerabilities in software are exploited by the
attackers who launch powerful attacks such as the Denial of Service (DoS) [8] and Information
attacks [9]. Rapid increase in the number of vulnerabilities has resulted in an exponential rise in
the number of attacks. According to the Computer Emergency Response Team (CERT), the num-
ber of vulnerabilities in software has been increasing and many of them exist in highly deployed
software [10], [11]. Considering that it is near to impossible to build ‘perfect’ software, it be-
comes critical to build effective intrusion detection systems which can detect attacks reliably. The
prospect of obtaining valuable information, as a result of a successful attack, subside the threat of
legal convictions. The inability to prevent attacks furthers the need for intrusion detection. The
problem becomes more profound since authorized users can misuse their privileges and attackers
can masquerade as authentic users by exploiting vulnerable applications.
Given the diverse type of attacks (Denial of Service, Probing, Remote to Local, User to Root
and others), it is a challenge for any intrusion detection system to detect a wide variety of attacks
with very few false alarms in real-time environment. Ideally, the system must detect all intrusions
with no false alarms. The challenge is, thus, to build a system which has broad attack detection
coverage and at the same time which results in very few false alarms. The system must also
be efficient enough to handle large amount of audit data without affecting performance at the
deployed environment. The simplest way to ensure a high level of security, provided we can
ensure hardware security, is to disable all resource sharing and communication between any two
computers. However, this is in no way a solution for securing today’s highly networked computing
environment and, hence, the need to develop better intrusion detection systems.
1.2 Emerging Attacks 3
1.1.1 Research Objectives
In this thesis:
1. We aim to develop systems which have broad attack detection coverage and which are not
specific in detecting only the previously known attacks.
2. We aim to reduce the number of false alarms generated by anomaly and hybrid intrusion
detection systems, thereby improving their attack detection accuracy.
3. We aim to develop anomaly intrusion detection systems which can operate efficiently in
high speed networks without dropping audit data.
Issues such as scalability, availability of training data, robustness to noise in the training data
and others are also implicitly addressed.
1.2 Emerging Attacks
For an intrusion detection system, it is important to detect previously known attacks with high
accuracy. However, detecting previously unseen attacks is equally important in order to minimize
the losses as a result of a successful intrusion.
In [5], we describe a scenario in which a software agent can be used to attack a specific
target without affecting any other network with a purpose to search and transmit confidential and
sensitive information without authorized access. Such an attack can be carried out by experts
with the motive to hide the entire attack and protect their identity from being discovered. Further,
since the attack targets only a single network, it would not be detected by large scale cooperative
intrusion detection systems. The most significant part of the entire attack is that none of the present
systems can detect such attacks and the agent can destroy itself when the attack is successful
without leaving traces of its activities. Unlike worms, the replication in case of an intruding agent
is limited and it does not degrade performance at the target making their detection very difficult.
We represent the behaviour of the intruding agent in Figure 1.1 by a flow diagram.
In addition to detecting the Denial of Service attacks, which target availability aspect, and
the Information attacks, which target confidentiality and integrity aspects, the intrusion detection
systems must also be able to detect attacks which present a change in the motive of the attackers.
Such attacks are network specific and the attacker follows a criminal pursuit which is driven by
4 Introduction
Time
Out
Information
Correct
Transmit Information and
Await Confirmation
Success
Search Information
Start
Destroy Itself and Traces
Set Up a Knowledge Database
Search and Control a Zombie
Out
Time
Attempt to Enter the Target Network
Success
Update Knowledge Database
and Adjust Behaviour
Success
Replicate
Yes
No
No
Yes
End
Attempts
n>N
Time
Out
No
Yes
No
Yes
Yes
Yes
No
No
Yes
No
Yes
No
Figure 1.1: Behaviour of an Intruding Agent
the goal to make money [4]. This has not only resulted in increasing the severity of attacks, but
the attacks have become isolated; targeting only a few nodes in a single network. Such attacks are
very difficult to detect using generic systems and hence, better intrusion detection systems must
be developed which are capable of detecting such specific attacks.
1.3 Contributions to Thesis
In order to launch an attack, an attacker often follows a sequence of events. The events in such a
sequence are highly correlated and long range dependencies exist between them. Further, in order
to prevent detection, the attacker can also hide the individual events within a large number of
normal events. As a result, considering the events in isolation affects classification and results in a
1.3 Contributions to Thesis 5
large number of false alarms. Additionally, the individual events themselves are vector quantities
and consist of multiple features which are monitored continuously. These features are also highly
correlated and must not be analyzed in isolation.
In order to operate in high speed networks, present anomaly based systems consider the events
individually, thereby, discarding any correlation between the sequential events. In cases when the
present systems consider a sequence of events, they monitor only one feature, ignoring others,
which results in a poor model. Hence, we introduce efficient intrusion detection frameworks
and methods which consider a sequence of events and which analyze multiple features without
assuming any independence among the features.
1.3.1 Layered Framework for Intrusion Detection
In Chapter 3, we introduce our Layered Framework for building intrusion detection systems
which can be used, for example, as a network intrusion detection system and can detect a wide
variety of attacks reliably and efficiently when compared to the traditional network intrusion de-
tection systems. In our layered framework, we use a number of separately trained and sequentially
arranged sub systems in order to decrease the number of false alarms and increase the attack de-
tection coverage. In particular, our layered framework has the following advantages:
• The framework is customizable and domain specific knowledge can be easily incorporated
to build individual layers which help to improve accuracy.
• Individual intrusion detection sub systems are light weight and can be trained separately.
• Different anomaly and hybrid intrusion detectors can be incorporated in our framework.
• Our framework not only helps to detect an attack but it also helps to identify the type of
attack. As a result, specific intrusion response mechanisms can be initiated automatically
thereby reducing the impact of an attack.
• Our framework is scalable and the number of layers can be increased (or decreased) in the
overall framework.
1.3.2 Layered Conditional Random Fields for Network Intrusion Detection
Network monitoring is one of the common and widely applied methods for detecting malicious
activities in an entire network. However, real-time monitoring of every single event even in a
6 Introduction
moderate size network may not be feasible, simply due to the large amount of network traffic. As
a result, it is only possible to perform pattern matching using attack signatures which may at best
detect only previously known attacks. Anomaly based systems result in dropping audit data when
they are used to analyze every event. As a result, network monitoring often involves analyzing only
the summary statistics from the audit data. The summary statistics may include features of a single
TCP session between two IP addresses or may include network level features such as the load on
sever, number of incoming connections per unit time and others. Such statistics are represented
in the KDD 1999 data set [12]. In Chapter 4, we introduce the Layered Conditional Random
Fields which can be used to build accurate anomaly intrusion detection systems which can operate
efficiently in high speed networks. In particular, our system has the following advantages:
• The attack detection accuracy improves for individual sub systems when using conditional
random fields.
• The overall system has wide attack detection coverage, where every sub system is trained
to detect attacks belonging to a single attack class.
• Attacks can be detected efficiently in high speed networks.
• Our system is robust to noise and performs better than any other compared system.
1.3.3 Unified Logging Framework for Audit Data Collection
In order to access application data, a user has no option but to access the application which interacts
with the application data. Hence, application access and the corresponding data accesses are
highly correlated. In order to detect attacks effectively, we aim to capture this correlation between
the application access and the corresponding data accesses. Hence, in Chapter 5, we present our
Unified Logging Framework which efficiently integrates the application and the data access logs.
We have collected two such data sets which can be downloaded and used freely [13]. In particular,
our unified logging framework has the following advantages:
• By using the unified log, the objective is to capture the user-application and the application-
data interaction in order to improve attack detection. Further, this interaction is fixed and
does not vary overtime as opposed to modeling user profiles which changes frequently.
• Our framework is application independent and can be deployed for a variety of applications.
1.4 Thesis Organization 7
1.3.4 User Session Modeling for Application Intrusion Detection
Network monitoring is often restricted to monitoring summary statistics due to excessive amount
of network traffic and is further affected due to network address translation and encryption, mak-
ing it difficult to provide high level of security. Thus, it becomes necessary to extend network
monitoring and focus on data and applications which are often target of attacks. Further, as we
have already mentioned, many attacks require a number of sequential operations to be performed.
In Chapter 6, we introduce User Session Modeling using Conditional Random Fields which an-
alyzes the unified log to detect application level attacks. In particular, our system has the following
advantages:
• Conditional random fields perform best outperforming other well known anomaly detection
approaches including decision trees, naive Bayes classifiers, support vector machines and
hidden Markov models. Our system based on conditional random fields is particularly
effective when attacks span over a sequence of events (such as password guessing followed
by launching of the exploit to gain administrative privileges on the target and finally leading
to unauthorized access of data.)
• Our approach is robust in detecting disguised attacks.
• Using our system, attacks can be blocked in real-time.
• Performing session modeling using conditional random fields in our unified logging frame-
work, attacks can be detected at smaller window widths thereby resulting in an efficient
system which does not require a large amount of history to be maintained.
1.4 Thesis Organization
This thesis is organized as follows; we first present the taxonomy of intrusion detection and give
the related literature review in Chapter 2. We then describe our layered framework which can be
used to build effective and efficient intrusion detection systems in Chapter 3. In Chapter 4, we
describe how conditional random fields can be integrated in our layered framework. We present
our experimental results and demonstrate that layered conditional random fields outperform well
known methods for intrusion detection and are a strong candidate to build robust and efficient
network intrusion detection systems. We then describe our unified logging framework in Chapter
8 Introduction
5, which integrates the application access logs and the corresponding data access logs to provide
a unified audit log. The unified log captures the necessary user-application and application-data
interaction which is useful to detect application level attacks effectively. In Chapter 6, we then
use the conditional random fields and perform user session modeling using a moving window of
events in our unified logging framework to build real-time application intrusion detection systems.
Our experimental results suggest that performing user session modeling using conditional random
fields’ attacks can be detected by analyzing only a small number of events in a user session which
results in an efficient and an accurate system. Finally, in Chapter 7 we conclude and give possible
directions for future research.
Chapter 2
Background
D
ETECTING intrusions in networks and applications has become one of the most critical
tasks to prevent their misuse by attackers. The cost involved in protecting these valuable
resources is often negligible when compared with the actual cost of a successful intrusion, which
strengthens the need to develop more powerful intrusion detection systems. Intrusion detection
started in 1980’s and since then a number of approaches have been introduced to build intrusion
detection systems [1], [14], [15], [16], [17], [18], [19], [20]. However, intrusion detection is still
at its infancy and naive attackers can launch powerful attacks which can bring down an entire net-
work [5]. To identify the shortcoming of different approaches for intrusion detection, we explore
the related research in intrusion detection. We describe the problem of intrusion detection in detail
and analyze various well known methods for intrusion detection with respect to two critical re-
quirements viz. accuracy of attack detection and efficiency of system operation. We observe that
present methods for intrusion detection suffer from a number of drawbacks which significantly af-
fect their attack detection capability. Hence, we introduce conditional random fields for effective
intrusion detection and motivate our approach for building intrusion detection systems which can
operate efficiently and which can detect a wide variety of attacks with relatively higher accuracy,
both at the network and at the application level.
2.1 Introduction
P
RESENT networks are increasingly based on the concept of resource sharing as it is a neces-
sity for collaboration, and provides an easy means of communication and economic growth.
However, the need to communicate and share resources increases the complexity of the system.
The systems are getting bigger with more and more add on features making them complex. This
9
10 Background
results in vulnerabilities in software and configuration errors in networks and deployed applica-
tions. Ease of access of resources in addition to vulnerabilities and poor management of resources
can be exploited to launch attacks [3]. Further, features intended for some specific usage in many
applications may also be exploited for misuse of systems. A typical example of this is the response
generated by the SQL server which is often exploited in the SQL injection attacks. As a result, the
number of attacks has increased significantly [10]. Additionally, the attacks have become more
complex and difficult to detect using traditional intrusion detection approaches, demanding more
effective solutions [5]. More stringent monitoring has further increased the resources required by
the intrusion detection systems. However, addition of more resources may not always provide a
desired level of security.
The notion of intrusion detection was born in the 1980’s with a paper from Anderson [21],
which described that audit trails contain valuable information and could be utilized for the pur-
pose of misuse detection by identifying anomalous user behaviour. The lead was then taken by
Denning at the SRI International and the first model of intrusion detection, ‘Intrusion Detection
Expert System’ (IDES) [22], [23] was born in 1984. Another project at the Lawrence Livermore
Laboratories developed the ‘Haystack’ intrusion detection system in 1988 [24]. This further led
to the concept of distributed intrusion detection system which augmented the existing solution by
tracking client machines as well as the servers. The last system to be released under the same
generation, called ‘Stalker’, was released in 1989 which was again a host based, pattern matching
system [25]. Until then, the majority of the systems were host based and analyzed the individual
host level audit records. Todd Heberlein, in 1990 introduced the concept of network intrusion
detection and came up with the system called the ‘Network Security Monitor’ (NSM) [26], [27].
These developments gradually paved way for the intrusion detection systems to enter into the
commercial market with products such as ‘Net Ranger’, ‘Real Secure’ and ‘Snort’ acquiring big
market shares [25], [28].
Present intrusion detection systems are very often based on analyzing individual audit patterns
by extracting signatures or are based on analyzing summary statistic collected at the network or
at the application level [9], [29]. Such systems are unable to detect attacks reliably because they
neglect the sequence structure in the audit patterns and consider every pattern to be independent.
In most situations such independence assumptions do not hold which severely affect the attack
detection capability of an intrusion detection system.
2.2 Intrusion Detection and Intrusion Detection System 11
Another approach for intrusion detection is based on analyzing sequence structure in the audit
patterns. Methods based on analyzing sequence of system calls issued by privileged processes
are well known [30], [31]. However, to reduce system complexity, the system considers only one
feature which is the sequence of system calls. Other features, such as the arguments of the system
calls, are ignored. In cases, when multiple features are considered, the features are assumed inde-
pendent and separate models are built using individual features. Results from all the models are
then combined using a voting mechanism. This again may not detect attacks reliably. To improve
attack detection, all of the features must be considered collectively and not independently [32],
[33]. Assuming events to be independent makes the model simple and improves speed of opera-
tion; but at the cost of reduced attack detection and increased number of false alarms. Frequent
false alarms, in turn, make the system administrators to ignore the alarms altogether.
Present networks and applications are, thus, far away froma state where they can be considered
secure. Hence, in this chapter we explore the problem of intrusion detection to identify the root
causes of the inability of the present intrusion detection systems to detect attacks reliably. We then
motivate the use of conditional random fields [34] for building effective network and application
intrusion detection systems [32], [33], [35], [36], [37].
The rest of the chapter is organized as follows; in Section 2.2, we give the taxonomy of in-
trusion detection which is described in detail in [38]. We then give their classification in Section
2.3, followed by the properties of the audit patterns which can be used to detect attacks in Sec-
tion 2.4. We present the evaluation metrics for analyzing intrusion detection systems in Section
2.5 and give a detailed literature review for intrusion detection in Section 2.6. We then describe
conditional random fields in Section 2.7. Finally, we conclude this chapter in Section 2.8.
2.2 Intrusion Detection and Intrusion Detection System
The intrusion detection systems are a critical component in the network security arsenal. Security
is often implemented as a multi layer infrastructure and different approaches for providing security
can be categorized into the following six areas [39]:
1. Attack Deterrence – Attack deterrence refers to persuading an attacker not to launch an
attack by increasing the perceived risk of negative consequences for the attacker. Having
a strong legal system may be helpful in attack deterrence. However, it requires strong
12 Background
evidence against the attacker in case an attack was launched. Research in this area focuses
on methods such as those discussed in [40] which can effectively trace the true source of
attack as very often the attacks are launched with spoofed source IP address. (Spoofing
refers to sending IP packets with modified source IP address so that the true sender of the
packet cannot be traced.)
2. Attack Prevention – Attack prevention aims to prevent an attack by blocking it before
an attack can reach the target. However, it is very difficult to prevent all attacks. This
is because, to prevent an attack, the system requires complete knowledge of all possible
attacks as well as the complete knowledge of all the allowed normal activities which is not
always available. An example of attack prevention system is a firewall [41].
3. Attack Deflection – Attack deflection refers to tricking an attacker by making the attacker
believe that the attack was successful though, in reality, the attacker was trapped by the
system and deliberately made to reveal the attack. Research in this area focuses on attack
deflection systems such as the honey pots [42].
4. Attack Avoidance – Attack avoidance aims to make the resource unusable by an attacker
even though the attacker is able to illegitimately access that resource. An example of
security mechanism for attack avoidance is the use of cryptography [43]. Encrypting data
renders the data useless to the attacker, thus, avoiding possible threat.
5. Attack Detection – Attack detection refers to detecting an attack while the attack is still in
progress or to detect an attack which has already occurred in the past. Detecting an attack
is significant for two reasons; first the system must recover from the damage caused by
the attack and second, it allows the system to take measures to prevent similar attacks in
future. Research in this area focuses on building intrusion detection systems.
6. Attack Reaction and Recovery – Once an attack is detected, the system must react to an
attack and perform the recovery mechanisms as defined in the security policy.
Tools available to perform attack detection followed by reaction and recovery are known as the
intrusion detection systems. However, the difference between intrusion prevention and intrusion
detection is slowly diminishing as the present intrusion detection systems increasingly focus on
real-time attack detection and blocking an attack before it reaches the target. Such systems are
better known as the Intrusion Prevention Systems.
2.2 Intrusion Detection and Intrusion Detection System 13
2.2.1 Principles and Assumptions in Intrusion Detection
Denning [22] defines the principle for characterizing a system under attack. The principle states
that for a system which is not under attack, the following three conditions hold true:
1. Actions of users conform to statistically predictable patterns.
2. Actions of users do not include sequences which violate the security policy.
3. Actions of every process correspond to a set of specifications which describe what the
process is allowed to do.
Systems under attack do not meet at least one of the three conditions. Further, intrusion de-
tection is based upon some assumptions which are true regardless of the approach adopted by the
intrusion detection system. These assumptions are:
1. There exists a security policy which defines the normal and (or) the abnormal usage of
every resource.
2. The patterns generated during the abnormal system usage are different from the patterns
generated during the normal usage of the system; i.e., the abnormal and normal usage of a
system results in different system behaviour. This difference in behaviour can be used to
detect intrusions.
As we shall discuss later, different methods can be used to detect intrusions which make a
number of assumptions that are specific only to the particular method. Hence, in addition to the
definition of the security policy and the access patterns which are used in the learning phase of
the detector, the attack detection capability of an intrusion detection system also depends upon the
assumptions made by individual methods for intrusion detection [44].
2.2.2 Components of Intrusion Detection Systems
An intrusion detection system typically consists of three sub systems or components:
1. Data Preprocessor – Data preprocessor is responsible for collecting and providing the audit
data (in a specified form) that will be used by the next component (analyzer) to make a
decision. Data preprocessor is, thus, concerned with collecting the data from the desired
source and converting it into a format that is comprehensible by the analyzer.
14 Background
Data used for detecting intrusions range from user access patterns (for example, the se-
quence of commands issued at the terminal and the resources requested) to network packet
level features (such as the source and destination IP addresses, type of packets and rate of
occurrence of packets) to application and system level behaviour (such as the sequence of
system calls generated by a process.) We refer to this data as the audit patterns.
2. Analyzer (Intrusion Detector) – The analyzer or the intrusion detector is the core compo-
nent which analyzes the audit patterns to detect attacks. This is a critical component and
one of the most researched. Various pattern matching, machine learning, data mining and
statistical techniques can be used as intrusion detectors. The capability of the analyzer to
detect an attack often determines the strength of the overall system.
3. Response Engine – The response engine controls the reaction mechanism and determines
how to respond when the analyzer detects an attack. The system may decide either to raise
an alert without taking any action against the source or may decide to block the source for
a predefined period of time. Such an action depends upon the predefined security policy of
the network.
In [45], the authors define the Common Intrusion Detection Framework (CIDF) which recog-
nizes a common architecture for intrusion detection systems. The CIDF defines four components
that are common to any intrusion detection system. The four components are; Event generators (E-
boxes), event Analyzers (A-boxes), event Databases (D-boxes) and the Response units (R-boxes).
The additional component, called the D-boxes, is optional and can be used for later analysis.
2.2.3 Challenges and Requirements for Intrusion Detection Systems
The purpose of an intrusion detection system is to detect attacks. However, it is equally important
to detect attacks at an early stage in order to minimize their impact. The major challenges and
requirements for building intrusion detection systems are:
1. The system must be able to detect attacks reliably without giving false alarms. It is very
important that the false alarm rate is low as in a live network with large amount of traf-
fic, the number of false alarms may exceed the total number of attacks detected correctly
thereby decreasing the confidence in the attack detection capability of the system. Ideally,
the system must detect all intrusions with no false alarms. The challenge is to build a sys-
2.3 Classification of Intrusion Detection Systems 15
tem which has broad attack detection coverage, i.e. it can detect a wide variety of attacks
and at the same time which results in very few false alarms.
2. The system must be able to handle large amount of data without affecting performance and
without dropping data, i.e. the rate at which the audit patterns are processed and decision
is made must be greater than or equal to the rate of arrival of new audit patterns. Hence the
speed of operation is critical for systems deployed in high speed networks. In addition, the
system must be capable of operating in real-time by initiating a response mechanism once
an attack is detected. The challenge is to prevent an attack rather than simply detecting it.
3. A system which can link an alert generated by the intrusion detector to the actual security
incident is desirable. Such a system would help in quick analysis of the attack and may
also provide effective response to intrusion as opposed to a system which offers no after
attack analysis. Hence, it is not only necessary to detect an attack, but it is also important
to identify the type of attack.
4. It is desirable to develop a system which is resistant to attacks since, a system that can be
exploited during an attack may not be able to detect attacks reliably.
5. Every network and application is different. The challenge is to build a system which is
scalable and which can be easily customized as per the specific requirements of the envi-
ronment where it is deployed.
2.3 Classification of Intrusion Detection Systems
Classifying intrusion detection systems helps to better understand their capabilities and limitations.
We therefore, present the classification of intrusion detection systems in Figure 2.1.
From Figure 2.1, we observe that for any intrusion detection system, security policy and audit
patterns are the two prime information sources. The audit patterns must be analyzed to detect an
attack and the security policy defines the acceptable and non acceptable usage of a resource and
helps to qualify whether an event is normal or attack. Hence, based on the given classification,
an example of an intrusion detection system can be a centralized system deployed on a network
with sliding window based data collection which operates in real-time and is based on signature
analysis with active response to intrusion.
16 Background
Audit
Patterns
Information Sources for an
Intrusion Detection System
Security
Policy
Signature Based
Behaviour Based
Hybrid
Knowledge of the Resources
Frequency of Analysis
Batch Mode
Near Real Time
Real Time
Response on Intrusion
Passive
Active
Number of Audit Sources
Centralized
Distributed
Alert Correlation
Audit Source Location
Network Based
Host Based
Application Based
Periodic Snapshot Based
Frequency of Audit Data Collection
Session Based
Sliding Window Based
Figure 2.1: Classification of Intrusion Detection Systems
2.3 Classification of Intrusion Detection Systems 17
2.3.1 Classification based upon the Security Policy definition
Intrusion detection systems are classified in two ways based upon the security policy definition.
1. Security policy defines the normal and abnormal usage of every resource. Consider a set U,
which represents the complete domain (universe) for a resource R. The set U consists of
both, normal and abnormal usage of R. Hence, U = U
R−normal
, U
R−attack
. The problem is
to identify the set U such that it is complete and unambiguous. However, in most practical
situations it is very difficult to identify and define the complete set U and only a small
portion of this set is available which is denoted as S. Hence, the security policy is defined
with only the knowledge contained in the subset S, where S = S
R−normal
, S
R−attack
. This
is represented in Figure 2.2.
U
R−attack
U
R−normal
(a) Total Knowledge
S
R−attack
S
R−normal
(b) Available Knowledge
Figure 2.2: Knowledge Representation for a Resource (R)
where |U
R−normal
| ≥ |S
R−normal
| and |U
R−attack
| ≥ |S
R−attack
|
Based upon the elements of subset S, intrusion detection system can be classified as:
(a) Signature (Misuse) Based – When the set S only contains the events which are
known to be attack, the system focuses on detecting known misuses and is known
as signature or misuse based system [42]. Signature based system are represented in
Figure 2.3.
Signature based systems employ pattern matching approaches to detect attacks.
They can detect attacks with very few false alarms but have limited attack detection
capability since they cannot detect unseen attacks. Their attack detection capabil-
ity is directly proportional to the available knowledge of attacks in the set S, i.e.
knowledge of S
R−attack
. To be effective, such systems require complete knowledge
of attacks, i.e. S
R−attack
should be equal to U
R−attack
, which is not always possible.
18 Background
Correctly
Correctly
Detected Attacks
Detected Normals
Missed Attacks
Figure 2.3: Representation of a Signature Based System
(b) Behaviour (Anomaly) Based – When the set S only consists of events which are
known to be normal, the goal of the intrusion detection system is to identify signifi-
cant deviations from the known normal behaviour [42] as shown in Figure 2.4.
Correctly
Detected Normals
Correctly
Detected Attacks
False Alarms
Missed Attacks
Figure 2.4: Representation of a Behaviour Based System
For behaviour based systems to be effective complete knowledge of normal be-
haviour of a resource is required, i.e. the set S
R−normal
should be equal to the set
U
R−normal
. Since the complete knowledge of a resource may not be available, a
threshold is used which gives some flexibility to the system. Events which lie be-
yond the threshold are detected as attacks. Hence, behaviour based systems; in
general, suffer from a large false alarm rate. False alarms can be reduced by increas-
ing the threshold, however, this affects the attack detection and the system may not
be able to detect a wide variety of attacks. Hence, there is a tradeoff in limiting the
number of false alarms and the capability of the system to detect a variety of attacks.
(c) Hybrid – In most environments, it may not be possible to completely define either
the normal or the abnormal behaviour. As a result, an intrusion detection systemmay
generate a large number of false alarms or may be specific in detecting only a few
types of attacks. A hybrid system uses the partial knowledge of both, i.e., S
R−normal
and S
R−attack
, to detect attacks; often resulting in fewer false alarms and detecting
2.3 Classification of Intrusion Detection Systems 19
more attacks. Such systems generally employ machine learning approaches. A
hybrid system is represented in Figure 2.5.
False Alarms
Missed Attacks
Correctly
Detected Attacks
Correctly
Detected Normals
Figure 2.5: Representation of a Hybrid System
2. The security policy also defines how the system must respond when an attack is detected;
based upon which the intrusion detection systems can be classified as:
(a) Passive Response Systems – In a passive response system, the system does not take
any measure to respond to an attack once an attack is detected. It simply generates
an alert which can be analyzed by the administrator at some later stage [39], [42].
(b) Active Response Systems – In active response systems, the intrusion detection sys-
tems respond to attacks by various possible approaches which may include blocking
the source of the attack for a predefined time period [39], [42].
2.3.2 Classification based upon the Audit Patterns
1. The source from which the audit patterns are collected affects the attack detection capabil-
ity of a system. For example, when network statistics are used as the audit patterns, they
cannot provide any detail about the user and system interaction. Based on this, intrusion
detection systems are classified as:
(a) Network Based – In a network based system, the audit patterns collected at the net-
work level are used by the intrusion detector [46], [47]. Though a single system
(or a few strategically placed systems) is (are) sufficient for the entire network, the
attack detection capability of a network based system is limited. This is because it
is hard to infer the contextual information directly from the network audit patterns.
Further, the audit patterns may be encrypted rendering them unusable by the intru-
sion detector at the network level. In addition, large amount of audit patterns at the
20 Background
network level may also affect the total attack detection accuracy. This is because of
two reasons; first, a significant portion of the total incoming patterns may be allowed
to pass into the network without any analysis and second, in high speed networks,
it may be practical to analyze only the summary statistics collected at regular time
intervals. These statistics may include features such as the total number of connec-
tions, amount of incoming and outgoing traffic. Such features only provide a high
level summary which may not be able to detect attacks reliably [42].
(b) Host Based – The intrusion detector in a host based system analyzes the audit pat-
terns generated at the kernel level of the system which include system access logs
and the error logs [42]. The audit patterns collected at the individual host contains
more specific information than the network level audit patterns, which may be used
to detect attacks reliably. However, it becomes difficult to manage a large num-
ber of host based systems in a big network. Additionally, host based systems can
themselves be the victims of an attack.
(c) Application Based – The application based systems are concerned only with a single
application and detect attacks directed at a particular application or a privileged
process [31]. They can analyze either the application access logs or the system
calls generated by the processes to detect anomalous activities. The application
based systems can be very effective as they can exploit the complete knowledge of
the application and can be used even when encryption is used in communication.
They can also analyze the user and application interactions which can significantly
improve the attack detection accuracy.
2. In order to detect intrusions, the audit patterns can be collected froma single source or from
a number of sources. When the audit patterns are collected from more than one source, the
decision can be made by individual nodes or by aggregating the audit patterns at a single
point and then analyzing them together. Based upon this property, the intrusion detection
systems can be classified as:
(a) Centralized System – In a centralized system, the audit patterns are collected either
from a single source or from multiple sources but are processed at a single point
where they are analyzed together to determine the global state of the network [42].
However, such systems may themselves become a target of attacks.
2.3 Classification of Intrusion Detection Systems 21
(b) Distributed System – In contrast to the centralized systems, the distributed systems
can make local decisions close to the source of the audit patterns and may report
only a small summary of activities to a higher level in the system. The advantage of
a distributed system for intrusion detection is that immediate response mechanism
can be activated based upon local decisions. However, distributed systems can be
less accurate due to lack of global knowledge. Agent based systems are examples
of distributed intrusion detection systems [42].
(c) Alert Correlation – Alert correlation based systems analyze the alerts generated by
a number of cooperating intrusion detection systems [39]. The individual systems
may themselves be centralized or decentralized. Alert correlation systems can only
be effective when multiple networks are attacked with similar attacks such as in case
of worm outbreak. Incase when the attacks are network specific, the alert correlation
systems will not be effective even though a few target networks may detect some
anomalous activities. In such cases, the local alerts will be discarded as false alarms
due to lack of global consensus.
3. Regardless of the source and the number of audit patterns, the intrusion detection systems
can be classified depending upon the frequency at which the audit patterns are collected.
Based on this, they are classified as:
(a) Session Based – Audit patterns can be collected at the end of every session by sum-
marizing different features. Methods can be used which analyze the summary of
every session once the session is terminated.
(b) Sliding Window Based – In case of sliding window based collection of audit pat-
terns, events are recorded using a moving window of fixed or variable width. The
width of the window defines the number of events recorded together and the step
size for sliding the window determines how fast the window is advanced forward.
(c) Periodic Snapshot Based – Instead of recording every event or summarizing a ses-
sion at its termination, snapshots of different states of the entire system can be taken
at regular intervals which can be analyzed to detect intrusions.
4. Depending upon the frequency of analysis of audit patterns, the intrusion detection systems
can be classified as:
22 Background
(a) Batch Mode – In batch mode intrusion detection, the audit patterns are aggregated
in a central repository. The patterns are then analyzed for intrusions at predefined
time intervals. Such systems cannot provide any immediate response to intrusion
and can only perform the recovery task once an attack is detected.
(b) Near Real-time – An intrusion detection system is said to perform in near real-time
when the system cannot detect an intrusion when it commenced, but can detect it at
some later stage during the attack or immediately at the end of an attack. In such
systems, there is some delay before the patterns are made available to the intrusion
detector. Patterns collected by taking periodic snapshots or using moving window
with step size greater than one can be used for near real-time intrusion detection.
(c) Real-time – A real-time intrusion detection system must detect an attack as soon
as it is commenced, i.e. the system is said to perform in real-time if and only if,
for an event ‘x’ when the attack commenced, the attacker cannot succeed with the
event ‘x+1’. Hence, for real-time intrusion detection, the system must detect an
attack immediately. However, in practice it is very difficult to build such a system
given the constraint that it should have low false alarm rate and high attack detection
accuracy. Real-time intrusion detection systems can be implemented by using a
moving window with a step of size one. Network based signature detection systems,
which perform pattern matching can also perform in real-time by checking every
event for known attacks. However, they are limited in detecting only those attacks
whose signatures are known in prior. A typical example is the Snort [48].
2.4 Audit Patterns
The raw patterns must be preprocessed and presented in a format which can be interpreted by the
intrusion detector before they can be analyzed.
2.4.1 Properties of Audit Patterns useful for Intrusion Detection
Different properties in the audit patterns can be analyzed for detecting intrusions. The authors in
[49] describe three properties which can be used to detect intrusions.
2.4 Audit Patterns 23
1. Frequency of Event(s) – Frequency determines how often an event occurs in a predefined
time interval. A threshold can be used to define the limit. When the frequency crosses this
limit, an alarm can be raised. Properties such as the number of invalid login attempts and
the number of rows accessed in a database can be used to measure frequency.
2. Duration of Event(s) – Rather than counting the number of occurrences of an event, the
duration property determines the acceptable time duration for a particular event. It is based
upon selecting a threshold which defines an acceptable range for a particular event. For
example, large number of invalid login attempts for a single user id in a very short time
span can be considered as an attempt to guess a password and hence an attack.
Systems analyzing the frequency or (and) duration property for the events can perform
efficiently but they suffer from large false alarm rate as it is often difficult to determine the
correct threshold for the events.
3. Ordering of Events – Analyzing the order in which events occur can improve the attack
detection accuracy and reduce false alarms. This is because, very often, intrusion is a multi
step process in which a number of events must occur sequentially in order to launch a
successful attack. However, to avoid detection from systems which do analyze a sequence
of events, the attacks can be spread over a long period of time such that the events cannot
be correlated unless a long history is maintained by the intrusion detection system.
A system which can analyze all of the above mentioned properties can detect attacks with high
accuracy. However, such a system may be inefficient in operation.
2.4.2 Univariate or Multivariate Audit Patterns
The audit patterns used to detect attacks may either be univariate or multivariate. As, discussed
before, the audit patterns may be collected from the routers and switches for the network level
systems. When only one feature is analyzed, in case of univariate audit patterns, the analysis is
much simpler in comparison to when many features are analyzed together, as in case of multivari-
ate analysis. However, a single feature itself may not be the complete representation and, hence,
insufficient to detect attacks. For example, when the sequence of system calls generated by a priv-
ileged process is analyzed for detecting abnormal behaviour, discarding other features such as the
parameters of the system calls can affect the attack detection capability of the system [9].
24 Background
2.4.3 Relational or Sequential Representation
Very often, the audit patterns collected are sequential where one or more features are recorded
continuously. However, the raw audit patterns may be processed into a relational form and a
number of new features can be added. These features often give a high level representation of
the audit patterns in a summarized form. Examples of such features include; total amount of
data transferred in a session and duration of a session. Frequency and duration properties of
events can be easily represented in a relational form. Converting the audit patterns from sequential
to relational form has two advantages; first, more features can be added and second, efficient
methods can be used for analysis of audit patterns in relational form. However, this may result in
affecting the attack detection capability as in relational form the ordering of events and, hence, the
relationship among sequential events is lost. When the audit patterns are represented sequentially,
event ordering can be exploited in favour of higher attack detection accuracy. However, in general,
sequence analysis is slower when compared to the relational analysis.
2.5 Evaluation Metrics
Evaluating different methods for detecting intrusions is important. Intrusion detection is an ex-
ample of a problem with imbalanced classes, i.e. the number of instances in the classes is not
equally distributed. The number of attacks is very small when compared with the total number of
normal events. Note that, in case of the Denial of Service attacks, the amount of attack traffic is
extremely large as compared to the normal traffic. Hence, evaluating intrusion detection systems
using simple accuracy metric may result in very high accuracy [50]. Other metrics such as Pre-
cision, Recall and F-Measure, which do not depend on the size of the test set, are, thus, used for
evaluating intrusion detectors. These are defined with the help of the confusion matrix as follows:
Table 2.1: Confusion Matrix
Predicted Normal Predicted Attack
True Normal True Negative False Positive
True Attack False Negative True Positive
2.6 Literature Review 25
Precision =
number o f True Positives
number o f True Positives + number o f False Positives
Recall =
number o f True Positives
number o f True Positives + number o f False Negatives
F − Measure =
(1 + β
2
) ∗ Recall ∗ Precision
β
2
∗ (Recall + Precision)
where β corresponds to the relative importance of Precision vs. Recall and is usually set to 1.
Hence, a system must have high Precision (i.e. it must detect only attacks), high Recall (i.e. it
must detect all attacks) and, thus, a high F-Measure.
In addition to evaluating the attack detection capability of the detector, time taken to detect
an attack is also significant. The time performance is generally measured for the time taken by
the intrusion detector to detect an attack from the time the audit patterns are fed into the detector.
This is sufficient for comparison when different methods use exactly the same data for analysis,
however, it does not represent the efficiency of the intrusion detection system, since the time taken
in collecting and preprocessing the audit patterns is not considered. Hence, in real environments,
total time must be measured which is the time from the point when intrusion actually started to the
point in time when the response mechanism is activated.
2.6 Literature Review
Two most significant motives to launch attacks as described in [3] are, either to force a network
to stop some service(s) that it is providing or to steal some information stored in a network. An
intrusion detection system must be able to detect such anomalous activities. However, what is
normal and what is anomalous is not defined, i.e., an event may be considered normal with respect
to some criteria, but the same may be labeled anomalous when this criterion is changed. Hence,
the objective is to find anomalous test patterns which are similar to the anomalous patterns which
occurred during training. The underlying assumption is that the evaluating criterion is unchanged
and the system is properly trained such that it can reliably separate normal and anomalous events.
26 Background
2.6.1 Frameworks for building Intrusion Detection Systems
A number of frameworks have been proposed for building intrusion detection systems. The com-
mon intrusion detection framework is described in [45]. The authors in [50] and [51] describe a
data mining framework for building intrusion detection systems. Using the approach described
in [51], the rules can be learned inductively instead of manually coding the intrusion patterns and
profiles. However, their approach requires the use of a large amount of noise free audit data to train
the models. Agent based intrusion detection frameworks are discussed in [52] and [53]. Frame-
works which describe the collaborative use of intrusion detection systems have also been proposed
[54], [55]. The system described in [54] is based on the combination of network based and host
based systems while the system in [55] employs both, signature based and behaviour based tech-
niques for detecting intrusions. All of these frameworks suffer from one major drawback; a single
intrusion detector used within these frameworks is trained to detect a wide variety of attacks. This
results in a large number of false alarms. To ameliorate this, we introduce our Layered Framework
for building Intrusion Detection Systems in Chapter 3.
2.6.2 Network Intrusion Detection
The prospect of maintaining a single system which can be used to detect network wide attacks
make network monitoring a preferred option as opposed to monitoring individual hosts in a large
network. A number of techniques such as association rules, clustering, naive Bayes classifier, sup-
port vector machines, genetic algorithms, artificial neural networks and others have been applied
to detect intrusions at network level. It is important to note that different methods are based on
specific assumptions and analyze different properties in the audit patterns, resulting in different
attack detection capabilities. These methods can be broadly divided into three major categories:
Pattern Matching
Pattern matching techniques search for predefined set of patterns (known as signatures) in the
audit patterns to detect intrusions. Pattern matching approaches are employed on the audit patterns
which do not have any state or sequence information. Hence, they assume independence among
events. However, this assumption may not always hold as a single intrusion may span over multiple
events which are correlated. The prime advantage of pattern matching approaches is that they
2.6 Literature Review 27
are very efficient and triggers an alert only when an exact match of an attack signature is found
resulting in very few false alarms. They can, however, detect attacks only if the corresponding
pattern (signature) exists in the signature database. Hence, they cannot detect unseen attacks for
which there are no signatures [9], [42]. Snort system [48] is based upon pattern matching.
Statistical Methods
Statistical methods based on modeling the monitored variables as independent Gaussian random
variables and methods such as those based on the Hotelling T
2
test statistic can be used to detect
attacks by calculating deviations in the present profile from the stored normal profile [9]. They
are based upon modeling the underlying process which generates the audit patterns and exploit
the frequency and duration property of events. They often analyze properties such as the overall
system load and statistical distribution of events, which represent a summary measure. When the
deviations exceed a predefined threshold, the system triggers an alarm. To determine this threshold
accurately is a critical issue. When the threshold is low, the system raises a large number of
(false) alarms and when the threshold is high, the system may not detect attacks reliably. Though
these methods can handle multiple features in the audit patterns, very often, in order to reduce
complexity and improve system performance only a single feature is considered, as in the Intrusion
Detection Expert System (IDES) [23], or the features are assumed to be independent, as in the
Haystack system [24]. This, however, affects the attack detection accuracy. Statistical methods
can operate either in batch mode (Haystack system) or in real-time mode (IDES).
Data Mining and Machine Learning
Data mining and machine learning methods focus on analyzing the properties of the audit patterns
rather than identifying the process which generated them [9]. These methods include approaches
for mining association rules, classification and cluster analysis. Classification methods are one
of the most researched and include methods like the decision trees, Bayesian classifiers, artificial
neural networks, k-nearest neighbour classification, support vector machines and many others.
• Clustering – Clustering of data has been applied extensively for intrusion detection using a
number of methods such as k-means, fuzzy c-means and others [56], [57]. Clustering meth-
ods are based upon calculating the numeric distance of a test point from different cluster
28 Background
centres’ and then adding the point to the closest cluster. One of the main drawbacks of clus-
tering technique is that since a numeric distance measure is used, the observations must be
numeric. Observations with symbolic features cannot be readily used for clustering which
results in inaccuracy. In addition, clustering methods consider the features independently
and are unable to capture the relationship between different features of a single record which
results in lower accuracy. Another issue when applying any clustering method is to select
the distance measure as different distance measures result in clusters with different shapes
and sizes. Frequently used distance measures are the Euclidian distance and the Maha-
lanobis distance [9]. Clustering can, however, be performed in case only the normal audit
patterns are available. In such cases, density based clustering methods can be used which
are based on the assumption that intrusions are rare and dissimilar to the normal events.
This is similar to identifying the outlier points which can be considered as intrusions.
• Data Mining – Data mining approaches [50], [51] are based on mining association rules
[58] and using frequent episodes [59] to build classifiers by discovering relevant patterns
of program and user behaviour. Association rules and frequent episodes are used to learn
the record patterns that describe user behaviour. These approaches can deal with symbolic
features and the features can be defined in the formof packet and connection details. Mining
association rules for intrusion detection has the advantage that they are easy to interpret.
However, they are based upon building a database of rules of normal and frequent items
during the training phase. During testing, patterns from the test data are extracted and
various classification methods can be used to classify the test data. The detection accuracy
suffers as the database of rules is not a complete representation of the normal audit patterns.
• Bayesian Classifiers – Naive Bayes classifiers are also proposed for intrusion detection [60].
However, they make strict independence assumption between the features in an observation
resulting in lower attack detection accuracy when the features are correlated, which is of-
ten the case. Bayesian network [61] can also be used for intrusion detection [62], [63].
However, they tend to be attack specific and build a decision network based on special
characteristics of individual attacks. As a result, the size of a Bayesian network increases
rapidly as the number of features and the type of attacks modeled by the network increases.
• Decision Trees – Decision trees have also been used for intrusion detection [60], [64]. De-
cision trees select the best features for each decision node during tree construction based on
2.6 Literature Review 29
some well defined criteria. One such criterion is the gain ratio which is used in C4.5. De-
cision trees generally have very high speed of operation and high attack detection accuracy
and have been successfully used to build effective intrusion detection systems.
• Artificial Neural Networks – Neural networks have been used extensively to build net-
work intrusion detection systems as discussed in [65], [66], [67], [68], [69], [70] and [71].
Though, the neural networks can work effectively with noisy data, like other methods, they
require large amount of data for training and it is often hard to select the best possible
architecture for the neural network.
• Support Vector Machines – Support vector machines map real valued input feature vector
to higher dimensional feature space through nonlinear mapping and have been used for
detecting intrusions [70], [71], [72]. They can provide real-time attack detection capability,
deal with large dimensionality of data and perform multi class classification.
For data mining and machine learning based approaches, the accuracy of the trained system
also depends upon the amount of audit patterns available during training. Generally, training
with more audit patterns result in a better model. The above discussed methods often deal with
the summarized representation of the audit patterns and may analyze multiple features which are
considered independently. The prime reason for working with summary patterns is that the system
tends to be simple, efficient and give fairly good attack detection accuracy. Similar to the pattern
matching and statistical methods, these methods assume independence among consecutive events
and hence do not consider the order of occurrence of events for attack detection.
• Markov Models – Markov chains [73], [74] and hidden Markov models [75] can be used
when dealing with sequential representation of audit patterns. [31], [76], [77] and [78] de-
scribes the use of hidden Markov model for intrusion detection. Hidden Markov models
have been shown to be effective in modeling sequences of system calls of a privileged pro-
cess, which can be used to detect anomalous traces. However, modeling system calls alone
may not always provide accurate classification as various connection level features are ig-
nored. Further, hidden Markov models cannot model long range dependencies between
the observations [34]. Very often the sequence itself is a vector and has many correlated
features. However, in order to gain computational efficiency the multivariate data analysis
problem is broken into multiple univariate data analysis problems and the individual results
30 Background
are combined using a voting mechanism [9]. This however, results in inaccuracy as the
correlation among the features is lost. The authors in [49] show that modeling the ordering
property of events, in addition to the duration and frequency, results in higher attack detec-
tion accuracy. The drawback with modeling the ordering of events is that the complexity of
the system increases which affects the performance of the system. Hence, there is a tradeoff
between detection accuracy and the time required for attack detection.
• Others – Other approaches for detecting intrusion include the use of genetic algorithm and
autonomous and probabilistic agents [79], [80]. These methods are generally aimed at
developing a distributed intrusion detection system.
A number of intrusion detection systems such as the IDES (Intrusion Detection Expert Sys-
tem), Haystack system, the MIDAS (Multics Intrusion Detection System), W&S (Wisdom and
Sense) system, TIM (Time based Inductive Machine), Snort and others have been developed which
operate at the network level [1]. However, network intrusion detection systems must perform very
efficiently in order to handle large amount of network data and hence many of the network in-
trusion detection systems are primarily based on signature matching. When anomaly detection
systems are used at network level, they either consider only one feature [23] or assume the fea-
tures to be independent [24]. However, we propose to use a hybrid system based on conditional
random fields and integrate the layered framework to build a single system which can operate in
high speed networks and can detect a wide variety of attacks with very few false alarms. We, thus,
present the Layered Conditional Random Fields for Network Intrusion Detection in Chapter 4.
The most closely related work, to our work, is of Lee et al. [51], [81], [82], [83], [84].
They, however, consider a data mining approach for mining association rules and finding fre-
quent episodes in order to calculate the support and confidence of the rules. Instead, in our work
we define features from the observations as well as from the observations and the previous la-
bels and perform sequence labeling via the conditional random fields to label every feature in the
observation. This setting is sufficient to model the correlation between different features in an
observation. We also compare our work with [85] which describes the use of maximum entropy
principle for detecting anomalies in the network traffic. The key difference between [85] and our
work is that, the authors in [85] use only the normal audit patterns during training and build a be-
haviour based system while we train our system using both the normal and the anomalous patterns
i.e. we build a hybrid system. Secondly, the system in [85] fails to model long range dependencies
2.6 Literature Review 31
in the observations, which can be easily represented in our model. As we shall describe in Chapter
4, we also integrate the layered framework with the conditional random fields to gain the benefits
of computational efficiency, wide attack detection coverage and high accuracy of attack detection
in a single system.
2.6.3 Monitoring Access Logs
A number of approaches have been described to monitor the data access logs or (and) the appli-
cation access logs in order to detect attacks, particularly at the user access level. We now review
some of these well known approaches.
Monitoring Data Access Logs
In [86], the authors focus on detecting malicious database modifications using database logs to
mine dependencies among data items by creating dependency rules. For example, any update
operation must satisfy certain rules which define what data items must be read before an update and
what data items must be written after the update operation. In order to detect malicious queries, the
authors in [87] perform clustering of queries that might return one or more features, each of which
can further return multiple records. In [88], the authors discuss that time differences between
multiple transactions in database systems can be used to detect malicious transactions when an
intruder masquerades as a normal user. The authors describe the use of Petri-Nets for finding
anomalies at the user task level. In [89], the authors describe that the database logs can be used to
build role profiles to model normal behaviour which can then be used to identify intruders. The
authors use naive Bayes classifier to perform classification using features extracted from the SQL
commands, the set of relations accessed and the attributes referenced. In [90] and [91], the authors
describe that fingerprinting of SQL queries can be used to detect malicious requests. They also
present an algorithm which summarizes the raw transactional SQL queries into compact regular
expressions that can be used for matching against known attack signatures. Further, ordering
constraints are imposed on the SQL queries in [90] which improves attack detection. In [92],
the authors describe the use of database logs to build user profiles based on user query frequent
item-sets. They also define support and confidence functions for fingerprints generated for the
queries depending upon the user profile. In [93], the authors describe that data objects can be
32 Background
tagged with time semantics that captures expectations about update rates which are unknown to
attackers. This is, however, applicable only to data which is refreshed regularly. In [94], the
authors describe the use of audit logs for building user profiles. They consider both, the integrity
constraints encoded in the data dictionary and the user profiles to define a distance measure which
estimates the closeness of a set of attributes which are referenced together. The authors in [95]
describe a system which determines whether a query should be denied in order to protect the
privacy of users by constructing auditors for ‘max’, ‘min’ and ‘sum’ queries. In [96], [97] and
[98], the authors describe Hippocratic Databases and present an auditing framework to detect
suspicious queries which do not adhere to data disclosure policies.
Such approaches are; generally, rule based and expensive to build. Additionally, they have lim-
ited attack detection capability since they cannot detect attacks whose signatures are not available.
Further, such systems are application specific and their deployment in different environments re-
quire recreating the set of rules applicable in the new domain. Another drawback is that, a system
based on modeling user profiles results in a large false alarm rate. This is because of two reasons;
first, the user behaviour is not fixed and changes overtime and, second, profile based systems em-
ploy threshold to determine the acceptable deviation in normal activities. The thresholds are often
determined empirically and, hence, may be unreliable. Additionally, these methods consider data
access patterns in isolation of the events which generates the data request.
Monitoring Web Access Logs
Contrary to the systems which monitor data queries alone, there exist systems which analyze the
web server access logs to detect malicious data and application accesses. Systems such as [99]
combine static and dynamic verification of web requests to ensure absence of particular kind of
erroneous behaviour in web applications. They, however, do not consider underlying data access
and hence cannot detect a wide variety of attacks. The system described in [100] performs ap-
plication layer protocol analysis to detect intrusions. The authors in [101] describe an anomaly
based approach for detecting attacks against web applications by analyzing its internal state and
learning the relationships between critical execution points and the internal states. In [102], the
authors describe a technique called protomatching which combines protocol analysis, normaliza-
tion and pattern matching into a single phase and hence can be used to perform signature analysis
efficiently. The authors claim that their protomatching approach improves the efficiency of the
2.6 Literature Review 33
Snort [48] intrusion detection system by up to 49%. In [103], the authors model network traffic
into network sessions and packets to identify instances with high attack probability. The authors
in [104] describe a tool for performing intrusion detection at application level. Their system uses
‘Apache’ web server to implement an audit data source which monitors the web server.
In order to improve attack detection at the application level we are, however, interested in
analyzing the behaviour of a web application in conjunction with the underlying data accesses
rather than analyzing them separately. Hence, we present our Unified Logging Framework in
Chapter 5. The advantage of our framework is that it is application independent, since we do
not extract application specific signatures, and therefore our framework can be used in a variety
of applications. Further, instead of modeling user profiles, our system models application-data
interaction which does not depend upon a particular user and therefore does not change overtime.
2.6.4 Application Intrusion Detection
Network monitoring, though significant, is not sufficient to detect attacks which are directed to-
wards individual applications. In order to detect such malicious application and data accesses,
intrusion detection must also be performed at the application level. Further, for an attack to be suc-
cessful, very often, a sequence of events must be followed. Present application intrusion detection
systems consider every event individually rather than considering a sequence of events resulting in
a large number of false alarms and, hence, poor attack detection accuracy. To ameliorate this, we
introduce User Session Modeling using Unified Log for Application Intrusion Detection in Chap-
ter 6. We perform session modeling at the user application level, as opposed to the network packet
level, and integrate the unified logging framework to build an application intrusion detection sys-
tem. We show that using conditional random fields, session modeling can be performed with our
unified logging framework and attacks can be detected by monitoring only a small number of
events in a sequence. This results in an efficient and an accurate system.
The most closely related works to ours are [105], [106] and [107]. In [105], the authors
describe an anomaly based learning approach for detecting SQL attacks by learning profiles of
the normal database accesses for web applications. Our work is different from this because we
consider both the normal and the anomalous data patterns during training and build a classification
system based on user session modeling, while in [105] the authors use only the normal patterns
during training to build an anomaly based system and analyze the events independently. In [106],
34 Background
the authors describe anomaly detection techniques to detect attacks against web servers and web
based applications by correlating the server side programs referenced by client queries with the
parameters contained in the queries. Their system primarily focuses on the web server logs to
produce an anomaly score by creating profiles for every server side program and its features and
then establishing their threshold, while in our system we combine the web server logs with the data
access logs to detect malicious data accesses and use a moving window (of size more than one)
to analyze a sequence of events. The authors in [107] describe a two layer system in which the
first layer generates pre alarms and the second layer makes the final decision to activate an alarm.
Even though the authors use both the web access logs and the data access logs, they build separate
profiles using the two logs. We also compare our work with [103]. Their system analyzes network
sessions and network packets while, we model the user application sessions to detect malicious
data accesses.
2.7 Conditional Random Fields
Conditional models are probabilistic systems which are used to model the conditional distribu-
tion over a set of random variables. Such models have been extensively used in natural language
processing tasks and computational biology. Conditional models offer a better framework as they
do not make any unwarranted assumptions on the observations and can be used to model rich
overlapping features among the visible observations. Maxent classifiers [108], [109], [110] maxi-
mum entropy Markov models [85], [111] and conditional random fields [34] are such conditional
models. The simplest conditional classifier is the Maxent classifier based upon maximum entropy
classification which estimates the conditional distribution of every class given the observations.
The training data is used to constrain this conditional distribution while ensuring maximum en-
tropy and hence maximum uniformity. We now give a brief description of the conditional random
fields which is motivated from the work in [34]. A comprehensive introduction to the conditional
random fields is provided in Appendix A.
Let X be the random variable over a data sequence to be labeled and Y be the corresponding
label sequence. Also, let G = (V, E) be a graph such that Y = (Y
v
)
v∈(V)
, so that Y is indexed
by the vertices of G. Then (X, Y) is a conditional random field, when conditioned on X, the
random variables Y
v
obey the Markov property with respect to the graph: p(Y
v
|X, Y
w
, w = v) =
2.7 Conditional Random Fields 35
p(Y
v
|X, Y
w
, w ∼ v), where w ∼ v means that w and v are neighbors in G, i.e. a conditional
random field is a random field globally conditioned on X. For a simple sequence (or chain)
modeling, as in our case, the joint distribution over the label sequence Y given X has the form:
p
θ
(y|x) ∝ exp(

e∈E,k
λ
k
f
k
(e, y|
e
, x) +

v∈V,k
µ
k
g
k
(v, y|
v
, x)) (2.1)
where x is the data sequence, y is a label sequence, and y|
s
is the set of components of y associated
with the vertices or edges in sub graph S. The features f
k
and g
k
are assumed to be given and fixed.
Further, the parameter estimation problem is to find the parameters θ = (λ
1
, λ
2
, ...; µ
1
, µ
2
, ...)
from the training data D = (x
i
, y
i
)
N
i=1
with the empirical distribution ˜ p(x, y).
The graphical structure of a conditional random field is represented in Figure 2.6 where
x
1
, x
2
, x
3
, x
4
represents an observed sequence of length four and every event in the sequence is
correspondingly labeled as y
1
, y
2
, y
3
, y
4
.
y
1
y
2
x
1
x
2
x
4
x
3
y
3
y
4
Figure 2.6: Graphical Representation of a Conditional Random Field
The prime advantage of conditional random fields is that they are discriminative models which
directly model the conditional distribution p(y|x). Further, conditional random fields are undi-
rected models and free from label bias and observation bias which are present in other conditional
models [112]. Generative models such as the Markov chains, hidden Markov models and joint dis-
tribution have two disadvantages. First, the joint distribution is not required since the observations
are completely visible and the interest is in finding the correct class which is the conditional distri-
bution p(y|x). Second, inferring conditional probability p(y|x) from the joint distribution, using
the Bayes rule, requires marginal distribution p(x) which is difficult to estimate as the amount of
training data is limited and the observation x contains highly dependent features. As a result strong
independence assumptions are made to reduce complexity. This results in reduced accuracy [113].
36 Background
Instead, conditional random fields predict the label sequence y given the observation sequence
x, allowing them to model arbitrary relationships among different features in the observations
without making independence assumptions.
Conditional random fields, thus, offer us the required framework to build effective intrusion
detection systems. The task of intrusion detection can be compared to many problems in machine
learning, natural language processing and bio-informatics such as gene prediction, determining
secondary structures of protein sequences, part of speech tagging, text segmentation, shallow pars-
ing, named entity recognition, object recognition and many others. The conditional random fields
have proven to be very successful in such tasks. Hence, in this thesis, we explore the suitability of
conditional random fields for building robust intrusion detection systems.
2.8 Conclusions
In this chapter, we presented the taxonomy of intrusion detection and explored the problem in
detail. We first discussed the principles and assumptions involved in building intrusion detection
systems and described the components of intrusion detection systems in detail. We then presented
various challenges and requirements for effective intrusion detection and presented a classification
of intrusion detection systems. We then discussed methods which have been used for detecting
intrusions, their underlying assumptions, and their strengths and limitations with regards to their
attack detection capability. We presented the literature review where we explored various frame-
works and methods which have been used to build network and application intrusion detection
systems. Finally, we drew similarities between intrusion detection and various tasks in computa-
tional linguistics, computational biology and motivated our approach to build intrusion detection
systems based on conditional random fields.
In the next chapter, we present our layered framework and describe how it can be used to build
accurate and efficient anomaly and hybrid network intrusion detection systems.
Chapter 3
Layered Framework for Building Intrusion
Detection Systems
Present networks and enterprises follow a layered defence approach to ensure security at different
access levels by using a variety of tools such as network surveillance, perimeter access control, fire-
walls, network, host and application intrusion detection systems, data encryption and others. Given
this traditional layered defence approach, only a single system is employed at every layer which is
expected to detect attacks at that particular location. However, with the rapid increase in the number
and type of attacks, a single system is not effective enough given the constraints of achieving high
attack detection accuracy and high system throughput. Hence, we propose a layered framework for
building intrusion detection systems which can be used, for example, to build a network intrusion de-
tection system which can detect a wide variety of attacks reliably and efficiently when compared to
the traditional network intrusion detection systems. Another advantage of our Layer based Intrusion
Detection System (LIDS) framework is that it is very general and easily customizable depending upon
the specific requirements of individual networks.
3.1 Introduction
T
WO significant requirements for building intrusion detection systems are broad attack de-
tection coverage and efficiency in operation, i.e., an intrusion detection system must detect
different type of attacks effectively and must operate efficiently in high traffic networks. Present
networks are prone to a number of attacks, a large number of which are previously known. How-
ever, the number of previously unseen attacks is on a rise [10].
Signature based systems using pattern matching approaches can be used effectively and effi-
ciently to detect previously known attacks in high speed networks. However, even a slight variation
in attacks may not be detected by a signature based system. As a result, anomaly and hybrid sys-
tems are used to detect previously unseen attacks and have been proven to be more reliable in
37
38 Layered Framework for Building Intrusion Detection Systems
detecting novel attacks when compared with the signature based systems. A common practice to
build anomaly and hybrid intrusion detection systems is to train a single system with labeled data
to build a classifier which can then be used to detect attacks from a previously unseen test set.
At times, when labeled data is not available, clustering based systems can be used to distinguish
between legitimate and malicious packets. However, a significant disadvantage of such systems is
that they result in a large number of false alarms. The attack detection coverage of the system is
further affected when a single system is trained to detect different type of attacks. To maximize
attack detection, various systems such as [55] and [114] employ both the signature based and the
anomaly based systems together. However, the anomaly based systems still remain a bottleneck
in the joint system. This is because, a single anomaly detector is trained which is expected to
accurately detect a variety of attacks and perform efficiently.
Thus, for a network intrusion detection system monitoring the incoming and outgoing network
traffic and ensuring confidentiality, integrity and availability via a single system may not be possi-
ble due to several reasons including the complexity and the diverse type of attacks at the network
level. Ensuring high speed of operation further limits the deployment, particularly, of anomaly
and hybrid network intrusion detection systems. Network monitoring using a network intrusion
detection system is only a single line of defence in the traditional layered defence approach which
aims to provide complete organizational security. Hence, network intrusion detection systems are
complemented by a variety of other tools such as network surveillance, perimeter access control,
firewalls, host and application intrusion detection systems, file integrity checkers, data encryption
and others and are deployed at different access points in a layered organizational security frame-
work [115]. In this chapter we propose a layered framework for building anomaly and hybrid
network intrusion detection systems which can operate efficiently in high speed networks and can
accurately detect a variety of attacks. Our proposed framework is very general and can be easily
customized by adding domain specific knowledge as per the specific requirements of the network
in concern, thereby, giving flexibility in implementation.
The rest of the chapter is organized as follows; we give motivating examples to highlight the
significance of the layered framework for intrusion detection in Section 3.2. We then describe our
layered framework in Section 3.3. We highlight the advantages of our framework in Section 3.4
and compare the layered framework with others in Section 3.5. Finally, we conclude this chapter
in Section 3.6.
3.2 Motivating Examples 39
3.2 Motivating Examples
Anomaly and hybrid intrusion detection systems typically employ various data mining and ma-
chine learning based approaches which are inefficient when compared to the signature based sys-
tems which employ pattern matching. Hence, it becomes critical to search for methods which can
be used to build efficient anomaly and hybrid intrusion detection systems. However, given that
the present networks are prone to a wide variety of attacks, using a single system would not only
degrade performance but will also be less effective in attack detection.
Consider for example, a single network intrusion detection system which is deployed to detect
every network attack in a high speed network. A network is prone to different types of attacks
such as the Denial of Service (DoS), Probe and others. We note that the DoS and Probe attacks
are different and require different features for their effective detection. When same features are
used to detect the two attacks the accuracy decreases. It also makes the system bulky which
affects its speed of operation. Hence, for effective attack detection, a network intrusion detection
system must differentiate between different types of attacks. Thus, using a single system is not a
viable option. One possible solution is having a number of sub systems each of which is specific
in detecting a single category of attack (such as DoS, Probe and others). This is not only more
effective in detecting individual classes of attacks, but it also results in an efficient system. The
number of sub systems to be used can be determined by analyzing the potential risks and the
availability of resources at individual installations.
Hence, we propose a layered framework for building efficient anomaly and hybrid intrusion
detection systems where different layers in the system are trained independently to detect different
type of attacks with high accuracy. For example, based on our proposed framework a network
intrusion detection system may consist of four layers, where the layers correspond to four different
attack classes; Denial of Service, Probe, Remote to Local and User to Root.
3.3 Description of our Framework
Figure 3.1 represents our framework for building Layer based Intrusion Detection Systems (LIDS).
The figure represents an ‘n’ layer system where every layer in itself is a small intrusion de-
tection system which is specifically trained to detect only a single type of attack, for example the
40 Layered Framework for Building Intrusion Detection Systems
Layer Two
Feature Selection
Intrusion Detection
Sub System
Intrusion Detection
Sub System
Intrusion Detection
Sub System Feature Selection
All Features
Block
No
Yes
Allow
Block
Normal
No
Normal
Block
Yes
No
Normal
Layer One
Feature Selection
Yes
Layer n
Figure 3.1: Layered Framework for Building Intrusion Detection Systems
DoS attack. A number of such sub systems are then deployed sequentially, one after the other.
This serves dual purpose; first, every layer can be trained with only a small number of features
which are significant in detecting a particular class of attack. Second, the size of the sub system
remains small and hence, it performs efficiently. A common disadvantage of using a modular ap-
proach, similar to our layered framework, is that it increases the communication overhead among
the modules (sub systems). However, this can be easily eliminated in our framework by making
every layer completely independent of every other layer. As a result, some features may be present
in more than one layer. Depending upon the security policy of the network, every layer can simply
block an attack once it is detected without the need of a central decision maker.
A number of such layers essentially act as filters, which blocks anomalous connection as soon
as they are detected in a particular layer, thereby providing a quick response to intrusion and
simultaneously reducing the analysis at subsequent layers. It is important to note that a different
response may be initiated at different layers depending upon the class of attack the layer is trained
to detect. The amount of audit data analyzed by the system is more at the first layer and decreases
at subsequent layers as more and more attacks are detected and blocked. In the worst case, when
no attacks are detected until at the last layer, all the layers have the same load. However, the
3.3 Description of our Framework 41
overall load for the average case is expected to be much less since attacks are detected and blocked
at every subsequent layer. On the contrary, if the layers are arranged in parallel rather than in
a sequence, the load at every sub system is same and is equal to that of the worst case in the
sequential configuration. Additionally, the initial layers in the sequential configuration can be
replicated to perform load balancing in order to improve performance.
3.3.1 Components of Individual Layers
Given that a network is prone to a wide variety of attacks, it is often not feasible to add a separate
layer to detect every single attack. However, a number of similar attacks can be grouped together
and represented as a single attack class. Every layer in our framework corresponds to a sub system
which is trained independently to detect attacks belonging to a single attack class. As a result, the
total number of layers in our framework remains small. For example, both, ‘Smurf’ and ‘Neptune’
result in Denial of Service and, hence, can be detected at a single layer rather than at two different
layers.
Additionally, the layered framework is very general and the number of layers in the overall
system can be adjusted depending upon the individual requirements of the network in concern.
Consider for example, a data repository which is a replica of a real-time application data and
which does not provide any online services. To ensure security of this data, the priority is to simply
detect network scans as opposed to detecting malicious data accesses. For such an environment,
only a single layer which can reliably detect the Probe attacks is sufficient. Hence, the number of
layers in our framework can be easily customized depending upon the identified threats and the
availability of resources.
Even though the number of layers and the significance of every layer in our framework depend
upon the target network, every layer has two significant components:
1. Feature Selection Component – In order to detect intrusions, a large number of features
can be monitored. These features include ‘protocol’, ‘type of service’, ‘number of bytes
from source to destination’, ‘number of bytes from destination to source’, ‘whether or not a
user is logged in’, ‘number of root accesses’, ‘number of files accessed’ and many others.
However, to detect a single attack class, only a small set of these features is required at
every layer. Using more features than required makes the system inefficient. For example,
to detect Probe attacks, features such as the ‘protocol’ and ‘type of service’ are significant
42 Layered Framework for Building Intrusion Detection Systems
while features such as ‘number of root accesses’ and ‘number of files accessed’ are not
significant.
2. Intrusion Detection and Response Sub System – The second component in every layer
is the intrusion detection and response unit. To detect intrusions, our framework is not
restrictive in using a particular anomaly or hybrid detector. A variety of previously well
known intrusion detection methods such as the naive Bayes classifier, decision trees, sup-
port vector machines and others can be used. A prime advantage of our framework is
that newer methods, such as conditional random fields as we will discuss in the following
chapters, which are more effective in detecting attacks can be easily incorporated in our
framework. Finally, once an attack is detected, the response unit can provide adequate
intrusion response depending upon the security policy.
In order to take advantages of our proposed framework, every layer must contain both of the
above mentioned components.
3.4 Advantages of Layered Framework
We now summarize the advantages of using our layered framework.
• Using our layered framework improves attack detection accuracy and the system can detect
a wide variety of attacks by making use of the domain specific knowledge.
• The layered framework does not degrade system performance as individual layers are in-
dependent and are trained with only a small number of features, thereby, resulting in an
efficient system. Additionally, using our layered framework opens avenues to perform
pipelining resulting in very high speed of operation. Implementing pipelining, particu-
larly in multi core processors, can significantly improve the performance by reducing the
multiple I/O operations to a single I/O operation since all the features can be read in a single
operation and analyzed by different layers in the layered framework.
• Our framework is easily customizable and the number of layers can be adjusted depending
upon the requirements of the target network.
• Our framework is not restrictive in using a single method to detect attacks. Different meth-
ods can be seamlessly integrated in our framework to build effective intrusion detectors.
3.5 Comparison with other Frameworks 43
• Our proposed layered framework for building effective and efficient network intrusion de-
tection systems fits well in the traditional layered defence approach for providing network
and enterprise level security.
• Our framework has the advantage that the type of attack can be inferred directly from the
layer at which it is detected. As a result, specific intrusion response mechanisms can be
activated for different attacks.
3.5 Comparison with other Frameworks
Ensuring continuity of services and security of data from unauthorized disclosure and malicious
modifications are critical for any organization. However, providing a desired level of security at
the enterprise level can be challenging. No single tool can provide enterprise wide security and
hence, a number of different security tools are deployed. For this, a layered defence approach is
often employed to provide security at the organizational level. This traditional layered defense
approach incorporates a variety of security tools such as the network surveillance, perimeter ac-
cess control, firewalls, network, host and application intrusion detection systems, file integrity
checkers, data encryption and others which are deployed at different access points in a layered
security framework. The traditional layered architecture is perceived as a framework for ensuring
complete organizational security rather than as an approach for building effective and efficient
intrusion detection systems. Figure 3.2 represents the traditional layered defence approach.
However, as discussed earlier, we present a layered framework for building intrusion detection
systems. Our framework fits well in the traditional layered defence approach and can be used to
develop effective and efficient network intrusion detection systems. Further, the four components
viz., event generators, event Analyzers, event Databases and the response units, presented in the
Common Intrusion Detection Framework [45] can be defined for every intrusion detection sub
system in our layered framework.
In the data mining framework for intrusion detection, [84], the authors describe the use of
data mining algorithms to compute activity patterns from system audit data to extract features
which are then used to generate rules to detect intrusions. The same approach can be applied
for building an intrusion detection system based on our layered framework. Our framework can
not only seamlessly integrate the use of data mining technique for intrusion detection, but can
44 Layered Framework for Building Intrusion Detection Systems
Data
Security
Application Security
Business Continuity
Content Management
Network Security
Surveillance
Host Security (Infrastructure Protection)
Perimeter Security (Network Access Control)
Figure 3.2: Traditional Layered Defence Approach to Provide Enterprise Wide Security
also help to improve its performance by selecting only a small number of significant features for
building separate intrusion detection sub systems which can be used to effectively detect different
classes of attacks at different layers.
A number of other frameworks have been proposed which describe the use of classifier com-
bination [55], [114], [116], [117]. In [55] and [114], the authors apply a combination of anomaly
and misuse detectors for better qualification of analyzed events. The authors in [116] describe the
combination of ‘strong’ classifiers using stacking where decision tress, naive Bayes and a number
of other classification methods are used as base classifiers. The authors show that the output from
these classifiers can be combined to generate a better classifier rather than selecting the individual
best classifier. In [117], the authors use a combination of ‘weak’ classifiers where the individual
classification power of weak classifiers is slightly better than that of random guessing. The authors
show that a number of such classifiers when combined by using simple majority voting mechanism
provide good classification. Our framework is, however, not based upon classifier combination.
Combination of classifiers is expensive with regards to the processing time and decision making.
In addition, centralized decision making systems often tend to be complex and slow in operation.
3.6 Conclusions 45
The only purpose of classifier combination is to improve accuracy. Rather, our system is based
upon serial layering of multiple hybrid detectors which are trained independently and which oper-
ate without the influence of any central controller. In our framework, the results from individual
classifiers at a layer are not combined at any later stage and, hence, an attack is blocked at the
layer where it is detected. There is no communication overhead among the layers and the central
decision maker which results in an efficient system. In addition, since the layers are independent
they can be trained separately and deployed independently. As already discussed, using a stacked
system is expensive when compared to the sequential model. From our experimental results in the
following chapters, we will show that an intrusion detection system based on our layered frame-
work performs better and is more efficient when compared with individual systems as well as with
systems based on classifier combination.
3.6 Conclusions
In this chapter, we presented our layered framework for building effective and efficient intrusion
detection systems. We compared our framework with other well known frameworks and high-
lighted its specific advantages. In addition to improving the attack detection accuracy and detect-
ing a variety of attacks, our framework can be used to build efficient anomaly and hybrid network
intrusion detection systems. In particular our framework can identify the class of an attack once
detected, is scalable and can be easily customized depending upon the specific requirements of a
network.
Given the layered framework, in the next chapter, we first demonstrate the effectiveness of
conditional random fields to build intrusion detection sub systems which are individually trained
to effectively detect a single attack class. We then integrate the trained (sub) systems into our
layered framework to build accurate and efficient network intrusion detection systems which are
not based on attack signatures. Experimental results demonstrate that our system outperforms
other well known approaches for intrusion detection.
Chapter 4
Layered Conditional Random Fields for
Network Intrusion Detection
Ever increasing network bandwidth poses a significant challenge to build efficient network intrusion
detection systems which can detect a wide variety of attacks with acceptable reliability. In order to
operate in high traffic environment, present network intrusion detection systems are often signature
based. However, signature based systems have obvious disadvantages. As a result, anomaly and
hybrid intrusion detection systems must be used to detect novel attacks. However, such systems are
inefficient and suffer from a large false alarm rate. To ameliorate these drawbacks, we first develop
better hybrid intrusion detection methods which are not based on attack signatures and which can
detect a wide variety of attacks with very few false alarms. We then integrate the layered framework,
discussed in previous chapter, to build a single system which is effective in attack detection and which
can also perform efficiently in high traffic environment.
4.1 Introduction
I
NCREASING network bandwidth has enabled a large number of services to be provided over
a network. High speed of communication and increasing complexity in systems has, however,
made it difficult to detect intrusive activities in real-time. In order to operate in high speed net-
works, intrusion detection systems are either signature based which perform pattern matching or
operate on summarized audit patterns which are collected regularly at predefined intervals. Pattern
matching systems operate on signatures extracted from previously known attacks and are limited
in detecting only the attacks with known signatures. Anomaly and hybrid intrusion detection sys-
tems, in addition to detecting previously known attacks, can also detect previously unseen attacks;
however, they are expensive in operation. As a result, such systems analyze summarized data
instead of monitoring a sequence of events.
Anomaly and hybrid intrusion detection systems suffer from two major disadvantages; first,
47
48 Layered Conditional Random Fields for Network Intrusion Detection
they generate a large number of false alarms and second, they are expensive in operation. Further,
a single system has limited attack detection coverage and it cannot detect a wide variety of attacks
reliably. Hence, in this chapter, we focus on building accurate hybrid intrusion detection systems
which can perform efficiently in high speed network environment.
We first develop hybrid intrusion detection systems based on conditional random fields which
can detect a wide variety of attacks and which result in very few false alarms. To improve the
efficiency of the system, we then integrate the layered framework, as discussed in the previous
chapter, and demonstrate that a single system based on our framework is more effective than pre-
viously well known methods for network intrusion detection. Experimental results on the bench-
mark KDD 1999 intrusion data set [12] and comparison with other well known methods such as
decision trees and naive Bayes show that our approach based on layered conditional random
fields outperform these methods; in terms of, both, accuracy of attack detection and efficiency of
operation. Impressive part of our results is the percentage improvement in attack detection accu-
racy, particularly, for User to Root (U2R) attacks (34.8% improvement) and Remote to Local (R2L)
attacks (34.5% improvement). Statistical tests also demonstrate higher confidence in detection ac-
curacy with layered conditional random fields. We also show that our system is robust and can
detect attacks with higher accuracy, when compared with other methods, even when trained with
noisy data.
Rest of the chapter is organized as follows; in Section 4.2 we motivate the use of conditional
random fields for intrusion detection which can model complex relationships between different
features in the data set. We then describe the data set used in our experiments in Section 4.3.
In Section 4.4, we describe how conditional random fields can be used for effective intrusion
detection followed by the algorithm to integrate the layered framework with conditional random
fields to build an effective and an efficient network intrusion detection system. In Section 4.5
we give details of the experiments performed and describe the implementation of our integrated
system. In Section 4.6, we compare our results with other methods such as decision trees, naive
Bayes classifier, multi layer perceptron, support vector machines, K-means clustering, principle
component analysis and approaches based on classifier combination, which are known to perform
well for intrusion detection. We analyze the robustness of our system in Section 4.7 by introducing
noise in the training data. Finally, we draw conclusions and highlight the advantages of layered
conditional random fields for network intrusion detection in Section 4.8.
4.2 Motivating Examples 49
4.2 Motivating Examples
Network intrusion detection systems operate at the periphery of the networks and are, thus, over-
loaded with large amount of network traffic, particularly in high speed networks. As a result, the
anomaly and hybrid intrusion detection systems generally operate on summarized audit patterns.
However, when audit patterns are summarized, they are represented with multiple features which
are correlated and complex relationships exist between them. To detect intrusions effectively,
these features must not be considered independently. Methods, such as conditional random fields,
which can capture relationships among multiple features, would perform better when compared
with methods which consider the features to be independent such as the naive Bayes classifier.
Consider, for example, a network intrusion detection system which uses two features ‘logged
in’ and ‘number of file creations’ to classify network connections as either normal or attack.
When these features are analyzed in isolation they do not provide significant information which
can help in detecting attacks. However, analyzing these features together can provide meaningful
information for classification. This is because, a particular user may or may not have privileges to
create files in the system or the system may detect anomalous activity by calculating deviation in
the current profile and then comparing it with the previously saved profile for that particular user.
Consider another network intrusion detection system which analyzes connection level feature
such as ‘service invoked at the destination’ in order to detect attacks. When this feature is analyzed
in isolation, it is significant only when an attacker requests for a service that is not available at
the destination and the system may then tag the connection as a Probe attack. However, if this
information is analyzed in combination with other features such as ‘protocol type’ and ‘amount
of data transferred between the source and the destination’; the audit data provides significant
details which help in improving classification. In this case, if the features are considered to be
independent, the system is limited in detecting only Probe attacks. However, as we will show
from our experiments, if these features are not considered to be independent, the system may not
only detect Probe attacks, but it can also correctly detect R2L and U2R attacks.
Such relationships between different features in the observed data, if considered by an intru-
sion detection system during classification can significantly decrease classification error, thereby
improving the attack detection accuracy. We thus explore the effectiveness of conditional random
fields which can effectively model such relationships and compare their performance with other
well known approaches for intrusion detection.
50 Layered Conditional Random Fields for Network Intrusion Detection
4.3 Data Description
We perform our experiments with the benchmark KDD 1999 intrusion data set [12]. The data set
is a version of the 1998 DARPA intrusion detection evaluation program, prepared and managed by
the MIT Lincoln Labs. The data set contains about five million connection records as the training
data and about two million connection records as the test data. In our experiments, we use the
ten percent of the total training data and ten percent of the test data (with corrected labels) which
are provided separately. This leads to 494,020 training and 311,029 test instances. Each record
in the data set represents a connection between two IP addresses, starting and ending at some
well defined times with a well defined protocol. Further, with 41 different features, every record
represents a separate connection and, hence in our experiments, we consider every record to be
independent of every other record.
Table 4.1 gives the number of instances for every class in the data set. The training data is
either labeled as normal or as one of the 24 different kinds of attack. All of the 24 attacks can be
grouped into one of the four classes; Probe, Denial of Service (DoS), unauthorized access from a
remote machine or Remote to Local (R2L) and unauthorized access to root or User to Root (U2R).
Similarly the test data is also labeled as either normal or as one of the attacks belonging to the
four attack classes. It is important to note that the test data includes specific attacks which are not
present in the training data. This makes the intrusion detection task more realistic [12].
Table 4.1: KDD 1999 Data Set
Training Set Test Set
Normal 97,277 60,593
Probe 4,107 4,166
DoS 391,458 229,853
R2L 1,126 16,349
U2R 52 68
Total 494,020 311,029
4.4 Methodology 51
4.4 Methodology
Given the network audit patterns where every connection between two hosts is presented in a
summarized form with 41 features, our objective is to detect most of the anomalous connections
while generating very few false alarms. In our experiments, we used the KDD 1999 data set
described in Section 4.3. Conventional methods, such as decision trees and naive Bayes, are
known to perform well in such an environment; however, they assume observation features to
be independent. We propose to use conditional random fields which can capture the correlations
among different features in the data and hence perform better when compared with other methods.
The KDD 1999 data set represents multiple features, a total of 41, for every session in rela-
tional form with only one label for the entire record. In this case, using a conditional model would
result in a maximum entropy classifier [108], [110]. However, we represent the audit data in the
form of a sequence and assign label to every feature in the sequence using the first order Markov
assumption instead of assigning a single label to the entire observation. Though, this increases
complexity, it also improves the attack detection accuracy. To manage complexity and improve
system’s performance, we integrate the layered framework, described in the previous chapter, with
the conditional random fields to build a single system which is more efficient and more effective.
Figure 4.1 represents how conditional random fields can be used for detecting network intrusions.
= 0 = SF
flag protocol duration service
attack attack attack attack attack
= 8
src_byte
= icmp = eco_i
(a) Attack Event
= 0
flag src_byte
= tcp = smtp
= SF
normal normal normal normal normal
duration protocol service
= 4854
(b) Normal Event
Figure 4.1: Conditional Random Fields for Network Intrusion Detection
In the figure, observation features ‘duration’, ‘protocol’, ‘service’, ‘flag’ and ‘source bytes’ are
used to discriminate between attack and normal events. The features take some possible value for
every connection which are then used to determine the most likely sequence of labels < attack,
attack, attack, attack, attack > or < normal, normal,normal, normal, normal >. Custom
feature functions can be defined which describe the relationships among different features in the
observation. During training, feature weights are learnt and during testing, features are evaluated
52 Layered Conditional Random Fields for Network Intrusion Detection
for the given observation which is then labeled accordingly. It is evident from the figure that every
input feature is connected to every label which indicates that all the features in an observation
determine the final labeling of the entire sequence. Thus, a conditional random field can model
dependencies among different features in an observation. Present intrusion detection systems do
not consider such relationships. They either consider only one feature, as in case of system call
modeling, or assume independence among different features in an observation, as in case of a
naive Bayes classifier. Our experimental results, described in Section 4.5, clearly suggest that
conditional random fields can effectively model such relationships among different features of an
observation resulting in higher attack detection accuracy.
We also note that in the KDD 1999 data set, attacks can be represented in four classes; Probe,
DoS, R2L and U2R. In order to consider this as a two class classification problem, the attacks
belonging to all the four attack classes can be re-labeled as attack and mixed with the audit patterns
belonging to the normal class to build a single model which can be trained to detect any kind of
attack. Another approach for considering the same problem, as a two class problem, is to use
only the attacks belonging to a single attack class mixed with audit patterns belonging to the
normal class to train a separate sub system for all the four attack classes. The problem can also be
considered as a five class classification problem, where a single system is trained with five classes
(normal, Probe, DoS, R2L and U2R) instead of two. Such a system can easily identify an attack
once it is detected but is very slow in operation, making their deployment impractical in high speed
networks.
As we will see from our experimental results, particularly from Table 4.14 in Section 4.6,
considering every attack class separately not only improves the attack detection accuracy but also
helps to improve the overall system performance when integrated with the layered framework.
Furthermore, it also helps to identify the class of an attack once it is detected at a particular layer
in the layered framework. However, a drawback of this implementation is that it requires domain
knowledge to perform feature selection for every layer. Nonetheless, this is one time process and
given the critical nature of the problem of intrusion detection, if domain knowledge can help to
improve the attack detection accuracy it is recommended to do so.
Using conditional random fields improve the attack detection accuracy particularly for the
U2R attacks. They are also effective in detecting the Probe, R2L and the DoS attacks. However,
when we consider all the 41 features in the data set for each of the four attack classes separately,
4.4 Methodology 53
conditional random fields can be expensive during training and testing. For a simple linear chain
structure, the time complexity for training a conditional random field is O(TL
2
NI) where T is the
length of the sequence, L is the number of labels, N is the number of training instances and I is
the number of iterations. During inference, the Viterbi algorithm [118], [119] is employed which
has a complexity of O(TL
2
). The quadratic complexity is significant when the number of labels is
large as in language tasks. However, for intrusion detection there are only two labels normal and
attack and, thus, our system is very efficient. We further improve the overall system performance
by implementing the layered framework and performing feature selection which decreases T, i.e.,
the length of the sequence. We now describe feature selection for all the four attack classes.
4.4.1 Feature Selection
Attacks belonging to different classes are different and, hence for better attack detection, it be-
comes necessary to consider them separately. As a result, in our layered system, we train every
layer separately to optimally detect a single class of attack. We therefore select different features
for different layers based upon the type of attack the layer is trained to detect. In Figure 4.2, we
represent a detailed view of a single layer (Probe layer) which can be used to detect Probe attacks
in our integrated system.
All Features
Probe Layer
Feature Selection
Audit Data
(Normal + Probe)
Normal
No
Yes
Allow
Block
Figure 4.2: Representation of Probe Layer with Feature Selection
The Probe layer is optimally trained to detect only the Probe attacks. Hence, we use only
the Probe attacks and the normal instances from the audit data to train this layer. Other layers
can be trained similarly. Note that, we select different features to train different layers in our
framework. Experimental results clearly suggest that feature selection significantly improves the
54 Layered Conditional Random Fields for Network Intrusion Detection
attack detection capability of our system. Ideally, we would like to perform feature selection
automatically. However, experimental results in Section 4.6.2 suggest that present methods for
automatic feature selection are not effective. Hence, we use domain knowledge to select features
for all the four attack classes. We now describe our approach for selecting features for every layer
and why some features were chosen over others.
1. Probe Layer – Probe attacks are aimed at acquiring information about the target network
from a source which is often external to the network. Hence, basic connection level fea-
tures such as the ‘duration of connection’ and ‘source bytes’ are significant; while features
like ‘number of file creations’ and ‘number of files accessed’ are not expected to provide
information for detecting Probe attacks.
2. DoS Layer – DoS attacks are meant to prevent the target from providing service(s) to its
users by flooding the network with illegitimate requests. Hence, to detect attacks at the
DoS layer; network traffic features such as the ‘percentage of connections having same
destination host and same service’ and packet level features such as the ‘source bytes’ and
‘percentage of packets with errors’ are significant. To detect DoS attacks, it may not be
important to know whether a user is ‘logged in or not’ and hence, such features are not
considered in the DoS layer.
3. R2L Layer – R2L attacks are one of the most difficult attacks to detect as they involve both,
the network level and the host level features. Hence, to detect R2L attacks, we selected
both, the network level features such as the ‘duration of connection’, ‘service requested’
and the host level features such as the ‘number of failed login attempts’ among others.
4. U2R Layer – U2R attacks involve the semantic details which are very difficult to capture
at an early stage at the network level. Such attacks are often content based and target an
application. Hence for detecting U2R attacks, we selected features such as ‘number of file
creations’, ‘number of shell prompts invoked’, while we ignored features such as ‘protocol’
and ‘source bytes’.
From all the 41 features in the KDD 1999 data set, we select only five features for Probe layer,
nine features for DoS layer, 14 features for R2L layer and eight features for U2R layer. Since every
layer in our framework is independent, feature sets for all the four layers are not disjoint. We list
the features used for all the four layers in Appendix B.
4.4 Methodology 55
4.4.2 Integrating the Layered Framework
The layered framework, introduced in Chapter 3, is general and can be tailored to build specific in-
trusion detection systems. In this section, we describe how we can integrate the layered framework
with the conditional random fields to build an effective and an efficient hybrid network intrusion
detection system.
Given the four different attack classes in the KDD1999 data, we implement a four layer system
where every layer corresponds to a single attack class. The four layers are arranged in a sequence
as represented in Figure 4.3.
Feature Selection
R2L Layer
Feature Selection
DoS Layer
Feature Selection
Normal
Normal
Normal
All Features
Normal
Yes
No
No
Yes
No
Yes Yes
Allow
Block Block
Block
No
Probe Layer
Feature Selection
U2R Layer
Block
Figure 4.3: Integrating Layered Framework with Conditional Random Fields
In the system, every layer is trained separately with the normal instances and with the attack
instances belonging to a single attack class. The layers are then arranged one after the other in a
sequence as shown in Figure 4.3. However, during testing, all the audit patterns (irrespective of
their attack class, which is unknown) are passed into the system starting from the first layer. If
the layer detects the instance as an attack, the system labels the instance as a Probe attack and
initiates the response mechanism; otherwise it passes the instance to the next layer. Same process
is repeated at every layer until either an instance is detected as an attack or it reaches the last layer
where the instance is labeled as normal if no attack is detected. We now give the algorithm to
integrate the layered framework with conditional random fields.
56 Layered Conditional Random Fields for Network Intrusion Detection
Algorithm: Integrating Layered Framework & Conditional Random Fields
Algorithm 1 Training
1: Select the number of layers, n, for the complete system.
2: Separately perform features selection for each layer.
3: Train a separate model with conditional random fields for each layer using the features se-
lected from Step 2.
4: Plug in the trained models sequentially such that only the connections labeled as normal are
passed to the next layer.
Algorithm 2 Testing
1: For each (next) test instance perform Steps 2 through 5.
2: Test the instance and label it either as attack or normal.
3: If the instance is labeled as attack, block it and identify it as an attack represented by the layer
name at which it is detected and go to Step 1. Else pass the sequence to the next layer.
4: If the current layer is not the last layer in the system, test the instance and go to Step 3. Else
go to Step 5.
5: Test the instance and label it either as normal or as an attack. If the instance is labeled as an
attack, block it and identify it as an attack corresponding to the layer name.
4.5 Experiments and Results
For our experiments, we use the conditional random field toolkit CRF++ [120] and the Weka tool
[121]. We develop python and shell scripts for data formatting and implementing the layered
framework and perform all of our experiments on a desktop running with Intel(R) Core(TM) 2,
CPU 2.4 GHz and 2 GB RAM under exactly the same conditions.
In our experiments we perform hybrid detection, i.e., we use both normal and anomalous audit
patterns to train the model in a supervised learning environment. We perform our experiments ten
times and report the best, the average and the worst cases. To measure the efficiency of attack
detection, we consider only the test time efficiency since the real-time performance of an intrusion
detection system depends upon the test time efficiency alone. We observe that our system based on
layered framework and conditional random fields, which we refer to as the “Layered Conditional
Random Fields”, is very efficient during testing. The time required to test every instance when
we consider all the 41 features for all the four layers is 0.2236 ms. This reduces to 0.0678 ms when
we perform feature selection and implement the layered framework. More details are presented in
the following sections.
4.5 Experiments and Results 57
4.5.1 Building Individual Layers of the System
To determine the effectiveness of conditional random fields for intrusion detection we perform
two set of experiments. From the first experiment, we examine the accuracy of conditional ran-
dom fields and compare them with other techniques which are known to perform well. In this
experiment we use all the 41 features to make a decision. We observe that the conditional random
fields perform very well particularly for detecting U2R attacks while the decision trees achieve
higher attack detection for the Probe and R2L attacks. The difference in attack detection accuracy
for DoS attacks is not significant. The reason for better accuracy for decision trees is that they per-
form feature selection and use only a small set of features in the final model. Hence, we perform
our second experiment where we select a subset of features for all the four layers separately as
discussed earlier in Section 4.4.1.
For our experiments, we divided the training data into five different classes; normal, Probe,
DoS, R2L and U2R. Similarly, we divided the test data into five classes. As discussed in Section
4.4, we perform experiments separately for all the four attack classes by randomly selecting data
corresponding to that particular attack class and normal data only. For example, to detect Probe
attacks, we train and test the system with Probe attacks and normal audit patterns only. We do not
add other attacks such as DoS, R2L and U2R in the training data when training the sub system to
detect Probe attacks. Not including other attacks allow the system to better learn features specific
to the Probe attacks and normal events. Hence, for four attack classes we train four independent
models, separately, with and without feature selection to compare their performance. We perform
similar experiments with decision trees and naive Bayes. We call the models as layered conditional
random fields, layered decision trees and layered naive Bayes when we perform feature selection.
For better comparison and readability, we present the results for the two experiments for all the
four layers together.
Detecting Probe Attacks
To detect Probe attacks, we train our system by randomly selecting 10,000 normal records and the
entire Probe records from the training data. For testing the model, we select all the normal and
Probe records from the test data. Hence, we have about 15,000 training and 64,759 test instances.
1. Experiments with all 41 Features – In Table 4.2, we give the results for detecting Probe
58 Layered Conditional Random Fields for Network Intrusion Detection
attacks when we use all the 41 features for training and testing in the first experiment. The
table represents that the system takes a total of 14.53 seconds to label all the 64,759 test
instances. Results suggest that decision trees are more efficient than conditional random
fields and naive Bayes. This is because they have a small tree structure, often with very
few decision nodes, which is very efficient. The attack detection accuracy is also higher
for the decision trees since they select the best possible features during tree construction.
However, when we performfeature selection, the layered conditional randomfields achieve
much higher accuracy and there is significant improvement in train and test time efficiency.
Table 4.2: Detecting Probe Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 84.60 89.94 86.73
Random Average 82.53 88.06 85.21 200.6 14.53
Fields Worst 80.44 86.13 83.19
Naive
Best 73.20 97.00 83.30
Bayes
Average 72.26 96.65 82.70 1.08 6.31
Worst 71.20 96.30 81.90
Decision
Best 93.20 97.70 95.40
Trees
Average 87.36 95.73 91.34 2.04 2.40
Worst 85.50 90.90 88.80
2. Experiments with Feature Selection – In the second experiment, we use the same data as
used in previous experiment, however, we perform feature selection in this experiment.
We give the results for detecting Probe attacks with feature selection in Table 4.3. The
table suggests that the layered conditional random fields perform better and faster than the
previous experiment and are the best choice for detecting Probe attacks. The system takes
only 2.04 seconds to label all the 64,759 test instances. We observe that there is no sig-
nificant advantage with respect to time for the layered decision trees. This is because the
size of the final tree with decision trees and with layered decision trees is not significantly
different, resulting in similar efficiency. We also observe that the Recall and, hence, the
F-Measure for layered naive Bayes decreases drastically. This can be explained as follows;
the classification accuracy with naive Bayes generally improves as the number of features
4.5 Experiments and Results 59
increases. However, if the number of features increases to a very large extent, the esti-
mation tends to become unreliable. As a result, when we use all the 41 features, naive
Bayes performs well but when we perform feature selection and use only five features, its
classification accuracy decreases. The results from Table 4.3 clearly suggest that layered
conditional random fields are a better choice for detecting Probe attacks.
Table 4.3: Detecting Probe Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 89.72 98.03 93.68
Conditional Average 88.19 97.82 92.73 6.91 2.04
Random Fields Worst 82.92 96.48 89.82
Layered Best 78.80 21.30 33.60
Naive Average 77.23 19.57 31.22 0.45 1.13
Bayes Worst 74.70 17.00 27.70
Layered Best 87.50 97.70 92.30
Decision Average 87.04 97.41 91.93 0.54 1.00
Trees Worst 86.60 95.20 90.80
Detecting DoS Attacks
We randomly select 20,000 normal records and 4,000 DoS records from the training data to train
the system to detect DoS attacks. For testing, we select all the normal and DoS records from the
test set. Hence, we have 24,000 training instances and 290,446 test instances.
1. Experiments with all 41 Features – In Table 4.4, we give the results for detecting DoS
attacks when we use all the 41 features. The table represents that the system takes a total
of 64.42 seconds to label all the 290,446 test instances. The results show that all the
three methods have similar attack detection accuracy; however, decision trees give a slight
advantage with regards to the test time efficiency.
2. Experiments with Feature Selection – To detect DoS attacks with feature selection we
perform experiments on the same data used in the previous experiment, but we perform
60 Layered Conditional Random Fields for Network Intrusion Detection
Table 4.4: Detecting DoS Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 99.82 97.11 98.43
Random Average 99.78 97.05 98.40 256.11 64.42
Fields Worst 99.75 96.99 98.37
Naive
Best 99.40 97.00 98.20
Bayes
Average 99.32 97.00 98.17 1.79 26.28
Worst 99.30 97.00 98.10
Decision
Best 99.90 97.20 98.60
Trees
Average 99.90 97.00 98.46 6.09 9.04
Worst 99.90 96.70 98.30
feature selection in this experiment. Table 4.5 presents the results. With feature selection,
the system takes only 15.17 seconds to label all the 290,446 test instances. The results
follow the same trend as in the previous experiment. Considering the test time efficiency,
layered decision trees are a better choice for detecting DoS attacks. It is important to note
that there is slight increase in the detection accuracy when feature selection is performed;
however, this increase is not significant. In this experiment, the real advantage of feature
selection is seen in terms of improvement in the test time performance.
Table 4.5: Detecting DoS Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 99.99 97.12 98.53
Conditional Average 99.98 97.05 98.50 26.59 15.17
Random Fields Worst 99.97 97.01 98.48
Layered Best 99.40 97.00 98.20
Naive Average 99.39 97.00 98.19 0.68 6.50
Bayes Worst 99.30 97.00 98.10
Layered Best 99.90 97.30 98.60
Decision Average 99.90 97.10 98.50 1.31 3.87
Trees Worst 99.90 97.00 98.40
4.5 Experiments and Results 61
Detecting R2L Attacks
For training our system to detect R2L attacks, we randomly select 1,000 normal records and all the
R2L records from the training data. To test the model, we select all the normal and R2L records
from the test set. Hence, we have about 2,000 training instances and 76,942 test instances.
1. Experiments with all 41 Features – In Table 4.6, we give the results for detecting R2L
attacks when we use all the 41 features. We observe that to test all the 76,942 test in-
stances, the system take 17.16 seconds. Table 4.6 suggests that decision trees have higher
F-Measure, but the conditional random fields have higher Precision when compared with
other methods, i.e., a system using conditional random fields generates less false alarms.
Table 4.6: Detecting R2L Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 93.67 16.81 28.42
Random Average 92.35 15.10 25.94 23.40 17.16
Fields Worst 90.54 12.42 21.89
Naive
Best 74.10 7.40 13.40
Bayes
Average 70.03 6.63 12.12 0.38 7.33
Worst 61.30 5.40 10.00
Decision
Best 98.30 37.10 53.20
Trees
Average 84.68 23.29 35.62 0.60 2.75
Worst 63.70 10.40 18.30
2. Experiments with Feature Selection – In the second experiment, we use the same data as
used in the previous experiment, however, we perform feature selection in this experiment.
From the results in Table 4.7, we observe that the system takes only 5.96 seconds to test
all the 76,942 test instances and the layered conditional random fields perform much better
than conditional random fields (increase in F-Measure of about 60%), layered decision
trees (increase of about 125%), decision trees (increase of about 17%), layered naive Bayes
(increase of about 250%) and naive Bayes (increase of about 250%) and are the best choice
for detecting R2L attacks. Layered condition random fields take slightly more time which
is acceptable as they achieve much higher attack detection accuracy.
62 Layered Conditional Random Fields for Network Intrusion Detection
Table 4.7: Detecting R2L Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 95.84 31.67 47.52
Conditional Average 94.70 27.08 42.08 5.30 5.96
Random Fields Worst 91.37 24.98 39.23
Layered Best 88.30 7.20 13.30
Naive Average 81.81 6.47 11.98 0.31 2.99
Bayes Worst 78.20 4.10 7.80
Layered Best 89.70 14.50 24.90
Decision Average 85.48 10.39 18.43 0.36 1.43
Trees Worst 78.80 7.30 13.50
Detecting U2R Attacks
To detect U2R attacks, in the first experiment, we train our system by randomly selecting 1,000
normal records and all the U2R records from the training data. We used all the normal and U2R
records from the test set for testing the system. Hence, we have about 1,000 training instances
and 60,661 test instances.
1. Experiments with all 41 Features – In Table 4.8, we give the results for detecting U2R
attacks when we use all of the 41 features. The system takes 13.45 seconds to label 60,661
test instances. Table 4.8, clearly shows that conditional random fields are far better for
detecting U2R attacks when compared with other methods. The F-Measure for conditional
random fields is more than 150% with respect to the decision trees and more than 600%
with respect to the naive Bayes. The U2R attacks are very difficult to detect and most of the
present intrusion detection systems fail to detect such attacks with acceptable reliability.
We observe that conditional random fields can be used to reliably detect the U2R attacks
in particular.
2. Experiments with Feature Selection – In the second experiment, we use the same data
as used in the previous experiment to detect U2R attacks; however, we perform feature
selection in this experiment. We give the results for detecting U2R attacks with feature
4.5 Experiments and Results 63
Table 4.8: Detecting U2R Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 58.62 60.29 56.74
Random Average 52.16 55.02 53.44 8.35 13.45
Fields Worst 47.30 50.00 49.30
Naive
Best 5.30 91.20 10.00
Bayes
Average 3.94 85.88 7.54 0.31 5.90
Worst 3.20 82.40 6.20
Decision
Best 24.80 63.20 34.90
Trees
Average 12.93 57.49 20.42 0.37 2.22
Worst 6.30 51.50 11.20
selection in Table 4.9. We observe that the system takes only 2.67 seconds to label all the
60,661 test instances. Table 4.9, clearly suggests that layered conditional random fields
are the best choice for detecting U2R attacks and are far better than conditional random
fields (increase of about 8%), layered decision trees (increase of about 30%), decision trees
(increase of about 184%), layered naive Bayes (increase of about 38%) and naive Bayes
(increase of about 675%). We also observe that the attack detection capability increases
for the decision trees and the naive Bayes when we perform feature selection.
Table 4.9: Detecting U2R Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 58.57 64.71 61.11
Conditional Average 55.07 62.35 58.19 0.85 2.67
Random Fields Worst 34.96 60.29 45.03
Layered Best 50.00 66.20 51.40
Naive Average 35.48 55.12 41.97 0.25 1.83
Bayes Worst 19.60 52.90 29.80
Layered Best 51.00 38.20 43.70
Decision Average 51.00 38.20 43.70 0.29 0.93
Trees Worst 51.00 38.20 43.70
64 Layered Conditional Random Fields for Network Intrusion Detection
It is evident from our results that the attack detection accuracy using layered conditional ran-
dom fields is significantly higher for detecting the U2R, R2L and Probe attacks. The difference in
attack detection accuracy is, however, not significant for the DoS attacks. Further, regardless of the
method considered and particularly for conditional random fields, the time required for training
and testing the system reduces significantly once we perform feature selection.
4.5.2 Implementing the Integrated System
In many situations, there is a tradeoff between efficiency and accuracy of the system and there
can be various avenues to improve system performance. Methods such as naive Bayes assume
independence among the observed data. This certainly increases system efficiency but it severely
affects the accuracy as we observed from the experimental results. To balance this tradeoff we
use the conditional random fields which are more accurate, though expensive, but we implement
the layered approach to improve overall system performance. The performance of our integrated
system, “Layered Conditional Random Fields”, is comparable to that of the decision trees and
the naive Bayes and our system has higher attack detection accuracy.
Experimental results in Section 4.5.1 suggest that conditional random fields (with feature se-
lection) can be very effective in detecting different attacks when different attack classes are con-
sidered separately. However, in real scenario, the category of an attack is unknown. Rather, it
would be beneficial if an intrusion detection system not only detects an attack but also identifies
the type of attack, thereby enabling specific intrusion response mechanisms depending upon the
type of attack. We perform further experiments with the integrated system presented in Section
4.4.2. Results show that integrating the layered framework not only improves the efficiency of the
overall system, but it also helps to identify the type of attack once it is detected. This is because
individual layers in the layered framework are trained to detect only a particular class of attack.
As soon as a layer detects an attack, the category of the attack can be inferred from the class of
attack the layer is trained to detect. For example, if an attack is detected at the U2R layer in the
layered framework, it is very likely that the attack is of U2R type and hence, the system labels the
attack as U2R and initiates specific response mechanisms.
To examine the effectiveness of our integrated system, layered conditional random fields, we
perform experiments in an environment similar to the real life deployment of the system. For this
experiment, we perform feature selection and use exactly the same training instances as used for
4.5 Experiments and Results 65
training the individual models in the experiments described in Section 4.5.1. However, we re-label
the entire data in the test set as either normal or attack. During testing, all the instances from the
test set are passed through the system starting from the first layer. If layer one detects an attack, it
is blocked and labeled as Probe. Only the connections which are labeled as normal at the first layer
are allowed to pass to the next layer. Since, the layer is trained to detect Probe attacks effectively;
most of the Probe attacks are detected. Other attacks such as DoS can either be seen as normal or
as Probe. If other attacks are detected as Probe, it must be considered as an advantage, since the
attack is detected at an early stage. Similarly, if some Probe attacks are not detected at the first
layer, they may be detected at subsequent layers. Same process is repeated at the following layers
where an attack is blocked and labeled as DoS, R2L or U2R at layer two, layer three and layer
four respectively. We perform all experiments 10 times and report their average. Table 4.10 gives
the % detection with respect to each of the five classes in a confusion matrix.
Table 4.10: Confusion Matrix
% Detection
Probe DoS R2L U2R Normal (Total % Blocked)
Probe 97.82 0.11 0.69 0.00 1.38 98.62
DoS 25.50 71.90 0.00 0.00 2.60 97.40
R2L 3.00 0.00 26.58 0.04 70.38 29.62
U2R 5.15 0.00 77.65 3.53 13.67 86.33
Normal 0.91 0.07 0.35 0.05 98.62 1.38
From Table 4.10, we observe that an intrusion detection system based on layered conditional
random fields can detect most of the Probe (98.62%), DoS (97.40%) and U2R (86.33%) attacks
while giving very few false alarms at each layer. The system can also detect R2L attacks with
much higher accuracy (29.62%) when compared with previously reported systems. The confusion
matrix shows that only 71.90% of DoS attacks are labeled as DoS. However, it is very important to
note that the accuracy for detecting DoS attacks is not 71.90%, rather it is 25.50 +71.90 +0.00 +
0.00 = 97.40%. This is because, 25.50% of the DoS attacks have been detected at the first layer
itself, though the system identifies them as Probe attacks since the first layer represents the Probe
66 Layered Conditional Random Fields for Network Intrusion Detection
layer. This is acceptable because it is critical to detect an attack as early as possible which helps
to minimize the impact of an attack.
We also note that most of the U2R attacks are detected in the third layer and hence labeled as
R2L. However, if we remove the third layer, the fourth layer can detect U2R attacks with similar
accuracy. Looking at the R2L and U2R attacks in Table 4.10, it is natural to think that the two
layers can be merged. However, this has two disadvantages. First, merging the two layers re-
sult in increasing the number of features which affects efficiency. When the layers are merged, the
merged layer performs poorly with respect to the total test time when compared with the combined
test time for both the unmerged layers. Second, the U2R attacks are not detected effectively and
their individual attack detection accuracy decreases. This is because the number of U2R attacks in
the training data is very small and the system simply learns the features which are specific to the
R2L attacks. Hence, we use two separate layers for detecting R2L and U2R attacks. Using the lay-
ered framework, it is hoped that any attack, even though its category is unknown, can be detected
at any one of the layers in the system. The number of layers can also be increased or decreased
in the layered framework, making the system scalable and flexible to specific requirements of the
particular environment where it is deployed.
We evaluate the performance of every layer in our system in Table 4.11. The table clearly
shows that out of all the 250,436 attack instances in the test set, more than 25% of the attacks are
blocked at layer one and more than 90% of all the attacks are blocked by the end of layer two.
Thus, the layered framework is very effective in reducing the attack traffic at every layer in the
system. This configuration takes only 21 seconds to classify all the 250,436 attacks.
Table 4.11: Attack Detection at Individual Layers (Case:1)
Accuracy Test Time
Total Cumulative Per Instance Total Cumulative
(%) (%) (m sec.) (sec.) (sec.)
Probe 25.226 25.226 0.031 8 8
DoS 65.996 91.222 0.053 10 18
R2L 1.770 92.992 0.090 2 20
U2R 0.004 92.996 0.056 1 21
4.6 Comparison and Analysis of Results 67
We can further optimize this configuration by putting the DoS layer before the Probe layer.
We can do this because the data is relational and every layer in the system is independent. Putting
the DoS layer before the Probe layer improves overall system performance and helps to detect a
large number of attacks at the first layer itself. Such optimization becomes significant in severe
attack situations when the target is overwhelmed with illegitimate connections. We present the
results in Table 4.12.
Table 4.12: Attack Detection at Individual Layers (Case:2)
Accuracy Test Time
Total Cumulative Per Instance Total Cumulative
(%) (%) (m sec.) (sec.) (sec.)
DoS 89.807 89.807 0.051 13 13
Probe 1.415 91.222 0.031 1 14
R2L 1.770 92.992 0.090 2 16
U2R 0.004 92.996 0.056 1 17
Table 4.12 shows that our system can analyze 250,436 test instances in 17 seconds, i.e., it can
handle 1.4731 ∗ 10
4
instances per second. Now, assuming the average size of an instance to be
1.5 KB, the overall bandwidth which our system can handle is easily in excess of 100 Mbps. It is
important to note that this performance is achieved on a desktop running with Intel(R) Core(TM)
2, CPU 2.4 GHz and 2 GB RAM in the Windows environment. Significant performance improve-
ment can be achieved by building dedicated devices for large scale commercial deployment.
4.6 Comparison and Analysis of Results
Experimental results from Section 4.5 clearly suggest that conditional random fields when inte-
grated with the layered framework can be used to build effective and efficient network intrusion de-
tection systems. In this section, we compare the layered conditional random fields with other well
known methods for intrusion detection based on the anomaly detection principle. The anomaly
based systems primarily detect deviations from the learnt normal data using statistical methods,
machine learning or data mining approaches [9]. Standard techniques such as decision trees and
68 Layered Conditional Random Fields for Network Intrusion Detection
naive Bayes are known to perform well. However, our experimental results show that layered
conditional random fields perform far better than these techniques. The main reason for better
accuracy of our system is that the conditional random fields do not consider observation features
to be independent. In [122], the authors present a comparative study of various classifiers when
applied to the KDD 1999 data set. To improve attack detection, the authors in [123] propose the
use of principle component analysis before applying any machine learning algorithm. The use of
support vector machines for intrusion detection is discussed in [72]. We compare these methods
with our layered conditional random fields for intrusion detection in Table 4.13. The table repre-
sents the Probability of Detection (PD) and the False Alarm Rate (FAR) in % for different methods
including the winners of the KDD 1999 cup.
Comparison from Table 4.13 suggests that layered conditional random fields perform sig-
nificantly better than previously reported results including the winner of the KDD 1999 cup and
various other methods applied to the KDD 1999 data set. The most impressive part of layered con-
ditional random fields is the margin of improvement when compared with other methods. They
have very high attack detection of 98.6% for Probe attacks (5.8% improvement) and 97.40% detec-
tion for DoS attacks. They outperform by a significant percentage for R2L (34.5% improvement)
and U2R (34.8% improvement) attacks.
4.6 Comparison and Analysis of Results 69
Table 4.13: Comparison of Results
Probe DoS R2L U2R
Layered Conditional PD 98.60 97.40 29.600 86.3000
Random Fields FAR 0.91 0.07 0.350 0.0500
KDD 1999 Winner [122]
PD 83.30 97.10 8.400 13.2000
FAR 0.60 0.30 0.005 0.0030
Multi Classifier [122]
PD 88.70 97.30 9.600 29.8000
FAR 0.40 0.40 0.100 0.4000
Multi Layer Perceptron [122]
PD 88.70 97.20 5.600 13.2000
FAR 0.40 0.30 0.010 0.0500
Gaussian Classifier [122]
PD 90.20 82.40 9.600 22.8000
FAR 11.30 0.90 0.100 0.5000
K-Means Clustering [122]
PD 87.60 97.30 6.400 29.8000
FAR 2.60 0.40 0.100 0.4000
Nearest Cluster Algorithm [122]
PD 88.80 97.10 3.400 2.2000
FAR 0.50 0.30 0.010 0.0006
Incremental Radial PD 93.20 73.00 5.900 6.1000
Basis Function [122] FAR 18.80 0.20 0.300 0.0400
Leader Algorithm [122]
PD 83.80 97.20 0.100 6.6000
FAR 0.30 0.30 0.003 0.0300
Hypersphere Algorithm [122]
PD 84.80 97.20 1.000 8.3000
FAR 0.40 0.30 0.005 0.0090
Fuzzy ARTMAP [122]
PD 77.20 97.00 3.700 6.1000
FAR 0.20 0.30 0.004 0.0010
C4.5 (Decision Trees) [122]
PD 80.80 97.00 4.600 1.8000
FAR 0.70 0.30 0.005 0.0020
Nearest Neighbour
with Principle PD 86.13 97.32 2.510 64.0400
Component Analysis FAR 0.27 0.23 0.001 0.0001
(4 axis) [123]
Decision Trees
with Principle PD 70.40 97.58 0.070 7.0200
Component Analysis FAR 0.85 0.12 0.030 0.0001
(2 axis) [123]
Support Vector Machines [72]
PD 36.65 91.60 22.000 12.0000
FAR - - - -
70 Layered Conditional Random Fields for Network Intrusion Detection
4.6.1 Significance of Layered Framework
To evaluate the effectiveness of the layered framework, we perform further experiments where we
do not implement the layered framework, i.e., we train a single system with two classes, normal
and attack, by labeling all the Probe, DoS, R2L and U2R attacks as attack. We perform experi-
ments, both, with and without feature selection. For experiments when we do not implement the
layered framework but we perform feature selection, we select 21 features out of the total of 41
features by applying the union operation on the feature sets of the four individual attack classes.
Table 4.14 presents the results.
Table 4.14: Layered Vs. Non Layered Framework
Attack Detection in % Time Taken (sec.)
Probe DoS R2L U2R Test
Layered
Feature Selection 98.62 97.40 29.62 86.33 17
All Features 88.06 97.05 15.10 55.03 56
Non Feature Selection 92.21 96.88 16.01 60.00 29
Layered
All Features 87.94 96.12 17.58 48.24 57
Comparison from Table 4.14 clearly suggests that a system implementing the layered frame-
work with feature selection is more efficient and more accurate in detecting attacks particularly the
U2R, R2L and Probe attacks. The motivation behind layered framework is to improve performance
speed while feature selection helps to improve classification accuracy. Hence, a system which im-
plements feature selection with layered framework can benefit from both; high performance speed
and high classification accuracy. Further, in Table 4.14, we should read the time in relative terms
rather than in absolute terms since, for ease of experiments, we use scripts for implementation. In
real environment, high speed can be achieved by implementing the complete system in languages
with efficient compilers such as the C Language. Further, as discussed earlier, we can implement
pipelining in multi core processors where every core represents a single layer and due to pipelin-
ing, multiple I/O operations can be replaced by a single I/O operation providing very high speed
of operation.
4.6 Comparison and Analysis of Results 71
4.6.2 Significance of Feature Selection
Form the experiments in previous sections; we observe that performing feature selection improves
the attack detection accuracy as well as the efficiency of the system. In our experiments, we
performed manual feature selection, using our domain knowledge. However, it would be advan-
tageous if we can select features automatically for different attack classes. For experiments with
automatic feature selection, we use methods such as those discussed in [124] and [125] which
can automatically extract significant features. We compare the results of manual feature selection
with automatic feature selection for all the layers. We observe that the system using automatic
feature selection has similar test time performance when compared to the system with manual
feature selection, but the accuracy of detection is significantly lower when features are induced
automatically than the system based on manual feature selection. We compare the effect of fea-
ture selection on intrusion detection in Table 4.15. For automatic feature selection, we perform
experiments using the Mallet tool [126].
Table 4.15: Significance of Feature Selection
F-Measure (%)
Probe DoS R2L U2R
Best 93.68 98.53 47.52 61.11
Manual Average 92.73 98.50 42.08 58.19
Worst 89.82 98.48 39.23 45.03
Best 87.39 98.38 42.06 53.90
Automatic Average 86.28 98.31 32.14 49.80
Worst 85.03 98.20 25.15 46.58
No
Best 86.73 98.43 28.42 56.74
Selection
Average 85.21 98.40 25.94 53.44
Worst 83.19 98.37 21.89 49.30
It is not surprising that manual feature selection performs better than automatic feature selec-
tion. However, we also considered other methods for automatic feature selection. We performed
experiments with feed forward neural network to determine the weights for all the 41 features. We
then discarded the features with weights close to zero. This results in only a small set of features
for each layer. However, when we performed similar experiments with the reduced set of features,
72 Layered Conditional Random Fields for Network Intrusion Detection
there was no significant improvement in the attack detection accuracy. We then used Principle
Component Analysis (PCA) for dimensionality reduction [123]. However, main drawback of us-
ing PCA followed by conditional random fields is that, the PCA transforms a large number of
possibly correlated features into a small number of uncorrelated features known as the principle
components. Hence, when we applied PCA to the data set and then implemented the system us-
ing conditional random fields in the newly transformed feature space, the combined approach did
not provide significant advantage. This is because, the strength of conditional random fields is to
model correlation between features, but the features in the transformed space are independent. We
then used the C4.5 algorithm [127] to perform feature selection. We constructed a decision tree
and selected only a small set of features which were selected by the C4.5 algorithm for further
experiments. However, there was no significant improvement in the results.
Given the critical nature of the task of intrusion detection, it is important to detect most of the
attacks with very few false alarms; hence, we use domain knowledge to improve attack detection
accuracy. Nonetheless, automatic feature selection with layered conditional random fields is still
a feasible scheme for building reliable network intrusion detection systems which can operate
efficiently in high speed networks.
4.6.3 Significance of Our Results
Experimental results show that conditional random fields have high attack detection accuracy.
However, if we use all the 41 features for all the four attack classes, the time required to train
and test the model is very high. To address this, we perform feature selection and implement the
layered framework with the conditional random fields to produce a four layer system. The four
layers correspond to Probe, DoS, R2L and U2R attacks. We observe that the test time performance
of the integrated system is comparable with other methods; however, the time required to train
the model is slightly higher. We also observe that feature selection not only improves the test
time efficiency, but it also increases the accuracy of attack detection. This is because using more
features than required can generate superfluous rules often resulting in fitting irregularities in the
data which can misguide classification. With regards to improving the attack detection accuracy,
the main strength of layered conditional random fields lies in detecting R2L and U2R attacks
which are not satisfactorily detected by other methods. Our system also gives slight improvement
for detecting Probe attacks but has similar accuracy for detecting DoS attacks.
4.7 Robustness of the System 73
The prime reason for better attack detection accuracy for conditional random fields is that they
do not consider observation features to be independent. This results in capturing the correlation
among different features in the observation resulting in higher accuracy. Considering both, the
accuracy and the time required for testing, layered conditional random fields score better.
To determine the statistical significance of our results, we rank all the six systems in order
of significance for detecting Probe, DoS, R2L and U2R attacks. We use the Wilcoxon test [128]
with 95% confidence interval to discriminate the performance of these methods. We compare
the ranking for various methods in Table 4.16, where a system with rank ‘1’ represents the best
system.
Table 4.16: Ranking Various Methods for Intrusion Detection
Probe DoS R2L U2R
Layered Conditional Random Fields 1 1 1 1
Conditional Random Fields 4 4 3 2
Layered Decision Trees 1 1 4 3
Decision Trees 1 1 1 5
Layered Naive Bayes 6 5 5 3
Naive Bayes 5 5 5 6
The results of the test indicate that layered conditional random fields are significantly better
(or equal) for detecting attacks when compared with other methods. Thus, layered conditional
random fields are a strong candidate for building effective and efficient network intrusion detection
systems.
4.7 Robustness of the System
In order to test the robustness of our system, it is important to perform similar experiments with
a number of other data sets. However, given the domain of the problem, no other data sets are
freely available which can be used for similar experimentation. To ameliorate this problem to
some extent and to study the robustness of our system, we add substantial amount of noise in the
training data and perform similar experiments.
74 Layered Conditional Random Fields for Network Intrusion Detection
4.7.1 Addition of Noise
We control the addition of noise in the data by two parameters, the probability of adding noise to
a feature, ‘p’, and the scaling factor, ‘s’. We perform four set of experiments with noisy data, one
for each layer. For every set of experiment, we vary the parameter ‘p’ from 0 and 1 (by keeping
it at values 0.10, 0.20, 0.33, 0.50, 0.75, 0.90 and 0.95) and vary the parameter ‘s’ from -1000 and
+1000. In case, when the original feature is ‘0’, we add noise to that feature by using an additive
function (a random value between -1000 and +1000) instead of scaling. We represent the effect of
noise for detecting Probe, DoS, R2L and U2R attacks separately in Figures 4.4, 4.5, 4.6 and 4.7
respectively. The figures clearly suggest that the layered conditional random fields are robust to
noise in the training data and perform better than other methods.
4.7 Robustness of the System 75
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.4: Effect of Noise on Probe Layer
89
90
91
92
93
94
95
96
97
98
99
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.5: Effect of Noise on DoS Layer
76 Layered Conditional Random Fields for Network Intrusion Detection
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.6: Effect of Noise on R2L Layer
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.7: Effect of Noise on U2R Layer
4.8 Conclusions 77
4.8 Conclusions
In this chapter, we addressed the core issues concerning the anomaly and hybrid intrusion detec-
tion systems at the network level; viz, the accuracy of attack detection, capability of detecting
a wide variety of attacks and efficiency of operation. Our experimental results in Section 4.5.1
show that conditional random fields are very effective in improving the attack detection rate and
decreasing the false alarm rate. Having a low false alarm rate is important for any intrusion detec-
tion system. Further, experimental results presented in Section 4.5.2, show that feature selection
and implementing the layered framework significantly reduces the time required to train and test
the model. Experiments also suggest that conditional random fields can be very effective in re-
ducing the false alarms, thereby improving the attack detection accuracy. Further, our system can
be implemented to detect a variety of attacks including the DoS, Probe, R2L and the U2R. Other
type of attacks can also be detected by adding new layers in the system, making our system highly
scalable. We compared our approach with some well known methods for intrusion detection such
as the decision trees and naive Bayes. These methods, however, cannot detect the R2L and the
U2R attacks effectively, while our integrated system can effectively and efficiently detect such
attacks giving an improvement of 34.5% for the R2L attacks and 34.8% for the U2R attacks. Our
system also helps in identifying an attack once it is detected at a particular layer which expedites
the intrusion response mechanism, thus minimizing the impact of an attack. We showed that our
system is robust to noise in the training data and performs better than any other compared system.
Our system has all the advantages of the layered framework discussed in the previous chapter,
and, in particular the number of layers in the system can be easily increased or decreased giving
flexibility to network administrators.
Our system can clearly provide better intrusion detection capabilities at the network level.
However, as discussed earlier, to provide a higher level of security it is significant to detect in-
trusions at the application level along with detecting intrusions at the periphery of the network.
Hence, in the following chapters, we focus on developing intrusion detection systems which can
operate at the application level and which can be effective in detecting application level attacks.
Chapter 5
Unified Logging Framework and Audit
Data Collection
In order to detect malicious activities at the application level, present intrusion detection systems
either analyze the application access logs or the data access logs. A stacked system can also be used
which analyzes the two logs separately one after the other. Such systems, however, cannot model the
application-data interaction which is significant to detect low level application specific attacks. To
overcome this deficiency in present application intrusion detection systems, we introduce a unified
logging framework which combines the application and the data access logs to produce a unified log
which can be used as the audit patterns to detect attacks at the application level. This unified log can
easily incorporate features from both, the application accesses and the corresponding data accesses.
As a result, application-data interaction can be captured which improves attack detection. Finally,
our framework does not encode application specific features to extract attack signatures and can be
used for a variety of similar applications.
5.1 Introduction
U
SING our layered framework, as discussed in previous chapters, can undoubtedly provide
effective network intrusion detection capability. However, to ensure a higher level of secu-
rity, network level systems must be complimented with application level systems. This is because;
the attack detection capability of a network based system is different from that of a host based
and application based system. A network based system primarily focuses on monitoring network
packets and, hence, cannot detect data and application level attacks particularly when Network
Address Translation (NAT) and encryption are used in communication. Further, attacks can be
split into more than one packet to avoid their detection. As a result, network intrusion detection
systems cannot reliably detect application attacks such as the SQL injection. Similarly, host and
application based systems cannot protect against network attacks such as the Denial of Service.
79
80 Unified Logging Framework and Audit Data Collection
Methods which are effective in detecting attacks at network level, such as those discussed
earlier cannot be directly used to detect low level application attacks. Detecting application level
attacks often require monitoring every single data access in real-time environment which may not
always be feasible, simply due to large number of data requests per unit time. Further, attackers
may come up with previously unseen attacks compounding the situation even more difficult [5].
Present application intrusion detection systems either analyze only the web access logs or only the
data access logs or use two separate systems (based on analyzing the web access logs and the data
access logs) which operate independently and, hence, cannot detect attacks reliably. Such systems
are often signature based and, thus, have limited attack detection. Therefore, it becomes critical
to develop better application intrusion detection systems which can detect attacks reliably and are
not entirely dependent on attack signatures. Detecting malicious data accesses, thus, presents a
major challenge and alternate methods must be considered which are efficient and at the same time
which can detect attacks reliably.
We note that to effectively detect application level attacks the application-data interaction must
be captured. Hence, we introduce a unified logging framework which combines the application
access logs and the corresponding data access logs in real-time to provide a unified log with
features from both the application accesses and the corresponding data accesses. This captures the
correlation between the two logs and also eliminates the need to analyze them separately, thereby
resulting in a system which is accurate and which operates efficiently.
The rest of the chapter is organized as follows; we motivate our unified logging framework
with some examples in Section 5.2. We then describe our proposed framework in Section 5.3 and
the setup for data collection in Section 5.4. Finally, we conclude this chapter in Section 5.5.
5.2 Motivating Example
Data access in three tier application architecture is restricted via the application and, hence, ap-
plications are one of the prime targets of attack. However, the ultimate objective of attacking an
application is either to launch a Denial of Service or to access the underlying data. To detect such
malicious data accesses it becomes critical to consider the user behaviour (via the web applica-
tion requests) and the corresponding application behaviour (via the corresponding data accesses)
together, i.e., by analyzing the application’s interaction with the underlying data.
5.3 Proposed Framework 81
Consider for example, a simple website which links page A to either page B, page C or any
other page. This depends on the logic encoded in the application. Transition from page A to page
B may be valid only if some conditions are satisfied, such as ‘the user must be logged in’ to transit
from page A to page B. If this condition is not satisfied, the transition is considered as anomalous.
Considering only a single feature, ‘transition sequence of web pages’ may not be sufficient to
detect attacks. Other features such as ‘the result of authentication module’ are significant for
decision making. Neglecting such features result in false alarms. This is because; the encoded
logic cannot be modeled by analyzing the web accesses alone. However, when the system is
made aware of the data access pattern via features such as ‘the number of requests generated by
a particular page’, ‘the corresponding next page’ and other features, it can effectively model the
user-application interaction, thereby resulting in better attack detection.
Similarly, monitoring the data access queries alone without any knowledge of the web applica-
tion which requests the data is insufficient to detect attacks since they lack the necessary contextual
information. Hence, to detect attacks reliably, we propose monitoring web accesses together with
the corresponding data accesses using our unified logging framework.
5.3 Proposed Framework
In order to detect malicious data accesses, the straight forward approach is to audit every data
access request before it is processed by the application. However, this is not the ideal solution to
detect data breaches due to the following reasons:
1. In most applications, the number of data accesses per unit time is very large as compared to
the number of web accesses and, thus, monitoring every data request in real-time severely
affects system performance.
2. Assuming that we can somehow monitor every data request by using a signature based
system; the system is application specific because the attack signatures are defined by
encoding application specific knowledge.
3. The system must be regularly updated with new signatures to detect attacks. As with any
signature based system, it cannot detect zero day attacks.
Thus, monitoring every data request is not feasible in high speed application environment. We
also observe that the real world applications follow the three tier architecture [129] which ensures
82 Unified Logging Framework and Audit Data Collection
application and data independence, i.e., data is managed separately and is not encoded into the
application. To access application data, an attacker has no option but to exploit the application. To
detect such attacks, an intrusion detection system can either monitor the application requests or
(and) monitor the data requests. When a system monitors the application accesses alone, it cannot
detect attacks such as the SQL injection since the system lacks useful information about the data
accessed. Similarly, analyzing every data access in isolation limits the attack detection capability
of an intrusion detection system. Further, using two separate systems does not capture application-
data interaction which affects attack detection. As discussed earlier, previous approaches either
consider only the application accesses or the data accesses or consider both in isolation and, hence,
unable to correlate the events together resulting in a large number of false alarms. We, thus,
propose a unified logging framework which generates a single audit log that can be used by the
application intrusion detection system to detect a variety of attacks including the SQL injection,
cross site scripting and other application level attacks. Before we describe our framework in detail,
we define some key terms which will be helpful in better understanding of the remaining of the
chapter.
1. Application: Application is a software by which a user can access data. There exists no
other way in which the data can be made available to a user.
2. User: User is either an individual or any other application which access data.
3. Event: Data transfer between a user and an application is a result of multiple sequential
events. Data transfer can be considered as a request-response system where a request for
data access is followed by a response. An event is such a single request-response pair. We
use the term event interchangeably with the term request. A single event is represented as
an ‘N’ feature vector which is denoted as:
e
i
= f
1
, f
2
, f
3
... f
N
4. User Session: A user session is an ordered set of events or actions performed, i.e., a
session is a sequence of one or more request-response pairs. Every session can be uniquely
identified by a session id. A user session is represented as a sequence of event vectors as:
s
i
= start, e
1
, e
2
, e
3
..., end
5.3 Proposed Framework 83
5.3.1 Description of our Framework
We present our unified logging framework in Figure 5.1 which can be used for building effective
application intrusion detection systems.
Control
Web Server
with
Deployed
Application
Data
Intrusion
Detection
System
User / Client
Web Server
Log
Data Access
Log
Unified
Log
Session
Figure 5.1: Framework for Building Application Intrusion Detection System
In our framework, we define two modules; session control module and logs unification module,
in addition to an intrusion detection system which is used to detect malicious data accesses in an
application. The logs unification module provides input audit patterns to the intrusion detection
system and the response generated by the intrusion detection system is passed on to the session
control module which can initiate appropriate intrusion response mechanisms. We have already
discussed that the three tier architecture restricts data access only via the application. Hence, user
access is restricted via the application and, thus, the application acts as bridging element between
the user and the data. In our framework, every request first passes through the session control
which is described next.
84 Unified Logging Framework and Audit Data Collection
Session Control Module
The prime objective of an intrusion detection system is to detect attacks reliably. However, it
must also ensure that once an attack is detected, appropriate intrusion response mechanisms are
activated in order to mitigate their impact and prevent similar attacks in future. The session control
module serves dual purpose in our framework. First, it is responsible for establishing new sessions
and for checking the session id for previously established sessions. For this, it maintains a list of
valid sessions which are allowed to access the application. Every request to access the application
is checked for a valid session id at the session control and anomalous requests can be blocked
depending upon the installed security policy. Second, the session control also accepts input from
the intrusion detection system. As a result, it is capable of acting as an intrusion response system.
If a request is evaluated to be anomalous by the intrusion detection system, the response from the
application can be blocked at the session control before data is made visible to the user, thereby
preventing malicious data accesses in real-time. The session control can either be implemented as
a part of the application or can also be implemented as a separate entity.
Once the session id is evaluated for a request, the request is sent to the application where it is
processed. The web server logs every request. All corresponding data accesses are also logged.
The two logs are then combined by the logs unification module to generate unified log which is
described next.
Logs Unification Module
In Section 5.2, we discussed that analyzing the web assess logs and the data access logs in isolation
is not sufficient to detect application level attacks. Hence, we propose using unified log which can
better detect attacks as compared to independent analysis of the two logs. The logs unification
module is used to generate the unified log. The unified log incorporates features from both the
web access logs and the corresponding data access logs. Using the unified log, thus, helps to
capture the user-application interaction and the application-data interactions. However, very often,
the number of data accesses is extremely large when compared to the number of web requests.
Hence, we first process the data access logs and represent them using simple statistics such as ‘the
number of queries invoked by a single web request’ and ‘the time taken to process them’ rather
than analyzing every data access individually. We then use the session id, present in both, the
5.4 Audit Data Collection 85
application access logs and the associated data access logs, to uniquely map the extracted statistics
(obtained from the data access logs) to the corresponding web requests in order to generate a
unified log. Figure 5.2, represents how the web access logs and the corresponding data access logs
can be uniquely mapped to generate a unified log. In the figure, f
1
, f
2
, f
3
... f
N
and g

1
, g

2
, g

3
...g

M
represent the features of web access logs and the features extracted from the reduced data access
logs respectively.
Web Request Reduced Data Acesses
Unified Log
Log Unification
Data Access Log Reduction
Web Request
Data Access (d
e11
= g
1
, g
2
, ..., g
m
)
Data Access (d
e12
= g
1
, g
2
, ..., g
m
)
Data Access (d
e13
= g
1
, g
2
, ..., g
m
)
(W
e1
= f
1
, f
2
, ..., f
n
) (d
e1
= g

1
, g

2
, ..., g

m
)
(e
1
= f
1
, f
2
, ..., f
n
, g

1
, g

2
, ..., g

m
)
(W
e1
= f
1
, f
2
, ..., f
n
)
Figure 5.2: Representation of a Single Event in the Unified log
From Figure 5.2, we observe that a single web request may result in more than one data
accesses which depend upon the logic encoded into the application. Once the web access logs
and the corresponding data access logs are available, the next step involves the reduction of data
access logs by extracting simple statistics as discussed before. The session id can, then, be used to
uniquely combine the two logs to generate the unified log.
5.4 Audit Data Collection
As presented in our framework, the log unification module generates a unified log which can be
used by an application intrusion detection system. However, there is no data set which can be
used for our experiments. Application data sets such as [130] are available, but are restricted to
86 Unified Logging Framework and Audit Data Collection
monitoring the sequence of system calls for privileged processes. Such data sets cannot be used in
our experiments. Further, getting real world application data, for example a bank website data, is
very hard, if not impossible. Hence, we collected data sets locally.
We collected two separate data sets by setting up an environment that mimics a real world
application environment. Both the data sets are made freely available and can be downloaded
from [13]. For the first data set, we used an online shopping application [131] and deployed it
on a web server running Apache, version 2.0.55. At the backend, the application was connected
to MySQL database, version 4.1.22. Both, the web requests and the corresponding data accesses
were logged. The servers and the application were installed on a desktop running with Intel(R)
Core(TM) 2, CPU 2.4 GHz and 2 GB RAM. The operating system installed was Microsoft Win-
dows XP Professional Service Pack 2. To collect the second data set, we used another online
shopping application [132] and deployed it separately on exactly the same configuration. For both
the applications, we consider a web request to be a single request to render a page by the server and
not a single HTTP GET request as it may contain multiple images, frames and dynamic content.
A request can be easily identified from the web server logs. This request further generates one or
more data requests which depend on the logic encoded in the application.
5.4.1 Feature Selection
We used two features from the data access logs and four features from the web access logs to
represent the unified log. Thus, we generate a unified log format where every user session is
represented as a sequence of vectors, each having six features. The six features are:
1. Number of data queries generated in a single web request.
2. Time taken to process the request.
3. Response generated for the request.
4. Amount of data transferred (in bytes).
5. Request made (or the function invoked) by the client.
6. Reference to the previous request in the same session.
Web access logs contain useful information such as the details of every request made by a
client (user), response of the web server, amount of data transferred etc. Similarly, data access
logs contain important details such as the exact data table and columns accessed, in case the
5.4 Audit Data Collection 87
data is stored in a database. Performing intrusion detection at the data access level, in isolation,
requires substantially more resources when compared to our approach. Furthermore, monitoring
the logs together eliminate the need to monitor every data query since we can use simple statistics
to represent the features of the data accesses logs in the unified log. The unified log is then used
as input to the intrusion detection system, which is the final module in our framework and is
discussed in the next chapter.
5.4.2 Normal Data Collection
To collect normal data, the postgraduate students in our department were encouraged to access
the application. The application was accessible like any other online shopping website; however,
the access to the application was restricted to only from within the department. For the purpose
of normal data collection, the students were advised not to provide any personal information and
were asked to use dummy information instead of using their actual details. The application was
accessed using different scenarios; some examples of the scenarios are:
1. A user is not interested in shopping but clicks on a few links to explore the website.
2. A user is not a registered user. The user visits the website, looks at some items, adds few
items to cart but does not buy them.
3. A user is not a registered user. The user visits the website, looks at some items, adds few
items to cart, buys them by registering and completes the check out process and finally
logs off.
4. A user is a registered user, visits the website, searches for an item, adds it to cart, starts the
check out process but does not finish buying and logs off.
5. A user is a registered user, visits the website, searches for an item, adds some items to cart,
buys some products and logs off.
In addition to these, other scenarios were also considered. For data collection, the system
was online for five consecutive days, separately, for both the data sets. Further, the students were
asked to use different browsers to access the same application. The students were not restricted
to create a single user account, and many of them created multiple accounts. This is significant
because, in this case, we cannot consider a one to one mapping between a user and an IP address.
Hence, we did not use the IP address to identify a user accessing the application, which is also not
88 Unified Logging Framework and Audit Data Collection
possible in any real world application due to sharing of computers and the use of Network Address
Translation in networks.
For the first data set, we observe that 35 different users accessed the application which results
in 117 unique sessions composed of 2,615 web requests and 232,655 data requests. We then
combine the web server logs with the data server logs to generate unified log as discussed earlier
in Section 5.3. This results in 117 user sessions with only 2,615 event vectors, each of which
include features from the web requests and the associated data requests. We also observe that a
large number of user sessions are terminated without actual purchase, resulting in abandoning the
shopping cart. This is a realistic scenario and in reality a large number of the shopping carts are
abandoned without purchase.
Similarly, for the second data set, we combine 1,642 web requests with 931,671 data accesses
which results in 60 unique user sessions with 1,642 event vectors. Note that, the number of
data accesses per web request is large in the second data set when compared to the first. This is
because; the two applications are different. Also, we did not make any additional change specific
to the second application to collect the second data set. This shows that our framework for unified
logging can be employed with minimum effort for a variety of existing applications.
We represent a normal user session from the data set in Figure 5.3. The session depicts a user
browsing the website and looking at different products displayed on the index page of the deployed
web application.
0,0,301,369,GET /catalog HTTP 1.1,−,normal
0,0,200,28885,GET /catalog/ HTTP 1.1,−,normal
131,1,200,28480,GET /catalog/index.php,http://dummydata.xyz/catalog/,normal
84,0,200,25431,GET /catalog/index.php,http://dummydata.xyz/catalog/index.php,normal
108,1,200,27121,GET /catalog/product_info.php,http://dummydata.xyz/catalog/index.php,normal
105,0,200,25252,GET /catalog/index.php,http://dummydata.xyz/catalog/product_info.php,normal
Figure 5.3: Representation of a Normal Session
5.4.3 Attack Data Collection
To collect attack data, we disabled access to the system to other users and generated the attack
traffic manually. We launched attacks based upon two criteria:
1. Attacks which do not require any control over the web server or the database such as
password guessing and SQL injection attack.
5.5 Conclusions 89
2. Attacks which require prior control over the web server such as website defacement and
cross site scripting.
To collect the attack data, both, the web requests and the data accesses were logged. The logs
were then combined using our framework. For the first data set, we generate 45 different attack
sessions with 272 web requests resulting in 44,390 data requests. Combining the two together,
the unified log has 45 unique attack sessions with 272 event vectors. For the second data set, we
generate 241 web requests and 249,597 corresponding data requests. Combing the logs result in
25 unique sessions with 241 event vectors in the unified log.
A typical anomalous session in the data set is represented in Figure 5.4. The session depicts
a scenario where the deployed application has been modified by taking control of the web server.
This is because; we observe that a user has bypassed the login module which is necessary to
complete a genuine transaction. In this case, a user successfully completes the transaction and
the login module is never invoked. This is possible only when the deployed application has been
modified and hence, the entire session is labeled as attack.
103,0,200,28623,GET /catalog/index.php HTTP 1.1,−,attack
203,0,200,35467,GET /catalog/checkout_shipping.php HTTP 1.1,http://dummydata.xyz/catalog/index.php,attack
208,0,200,40401,GET /catalog/checkout_payment.php HTTP 1.1,http://dummydata.xyz/catalog/checkout_shipping.php,attack
203,0,200,47801,GET /catalog/checkout_payment_address.php HTTP 1.1,http://dummydata.xyz/catalog/checkout_payment.php,attack
203,0,200,25605,GET /catalog/checkout_success.php HTTP 1.1,http://dummydata.xyz/catalog/checkout_payment_address.php,attack
Figure 5.4: Representation of an Anomalous Session
5.5 Conclusions
In this chapter, we introduced our unified logging framework which efficiently combines the ap-
plication access logs and the corresponding data access logs to generate unified log. The unified
log can be used as the input audit patterns for building application intrusion detection system.
The advantage of using unified log is that they include features of both, the user behaviour and
the application behaviour and can, thus, capture the application-data interaction which helps in
improving attack detection at the application level. We showed that our framework is not specific
to any particular application since it does not encode application specific signatures and can be
used for a variety of applications. Finally, we described our audit data collection methodology
which was used to collect two different data sets. The two data sets can be used for building and
90 Unified Logging Framework and Audit Data Collection
evaluating application intrusion detection systems and can be downloaded from [13].
In the next chapter, we perform experiments using our collected data sets and analyze the
effectiveness of the unified log in building application intrusion detection systems. We introduce
user session modeling using a moving window of events to model sequence of events in a user
session which can be used to effectively detect application level attacks.
Chapter 6
User Session Modeling using Unified Log
for Application Intrusion Detection
Present application intrusion detection systems suffer from two disadvantages; first they analyze
every single event independently to detect possible attacks and second, they are based on signature
matching and, hence, have limited attack detection capabilities. To overcome these deficiencies and
to improve attack detection at the application level, we introduce a novel approach of modeling user
sessions as a sequence of events instead of analyzing every event in isolation. From our experiments,
we show that the attack detection accuracy improves significantly when we perform session modeling.
We integrate our unified logging framework, discussed in previous chapter, to build effective appli-
cation intrusion detection systems which are not specific in detecting a single type of attack. Our
experimental results on the locally collected data sets show that our approach based on conditional
random fields is effective and can detect attacks at an early stage by analyzing only a small number of
sequential events. We also show that our system is robust and can reliably detect disguised attacks.
6.1 Introduction
A
PPLICATIONS have unrestricted access to the underlying application data and are thus
a prime target of attacks resulting in loss of one or more of the three basic security re-
quirements, viz., confidentiality, integrity and availability of the data. To prevent such malicious
data accesses, it becomes critical to detect any compromise of applications which accesses the
data. Web-based applications, in particular, are easy targets and can be exploited by the attackers.
Hence, we integrate our unified logging framework, discussed in Chapter 5, and introduce user
session modeling to detect application level attacks reliably and efficiently.
Present application intrusion detection systems cannot detect attacks reliably because, to per-
form efficiently, they are often signature based and, thus, unable to detect novel attacks whose
signatures are not available. Similarly, hybrid and anomaly detection systems are inefficient and
91
92 User Session Modeling using Unified Log for Application Intrusion Detection
unreliable, resulting in a large number of false alarms because they are based on thresholds which
are difficult to estimate accurately. Further, application based systems often consider sequential
events independently and hence unable to capture the sequence behaviour in consecutive events
in a single user session. Very often, attacks are a result of more than one events and monitoring
the events individually result in reduced attack detection accuracy. Hence, to detect attacks effec-
tively, we introduce user session modeling at application level by monitoring a sequence of events
using a moving window. We also integrate the unified logging framework which generates a single
unified log with features from both, the application accesses and the corresponding data accesses.
We evaluate various methods such as conditional random fields, support vector machines, decision
trees, naive Bayes and hidden Markov models and compare their attack detection capability. As
we will demonstrate from our experimental results, integrating the unified logging framework and
modeling user sessions result in better attack detection accuracy, particularly for the conditional
random fields. Session modeling, however, increases the complexity of the system. Nonetheless,
our experiments show that using conditional random fields higher attack detection accuracy can be
achieved by analyzing only a few events, which is desirable, as opposed to other methods which
must analyze a large number of events to operate with comparable accuracy. Further our system
operates efficiently as it uses simple statistics rather than analyzing all the features in every data
access. Finally, our system performs best and is able to detect disguised attacks reliably when
compared with other methods.
The rest of the chapter is organized as follows; we motivate the use of session modeling for
application intrusion detection in Section 6.2. We then describe the data sets used in our experi-
ments in Section 6.3 and our methodology in Section 6.4. We describe our experimental set up and
present our results in Section 6.5 followed by the analysis of our results in Section 6.6. In Section
6.7, we discuss some implementation issues such as the availability of training data and suitability
of our approach for a variety of applications. Finally, we conclude this chapter in Section 6.8.
6.2 Motivating Example
Recalling from the previous chapter, we defined an event as a single request-response pair which
can be represented as an ‘N’ feature vector as:
e
i
= f
1
, f
2
, f
3
... f
N
6.2 Motivating Example 93
Similarly, we defined a user session as an ordered set of events or actions performed, i.e., a
session is a sequence of one or more request-response pairs and is represented as a sequence of
event vectors:
s
i
= start, e
1
, e
2
, e
3
..., end
In many situations, to launch an attack the attacker must follow a sequence of events. For such
cases in particular, the attack will be successful when the entire sequence of events is performed.
Each event individually is not significant; however, the events if performed in a sequence can result
in powerful attacks. Further, the situation can be relaxed to give advantage to an attacker such that
the individual anomalous events may not strictly follow each other. As a result, the anomalous
events may be disguised within a number of legitimate events, such that the attack is successful
and hence, the overall session is considered as anomalous. For example, a single session with five
sequential events along with their labels may be represented as follows:
< Session Start >
e
1
< f
1
1
, g
1
2
, ...h
1
n
– Normal>
e
2
< f
2
1
, g
2
2
, ...h
2
n
– Normal>
e
3
< f
3
1
, g
3
2
, ...h
3
n
– Attack >
e
4
< f
4
1
, g
4
2
, ...h
4
n
– Normal>
e
5
< f
5
1
, g
5
2
, ...h
5
n
– Attack >
< Session End >
In the above sequence of events, e
1
...e
5
, when we consider every event individually, anoma-
lous events may not be detected, however, if the events are analyzed such that their sequence of
occurrence is taken into consideration, the attack sessions can be detected effectively. Consider for
example, a website which collects and stores credit card information and the following sequence
of events occur in a single session:
1. A user attempts to log in by entering a (stolen) user id and password. The log in is suc-
cessful. (Note that, SQL injection can also be used to reveal such login information).
2. The user then visits the home page and modifies some information (to create a backdoor
for reentry).
94 User Session Modeling using Unified Log for Application Intrusion Detection
3. The user exploits the application to gain administrator access.
4. The user then visits the home page of the original user (in order to attempt) to disguise the
previous event within normal events.
5. The user exploits administrator rights to reveal credit card information of other users.
It must be noted that in the above sequence of events, the individual events appear to be normal
events and may not be detected by the intrusion detection system when the system analyzes the
events in isolation. In particular, the third event in the above sequence, when analyzed in isolation,
may be considered as normal since the administrator can access the application using the super
user access. However, the overall sequence of events; transition from a user with limited access to
a user with administrator access and finally revealing the credit card information is made visible
only when the system analyzes all the events together in the session. Using session modeling,
we aim to minimize the number of false alarms and detect such attacks, including the disguised
attacks, which cannot be reliably detected by traditional intrusion detection systems.
6.3 Data Description
To perform experiments using user session modeling at the application level, there does not exists
any freely available data set which can be used. As a result, we collected the data sets locally as
described earlier in Chapter 5. We summarize the two data sets in Table 6.1.
Table 6.1: Data Sets
Number of Number of Number of
Web Requests Data Accesses Sessions
Data Set Normal 2615 232,655 117
One
Attack 272 44,390 45
Data Set Normal 1642 931,671 60
Two
Attack 241 249,597 25
Every session in both the data sets represents a sequence of event vectors, with each event
vector having six features. It is important to note that, though, both the applications are examples
6.4 Methodology 95
of online shopping website; there is difference in the two applications. One significant difference
is the application’s interaction with the underlying database which is encoded as the application
logic. As a result, the number of data accesses in the second data set (1,181,268) is significantly
larger than in the first (277,045). Further, the size of the two data sets is also different; the first
data set consists of 162 sessions, while the second has only 85 sessions. It is also important to
note that the two data sets were collected independently at different times.
6.4 Methodology
In order to gain data access an attacker performs a sequence of malicious events. An experienced
attacker can also disguise attacks within a number of normal events in order to avoid detection.
Hence, to reduce the false alarms and increase attack detection accuracy, intrusion detection sys-
tems must be capable of analyzing entire sequence of events rather than considering every event
in isolation [49]. We therefore propose user session modeling to detect application level attacks.
To model a sequence of event vectors, we need a method which does not assume independence
among sequential events. Hence, we use conditional random fields as the core intrusion detector
in our application intrusion detection system. The advantage of using conditional random fields is
that they predict the label sequence y given the observation sequence x allowing them to model
arbitrary relationships between different features in the observations without making independence
assumptions. Figure 6.1 shows, how conditional random fields can be used to model user sessions.
e
1
=
f
1
...f
6
e
2
=
f
1
...f
6
e
3
=
f
1
...f
6
e
4
=
f
1
...f
6
y
2
y
3
y
4
y
1
Figure 6.1: User Session Modeling using Conditional Random Fields
In the figure, e
1
, e
2
, e
3
, e
4
represents a user session of length four and every event e
i
in the session
is correspondingly labeled as y
1
, y
2
, y
3
, y
4
. Further, every event e
i
is a feature vector of length
six as described in the unified logging framework. The conditional random fields do not assume
96 User Session Modeling using Unified Log for Application Intrusion Detection
any independence among the sequence of events e
1
, e
2
, e
3
, e
4
. We note that a user session can
be of variable length and some sessions may be longer than others. Analyzing every session at
its termination is effective since complete session information is available; however, it has two
disadvantages:
1. The attack detection is not real-time.
2. The size of the session can be very large with more than 50 events. As a result, analyzing
all the events together increases the complexity and the amount of history that must be
maintained for session analysis.
Hence, we perform user session modeling using a moving window of events. We vary the
width of the window from 1 to 20 in all our experiments. Since the complexity of the system
increases as the width of the window increases, a method which can reliably detect attacks with
only a small number of events, i.e., at small values of window width, is considered better. Hence,
we restrict the window width to 20 in our experiments.
6.4.1 Feature Functions
For a conditional random field, it is critical to define the feature functions because the ability
of a conditional random field to model correlation between different features depends upon the
predefined features used for training the random field.
We use our domain knowledge to identify such dependencies in the features and then define
functions which extracts features from the training data. Examples of features extracted include; if
f eature
1
is request made = ‘abc’ and f eature
2
is reference to previous request = ‘xyz’ then label
is ‘normal’. Similarly another example can be; if f eature is amount of data transferred = ‘pqr’
then label is ‘attack’. Using feature conjunction, as shown in the first example helps to capture
the correlation between different features. Based on our domain knowledge, other features were
extracted similarly using the CRF++ tool [120]. The feature functions used in our experiments are
presented in Appendix C.
6.4.2 Session Modeling using a Moving Window of Events
We use the logs generated by the unified logging framework presented in Chapter 5 and perform
user session modeling using a moving window of events to build effective application intrusion
6.5 Experiments and Results 97
detection systems. For example, consider a session of length 10 represented as a sequence of
events:
< start >, e
1
, e
2
, e
3
, e
4
, e
5
, e
6
, e
7
, e
8
, e
9
, e
10
, < end >
Using a moving window of width five with a step size of one, the events in this session can be
analyzed as shown below (note that ∅ represents absence of an event):
e
1
, ∅, ∅, ∅, ∅ −→ ‘Label’ at step 1
e
1
, e
2
, ∅, ∅, ∅ −→ ‘Label’ at step 2
e
1
, e
2
, e
3
, ∅, ∅ −→ ‘Label’ at step 3
e
1
, e
2
, e
3
, e
4
, ∅ −→ ‘Label’ at step 4
e
1
, e
2
, e
3
, e
4
, e
5
−→ ‘Label’ at step 5
e
2
, e
3
, e
4
, e
5
, e
6
−→ ‘Label’ at step 6
e
3
, e
4
, e
5
, e
6
, e
7
−→ ‘Label’ at step 7
e
4
, e
5
, e
6
, e
7
, e
8
−→ ‘Label’ at step 8
e
5
, e
6
, e
7
, e
8
, e
9
−→ ‘Label’ at step 9
e
6
, e
7
, e
8
, e
9
, e
10
−→ ‘Label’ at step 10
It is evident from the above representation that the window of events is advanced forward by
one and hence, such a systemcan performin real-time. However, depending upon the requirements
of a particular application, the window can be advanced forward with a step size > 1. In such
cases, the system no longer operates in real-time. [Note that, if the analysis is performed only at
the end of every session, the system operates in batch mode.]
6.5 Experiments and Results
We now describe the experimental setup and compare our results using a number of methods
such as the conditional random fields, decision trees, naive Bayes, support vector machines and
hidden Markov models for detecting malicious data accesses at the application level. It is impor-
tant to note that the accuracy of attack detection and efficiency of operation are the two critical
factors which determine the suitability of any method for intrusion detection. A method which
98 User Session Modeling using Unified Log for Application Intrusion Detection
can detect most of the attacks but is extremely slow in operation may not be useful. Similarly, a
technique which is efficient but cannot detect attacks with acceptable level of confidence is not
useful. Hence, an intrusion detection technique must balance the two. Decision trees are very fast
and generally result in accurate classification. The naive Bayes classifier is simple to implement
and very efficient. Support vector machines are also considered to be high quality systems which
can handle data in high dimensional space. Hidden Markov models are well known for modeling
sequences and have been successful in various tasks in language processing. These methods have
been effectively used for building anomaly and hybrid intrusion detection systems. Our experi-
mental results, from Chapter 4, suggests that conditional random fields outperform these methods
and can be used to build accurate network intrusion detection systems. In this chapter, we analyze
the effectiveness of conditional random fields for building application intrusion detection systems
and compare their performance with these methods.
For our experiments, we use the CRF++ toolkit [120], hidden Markov model toolbox for
MatLab and the weka tool [121] and perform experiments, separately, using both the data sets.
We perform all experiments ten times by randomly selecting training and test data and report
their average. We use exactly the same training and test samples for all the five methods that
we compare (conditional random fields, decision trees, hidden Markov models, support vector
machines and naive Bayes classifier). It is important to note that methods such as decision trees,
naive Bayes and support vector machines are not designed for labeling sequential data. However,
to experiment with these methods, we convert every session into a single record by appending
sequential events at the end of the previous event and then label the entire session as either normal
or as attack. For example, for a session of length five, where every event is described by six
features, we create a single record with 5 ∗ 6 = 30 features. Additionally, for the support vector
machines we experiment with three kernels; poly-kernel, rbf-kernel and normalized-poly-kernel,
and vary the value of c between 1 and 100 for all of the kernels [121]. As we mentioned before,
we use six features to represent every event. Hence, for experiments with the hidden Markov
models, we build six different hidden Markov models, one for each feature, and then combine the
individual results using a voting mechanism to get the final label for the sequence, i.e., we label
the sequence as attack when the number of votes in favour of the attack class is greater than or
equal to three.
We perform our experiments using a moving window of events and vary the window width ‘S’
6.5 Experiments and Results 99
from 1 to 20. Window of width S = 1 indicates that we consider only the current event and do
not consider the history, while a window of width S = 20 implies that a sequence of 20 events
is analyzed to perform the labeling. We limit ‘S’ to 20 as the complexity of the system increases
which affects system’s efficiency. This can, however, be exploited by attackers since they can
hide the attacks within normal events, making attack detection very difficult. Thus, to make the
intrusion detection task more realistic, we define the disguised attack parameter, ‘p’ as follows:
p =
number o f Attack events
number o f Normal events + number o f Attack events
where number o f Attack events > 0 and number o f Normal events ≥ 0. The value of ‘p’
lies in the range (0,1]. The attacks are not disguised when p = 1, since in this case the number
of normal events is 0. As the value of ‘p’ decreases, when the number of normal events increases,
the attacks are disguised in a large number of normal events. To create disguised attack data, we
add a random number of attack events at random locations in the normal sessions and label all the
events in the session as attack. This results in hiding the attacks within normal events such that the
attack detection becomes difficult. We perform experiments to reflect these scenarios by varying
the number of normal events in an attack session by setting ‘p’ between 0 and 1.
6.5.1 Experiments with Clean Data (p = 1)
We first set p = 1, i.e., the attacks are not disguised. In Figure 6.2, we compare the attack detection
accuracy (F-Measure) as we increase the window width ‘S’ from 1 to 20 for a fixed value of p = 1
for conditional random fields, support vector machines, decision trees, naive Bayes and hidden
Markov models for both the data sets.
Results for both the data sets show similar trends. We observe that conditional random fields
and support vector machines perform similarly and their attack detection capability (F-Measure)
increases, slowly but steadily, as the value of ‘S’ increases. This shows that modeling a user session
results in better attack detection accuracy when compared to analyzing the events individually, i.e.
attack detection accuracy improves as ‘S’ increases.
Conditional randomfields do not consider the sequence of events in a session to be independent
and, hence, can model the correlation between events. Hence, they can detect attacks reliably.
Support vector machines also result in good attack detection accuracy and can easily handle a
100 User Session Modeling using Unified Log for Application Intrusion Detection
large number of features, thereby resulting in good classification.
Decision trees and naive Bayes perform poorly and have low F-Measure regardless of the
window width ‘S’. Their accuracy improves initially as ‘S’ increases but when ‘S’ becomes large
their accuracy tends to decrease. This is because; they consider features independently to label a
particular event in a session and then combine the results for all the features but do not consider the
correlation between them. When the number of features is less, the error due to loss of correlation
is small which increases with the number of features. Also, the number of input features increases
as ‘S’ increases; but the decision trees selects a subset of features and its size remains fairly
constant. Hence, there is little effect of ‘S’ on the attack detection accuracy for decision trees
when compared with the naive Bayes classifier.
Hidden Markov models also perform poorly; however, their accuracy improves slightly as ‘S’
increases. When compared with the conditional random fields, hidden Markov models have lower
attack detection accuracy because they are generative systems which model the joint distribution
instead of the conditional distribution and, thus, make independence assumptions. Furthermore,
they cannot model long range dependencies in the observations, thereby resulting in poor perfor-
mance.
6.5 Experiments and Results 101
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(b) Data Set Two
Figure 6.2: Comparison of F-Measure (p = 1)
102 User Session Modeling using Unified Log for Application Intrusion Detection
6.5.2 Experiments with Disguised Attack Data (p = 0.60)
In order to test the robustness of different methods, we perform experiments with disguised attack
data. Using such a data set makes attack detection realistic and more difficult as an attacker may
try to hide the attack within normal events. As discussed earlier, we define the disguised attack
parameter, ‘p’, where p < 1 indicates that the attack is disguised within normal events in a session.
In Figure 6.3, we compare the results for all the five methods for both the data sets at p = 0.60.
For both the data sets, we observe that the attack detection capability decreases as the attacks
are disguised within normal events. However, the conditional random fields perform best, outper-
forming all other methods and are robust in detecting disguised attacks when compared with any
other method. Hidden Markov models are least effective for the first data set while support vector
machines and naive Bayes classifier have similar performance. The decision trees are least effec-
tive in detecting disguised attacks for the second data set. Again, the attack detection accuracy
increases as ‘S’ increases.
The reason for better accuracy for the conditional random fields is that they can model long
range dependencies among the events in a sequence, since they do not assume independence within
the event vectors and, thus, perform effectively even when the attacks are disguised. As we de-
crease ‘p’, the support vector machines did not perform as well. The reason for this is that, the
support vector machines cannot geometrically differentiate between the normal and attack events
because of the overlap between the normal data space and the attack data space.
The variation in performance of the hidden Markov models and the decision trees for the two
data sets is attributed to the size of the data sets. The size of the first data set is bigger, compared
to that of the second. As a result, the decision trees can better select significant features in the
first data set resulting in higher accuracy. However, for the second data set, due to its small size,
the decision trees cannot perform optimally compared to the hidden Markov models. The hidden
Markov models perform better because they consider the sequence information which becomes
significant when the size of the data set is small.
6.5 Experiments and Results 103
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(b) Data Set Two
Figure 6.3: Comparison of F-Measure (p = 0.60)
104 User Session Modeling using Unified Log for Application Intrusion Detection
Results using Conditional Random Fields
We study the Precision, Recall and F-Measure for conditional random fields at p = 0.60 and
present the results in Figure 6.4.
Results for conditional random fields, from both the data sets, suggest that they have high F-
Measure which increases steadily as the windowwidth ‘S’ increases. The best value for F-Measure
for data set one is 0.87 at S = 15, while it is 0.65 at S = 20 for data set two. This suggests that
the system based on conditional random field generates fewer false alarms and performs reliably
even when attacks are disguised.
6.5 Experiments and Results 105
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.4: Results using Conditional Random Fields at p = 0.60
106 User Session Modeling using Unified Log for Application Intrusion Detection
Results using Support Vector Machines
Figure 6.5 represents the variation in Precision, Recall and F-Measure for support vector machines
as we increase ‘S’ from 1 to 20 at p = 0.60.
As mentioned earlier, for support vector machines, we experiment with three kernels; poly-
kernel, rbf-kernel and normalized-poly-kernel, and vary the value of c between 1 and 100 for all of
the three kernels. We observe that the poly-kernel with c = 1 performs best and, hence, we report
the results using the same kernel. Figure 6.5 shows that support vector machines have moderate
Precision for both the data sets, but low Recall and hence low F-Measure. The best value of F-
Measure for support vector machines for data set one is 0.82 at S = 17, while it is 0.49 at S = 20
for data set two in comparison to the conditional random fields which have the F-Measure of 0.87
and 0.65 for data set one and data set two respectively.
6.5 Experiments and Results 107
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.5: Results using Support Vector Machines at p = 0.60
108 User Session Modeling using Unified Log for Application Intrusion Detection
Results using Decision Trees
We study the variation in Precision, Recall and F-Measure for decision trees in Figure 6.6.
Results from Figure 6.6 show that the decision trees have very low F-Measure suggesting
that they cannot be used effectively for detecting anomalous data accesses when the attacks are
disguised. The detection accuracy for decision trees remains fairly constant as ‘S’ increases and is
maximum at S = 20 and at S = 19 for the two data sets.
6.5 Experiments and Results 109
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.6: Results using Decision Trees at p = 0.60
110 User Session Modeling using Unified Log for Application Intrusion Detection
Results using Naive Bayes Classifier
Figure 6.7 represents the variation in Precision, Recall and F-Measure for the naive Bayes classifier
as we vary ‘S’ from 1 to 20 at p = 0.60.
Experimental results using both the data sets show similar trend for the naive Bayes classifica-
tion. The results suggest that the system has low F-Measure and there is little improvement in the
attack detection accuracy as ‘S’ increases. The maximum value for F-Measure is 0.67 at S = 12
for data set one and 0.43 at S = 19 for data set two, suggesting that a system based on naive Bayes
classifier cannot detect attacks reliably.
6.5 Experiments and Results 111
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.7: Results using Naive Bayes Classifier at p = 0.60
112 User Session Modeling using Unified Log for Application Intrusion Detection
Results using Hidden Markov Models
We present the Precision, Recall and F-Measure for the hidden Markov models for both the data
sets at p = 0.60 in Figure 6.8.
From Figure 6.8, we observe that the hidden Markov models have very high Recall but very
low Precision and hence low F-Measure. There is little effect of ‘S’ on the F-Measure which does
not improve significantly.
6.5 Experiments and Results 113
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.8: Results using Hidden Markov Models at p = 0.60
114 User Session Modeling using Unified Log for Application Intrusion Detection
6.6 Analysis of Results
Experimental results clearly suggest that the conditional random fields outperform other methods
and are the best choice to build application intrusion detection systems.
6.6.1 Effect of ‘S’ on Attack Detection
In our experiments, we use a moving window to model user sessions by varying ‘S’ from 1 to 20.
We want ‘S’ to be small since the complexity and the amount of history that must be maintained
increases with ‘S’ and the system cannot respond to attacks in real-time. Window width of 20 and
beyond is often large, resulting in delayed attack detection and high computation cost. Tables 6.2
and 6.4 describe the effect of ‘S’ on attack detection for the two data sets.
Table 6.2: Effect of ‘S’ on Attack Detection for Data Set One, when p = 0.60
F-Measure
Width of Hidden
Decision Naive
Support Conditional
Window Markov
Trees Bayes
Vector Random
‘S’ Models Machines Fields
1 0.00 0.47 0.61 0.56 0.62
2 0.24 0.47 0.58 0.66 0.66
3 0.27 0.44 0.61 0.69 0.68
4 0.26 0.47 0.65 0.71 0.79
5 0.27 0.46 0.64 0.72 0.76
6 0.30 0.44 0.60 0.69 0.76
7 0.31 0.33 0.61 0.68 0.81
8 0.35 0.47 0.65 0.74 0.81
9 0.36 0.51 0.65 0.70 0.80
10 0.35 0.48 0.65 0.75 0.83
11 0.35 0.51 0.66 0.80 0.84
12 0.39 0.41 0.67 0.75 0.82
13 0.38 0.44 0.65 0.77 0.84
14 0.38 0.47 0.63 0.74 0.86
15 0.39 0.50 0.66 0.80 0.87
16 0.40 0.50 0.63 0.77 0.86
17 0.39 0.47 0.65 0.82 0.86
18 0.41 0.51 0.64 0.78 0.87
19 0.40 0.53 0.64 0.76 0.86
20 0.41 0.56 0.66 0.81 0.86
6.6 Analysis of Results 115
From Table 6.2, we observe that conditional random fields perform best and their attack de-
tection capability increases as the window width increases. Additionally, when we increase ‘S’
beyond 20 (not shown in the graphs), the attack detection accuracy increases steadily and the
system achieves very high F-Measure when we analyze the events in the entire session together.
Results for the first data set show that the hidden Markov model performs best at S = 18 while
conditional random fields achieve the same performance at S = 1. Similarly, decision trees an-
alyzes 20 events to reach their best performance while conditional random fields achieve same
performance by analyzing only a single event (i.e., at S = 1). Naive Bayes peak their perfor-
mance at S = 12 while conditional random fields achieve same performance at S = 3. Finally,
support vector machines reach their best performance at window width of 17 while the conditional
random fields achieve same performance at S = 10. We compare various methods in Table 6.3.
Table 6.3: Analysis of Performance of Different Methods
HMM C4.5 Naive Bayes SVM CRF
HMM (0.41) S = 18 S = 1 S = 1 S = 1 S = 1
C4.5 (0.56) S > 20 S = 20 S = 1 S = 1 S = 1
Naive Bayes (0.67) S > 20 S > 20 S = 12 S = 3 S = 3
SVM (0.82) S > 20 S > 20 S > 20 S = 17 S = 10
CRF (0.87) S > 20 S > 20 S > 20 S > 20 S = 15
Table 6.3 can be interpreted as follows. Row one in the table shows that the hidden Markov
models achieve the best F-Measure of 0.41 at S = 18 while decision trees, naive Bayes classifier,
support vector machines and conditional random fields achieve the same F-Measure at S = 1.
Similarly, the last row indicates that the conditional random fields achieve the highest F-Measure
of 0.87 at ‘S’ value of 15 while all other methods require more than 20 events to achieve the same
performance.
Hence, performing session modeling using conditional random fields result in higher accu-
racy for attack detection at lower values of ‘S’ which is desirable, since it results in early attack
detection and an efficient system.
116 User Session Modeling using Unified Log for Application Intrusion Detection
Table 6.4: Effect of ‘S’ on Attack Detection for Data Set Two, when p = 0.60
F-Measure
Width of Hidden
Decision Naive
Support Conditional
Window Markov
Trees Bayes
Vector Random
‘S’ Models Machines Fields
1 0.00 0.28 0.28 0.21 0.50
2 0.35 0.01 0.31 0.36 0.48
3 0.37 0.03 0.35 0.40 0.52
4 0.42 0.02 0.36 0.39 0.50
5 0.39 0.04 0.38 0.37 0.53
6 0.41 0.13 0.37 0.42 0.53
7 0.37 0.18 0.37 0.35 0.57
8 0.42 0.06 0.39 0.35 0.58
9 0.42 0.25 0.42 0.38 0.55
10 0.45 0.21 0.41 0.40 0.55
11 0.46 0.35 0.37 0.32 0.52
12 0.44 0.16 0.36 0.35 0.56
13 0.44 0.23 0.39 0.25 0.56
14 0.42 0.26 0.41 0.36 0.58
15 0.42 0.34 0.38 0.46 0.59
16 0.43 0.21 0.37 0.43 0.59
17 0.42 0.31 0.40 0.41 0.60
18 0.41 0.35 0.41 0.46 0.63
19 0.41 0.42 0.43 0.41 0.63
20 0.40 0.30 0.41 0.49 0.65
6.6.2 Effect of ‘p’ on Attack Detection (0 < p ≤ 1)
To analyze the robustness of conditional random fields, we experiment with the disguised attack
data by varying the disguised attack parameter, ‘p’, between 0 and 1. Figure 6.9, represents the
effect of ‘p’ on conditional random fields for different values of ‘S’ for both the data sets. We do
not present the results for other methods since they perform poorly at lower values of ‘p’.
From Figure 6.9, we make two observations; first, as the value of ‘p’ decreases, i.e., attacks
are disguised within normal events, the attack detection accuracy decreases making it difficult to
detect attacks and second, regardless of the value of ‘p’ and for a fixed value of ‘p’, the attack
detection accuracy increases as the width of the window ‘S’ increases.
6.6 Analysis of Results 117
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
p=1.00 p=0.60 p=0.45 p=0.35 p=0.25 p=0.15
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
p=1.00 p=0.60 p=0.45 p=0.35 p=0.25 p=0.15
(b) Data Set Two
Figure 6.9: Effect of ‘p’: Results using Conditional Random Fields when 0 < p ≤ 1
118 User Session Modeling using Unified Log for Application Intrusion Detection
6.6.3 Significance of Using Unified Log
We performed our experiments using the unified log (based on the framework described in Chapter
5) to detect application level attacks. By using the unified log, our system can analyze the user
behaviour (via the web accesses) and its effect on the application behaviour (via the corresponding
data accesses). Features in both the logs are correlated and analyzing themindividually by building
separate systems significantly affects attack detection capability. Hence, we perform additional
experiments where we build three separate systems and compare them with our approach. The
first system analyzes only the application logs, while the second system analyzes only the data
access logs. In the third system, we combine the individual responses from both the systems using
a voting mechanism to determine the final labeling. If either of the two systems labels an event
as attack, we label the event as attack. We call this system as a voting based system. We use
the same instances as used in our previous experiments and present the results with conditional
random fields at ‘p’ value of 0.60 by varying ‘S’ from 1 to 20. We present the comparison in
Figure 6.10.
The results clearly suggest that using a single system, based on our approach of session mod-
eling with unified log, performs best. We also observe that when we use two separate systems and
use a voting mechanism to determine the final label, the performance improves for the first data
set, but it decreases for the second data set. Hence, we can conclude that using a voting mechanism
may not always be useful.
An advantage of our system is that it can be deployed in real environment as it analyzes only
the summary statistics extracted from the data access logs rather than analyzing every data access
to match previously known attack signatures. From Table 6.1 in Section 6.3, it is evident that
using the unified log eliminate the need to consider over one million (1,181,268) data accesses
for the second data set. Instead, our approach limits the number of events to the number of web
accesses, which is significantly smaller when compared to the number of data accesses. Hence,
our approach uses features from both, the web access logs and the corresponding data access logs,
and at the same time limits the load at the intrusion detection system which is significant in high
speed application environment.
6.6 Analysis of Results 119
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
Unified Log
Voting Based
Web Access Logs Alone
Data Access Logs Alone
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
Unified Log
Voting Based
Web Access Logs Alone
Data Access Logs Alone
(b) Data Set Two
Figure 6.10: Significance of Using Unified Log
120 User Session Modeling using Unified Log for Application Intrusion Detection
6.6.4 Test Time Performance
It is not justified to compare the efficiency of our system with that of a signature based system
because the two systems are significantly different in their attack detection capability. Signature
based systems simply perform signature matching for previously known attacks while the strength
of anomaly and hybrid systems, such as one described in this chapter, lies in their capability of
detecting novel attacks in addition to detecting previously seen attacks.
It is important to note that the unification of logs does incur some overhead. However, this
overhead can be eliminated by developing better software engineering practices which are aware
of the security implications, particularly in web based applications. Security aware software engi-
neering practices can be followed which can provide standardized unified log rather than separately
logging web accesses and their corresponding data accesses. Nonetheless, the overhead incurred
is very small when compared with the time required to individually analyze the web access logs
and the data access logs.
We now compare the test time performance of different methods. We are generally not inter-
ested in the training time because training is often a one time process and can be performed offline.
Hence, we focus only on the test time complexity. During testing, both conditional random fields
and hidden Markov models employ the Viterbi algorithm which has a complexity of O(TL
2
),
where T is the length of the sequence and L is the number of labels. The quadratic complexity
is problematic when the number of labels is large, such as in the language tasks, but for intrusion
detection we have a limited number of labels (normal and attack) and, hence, the system is ef-
ficient. Support vector machines, naive Bayes classifier and decision trees are very efficient and
can handle large dimensionality in data. Table 6.5 compares the average test time for analyzing a
session by different methods at S = 20 and p = 0.60 for both the data sets.
The test time performance of various systems presented in Table 6.5, appear to be counter in-
tuitive and we expect the performance of naive Bayes classifier and decision trees to be better than
that of the conditional random fields. On the contrary, we observe that the conditional random
fields perform best. However, considering the fact that when we increase ‘S’ to 20, the complex-
ity increases for decision tress, support vector machines and naive Bayes classifiers because the
number of features increases from 6 to 120 while the number of features in conditional random
fields still remain equal to six. Further, we consider a first order Markov assumption for labeling
6.6 Analysis of Results 121
Table 6.5: Comparison of Test Time
Test Time (µ sec.)
Data Set One Data Set Two
Conditional Random Fields 510 555
Hidden Markov Models 7361 7415
Decision Trees 3515 3510
Naive Bayes Classifier 4125 4080
Support Vector Machines 9740 9125
in the conditional random fields and the label set itself is very small (equal to two) which results
in high test time efficiency. Additionally, the time complexity for hidden Markov models is higher
because of the additional overhead involved in combining the results from six independent models
to get the final label as we discussed earlier.
From our experiments, we observe that the results follow the same trend for both the data
sets and, hence, we can conclude that our results are not the artifact of a particular data set, that
our framework is application independent and can be easily used for a variety of applications.
Therefore, considering both, the attack detection accuracy and the test time performance, the
conditional random fields score better and are a strong candidate for building robust and efficient
application intrusion detection systems.
6.6.5 Discussion of Results
Experimental results from both the data sets clearly suggest that conditional random fields, when
compared with other methods, perform best and are able to detect attacks reliably, even when the
attacks are disguised in normal events, i.e., at lower values of ‘p’. Further, performing session
modeling using our unified logging framework, based on unified web access and data access logs,
help to improve attack detection accuracy. This is because, very often, to launch an attack, the
attacker performs a number of events in a sequence. As a result, systems based on session model-
ing can detect attacks better when compared to those which analyze every event in isolation. This
is clear from our experiments where we show that the attack detection improves as the value of
122 User Session Modeling using Unified Log for Application Intrusion Detection
‘S’ increases. We also note that an experienced attacker may disguise attacks within 20 or more
normal events. Even then, our system is capable of detecting attacks as the system does not con-
sider events independently. However, there is a tradeoff between the disguise attack parameter ‘p’
and the window width ‘S’. In general, for better attack detection, ‘S’ must be increased when ‘p’
decreases. The advantage of conditional random fields is that higher attack detection occurs at
lower values of ‘S’ which is desirable for the reasons discussed before.
The reason for better attack detection with conditional random fields is that they do not con-
sider the features to be independent and are able to model the correlation between them. Further,
they can model the long range dependencies between sequential events in a session and, hence,
they can reliably detect attacks when the value of ‘p’ decreases. Conditional random fields do not
make any unwarranted assumptions about the data, and once trained they are very efficient and
robust. Support vector machines, decision trees and naive Bayes classifiers on the other hand con-
sider the events to be independent and ignore the correlation between features, thereby resulting in
lower accuracy of attack detection. Similarly, as we discussed earlier, hidden Markov models are
generative systems and cannot represent long range dependencies among observations, thereby
resulting in lower accuracy of attack detection.
Performing session modeling using a moving window of events in our unified logging frame-
work helps to correlate the user behaviour and the application behaviour providing rich interacting
features which improve attack detection. Our experimental results confirm that when the unified
log are analyzed using session modeling, the system can detect attacks with higher accuracy as
opposed to the independent analysis of the web access logs and the data access logs.
Finally, it is important to note that simulating a few attacks does not necessarily imply that
our system is limited in detecting only these attacks. We have already discussed that our system
focuses on modeling the interaction between the user behaviour and the application behaviour.
Hence, our system can detect any illegitimate data access since malicious modifications result
in different application-data interaction when compared to the legitimate requests. Our system
focuses on detecting such modifications by combining the user behaviour with the application
behaviour instead of using specially crafted signatures which are limited in detecting specific
attacks. Further, we performed our experiments on two data sets and our results clearly suggest
that the conditional random fields perform best for both the data sets, establishing that our results
are not an artifact of a particular data set.
6.7 Issues in Implementation 123
6.7 Issues in Implementation
Experimental results show that our approach based on conditional random fields can be used to
build effective application intrusion detection systems. However, before deployment, it is critical
to resolve issues such as the availability of the training data and suitability of our approach for a
variety of applications. We now discuss various methods which can be employed to resolve such
issues.
6.7.1 Availability of Training Data
Though our system is application independent and can be used to detect malicious data access in
a variety of applications, it must be trained before the system can be deployed online to detect
attacks. This requires training data which is specific to the application. To obtain such data may
be difficult. However, training data can be made available as early as during the application testing
phase when the application is tested to identify errors. Logs generated during the application test-
ing phase can be used for training the intrusion detection system. However, this requires security
aware software engineering practices which must ensure that necessary measures are taken to pro-
vide training data during the application development phase, which can be used to train effective
application intrusion detection systems.
6.7.2 Suitability of Our Approach for a Variety of Applications
As we already discussed, our framework is generic and can be deployed for a variety of applica-
tions. It is particularly suited to applications which follow the three tier architecture which have
application and data independence. Furthermore, our framework can be easily extended and de-
ployed in the Service Oriented Architecture [133]. This is because as part of the business solution,
the service oriented architecture defines numerous services each of which provides specific func-
tionality and which have the capability to interact among themselves. Our proposed framework
can be considered as a special case for the service oriented architecture which defines only one
service. Nonetheless, it can be easily extended to the general service oriented architecture by se-
lecting many services. This would, however, require some domain specific knowledge in order
to identify the correlated services (applications). The challenge is to identify such correlations
automatically and this provides an interesting direction for future work.
124 User Session Modeling using Unified Log for Application Intrusion Detection
6.8 Conclusions
In this chapter, we implemented user session modeling using a moving window of events in our
unified logging framework to build application intrusion detection systems which can detect ap-
plication level attacks effectively and efficiently. Experimental results confirm that conditional
random fields can be effectively used in our framework and perform better when compared with
other methods. In our framework, we considered a sequence of events in a session, rather than
analyzing the events individually which improves the attack detection accuracy. Our system based
on conditional random fields can detect attacks at smaller values of ‘S’ resulting in early attack
detection. We also showed that unified log not only helps to improve the attack detection accu-
racy but also to improve system’s performance since we can use summary statistics rather than
analyzing every data access. Our experimental results with multiple data sets show similar trends
and confirm that our framework is application independent and can be used for a variety of appli-
cations. Another advantage of our system is that it models user-application and application-data
interaction which does not vary overtime as compared to modeling user profiles which change
frequently. Application and data interaction vary only in case of an attack which is detected by
our system. We also showed that our system using conditional random fields is robust and is able
to detect disguised attacks effectively.
Finally, following better security aware software engineering practices and taking care of log-
ging mechanism during application development would not only help in application testing and
related areas but would also provide necessary framework for building better and efficient appli-
cation intrusion detection systems, such as those discussed in this chapter.
Chapter 7
Conclusions
I
N this thesis, we explored the suitability of conditional random fields for building robust
and efficient intrusion detection systems which can operate, both, at the network and at the
application level. In particular, we introduced novel frameworks and developed models which
address three critical issues that severely affect the large scale deployment of present anomaly and
hybrid intrusion detection systems in high speed networks. The three issues are:
1. Limited attack detection coverage
2. Large number of false alarms and
3. Inefficiency in operation
Other issues such as the scalability and ease of system customization, robustness of the system
to noise in the training data, availability of training data, and the ability of the system to detect
disguised attacks were also addressed. As a result of this research, we conclude that:
1. Layered framework can be used to build efficient intrusion detection systems. In addition,
the framework offers ease of scalability for detecting different variety of attacks as well
as ease of customization by incorporating domain specific knowledge. The framework
also identifies the type of attack and, hence, specific intrusion response mechanism can be
initiated which helps to minimize the impact of the attack.
2. Conditional random fields are a strong candidate for building robust and efficient intru-
sion detection systems. Integrating the layered framework with the conditional random
fields can be used to build effective and efficient network intrusion detection systems. Us-
ing conditional random fields as intrusion detectors result in very few false alarms and,
thus, the attacks can be detected with very high accuracy.
125
126 Conclusions
3. Unified logging framework can capture user-application and application-data interactions
which are significant to detect application level attacks. The framework is application
independent and can be used for a variety of applications.
4. User session modeling using the unified log must be performed in order to detect applica-
tion level attacks with high accuracy. Conditional random fields can be effectively used in
this framework to model a sequence of events in a user session. Using conditional random
fields’ attacks can be detected at smaller window widths, thereby, resulting in an efficient
system. Additionally, the system is robust and can effectively detect disguised attacks.
We performed a range of experiments which show that, in order to detect intrusions effectively,
it is critical to model the correlations between multiple features in an observation. Assuming var-
ious features to be independent, though, makes a model simple and efficient; it affects its attack
detection capability. Conditional random fields can easily model such correlations by defining
specific feature functions which make them a strong candidate for building effective intrusion
detectors. Further, we introduced the layered framework which helps to improve overall system
performance. Our framework is highly scalable, easily customizable and can be used to build effi-
cient network intrusion detection systems which can detect a wide variety of attacks. Experimental
results on the benchmark KDD 1999 intrusion data set [12] and comparison with other well known
methods for intrusion detection such as decision trees, naive Bayes, support vector machines and
the winners of the KDD 1999 cup, show that our approach, based on layered conditional random
fields, outperform these methods; in terms of, both, accuracy of attack detection and efficiency of
system operation. The impressive part of our results is the percentage improvement in attack de-
tection accuracy, particularly, for User to Root (U2R) attacks (34.8% improvement) and Remote
to Local (R2L) attacks (34.5% improvement). Statistical tests also demonstrate higher confidence
in detection accuracy with layered conditional random fields. We also showed that our system
is robust and can detect attacks with higher accuracy, when compared with other methods, even
when trained with noisy data. Finally, our system is not based on attack signatures and, hence,
capable of detecting novel attacks.
We also performed experiments which showthat, in order to effectively detect application level
attacks, it is critical to model the sequence of events. This is because, very often, an attacker must
perform a number of sequential operations in order to launch a successful attack. Additionally,
7.1 Directions for Future Research 127
for most applications, and in particular for web based applications, the application access logs
and the corresponding data access logs are highly correlated. To detect attacks at the application
level, the application logs or (and) the data access logs can be used. However, present application
intrusion detection systems analyze the logs separately, often using two separate systems, resulting
in inefficient systems which give a large number of false alarms and, hence, low attack detection
accuracy. To address these issues, we introduced our unified logging framework which integrates
the application access logs and the corresponding data access logs to generate unified log. As
a result, the user-application and the application-data interaction can be captured; this can be
used to detect attacks with high accuracy. Further, the user-application and the application-data
interactions are stable and do not vary overtime as opposed to modeling user profiles which change
frequently. Experimental results confirm that our system, based on user session modeling using
conditional random fields which analyze unified log, can detect attacks at an early stage by
analyzing only a small number of past events resulting in an efficient system which can block
attacks in real-time. Experimental results also demonstrate that our system is robust and can detect
disguised attacks effectively, outperforming other methods such as the hidden Markov models,
support vector machines, decision trees and the naive Bayes. In particular, for data set one at p
= 0.60, using conditional random fields in our unified logging framework achieves an F-Measure
of 0.87 while the same for hidden Markov models, decision trees, naive Bayes and support vector
machines is 0.41, 0.56, 0.67 and 0.82 respectively. Similarly, for data set two at p = 0.60, our
system achieves an F-Measure of 0.65 while the same for hidden Markov models, decision trees,
naive Bayes and support vector machines is 0.46, 0.42, 0.43 and 0.49 respectively. Finally, the two
data sets which we collected can be downloaded from [13] and can be used to build and evaluate
application intrusion detection systems.
7.1 Directions for Future Research
The critical nature of the task of detecting intrusions in networks and applications leaves no mar-
gin for errors. The effective cost of a successful intrusion overshadows the cost of developing
intrusion detection systems and hence, it becomes critical to identify the best possible approach
for developing better intrusion detection systems.
Every network and application is custom designed and it becomes extremely difficult to de-
128 Conclusions
velop a single solution which can work for every network and application. In this thesis, we
proposed novel frameworks and developed methods which perform better than previously known
approaches. However, in order to improve the overall performance of our system we used the
domain knowledge for selecting better features for training our models. This is justified because
of the critical nature of the task of intrusion detection. Using domain knowledge to develop bet-
ter systems is not a significant disadvantage; however, developing completely automatic systems
presents an interesting direction for future research.
From our experiments, it is evident that our systems performed efficiently. However, devel-
oping faster implementations of conditional random fields particularly for the domain of intrusion
detection requires further investigation.
Another possible direction for future research is to employ our approach, layered framework,
for building highly efficient systems since they give opportunity to implement pipelining of layers
in multi core processors.
We demonstrated the effectiveness of our application intrusion detection system in the well
known three tier application architecture. However, our framework can be extended and deployed
in the Service Oriented Architecture [133] and presents another line of interesting research.
There is ample scope and need to build systems which aim at preventing attacks rather than
simply detecting them. Integrating intrusion detection systems with the security policy in individ-
ual networks would help to minimize the false alarms and qualify the alarms raised by the intrusion
detection systems.
Thoughts for Practitioners
We now outline some open issues which are significant but outside the scope of this thesis which
must be considered in order to develop better intrusion detection systems [3].
1. Many of the attacks are successful because the attackers enjoy anonymity and they can
launch attacks from spoofed sources, making it very hard to trace back the true source of
the attack. However, if there is a reliable method to trace back the packets to their actual
source, many of the attacks can be prevented. Solutions are available for this such as
the adjusted probabilistic packet marking and others [40], but they require a global effort
which is not very easy to ensure. The problem is to identify the true source of attack
without affecting the performance of the overall system.
7.1 Directions for Future Research 129
2. Security policy plays an important role in a network and describes the acceptable and non
acceptable usage of the resources. There are two major issues in defining the security
policy; first, the security policy must be complete and second, the policy must be clear and
unambiguous. Hence, the problem is to clearly define the acceptable and the unacceptable
usage of every resource.
3. Many systems are based upon authenticating a user. However, authentication mechanisms
such as the use of login and password are weak and can be compromised. Multi factor
authentication and use of biometric methods have been introduced but they can also be
bypassed. The problem is how to link the supplied credentials with the actual human user?
Methods based on user profiling can be used which learn the normal user profile and then
can be used to detect significant deviations from the learnt profile. However, they are based
upon thresholds which are selected by empirical analysis and, hence, may not always be
accurate.
The field of intrusion detection has been around since 1980’s and a lot of advancement has
been made in the same. However, to keep pace with the rapid and ever changing networks and ap-
plications, the research in intrusion detection must synchronize with the present networks. Present
networks increasingly support wireless technologies, removable and mobile devices. Intrusion de-
tection systems must integrate with such networks and devices and provide support for advances
in a comprehensible manner.
Bibliography
[1] Stefan Axelsson. Research in Intrusion-Detection Systems: A Survey. Technical Report
98-17, Department of Computer Engineering, Chalmers University of Technology, 1998.
[2] SANS Institute - Intrusion Detection FAQ. Last accessed: Novmeber 30, 2008. http:
//www.sans.org/resources/idfaq/.
[3] Kotagiri Ramamohanarao, Kapil Kumar Gupta, Tao Peng, and Christopher Leckie. The
Curse of Ease of Access to the Internet. In Proceedings of the 3
rd
International Confer-
ence on Information Systems Security (ICISS), pages 234–249. Lecture Notes in Computer
Science, Springer Verlag, Vol (4812), 2007.
[4] Overview of Attack Trends, 2002. Last accessed: November 30, 2008. http://www.
cert.org/archive/pdf/attack_trends.pdf.
[5] Kapil Kumar Gupta, Baikunth Nath, Kotagiri Ramamohanarao, and Ashraf Kazi. Attacking
Confidentiality: An Agent Based Approach. In Proceedings of IEEE International Confer-
ence on Intelligence and Security Informatics, pages 285–296. Lecture Notes in Computer
Science, Springer Verlag, Vol (3975), 2006.
[6] The ISC Domain Survey. Last accessed: Novmeber 30, 2008. https://www.isc.
org/solutions/survey/.
[7] Peter Lyman, Hal R. Varian, Peter Charles, Nathan Good, Laheem Lamar Jor-
dan, Joyojeet Pal, and Kirsten Swearingen. How much Information. Last ac-
cessed: Novmeber 30, 2008. http://www2.sims.berkeley.edu/research/
projects/how-much-info-2003.
[8] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Survey of Network-Based
Defense Mechanisms Countering the DoS and DDoS Problems. ACM Computing Surveys,
39(1):3, 2007. ACM.
131
132 BIBLIOGRAPHY
[9] Animesh Patcha and Jung-Min Park. An Overview of Anomaly Detection Techniques:
Existing Solutions and Latest Technological Trends. Computer Networks, 51(12):3448–
3470, 2007.
[10] CERT/CC Statistics. Last accessed: Novmeber 30, 2008. http://www.cert.org/
stats/.
[11] Thomas A. Longstaff, James T. Ellis, Shawn V. Hernan, Howard F. Lipson, Robert D.
Mcmillan, Linda Hutz Pesante, and Derek Simmel. Security of the Internet. Technical
Report The Froehlich/Kent Encyclopedia of Telecommunications Vol (15), CERT Coordi-
nation Center, 1997. Last accessed: Novmeber 30, 2008. http://www.cert.org/
encyc_article/tocencyc.html.
[12] KDD Cup 1999 Intrusion Detection Data. Last accessed: Novmeber 30, 2008. http:
//kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[13] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Application based
Intrusion Detection Dataset. Last accessed: Novmeber 30, 2008. http://www.csse.
unimelb.edu.au/
˜
kgupta.
[14] Stefan Axelsson. Intrusion Detection Systems: A Taxomomy and Survey. Technical Report
99-15, Department of Computer Engineering, Chalmers University of Technology, 2000.
[15] Anita K. Jones and Robert S. Sielken. Computer System Intrusion Detection: A Sur-
vey. Technical report, Department of Computer Science, University of Virginia, 1999.
Last accessed: Novmeber 30, 2008. http://www.cs.virginia.edu/
˜
jones/
IDS-research/Documents/jones-sielken-survey-v11.pdf.
[16] Peyman Kabiri and Ali A. Ghorbani. Research on Intrusion Detection and Response: A
Survey. International Journal of Network Security, 1(2):84–102, 2005.
[17] Joseph S. Sherif and Tommy G. Dearmond. Intrusion Detection: Systems and Models.
In Proceedings of the Eleventh IEEE International Workshops on Enabling Technologies:
Infrastructure for Collaborative Enterprises. WET ICE, pages 115–133. IEEE, 2002.
[18] Mikko T. Siponen and Harri Oinas-Kukkonen. A Review of Information Security Issues
and Respective Research Contributions. SIGMIS Database, 38(1):60–80, 2007. ACM.
BIBLIOGRAPHY 133
[19] Teresa F. Lunt. A survey of intrusion detection techniques. Computers and Security,
12(4):405–418, 1993. Elsevier Advanced Technology Publications.
[20] Emilie Lundin and Erland Jonsson. Survey of Intrusion Detection Research. Technical
Report 02-04, Department of Computer Engineering, Chalmers University of Technology,
2002.
[21] James P. Anderson. Computer Security Threat Monitoring and Surveillance, 1980.
Last accessed: Novmeber 30, 2008. http://csrc.nist.gov/publications/
history/ande80.pdf.
[22] Dorothy E. Denning. An Intrusion-Detection Model. IEEE Transactions on Software En-
gineering, 13(2):222–232, 1987. IEEE.
[23] H. S. Javitz and A. Valdes. The SRI IDES Statistical Anomaly Detector. In Proceedings of
the IEEE Symposium on Security and Privacy, pages 316–326. IEEE, 1991.
[24] S.E.Smaha. Haystack: An Intrusion Detection System. In Proceedings of the 4th Aerospace
Computer Security Applications Conference, pages 37–44. IEEE, 1988.
[25] Paul Innella. The Evolution of Intrusion Detection Systems, 2001. Last accessed: Novme-
ber 30, 2008. http://www.securityfocus.com/infocus/1514.
[26] L. T. Heberlein, G.V. Dias, K. N. Levitt, B. Mukherjee, J. Wood, and D. Wolber. A Network
Security Monitor. In Proceedings of the IEEE Symposium on Research in Security and
Privacy, pages 296–304. IEEE, 1990.
[27] Biswanath Mukherjee, L. Todd Heberlein, and Karl N. Levitt. Network Intrusion Detection.
IEEE Network, 8(3):26–41, 1994. IEEE.
[28] John McHugh. Intrusion and intrusion detection. International Journal of Information
Security, 1(1):14–35, 2001. Springer.
[29] Herv´ e Debar, Marc Dacier, and Andreas Wespi. Towards a taxonomy of intrusion-detection
systems. Computer Networks, 31(9):805–822, 1999. Elsevier.
[30] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff. A Sense of Self for Unix
Processes. In Proceedinges of the IEEE Symposium on Research in Security and Privacy,
pages 120–128. IEEE, 1996.
134 BIBLIOGRAPHY
[31] Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using
System Calls: Alternative Data Models. In Proceedings of the IEEE Symposium on Security
and Privacy, pages 133–145. IEEE, 1999.
[32] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Layered Approach us-
ing Conditional Random Fields for Intrusion Detection. IEEE Transactions on Dependable
and Secure Computing, In Press.
[33] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Robust Application
Intrusion Detection using User Session Modeling. ACM Transactions on Information and
Systems Security, Under Review.
[34] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of Eigh-
teenth International Conference on Machine Learning, pages 282–289. Morgan Kaufmann,
2001.
[35] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Network Security
Framework. International Journal of Computer Science and Network Security, 6(7B):151–
157, 2006.
[36] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Conditional Random
Fields for Intrusion Detection. In Proceedings of 21st International Conference on Ad-
vanced Information Networking and Applications Workshops (AINAW), pages 203–208.
IEEE, 2007.
[37] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. User Session Modeling
for Effective Application Intrusion Detection. In Proceedings of the 23rd International
Information Security Conference (SEC 2008), pages 269–283. Lecture Notes in Computer
Science, Springer Verlag, Vol (278), 2008.
[38] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Intrusion Detection
in Networks and Applications. In Handbook of Communication Networks and Distributed
Systems. World Scientific, To Appear.
[39] Christopher Kruegel, Fredrik Valeur, and Giovanni Vigna. Intrusion Detection and Corre-
lation: Challenges and Solutions. Springer, 2005.
BIBLIOGRAPHY 135
[40] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Adjusted Probabilistic Packet
Marking for IP Traceback. In Proceedings of the Second IFIP Networking Conference,
pages 697–708. Springer, 2002.
[41] William R. Cheswick and Steven M. Bellovin. Firewalls and Internet Security. Addison-
Wesley, 1994.
[42] Rebecca Bace and Peter Mell. Intrusion Detection Systems. Gaithersburg, MD : Computer
Security Division, Information Technology Laboratory, National Institute of Standards and
Technology, 2001.
[43] Bruce Schneier. Applied Cryptography. John Wiley & Sons, 1996.
[44] Kymie Tan. Defining the Operational Limits of Sequence-Based Anomaly Detectors. PhD
thesis, The University of Melbourne, 2002.
[45] Stuart Staniford-Chen, Brian Tung, Phil Porras, Cliff Kahn, Dan Schnackenberg, Rich
Feiertag, and Maureen Stillman. The Common Intrusion Detection Framework - Data For-
mats, March 1998. Last accessed: Novmeber 30, 2008. http://tools.ietf.org/
html/draft-staniford-cidf-data-formats-00.
[46] Giovanni Vigna and Richard A. Kemmerer. NetSTAT: A Network-based Intrusion Detec-
tion Approach. In Proceedings of the 14th Annual Computer Security Applications Confer-
ence, pages 25–34. IEEE, 1998.
[47] Carol Taylor and Jim Alves-Foss. An Empirical Analysis of NATE: Network Analysis
of Anomalous Traffic Events. In Proceedings of the 2002 Workshop on New Security
Paradigms, pages 18–26. ACM, 2002.
[48] Snort, a Network based Intrusion Detection System. Last accessed: Novmeber 30, 2008.
http://www.snort.org/.
[49] Nong Ye, Xiangyang Li, Qiang Chen, Syed Masum Emran, and Mingming Xu. Probabilis-
tic Techniques for Intrusion Detection Based on Computer Audit Data. IEEE Transactions
on Systems, Man and Cybernetics, Part A: Systems and Humans, 31(4):266–274, 2001.
136 BIBLIOGRAPHY
[50] Paul Dokas, Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, and
Pang-Ning Tan. Data Mining for Network Intrusion Detection. In Proceedings of the NSF
Workshop on Next Generation Data Mining, pages 21–30, 2002.
[51] Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. A Data Mining Framework for Build-
ing Intrusion Detection Model. In Proceedings of the IEEE Symposium on Security and
Privacy, pages 120–132. IEEE, 1999.
[52] Dalila Boughaci, Habiba Drias, Ahmed Bendib, Youcef Bouznit, and Belaid Benhamou.
Distributed Intrusion Detection Framework Based on Mobile Agents. In Proceedings of the
International Conference on Dependability of Computer Systems, pages 248–255. IEEE,
2006.
[53] Jai Sundar Balasubramaniyan, Jose Omar Garcia-Fernandez, David Isacoff, Eugene H.
Spafford, and Diego Zamboni. An Architecture for Intrusion Detection Using Autonomous
Agents. In Proceeding of the 14th Annual Computer Security Applications Conference,
pages 13–24. IEEE, 1998.
[54] Yu-Sung Wu, Bingrui Foo, Yongguo Mei, and Saurabh Bagchi. Collaborative Intrusion
Detection System (CIDS): A Framework for Accurate and Efficient IDS. In Proceedings of
the 19th Annual Computer Security Applications Conference, pages 234–244. IEEE, 2003.
[55] Elvis Tombini, Herv´ e Debar, Ludovic Me, and Mireille Ducasse. A Serial Combination of
Anomaly and Misuse IDSes Applied to HTTP Traffic. In Proceedings of the 20th Annual
Computer Security Applications Conference, pages 428–437. IEEE, 2004.
[56] L. Portnoy, E. Eskin, and S. Stolfo. Intrusion Detection with Unlabeled Data using Clus-
tering. In Proceedings of the ACM Workshop on Data Mining Applied to Security (DMSA).
ACM, 2001.
[57] H. Shah, J. Undercoffer, and A. Joshi. Fuzzy Clustering for Intrusion Detection. In Pro-
ceedings of the 12th IEEE International Conference on Fuzzy Systems, pages 1274–1278.
IEEE, 2003.
[58] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items
in Large Databases. In Proceedings of the International Conference on Management of
Data (SIGMOD), pages 207–216. ACM, 1993.
BIBLIOGRAPHY 137
[59] H.Mannila, H.Toivonen, and A.I.Verkamo. Discovering Frequent Episodes in Sequences. In
Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining,
pages 210–215. AAAI, 1995.
[60] Nahla Ben Amor, Salem Benferhat, and Zied Elouedi. Naive Bayes vs Decision Trees in In-
trusion Detection Systems. In Proceedings of the ACM Symposium on Applied Computing,
pages 420–424. ACM, 2004.
[61] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian Network Classifiers. Ma-
chine Learning, 29(2-3):131–163, 1997. Springer.
[62] Darren Mutz, Fredrik Valeur, Giovanni Vigna, and Christopher Kruegel. Anomalous Sys-
tem Call Detection. ACM Transactions on Information and System Security, 9(1):61–93,
2006. ACM.
[63] Christopher Kruegel, Darren Mutz, William Robertson, and Fredrik Valeur. Bayesian Event
Classification for Intrusion Detection. In Proceedings of 19th Annual Computer Security
Applications Conference, pages 14–23. IEEE, 2003.
[64] Gray Stein, Bing Chen, Annie S. Wu, and Kien A. Hua. Decision Tree Classifier for Net-
work Intrusion Detection with GA-Based Feature Selection. In Proceedings of the 43rd
Annual SouthEast Regional Conference - Volume 2, pages 136–141. ACM, 2005.
[65] Herv´ e Debar, Monique Becke, and Didier Siboni. A Neural Network Component for an
Intrusion Detection System. In Proceedings of the IEEE Symposiumon Research in Security
and Privacy, pages 240–250. IEEE, 1992.
[66] Anup K. Ghosh, James Wanken, and Frank Charron. Detecting Anomalous and Unknown
Intrusions Against Programs. In Proceedings of the 14th Annual Computer Security Appli-
cations Conference, pages 259–267. IEEE, 1998.
[67] Jake Ryan, Meng-Jang Lin, and Risto Mikkulainen. Intrusion Detection with Neural Net-
works. In Advances in Neural Information Processing Systems, pages 943–949. MIT, 1997.
[68] Zheng Zhang, Jun Li, C.N. Manikopoulos, Jay Jorgenson, and Jose Ucles. HIDE: a Hi-
erarchical Network Intrusion Detection System Using Statistical Preprocessing and Neural
138 BIBLIOGRAPHY
Network Classification. In Proceedings of the IEEE Workshop on Information Assurance
and Security United States Military Academy, pages 85–90. IEEE, 2001.
[69] Anup K. Ghosh, Aaron Schwartzbard, and Michael Schatz. Learning Program Behavior
Profiles for Intrusion Detection. In Proceedings of the 1st USENIX Workshop on Intrusion
Detection and Network Monitoring, pages 51–62. USENIX Association, 1999.
[70] Srinivas Mukkamala, Guadalupe Janoski, and Andrew H. Sung. Intrusion Detection Using
Neural Networks and Support Vector Machines. In Proceedings of the International Joint
Conference on Neural Networks (IJCNN), pages 1702–1707. IEEE, 2002.
[71] Andrew H. Sung and Srinivas Mukkamala. Identifying Important Features for Intrusion
Detection Using Support Vector Machines and Neural Networks. In Proceedings of Sym-
posium on Applications and the Internet, pages 209–216. IEEE, 2003.
[72] Dong Seong Kim and Jong Sou Park. Network-Based Intrusion Detection with Support
Vector Machines. In Proceedings of the Information Networking, Networking Technologies
for Enhanced Internet Services International Conference, ICOIN, pages 747–756. Lecture
Notes in Computer Science, Springer Verlag, 2003.
[73] S. Jha, K. Tan, and R.A. Maxion. Markov chains, Classifiers, and Intrusion Detection. In
Proceedings of the 14th IEEE Computer Security Foundations Workshop, pages 206–219.
IEEE, 2001.
[74] Nong Ye, Yebin Zhang, and Connie M. Borror. Robustness of the Markov-Chain Model for
Cyber-Attack Detection. IEEE Transactions on Reliability, 53(1):116–123, 2004.
[75] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[76] Svetlana Radosavac. Detection and Classification of Network Intrusions using Hidden
Markov Models. Master’s thesis, University of Maryland, 2003.
[77] Wei Wang, Xiao-Hong Guan, and Xiang-Liang Zhang. Modeling Program behaviors by
Hidden Markov Models for Intrusion Detection. In Proceedings of International Confer-
ence on Machine Learning and Cybernetics, pages 2830–2835. IEEE, 2004.
BIBLIOGRAPHY 139
[78] Ye Du, Huiqiang Wang, and Yonggang Pang. A Hidden Markov Models-Based Anomaly
Intrusion Detection Method. In Proceeedings of the Fifth World Congress on Intelligent
Control and Automation (WCICA), pages 4348–4351. IEEE, 2004.
[79] Autonomous Agents for Intrusion Detection. Last accessed: Novmeber 30, 2008.
http://www.cerias.purdue.edu/about/history/coast/projects/
aafid.php.
[80] Probabilistic Agent based Intrusion Detection. Last accessed: Novmeber 30, 2008. http:
//www.cse.sc.edu/research/isl/agentIDS.shtml.
[81] Wenke Lee and Salvatore J. Stolfo. Data Mining Approaches for Intrusion Detection. In
Proceedings of the 7th USENIX Security Symposium, pages 79–94, 1998.
[82] Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. Mining Audit Data to build Intrusion
Detection Models. In Proceedings of the 4th International Conference on Knowledge Dis-
covery and Data Mining, pages 66–72. AAAI, 1998.
[83] Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. Mining in a Data-flow Environment:
Experience in Network Intrusion Detection. In Proceedings of the Fifth International Con-
ference on Knowledge Discovery and Data Mining, pages 114–124. ACM, 1999.
[84] Wenke Lee and Salvatore J. Stolfo. A Framework for Constructing Features and Models
for Intrusion Detection Systems. ACM Transactions on Information and System Security
(TISSEC), 3(4):227–261, 2000. ACM.
[85] Yu Gu, Andrew McCallum, and Don Towsley. Detecting Anomalies in Network Traffic
Using Maximum Entropy Estimation. In Proceedings of the Internet Measurement Confer-
ence, pages 345–350. USENIX Association, 2005.
[86] Yi Hu and Brajendra Panda. A Data Mining Approach for Database Intrusion Detection. In
Proceedings of the ACM symposium on Applied Computing, pages 711–716. ACM, 2004.
[87] Yong Zhong, Zhen Zhu, and Xiao-Lin Qin. A Clustering Method Based on Data Queries
and Its Application in Database Intrusion Detection. In Proceedings of the Fourth Interna-
tional Conference on Machine Learning and Cybernetics, Vol (4), pages 2096–2101. IEEE,
2005.
140 BIBLIOGRAPHY
[88] Yi Hu and Brajendra Panda. Identification of Malicious Transactions in Database Systems.
In Proceedings of the 7th International Database Engineering and Applications Sympo-
sium, pages 329–335. IEEE, 2003.
[89] Elisa Bertino, Ashish Kamra, Evimaria Terzi, and Athena Vakali. Intrusion Detection in
RBAC-Administered Databases. In Proceedings of the 21st Annual Computer Security
Applications Conference, pages 170–182. IEEE, 2005.
[90] Wai Lup Low, Joseph Lee, and Peter Teoh. DIDAFIT: Detecting Intrusions in Databases
Through Fingerprinting Transactions. In Proceedings of the 4th International Conference
on Enterprise Information Systems, pages 121–128, 2002.
[91] Sin Yeung Lee, Wai Lup Low, and Pei Yuen Wong. Learning Fingerprints for a Database
Intrusion Detection System. In Proceedings of the 7th European Symposium on Research
in Computer Security, Vol (2502), pages 264–279. Lecture Notes in Computer Science,
Springer Verlag, 2002.
[92] Yong Zhong and Xiao-Lin-Qin. Research on Algorithm of User Query Frequent Itemsets
Mining. In Proceedings of Third International Conference on Machine Learning and Cy-
bernetics, Vol (3), pages 1671–1676. IEEE, 2004.
[93] Victor C.S. Lee, John A. Stankovic, and Sang H. Son. Intrusion Detection in Real-time
Database Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time Tech-
nology and Applications Symposium, pages 124–133. IEEE, 2000.
[94] Christina Yip Chung, Michael Gertz, and Karl Levitt. DEMIDS: A Misuse Detection Sys-
tem for Database Systems. In Proceeding of the 3rd International IFIP TC-11 WG11.5
Working Conference on Integrity and Internal Control in Information Systems, pages 159–
178. Kluwer, 1999.
[95] Shubha U. Nabar, Bhaskara Marthi, KrishnaramKenthapadi, Nina Mishra, and Rajeev Mot-
wani. Towards Robustness in Query Auditing. In Proceedings of the 32nd International
Conference on Very large Data Bases, pages 151–162. ACM, 2006.
[96] Rakesh Agarwal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. Hippocratic
Databases. In Proceedings of the 28th International Conference on Very Large Databases,
pages 143–154. Morgan Kaufmann, 2002.
BIBLIOGRAPHY 141
[97] Rakesh Agrawal, Roberto J. Bayardo Jr., Christos Faloutsos, Jerry Kiernan, Ralf Rantzau,
and Ramakrishnan Srikant. Auditing Compliance with a Hippocratic Database. In Pro-
ceedings of the 30th International Conference on Very Large Databases, pages 516–527.
Morgan Kaufmann, 2004.
[98] Kristen LeFevre, Rakesh Agrawal, Vuk Ercegovac, Raghu Ramakrishnan, Yirong Xu, and
David J. DeWitt. Limiting Disclosure in Hippocratic Databases. In Proceedings of the 30th
International Conference on Very Large Databases, pages 108–119. Morgan Kaufmann,
2004.
[99] Lieven Desmet, Frank Piessens, Wouter Joosen, and Pierre Verbaeten. Bridging the Gap
Between Web Application Firewalls and Web Applications. In Proceedings of the Fourth
ACM workshop on Formal methods in security, FMSE, pages 67–77. ACM, 2006.
[100] Holger Dreger, Anja Feldmann, Michael Mai, Vern Paxson, and Robin Sommer. Dynamic
Application-Layer Protocol Analysis for Network Intrusion Detection. In Proceedings of
the 15th Usenix Security Symposium, pages 257–272. USENIX Association, 2006.
[101] Marco Cova, Davide Balzarotti, Viktoria Felmetsger, and Giovanni Vigna. Swaddler: An
Approach for the Anomaly-Based Detection of State Violations in Web Applications. In
Proceedings of the 10th International Symposium on Recent Advances in Intrusion Detec-
tion (RAID), pages 63–86. Springer, 2007.
[102] Shai Rubin, Somesh Jha, and Barton P. Miller. Protomatching Network Traffic for High
Throughput Network Intrusion Detection. In Proceedings of the Proceedings of the 13th
ACM conference on Computer and Communications Security, pages 47–58. ACM, 2006.
[103] Bruce D. Caulkins, Joohan Lee, and Morgan Wang. Packet- vs. Session-Based Modeling
for Intrusion Detection Systems. In Proceedings of the International Conference on Infor-
mation Technology: Coding and Computing (ITCC’05), pages 116–121. IEEE, 2005.
[104] Magnus Almgren and Ulf Lindqvist. Application-Integrated Data Collection for Security
Monitoring. In Proceedings of the 4th International Symposium on Recent Advances in
Intrusion Detection, pages 22–36. Lecture Notes in Computer Science, Springer Verlag,
Vol (2212), 2001.
142 BIBLIOGRAPHY
[105] Fredrik Valeur, Darren Mutz, and Giovanni Vigna. A Learning-Based Approach to the De-
tection of SQL Attacks. In Proceedings of Second International Conference on Detection of
Intrusions and Malware, and Vulnerability Assessment (DIMVA), pages 123–140. Springer,
2005.
[106] Christopher Kruegel and Giovanni Vigna. Anomaly Detection of Web-Based Attacks.
In Proceedings of the 10th ACM Conference on Computer and Communications Security
(CCS), pages 251–261. ACM, 2003.
[107] Shu Wenhui and Tan T H Daniel. A Novel Intrusion Detection System Model for Securing
Web-based Database Systems. In Proceedings of the 25th Annual International Computer
Software and Applications Conference (COMPSAC), pages 249–254. IEEE, 2001.
[108] Adwait Ratnaparkhi. A Maximum Entropy Model for Part-of-Speech Tagging. In Pro-
ceedings of the Conference on Empirical Methods in Natural Language Processing, pages
133–142. Association for Computational Linguistics, 1996.
[109] Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolu-
tion. PhD thesis, University of Pennsylvania, 1998.
[110] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A Maximum Entropy
Approach to Natural Language Processing. Computational Linguistics, 22(1):39–71, 1996.
[111] Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum Entropy Markov
Models for Information Extraction and Segmentation. In Proceedings of the 17th Interna-
tional Conference on Machine Learning, pages 591–598. Morgan Kaufmann, 2000.
[112] Dan Klein and Christopher D. Manning. Conditional Structure versus Conditional Esti-
mation in NLP Models. In Proceedings of the ACL-02 Conference on Empirical methods
in Natural Language Processing Vol (10), pages 9–16. Association for Computational Lin-
guistics, 2002.
[113] Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for
Relational Learning. In Introduction to Statistical Relational Learning. MIT, 2006.
[114] L. Ertoz, A. Lazarevic, E. Eilertson, Pang-Ning Tan, Paul Dokas, V. Kumar, and Jaideep
Srivastava. Protecting Against Cyber Threats in Networked Information Systems. In Pro-
BIBLIOGRAPHY 143
ceedings of SPIE; Battlespace Digitization and Network Centric Systems III, pages 51–56,
2003.
[115] Shon Harris. CISSP All-in-One Exam Guide. McGraw-Hill Osborne Media, 2007.
[116] Saso Dzeroski and Bernard Zenko. Is Combining Classifiers Better than Selecting the Best
One. In Proceedings of the Nineteenth International Conference on Machine Learning,
pages 123–129. Morgan Kaufmann, 2002.
[117] Chuanyi Ji and Sheng Ma. Combinations of Weak Classifiers. IEEE Transactions on Neural
Networks, 8(1):32–42, 1997.
[118] Andrew Viterbi. Error Bounds for Convolutional Codes and an Asymptotically Optimum
Decoding Algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967.
[119] G.D.Forney. The Viterbi Algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
[120] Taku Kudo. CRF++: Yet another CRF toolkit. Last accessed: Novmeber 30, 2008. http:
//crfpp.sourceforge.net/.
[121] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann, 2005.
[122] Maheshkumar Sabhnani and Gursel Serpen. Application of Machine Learning Algorithms
to KDDIntrusion Detection Dataset within Misuse Detection Context. In Proceedings of the
International Conference on Machine Learning; Models, Technologies and Applications,
MLMTA, pages 209–215. CSREA, 2003.
[123] Yacine Bouzida and Sylvain Gombault. Eigenconnections to Intrusion Detection. In Secu-
rity and Protection in Information Processing Systems, pages 241–258. Springer, 2004.
[124] Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields. In Pro-
ceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence, pages
403–410. Morgan Kaufmann, 2003.
[125] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing Features of Random
Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393,
1997.
144 BIBLIOGRAPHY
[126] Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit, 2002.
Last accessed: Novmeber 30, 2008. http://mallet.cs.umass.edu.
[127] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[128] Frank Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics, 1(6):80–83,
1945.
[129] W.W. Eckerson. Three Tier Client/Server Architecture: Achieving Scalability, Perfor-
mance, and Efficiency in Client Server Applications. Open Information Systems, 10(1),
1995.
[130] Computer Immune Systems- Data Sets and Software. Last accessed: Novmeber 30, 2008.
http://www.cs.unm.edu/
˜
immsec/systemcalls.htm.
[131] osCommerce, Open Source Online Shop E-Commerce Solutions. Last accessed: Novmeber
30, 2008. http://www.oscommerce.com/.
[132] Zen Cart, the art of e-commerce. Last accessed: Novmeber 30, 2008. http://www.
zencart.com/.
[133] Eric Newcomer and Greg Lomow. Understanding SOA with Web Services. Addison-Wesley
Professional, 2004.
[134] Thomas G. Dietterich. Machine Learning for Sequential Data: A Review. In Proceed-
ings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pat-
tern Recognition, pages 15–30. Lecture Notes in Computer Science, Springer Verlag, No.
(2396), 2002.
[135] Hanna Wallach. Conditional Random Fields: An Introduction. Technical Report MS-
CIS-04-21, Department of Computer and Information Science, University of Pennsylvania,
2004.
[136] Hanna Wallach. Efficient Training of Conditional Random Fields. Master’s thesis, Division
of Informatics, University of Edinburgh, 2002.
[137] Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic Conditional
Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence
BIBLIOGRAPHY 145
Data. In Proceedings of the 21st International Conference on Machine Learning, pages
99–106. ACM, 2004.
[138] Yang Wang, Kia-Fock Loe, and Jian-Kang Wu. A Dynamic Conditional Random Field
Model for Foreground and Shadow Segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 28(2):279–289, 2006.
[139] Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. In Pro-
ceedings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology, pages 134–141. Association
for Computational Linguistics, 2003.
[140] Tak-Lam Wong and Wai Lam. Semi Supervised Learning for Sequence Labeling Using
Conditional Random Fields. In Proceedings of the 4th International Conference on Ma-
chine Learning and Cybernetics, pages 2832–2837. IEEE, 2005.
[141] Ariadna Quattoni, Michael Collins, and Trevor Darrel. Conditional Random Fields for
Object Recognition. In Proceedings of Advances in Neural Information Processing Systems,
pages 1097–1104. MIT, 2004.
[142] John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel Conditional Random Fields: Representa-
tion and Clique Selection. In Proceedings of the 21st International Conference on Machine
Learning, pages 64–71. ACM, 2004.
[143] Aron Culotta, David Kulp, and Andrew McCallum. Gene Prediction with Conditional Ran-
dom Fields. Technical Report UM-CS-2005-028, University of Massachusetts, Amherst,
2005.
[144] Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for
Information Extraction. In Advances in Neural Information Processing Systems, pages
1185–1192, 2004.
[145] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[146] Kevin Murphy. An Introduction to Graphical Models. Technical report, Intel Research,
2001.
146 BIBLIOGRAPHY
[147] Kamal Nigam, John Lafferty, and Andrew McCallum. Using Maximum Entropy for Text
Classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages
61–67, 1999.
[148] Edwin Thompson Jaynes. Information Theory and Statistical Mechanics. The Physical
Review, 106(4):620–630, 1957.
[149] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A Maximization Tech-
nique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains.
The Annals of Mathematical Statistics, 41(1):164–171, 1970.
[150] Arthur Pentland Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood
from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, B,
39(1):1–38, 1977.
[151] Serafim Batzoglou. CS 262 Computational Genomics Winter 2005. Last accessed: Novme-
ber 30, 2008. http://robotics.stanford.edu/
˜
serafim/CS262_2005/
index.html.
[152] Roman Klinger and Katrin Tomanek. Classical Probabilistic Models and Conditional Ran-
dom Fields. Technical Report TR07-2-013, Technical University of Dortmund, 2007.
[153] Sunita Sarawagi. CRF Project. Last accessed: Novmeber 30, 2008. http://crf.
sourceforge.net/.
[154] Kevin P Murphy. Conditional random fields (chains, trees and general graphs; includes BP
code). Last accessed: Novmeber 30, 2008. http://www.cs.ubc.ca/
˜
murphyk/.
Appendices
147
Appendix A
An Introduction to Conditional Random
Fields
Conditional random fields have been effectively used for a variety of tasks including gene predic-
tion, determining secondary structures of protein sequences, part of speech tagging, text segmentation,
shallow parsing, named entity recognition, object recognition, intrusion detection and many others.
Conditional random fields exploit the sequence structure in the observations without making unwar-
ranted assumptions, which results in better classification. We describe the theory behind conditional
random fields in detail; give their properties along with the assumptions made which motivate their
use in a particular problem including their advantages and disadvantages with respect to previously
known approaches which can be used for similar tasks.
A.1 Introduction
The need to correctly label a sequence of observations is of vital importance in a variety of do-
mains including computational linguistics, computational biology and real-time intrusion detec-
tion. Computational linguistics involve various tasks such as text segmentation, determining the
part of speech tags for a sentence, information extraction, named entity recognition and others.
Similarly, computational biology includes various tasks such as biological sequence alignment,
determining secondary structure of protein sequences, gene prediction and many more. The need
to label sequence of observations also arises in intrusion detection tasks to correctly identify ma-
licious events.
The problem of sequence labeling is defined as follows: given a sequence of observations
x
1
, x
2
, x
2
, ..., x
n
, label every observation as y
1
, y
2
, y
2
, ..., y
n
from a finite set of labels Y [134],
[135]. We shall, thus, focus on a sequence of observations and discuss various methods which
have been proposed to label them. In particular, we shall emphasize on conditional random fields,
[34], [113], [124], [136], highlighting their advantages over other methods and list a number of
149
150 An Introduction to Conditional Random Fields
applications where they have been successfully applied [32], [33], [36], [37], [137], [138], [139],
[140], [141], [142], [143], [144].
The rest of the chapter is organized as follows; in Section A.2, we give a brief background
on probability distributions and describe the notations used. In Section A.3, we discuss various
graphical methods and highlight drawbacks in previously introduced methods such as the maxi-
mum entropy Markov models, hidden Markov models, naive Bayes classifiers and others which
motivate the use of conditional random fields. We then describe conditional random fields, in
details in Section A.4, highlighting situations where conditional random fields are expected to
perform better than their predecessors. We emphasize on feature functions, training and testing
and the complexity involved in using conditional random fields. We also give a brief description
of the tools which implement conditional random fields. In Section A.5, we compare the directed
and the undirected graphical models. Finally, we conclude the chapter in Section A.6.
A.2 Background
Many real life problems in language processing, computational biology and real-time intrusion
detection involve sequence labeling, time series prediction and sequence classification. In order to
perform such tasks, probabilistic approaches have gained wide acceptance which involve estimat-
ing either the joint distribution or the conditional distribution which are defined as follows.
• Joint Probability Distribution- Given N random variables, the joint distribution of the given
random variables is the distribution, D, of all the variables occurring together. When there
are only two random variables, X and Y, the joint distribution is represented as: P(X =
x, Y = y), ∀ x, y values.
• Conditional Probability Distribution- Given N random variables, the conditional distribu-
tion is a distribution, D, of a subset of variables given the occurrences of the remaining
random variables in the set N. For two random variables, X and Y, the conditional distri-
bution of Y given X is represented as: P(Y = y|X = x), ∀ x, y values.
The observations in most sequence labeling tasks are known and the objective is to assign
the correct label given the observations. The aim is, thus, to predict the label sequence which
maximizes the probability of the class labels given the observations. However, many machine
learning approaches first estimate the joint distribution of the observations and the labels and then
A.2 Background 151
determine the required conditional distribution using the Bayes rule. Once the complete joint
distribution is available, calculating the required marginal and conditional probabilities is an easy
task. However, the major issue in a joint distribution is estimating the required joint distribution
itself. To learn the joint distribution from the training data is difficult due to the following reasons:
1. The number of observations required to determine the complete joint distribution is ex-
ponential in the number of variables. For M variables each taking K possible labels, this
number is O(K
M
). Assuming complete independence among variables significantly re-
duces this to O(K ∗ M), however, making such strong independence assumptions affects
the accuracy of the model.
2. The amount of training data is limited and hence, it is difficult to estimate the accurate joint
distribution. The joint distribution learnt from a limited data set can result in over-fitting
and mirrors the training data. As a result, the learnt model does not generalize to new
observations.
Estimating the joint distribution without making any independence assumptions is, thus, fea-
sible only in situations when the number of random variables is small and large amount of data
samples are available for training. On the contrary, assuming complete independence among the
random variables though makes the model tractable but severely affects the modeling capability.
Hence, the objective is to build models which optimally balance the dual constraints; making the
model tractable with the help of independence assumptions without affecting the modeling power
of the system and improving the generalization capability of the model on unseen observations
given the limited number of training data samples. Domain knowledge is typically used to deter-
mine such dependence and independence relations as in case of the Bayesian networks which are
described later.
Estimating the conditional distribution directly from the training observations eliminates the
need of estimating the joint distribution and does not necessitate any unwarranted independence
assumptions among the random variables.
For estimating either the joint or the conditional distribution, a diagrammatic representation
of the random variables presenting their dependence relations is advantageous and graphical mod-
els have become an important tool for various machine learning tasks as presented in [145] and
[146]. As mentioned in [145], various complicated problems can be formulated and solved using
purely algebraic manipulation, however, the use of graphical models augments the analysis us-
152 An Introduction to Conditional Random Fields
ing diagrammatic representations of probability distributions which not only help to visualize the
structure of the probabilistic model but also gives insight to the properties of the model including
the conditional independence which significantly improves the probabilistic analysis and helps to
reduce the need of using larger data sets.
Notation
We use the following notations for the rest of the chapter.

−→
x = x
1
x
2
...x
t
is the observed vector. Let there be m alphabets for each x
i
.
• y is estimated class. We use the term “class” interchangeably with the term “label”. Let
there be k possible classes. For sequence labeling y is a vector,
−→
y = y
1
, y
2
, ..., y
t
, whose
length is equal to that of the observation
−→
x .
Note that, the graphical methods can be applied to label a single observation with multiple
features as well as to label a sequence of observations where each observation is itself represented
by multiple features. Methods which generally deal with a single observation are naive Bayes
classifier and Maxent classifier. Similarly, methods which deal with a sequence of observations
are hidden Markov models, maximum entropy Markov models, Markov random fields and condi-
tional random fields. Very often, when a sequence of observations is considered, the observation
represents the value of a single feature observed overtime, even though more than one feature can
be used to represent an observation sequence.
A.3 Graphical Models
Graphical models are often used to model the probability distribution over a set of random vari-
ables by factorizing complex distributions, with a large number of random variables, into a product
of simpler distributions, each with a small set of variables. A graph, G, is a set of vertices, V, con-
nected by edges, E, where a vertex represents a single or a group of random variable(s) and the
edges between the vertices represents the relationship between these random variables.
Based upon the type of edges used in the graphs, the graphical models can be broadly classified
as Directed or Undirected.
A.3 Graphical Models 153
A.3.1 Directed Graphical Models
Def.: A directed graphical model is a graph G = (V, E) where V = {V
1
, V
2
, ..., V
N
} are the
vertices and E = {(V
i
, V
j
), i = j} are the directed edges from vertex V
i
to vertex V
j
. A vertex V
i
can be represented by the random variable representation X
i
.
A directed graphical model incorporates the parent child relationship via the direction of an
edge, i.e., an edge pointing from the vertex V
i
to vertex V
j
implicitly describes the parent child
relationship such that X
i
is the parent of X
j
. The joint distribution over a set of random variables
can be factorized into a product of local conditional distributions in the directed graphical models.
Directed graphical models are also known as the Bayesian Networks [61].
An important restriction for the directed graphs is the absence of closed loops, i.e., there
should be no directed path starting from and ending to the same vertex. Such graphs are called
as the Directed Acyclic Graphs (DAG). The directed graphical models factorize according to the
probability distribution given in equation A.1, where x
i
represents a node and x
π
i
represents its
parents.
p(x
1
, x
2
, ..., x
n
) =
N

i=1
p(x
i
|x
Π
i
) (A.1)
Figure A.1 represents a fully connected directed graphical model for three random variables.
x
1
x
3
x
2
Figure A.1: Fully Connected Graphical Model
The graphical model represented in Figure A.1 can be factorized as:
p(x
1
, x
2
, x
3
) = p(x
1
) ∗ p(x
2
|x
1
) ∗ p(x
3
|x
1
, x
2
) (A.2)
Thus, for a fully connected graph with M variables each taking K possible values the total number
154 An Introduction to Conditional Random Fields
of parameters that must be specified for an arbitrary joint distribution is equal to K
M
−1 which
grows exponentially with M. This is not feasible for most real world applications which often
involve a large number of random variables with complex dependencies among them. The com-
plexity can be drastically reduced by assuming the random variables to be completely independent.
Figure A.2 represents a graphical model where the random variables are assumed to be completely
independent.
x
1
x
3
x
2
Figure A.2: Fully Disconnected Graphical Model
The graphical model represented in Figure A.2 can be factorized as:
p(x
1
, x
2
, x
3
) = p(x
1
) ∗ p(x
2
) ∗ p(x
3
) (A.3)
Assuming the variables to be completely independent significantly reduces the number of required
parameters to M(K −1), which is manageable.
Conditional independence properties can be used to simplify the structure of the graph. In
case of directed graphs, the conditional independence properties can be tested by applying the d-
separation test. This involves testing whether or not the path between two nodes is blocked. More
details on d-separation can be found in [145]. We shall now describe some of the well known
directed graphical models.
Naive Bayes Classifier
Naive Bayes classifier is a well known directed graphical model which is frequently used to de-
termine the class label for a given observation. The naive Bayes classifier is represented in Figure
A.3.
A.3 Graphical Models 155
x
1
x
2
x
t
y
Figure A.3: Naive Bayes Classifier
The objective is to find the label, y, which maximizes the probability of the given observation,
i.e., find:
argmax
y
p(y|
−→
x ) (A.4)
The Bayes Rule can be used to find p(y|
−→
x ):
p(y|
−→
x ) =
p(y) ∗ p(
−→
x |y)
p(
−→
x )
Hence, Equation A.4 can be rewritten as:
argmax
y
p(y|
−→
x ) = argmax
y
_
p(y) ∗ p(
−→
x |y)
p(
−→
x )
_
= argmax
y
p(y) ∗ p(
−→
x |y)
= argmax
y
p(y) ∗ p(x
1
, x
2
, ...x
t
|y)
Making naive Bayes assumption, “Every feature x
i
is conditionally independent of every other
feature”, the resulting naive Bayes classifier is given by:
argmax
y
p(y|
−→
x ) = argmax
y
p(y) ∗ p(x
1
|y) ∗ p(x
2
|y), ..., ∗p(x
t
|y)
= argmax
y
_
p(y) ∗
t

i=1
p(x
i
|y)
_
(A.5)
The classifier presented in Equation A.5 considers the features in the observation to be inde-
pendent and discard any correlation which may exist between them. This makes the model simple
156 An Introduction to Conditional Random Fields
but, it affects classification accuracy of the resulting classifier.
Maxent Classifier
Recalling that naive Bayes classifier is a directed graphical model which is often used to assign a
single label to an observation. We also observed that, to make the model simple, the naive Bayes
classifier assumes different observation features to be completely independent which affects clas-
sification. Similar to the naive Bayes classifier, the Maxent classifier (or logistic regression) can
be used to classify an observation which may be represented by multiple features. Contrary to the
naive Bayes assumption, the Maxent classifier does not assume independence among the observa-
tion features thereby resulting in better classification accuracy. Maxent classifier is represented in
Figure A.4.
x
1
x
2
x
t
y
Figure A.4: Maxent Classifier
Maxent classifier is motivated by the assumption that the log probability, log p(y|x), for each
class is a linear function of the observation x and a normalization constant. This results in a
conditional distribution which is represented in Equation A.6.
p(y|x) =
1
Z(x)
∗ exp
K

k=1
λ
k
∗ f
k
(y, x) (A.6)
where
Z(x) =

y
exp
K

k=1
λ
k
∗ f
k
(y, x)
is the normalization constant, λ
k
is the bias weight and f
k
(y, x) is a feature function defined on an
observation and label pair for every feature k. The ability of this model to capture the correlation
A.3 Graphical Models 157
between observation features depend upon the feature functions, f
k
(y, x), and the weights, λ
k
,
learnt during training [147].
Such a conditional probability model is based on the Principle of Maximum Entropy [148]
which states that; when for a probability distribution only incomplete information is available, the
only unbiased assumption that can be made is a distribution which is as uniform as possible given
the available information. This means that the model should follow all the constraints imposed on
it (which are defined by the feature functions extracted from the training data), but beyond these
constraints, the model should be as uniform as possible, i.e., one which does not make any further
assumptions. More details on Maximum Entropy models can be obtained from [110].
Generative and Discriminative Graphical Models
Def.: A graphical model which models the joint probability of the observations and the labels,
p(y, x), is known as a generative model. The naive Bayes classifier discussed earlier is an exam-
ple of a generative graphical model. Other well known generative models are the hidden Markov
models, Bayesian networks, Markov random fields. The prime disadvantage of generative models
is that they need to enumerate all possible observation sequences. However, in many real world
situations, the amount of data available for training is limited and hence, independence assump-
tions are made which results in approximate models.
Def.: A graphical model which models the conditional distribution of the labels given the ob-
servations, p(y|x), is known as a discriminative model. Maxent classifier (logistic regression)
discussed earlier is a typical discriminative model. Other well known methods such as the support
vector machines, maximum entropy Markov models, conditional random fields, neural networks
and nearest neighbor are examples of discriminative models.
Hidden Markov Model
The naive Bayes classifier is generally used to predict only a single class label. This model can
be extended to estimate a sequence of labels,
−→
y , for an observed sequence,
−→
x , of length t. As
mentioned earlier, very often the observed sequence represents the values of a single feature taken
158 An Introduction to Conditional Random Fields
over a period of time. The hidden Markov model is well known example of a directed and a
generative graphical model. They are doubly stochastic models; the state sequence is generated
by a stochastic process from which the output sequence is then generated [75]. Hence, given an
output sequence (observation), one cannot uniquely determine the labeling (i.e., the sequence of
states which generated the observation) since there may exist more than one sequence of states
which generated the particular observation.
We shall concentrate only on the first order hidden Markov model which assumes that a state
at time t depends only on the state at time t −1. Further, the observation at time t depends only on
the state at time t. Since we consider only a single feature which is observed overtime, this results
in a chain like structure as represented in Figure A.5.
y
1
x
1
x
2
x
t
y
t
y
2
y
t−1
x
t−1
Figure A.5: Hidden Markov Model
The hidden Markov model represented in Figure A.5 can be factorized as:
p(
−→
y ,
−→
x ) =
t

i=1
p(y
i
|y
i−1
) ∗ p(x
i
|y
i
) (A.7)
The best label sequence,
−→
y , is one which maximizes this joint distribution, p(
−→
y ,
−→
x ). The draw-
back of hidden Markov model is that observation at time t, i.e., x
t
, is assumed to be independent
of observation at any other time which can affect accuracy.
For an hidden Markov model, we often assume the number of states to be equal to the number
of class labels. Hence, the set of states is represented by Q = (q
1
, q
2
, ..., q
k
). Further, let the
transition probability from state q
t−1
= i to state q
t
= j be represented by a
ij
such that a
i1
+
a
i2
+ ... + a
ik
= 1, ∀ states i = 1...k. The starting probabilities, a
0i
, are initialized for each state i
such that a
01
+ a
02
+ ... + a
0k
= 1. Also, each state has a probability of emitting an observation,
e
i
(b) = p(x
i
= b|q
i
= k) such that e
i
(b
1
) + ... +e
i
(b
m
) = 1, ∀ states i = 1...k.
A.3 Graphical Models 159
Three main questions are considered when using an hidden Markov model.
1. Evaluation - Given an hidden Markov model M and an observation sequence
−→
x , what is
the probability of the observation sequence given the model, i.e., find
p(
−→
x |M)
2. Decoding - Given an hidden Markov model M and an observation sequence
−→
x , what is
the sequence of states that maximizes the joint probability of the observation sequence and
the state sequence, i.e. find
argmax
−→
q
p(
−→
x ,
−→
q |M)
3. Learning - Given an hidden Markov model M with unspecified transition and emission
probabilities, and an observation sequence
−→
x , what are the parameters (transition and
emission probabilities) that maximize the probability of the observation sequence, i.e.,
find
argmax
θ
p(
−→
x |θ)
Evaluation
The objective is to find p(
−→
x |M), i.e., the probability of the observation sequence given the model.
The naive approach is to perform summation over all possible ways of generating the observation
sequence,
−→
x , i.e.,
p(
−→
x ) =

−→
q
p(
−→
x ,
−→
q )
=

−→
q
p(
−→
x |
−→
q ) ∗ p(
−→
q )
Summing over exponential number of paths is not desirable. Dynamic programming can be used
to perform this efficiently. First, define the forward probability as follows:
f
k
(i) = p(x
1
, ..., x
i
, q
i
= k)
=

q
1
...q
i−1
p(x
1
, ..., x
i−1
, q
1
, ..., q
i−1
, q
i
= k) ∗ e
k
(x
i
)
160 An Introduction to Conditional Random Fields
=

q
1
...q
i−1
p(x
1
, ..., x
i−1
, q
1
, ..., q
i−1
) ∗ p(q
i
= k|q
i−1
) ∗ e
k
(x
i
)
=

l

q
1
...q
i−2
p(x
1
, ..., x
i−1
, q
1
, ..., q
i−1
= l) ∗ p(q
i
= k|q
i−1
) ∗ e
k
(x
i
)
=

l

q
1
...q
i−2
p(x
1
, ..., x
i−1
, q
1
, ..., q
i−2
, q
i−1
= l) ∗ a
lk
∗ e
k
(x
i
)
=

l
p(x
1
, ...x
i−1
, q
i−1
= l) ∗ a
lk
∗ e
k
(x
i
)
= e
k
(x
i
) ∗

l
f
l
(i −1) ∗ a
lk
Using this idea, the forward algorithm [75] can be used to perform this efficiently with a time
complexity of O(K
2
T) and a space complexity of O(KT) where K is the possible number of
states and T is the length of the observation sequence. The algorithm is described next.
Initialization:
f
0
(0) = 1
f
k
(0) = 0, ∀ k > 0
Iteration:
f
k
(i) = e
k
(x
i
) ∗ ∑
l
f
l
(i −1) ∗ a
lk
Termination:
p(
−→
x ) = ∑
k
f
k
(t) ∗ a
k0
, where a
k0
is the probability of terminating in state k.
Similar to the forward algorithm described above, the backward algorithm [75] employs dy-
namic programming and can be used in conjunction with the forward algorithm to determine the
most likely state at position i given the observation sequence
−→
x . First, define the backward prob-
ability as follows:
b
k
(i) = p(x
i+1
, ..., x
t
|q
i
= k)
=

q
i+1
...q
t
p(x
i+1
, ..., x
t
, q
i+1
, ..., q
t
|q
i
= k)
=

l

q
i+1
...q
t
p(x
i+1
, ..., x
t
, q
i+1
= l, q
i+2
..., q
t
|q
i
= k)
A.3 Graphical Models 161
=

l
e
l
(x
i+1
) ∗ a
kl ∑
q
i+1
...q
t
p(x
i+2
, ..., x
t
, q
i+2
, ..., q
t
|q
t+1
= l)
=

l
e
l
(x
i+1
) ∗ a
kl
∗ b
l
(i + 1)
Using the above concept, the backward algorithm described next.
Initialization:
b
k
(t) = a
k0
, ∀ k
Iteration:
b
k
(i) = ∑
l
e
l
(x
i+1
) ∗ a
kl
∗ b
l
(i + 1)
Termination:
p(
−→
x ) = ∑
l
a
0l
∗ e
l
(x
1
)b
l
(1)
The backward algorithm also has a time complexity of O(K
2
T) and a space complexity of
O(KT) where K is the possible number of states and T is the length of the observation sequence.
The most likely state at position i, given the observation sequence
−→
x , can now be calculated using
the Equation A.8.
p(q
i
= k|
−→
x ) =
f
k
(i) ∗ b
k
(i)
p(
−→
x )
(A.8)
This is also known as posterior decoding. The most likely state can be calculated at each position
using Equation A.8. However, this does not represent the most likely sequence of states given the
entire observation sequence of length t.
Decoding
The objective is to find:
−→
q∗ = argmax
−→
q
p(
−→
x ,
−→
q |M)
Consider the given observation sequence x
1
, x
2
, ...x
t
as shown in Figure A.6.
To calculate the sequence of states which maximizes the joint probability of the observation se-
quence and the state sequence, dynamic programming can be used to perform the computation
162 An Introduction to Conditional Random Fields
k k k k k
2 2 2 2 2
1 1 1 1 1
x
3
x
1
x
2
x
t
x
t−1
Figure A.6: Decoding in an Hidden Markov Model
efficiently. Let V
k
(i) is the probability of most likely sequence of states ending in state q
i
= k
V
k
(i) = max
q
1
...q
i−1
p(x
1
, ...x
i−1
, q
1
, ...q
i−1
, x
i
, q
i
= k) (A.9)
Given V
k
(i) for all states k, and for a fixed position i, calculate V
l
(i + 1) as:
V
l
(i + 1) = max
q
1
...q
i
p(x
1
, ...x
i
, q
1
, ...q
i
, x
i+1
, q
i+1
= l)
= max
q
1
...q
i
p(x
i+1
, q
i+1
= l|x
1
, ...x
i
, q
1
, ...q
i
) ∗ p(x
1
, ...x
i
, q
1
, ...q
i
)
= max
q
1
...q
i
p(x
i+1
, q
i+1
= l|q
i
) ∗ p(x
1
, ...x
i−1
, q
1
, ...q
i−1
, x
i
, q
i
)
= max
k
[p(x
i+1
, q
i+1
= l|q
i
= k)∗
max
q
1
...q
i−1
p(x
1
, ...x
i−1
, q
1
, ...q
i−1
, x
i
, q
i
= k)]
= max
k
[p(x
i+1
|q
i+1
= l) ∗ p(q
i+1
= l|q
i
= k) ∗ V
k
(i)]
= e
l
(x
i+1
) ∗ max
k
[a
kl
∗ V
k
(i)] (A.10)
The Viterbi algorithm [118], [119] implements this idea with a time complexity of O(K
2
T) and
a space complexity of O(KT) where K is the possible number of states and T is the length of the
observation sequence. The algorithm is described in the following steps:
A.3 Graphical Models 163
Initialization:
V
0
(0) = 1, where 0 is the imaginary start position.
V
k
(0) = 0, ∀ k > 0
Iteration:
V
j
(i) = e
j
(x
i
) ∗ max
k
[a
kj
∗ V
k
(i −1)]
Ptr
j
(i) = argmax
k
a
kj
∗ V
k
(i −1)
Termination:
p(
−→
x ,
−→
q∗) = max
k
V
k
(t)
Traceback:
q

t
= argmax
k
V
k
(t)
q

i−1
= Ptr
q
i
(i)
Learning
In order to estimate the parameters of an hidden Markov model, two learning scenarios exist; when
the labeled training data is available and when the training data is not labeled. In this chapter, we
shall only discuss the first case when labeled training data is available. In case when training
data is not labeled, the Baum-Welch algorithm [149] can be used which is based on the principle
of expectation maximization [150]. Alternately, the Viterbi training can also be used. In case
when the training data is labeled the observation sequence,
−→
x = x
1
, x
2
, ..., x
t
, is given and the
corresponding state sequence,
−→
q = q
1
, q
2
, ..., q
t
is known. We define,
A
kl
= number o f times transition occurs f rom k to l in
−→
q .
E
k
(x) = number o f times state k in
−→
q emits x in
−→
x .
The maximum likelihood parameters θ, i.e., maximize p(
−→
x |θ), can be shown to be:
a
kl
=
A
kl

i
A
ki
(A.11)
e
k
(b) =
E
k
(b)

c
E
k
(c)
(A.12)
164 An Introduction to Conditional Random Fields
Hence, given the labeled training data, the best estimate of the parameters that can be obtained
is the average frequency of transitions and emissions that occur in the training data. A common
drawback is that this can result in over-fitting which affects their generalization capability.
We, thus, observe that for an hidden Markov model, the transition and emission probabilities
can be used to determine the likelihood of the parse, i.e., given an hidden Markov model, an
observation sequence
−→
x and a parse
−→
q , the likelihood of this parse is:
p(
−→
x ,
−→
q ) = p(x
1
, x
2
, ..., x
t
, q
1
, q
2
, ..., q
t
)
= a
0q
1
a
q
1
q
2
...a
q
t−1
q
t
e
q
1
(x
1
)e
q
2
(x
3
)...e
q
t
(x
t
)
A compact approach to represent a
0q
1
a
q
1
q
2
...a
q
t−1
q
t
e
q
1
(x
1
)e
q
2
(x
3
)...e
q
t
(x
t
) is to consider all the
parameters a
ij
and e
i
(b) as features. Let there be n such features (both a
ij
and e
i
(b)). Counting
the number of times every feature j = 1, ..., n occurs in (
−→
x ) and (
−→
q ), we represent the count as
F(j, x, q) = number o f parameters θ
j
which occur in (
−→
x ,
−→
q )
Thus,
p(
−→
x ,
−→
q ) =

j=1,...,n
θ
F(j,
−→
x ,
−→
q )
j
which can be reduced to the form
p(
−→
x ,
−→
q ) = exp
_

j=1,...,n
log(θ
j
) ∗ F(j,
−→
x ,
−→
q )
_
(A.13)
Equation A.13 gives another way of representing an hidden Markov model which presents an intu-
itive approach for understanding the maximum entropy Markov models and, thus, the conditional
random fields which are discussed next.
Maximum Entropy Markov Model
Similar to how we extended the naive Bayes classifier to perform sequence labeling in the hidden
Markov models, given the maximum entropy model (Maxent classifier) in Equation A.6, we can
A.3 Graphical Models 165
extend it to perform sequence labeling for an observation sequence,
−→
x . This results in a maximum
entropy Markov model as represented in Figure A.7.
y
1
x
1
y
2
y
t
x
2
x
t
y
t−1
x
t−1
Figure A.7: Maximum Entropy Markov Model
One approach to perform sequence labeling is to run the Maxent classifier locally for every
observation in the sequence resulting in a label for every observation, x
i
. An obvious drawback of
this approach is that the labels for each observation x
i
are optimal locally as opposed to the optimal
sequence of labels, y
1
, y
2
, ..., y
t
. To avoid this, the Viterbi decoding can be performed similar to
the hidden Markov models. The maximum entropy Markov model represented in Figure A.7 can
be factorized as:
p(y
t
|y
t−1
, x
t
) =
1
Z(y
t−1
, x
t
)
∗ exp

k
λ
k
∗ f
k
(y
t
, y
t−1
, x
t
) (A.14)
where
Z(y
t−1
, x
t
) =

y
t
exp

k
λ
k
∗ f
k
(y
t
, y
t−1
, x
t
)
is the partition function, λ
k
is the weight and f
k
(y
t
, y
t−1
, x
t
) is the feature function defined for a
feature k. Using the Viterbi algorithm, decoding can be performed similar to the hidden Markov
model such that the probability of the overall sequence of labels is maximized as opposed to
finding the optimum class label at each observation x
t
.
Comparing Equation A.13 and Equation A.14, we note that the hidden Markov model models
the joint probability of the observation sequence and the label sequence by assuming that a state
at time t depends only on the state at time t − 1. Further, they assume that the observation at
time t depends only on the state at time t. Instead, the maximum entropy Markov model models
166 An Introduction to Conditional Random Fields
the conditional distribution of the label sequence by conditioning on the observation at time t.
Maximum entropy Markov models, thus, often perform better than the hidden Markov models,
however, they suffer from the Label Bias problem [34], [112] which is described next.
Label Bias in Maximum Entropy Markov Models
Label bias is the phenomenon in which the model effectively ignores the observation thereby
resulting in inaccurate results. This is attributed to the directed graphical structure and, hence, local
conditional modeling in each state [112]. As we discussed earlier, the maximum entropy Markov
model is analogous to a sequence of independent Maxent classifiers, thus, the probability at every
instant sums to one. As a result, if certain sequence of states is more frequent during training, the
same path is preferred irrespective of the observation at any later stage (during decoding). In other
words, the previous state explains the current state so well that the observation at the current state
is effectively ignored.
In [34], the authors explain the label bias phenomenon with the following example. Consider
the finite state model represented in Figure A.8.
o
i
r
b
r
b
4 5
1 2
3 0
Figure A.8: Label Bias Problem
Suppose that the observation sequence is r i b. Once, the model observes, the observation r,
it assign equal probability to both, state 1 and state 4. Next the model observes the observation i.
However, given the model, both the states 1 and 4 have only one outgoing transition and because
the incoming probability is equal to the outgoing probability, due to local normalization, when
the model observes i or any other observation, both the states have no choice but to ignore the
observation and move to the next state with the maximum probability. As a result, both states 2
and 5 result in equal probability. Further, if one of the observation sequences is more common in
the training, the transitions would prefer the corresponding path irrespective of the observation.
A.3 Graphical Models 167
In the above example, it is possible to eliminate label bias by collapsing the states 1 and 4,
however, this is a special case and not always possible [34]. Another approach is to start with
a fully connected structure; however, this would preclude the use of prior structural knowledge.
Similar to the label bias, the authors in [112] describe what they call as the observation bias where
the observations explain the states such that the previous states are effectively ignored.
Conditional random fields effectively address these issues by dropping local normalization
and instead normalize globally on the observation sequence. However, before we describe the
conditional random fields, we present the general undirected graphical models which are necessary
for better understanding of the conditional random fields.
A.3.2 Undirected Graphical Models
Def.: An undirected graphical model is a graph G = (V, E) where V = {V
1
, V
2
, ..., V
N
} are the
vertices and E = {(V
i
, V
j
), i = j} are the undirected edges from vertex V
i
to vertex V
j
. A vertex
V
i
can be represented by the random variable representation X
i
. Undirected graphical models are
also known as the Markov Random Fields [145].
Similar to the directed graphical models, the undirected graphical models describe the factor-
ization of a set of random variables and their notion of conditional independence. The undirected
graphical models factorize according to the probability distribution given in Equation A.15.
p(x
1
, x
2
, ..., x
n
) =
1
Z

c∈C
ψ
c
(x
c
) (A.15)
such that
ψ
c
(x
c
) > 0, ∀ c, x
c
and
Z =

x
1

x
2
...

x
n

c∈C
ψ
c
(x
c
)
where C is the set of cliques in the graph, Z is the normalization factor known as the partition
function and ψ
c
are the strictly positive real valued functions known as the potential functions
defined over the cliques. Potentials have no specific probabilistic interpretations. To make sure
that the Equation A.15 represents a probability distribution, it is necessary to calculate the partition
function Z. Figure A.9 represents an undirected graphical model for three random variables.
168 An Introduction to Conditional Random Fields
x
1
x
3
x
2
Figure A.9: Undirected Graphical Model
The undirected graphical model represented in Figure A.9 can be factorized as:
p(x
1
, x
2
, x
3
) =
1
Z
ψ
1,2
(x
1
, x
2

1,3
(x
1
, x
3

2,3
(x
2
, x
3

1,2,3
(x
1
, x
2
, x
3
) (A.16)
where
Z =

x
1

x
2

x
3
ψ
1,2
(x
1
, x
2

1,3
(x
1
, x
3

2,3
(x
2
, x
3

1,2,3
(x
1
, x
2
, x
3
)
The complexity of an undirected graphical model depends upon the size of the largest clique.
The overall complexity can be determined from ∑
c∈C
O(
k
m
c
) where, m
c
is the size of the clique
c. For the undirected graphical models, conditional independence properties can be simply deter-
mined by graph separation. More details can be found in [145].
A.4 Conditional Random Fields
In [34], the authors proposed the conditional random fields as a solution for the label bias prob-
lem. However, conditional random fields can also be considered as a generalization of the hidden
Markov models, [113], [151] which also gives a better view and help in better understanding.
A major drawback in an hidden Markov model is that the state q
i
can observe only the ob-
servation symbol x
i
, i.e., a strong independence assumption is made that the state at any instant
depends only upon the previous state. Further, we observed that using the dynamic programming
approach, all ‘K
2
’ a
kl
features and all ‘K’ e
l
(x
i
) features are significant at every instant. Rearrang-
ing Equation A.7 and Equation A.10 as:
V
l
(i) = V
k
(i −1) + (a(k, l) +e(l, i))
= V
k
(i −1) + g(k, l, x
i
)
A.4 Conditional Random Fields 169
We note that the restriction in an hidden Markov model arise from the x
i
part in the function
g(k, l, x
i
). Generalizing this function to g(k, l, x, i) removes the independence assumptions made
in the hidden Markov model which forms the basis for conditional random fields. A large number
of features can be defined at every position which can capture long range dependencies in the
observation sequence x. Higher the value of function g, the more likely state k will follow state
l at position i. A conditional random field, thus, includes all the features present in an hidden
Markov model and also has the capability to define a large number of additional features which
significantly improves its modeling power compared to that of an hidden Markov model.
A.4.1 Representation of Conditional Random fields
Using Equation A.15,
p(x
1
, x
2
, ..., x
t
) =
1
Z

c∈C
ψ
c
(x
c
)
p(
−→
x ) =
1
Z

c∈C
ψ
c
(x
c
)
The conditional probability can be written as:
p(
−→
y |
−→
x ) =
p(
−→
y ,
−→
x )
p(
−→
x )
=
p(
−→
y ,
−→
x )

−→
y
p(
−→
y

,
−→
x )
=
1
Z
∗ ∏
c∈C
ψ
c
(y
c
, x
c
)
1
Z
∗ ∑
−→
y

c∈C
ψ
c
(y

c
, x
c
)
=
1
Z(
−→
x )

c∈C
ψ
c
(y
c
, x
c
) (A.17)
where
Z(
−→
x ) =

−→
y


c∈C
ψ
c
(y

c
, x
c
)
Equation A.17, presents the general formulation of a conditional random field. However, in this
chapter, we shall focus on a linear chain structure for conditional random fields which is motivated
from [34], [113], [151] and [152] and is described next.
170 An Introduction to Conditional Random Fields
Linear Chain Conditional Random Field
Consider an observation sequence
−→
x of length t + 1. A linear chain conditional random field is
represented in Figure A.10.
y
1
y
2
x
1
x
2
y
t+1
x
t+1
x
t
y
t
Figure A.10: Linear Chain Conditional Random Field
Using Equation A.17, a linear chain conditional random field can be formulated as:
p(
−→
y |
−→
x ) =
1
Z(
−→
x )
t

j=1
ψ
j
(
−→
y ,
−→
x ) (A.18)
where
Z(
−→
x ) =

−→
y

t

j=1
ψ
j
(
−→
y

,
−→
x )
and
ψ
j
(
−→
y ,
−→
x ) = exp
_
m

i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x , j)
_
Equation A.18 can be rewritten as:
p(
−→
y |
−→
x ) =
1
Z(
−→
x )
∗ exp
_
t

j=1
m

i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x , j)
_
(A.19)
where
Z(
−→
x ) =

−→
y
exp
_
t

j=1
m

i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x , j)
_
In Equation A.19, summing over all possible label sequences ensures that it is a probability dis-
tribution. Further, for an observation of length t + 1 for a linear chain structure represented in
Figure A.4, there exists t possible maximal cliques which are represented by adjacent nodes in the
chain. The index j, thus, represents the position in the input sequence and sums over a sequence
A.4 Conditional Random Fields 171
of length t. Index i represents the m overall feature functions defined on the specified set of vari-
ables. Further, the feature weights λ
i
are not dependent on the position j, rather they are tied to
the individual feature functions.
From Equation A.18, we observe that the potential function ψ must be a strictly positive real
valued function. Using the exponential function, for defining the potentials, implicitly represents
the positivity constraint on the potentials. Further, as we shall observe later, since exponential
function is a continuous function and easily differentiable, it can be effectively used for maximum
likelihood parameter estimation (estimating λ

s) during training.
Feature Functions and Feature Selection
In hidden Markov models, every label (or state), y
i
, can look only at the observation x
i
and hence
they cannot model long range dependencies between the observation sequence. As discussed
earlier, conditional random fields do not assume such independence among observations. This is
accomplished by using the features defined while training the random field. In order to define
the features, a clique template is defined which can extract a variety of features from the given
training samples. The clique template makes assumptions on the structure of the underlying data
by defining the composition of the cliques.
For a linear chain conditional random field, there exist only one clique template which defines
the links between y
j
and y
j−1
and
−→
x . Given the clique template, features can then be extracted
for different realizations of y
j
and y
j−1
and
−→
x from the training data.
A.4.2 Training
Given the labeled training sequences,
−→
x , the objective of training a conditional random field is to
determine the weights, λ

s which maximize p(
−→
y |
−→
x ). Maximum likelihood method is applied
for parameter estimation. The log likelihood L on the training data D is given by:
L(D) =

(
−→
y ,
−→
x )∈D
log p(
−→
y |
−→
x ) (A.20)
=

(
−→
y ,
−→
x )∈D
_
log
_
exp(∑
t
j=1

m
i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x ))

−→
y
exp(∑
t
j=1

m
i=1
λ
i
f
i
(y

j−1
, y

j
,
−→
x ))
__
172 An Introduction to Conditional Random Fields
To avoid over-fitting, the likelihood is often penalized with some form of a prior distribution which
has a regularizing influence. A number of priors such as the Gaussian, Laplacian, Hyperbolic and
others can be used. Consider a simple prior of the form∑
m
i=1
λ
2
i

2
i
where σ
i
is the standard deviation
of the parameter λ
i
.
Hence, the likelihood becomes:
L(D) =

(
−→
y ,
−→
x )∈D
_
log
_
exp(∑
t
j=1

m
i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x ))

−→
y
exp(∑
t
j=1

m
i=1
λ
i
f
i
(y

j−1
, y

j
,
−→
x ))
__

m

i=1
λ
2
i

2
i
=

(
−→
y ,
−→
x )∈D
log
_
exp(
t

j=1
m

i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x ))
_


(
−→
y ,
−→
x )∈D
log
_
_

−→
y

exp(
t

j=1
m

i=1
λ
i
f
i
(y

j−1
, y

j
,
−→
x ))
_
_

m

i=1
λ
2
i

2
i
=

(
−→
y ,
−→
x )∈D
t

j=1
m

i=1
λ
i
f
i
(y
j−1
, y
j
,
−→
x )
. ¸¸ .
A


(
−→
y ,
−→
x )∈D
log Z(
−→
x )
. ¸¸ .
B

m

i=1
λ
2
i

2
i
. ¸¸ .
C
(A.21)
Taking partial derivatives of the likelihood with respect to the parameters, λ

s, we get:

∂λ
i
A =

(
−→
y ,
−→
x )∈D
t

j=1
f
i
(y
j−1
, y
j
,
−→
x ) (A.22)
which is same as the expected value of the feature under its empirical distribution and is denoted
as
˜
E( f i).

∂λ
i
B =

(
−→
y ,
−→
x )∈D
1
Z(
−→
x )
∂Z(
−→
x )
∂λ
i
=

(
−→
y ,
−→
x )∈D
1
Z(
−→
x )

∂λ
i

−→
y

exp
_
t

j=1
m

i=1
λ
i
f
i
(y

j−1
, y

j
,
−→
x )
_
=

(
−→
y ,
−→
x )∈D
1
Z(
−→
x )

−→
y

exp
_
t

j=1
m

i=1
λ
i
f
i
(y

j−1
, y

j
,
−→
x )
_
t

j=1
f
i
(y

j−1
, y

j
,
−→
x )
=

(
−→
y ,
−→
x )∈D

−→
y

1
Z(
−→
x )
exp
_
t

j=1
m

i=1
λ
i
f
i
(y

j−1
, y

j
,
−→
x )
_
t

j=1
f
i
(y

j−1
, y

j
,
−→
x )
=

(
−→
y ,
−→
x )∈D

−→
y

p(
−→
y |
−→
x )
t

j=1
f
i
(y

j−1
, y

j
,
−→
x ) (A.23)
A.4 Conditional Random Fields 173
which is the expectation under the model distribution and is denoted as E( f i).

∂λ
i
C =

i

2
i
=
λ
i
σ
2
i
(A.24)
Using A.21, A.22, A.23 and A.24, we get:
∂L(D)
∂λ
i
=
˜
E( f i) − E( f i) −
λ
i
σ
2
i
(A.25)
To find the maximum, we equate the right hand side in Equation A.25 to 0. Hence,
˜
E( f i) − E( f i) −
λ
i
σ
2
i
= 0 (A.26)
˜
E( f i) can be easily computed by counting how often every feature occurs in the training data. To
efficiently calculate the E( f i), a modified version of the forward-backward algorithm can be used.
Consider states s

and s. As described in [34], defining the forward (α) and backward (β) scores
as follows:
α
j
(s|
−→
x ) =

s

α
j−1
(s

|
−→
x ) ∗ ψ
j
(
−→
x , s

, s) (A.27)
β
j
(s|
−→
x ) =

s

β
j+1
(s

|
−→
x ) ∗ ψ
j
(
−→
x , s

, s) (A.28)
where
ψ
j
(
−→
x , s, s

) = exp
_
m

i=1
λ
i
f
i
(y
j−1
= s, y
j
= s

,
−→
x )
_
Using the α and β functions, it is possible to compute the expectation under the model distribution
efficiently by:
E( f
i
) =

(
−→
y ,
−→
x )
1
Z(
−→
x )
t

j=1

s

s

f
i
(s, s

,
−→
x ) ∗ α
j
(s|
−→
x )ψ
j
(
−→
x , s, s


j
(s

|
−→
x )
The forward-backward algorithm has a complexity of O(K
2
T) where K is the number of states
and T is the length of the sequence. Training a conditional random field involves many iterations
of the forward-backward algorithm.
174 An Introduction to Conditional Random Fields
A.4.3 Inference
Given the observed sequence
−→
x and the trained conditional random field, the objective is to find
the most likely sequence of labels for the given observation. As with the hidden Markov models,
the Viterbi algorithm can be used to effectively determine the sequence of states. Often the number
of states is assumed to be equal to the number of labels and, hence, we use the two interchangeably.
Let δ
j
(s|
−→
x ) represent the highest score of the sequence of states ending in state s at position
j and is defined as:
δ
j
(s|
−→
o ) = max
y
1
,...,y
j−1
p(y
1
, ..., y
j−1
, y
j
= s|
−→
x ) (A.29)
We then calculate
δ
j+1
(s|
−→
x ) = max
s

δ
j
(s

) ∗ ψ
j+1
(
−→
x , s, s

) (A.30)
The algorithm is described in the following steps:
Initialization: ∀ s ∈ S :
δ
1
(s) = ψ
1
(
−→
x , s
0
, s)
q
1
(s) = s
0
Recursion: ∀ s ∈ S : 1 ≤ j ≤ t
δ
j
(s) = max
s
δ
j−1
(s

) ∗ ψ(
−→
x , s, s

)
q
j
(s) = argmax
s

δ
j−1
(s

) ∗ ψ(
−→
x , s

, s)
Termination:
p

= max
s
δ
t
(s

)
−→
l

t
= argmax
s

δ
t
(s

)
Traceback:
−→
l

t
= q
t+1
(
−→
l

t+1
)
The complexity of the algorithm is O(K
2
T) where K is the number of states and T is the length
of the sequence.
A.5 Comparing the Directed and Undirected Graphical Models 175
A.4.4 Tools Available for Conditional Random Fields
We now list some of the tools which implement conditional random fields. This, however, is not
a complete list of tools and many other tools exist that can be used. The tools include CRF++
[120], Mallet [126], Sunita Sarawagi’s CRF package [153], Kevin Murphy’s MATLAB CRF code
[154]. We mainly experimented with the CRF++ and found it to be effective and easy to use
and customize. [120] gives an in depth description of the software and describes the commands
necessary to run it using example of the named entity recognition task from language processing.
A.5 Comparing the Directed and Undirected Graphical Models
Both directed and undirected graphical models allow complex distributions to be factorized into
a product of simpler distributions (functions). However, the two models differ in the way they
determine the conditional independence relations. The directed models determine the conditional
independence properties via the d-separation test while the undirected models determine the same
via graph separation [145].
As a result, the two models also differ in the way the probability distribution is factorized.
In directed graphs, the factorization results into a product of conditional probability distributions
while in undirected graphs, the factorization results into arbitrary functions. Factorization into
arbitrary functions enable us to define functions which can capture dependencies among variables,
however, it comes with the cost of calculating the normalization constant Z. The directed graphical
models do not require calculating such a partition function.
Given the two ways (directed and undirected models) to factorize a distribution, consider a
set S which represents the universe of distributions. Then, using the approach of directed graph-
ical models, we can represent only a subset, D, of distributions which follow all the conditional
independence properties. Similarly, using the approach presented by the undirected models, we
can represent a subset, U, of distributions which follow all the conditional independence relations.
This can be represented as shown in Figure A.11.
Note that, Figure A.11 shows that there exist a subset of distributions (which follow all the
conditional independence relations) which can be represented, both, by directed and undirected
models and there also exist distributions which can be represented only by either of the two. A
176 An Introduction to Conditional Random Fields
Distributions Represented
by Undirected Models
Distributions Represented by Both
Directed and Undirected Models
Distributions Represented
by Directed Models
Set of All Distributions
Figure A.11: Factorization in Graphical Models
trivial example where the factorizations are alike is when all the random variables are independent.
A.6 Conclusions
In this chapter we described conditional random fields in detail, discussed properties along with
the assumptions made which motivate their use in a particular problem including their advantages
and disadvantages with respect to previously known approaches which can be used for similar
tasks. The key features for conditional random fields are:
• Conditional random fields can be considered as a generalization of the hidden Markov mod-
els.
• Conditional random fields eliminate the label bias problem which is present in other condi-
tional models such as the maximum entropy Markov models.
• Long range dependencies can be modeled among observations using conditional random
fields.
• Training a conditional random field involves many iterations of the forward-backward al-
gorithm which has a complexity of O(K
2
T) where K is the number of states and T is the
length of the sequence.
• Inference or test time complexity for a conditional random field is also O(K
2
T) where K is
the number of states and T is the length of the sequence.
• Conditional random fields have been shown to be successful in many domains including
computational linguistics, computational biology and real-time intrusion detection.
Appendix B
Feature Selection for Network Intrusion
Detection
As described in Chapter 4, every record in the KDD 1999 data set presents 41 features which
can be used for detecting a variety of attacks such as the Probe, DoS, R2L and U2R. However,
using all the 41 features for detecting attacks belonging to all these classes severely affects the
performance of the system and also generates superfluous rules, resulting in fitting irregularities in
the data which can misguide classification. Hence, we performed feature selection to effectively
detect different classes of attacks. We now describe our approach for selecting features for every
layer and why some features were chosen over others.
B.1 Feature Selection for Probe Layer
Probe attacks are aimed at acquiring information about the target network from a source that is
often external to the network. For detecting Probe attacks, basic connection level features such as
the ‘duration of connection’ and ‘source bytes’ are significant, while features like ‘number of file
creations’ and ‘number of files accessed’ are not expected to provide significant information. We
selected only five features for Probe layer. The features selected for detecting Probe attacks are
presented in Table B.1.
Table B.1: Probe Layer Features
Feature Number Feature Name
1 duration
2 protocol type
3 service
4 flag
5 src bytes
177
178 Feature Selection for Network Intrusion Detection
B.2 Feature Selection for DoS Layer
DoS attacks are meant to prevent the target from providing service(s) to its users by flooding
the network with illegitimate requests. Hence, to detect attacks at the DoS layer, network traffic
features such as the ‘percentage of connections having same destination host and same service’
and packet level features such as the ‘duration’ of a connection, ‘protocol type’, ‘source bytes’,
‘percentage of packets with errors’ and others are significant. To detect DoS attacks, it may not be
important to know whether a user is ‘logged in or not’, or whether or not the ‘root shell’ is invoked
or ‘number of files accessed’ and, hence, such features are not considered in the DoS layer. From
all the 41 features, we selected only nine features for the DoS layer. The features selected for
detecting DoS attacks are presented in Table B.2.
Table B.2: DoS Layer Features
Feature Number Feature Name
1 duration
2 protocol type
4 flag
5 src bytes
23 count
34 dst host same srv rate
38 dst host serror rate
39 dst host srv serror rate
40 dst host rerror rate
B.3 Feature Selection for R2L Layer
R2L attacks are one of the most difficult attacks to detect and most of the present systems cannot
detect them reliably. However, our experimental results presented earlier show that careful feature
selection can significantly improve their detection. We observed that effective detection of the R2L
attacks involve both, the network level and the host level features. Hence, to detect R2L attacks, we
selected both, the network level features such as the ‘duration of connection’, ‘service requested’
and the host level features such as the ‘number of failed login attempts’ among others. Detecting
R2L attacks, require a large number of features and we selected 14 features. The features selected
for detecting R2L attacks are presented in Table B.3.
B.4 Feature Selection for U2R Layer 179
Table B.3: R2L Layer Features
Feature Number Feature Name
1 duration
2 protocol type
3 service
4 flag
5 src bytes
10 hot
11 num failed logins
12 logged in
13 num compromised
17 num file creations
18 num shells
19 num access files
21 is host login
22 is guest login
B.4 Feature Selection for U2R Layer
U2R attacks involve the semantic details which are very difficult to capture at an early stage at
the network level. Such attacks are often content based and target an application. Hence, for
detecting U2R attacks, we selected features such as ‘number of file creations’, ‘number of shell
prompts invoked’, while we ignored features such as ‘protocol’ and ‘source bytes’. From all the
41 features, we selected only eight features for the U2R layer. Features selected for detecting U2R
attacks are presented in Table B.4.
Table B.4: U2R Layer Features
Feature Number Feature Name
10 hot
13 num compromised
14 root shell
16 num root
17 num file creations
18 num shells
19 num access files
21 is host login
180 Feature Selection for Network Intrusion Detection
B.5 Template Selection
To train a conditional random field, feature functions must be chosen in prior. Hence, we defined
a template which can be used to extract all possible feature functions from the given training data
to train the conditional random filed using the CRF++ tool [120].
The template can be used to define both, unigram and bigram feature functions. For unigram
feature functions, which begins with U, the template defines a special macro %x[row,col] which is
used to specify a token in the input data, where row specifies the relative position from the current
focusing token and col specifies the absolute position of the column. The number of feature
functions generated by this type of template amounts to (L * N), where L is the number of output
classes and N is the number of unique string expanded from the given template.
For bigram feature functions, which begins with B, a combination of the current output token
and previous output token (bigram) is automatically generated. This type of template generates a
total of (L * L * N) distinct features, where L is the number of output classes and N is the number
of unique features generated by the templates. A sample template used in our experiments is
presented next.
# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
# Bigram
B
Appendix C
Feature Selection for Application
Intrusion Detection
As described in Chapter 5, we used 6 features to represent a user session. The six features are:
1. Number of data queries generated in a single web request.
2. Time taken to process the request.
3. Response generated for the request.
4. Amount of data transferred (in bytes).
5. Request made (or the function invoked) by the client.
6. Reference to the previous request in the same session.
C.1 Template Selection
To train a conditional random field, feature functions must be chosen in prior. Hence, we defined
a template which can be used to extract all possible feature functions from the given training data
to train the conditional random filed using the CRF++ tool [120].
The template can be used to define both, unigram and bigram feature functions. For unigram
feature functions, which begins with U, the template defines a special macro %x[row,col] which is
used to specify a token in the input data, where row specifies the relative position from the current
focusing token and col specifies the absolute position of the column. The number of feature
functions generated by this type of template amounts to (L * N), where L is the number of output
classes and N is the number of unique string expanded from the given template.
For bigram feature functions, which begins with B, a combination of the current output token
and previous output token (bigram) is automatically generated. This type of template generates a
total of (L * L * N) distinct features, where L is the number of output classes and N is the number
181
182 Feature Selection for Application Intrusion Detection
of unique features generated by the templates. A sample template used in our experiments is
presented next.
\# Unigram
U001:%x[-4,0] , U002:%x[-3,0] , U003:%x[-2,0] ,
U004:%x[-1,0] , U005:%x[0,0] , U006:%x[1,0] ,
U007:%x[2,0] , U008:%x[3,0] , U009:%x[4,0] ,
U101:%x[-4,1] , U102:%x[-3,1] , U103:%x[-2,1] ,
U104:%x[-1,1] , U105:%x[0,1] , U106:%x[1,1] ,
U107:%x[2,1] , U108:%x[3,1] , U109:%x[4,1] ,
U201:%x[-4,2] , U202:%x[-3,2] , U203:%x[-2,2] ,
U204:%x[-1,2] , U205:%x[0,2] , U206:%x[1,2] ,
U207:%x[2,2] , U208:%x[3,2] , U209:%x[4,2] ,
U301:%x[-4,3] , U302:%x[-3,3] , U303:%x[-2,3] ,
U304:%x[-1,3] , U305:%x[0,3] , U306:%x[1,3] ,
U307:%x[2,3] , U308:%x[3,3] , U309:%x[4,3] ,
U401:%x[-4,4] , U402:%x[-3,4] , U403:%x[-2,4] ,
U404:%x[-1,4] , U405:%x[0,4] , U406:%x[1,4] ,
U407:%x[2,4] , U408:%x[3,4] , U409:%x[4,4] ,
U501:%x[-4,5] , U502:%x[-3,5] , U503:%x[-2,5] ,
U504:%x[-1,5] , U505:%x[0,5] , U506:%x[1,5] ,
U507:%x[2,5] , U508:%x[3,5] , U509:%x[4,5]
\# Bigram
B

Abstract

I

NTRUSION Detection systems are now an essential component in the overall network and data security arsenal. With the rapid advancement in the network technologies including

higher bandwidths and ease of connectivity of wireless and mobile devices, the focus of intrusion detection has shifted from simple signature matching approaches to detecting attacks based on analyzing contextual information which may be specific to individual networks and applications. As a result, anomaly and hybrid intrusion detection approaches have gained significance. However, present anomaly and hybrid detection approaches suffer from three major setbacks; limited attack detection coverage, large number of false alarms and inefficiency in operation. In this thesis, we address these three issues by introducing efficient intrusion detection frameworks and models which are effective in detecting a wide variety of attacks and which result in very few false alarms. Additionally, using our approach, attacks can not only be accurately detected but can also be identified which helps to initiate effective intrusion response mechanisms in real-time. Experimental results performed on the benchmark KDD 1999 data set and two additional data sets collected locally confirm that layered conditional random fields are particularly well suited to detect attacks at the network level and user session modeling using conditional random fields can effectively detect attacks at the application level. We first introduce the layered framework with conditional random fields as the core intrusion detector. Layered conditional random field can be used to build scalable and efficient network intrusion detection systems which are highly accurate in attack detection. We show that our systems can operate either at the network level or at the application level and perform better than other well known approaches for intrusion detection. Experimental results further demonstrate that our system is robust to noise in training data and handles noise better than other systems such as the decision trees and the naive Bayes. We then introduce our unified logging framework for audit data collection and perform user session modeling using conditional random fields to build iii

real-time application intrusion detection systems. We demonstrate that our system can effectively detect attacks even when they are disguised within normal events in a single user session. Using our user session modeling approach based on conditional random fields also results in early attack detection. This is desirable since intrusion response mechanisms can be initiated in real-time thereby minimizing the impact of an attack.

iv

Kapil Kumar Gupta. the thesis comprises only my original work towards the PhD.000 words in length. due acknowledgement has been made in the text to all other material used. 3. January 2009 v . the thesis is less than 100. maps. exclusive of tables. 2.Declaration This is to certify that 1. bibliographies and appendices.

.

3.List of Publications Part of the work which is described in this thesis has been published as journal articles. Sequence Labeling for Effective Intrusion Detection – Kotagiri Ramamohanarao. Baikunth Nath. Under Review. Kotagiri Ramamohanarao. 6. Lecture Notes in Computer Science. In Press. 5. Lecture Notes in Computer Science. In Proceedings of the 3rd International Conference on Information Systems Security. book chapters and conference proceedings. User Session Modeling for Effective Application Intrusion Detection – Kapil Kumar Gupta. vii . Layered Approach using Conditional Random Fields for Intrusion Detection – Kapil Kumar Gupta. The Curse of Ease of Access to the Internet – Kotagiri Ramamohanarao. Christopher Leckie. Baikunth Nath.284. 2007. vol (4812). 4. Baikunth Nath. Intrusion Detection in Networks and Applications – Kapil Kumar Gupta. 2. pages 234 . Kotagiri Ramamohanarao. vol (278). Tao Peng. Kotagiri Ramamohanarao.249. Baikunth Nath. Kotagiri Ramamohanarao. Kapil Kumar Gupta. Submitted to the ACM Transactions on Information and Systems Security (TISSEC). Robust Application Intrusion Detection using User Session Modeling – Kapil Kumar Gupta. 2008. Baikunth Nath. Springer Verlag. In Proceedings of the 23rd International Information Security Conference. 1. Following is the list of the papers which have been published during the course of the candidature. In Press. World Scientific. In Handbook of Communication Networks and Distributed Systems. Kapil Kumar Gupta. IEEE Transactions on Dependable and Secure Computing (TDSC). pages 269 . In Press. Springer Verlag. In Proceedings of the 2nd Annual Computer Security Conference.

IEEE Computer Society. Baikunth Nath. Kotagiri Ramamohanarao. pages 203 . Springer Verlag. Conditional Random Fields for Intrusion Detection – Kapil Kumar Gupta. Attacking Confidentiality: An Agent Based Approach – Kapil Kumar Gupta. Kotagiri Ramamohanarao.208.296. 2006. Ashraf Kazi.7. pages 151 . Lecture Notes in Computer Science. Network Security Framework – Kapil Kumar Gupta. viii . vol (1).157. In Proceedings of the IEEE 21st International Conference on Advanced Information Networking and Applications Workshops. 2007. Baikunth Nath. 9. International Journal of Computer Science and Network Security (IJCSNS). vol (3975). Kotagiri Ramamohanarao. 8. vol 6(7B). In Proceedings of the IEEE International Conference on Intelligence and Security Informatics. Baikunth Nath. pages 285 . 2006.

ix . tremendous support from the School of Graduate Research. Their constant motivation. room 3. I thank the staff at the Department of Computer Science and Software Engineering.Acknowledgements It gives me immense pleasure to thank and express my gratitude towards my supervisors Assoc. and in the department for making the place a fun place to work and who also helped me to collect the data sets used in this research. for their support throughout the course of my study. Alauddin Bhuiyan deserves a special mention and I shall cherish the frequent tea breaks that we had together. In particular.08. I would like to thank my friends in the research lab. Baikunth Nath and Prof. I am grateful for the support received from the University of Melbourne via numerous channels including the Melbourne International Fee Remission Scholarship (MIFRS). Ramamohanarao Kotagiri. Finally. Prof. It would not have been possible for me to undertake this challenging task without their constant support. committee member Assoc.D. I thank them from the bottom of my heart. support and expert guidance has helped me to overcome all odds making this journey a truly rewarding experience in my life. supportive staff at the university libraries and various other university resources. I am extremely grateful to the National ICT Australia (NICTA) for the financial support in the form of the prestigious NICTA Studentship and regular support to present the research at various international conferences and to visit international laboratories. I do not have words to express my gratitude towards my parents and my elder brother whose support and uncountable sacrifices have paved the way for me to pursue this study. Prof. Chris Leckie for his valuable feedback and critical reviews which have helped to improve the quality of the thesis. Melbourne School of Engineering who has been extremely helpful at numerous occasions. I would also like to thank my Ph.

.

. . . . . . .4 Audit Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . .3. . Layered Conditional Random Fields for Network Intrusion Detection . . . . . . . . .4 Thesis Organization . .4 Layered Framework for Intrusion Detection . . . . . . .2 Introduction . . . . . . . . .1. . . . . . . . .3. . . .3. 1.1 1. . User Session Modeling for Application Intrusion Detection . 1. . . . . . . xi . .2. . . . . . .3 Properties of Audit Patterns useful for Intrusion Detection . . . . . . . Unified Logging Framework for Audit Data Collection .3 Principles and Assumptions in Intrusion Detection . . . . . .3 Research Objectives . . . . . . .1 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification of Intrusion Detection Systems . . . . . . Components of Intrusion Detection Systems . . . . . . . . . . . . . . . . . .3. .3 1. . . 2. . . . . . . . . . . . . . . .3 2. . . . . .3. . . . . . . . . .2 Classification based upon the Security Policy definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Contents 1 Introduction 1. . . . . . .1 2. . . . . . . . . . . . . . . . . . . . . .2. . . . . .1 1. . . Challenges and Requirements for Intrusion Detection Systems .2 2. . . . . . . . . . .3. . . .2. . Intrusion Detection and Intrusion Detection System . . . . . . . . . . . . Contributions to Thesis . . . .1 2. . . . . . . Relational or Sequential Representation . . . . . . . . . . 2. . . . . . . .2 1. . . . . . . . . . . . . . . . 2. . . . 1 1 3 3 4 5 5 6 7 7 9 9 11 13 13 14 15 17 19 22 22 23 24 Emerging Attacks . . . . . . . . . . . .1 Motivation and Problem Description . . . . . . . Classification based upon the Audit Patterns . .4. . . . . . 2 Background 2. .4. . . Univariate or Multivariate Audit Patterns . . .2 2.2 1. .1 2. 1. . . . . . . . . . .4. .

. . . .2. . . . .1 3. . . . . . . . 4. . . . . . . . . . . . . . . . . . . . .3 Introduction . . . . Advantages of Layered Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 2. .2 Building Individual Layers of the System . . . . . . . . . . .3 Significance of Layered Framework . . . . Monitoring Access Logs . . . . . . . . . . . . .6. . . . . . . . . . . . . . . . . . . . . Integrating the Layered Framework . . . . .6 Evaluation Metrics . . . . . . . . . . . 24 25 26 26 31 33 34 36 37 37 39 39 41 42 43 45 47 47 49 50 51 53 55 56 57 64 67 70 71 72 2. . . . . . . . . . . . . . . . . Implementing the Integrated System . . 4. . .6. 2. . . . .6.2 4. . . .2 4. . . . . . . . .6. . . . . . . . . . . . . .1 4. . . . . . . . . .6 Comparison and Analysis of Results . . . .3 2. . . . . . . . . . . . . . . .6 Components of Individual Layers . .7 2.1 3. . . . . . . . . . . .5. . . . . . . . . Methodology .4 3. . . . .5. . . .4. . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . .4 Introduction . . . . . . . . . . . Conclusions . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description of our Framework . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Feature Selection . . Motivating Examples . 4 Layered Conditional Random Fields for Network Intrusion Detection 4. 4. . . . . . Application Intrusion Detection . . . . . . . . . . xii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison with other Frameworks . . . .2 3. . . . Significance of Feature Selection . . . . . . . . .6. . . . . . . . . . . . . . . . . . . . . .6. . . . . . 3 Layered Framework for Building Intrusion Detection Systems 3. . . . . . . . . . . . . . . . . . . . . . . . . Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . Significance of Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. . . . . . . . . . . . . . . . . . . . . . .1 2. . . . . . . . . . . . Network Intrusion Detection . . Experiments and Results . . .4 Frameworks for building Intrusion Detection Systems . . .8 Conditional Random Fields . . . . .3 4. . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . .5 3. . . . . . . .1 4. . . . . . . . . . . . . . . . Data Description . . .1 4. . . . . . . . . . . . . . . Literature Review . . . .1 4. . . . . . . . .4. . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . .

. . . . . . . . . .3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .4. Motivating Example . . . . . . . . . . . . . . . . . Methodology . . . . . . . . . . . . . Audit Data Collection . . . . . . . . . . . . . . . . . . . . .5. . . . Attack Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . .8 Conclusions . . . . . . . . . . . . . . 4.1 5. . . . . . . . . . . . . . . . . . . . . . . . . Proposed Framework .4. . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Significance of Using Unified Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 74 77 79 79 80 81 83 85 86 87 88 89 91 91 92 94 95 96 96 97 99 4.1 6. . . . . . . . . . . . .2 6. . . . 5. . . 6. . .4 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Issues in Implementation . . . .3. . . . . . . . . . . . . . . . . . . .1 6. . . . . .2 6. . . . .6. . . . . . . .4 Introduction . . . .2 5. . . . . . . . . . . . . . . . . . . . . . . . . . . Session Modeling using a Moving Window of Events . . . . . . . . . .1 6. . . . .4. . . . .5 Feature Functions . . . . . . . .60) . . . . . . . . . . . . . . . . .6. . . . . . . . . .4. . 118 Test Time Performance . . . 123 xiii . .5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6. .3 Introduction . . . . . . . . .5 Effect of ‘S’ on Attack Detection . . . . . . . 114 Effect of ‘p’ on Attack Detection (0 < p ≤ 1) . . . . . . . . . . . . . . . . .1 5. . . . . . . .3 6. . . . . . . . . . . . .1 6. . . . 6 User Session Modeling using Unified Log for Application Intrusion Detection 6. . . . . . . . . . .7 Robustness of the System . . . . . . . .2 Experiments with Clean Data (p = 1) . . . . . . . . . . . . . . . . . . . . . . . . Experiments with Disguised Attack Data (p = 0. . . . . . 114 6. . Data Description . . . . . . . . . .1 Addition of Noise . . . . . . . . . .3 6. . . . . . . . . . . . . . . . . . . . . . .5. . . . . . . . . . . . . . . . . . . . . . . . 121 6. . . . . . . . . . Motivating Example . . Experiments and Results . . . . . . . . . . . . . . . . .7. . . . . . .2 6. . . . . . . .4.1 5.6. .6 Analysis of Results . . . . . . . .4 Description of our Framework . . . . .2 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. . . . .4. . Normal Data Collection . . . . 5 Unified Logging Framework and Audit Data Collection 5. . . . . . . . . . . . . . . . . . . . . 120 Discussion of Results . . .

. . . . . . . . . . . . . . . . 127 131 147 149 Bibliography Appendices A An Introduction to Conditional Random Fields A. . . . 152 A.1 Feature Selection for Probe Layer . . . . . . . . . . .1 Representation of Conditional Random fields . . . . . . . . . .2 Undirected Graphical Models . . . . . . . . . . . .3 Inference . .4 Feature Selection for U2R Layer . . . . . .4. . .6. . 167 A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 B Feature Selection for Network Intrusion Detection 177 B. . . . . . . . . . . . . . . 150 A. . . . . . . . . . 124 125 7 Conclusions 7. . . . . . . . . . . . . . . . . .1 Directions for Future Research . . . . .1 Introduction . . . . . . . . . . . . . . . . . . . 179 B. . . . . .7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Conclusions . . . . 169 A. . . 178 B. . . . . . . . . 175 A.3 Feature Selection for R2L Layer . . . . . . . . . . . . 174 A. . . . . . .7. . . . . . . . . . . . . . . . . . . . . . . . .4. . . .1 Directed Graphical Models . . . . . . . . . . . .2 Feature Selection for DoS Layer . . . .4. . . . . . . . . . . . . . . . . .5 Comparing the Directed and Undirected Graphical Models . . . . . . . . . . . . .3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . 180 C Feature Selection for Application Intrusion Detection 181 C. . . . . . . . . . . . . .3. . . . . . . . . . . . . .4 Tools Available for Conditional Random Fields . 175 A. . . . . . . 123 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 177 B. . . . . . . . . . . . 181 xiv . . . . . . . . . 149 A. . . . . . . . . . . . . 178 B. . .2 Training . . . . . . . . .2 6. . . . . . . . . . . .2 Background . . . . . . . . . . . . . . . . .1 Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.4. . . . . . . . . . . . . . . . .1 6. . . . . . . . . . . 123 Suitability of Our Approach for a Variety of Applications . . . . . . .3. . . . . .4 Conditional Random Fields . 171 A. . . . . . . . . . . . . 168 A. . . .8 Availability of Training Data .5 Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . Effect of ‘S’ on Attack Detection for Data Set One. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Effect of ‘S’ on Attack Detection for Data Set Two. . . . . . . . . . . . . . . . . . . . . . .3 6.4 6. .13 Comparison of Results . . . .14 Layered Vs. . . . 6. . . . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . .1 6. . . . . . . . . Detecting R2L Attacks (with Feature Selection) . . . 4. . .List of Tables 2. . . .7 4. . . . . . . . . . . . . . . . . . . . . .16 Ranking Various Methods for Intrusion Detection . . . .3 4. . . . . . when p = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . KDD 1999 Data Set . . . . . . .60 . . . . . . . . .60 . . . . . . . . . . . . . . . 121 xv . . . . . Detecting U2R Attacks (with all 41 Features) . . . 4. . . . . . . . . . .1 4. . . . . . Detecting DoS Attacks (with all 41 Features) . . . . . . . . . . . . . . . .2 6. . . . . . . . .6 4. . .1 4. .5 4. . .12 Attack Detection at Individual Layers (Case:2) . . . . Detecting Probe Attacks (with all 41 Features) . 24 50 58 59 60 60 61 62 63 63 65 66 67 69 70 71 73 94 4. . . . .10 Confusion Matrix .8 4. . Detecting Probe Attacks (with Feature Selection) . . . . . . . . . . . . . .4 4. . . . . . . . . . . . . . . . . . . . . . . . Non Layered Framework . . . . . . . . . . .9 Confusion Matrix .2 4. Detecting U2R Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . .11 Attack Detection at Individual Layers (Case:1) . . . . . . . . . . . . . .5 Data Sets . . . . . . . 116 Comparison of Test Time . 114 Analysis of Performance of Different Methods . . . . . . . . . . . . . . . . . . . . . . Detecting R2L Attacks (with all 41 Features) . . . . . . . . . . . . 4. . . . . . . . . . 4. . when p = 0. . . . . . . . . . . . .15 Significance of Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detecting DoS Attacks (with Feature Selection) . . . . . . . . .

. . .4 U2R Layer Features . . . . . . . . . . . . . 177 B. . . . . . . . . . . . . . .3 R2L Layer Features . . . . .B. . . . . . . . 179 B. . . . . . . 179 xvi . . . . . . . . .1 Probe Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 B. . . . . . . . . . . . . . . . . . . . . . . .2 DoS Layer Features . . . . . . . . . . .

. . . Representation of a Signature Based System . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . . .4 6. . . . .3 5. . . . . . . .1 2. . . . . Effect of Noise on U2R Layer . . Framework for Building Application Intrusion Detection System . . Graphical Representation of a Conditional Random Field . . . . . . . .4 2. . . . . . .7 5. . . . . . . . . . . . . . . . . . . . . . . . Layered Framework for Building Intrusion Detection Systems .4 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . .1 Behaviour of an Intruding Agent . . . . . . . . . . . . . . . . Effect of Noise on DoS Layer . . . . . .3 4. . . . . . . . . Effect of Noise on R2L Layer . . . . Representation of Probe Layer with Feature Selection . . . . . . . . . . . . . Knowledge Representation for a Resource (R) . . . . . . . . . . . . . . .3 2. . .1 3. . . .1 5.6 3. Conditional Random Fields for Network Intrusion Detection . . . . . . . . . . . . . . . Representation of a Normal Session .5 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification of Intrusion Detection Systems . . . .2 2. . . . . . . . . . . . . . . . . . . . . Representation of an Anomalous Session . . . . . . . . Representation of a Behaviour Based System . . . . . . Effect of Noise on Probe Layer . . . . Traditional Layered Defence Approach to Provide Enterprise Wide Security . .6 4. . . . .List of Figures 1. . . . . . . . . . . . . . . . . . Representation of a Hybrid System . .2 5. Integrating Layered Framework with Conditional Random Fields . . . .5 4. . . . . . . . . . . User Session Modeling using Conditional Random Fields . . . . . . . . . . . . xvii 4 16 17 18 18 19 35 40 44 51 53 55 75 75 76 76 83 85 88 89 95 . . . . . . . . . . . . . .1 4. . . . . . . . . . . .1 2. . Representation of a Single Event in the Unified log . . . . . . . .

. . . . . . . . . . . . . .6 Decoding in an Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . .4 Maxent Classifier . . . .8 6. . . . . . . . 158 A. . .11 Factorization in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A. . . . . . . . . . . . . . . . . .6 6. .6. . . . . . . . . . . . . . . . 165 A. . . . . . . . . . . 168 A. . . . . . . . . . . . 107 Results using Decision Trees at p = 0. . . . .2 6. . . . . . . . . . . .60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 A. 155 A. .9 Comparison of F-Measure ( p = 1) . . . . . . . . . . . . . . . . . . . . . . . 176 xviii . . . . . . . . . . . . . . . . . . . . . . . . . . .4 6. . . .7 6. 170 A. 162 A. . . . . .7 Maximum Entropy Markov Model . . .10 Linear Chain Conditional Random Field .5 6. . 101 Comparison of F-Measure ( p = 0. . 156 A. . . . . . . . . . . . . . . 113 Effect of ‘p’: Results using Conditional Random Fields when 0 < p ≤ 1 . . . . . . . . . . . . . . . . . 111 Results using Hidden Markov Models at p = 0. . . .60 . . .3 6. . . . . . . . .9 Undirected Graphical Model . .1 Fully Connected Graphical Model . . . . . . . . . . 109 Results using Naive Bayes Classifier at p = 0. . . 119 A. . . . . . . . . .60) . . . . . . . . . . . . .10 Significance of Using Unified Log . . . .60 . . . . . . . . .60 . . . . . . . . . . . . . . . . . . . . . . . 154 A. . . . . . . . . . . . . . . . . . . .3 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 . . . . . . . . . 105 Results using Support Vector Machines at p = 0. . . . . . . . . . . . . . . .2 Fully Disconnected Graphical Model . . . . . . .5 Hidden Markov Model . . . . . 103 Results using Conditional Random Fields at p = 0. 117 6. . . . . . . . .8 Label Bias Problem . . . . . . . .

Hence. 1. intrusion detection is one of the high priority and challenging tasks for network administrators and security professionals. Snort and others are developed using knowledge engineering approaches where domain experts can build focused and optimized pattern matching models [1]. a target of attacks which aim to bring down 1 . The objective of an intrusion detection system is to provide data security and ensure continuity of services provided by a network [3]. Present networks provide critical services which are necessary for businesses to perform optimally and are.Chapter 1 Introduction I N this thesis. Though such systems result in very few false alarms. limited attack detection coverage. signature based systems are expensive and slow to build. and Security (SANS) institute is the act of detecting actions that attempt to compromise the confidentiality. Today. Networking. The three issues are. most existing intrusion detection systems such as the USTAT. efficient in operation and have wide attack detection coverage. address these shortcomings and develop better anomaly and hybrid intrusion detection systems which are accurate in attack detection. we address three significant issues which severely restrict the utility of anomaly and hybrid intrusion detection systems in present networks and applications. Audit. suffer from a large number of false alarms and cannot be deployed in high speed networks and applications without dropping audit patterns. As a result their effectiveness is limited. due to their manual development process. We. they are specific in attack detection and often tend to be incomplete. EMERALD. Further. Present anomaly and hybrid intrusion detection systems have limited attack detection capability. integrity or availability of a resource [2].1 Motivation and Problem Description I NTRUSION detection as defined by the Sysadmin. large number of false alarms and inefficiency in operation. thus. thus. IDIOT.

the need to develop better intrusion detection systems. The simplest way to ensure a high level of security. Additionally. the number of vulnerabilities in software has been increasing and many of them exist in highly deployed software [10]. According to the Computer Emergency Response Team (CERT). in order to protect the data and services. the number of hosts on the Internet exceeded 550. this is in no way a solution for securing today’s highly networked computing environment and.000. According to the Internet Systems Consortium (ISC) survey. the attackers often come up with newer and more advanced methods to defeat the installed security systems [4].897 TB [7]. Considering that it is near to impossible to build ‘perfect’ software. The prospect of obtaining valuable information. Increasing dependence of businesses on the services over the Internet. [11]. has led to their rapid growth. The problem becomes more profound since authorized users can misuse their privileges and attackers can masquerade as authentic users by exploiting vulnerable applications. With the deployment of more sophisticated security tools. [5]. . a project in 2002. Given the diverse type of attacks (Denial of Service. However. the system must detect all intrusions with no false alarms. Earlier. to build a system which has broad attack detection coverage and at the same time which results in very few false alarms. with more and more data becoming available in digital format and more applications being developed to access this data. the data and applications are also a victim of attackers who exploit these applications to gain access to data. it becomes critical to build effective intrusion detection systems which can detect attacks reliably. User to Root and others). thus. estimated the size of the Internet to be 532. it is a challenge for any intrusion detection system to detect a wide variety of attacks with very few false alarms in real-time environment.000 in July 2008 [6]. it has also made the networks and applications a prime target of attacks. though. Remote to Local.2 Introduction the services provided by the network. Configuration errors and vulnerabilities in software are exploited by the attackers who launch powerful attacks such as the Denial of Service (DoS) [8] and Information attacks [9]. as a result of a successful attack. subside the threat of legal convictions. The inability to prevent attacks furthers the need for intrusion detection. Probing. Ideally. provided we can ensure hardware security. The challenge is. Rapid increase in the number of vulnerabilities has resulted in an exponential rise in the number of attacks. is to disable all resource sharing and communication between any two computers. The system must also be efficient enough to handle large amount of audit data without affecting performance at the deployed environment. hence.

1.2 Emerging Attacks

3

1.1.1 Research Objectives
In this thesis: 1. We aim to develop systems which have broad attack detection coverage and which are not specific in detecting only the previously known attacks. 2. We aim to reduce the number of false alarms generated by anomaly and hybrid intrusion detection systems, thereby improving their attack detection accuracy. 3. We aim to develop anomaly intrusion detection systems which can operate efficiently in high speed networks without dropping audit data. Issues such as scalability, availability of training data, robustness to noise in the training data and others are also implicitly addressed.

1.2 Emerging Attacks
For an intrusion detection system, it is important to detect previously known attacks with high accuracy. However, detecting previously unseen attacks is equally important in order to minimize the losses as a result of a successful intrusion. In [5], we describe a scenario in which a software agent can be used to attack a specific target without affecting any other network with a purpose to search and transmit confidential and sensitive information without authorized access. Such an attack can be carried out by experts with the motive to hide the entire attack and protect their identity from being discovered. Further, since the attack targets only a single network, it would not be detected by large scale cooperative intrusion detection systems. The most significant part of the entire attack is that none of the present systems can detect such attacks and the agent can destroy itself when the attack is successful without leaving traces of its activities. Unlike worms, the replication in case of an intruding agent is limited and it does not degrade performance at the target making their detection very difficult. We represent the behaviour of the intruding agent in Figure 1.1 by a flow diagram. In addition to detecting the Denial of Service attacks, which target availability aspect, and the Information attacks, which target confidentiality and integrity aspects, the intrusion detection systems must also be able to detect attacks which present a change in the motive of the attackers. Such attacks are network specific and the attacker follows a criminal pursuit which is driven by

4

Introduction

Start Search Information Set Up a Knowledge Database Time Out No Time Out No Yes No Yes

Search and Control a Zombie

Success

Yes Transmit Information and Await Confirmation

No

Success

Yes Attempt to Enter the Target Network No Attempts n>N Success No Yes Update Knowledge Database and Adjust Behaviour Yes Destroy Itself and Traces Yes Time Out No

Yes

No

Information Correct

Replicate

End

Figure 1.1: Behaviour of an Intruding Agent the goal to make money [4]. This has not only resulted in increasing the severity of attacks, but the attacks have become isolated; targeting only a few nodes in a single network. Such attacks are very difficult to detect using generic systems and hence, better intrusion detection systems must be developed which are capable of detecting such specific attacks.

1.3 Contributions to Thesis
In order to launch an attack, an attacker often follows a sequence of events. The events in such a sequence are highly correlated and long range dependencies exist between them. Further, in order to prevent detection, the attacker can also hide the individual events within a large number of normal events. As a result, considering the events in isolation affects classification and results in a

1.3 Contributions to Thesis

5

large number of false alarms. Additionally, the individual events themselves are vector quantities and consist of multiple features which are monitored continuously. These features are also highly correlated and must not be analyzed in isolation. In order to operate in high speed networks, present anomaly based systems consider the events individually, thereby, discarding any correlation between the sequential events. In cases when the present systems consider a sequence of events, they monitor only one feature, ignoring others, which results in a poor model. Hence, we introduce efficient intrusion detection frameworks and methods which consider a sequence of events and which analyze multiple features without assuming any independence among the features.

1.3.1 Layered Framework for Intrusion Detection
In Chapter 3, we introduce our Layered Framework for building intrusion detection systems which can be used, for example, as a network intrusion detection system and can detect a wide variety of attacks reliably and efficiently when compared to the traditional network intrusion detection systems. In our layered framework, we use a number of separately trained and sequentially arranged sub systems in order to decrease the number of false alarms and increase the attack detection coverage. In particular, our layered framework has the following advantages: • The framework is customizable and domain specific knowledge can be easily incorporated to build individual layers which help to improve accuracy. • Individual intrusion detection sub systems are light weight and can be trained separately. • Different anomaly and hybrid intrusion detectors can be incorporated in our framework. • Our framework not only helps to detect an attack but it also helps to identify the type of attack. As a result, specific intrusion response mechanisms can be initiated automatically thereby reducing the impact of an attack. • Our framework is scalable and the number of layers can be increased (or decreased) in the overall framework.

1.3.2 Layered Conditional Random Fields for Network Intrusion Detection
Network monitoring is one of the common and widely applied methods for detecting malicious activities in an entire network. However, real-time monitoring of every single event even in a

6

Introduction

moderate size network may not be feasible, simply due to the large amount of network traffic. As a result, it is only possible to perform pattern matching using attack signatures which may at best detect only previously known attacks. Anomaly based systems result in dropping audit data when they are used to analyze every event. As a result, network monitoring often involves analyzing only the summary statistics from the audit data. The summary statistics may include features of a single TCP session between two IP addresses or may include network level features such as the load on sever, number of incoming connections per unit time and others. Such statistics are represented in the KDD 1999 data set [12]. In Chapter 4, we introduce the Layered Conditional Random Fields which can be used to build accurate anomaly intrusion detection systems which can operate efficiently in high speed networks. In particular, our system has the following advantages: • The attack detection accuracy improves for individual sub systems when using conditional random fields. • The overall system has wide attack detection coverage, where every sub system is trained to detect attacks belonging to a single attack class. • Attacks can be detected efficiently in high speed networks. • Our system is robust to noise and performs better than any other compared system.

1.3.3 Unified Logging Framework for Audit Data Collection
In order to access application data, a user has no option but to access the application which interacts with the application data. Hence, application access and the corresponding data accesses are highly correlated. In order to detect attacks effectively, we aim to capture this correlation between the application access and the corresponding data accesses. Hence, in Chapter 5, we present our Unified Logging Framework which efficiently integrates the application and the data access logs. We have collected two such data sets which can be downloaded and used freely [13]. In particular, our unified logging framework has the following advantages: • By using the unified log, the objective is to capture the user-application and the applicationdata interaction in order to improve attack detection. Further, this interaction is fixed and does not vary overtime as opposed to modeling user profiles which changes frequently. • Our framework is application independent and can be deployed for a variety of applications.

naive Bayes classifiers. it becomes necessary to extend network monitoring and focus on data and applications which are often target of attacks.4 Thesis Organization 7 1. as we have already mentioned. we describe how conditional random fields can be integrated in our layered framework. 1. many attacks require a number of sequential operations to be performed. attacks can be blocked in real-time. We present our experimental results and demonstrate that layered conditional random fields outperform well known methods for intrusion detection and are a strong candidate to build robust and efficient network intrusion detection systems.4 User Session Modeling for Application Intrusion Detection Network monitoring is often restricted to monitoring summary statistics due to excessive amount of network traffic and is further affected due to network address translation and encryption. • Using our system. making it difficult to provide high level of security. Further. our system has the following advantages: • Conditional random fields perform best outperforming other well known anomaly detection approaches including decision trees. Thus. we introduce User Session Modeling using Conditional Random Fields which analyzes the unified log to detect application level attacks.1. We then describe our unified logging framework in Chapter .4 Thesis Organization This thesis is organized as follows.) • Our approach is robust in detecting disguised attacks. support vector machines and hidden Markov models. attacks can be detected at smaller window widths thereby resulting in an efficient system which does not require a large amount of history to be maintained. • Performing session modeling using conditional random fields in our unified logging framework. In Chapter 6. we first present the taxonomy of intrusion detection and give the related literature review in Chapter 2. We then describe our layered framework which can be used to build effective and efficient intrusion detection systems in Chapter 3.3. In Chapter 4. Our system based on conditional random fields is particularly effective when attacks span over a sequence of events (such as password guessing followed by launching of the exploit to gain administrative privileges on the target and finally leading to unauthorized access of data. In particular.

in Chapter 7 we conclude and give possible directions for future research.8 Introduction 5. . Our experimental results suggest that performing user session modeling using conditional random fields’ attacks can be detected by analyzing only a small number of events in a user session which results in an efficient and an accurate system. we then use the conditional random fields and perform user session modeling using a moving window of events in our unified logging framework to build real-time application intrusion detection systems. In Chapter 6. Finally. which integrates the application access logs and the corresponding data access logs to provide a unified audit log. The unified log captures the necessary user-application and application-data interaction which is useful to detect application level attacks effectively.

intrusion detection is still at its infancy and naive attackers can launch powerful attacks which can bring down an entire network [5]. [18].1 Introduction P RESENT networks are increasingly based on the concept of resource sharing as it is a necessity for collaboration. [15]. However. [20]. We describe the problem of intrusion detection in detail and analyze various well known methods for intrusion detection with respect to two critical requirements viz. both at the network and at the application level. which strengthens the need to develop more powerful intrusion detection systems. Hence. and provides an easy means of communication and economic growth. we explore the related research in intrusion detection. 2. [19]. [14]. The systems are getting bigger with more and more add on features making them complex. This 9 . [17]. accuracy of attack detection and efficiency of system operation. the need to communicate and share resources increases the complexity of the system. However. [16]. We observe that present methods for intrusion detection suffer from a number of drawbacks which significantly affect their attack detection capability. The cost involved in protecting these valuable resources is often negligible when compared with the actual cost of a successful intrusion. Intrusion detection started in 1980’s and since then a number of approaches have been introduced to build intrusion detection systems [1]. To identify the shortcoming of different approaches for intrusion detection.Chapter 2 Background D ETECTING intrusions in networks and applications has become one of the most critical tasks to prevent their misuse by attackers. we introduce conditional random fields for effective intrusion detection and motivate our approach for building intrusion detection systems which can operate efficiently and which can detect a wide variety of attacks with relatively higher accuracy.

[23] was born in 1984. . [29]. Additionally. In most situations such independence assumptions do not hold which severely affect the attack detection capability of an intrusion detection system. ‘Intrusion Detection Expert System’ (IDES) [22]. the attacks have become more complex and difficult to detect using traditional intrusion detection approaches. The last system to be released under the same generation. called ‘Stalker’. Another project at the Lawrence Livermore Laboratories developed the ‘Haystack’ intrusion detection system in 1988 [24]. [27]. More stringent monitoring has further increased the resources required by the intrusion detection systems. Ease of access of resources in addition to vulnerabilities and poor management of resources can be exploited to launch attacks [3]. in 1990 introduced the concept of network intrusion detection and came up with the system called the ‘Network Security Monitor’ (NSM) [26]. pattern matching system [25]. However. Todd Heberlein. was released in 1989 which was again a host based. demanding more effective solutions [5]. which described that audit trails contain valuable information and could be utilized for the purpose of misuse detection by identifying anomalous user behaviour. features intended for some specific usage in many applications may also be exploited for misuse of systems. Such systems are unable to detect attacks reliably because they neglect the sequence structure in the audit patterns and consider every pattern to be independent. the number of attacks has increased significantly [10]. [28]. As a result. Present intrusion detection systems are very often based on analyzing individual audit patterns by extracting signatures or are based on analyzing summary statistic collected at the network or at the application level [9]. The notion of intrusion detection was born in the 1980’s with a paper from Anderson [21]. These developments gradually paved way for the intrusion detection systems to enter into the commercial market with products such as ‘Net Ranger’. The lead was then taken by Denning at the SRI International and the first model of intrusion detection. Further. A typical example of this is the response generated by the SQL server which is often exploited in the SQL injection attacks. This further led to the concept of distributed intrusion detection system which augmented the existing solution by tracking client machines as well as the servers. Until then. ‘Real Secure’ and ‘Snort’ acquiring big market shares [25]. addition of more resources may not always provide a desired level of security.10 Background results in vulnerabilities in software and configuration errors in networks and deployed applications. the majority of the systems were host based and analyzed the individual host level audit records.

This again may not detect attacks reliably. However. Other features. the features are assumed independent and separate models are built using individual features. [33]. Hence. [37]. We then motivate the use of conditional random fields [34] for building effective network and application intrusion detection systems [32].7. we conclude this chapter in Section 2. make the system administrators to ignore the alarms altogether. it requires strong .6. To improve attack detection. We present the evaluation metrics for analyzing intrusion detection systems in Section 2. to reduce system complexity. are ignored. the system considers only one feature which is the sequence of system calls. in turn. Assuming events to be independent makes the model simple and improves speed of operation. followed by the properties of the audit patterns which can be used to detect attacks in Section 2. Having a strong legal system may be helpful in attack deterrence. Methods based on analyzing sequence of system calls issued by privileged processes are well known [30]. in this chapter we explore the problem of intrusion detection to identify the root causes of the inability of the present intrusion detection systems to detect attacks reliably.2.2 Intrusion Detection and Intrusion Detection System 11 Another approach for intrusion detection is based on analyzing sequence structure in the audit patterns. when multiple features are considered. we give the taxonomy of intrusion detection which is described in detail in [38]. all of the features must be considered collectively and not independently [32]. such as the arguments of the system calls.2 Intrusion Detection and Intrusion Detection System The intrusion detection systems are a critical component in the network security arsenal. Attack Deterrence – Attack deterrence refers to persuading an attacker not to launch an attack by increasing the perceived risk of negative consequences for the attacker. [36]. Present networks and applications are.2. 2. far away from a state where they can be considered secure.5 and give a detailed literature review for intrusion detection in Section 2. We then give their classification in Section 2. The rest of the chapter is organized as follows.4. However.8. but at the cost of reduced attack detection and increased number of false alarms. Frequent false alarms. In cases. We then describe conditional random fields in Section 2. Security is often implemented as a multi layer infrastructure and different approaches for providing security can be categorized into the following six areas [39]: 1.3. [35]. [31]. [33]. thus. in Section 2. Results from all the models are then combined using a voting mechanism. Finally.

it allows the system to take measures to prevent similar attacks in future.) 2. Research in this area focuses on methods such as those discussed in [40] which can effectively trace the true source of attack as very often the attacks are launched with spoofed source IP address. Detecting an attack is significant for two reasons. the system must react to an attack and perform the recovery mechanisms as defined in the security policy. Attack Prevention – Attack prevention aims to prevent an attack by blocking it before an attack can reach the target. (Spoofing refers to sending IP packets with modified source IP address so that the true sender of the packet cannot be traced. 5. An example of attack prevention system is a firewall [41]. Attack Deflection – Attack deflection refers to tricking an attacker by making the attacker believe that the attack was successful though. 3. An example of security mechanism for attack avoidance is the use of cryptography [43]. 4. avoiding possible threat. This is because. Such systems are better known as the Intrusion Prevention Systems. Tools available to perform attack detection followed by reaction and recovery are known as the intrusion detection systems. the difference between intrusion prevention and intrusion detection is slowly diminishing as the present intrusion detection systems increasingly focus on real-time attack detection and blocking an attack before it reaches the target. Research in this area focuses on attack deflection systems such as the honey pots [42]. Research in this area focuses on building intrusion detection systems. to prevent an attack. first the system must recover from the damage caused by the attack and second. in reality. . However. it is very difficult to prevent all attacks. the attacker was trapped by the system and deliberately made to reveal the attack. the system requires complete knowledge of all possible attacks as well as the complete knowledge of all the allowed normal activities which is not always available. 6. Encrypting data renders the data useless to the attacker. thus. Attack Reaction and Recovery – Once an attack is detected.12 Background evidence against the attacker in case an attack was launched. Attack Detection – Attack detection refers to detecting an attack while the attack is still in progress or to detect an attack which has already occurred in the past. Attack Avoidance – Attack avoidance aims to make the resource unusable by an attacker even though the attacker is able to illegitimately access that resource. However.

2.2 Intrusion Detection and Intrusion Detection System

13

2.2.1 Principles and Assumptions in Intrusion Detection
Denning [22] defines the principle for characterizing a system under attack. The principle states that for a system which is not under attack, the following three conditions hold true: 1. Actions of users conform to statistically predictable patterns. 2. Actions of users do not include sequences which violate the security policy. 3. Actions of every process correspond to a set of specifications which describe what the process is allowed to do. Systems under attack do not meet at least one of the three conditions. Further, intrusion detection is based upon some assumptions which are true regardless of the approach adopted by the intrusion detection system. These assumptions are: 1. There exists a security policy which defines the normal and (or) the abnormal usage of every resource. 2. The patterns generated during the abnormal system usage are different from the patterns generated during the normal usage of the system; i.e., the abnormal and normal usage of a system results in different system behaviour. This difference in behaviour can be used to detect intrusions. As we shall discuss later, different methods can be used to detect intrusions which make a number of assumptions that are specific only to the particular method. Hence, in addition to the definition of the security policy and the access patterns which are used in the learning phase of the detector, the attack detection capability of an intrusion detection system also depends upon the assumptions made by individual methods for intrusion detection [44].

2.2.2 Components of Intrusion Detection Systems
An intrusion detection system typically consists of three sub systems or components: 1. Data Preprocessor – Data preprocessor is responsible for collecting and providing the audit data (in a specified form) that will be used by the next component (analyzer) to make a decision. Data preprocessor is, thus, concerned with collecting the data from the desired source and converting it into a format that is comprehensible by the analyzer.

14

Background Data used for detecting intrusions range from user access patterns (for example, the sequence of commands issued at the terminal and the resources requested) to network packet level features (such as the source and destination IP addresses, type of packets and rate of occurrence of packets) to application and system level behaviour (such as the sequence of system calls generated by a process.) We refer to this data as the audit patterns. 2. Analyzer (Intrusion Detector) – The analyzer or the intrusion detector is the core component which analyzes the audit patterns to detect attacks. This is a critical component and one of the most researched. Various pattern matching, machine learning, data mining and statistical techniques can be used as intrusion detectors. The capability of the analyzer to detect an attack often determines the strength of the overall system. 3. Response Engine – The response engine controls the reaction mechanism and determines how to respond when the analyzer detects an attack. The system may decide either to raise an alert without taking any action against the source or may decide to block the source for a predefined period of time. Such an action depends upon the predefined security policy of the network. In [45], the authors define the Common Intrusion Detection Framework (CIDF) which recog-

nizes a common architecture for intrusion detection systems. The CIDF defines four components that are common to any intrusion detection system. The four components are; Event generators (Eboxes), event Analyzers (A-boxes), event Databases (D-boxes) and the Response units (R-boxes). The additional component, called the D-boxes, is optional and can be used for later analysis.

2.2.3 Challenges and Requirements for Intrusion Detection Systems
The purpose of an intrusion detection system is to detect attacks. However, it is equally important to detect attacks at an early stage in order to minimize their impact. The major challenges and requirements for building intrusion detection systems are: 1. The system must be able to detect attacks reliably without giving false alarms. It is very important that the false alarm rate is low as in a live network with large amount of traffic, the number of false alarms may exceed the total number of attacks detected correctly thereby decreasing the confidence in the attack detection capability of the system. Ideally, the system must detect all intrusions with no false alarms. The challenge is to build a sys-

2.3 Classification of Intrusion Detection Systems

15

tem which has broad attack detection coverage, i.e. it can detect a wide variety of attacks and at the same time which results in very few false alarms. 2. The system must be able to handle large amount of data without affecting performance and without dropping data, i.e. the rate at which the audit patterns are processed and decision is made must be greater than or equal to the rate of arrival of new audit patterns. Hence the speed of operation is critical for systems deployed in high speed networks. In addition, the system must be capable of operating in real-time by initiating a response mechanism once an attack is detected. The challenge is to prevent an attack rather than simply detecting it. 3. A system which can link an alert generated by the intrusion detector to the actual security incident is desirable. Such a system would help in quick analysis of the attack and may also provide effective response to intrusion as opposed to a system which offers no after attack analysis. Hence, it is not only necessary to detect an attack, but it is also important to identify the type of attack. 4. It is desirable to develop a system which is resistant to attacks since, a system that can be exploited during an attack may not be able to detect attacks reliably. 5. Every network and application is different. The challenge is to build a system which is scalable and which can be easily customized as per the specific requirements of the environment where it is deployed.

2.3 Classification of Intrusion Detection Systems
Classifying intrusion detection systems helps to better understand their capabilities and limitations. We therefore, present the classification of intrusion detection systems in Figure 2.1. From Figure 2.1, we observe that for any intrusion detection system, security policy and audit patterns are the two prime information sources. The audit patterns must be analyzed to detect an attack and the security policy defines the acceptable and non acceptable usage of a resource and helps to qualify whether an event is normal or attack. Hence, based on the given classification, an example of an intrusion detection system can be a centralized system deployed on a network with sliding window based data collection which operates in real-time and is based on signature analysis with active response to intrusion.

16

Background

Knowledge of the Resources Signature Based Behaviour Based Hybrid Security Policy Response on Intrusion Passive Active

Audit Source Location Network Based Host Based Application Based Number of Audit Sources Centralized Distributed Alert Correlation Frequency of Audit Data Collection Session Based Sliding Window Based Periodic Snapshot Based Frequency of Analysis Batch Mode Near Real Time Real Time

Information Sources for an Intrusion Detection System

Audit Patterns

Figure 2.1: Classification of Intrusion Detection Systems

1.2: Knowledge Representation for a Resource (R) where |UR−normal | ≥ |SR−normal | and |UR−attack | ≥ |SR−attack | Based upon the elements of subset S. The set U consists of both. This is represented in Figure 2.1 Classification based upon the Security Policy definition Intrusion detection systems are classified in two ways based upon the security policy definition. knowledge of SR−attack . They can detect attacks with very few false alarms but have limited attack detection capability since they cannot detect unseen attacks. Security policy defines the normal and abnormal usage of every resource. UR−normal UR−attack (a) Total Knowledge SR−normal SR−attack (b) Available Knowledge Figure 2.3. Consider a set U. The problem is to identify the set U such that it is complete and unambiguous. SR−attack should be equal to UR−attack . which represents the complete domain (universe) for a resource R. Signature based systems employ pattern matching approaches to detect attacks. intrusion detection system can be classified as: (a) Signature (Misuse) Based – When the set S only contains the events which are known to be attack. . the system focuses on detecting known misuses and is known as signature or misuse based system [42]. the security policy is defined with only the knowledge contained in the subset S. SR−attack . i.2. U = UR−normal .3. normal and abnormal usage of R. To be effective.e.2. in most practical situations it is very difficult to identify and define the complete set U and only a small portion of this set is available which is denoted as S. Hence. where S = SR−normal . i. Their attack detection capability is directly proportional to the available knowledge of attacks in the set S. UR−attack . Signature based system are represented in Figure 2. such systems require complete knowledge of attacks. However.3 Classification of Intrusion Detection Systems 17 2. Hence. which is not always possible.e.

this affects the attack detection and the system may not be able to detect a wide variety of attacks. the goal of the intrusion detection system is to identify significant deviations from the known normal behaviour [42] as shown in Figure 2.18 Background Missed Attacks Correctly Detected Normals Correctly Detected Attacks Figure 2. behaviour based systems. (c) Hybrid – In most environments. Hence. it may not be possible to completely define either the normal or the abnormal behaviour.3: Representation of a Signature Based System (b) Behaviour (Anomaly) Based – When the set S only consists of events which are known to be normal. however. to detect attacks. in general.. A hybrid system uses the partial knowledge of both. an intrusion detection system may generate a large number of false alarms or may be specific in detecting only a few types of attacks. i. Since the complete knowledge of a resource may not be available. False Alarms Correctly Detected Attacks Correctly Detected Normals Missed Attacks Figure 2. i.4: Representation of a Behaviour Based System For behaviour based systems to be effective complete knowledge of normal behaviour of a resource is required. SR−normal and SR−attack . As a result. suffer from a large false alarm rate. Events which lie beyond the threshold are detected as attacks.e.e. the set SR−normal should be equal to the set UR−normal . False alarms can be reduced by increasing the threshold. Hence.4. often resulting in fewer false alarms and detecting . a threshold is used which gives some flexibility to the system. there is a tradeoff in limiting the number of false alarms and the capability of the system to detect a variety of attacks.

2. when network statistics are used as the audit patterns. (b) Active Response Systems – In active response systems. large amount of audit patterns at the .3 Classification of Intrusion Detection Systems 19 more attacks. the system does not take any measure to respond to an attack once an attack is detected. The source from which the audit patterns are collected affects the attack detection capability of a system. based upon which the intrusion detection systems can be classified as: (a) Passive Response Systems – In a passive response system. This is because it is hard to infer the contextual information directly from the network audit patterns. [42]. the audit patterns may be encrypted rendering them unusable by the intrusion detector at the network level.5. the attack detection capability of a network based system is limited. Further. Though a single system (or a few strategically placed systems) is (are) sufficient for the entire network. A hybrid system is represented in Figure 2. In addition. It simply generates an alert which can be analyzed by the administrator at some later stage [39]. they cannot provide any detail about the user and system interaction. intrusion detection systems are classified as: (a) Network Based – In a network based system. Such systems generally employ machine learning approaches. the intrusion detection systems respond to attacks by various possible approaches which may include blocking the source of the attack for a predefined time period [39]. The security policy also defines how the system must respond when an attack is detected.2 Classification based upon the Audit Patterns 1. For example.3.2. False Alarms Correctly Detected Attacks Correctly Detected Normals Missed Attacks Figure 2. the audit patterns collected at the network level are used by the intrusion detector [46]. Based on this. [42].5: Representation of a Hybrid System 2. [47].

. In order to detect intrusions. Based upon this property. Such features only provide a high level summary which may not be able to detect attacks reliably [42]. first. amount of incoming and outgoing traffic. in high speed networks. The application based systems can be very effective as they can exploit the complete knowledge of the application and can be used even when encryption is used in communication. These statistics may include features such as the total number of connections. it may be practical to analyze only the summary statistics collected at regular time intervals. the audit patterns can be collected from a single source or from a number of sources. However. the audit patterns are collected either from a single source or from multiple sources but are processed at a single point where they are analyzed together to determine the global state of the network [42]. (c) Application Based – The application based systems are concerned only with a single application and detect attacks directed at a particular application or a privileged process [31]. The audit patterns collected at the individual host contains more specific information than the network level audit patterns.20 Background network level may also affect the total attack detection accuracy. They can also analyze the user and application interactions which can significantly improve the attack detection accuracy. Additionally. which may be used to detect attacks reliably. This is because of two reasons. the decision can be made by individual nodes or by aggregating the audit patterns at a single point and then analyzing them together. a significant portion of the total incoming patterns may be allowed to pass into the network without any analysis and second. it becomes difficult to manage a large number of host based systems in a big network. the intrusion detection systems can be classified as: (a) Centralized System – In a centralized system. such systems may themselves become a target of attacks. (b) Host Based – The intrusion detector in a host based system analyzes the audit patterns generated at the kernel level of the system which include system access logs and the error logs [42]. They can analyze either the application access logs or the system calls generated by the processes to detect anomalous activities. However. When the audit patterns are collected from more than one source. 2. host based systems can themselves be the victims of an attack.

The individual systems may themselves be centralized or decentralized. Depending upon the frequency of analysis of audit patterns. the intrusion detection systems can be classified as: . (c) Periodic Snapshot Based – Instead of recording every event or summarizing a session at its termination. Alert correlation systems can only be effective when multiple networks are attacked with similar attacks such as in case of worm outbreak. Agent based systems are examples of distributed intrusion detection systems [42]. they are classified as: (a) Session Based – Audit patterns can be collected at the end of every session by summarizing different features.2. Incase when the attacks are network specific. the distributed systems can make local decisions close to the source of the audit patterns and may report only a small summary of activities to a higher level in the system. distributed systems can be less accurate due to lack of global knowledge. Regardless of the source and the number of audit patterns. 4. In such cases. the intrusion detection systems can be classified depending upon the frequency at which the audit patterns are collected. (b) Sliding Window Based – In case of sliding window based collection of audit patterns. events are recorded using a moving window of fixed or variable width. 3. Based on this. The width of the window defines the number of events recorded together and the step size for sliding the window determines how fast the window is advanced forward. the alert correlation systems will not be effective even though a few target networks may detect some anomalous activities. The advantage of a distributed system for intrusion detection is that immediate response mechanism can be activated based upon local decisions. the local alerts will be discarded as false alarms due to lack of global consensus.3 Classification of Intrusion Detection Systems 21 (b) Distributed System – In contrast to the centralized systems. However. Methods can be used which analyze the summary of every session once the session is terminated. (c) Alert Correlation – Alert correlation based systems analyze the alerts generated by a number of cooperating intrusion detection systems [39]. snapshots of different states of the entire system can be taken at regular intervals which can be analyzed to detect intrusions.

the system must detect an attack immediately. Hence. Such systems cannot provide any immediate response to intrusion and can only perform the recovery task once an attack is detected.22 Background (a) Batch Mode – In batch mode intrusion detection. they are limited in detecting only those attacks whose signatures are known in prior. i. In such systems. (c) Real-time – A real-time intrusion detection system must detect an attack as soon as it is commenced.e. Real-time intrusion detection systems can be implemented by using a moving window with a step of size one. (b) Near Real-time – An intrusion detection system is said to perform in near real-time when the system cannot detect an intrusion when it commenced. but can detect it at some later stage during the attack or immediately at the end of an attack. However. in practice it is very difficult to build such a system given the constraint that it should have low false alarm rate and high attack detection accuracy. 2. Network based signature detection systems.4. the system is said to perform in real-time if and only if. there is some delay before the patterns are made available to the intrusion detector. Patterns collected by taking periodic snapshots or using moving window with step size greater than one can be used for near real-time intrusion detection. for real-time intrusion detection.1 Properties of Audit Patterns useful for Intrusion Detection Different properties in the audit patterns can be analyzed for detecting intrusions. The patterns are then analyzed for intrusions at predefined time intervals. However. for an event ‘x’ when the attack commenced. . which perform pattern matching can also perform in real-time by checking every event for known attacks. The authors in [49] describe three properties which can be used to detect intrusions. 2. the attacker cannot succeed with the event ‘x+1’. A typical example is the Snort [48].4 Audit Patterns The raw patterns must be preprocessed and presented in a format which can be interpreted by the intrusion detector before they can be analyzed. the audit patterns are aggregated in a central repository.

Systems analyzing the frequency or (and) duration property for the events can perform efficiently but they suffer from large false alarm rate as it is often difficult to determine the correct threshold for the events. the audit patterns may be collected from the routers and switches for the network level systems. hence. to avoid detection from systems which do analyze a sequence of events. This is because. A system which can analyze all of the above mentioned properties can detect attacks with high accuracy. Properties such as the number of invalid login attempts and the number of rows accessed in a database can be used to measure frequency. the duration property determines the acceptable time duration for a particular event. When only one feature is analyzed.2 Univariate or Multivariate Audit Patterns The audit patterns used to detect attacks may either be univariate or multivariate. As.4. Duration of Event(s) – Rather than counting the number of occurrences of an event. in case of univariate audit patterns. discussed before. 2. However. as in case of multivariate analysis. . However. discarding other features such as the parameters of the system calls can affect the attack detection capability of the system [9].4 Audit Patterns 23 1. It is based upon selecting a threshold which defines an acceptable range for a particular event. For example. such a system may be inefficient in operation. very often. when the sequence of system calls generated by a privileged process is analyzed for detecting abnormal behaviour. the analysis is much simpler in comparison to when many features are analyzed together. a single feature itself may not be the complete representation and. insufficient to detect attacks. Ordering of Events – Analyzing the order in which events occur can improve the attack detection accuracy and reduce false alarms. When the frequency crosses this limit. A threshold can be used to define the limit.2. intrusion is a multi step process in which a number of events must occur sequentially in order to launch a successful attack. large number of invalid login attempts for a single user id in a very short time span can be considered as an attempt to guess a password and hence an attack. Frequency of Event(s) – Frequency determines how often an event occurs in a predefined time interval. However. the attacks can be spread over a long period of time such that the events cannot be correlated unless a long history is maintained by the intrusion detection system. 2. 3. For example. an alarm can be raised.

e.5 Evaluation Metrics Evaluating different methods for detecting intrusions is important. are. thus. i.1: Confusion Matrix Predicted Normal True Normal True Attack True Negative False Negative Predicted Attack False Positive True Positive . total amount of data transferred in a session and duration of a session. the amount of attack traffic is extremely large as compared to the normal traffic. the audit patterns collected are sequential where one or more features are recorded continuously. Examples of such features include. used for evaluating intrusion detectors.3 Relational or Sequential Representation Very often. in case of the Denial of Service attacks. event ordering can be exploited in favour of higher attack detection accuracy. hence. Frequency and duration properties of events can be easily represented in a relational form. first. When the audit patterns are represented sequentially. However. in general. the number of instances in the classes is not equally distributed. However.24 Background 2. Intrusion detection is an example of a problem with imbalanced classes. evaluating intrusion detection systems using simple accuracy metric may result in very high accuracy [50]. Converting the audit patterns from sequential to relational form has two advantages. Other metrics such as Precision. efficient methods can be used for analysis of audit patterns in relational form. the raw audit patterns may be processed into a relational form and a number of new features can be added. These are defined with the help of the confusion matrix as follows: Table 2. However.4. The number of attacks is very small when compared with the total number of normal events. sequence analysis is slower when compared to the relational analysis. which do not depend on the size of the test set. this may result in affecting the attack detection capability as in relational form the ordering of events and. the relationship among sequential events is lost. Note that. These features often give a high level representation of the audit patterns in a summarized form. 2. more features can be added and second. Recall and F-Measure. Hence.

since the time taken in collecting and preprocessing the audit patterns is not considered. a system must have high Precision (i. high Recall (i. The time performance is generally measured for the time taken by the intrusion detector to detect an attack from the time the audit patterns are fed into the detector. what is normal and what is anomalous is not defined. either to force a network to stop some service(s) that it is providing or to steal some information stored in a network. it must detect only attacks). 2. This is sufficient for comparison when different methods use exactly the same data for analysis. it does not represent the efficiency of the intrusion detection system.6 Literature Review Two most significant motives to launch attacks as described in [3] are. i. total time must be measured which is the time from the point when intrusion actually started to the point in time when the response mechanism is activated. An intrusion detection system must be able to detect such anomalous activities.e. an event may be considered normal with respect to some criteria. a high F-Measure. Recall and is usually set to 1. However. The underlying assumption is that the evaluating criterion is unchanged and the system is properly trained such that it can reliably separate normal and anomalous events. the objective is to find anomalous test patterns which are similar to the anomalous patterns which occurred during training. however. Hence. Hence.. thus. time taken to detect an attack is also significant. Hence.2. In addition to evaluating the attack detection capability of the detector.e. but the same may be labeled anomalous when this criterion is changed. . in real environments.e. it must detect all attacks) and.6 Literature Review 25 Precision = number o f True Positives number o f True Positives + number o f False Positives Recall = number o f True Positives number o f True Positives + number o f False Negatives F − Measure = (1 + β2 ) ∗ Recall ∗ Precision β2 ∗ ( Recall + Precision) where β corresponds to the relative importance of Precision vs.

resulting in different attack detection capabilities. clustering. Frameworks which describe the collaborative use of intrusion detection systems have also been proposed [54]. A number of techniques such as association rules. The common intrusion detection framework is described in [45]. Pattern matching approaches are employed on the audit patterns which do not have any state or sequence information. This results in a large number of false alarms. The system described in [54] is based on the combination of network based and host based systems while the system in [55] employs both. The authors in [50] and [51] describe a data mining framework for building intrusion detection systems. naive Bayes classifier. It is important to note that different methods are based on specific assumptions and analyze different properties in the audit patterns. we introduce our Layered Framework for building Intrusion Detection Systems in Chapter 3. they assume independence among events.6. 2.1 Frameworks for building Intrusion Detection Systems A number of frameworks have been proposed for building intrusion detection systems. Hence. signature based and behaviour based techniques for detecting intrusions. artificial neural networks and others have been applied to detect intrusions at network level. this assumption may not always hold as a single intrusion may span over multiple events which are correlated.6. their approach requires the use of a large amount of noise free audit data to train the models. genetic algorithms. [55]. However. To ameliorate this. These methods can be broadly divided into three major categories: Pattern Matching Pattern matching techniques search for predefined set of patterns (known as signatures) in the audit patterns to detect intrusions. All of these frameworks suffer from one major drawback. support vector machines.2 Network Intrusion Detection The prospect of maintaining a single system which can be used to detect network wide attacks make network monitoring a preferred option as opposed to monitoring individual hosts in a large network. Using the approach described in [51]. However. The prime advantage of pattern matching approaches is that they . Agent based intrusion detection frameworks are discussed in [52] and [53].26 Background 2. a single intrusion detector used within these frameworks is trained to detect a wide variety of attacks. the rules can be learned inductively instead of manually coding the intrusion patterns and profiles.

To determine this threshold accurately is a critical issue. They can. Classification methods are one of the most researched and include methods like the decision trees. affects the attack detection accuracy. • Clustering – Clustering of data has been applied extensively for intrusion detection using a number of methods such as k-means. Clustering methods are based upon calculating the numeric distance of a test point from different cluster . [42]. Snort system [48] is based upon pattern matching. Bayesian classifiers. as in the Intrusion Detection Expert System (IDES) [23]. support vector machines and many others. artificial neural networks. fuzzy c-means and others [56]. detect attacks only if the corresponding pattern (signature) exists in the signature database.6 Literature Review 27 are very efficient and triggers an alert only when an exact match of an attack signature is found resulting in very few false alarms. very often. however. Data Mining and Machine Learning Data mining and machine learning methods focus on analyzing the properties of the audit patterns rather than identifying the process which generated them [9]. Hence. They are based upon modeling the underlying process which generates the audit patterns and exploit the frequency and duration property of events. When the threshold is low. This. they cannot detect unseen attacks for which there are no signatures [9]. [57]. however. which represent a summary measure. These methods include approaches for mining association rules. Statistical methods can operate either in batch mode (Haystack system) or in real-time mode (IDES). in order to reduce complexity and improve system performance only a single feature is considered.2. the system raises a large number of (false) alarms and when the threshold is high. They often analyze properties such as the overall system load and statistical distribution of events. as in the Haystack system [24]. Though these methods can handle multiple features in the audit patterns. the system may not detect attacks reliably. classification and cluster analysis. or the features are assumed to be independent. k-nearest neighbour classification. Statistical Methods Statistical methods based on modeling the monitored variables as independent Gaussian random variables and methods such as those based on the Hotelling T 2 test statistic can be used to detect attacks by calculating deviations in the present profile from the stored normal profile [9]. When the deviations exceed a predefined threshold. the system triggers an alarm.

28 Background centres’ and then adding the point to the closest cluster. As a result. Mining association rules for intrusion detection has the advantage that they are easy to interpret. During testing. Frequently used distance measures are the Euclidian distance and the Mahalanobis distance [9]. they are based upon building a database of rules of normal and frequent items during the training phase. Another issue when applying any clustering method is to select the distance measure as different distance measures result in clusters with different shapes and sizes. Decision trees select the best features for each decision node during tree construction based on . • Data Mining – Data mining approaches [50]. One of the main drawbacks of clustering technique is that since a numeric distance measure is used. the size of a Bayesian network increases rapidly as the number of features and the type of attacks modeled by the network increases. Clustering can. However. however. which is often the case. Bayesian network [61] can also be used for intrusion detection [62]. These approaches can deal with symbolic features and the features can be defined in the form of packet and connection details. [64]. Observations with symbolic features cannot be readily used for clustering which results in inaccuracy. clustering methods consider the features independently and are unable to capture the relationship between different features of a single record which results in lower accuracy. However. Association rules and frequent episodes are used to learn the record patterns that describe user behaviour. density based clustering methods can be used which are based on the assumption that intrusions are rare and dissimilar to the normal events. The detection accuracy suffers as the database of rules is not a complete representation of the normal audit patterns. patterns from the test data are extracted and various classification methods can be used to classify the test data. they tend to be attack specific and build a decision network based on special characteristics of individual attacks. the observations must be numeric. In such cases. • Bayesian Classifiers – Naive Bayes classifiers are also proposed for intrusion detection [60]. [63]. be performed in case only the normal audit patterns are available. However. [51] are based on mining association rules [58] and using frequent episodes [59] to build classifiers by discovering relevant patterns of program and user behaviour. In addition. they make strict independence assumption between the features in an observation resulting in lower attack detection accuracy when the features are correlated. • Decision Trees – Decision trees have also been used for intrusion detection [60]. This is similar to identifying the outlier points which can be considered as intrusions.

Further. modeling system calls alone may not always provide accurate classification as various connection level features are ignored. [69]. the neural networks can work effectively with noisy data. Decision trees generally have very high speed of operation and high attack detection accuracy and have been successfully used to build effective intrusion detection systems.5. the accuracy of the trained system also depends upon the amount of audit patterns available during training. [67]. Very often the sequence itself is a vector and has many correlated features. Similar to the pattern matching and statistical methods. training with more audit patterns result in a better model. Though. [76]. They can provide real-time attack detection capability. [68]. [31]. One such criterion is the gain ratio which is used in C4. these methods assume independence among consecutive events and hence do not consider the order of occurrence of events for attack detection. • Artificial Neural Networks – Neural networks have been used extensively to build network intrusion detection systems as discussed in [65]. they require large amount of data for training and it is often hard to select the best possible architecture for the neural network. • Support Vector Machines – Support vector machines map real valued input feature vector to higher dimensional feature space through nonlinear mapping and have been used for detecting intrusions [70]. [72]. The prime reason for working with summary patterns is that the system tends to be simple. in order to gain computational efficiency the multivariate data analysis problem is broken into multiple univariate data analysis problems and the individual results . [71]. The above discussed methods often deal with the summarized representation of the audit patterns and may analyze multiple features which are considered independently. [74] and hidden Markov models [75] can be used when dealing with sequential representation of audit patterns.2. Hidden Markov models have been shown to be effective in modeling sequences of system calls of a privileged process. Generally. efficient and give fairly good attack detection accuracy.6 Literature Review 29 some well defined criteria. which can be used to detect anomalous traces. • Markov Models – Markov chains [73]. [66]. deal with large dimensionality of data and perform multi class classification. For data mining and machine learning based approaches. [70] and [71]. However. like other methods. hidden Markov models cannot model long range dependencies between the observations [34]. However. [77] and [78] describes the use of hidden Markov model for intrusion detection.

present the Layered Conditional Random Fields for Network Intrusion Detection in Chapter 4. Instead. results in inaccuracy as the correlation among the features is lost. [81].30 Background are combined using a voting mechanism [9]. consider a data mining approach for mining association rules and finding frequent episodes in order to calculate the support and confidence of the rules. Hence. results in higher attack detection accuracy. The drawback with modeling the ordering of events is that the complexity of the system increases which affects the performance of the system. Secondly. [82]. TIM (Time based Inductive Machine). there is a tradeoff between detection accuracy and the time required for attack detection. The key difference between [85] and our work is that. [83]. [80]. [51]. These methods are generally aimed at developing a distributed intrusion detection system. This setting is sufficient to model the correlation between different features in an observation. [84]. However. W&S (Wisdom and Sense) system. Haystack system. This however. the system in [85] fails to model long range dependencies . thus. • Others – Other approaches for detecting intrusion include the use of genetic algorithm and autonomous and probabilistic agents [79]. However. in our work we define features from the observations as well as from the observations and the previous labels and perform sequence labeling via the conditional random fields to label every feature in the observation. We. the authors in [85] use only the normal audit patterns during training and build a behaviour based system while we train our system using both the normal and the anomalous patterns i. the MIDAS (Multics Intrusion Detection System). The most closely related work. in addition to the duration and frequency. however. to our work. they either consider only one feature [23] or assume the features to be independent [24]. network intrusion detection systems must perform very efficiently in order to handle large amount of network data and hence many of the network intrusion detection systems are primarily based on signature matching. The authors in [49] show that modeling the ordering property of events. we propose to use a hybrid system based on conditional random fields and integrate the layered framework to build a single system which can operate in high speed networks and can detect a wide variety of attacks with very few false alarms. When anomaly detection systems are used at network level. is of Lee et al. A number of intrusion detection systems such as the IDES (Intrusion Detection Expert Sys- tem). We also compare our work with [85] which describes the use of maximum entropy principle for detecting anomalies in the network traffic. we build a hybrid system.e. Snort and others have been developed which operate at the network level [1]. They.

each of which can further return multiple records. which can be easily represented in our model. In [92]. They also present an algorithm which summarizes the raw transactional SQL queries into compact regular expressions that can be used for matching against known attack signatures. Monitoring Data Access Logs In [86]. wide attack detection coverage and high accuracy of attack detection in a single system.2. Further.3 Monitoring Access Logs A number of approaches have been described to monitor the data access logs or (and) the application access logs in order to detect attacks. In order to detect malicious queries. the authors focus on detecting malicious database modifications using database logs to mine dependencies among data items by creating dependency rules. any update operation must satisfy certain rules which define what data items must be read before an update and what data items must be written after the update operation. the authors describe that fingerprinting of SQL queries can be used to detect malicious requests. the authors describe the use of database logs to build user profiles based on user query frequent item-sets. the authors describe that the database logs can be used to build role profiles to model normal behaviour which can then be used to identify intruders.6 Literature Review 31 in the observations. As we shall describe in Chapter 4. ordering constraints are imposed on the SQL queries in [90] which improves attack detection. In [90] and [91]. particularly at the user access level. We now review some of these well known approaches. we also integrate the layered framework with the conditional random fields to gain the benefits of computational efficiency. the authors describe that data objects can be . In [89]. the authors in [87] perform clustering of queries that might return one or more features. the set of relations accessed and the attributes referenced. For example. 2. They also define support and confidence functions for fingerprints generated for the queries depending upon the user profile.6. The authors describe the use of Petri-Nets for finding anomalies at the user task level. the authors discuss that time differences between multiple transactions in database systems can be used to detect malicious transactions when an intruder masquerades as a normal user. The authors use naive Bayes classifier to perform classification using features extracted from the SQL commands. In [88]. In [93].

there exist systems which analyze the web server access logs to detect malicious data and application accesses. The authors in [101] describe an anomaly based approach for detecting attacks against web applications by analyzing its internal state and learning the relationships between critical execution points and the internal states. the user behaviour is not fixed and changes overtime and. a system based on modeling user profiles results in a large false alarm rate. Additionally. however. generally. The system described in [100] performs application layer protocol analysis to detect intrusions. hence. Another drawback is that. Further. Systems such as [99] combine static and dynamic verification of web requests to ensure absence of particular kind of erroneous behaviour in web applications. Monitoring Web Access Logs Contrary to the systems which monitor data queries alone. may be unreliable. profile based systems employ threshold to determine the acceptable deviation in normal activities. these methods consider data access patterns in isolation of the events which generates the data request. In [94]. The authors in [95] describe a system which determines whether a query should be denied in order to protect the privacy of users by constructing auditors for ‘max’. first. This is because of two reasons. In [96]. the authors describe the use of audit logs for building user profiles. such systems are application specific and their deployment in different environments require recreating the set of rules applicable in the new domain. second. they have limited attack detection capability since they cannot detect attacks whose signatures are not available.32 Background tagged with time semantics that captures expectations about update rates which are unknown to attackers. the integrity constraints encoded in the data dictionary and the user profiles to define a distance measure which estimates the closeness of a set of attributes which are referenced together. They. normalization and pattern matching into a single phase and hence can be used to perform signature analysis efficiently. The authors claim that their protomatching approach improves the efficiency of the . Such approaches are. They consider both. Additionally. The thresholds are often determined empirically and. the authors describe a technique called protomatching which combines protocol analysis. ‘min’ and ‘sum’ queries. applicable only to data which is refreshed regularly. [97] and [98]. rule based and expensive to build. do not consider underlying data access and hence cannot detect a wide variety of attacks. In [102]. the authors describe Hippocratic Databases and present an auditing framework to detect suspicious queries which do not adhere to data disclosure policies. however. This is.

very often. The most closely related works to ours are [105]. This results in an efficient and an accurate system. intrusion detection must also be performed at the application level. we present our Unified Logging Framework in Chapter 5. Further. In [105]. .6. as opposed to the network packet level.6 Literature Review 33 Snort [48] intrusion detection system by up to 49%. hence. since we do not extract application specific signatures. Present application intrusion detection systems consider every event individually rather than considering a sequence of events resulting in a large number of false alarms and. Our work is different from this because we consider both the normal and the anomalous data patterns during training and build a classification system based on user session modeling. In [103]. 2. we introduce User Session Modeling using Unified Log for Application Intrusion Detection in Chapter 6. is not sufficient to detect attacks which are directed towards individual applications.2. a sequence of events must be followed. session modeling can be performed with our unified logging framework and attacks can be detected by monitoring only a small number of events in a sequence. In order to improve attack detection at the application level we are.4 Application Intrusion Detection Network monitoring. Their system uses ‘Apache’ web server to implement an audit data source which monitors the web server. our system models application-data interaction which does not depend upon a particular user and therefore does not change overtime. In [106]. instead of modeling user profiles. though significant. for an attack to be successful. and therefore our framework can be used in a variety of applications. [106] and [107]. Hence. To ameliorate this. The advantage of our framework is that it is application independent. while in [105] the authors use only the normal patterns during training to build an anomaly based system and analyze the events independently. the authors model network traffic into network sessions and packets to identify instances with high attack probability. We perform session modeling at the user application level. Further. the authors describe an anomaly based learning approach for detecting SQL attacks by learning profiles of the normal database accesses for web applications. We show that using conditional random fields. The authors in [104] describe a tool for performing intrusion detection at application level. and integrate the unified logging framework to build an application intrusion detection system. however. In order to detect such malicious application and data accesses. interested in analyzing the behaviour of a web application in conjunction with the underlying data accesses rather than analyzing them separately. poor attack detection accuracy.

let G = (V. The authors in [107] describe a two layer system in which the first layer generates pre alarms and the second layer makes the final decision to activate an alarm. Their system analyzes network sessions and network packets while. when conditioned on X. the random variables Yv obey the Markov property with respect to the graph: p(Yv | X. The training data is used to constrain this conditional distribution while ensuring maximum entropy and hence maximum uniformity. Y ) is a conditional random field.34 Background the authors describe anomaly detection techniques to detect attacks against web servers and web based applications by correlating the server side programs referenced by client queries with the parameters contained in the queries. The simplest conditional classifier is the Maxent classifier based upon maximum entropy classification which estimates the conditional distribution of every class given the observations. they build separate profiles using the two logs. so that Y is indexed by the vertices of G. w = v) = . We now give a brief description of the conditional random fields which is motivated from the work in [34]. 2. [111] and conditional random fields [34] are such conditional models. A comprehensive introduction to the conditional random fields is provided in Appendix A. Maxent classifiers [108]. E) be a graph such that Y = (Yv )v∈(V ) . Such models have been extensively used in natural language processing tasks and computational biology. Let X be the random variable over a data sequence to be labeled and Y be the corresponding label sequence.7 Conditional Random Fields Conditional models are probabilistic systems which are used to model the conditional distribution over a set of random variables. [110] maximum entropy Markov models [85]. We also compare our work with [103]. we model the user application sessions to detect malicious data accesses. Then ( X. Conditional models offer a better framework as they do not make any unwarranted assumptions on the observations and can be used to model rich overlapping features among the visible observations. Also. while in our system we combine the web server logs with the data access logs to detect malicious data accesses and use a moving window (of size more than one) to analyze a sequence of events. [109]. Their system primarily focuses on the web server logs to produce an anomaly score by creating profiles for every server side program and its features and then establishing their threshold. Even though the authors use both the web access logs and the data access logs. Yw .

µ2 . as in our case. λ2 .) ˜ from the training data D = ( xi . x3 .6 where x1 .1) where x is the data sequence. Generative models such as the Markov chains. x4 represents an observed sequence of length four and every event in the sequence is correspondingly labeled as y1 . the parameter estimation problem is to find the parameters θ = (λ1 . y4 . y). Further. inferring conditional probability p(y| x ) from the joint distribution.. y3 . a conditional random field is a random field globally conditioned on X. . Further. hidden Markov models and joint distribution have two disadvantages. First. y2 . x )) (2. Second. y|e .7 Conditional Random Fields 35 p(Yv | X. Yw .e. This results in reduced accuracy [113]. yi )i=1 with the empirical distribution p( x.. y is a label sequence.k ∑ λk f k (e. where w ∼ v means that w and v are neighbors in G. requires marginal distribution p( x ) which is difficult to estimate as the amount of training data is limited and the observation x contains highly dependent features.. x ) + v∈V. The features f k and gk are assumed to be given and fixed. µ1 . . the joint distribution over the label sequence Y given X has the form: pθ (y| x ) ∝ exp( e∈ E. using the Bayes rule. and y|s is the set of components of y associated with the vertices or edges in sub graph S. As a result strong independence assumptions are made to reduce complexity.. the joint distribution is not required since the observations are completely visible and the interest is in finding the correct class which is the conditional distribution p(y| x ).k ∑ µk gk (v. .6: Graphical Representation of a Conditional Random Field The prime advantage of conditional random fields is that they are discriminative models which directly model the conditional distribution p(y| x ). i. w ∼ v). y|v . The graphical structure of a conditional random field is represented in Figure 2. conditional random fields are undirected models and free from label bias and observation bias which are present in other conditional models [112].2. For a simple sequence (or chain) modeling. x2 . N y1 y2 y3 y4 x1 x2 x3 x4 Figure 2..

Finally. We first discussed the principles and assumptions involved in building intrusion detection systems and described the components of intrusion detection systems in detail. we drew similarities between intrusion detection and various tasks in computational linguistics. We then discussed methods which have been used for detecting intrusions. 2. The conditional random fields have proven to be very successful in such tasks. their underlying assumptions. allowing them to model arbitrary relationships among different features in the observations without making independence assumptions. In the next chapter. conditional random fields predict the label sequence y given the observation sequence x. determining secondary structures of protein sequences. part of speech tagging. . in this thesis. and their strengths and limitations with regards to their attack detection capability. text segmentation. We then presented various challenges and requirements for effective intrusion detection and presented a classification of intrusion detection systems. natural language processing and bio-informatics such as gene prediction. we explore the suitability of conditional random fields for building robust intrusion detection systems.36 Background Instead.8 Conclusions In this chapter. Hence. The task of intrusion detection can be compared to many problems in machine learning. we present our layered framework and describe how it can be used to build accurate and efficient anomaly and hybrid network intrusion detection systems. offer us the required framework to build effective intrusion detection systems. object recognition and many others. computational biology and motivated our approach to build intrusion detection systems based on conditional random fields. shallow parsing. Conditional random fields. We presented the literature review where we explored various frameworks and methods which have been used to build network and application intrusion detection systems. named entity recognition. thus. we presented the taxonomy of intrusion detection and explored the problem in detail.

we propose a layered framework for building intrusion detection systems which can be used. Signature based systems using pattern matching approaches can be used effectively and efficiently to detect previously known attacks in high speed networks. perimeter access control. However. only a single system is employed at every layer which is expected to detect attacks at that particular location. As a result.e. firewalls. an intrusion detection system must detect different type of attacks effectively and must operate efficiently in high traffic networks. However. for example. the number of previously unseen attacks is on a rise [10]. Another advantage of our Layer based Intrusion Detection System (LIDS) framework is that it is very general and easily customizable depending upon the specific requirements of individual networks. with the rapid increase in the number and type of attacks. anomaly and hybrid systems are used to detect previously unseen attacks and have been proven to be more reliable in 37 . network. host and application intrusion detection systems. 3. to build a network intrusion detection system which can detect a wide variety of attacks reliably and efficiently when compared to the traditional network intrusion detection systems.. data encryption and others.1 Introduction T WO significant requirements for building intrusion detection systems are broad attack detection coverage and efficiency in operation. a single system is not effective enough given the constraints of achieving high attack detection accuracy and high system throughput. Present networks are prone to a number of attacks. Hence.Chapter 3 Layered Framework for Building Intrusion Detection Systems Present networks and enterprises follow a layered defence approach to ensure security at different access levels by using a variety of tools such as network surveillance. even a slight variation in attacks may not be detected by a signature based system. i. a large number of which are previously known. However. Given this traditional layered defence approach.

The rest of the chapter is organized as follows.4 and compare the layered framework with others in Section 3.38 Layered Framework for Building Intrusion Detection Systems detecting novel attacks when compared with the signature based systems. data encryption and others and are deployed at different access points in a layered organizational security framework [115]. Our proposed framework is very general and can be easily customized by adding domain specific knowledge as per the specific requirements of the network in concern. we give motivating examples to highlight the significance of the layered framework for intrusion detection in Section 3. We highlight the advantages of our framework in Section 3. However. network intrusion detection systems are complemented by a variety of other tools such as network surveillance. A common practice to build anomaly and hybrid intrusion detection systems is to train a single system with labeled data to build a classifier which can then be used to detect attacks from a previously unseen test set. thereby. clustering based systems can be used to distinguish between legitimate and malicious packets. To maximize attack detection. Network monitoring using a network intrusion detection system is only a single line of defence in the traditional layered defence approach which aims to provide complete organizational security. firewalls.6. when labeled data is not available.3. Thus. giving flexibility in implementation. of anomaly and hybrid network intrusion detection systems. The attack detection coverage of the system is further affected when a single system is trained to detect different type of attacks. host and application intrusion detection systems.5. integrity and availability via a single system may not be possible due to several reasons including the complexity and the diverse type of attacks at the network level. particularly. At times. However. perimeter access control. file integrity checkers. a single anomaly detector is trained which is expected to accurately detect a variety of attacks and perform efficiently. . Ensuring high speed of operation further limits the deployment.2. This is because. Hence. a significant disadvantage of such systems is that they result in a large number of false alarms. Finally. for a network intrusion detection system monitoring the incoming and outgoing network traffic and ensuring confidentiality. the anomaly based systems still remain a bottleneck in the joint system. We then describe our layered framework in Section 3. In this chapter we propose a layered framework for building anomaly and hybrid network intrusion detection systems which can operate efficiently in high speed networks and can accurately detect a variety of attacks. various systems such as [55] and [114] employ both the signature based and the anomaly based systems together. we conclude this chapter in Section 3.

Consider for example. a single network intrusion detection system which is deployed to detect every network attack in a high speed network. One possible solution is having a number of sub systems each of which is specific in detecting a single category of attack (such as DoS. Probe and others). Denial of Service. for example the . a network intrusion detection system must differentiate between different types of attacks. Remote to Local and User to Root. given that the present networks are prone to a wide variety of attacks. based on our proposed framework a network intrusion detection system may consist of four layers. The figure represents an ‘n’ layer system where every layer in itself is a small intrusion detection system which is specifically trained to detect only a single type of attack. This is not only more effective in detecting individual classes of attacks. Hence.3. for effective attack detection. 3. For example. where the layers correspond to four different attack classes. Probe. When same features are used to detect the two attacks the accuracy decreases. Hence. Thus. Hence. A network is prone to different types of attacks such as the Denial of Service (DoS). we propose a layered framework for building efficient anomaly and hybrid intrusion detection systems where different layers in the system are trained independently to detect different type of attacks with high accuracy. using a single system would not only degrade performance but will also be less effective in attack detection. We note that the DoS and Probe attacks are different and require different features for their effective detection. However. using a single system is not a viable option.1 represents our framework for building Layer based Intrusion Detection Systems (LIDS). Probe and others.2 Motivating Examples Anomaly and hybrid intrusion detection systems typically employ various data mining and machine learning based approaches which are inefficient when compared to the signature based systems which employ pattern matching. but it also results in an efficient system. it becomes critical to search for methods which can be used to build efficient anomaly and hybrid intrusion detection systems.2 Motivating Examples 39 3.3 Description of our Framework Figure 3. The number of sub systems to be used can be determined by analyzing the potential risks and the availability of resources at individual installations. It also makes the system bulky which affects its speed of operation.

all the layers have the same load. which blocks anomalous connection as soon as they are detected in a particular layer. Depending upon the security policy of the network.40 Layered Framework for Building Intrusion Detection Systems All Features Layer One Feature Selection Intrusion Detection Sub System Normal No Block Yes Layer Two Feature Selection Intrusion Detection Sub System Normal No Block Yes Layer n Feature Selection Intrusion Detection Sub System Normal No Block Yes Allow Figure 3. However. A common disadvantage of using a modular approach. every layer can be trained with only a small number of features which are significant in detecting a particular class of attack. one after the other. it performs efficiently. some features may be present in more than one layer. Second. the size of the sub system remains small and hence. is that it increases the communication overhead among the modules (sub systems). the . However.1: Layered Framework for Building Intrusion Detection Systems DoS attack. In the worst case. when no attacks are detected until at the last layer. As a result. similar to our layered framework. A number of such sub systems are then deployed sequentially. thereby providing a quick response to intrusion and simultaneously reducing the analysis at subsequent layers. The amount of audit data analyzed by the system is more at the first layer and decreases at subsequent layers as more and more attacks are detected and blocked. A number of such layers essentially act as filters. first. this can be easily eliminated in our framework by making every layer completely independent of every other layer. It is important to note that a different response may be initiated at different layers depending upon the class of attack the layer is trained to detect. This serves dual purpose. every layer can simply block an attack once it is detected without the need of a central decision maker.

‘whether or not a user is logged in’. Hence. Consider for example. hence. For example.3 Description of our Framework 41 overall load for the average case is expected to be much less since attacks are detected and blocked at every subsequent layer. if the layers are arranged in parallel rather than in a sequence. For such an environment. However. ‘number of bytes from source to destination’. ‘number of files accessed’ and many others. the priority is to simply detect network scans as opposed to detecting malicious data accesses.1 Components of Individual Layers Given that a network is prone to a wide variety of attacks.3. As a result. a number of similar attacks can be grouped together and represented as a single attack class. to detect Probe attacks. the load at every sub system is same and is equal to that of the worst case in the sequential configuration. the number of layers in our framework can be easily customized depending upon the identified threats and the availability of resources. to detect a single attack class. ‘number of bytes from destination to source’. the layered framework is very general and the number of layers in the overall system can be adjusted depending upon the individual requirements of the network in concern. Additionally. only a single layer which can reliably detect the Probe attacks is sufficient. ‘Smurf’ and ‘Neptune’ result in Denial of Service and. it is often not feasible to add a separate layer to detect every single attack. a large number of features can be monitored. the initial layers in the sequential configuration can be replicated to perform load balancing in order to improve performance. For example. both. Even though the number of layers and the significance of every layer in our framework depend upon the target network. features such as the ‘protocol’ and ‘type of service’ are significant . Every layer in our framework corresponds to a sub system which is trained independently to detect attacks belonging to a single attack class. can be detected at a single layer rather than at two different layers. a data repository which is a replica of a real-time application data and which does not provide any online services. Feature Selection Component – In order to detect intrusions. To ensure security of this data. 3. Using more features than required makes the system inefficient.3. ‘type of service’. On the contrary. ‘number of root accesses’. Additionally. only a small set of these features is required at every layer. the total number of layers in our framework remains small. every layer has two significant components: 1. However. These features include ‘protocol’.

Implementing pipelining.4 Advantages of Layered Framework We now summarize the advantages of using our layered framework. • Our framework is easily customizable and the number of layers can be adjusted depending upon the requirements of the target network. which are more effective in detecting attacks can be easily incorporated in our framework. • Using our layered framework improves attack detection accuracy and the system can detect a wide variety of attacks by making use of the domain specific knowledge. can significantly improve the performance by reducing the multiple I/O operations to a single I/O operation since all the features can be read in a single operation and analyzed by different layers in the layered framework. such as conditional random fields as we will discuss in the following chapters. the response unit can provide adequate intrusion response depending upon the security policy. In order to take advantages of our proposed framework. thereby. A prime advantage of our framework is that newer methods. Intrusion Detection and Response Sub System – The second component in every layer is the intrusion detection and response unit. every layer must contain both of the above mentioned components. decision trees. A variety of previously well known intrusion detection methods such as the naive Bayes classifier. 2. support vector machines and others can be used. using our layered framework opens avenues to perform pipelining resulting in very high speed of operation. Different methods can be seamlessly integrated in our framework to build effective intrusion detectors. • The layered framework does not degrade system performance as individual layers are independent and are trained with only a small number of features. our framework is not restrictive in using a particular anomaly or hybrid detector. 3. Finally. Additionally. once an attack is detected. particularly in multi core processors. • Our framework is not restrictive in using a single method to detect attacks.42 Layered Framework for Building Intrusion Detection Systems while features such as ‘number of root accesses’ and ‘number of files accessed’ are not significant. To detect intrusions. . resulting in an efficient system.

5 Comparison with other Frameworks Ensuring continuity of services and security of data from unauthorized disclosure and malicious modifications are critical for any organization. specific intrusion response mechanisms can be activated for different attacks. the authors describe the use of data mining algorithms to compute activity patterns from system audit data to extract features which are then used to generate rules to detect intrusions. network. Further. event generators. data encryption and others which are deployed at different access points in a layered security framework. For this. No single tool can provide enterprise wide security and hence. As a result. However. a layered defence approach is often employed to provide security at the organizational level. However. [84]. but can . event Databases and the response units. The same approach can be applied for building an intrusion detection system based on our layered framework. In the data mining framework for intrusion detection. Our framework fits well in the traditional layered defence approach and can be used to develop effective and efficient network intrusion detection systems. host and application intrusion detection systems. a number of different security tools are deployed. This traditional layered defense approach incorporates a variety of security tools such as the network surveillance. event Analyzers. firewalls. 3. as discussed earlier. presented in the Common Intrusion Detection Framework [45] can be defined for every intrusion detection sub system in our layered framework. perimeter access control. Our framework can not only seamlessly integrate the use of data mining technique for intrusion detection.3. • Our framework has the advantage that the type of attack can be inferred directly from the layer at which it is detected.. we present a layered framework for building intrusion detection systems. The traditional layered architecture is perceived as a framework for ensuring complete organizational security rather than as an approach for building effective and efficient intrusion detection systems. providing a desired level of security at the enterprise level can be challenging. the four components viz. Figure 3. file integrity checkers.2 represents the traditional layered defence approach.5 Comparison with other Frameworks 43 • Our proposed layered framework for building effective and efficient network intrusion detection systems fits well in the traditional layered defence approach for providing network and enterprise level security.

Combination of classifiers is expensive with regards to the processing time and decision making. The authors in [116] describe the combination of ‘strong’ classifiers using stacking where decision tress. The authors show that a number of such classifiers when combined by using simple majority voting mechanism provide good classification. centralized decision making systems often tend to be complex and slow in operation. In [117]. [114]. not based upon classifier combination. The authors show that the output from these classifiers can be combined to generate a better classifier rather than selecting the individual best classifier. In addition. A number of other frameworks have been proposed which describe the use of classifier combination [55].44 Layered Framework for Building Intrusion Detection Systems Surveillance Perimeter Security (Network Access Control) Network Security Host Security (Infrastructure Protection) Content Management Business Continuity Application Security Data Security Figure 3. [116]. however. naive Bayes and a number of other classification methods are used as base classifiers. the authors use a combination of ‘weak’ classifiers where the individual classification power of weak classifiers is slightly better than that of random guessing. [117]. In [55] and [114]. . the authors apply a combination of anomaly and misuse detectors for better qualification of analyzed events. Our framework is.2: Traditional Layered Defence Approach to Provide Enterprise Wide Security also help to improve its performance by selecting only a small number of significant features for building separate intrusion detection sub systems which can be used to effectively detect different classes of attacks at different layers.

6 Conclusions In this chapter. We compared our framework with other well known frameworks and highlighted its specific advantages. . we presented our layered framework for building effective and efficient intrusion detection systems. our framework can be used to build efficient anomaly and hybrid network intrusion detection systems. We then integrate the trained (sub) systems into our layered framework to build accurate and efficient network intrusion detection systems which are not based on attack signatures. In our framework. In addition to improving the attack detection accuracy and detecting a variety of attacks. the results from individual classifiers at a layer are not combined at any later stage and. our system is based upon serial layering of multiple hybrid detectors which are trained independently and which operate without the influence of any central controller. Experimental results demonstrate that our system outperforms other well known approaches for intrusion detection. is scalable and can be easily customized depending upon the specific requirements of a network. using a stacked system is expensive when compared to the sequential model. hence. since the layers are independent they can be trained separately and deployed independently. we will show that an intrusion detection system based on our layered framework performs better and is more efficient when compared with individual systems as well as with systems based on classifier combination. 3. As already discussed.3. From our experimental results in the following chapters. Given the layered framework. we first demonstrate the effectiveness of conditional random fields to build intrusion detection sub systems which are individually trained to effectively detect a single attack class.6 Conclusions 45 The only purpose of classifier combination is to improve accuracy. in the next chapter. In addition. an attack is blocked at the layer where it is detected. Rather. In particular our framework can identify the class of an attack once detected. There is no communication overhead among the layers and the central decision maker which results in an efficient system.

.

1 Introduction I NCREASING network bandwidth has enabled a large number of services to be provided over a network. however. High speed of communication and increasing complexity in systems has. they are expensive in operation.Chapter 4 Layered Conditional Random Fields for Network Intrusion Detection Ever increasing network bandwidth poses a significant challenge to build efficient network intrusion detection systems which can detect a wide variety of attacks with acceptable reliability. first. discussed in previous chapter. Pattern matching systems operate on signatures extracted from previously known attacks and are limited in detecting only the attacks with known signatures. In order to operate in high traffic environment. Anomaly and hybrid intrusion detection systems suffer from two major disadvantages. Anomaly and hybrid intrusion detection systems. can also detect previously unseen attacks. however. However. As a result. such systems analyze summarized data instead of monitoring a sequence of events. in addition to detecting previously known attacks. We then integrate the layered framework. we first develop better hybrid intrusion detection methods which are not based on attack signatures and which can detect a wide variety of attacks with very few false alarms. intrusion detection systems are either signature based which perform pattern matching or operate on summarized audit patterns which are collected regularly at predefined intervals. made it difficult to detect intrusive activities in real-time. signature based systems have obvious disadvantages. present network intrusion detection systems are often signature based. In order to operate in high speed networks. such systems are inefficient and suffer from a large false alarm rate. 4. However. As a result. anomaly and hybrid intrusion detection systems must be used to detect novel attacks. to build a single system which is effective in attack detection and which can also perform efficiently in high traffic environment. To ameliorate these drawbacks. 47 .

7 by introducing noise in the training data. K-means clustering.8% improvement) and Remote to Local (R2L) attacks (34. We analyze the robustness of our system in Section 4.3. we compare our results with other methods such as decision trees. we focus on building accurate hybrid intrusion detection systems which can perform efficiently in high speed network environment. and demonstrate that a single system based on our framework is more effective than previously well known methods for network intrusion detection. we then integrate the layered framework. Impressive part of our results is the percentage improvement in attack detection accuracy. a single system has limited attack detection coverage and it cannot detect a wide variety of attacks reliably. when compared with other methods. in Section 4. we draw conclusions and highlight the advantages of layered conditional random fields for network intrusion detection in Section 4.5% improvement).2 we motivate the use of conditional random fields for intrusion detection which can model complex relationships between different features in the data set. We first develop hybrid intrusion detection systems based on conditional random fields which can detect a wide variety of attacks and which result in very few false alarms. Experimental results on the benchmark KDD 1999 intrusion data set [12] and comparison with other well known methods such as decision trees and naive Bayes show that our approach based on layered conditional random fields outperform these methods. naive Bayes classifier.5 we give details of the experiments performed and describe the implementation of our integrated system. To improve the efficiency of the system. multi layer perceptron. We then describe the data set used in our experiments in Section 4. principle component analysis and approaches based on classifier combination. which are known to perform well for intrusion detection. Statistical tests also demonstrate higher confidence in detection accuracy with layered conditional random fields. as discussed in the previous chapter.6.48 Layered Conditional Random Fields for Network Intrusion Detection they generate a large number of false alarms and second.4. Rest of the chapter is organized as follows. support vector machines. Hence. Further. In Section 4.8. Finally. they are expensive in operation. even when trained with noisy data. both. particularly. In Section 4. in terms of. we describe how conditional random fields can be used for effective intrusion detection followed by the algorithm to integrate the layered framework with conditional random fields to build an effective and an efficient network intrusion detection system. We also show that our system is robust and can detect attacks with higher accuracy. . for User to Root (U2R) attacks (34. accuracy of attack detection and efficiency of operation. In Section 4. in this chapter.

the audit data provides significant details which help in improving classification. as we will show from our experiments. When this feature is analyzed in isolation. Methods. thereby improving the attack detection accuracy. a particular user may or may not have privileges to create files in the system or the system may detect anomalous activity by calculating deviation in the current profile and then comparing it with the previously saved profile for that particular user. but it can also correctly detect R2L and U2R attacks. As a result. for example. However. the anomaly and hybrid intrusion detection systems generally operate on summarized audit patterns. they are represented with multiple features which are correlated and complex relationships exist between them. . When these features are analyzed in isolation they do not provide significant information which can help in detecting attacks. This is because. We thus explore the effectiveness of conditional random fields which can effectively model such relationships and compare their performance with other well known approaches for intrusion detection. Consider. overloaded with large amount of network traffic. if considered by an intrusion detection system during classification can significantly decrease classification error. Consider another network intrusion detection system which analyzes connection level feature such as ‘service invoked at the destination’ in order to detect attacks.4. Such relationships between different features in the observed data. analyzing these features together can provide meaningful information for classification. However. thus. the system is limited in detecting only Probe attacks. To detect intrusions effectively. However. when audit patterns are summarized. would perform better when compared with methods which consider the features to be independent such as the naive Bayes classifier. if this information is analyzed in combination with other features such as ‘protocol type’ and ‘amount of data transferred between the source and the destination’. In this case. the system may not only detect Probe attacks. it is significant only when an attacker requests for a service that is not available at the destination and the system may then tag the connection as a Probe attack. if these features are not considered to be independent.2 Motivating Examples 49 4. these features must not be considered independently. if the features are considered to be independent. such as conditional random fields. a network intrusion detection system which uses two features ‘logged in’ and ‘number of file creations’ to classify network connections as either normal or attack. particularly in high speed networks. However.2 Motivating Examples Network intrusion detection systems operate at the periphery of the networks and are. which can capture relationships among multiple features.

The data set is a version of the 1998 DARPA intrusion detection evaluation program. we consider every record to be independent of every other record. Table 4. Each record in the data set represents a connection between two IP addresses. Probe.020 Test Set 60.458 1. All of the 24 attacks can be grouped into one of the four classes. It is important to note that the test data includes specific attacks which are not present in the training data. prepared and managed by the MIT Lincoln Labs.50 Layered Conditional Random Fields for Network Intrusion Detection 4.853 16. Table 4.1: KDD 1999 Data Set Training Set Normal Probe DoS R2L U2R Total 97. This leads to 494.029 . starting and ending at some well defined times with a well defined protocol. In our experiments. Further.126 52 494.3 Data Description We perform our experiments with the benchmark KDD 1999 intrusion data set [12]. hence in our experiments.166 229. The training data is either labeled as normal or as one of the 24 different kinds of attack.277 4. The data set contains about five million connection records as the training data and about two million connection records as the test data. unauthorized access from a remote machine or Remote to Local (R2L) and unauthorized access to root or User to Root (U2R).1 gives the number of instances for every class in the data set.107 391.349 68 311.029 test instances. every record represents a separate connection and. Denial of Service (DoS). This makes the intrusion detection task more realistic [12]. we use the ten percent of the total training data and ten percent of the test data (with corrected labels) which are provided separately. with 41 different features. Similarly the test data is also labeled as either normal or as one of the attacks belonging to the four attack classes.020 training and 311.593 4.

we used the KDD 1999 data set described in Section 4. Conventional methods. a total of 41. for every session in relational form with only one label for the entire record. feature weights are learnt and during testing. such as decision trees and naive Bayes. this increases complexity. attack.1 represents how conditional random fields can be used for detecting network intrusions. however. ‘service’.4. are known to perform well in such an environment. we represent the audit data in the form of a sequence and assign label to every feature in the sequence using the first order Markov assumption instead of assigning a single label to the entire observation. Though.4 Methodology 51 4. ‘protocol’. ‘flag’ and ‘source bytes’ are used to discriminate between attack and normal events. The KDD 1999 data set represents multiple features. attack. we integrate the layered framework. attack attack attack attack attack normal normal normal normal normal duration =0 protocol = icmp service = eco_i flag = SF src_byte =8 duration =0 protocol = tcp service = smtp flag = SF src_byte = 4854 (a) Attack Event (b) Normal Event Figure 4. The features take some possible value for every connection which are then used to determine the most likely sequence of labels < attack. normal >. [110]. To manage complexity and improve system’s performance. In our experiments. However.1: Conditional Random Fields for Network Intrusion Detection In the figure. attack.4 Methodology Given the network audit patterns where every connection between two hosts is presented in a summarized form with 41 features. attack > or < normal. they assume observation features to be independent. features are evaluated .3. described in the previous chapter. Figure 4. our objective is to detect most of the anomalous connections while generating very few false alarms. normal. using a conditional model would result in a maximum entropy classifier [108].normal. normal. with the conditional random fields to build a single system which is more efficient and more effective. We propose to use conditional random fields which can capture the correlations among different features in the data and hence perform better when compared with other methods. In this case. During training. observation features ‘duration’. it also improves the attack detection accuracy. Custom feature functions can be defined which describe the relationships among different features in the observation.

14 in Section 4. . considering every attack class separately not only improves the attack detection accuracy but also helps to improve the overall system performance when integrated with the layered framework. DoS. As we will see from our experimental results. We also note that in the KDD 1999 data set. Using conditional random fields improve the attack detection accuracy particularly for the U2R attacks. Furthermore. clearly suggest that conditional random fields can effectively model such relationships among different features of an observation resulting in higher attack detection accuracy. Probe. Our experimental results.5.52 Layered Conditional Random Fields for Network Intrusion Detection for the given observation which is then labeled accordingly. R2L and U2R) instead of two. making their deployment impractical in high speed networks. as in case of system call modeling. the attacks belonging to all the four attack classes can be re-labeled as attack and mixed with the audit patterns belonging to the normal class to build a single model which can be trained to detect any kind of attack. R2L and the DoS attacks. as a two class problem. Nonetheless. It is evident from the figure that every input feature is connected to every label which indicates that all the features in an observation determine the final labeling of the entire sequence. is to use only the attacks belonging to a single attack class mixed with audit patterns belonging to the normal class to train a separate sub system for all the four attack classes. Present intrusion detection systems do not consider such relationships. this is one time process and given the critical nature of the problem of intrusion detection. or assume independence among different features in an observation. as in case of a naive Bayes classifier. However. They are also effective in detecting the Probe. They either consider only one feature. where a single system is trained with five classes (normal. attacks can be represented in four classes. However. Such a system can easily identify an attack once it is detected but is very slow in operation. Another approach for considering the same problem. described in Section 4. Probe. a drawback of this implementation is that it requires domain knowledge to perform feature selection for every layer. Thus. when we consider all the 41 features in the data set for each of the four attack classes separately. The problem can also be considered as a five class classification problem. DoS. if domain knowledge can help to improve the attack detection accuracy it is recommended to do so. particularly from Table 4.6. R2L and U2R. it also helps to identify the class of an attack once it is detected at a particular layer in the layered framework. In order to consider this as a two class classification problem. a conditional random field can model dependencies among different features in an observation.

e. [119] is employed which has a complexity of O( TL2 ). the time complexity for training a conditional random field is O( TL2 N I ) where T is the length of the sequence. hence for better attack detection. All Features Probe Layer Feature Selection Audit Data (Normal + Probe) Normal No Block Yes Allow Figure 4. As a result.4. We therefore select different features for different layers based upon the type of attack the layer is trained to detect. the Viterbi algorithm [118]. Experimental results clearly suggest that feature selection significantly improves the .4 Methodology 53 conditional random fields can be expensive during training and testing. Hence. we represent a detailed view of a single layer (Probe layer) which can be used to detect Probe attacks in our integrated system.. it becomes necessary to consider them separately. 4. we select different features to train different layers in our framework. Other layers can be trained similarly. in our layered system.4. However. The quadratic complexity is significant when the number of labels is large as in language tasks. During inference. L is the number of labels. the length of the sequence. we train every layer separately to optimally detect a single class of attack. i.2: Representation of Probe Layer with Feature Selection The Probe layer is optimally trained to detect only the Probe attacks. We further improve the overall system performance by implementing the layered framework and performing feature selection which decreases T.1 Feature Selection Attacks belonging to different classes are different and. For a simple linear chain structure. In Figure 4. Note that. for intrusion detection there are only two labels normal and attack and. N is the number of training instances and I is the number of iterations.2. We now describe feature selection for all the four attack classes. thus. our system is very efficient. we use only the Probe attacks and the normal instances from the audit data to train this layer.

to detect R2L attacks. From all the 41 features in the KDD 1999 data set. Since every layer in our framework is independent. network traffic features such as the ‘percentage of connections having same destination host and same service’ and packet level features such as the ‘source bytes’ and ‘percentage of packets with errors’ are significant. Probe Layer – Probe attacks are aimed at acquiring information about the target network from a source which is often external to the network. DoS Layer – DoS attacks are meant to prevent the target from providing service(s) to its users by flooding the network with illegitimate requests. experimental results in Section 4. ‘service requested’ and the host level features such as the ‘number of failed login attempts’ among others. Hence. We list the features used for all the four layers in Appendix B. to detect attacks at the DoS layer. we selected features such as ‘number of file creations’. we select only five features for Probe layer. basic connection level features such as the ‘duration of connection’ and ‘source bytes’ are significant. 4. 14 features for R2L layer and eight features for U2R layer. the network level features such as the ‘duration of connection’. Such attacks are often content based and target an application. To detect DoS attacks. we selected both. R2L Layer – R2L attacks are one of the most difficult attacks to detect as they involve both.54 Layered Conditional Random Fields for Network Intrusion Detection attack detection capability of our system.6. 1. Hence. Ideally. We now describe our approach for selecting features for every layer and why some features were chosen over others. . it may not be important to know whether a user is ‘logged in or not’ and hence. we use domain knowledge to select features for all the four attack classes. such features are not considered in the DoS layer. the network level and the host level features. while features like ‘number of file creations’ and ‘number of files accessed’ are not expected to provide information for detecting Probe attacks. U2R Layer – U2R attacks involve the semantic details which are very difficult to capture at an early stage at the network level. Hence. However. Hence. 3.2 suggest that present methods for automatic feature selection are not effective. while we ignored features such as ‘protocol’ and ‘source bytes’. nine features for DoS layer. we would like to perform feature selection automatically. Hence for detecting U2R attacks. 2. feature sets for all the four layers are not disjoint. ‘number of shell prompts invoked’.

4.4 Methodology

55

4.4.2 Integrating the Layered Framework
The layered framework, introduced in Chapter 3, is general and can be tailored to build specific intrusion detection systems. In this section, we describe how we can integrate the layered framework with the conditional random fields to build an effective and an efficient hybrid network intrusion detection system. Given the four different attack classes in the KDD 1999 data, we implement a four layer system where every layer corresponds to a single attack class. The four layers are arranged in a sequence as represented in Figure 4.3.

All Features

Probe Layer Feature Selection

Normal No Block

Yes

DoS Layer Feature Selection

Normal No Block

Yes

R2L Layer Feature Selection

Normal No Block

Yes

U2R Layer Feature Selection

Normal No Block

Yes

Allow

Figure 4.3: Integrating Layered Framework with Conditional Random Fields In the system, every layer is trained separately with the normal instances and with the attack instances belonging to a single attack class. The layers are then arranged one after the other in a sequence as shown in Figure 4.3. However, during testing, all the audit patterns (irrespective of their attack class, which is unknown) are passed into the system starting from the first layer. If the layer detects the instance as an attack, the system labels the instance as a Probe attack and initiates the response mechanism; otherwise it passes the instance to the next layer. Same process is repeated at every layer until either an instance is detected as an attack or it reaches the last layer where the instance is labeled as normal if no attack is detected. We now give the algorithm to integrate the layered framework with conditional random fields.

56

Layered Conditional Random Fields for Network Intrusion Detection

Algorithm: Integrating Layered Framework & Conditional Random Fields Algorithm 1 Training 1: Select the number of layers, n, for the complete system. 2: Separately perform features selection for each layer. 3: Train a separate model with conditional random fields for each layer using the features selected from Step 2. 4: Plug in the trained models sequentially such that only the connections labeled as normal are passed to the next layer.

Algorithm 2 Testing 1: For each (next) test instance perform Steps 2 through 5. 2: Test the instance and label it either as attack or normal. 3: If the instance is labeled as attack, block it and identify it as an attack represented by the layer name at which it is detected and go to Step 1. Else pass the sequence to the next layer. 4: If the current layer is not the last layer in the system, test the instance and go to Step 3. Else go to Step 5. 5: Test the instance and label it either as normal or as an attack. If the instance is labeled as an attack, block it and identify it as an attack corresponding to the layer name.

4.5 Experiments and Results
For our experiments, we use the conditional random field toolkit CRF++ [120] and the Weka tool [121]. We develop python and shell scripts for data formatting and implementing the layered framework and perform all of our experiments on a desktop running with Intel(R) Core(TM) 2, CPU 2.4 GHz and 2 GB RAM under exactly the same conditions. In our experiments we perform hybrid detection, i.e., we use both normal and anomalous audit patterns to train the model in a supervised learning environment. We perform our experiments ten times and report the best, the average and the worst cases. To measure the efficiency of attack detection, we consider only the test time efficiency since the real-time performance of an intrusion detection system depends upon the test time efficiency alone. We observe that our system based on layered framework and conditional random fields, which we refer to as the “Layered Conditional Random Fields”, is very efficient during testing. The time required to test every instance when we consider all the 41 features for all the four layers is 0.2236 ms. This reduces to 0.0678 ms when we perform feature selection and implement the layered framework. More details are presented in the following sections.

4.5 Experiments and Results

57

4.5.1 Building Individual Layers of the System
To determine the effectiveness of conditional random fields for intrusion detection we perform two set of experiments. From the first experiment, we examine the accuracy of conditional random fields and compare them with other techniques which are known to perform well. In this experiment we use all the 41 features to make a decision. We observe that the conditional random fields perform very well particularly for detecting U2R attacks while the decision trees achieve higher attack detection for the Probe and R2L attacks. The difference in attack detection accuracy for DoS attacks is not significant. The reason for better accuracy for decision trees is that they perform feature selection and use only a small set of features in the final model. Hence, we perform our second experiment where we select a subset of features for all the four layers separately as discussed earlier in Section 4.4.1. For our experiments, we divided the training data into five different classes; normal, Probe, DoS, R2L and U2R. Similarly, we divided the test data into five classes. As discussed in Section 4.4, we perform experiments separately for all the four attack classes by randomly selecting data corresponding to that particular attack class and normal data only. For example, to detect Probe attacks, we train and test the system with Probe attacks and normal audit patterns only. We do not add other attacks such as DoS, R2L and U2R in the training data when training the sub system to detect Probe attacks. Not including other attacks allow the system to better learn features specific to the Probe attacks and normal events. Hence, for four attack classes we train four independent models, separately, with and without feature selection to compare their performance. We perform similar experiments with decision trees and naive Bayes. We call the models as layered conditional random fields, layered decision trees and layered naive Bayes when we perform feature selection. For better comparison and readability, we present the results for the two experiments for all the four layers together.

Detecting Probe Attacks To detect Probe attacks, we train our system by randomly selecting 10,000 normal records and the entire Probe records from the training data. For testing the model, we select all the normal and Probe records from the test data. Hence, we have about 15,000 training and 64,759 test instances. 1. Experiments with all 41 Features – In Table 4.2, we give the results for detecting Probe

58

Layered Conditional Random Fields for Network Intrusion Detection attacks when we use all the 41 features for training and testing in the first experiment. The table represents that the system takes a total of 14.53 seconds to label all the 64,759 test instances. Results suggest that decision trees are more efficient than conditional random fields and naive Bayes. This is because they have a small tree structure, often with very few decision nodes, which is very efficient. The attack detection accuracy is also higher for the decision trees since they select the best possible features during tree construction. However, when we perform feature selection, the layered conditional random fields achieve much higher accuracy and there is significant improvement in train and test time efficiency. Table 4.2: Detecting Probe Attacks (with all 41 Features) Precision (%) Conditional Random Fields Naive Bayes Decision Trees Best Average Worst Best Average Worst Best Average Worst 84.60 82.53 80.44 73.20 72.26 71.20 93.20 87.36 85.50 Recall (%) 89.94 88.06 86.13 97.00 96.65 96.30 97.70 95.73 90.90 F-Measure (%) 86.73 85.21 83.19 83.30 82.70 81.90 95.40 91.34 88.80 Train (sec.) Test (sec.)

200.6

14.53

1.08

6.31

2.04

2.40

2. Experiments with Feature Selection – In the second experiment, we use the same data as used in previous experiment, however, we perform feature selection in this experiment. We give the results for detecting Probe attacks with feature selection in Table 4.3. The table suggests that the layered conditional random fields perform better and faster than the previous experiment and are the best choice for detecting Probe attacks. The system takes only 2.04 seconds to label all the 64,759 test instances. We observe that there is no significant advantage with respect to time for the layered decision trees. This is because the size of the final tree with decision trees and with layered decision trees is not significantly different, resulting in similar efficiency. We also observe that the Recall and, hence, the F-Measure for layered naive Bayes decreases drastically. This can be explained as follows; the classification accuracy with naive Bayes generally improves as the number of features

82 96.45 1.54 1. The results from Table 4.3 clearly suggest that layered conditional random fields are a better choice for detecting Probe attacks.000 normal records and 4. For testing.70 97. its classification accuracy decreases. However.80 Train (sec.41 95. we select all the normal and DoS records from the test set.91 2. if the number of features increases to a very large extent.5 Experiments and Results 59 increases.93 90. we have 24.00 Detecting DoS Attacks We randomly select 20.04 86.13 0. The results show that all the three methods have similar attack detection accuracy.72 88. when we use all the 41 features.20 F-Measure (%) 93.000 DoS records from the training data to train the system to detect DoS attacks.92 78.82 33.73 89. Experiments with Feature Selection – To detect DoS attacks with feature selection we perform experiments on the same data used in the previous experiment.04 0. decision trees give a slight advantage with regards to the test time efficiency.4.000 training instances and 290.60 Recall (%) 98. Experiments with all 41 Features – In Table 4.68 92. we give the results for detecting DoS attacks when we use all the 41 features.30 91.70 92. however. Table 4.4.42 seconds to label all the 290.23 74.03 97.57 17.22 27.80 77. Hence.50 87. 1. the estimation tends to become unreliable. naive Bayes performs well but when we perform feature selection and use only five features.3: Detecting Probe Attacks (with Feature Selection) Precision (%) Layered Conditional Random Fields Layered Naive Bayes Layered Decision Trees Best Average Worst Best Average Worst Best Average Worst 89. As a result. The table represents that the system takes a total of 64. 2.446 test instances.48 21.) Test (sec. but we perform .00 97.60 31.) 6.30 19.19 82.446 test instances.70 87.

) Test (sec.20 97.05 96.00 97.09 9.11 64.446 test instances.60 98.5: Detecting DoS Attacks (with Feature Selection) Precision (%) Layered Conditional Random Fields Layered Naive Bayes Layered Decision Trees Best Average Worst Best Average Worst Best Average Worst 99.99 97.60 Layered Conditional Random Fields for Network Intrusion Detection Table 4. Table 4.60 98.48 98.19 98.82 99. Considering the test time efficiency. The results follow the same trend as in the previous experiment.68 6. however.4: Detecting DoS Attacks (with all 41 Features) Precision (%) Conditional Random Fields Naive Bayes Decision Trees Best Average Worst Best Average Worst Best Average Worst 99.78 99.70 F-Measure (%) 98.40 Train (sec.20 98.10 97. the real advantage of feature selection is seen in terms of improvement in the test time performance.59 15.10 98. In this experiment.79 26.53 98.10 98.50 98.00 97.17 98.11 97.05 97.90 Recall (%) 97.39 99.17 0.01 97.) 256.50 98.40 99. this increase is not significant.00 F-Measure (%) 98.37 98.00 96.00 97.30 99.87 .90 99.90 99. It is important to note that there is slight increase in the detection accuracy when feature selection is performed.31 3.50 1.32 99.90 Recall (%) 97.00 97.99 99.43 98. the system takes only 15.20 98. Table 4.42 1.) Test (sec.75 99.98 99.00 97.5 presents the results.46 98.40 98.40 99.30 Train (sec. With feature selection. layered decision trees are a better choice for detecting DoS attacks.00 97.90 99.30 99.97 99.28 6.) 26.04 feature selection in this experiment.90 99.12 97.30 97.17 seconds to label all the 290.

i.) 23..96 seconds to test all the 76.6 suggests that decision trees have higher F-Measure.40 12.000 normal records and all the R2L records from the training data. we give the results for detecting R2L attacks when we use all the 41 features.54 74. we perform feature selection in this experiment.89 13. Table 4.00 53. we select all the normal and R2L records from the test set.40 37. we have about 2.63 5. but the conditional random fields have higher Precision when compared with other methods.30 98.42 25.942 test instances. decision trees (increase of about 17%).30 84.6: Detecting R2L Attacks (with all 41 Features) Precision (%) Conditional Random Fields Naive Bayes Decision Trees Best Average Worst Best Average Worst Best Average Worst 93.6. Experiments with Feature Selection – In the second experiment.60 2.10 23. a system using conditional random fields generates less false alarms.81 15. .33 0.62 18.e.10 70.67 92.16 0. Hence.40 F-Measure (%) 28.29 10.35 90.16 seconds. 1.942 test instances. Experiments with all 41 Features – In Table 4. we randomly select 1.38 7. the system take 17.942 test instances and the layered conditional random fields perform much better than conditional random fields (increase in F-Measure of about 60%). To test the model. layered decision trees (increase of about 125%).75 2. Layered condition random fields take slightly more time which is acceptable as they achieve much higher attack detection accuracy.000 training instances and 76.42 7. We observe that to test all the 76.4. From the results in Table 4.68 63. we observe that the system takes only 5.03 61. layered naive Bayes (increase of about 250%) and naive Bayes (increase of about 250%) and are the best choice for detecting R2L attacks. we use the same data as used in the previous experiment.) Test (sec. Table 4.94 21.40 17.5 Experiments and Results Detecting R2L Attacks 61 For training our system to detect R2L attacks.7.30 Train (sec.70 Recall (%) 16.40 6. however.10 12.12 10.20 35.

98 7.84 94.10 14. Table 4.23 13.30 81.) Test (sec.661 test instances. 1.31 2. we give the results for detecting U2R attacks when we use all of the 41 features.37 88.80 24. The F-Measure for conditional random fields is more than 150% with respect to the decision trees and more than 600% with respect to the naive Bayes.7: Detecting R2L Attacks (with Feature Selection) Precision (%) Layered Conditional Random Fields Layered Naive Bayes Layered Decision Trees Best Average Worst Best Average Worst Best Average Worst 95. clearly shows that conditional random fields are far better for detecting U2R attacks when compared with other methods. we have about 1.99 0.000 normal records and all the U2R records from the training data. The system takes 13.48 78. Experiments with all 41 Features – In Table 4.08 39.62 Layered Conditional Random Fields for Network Intrusion Detection Table 4.) 5. We observe that conditional random fields can be used to reliably detect the U2R attacks in particular.52 42. 2.81 78. Experiments with Feature Selection – In the second experiment.70 85.96 0.43 13.45 seconds to label 60. The U2R attacks are very difficult to detect and most of the present intrusion detection systems fail to detect such attacks with acceptable reliability.30 5.08 24.8. we perform feature selection in this experiment.30 F-Measure (%) 47.8. in the first experiment.661 test instances.47 4.43 Detecting U2R Attacks To detect U2R attacks.70 91.50 10. however. we use the same data as used in the previous experiment to detect U2R attacks.39 7. We give the results for detecting U2R attacks with feature .98 7.50 Train (sec.36 1.90 18. We used all the normal and U2R records from the test set for testing the system.80 Recall (%) 31. we train our system by randomly selecting 1. Hence.30 11.000 training instances and 60.67 27.20 89.20 6.

00 51.20 24.42 11.70 43.80 43.16 47.50 F-Measure (%) 56.54 6.62 52.661 test instances.90 38.5 Experiments and Results 63 Table 4.8: Detecting U2R Attacks (with all 41 Features) Precision (%) Conditional Random Fields Naive Bayes Decision Trees Best Average Worst Best Average Worst Best Average Worst 58.40 63.12 52. We observe that the system takes only 2.57 55.70 43.96 50.35 60.35 13.00 Recall (%) 64.90 0.20 57.20 Train (sec.74 53.80 12.29 55.93 .30 3.11 58.) 0. layered decision trees (increase of about 30%).20 38.97 29.67 0.20 34. layered naive Bayes (increase of about 38%) and naive Bayes (increase of about 675%).70 Train (sec.48 19.30 10.40 41. We also observe that the attack detection capability increases for the decision trees and the naive Bayes when we perform feature selection.30 Recall (%) 60.20 F-Measure (%) 61.49 51.20 55.37 2.4.20 85.85 2.00 91.9: Detecting U2R Attacks (with Feature Selection) Precision (%) Layered Conditional Random Fields Layered Naive Bayes Layered Decision Trees Best Average Worst Best Average Worst Best Average Worst 58.07 34.03 51.) Test (sec.67 seconds to label all the 60.83 0.00 51.00 7.30 5.02 50.9.31 5. decision trees (increase of about 184%).) Test (sec.93 6.71 62.00 35.22 selection in Table 4.20 38.25 1.) 8.94 3. Table 4.19 45. Table 4.45 0.29 0.88 82.60 51.29 66. clearly suggests that layered conditional random fields are the best choice for detecting U2R attacks and are far better than conditional random fields (increase of about 8%).9.90 20.44 49.

but it also helps to identify the type of attack once it is detected. For example.5. the category of an attack is unknown.1 suggest that conditional random fields (with feature selection) can be very effective in detecting different attacks when different attack classes are considered separately. the system labels the attack as U2R and initiates specific response mechanisms. but we implement the layered approach to improve overall system performance. we perform experiments in an environment similar to the real life deployment of the system.4. Further. As soon as a layer detects an attack. The difference in attack detection accuracy is. we perform feature selection and use exactly the same training instances as used for . regardless of the method considered and particularly for conditional random fields. The performance of our integrated system. To examine the effectiveness of our integrated system. “Layered Conditional Random Fields”. layered conditional random fields. though expensive. in real scenario.64 Layered Conditional Random Fields for Network Intrusion Detection It is evident from our results that the attack detection accuracy using layered conditional ran- dom fields is significantly higher for detecting the U2R. however. Experimental results in Section 4. R2L and Probe attacks. there is a tradeoff between efficiency and accuracy of the system and there can be various avenues to improve system performance. not significant for the DoS attacks. 4. To balance this tradeoff we use the conditional random fields which are more accurate. it would be beneficial if an intrusion detection system not only detects an attack but also identifies the type of attack. if an attack is detected at the U2R layer in the layered framework. This is because individual layers in the layered framework are trained to detect only a particular class of attack. This certainly increases system efficiency but it severely affects the accuracy as we observed from the experimental results.2. the time required for training and testing the system reduces significantly once we perform feature selection. is comparable to that of the decision trees and the naive Bayes and our system has higher attack detection accuracy. Results show that integrating the layered framework not only improves the efficiency of the overall system. thereby enabling specific intrusion response mechanisms depending upon the type of attack. the category of the attack can be inferred from the class of attack the layer is trained to detect. However.2 Implementing the Integrated System In many situations. Methods such as naive Bayes assume independence among the observed data.5. Rather. it is very likely that the attack is of U2R type and hence. We perform further experiments with the integrated system presented in Section 4. For this experiment.

since the attack is detected at an early stage.62 (Total % Blocked) 98. The system can also detect R2L attacks with much higher accuracy (29.5 Experiments and Results 65 training the individual models in the experiments described in Section 4. If layer one detects an attack. it must be considered as an advantage.90 0.91 DoS 0.58 77.50 3.40 29.5. 25.00 5. we re-label the entire data in the test set as either normal or attack.11 71.50 + 71.62 97.35 U2R 0.10.05 Normal 1.38 13.60 70.00 26.10 gives the % detection with respect to each of the five classes in a confusion matrix. DoS (97.53 0. R2L or U2R at layer two.38 From Table 4. Only the connections which are labeled as normal at the first layer are allowed to pass to the next layer.69 0.82 25.65 0. During testing. if some Probe attacks are not detected at the first layer.04 3.10: Confusion Matrix % Detection Probe Probe DoS R2L U2R Normal 97. most of the Probe attacks are detected.00 + 0. we observe that an intrusion detection system based on layered conditional random fields can detect most of the Probe (98.33%) attacks while giving very few false alarms at each layer. Same process is repeated at the following layers where an attack is blocked and labeled as DoS. This is because.62 86.00 0. Table 4.67 98. Similarly. The confusion matrix shows that only 71.00 0.00 0.15 0.00 0. We perform all experiments 10 times and report their average.07 R2L 0.62%). Table 4. However. However.40%) and U2R (86.62%) when compared with previously reported systems.90 + 0. If other attacks are detected as Probe.38 2. rather it is 25. layer three and layer four respectively.90%. it is very important to note that the accuracy for detecting DoS attacks is not 71. Since.90% of DoS attacks are labeled as DoS. they may be detected at subsequent layers. though the system identifies them as Probe attacks since the first layer represents the Probe .33 1.1. it is blocked and labeled as Probe.40%. the layer is trained to detect Probe attacks effectively.50% of the DoS attacks have been detected at the first layer itself.4. Other attacks such as DoS can either be seen as normal or as Probe.00 = 97. all the instances from the test set are passed through the system starting from the first layer.

) 8 18 20 21 . First.10. Looking at the R2L and U2R attacks in Table 4. merging the two layers result in increasing the number of features which affects efficiency.66 Layered Conditional Random Fields for Network Intrusion Detection layer. the U2R attacks are not detected effectively and their individual attack detection accuracy decreases.436 attack instances in the test set. The table clearly shows that out of all the 250. it is natural to think that the two layers can be merged. This configuration takes only 21 seconds to classify all the 250. Using the layered framework. However.436 attacks. making the system scalable and flexible to specific requirements of the particular environment where it is deployed. the layered framework is very effective in reducing the attack traffic at every layer in the system. this has two disadvantages. the merged layer performs poorly with respect to the total test time when compared with the combined test time for both the unmerged layers.226 91.004 Cumulative (%) 25.770 0.11: Attack Detection at Individual Layers (Case:1) Accuracy Total (%) Probe DoS R2L U2R 25.090 0. Hence. it is hoped that any attack. Table 4. We also note that most of the U2R attacks are detected in the third layer and hence labeled as R2L. if we remove the third layer. more than 25% of the attacks are blocked at layer one and more than 90% of all the attacks are blocked by the end of layer two. We evaluate the performance of every layer in our system in Table 4. we use two separate layers for detecting R2L and U2R attacks.053 0. even though its category is unknown.222 92.996 Per Instance (m sec.226 65. the fourth layer can detect U2R attacks with similar accuracy. This is because the number of U2R attacks in the training data is very small and the system simply learns the features which are specific to the R2L attacks. Second.11.996 1. The number of layers can also be increased or decreased in the layered framework.031 0. This is acceptable because it is critical to detect an attack as early as possible which helps to minimize the impact of an attack.056 Test Time Total (sec.) 8 10 2 1 Cumulative (sec. When the layers are merged. can be detected at any one of the layers in the system. However.) 0. Thus.992 92.

Such optimization becomes significant in severe attack situations when the target is overwhelmed with illegitimate connections.4731 ∗ 104 instances per second.415 1.e.807 1.996 Per Instance (m sec.6 Comparison and Analysis of Results Experimental results from Section 4.436 test instances in 17 seconds. i.12.056 Test Time Total (sec. We can do this because the data is relational and every layer in the system is independent.12: Attack Detection at Individual Layers (Case:2) Accuracy Total (%) DoS Probe R2L U2R 89. it can handle 1.4. we compare the layered conditional random fields with other well known methods for intrusion detection based on the anomaly detection principle.) 13 1 2 1 Cumulative (sec.004 Cumulative (%) 89.992 92.770 0.031 0.) 13 14 16 17 Table 4.807 91.222 92. machine learning or data mining approaches [9].) 0.090 0. Table 4.5 clearly suggest that conditional random fields when integrated with the layered framework can be used to build effective and efficient network intrusion detection systems. the overall bandwidth which our system can handle is easily in excess of 100 Mbps.051 0. The anomaly based systems primarily detect deviations from the learnt normal data using statistical methods. assuming the average size of an instance to be 1.6 Comparison and Analysis of Results 67 We can further optimize this configuration by putting the DoS layer before the Probe layer. 4. Standard techniques such as decision trees and . CPU 2. It is important to note that this performance is achieved on a desktop running with Intel(R) Core(TM) 2. In this section.4 GHz and 2 GB RAM in the Windows environment.12 shows that our system can analyze 250.. Significant performance improvement can be achieved by building dedicated devices for large scale commercial deployment. We present the results in Table 4. Now. Putting the DoS layer before the Probe layer improves overall system performance and helps to detect a large number of attacks at the first layer itself.5 KB.

5% improvement) and U2R (34.8% improvement) and 97. . The most impressive part of layered conditional random fields is the margin of improvement when compared with other methods.8% improvement) attacks. The use of support vector machines for intrusion detection is discussed in [72]. the authors in [123] propose the use of principle component analysis before applying any machine learning algorithm.13 suggests that layered conditional random fields perform significantly better than previously reported results including the winner of the KDD 1999 cup and various other methods applied to the KDD 1999 data set.40% detection for DoS attacks.6% for Probe attacks (5. In [122]. The main reason for better accuracy of our system is that the conditional random fields do not consider observation features to be independent. However.68 Layered Conditional Random Fields for Network Intrusion Detection naive Bayes are known to perform well. They outperform by a significant percentage for R2L (34. We compare these methods with our layered conditional random fields for intrusion detection in Table 4. The table represents the Probability of Detection (PD) and the False Alarm Rate (FAR) in % for different methods including the winners of the KDD 1999 cup. They have very high attack detection of 98. To improve attack detection. Comparison from Table 4. our experimental results show that layered conditional random fields perform far better than these techniques.13. the authors present a comparative study of various classifiers when applied to the KDD 1999 data set.

70 0.400 0.0001 PD FAR PD FAR 70.4000 2.700 0.32 0.300 0.100 0.23 R2L 29.001 U2R 86.50 93.8000 0.003 1.100 6.10 0.400 0.30 97.0300 8.00 0.005 2.10 0.005 3.40 77.0200 0.80 0.20 0.30 82.40 90.4.70 86.40 0.60 - 0.350 8.13 0.20 97.60 2.100 3.60 88.30 97.80 0.0030 29.0500 22.600 0.27 DoS 97.8000 0.91 83.0001 12.40 88.00 0.12 91.6 Comparison and Analysis of Results 69 Table 4.30 84.0090 6.58 0.00 0.005 9.30 0.6000 0.20 18.0000 - .0020 64.13: Comparison of Results Probe Layered Conditional Random Fields KDD 1999 Winner [122] Multi Classifier [122] Multi Layer Perceptron [122] Gaussian Classifier [122] K-Means Clustering [122] Nearest Cluster Algorithm [122] Incremental Radial Basis Function [122] Leader Algorithm [122] Hypersphere Algorithm [122] Fuzzy ARTMAP [122] C4.000 0.20 0.30 73.40 0.40 97.010 9.5 (Decision Trees) [122] Nearest Neighbour with Principle Component Analysis (4 axis) [123] Decision Trees with Principle Component Analysis (2 axis) [123] Support Vector Machines [72] PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR PD FAR 98.30 97.3000 0.0010 1.400 0.65 - 97.1000 0.010 5.070 0.004 4.40 0.30 0.30 97.30 0.80 0.20 11.30 87.0400 6.20 80.30 97.70 0.20 0.2000 0.1000 0.60 88.40 97.8000 0.600 0.60 0.85 36.80 0.20 0.2000 0.000 - 7.600 0.0500 13.07 97.510 0.100 5.4000 13.3000 0.600 0.030 22.900 0.600 0.80 83.0006 6.8000 0.5000 29.2000 0.0400 0.90 97.

. for ease of experiments. Further.14. we perform further experiments where we do not implement the layered framework. i. R2L and U2R attacks as attack.12 R2L 29. by labeling all the Probe. R2L and Probe attacks. high performance speed and high classification accuracy.6. both. DoS.14 clearly suggests that a system implementing the layered framework with feature selection is more efficient and more accurate in detecting attacks particularly the U2R.05 96. Hence. In real environment. Table 4. For experiments when we do not implement the layered framework but we perform feature selection. as discussed earlier. Non Layered Framework Attack Detection in % Probe Feature Selection Layered All Features Feature Selection All Features 98.88 96.06 92.70 Layered Conditional Random Fields for Network Intrusion Detection 4. we select 21 features out of the total of 41 features by applying the union operation on the feature sets of the four individual attack classes.40 97. normal and attack.1 Significance of Layered Framework To evaluate the effectiveness of the layered framework. a system which implements feature selection with layered framework can benefit from both. Further. we train a single system with two classes.24 Time Taken (sec. We perform experiments.10 16. we should read the time in relative terms rather than in absolute terms since. we use scripts for implementation.00 48.e.03 60.58 U2R 86. .) Test 17 56 29 57 Non Layered Comparison from Table 4. multiple I/O operations can be replaced by a single I/O operation providing very high speed of operation.14: Layered Vs.62 88.21 87. high speed can be achieved by implementing the complete system in languages with efficient compilers such as the C Language. with and without feature selection. in Table 4. we can implement pipelining in multi core processors where every core represents a single layer and due to pipelining. The motivation behind layered framework is to improve performance speed while feature selection helps to improve classification accuracy.01 17.14 presents the results.94 DoS 97.33 55.62 15. Table 4.

However.21 83.06 32.53 98.73 85. This results in only a small set of features for each layer.2 Significance of Feature Selection Form the experiments in previous sections. We performed experiments with feed forward neural network to determine the weights for all the 41 features.19 45. We then discarded the features with weights close to zero. using our domain knowledge. For automatic feature selection.28 85. We compare the effect of feature selection on intrusion detection in Table 4.50 98.74 53.89 U2R 61.31 98. We compare the results of manual feature selection with automatic feature selection for all the layers. However.6 Comparison and Analysis of Results 71 4.11 58.90 49.58 56. when we performed similar experiments with the reduced set of features. Table 4.08 39. we performed manual feature selection.82 87. we use methods such as those discussed in [124] and [125] which can automatically extract significant features. In our experiments. However.30 Manual Automatic No Selection It is not surprising that manual feature selection performs better than automatic feature selection. it would be advantageous if we can select features automatically for different attack classes.6.94 21.03 86.39 86.48 98.03 53. but the accuracy of detection is significantly lower when features are induced automatically than the system based on manual feature selection.14 25.4.40 98. .68 92. For experiments with automatic feature selection.19 DoS 98. we observe that performing feature selection improves the attack detection accuracy as well as the efficiency of the system.37 R2L 47.23 42.15: Significance of Feature Selection F-Measure (%) Probe Best Average Worst Best Average Worst Best Average Worst 93.38 98.73 89.15.20 98.15 28.80 46.43 98. We observe that the system using automatic feature selection has similar test time performance when compared to the system with manual feature selection. we perform experiments using the Mallet tool [126].44 49. we also considered other methods for automatic feature selection.42 25.52 42.

we perform feature selection and implement the layered framework with the conditional random fields to produce a four layer system.5 algorithm [127] to perform feature selection. main drawback of using PCA followed by conditional random fields is that. We then used Principle Component Analysis (PCA) for dimensionality reduction [123]. Hence.6. the strength of conditional random fields is to model correlation between features. With regards to improving the attack detection accuracy. We also observe that feature selection not only improves the test time efficiency. it is important to detect most of the attacks with very few false alarms. Given the critical nature of the task of intrusion detection. but the features in the transformed space are independent.5 algorithm for further experiments. We observe that the test time performance of the integrated system is comparable with other methods. However. DoS. the PCA transforms a large number of possibly correlated features into a small number of uncorrelated features known as the principle components. This is because. the time required to train and test the model is very high. However. if we use all the 41 features for all the four attack classes. automatic feature selection with layered conditional random fields is still a feasible scheme for building reliable network intrusion detection systems which can operate efficiently in high speed networks. the combined approach did not provide significant advantage. We then used the C4. . This is because using more features than required can generate superfluous rules often resulting in fitting irregularities in the data which can misguide classification. R2L and U2R attacks. but it also increases the accuracy of attack detection. the time required to train the model is slightly higher. The four layers correspond to Probe. however.3 Significance of Our Results Experimental results show that conditional random fields have high attack detection accuracy. when we applied PCA to the data set and then implemented the system using conditional random fields in the newly transformed feature space.72 Layered Conditional Random Fields for Network Intrusion Detection there was no significant improvement in the attack detection accuracy. Our system also gives slight improvement for detecting Probe attacks but has similar accuracy for detecting DoS attacks. we use domain knowledge to improve attack detection accuracy. hence. To address this. there was no significant improvement in the results. Nonetheless. We constructed a decision tree and selected only a small set of features which were selected by the C4. 4. the main strength of layered conditional random fields lies in detecting R2L and U2R attacks which are not satisfactorily detected by other methods. However.

16. To ameliorate this problem to some extent and to study the robustness of our system.7 Robustness of the System In order to test the robustness of our system. where a system with rank ‘1’ represents the best system. Considering both. we add substantial amount of noise in the training data and perform similar experiments. To determine the statistical significance of our results. However. no other data sets are freely available which can be used for similar experimentation. 4. Thus. This results in capturing the correlation among different features in the observation resulting in higher accuracy. we rank all the six systems in order of significance for detecting Probe. it is important to perform similar experiments with a number of other data sets. DoS. given the domain of the problem. Table 4. We use the Wilcoxon test [128] with 95% confidence interval to discriminate the performance of these methods. the accuracy and the time required for testing.4. . We compare the ranking for various methods in Table 4. layered conditional random fields score better.16: Ranking Various Methods for Intrusion Detection Probe Layered Conditional Random Fields Conditional Random Fields Layered Decision Trees Decision Trees Layered Naive Bayes Naive Bayes 1 4 1 1 6 5 DoS 1 4 1 1 5 5 R2L 1 3 4 1 5 5 U2R 1 2 3 5 3 6 The results of the test indicate that layered conditional random fields are significantly better (or equal) for detecting attacks when compared with other methods. layered conditional random fields are a strong candidate for building effective and efficient network intrusion detection systems.7 Robustness of the System 73 The prime reason for better attack detection accuracy for conditional random fields is that they do not consider observation features to be independent. R2L and U2R attacks.

We perform four set of experiments with noisy data. 0. when the original feature is ‘0’.50.74 Layered Conditional Random Fields for Network Intrusion Detection 4.4.6 and 4. R2L and U2R attacks separately in Figures 4. we add noise to that feature by using an additive function (a random value between -1000 and +1000) instead of scaling. The figures clearly suggest that the layered conditional random fields are robust to noise in the training data and perform better than other methods.20. and the scaling factor.90 and 0. ‘s’. one for each layer.1 Addition of Noise We control the addition of noise in the data by two parameters. 0. 4. 0. For every set of experiment. DoS.5.7 respectively. In case.7. 0.10.75. we vary the parameter ‘p’ from 0 and 1 (by keeping it at values 0. the probability of adding noise to a feature. ‘p’. 4. 0. .95) and vary the parameter ‘s’ from -1000 and +1000. We represent the effect of noise for detecting Probe.33.

5: Effect of Noise on DoS Layer .7 Robustness of the System 75 LCRF 100 CRF DT NB 90 80 F-Measure 70 60 50 40 30 0 10 20 30 40 50 Noise % 60 70 80 90 100 Figure 4.4: Effect of Noise on Probe Layer LCRF 99 98 97 96 95 94 93 92 91 90 89 0 10 20 30 CRF DT NB F-Measure 40 50 Noise % 60 70 80 90 100 Figure 4.4.

6: Effect of Noise on R2L Layer LCRF 60 CRF DT NB 50 40 F-Measure 30 20 10 0 0 10 20 30 40 50 Noise % 60 70 80 90 100 Figure 4.7: Effect of Noise on U2R Layer .76 Layered Conditional Random Fields for Network Intrusion Detection LCRF 45 40 35 30 F-Measure 25 20 15 10 5 0 0 10 20 30 CRF DT NB 40 50 Noise % 60 70 80 90 100 Figure 4.

. as discussed earlier. show that feature selection and implementing the layered framework significantly reduces the time required to train and test the model. We compared our approach with some well known methods for intrusion detection such as the decision trees and naive Bayes.5. Probe. viz. while our integrated system can effectively and efficiently detect such attacks giving an improvement of 34. thereby improving the attack detection accuracy. in particular the number of layers in the system can be easily increased or decreased giving flexibility to network administrators. making our system highly scalable. cannot detect the R2L and the U2R attacks effectively. Other type of attacks can also be detected by adding new layers in the system. Further. Our system also helps in identifying an attack once it is detected at a particular layer which expedites the intrusion response mechanism. Our system has all the advantages of the layered framework discussed in the previous chapter. These methods. Our experimental results in Section 4.5% for the R2L attacks and 34. Further. and. thus minimizing the impact of an attack. in the following chapters. capability of detecting a wide variety of attacks and efficiency of operation.5.8 Conclusions 77 4. We showed that our system is robust to noise in the training data and performs better than any other compared system. the accuracy of attack detection. we focus on developing intrusion detection systems which can operate at the application level and which can be effective in detecting application level attacks. our system can be implemented to detect a variety of attacks including the DoS. Experiments also suggest that conditional random fields can be very effective in reducing the false alarms. Hence.1 show that conditional random fields are very effective in improving the attack detection rate and decreasing the false alarm rate. Our system can clearly provide better intrusion detection capabilities at the network level.8% for the U2R attacks. however. R2L and the U2R. to provide a higher level of security it is significant to detect intrusions at the application level along with detecting intrusions at the periphery of the network.4. However. Having a low false alarm rate is important for any intrusion detection system.2. we addressed the core issues concerning the anomaly and hybrid intrusion detection systems at the network level.8 Conclusions In this chapter. experimental results presented in Section 4.

.

Chapter 5 Unified Logging Framework and Audit Data Collection In order to detect malicious activities at the application level. Further. network intrusion detection systems cannot reliably detect application attacks such as the SQL injection. A network based system primarily focuses on monitoring network packets and. This is because. 5.1 Introduction U SING our layered framework. network level systems must be complimented with application level systems. However. Finally. the attack detection capability of a network based system is different from that of a host based and application based system. we introduce a unified logging framework which combines the application and the data access logs to produce a unified log which can be used as the audit patterns to detect attacks at the application level. our framework does not encode application specific features to extract attack signatures and can be used for a variety of similar applications. 79 . can undoubtedly provide effective network intrusion detection capability. as discussed in previous chapters. Such systems. Similarly. application-data interaction can be captured which improves attack detection. however. present intrusion detection systems either analyze the application access logs or the data access logs. cannot detect data and application level attacks particularly when Network Address Translation (NAT) and encryption are used in communication. As a result. A stacked system can also be used which analyzes the two logs separately one after the other. to ensure a higher level of secu- rity. As a result. cannot model the application-data interaction which is significant to detect low level application specific attacks. host and application based systems cannot protect against network attacks such as the Denial of Service. This unified log can easily incorporate features from both. hence. the application accesses and the corresponding data accesses. attacks can be split into more than one packet to avoid their detection. To overcome this deficiency in present application intrusion detection systems.

. it becomes critical to develop better application intrusion detection systems which can detect attacks reliably and are not entirely dependent on attack signatures. However. Detecting application level attacks often require monitoring every single data access in real-time environment which may not always be feasible. 5. Therefore. Such systems are often signature based and. thus. we conclude this chapter in Section 5.4. Detecting malicious data accesses. thus. the ultimate objective of attacking an application is either to launch a Denial of Service or to access the underlying data.3 and the setup for data collection in Section 5. hence. Present application intrusion detection systems either analyze only the web access logs or only the data access logs or use two separate systems (based on analyzing the web access logs and the data access logs) which operate independently and. we introduce a unified logging framework which combines the application access logs and the corresponding data access logs in real-time to provide a unified log with features from both the application accesses and the corresponding data accesses. Hence. cannot detect attacks reliably. We then describe our proposed framework in Section 5. Further. by analyzing the application’s interaction with the underlying data. such as those discussed earlier cannot be directly used to detect low level application attacks.2. The rest of the chapter is organized as follows. We note that to effectively detect application level attacks the application-data interaction must be captured. i. This captures the correlation between the two logs and also eliminates the need to analyze them separately. hence.80 Unified Logging Framework and Audit Data Collection Methods which are effective in detecting attacks at network level. have limited attack detection. Finally.2 Motivating Example Data access in three tier application architecture is restricted via the application and. thereby resulting in a system which is accurate and which operates efficiently. we motivate our unified logging framework with some examples in Section 5.5. To detect such malicious data accesses it becomes critical to consider the user behaviour (via the web application requests) and the corresponding application behaviour (via the corresponding data accesses) together. attackers may come up with previously unseen attacks compounding the situation even more difficult [5].e. simply due to large number of data requests per unit time. applications are one of the prime targets of attack.. presents a major challenge and alternate methods must be considered which are efficient and at the same time which can detect attacks reliably.

3. the number of data accesses per unit time is very large as compared to the number of web accesses and. Transition from page A to page B may be valid only if some conditions are satisfied. it can effectively model the user-application interaction. the straight forward approach is to audit every data access request before it is processed by the application. the transition is considered as anomalous. Hence.3 Proposed Framework 81 Consider for example. monitoring every data request in real-time severely affects system performance. page C or any other page. In most applications. This depends on the logic encoded in the application. However. ‘transition sequence of web pages’ may not be sufficient to detect attacks. If this condition is not satisfied. we propose monitoring web accesses together with the corresponding data accesses using our unified logging framework. to detect attacks reliably. This is because. monitoring the data access queries alone without any knowledge of the web application which requests the data is insufficient to detect attacks since they lack the necessary contextual information. Similarly. it cannot detect zero day attacks. Thus. a simple website which links page A to either page B. Neglecting such features result in false alarms. such as ‘the user must be logged in’ to transit from page A to page B.3 Proposed Framework In order to detect malicious data accesses. As with any signature based system. However. the system is application specific because the attack signatures are defined by encoding application specific knowledge.5. thus. when the system is made aware of the data access pattern via features such as ‘the number of requests generated by a particular page’. thereby resulting in better attack detection. Considering only a single feature. Assuming that we can somehow monitor every data request by using a signature based system. 2. ‘the corresponding next page’ and other features. The system must be regularly updated with new signatures to detect attacks. this is not the ideal solution to detect data breaches due to the following reasons: 1. We also observe that the real world applications follow the three tier architecture [129] which ensures . the encoded logic cannot be modeled by analyzing the web accesses alone. 5. Other features such as ‘the result of authentication module’ are significant for decision making. monitoring every data request is not feasible in high speed application environment.

To detect such attacks. A user session is represented as a sequence of event vectors as: si = start.. f N 4. An event is such a single request-response pair. Further. There exists no other way in which the data can be made available to a user. A single event is represented as an ‘N’ feature vector which is denoted as: ei = f 1 ... We use the term event interchangeably with the term request. When a system monitors the application accesses alone. Event: Data transfer between a user and an application is a result of multiple sequential events. we define some key terms which will be helpful in better understanding of the remaining of the chapter. using two separate systems does not capture applicationdata interaction which affects attack detection. User: User is either an individual or any other application which access data. 3. Similarly. f 3 .e. data is managed separately and is not encoded into the application. propose a unified logging framework which generates a single audit log that can be used by the application intrusion detection system to detect a variety of attacks including the SQL injection. e1 . an attacker has no option but to exploit the application. e2 . a session is a sequence of one or more request-response pairs. Application: Application is a software by which a user can access data. i. To access application data. As discussed earlier. Every session can be uniquely identified by a session id.. i.e. 1. hence. analyzing every data access in isolation limits the attack detection capability of an intrusion detection system. previous approaches either consider only the application accesses or the data accesses or consider both in isolation and. end . an intrusion detection system can either monitor the application requests or (and) monitor the data requests. cross site scripting and other application level attacks. We.. unable to correlate the events together resulting in a large number of false alarms. thus.. e3 . User Session: A user session is an ordered set of events or actions performed. Before we describe our framework in detail. it cannot detect attacks such as the SQL injection since the system lacks useful information about the data accessed. f 2 . Data transfer can be considered as a request-response system where a request for data access is followed by a response.82 Unified Logging Framework and Audit Data Collection application and data independence. 2..

in addition to an intrusion detection system which is used to detect malicious data accesses in an application. The logs unification module provides input audit patterns to the intrusion detection system and the response generated by the intrusion detection system is passed on to the session control module which can initiate appropriate intrusion response mechanisms. user access is restricted via the application and.1 which can be used for building effective application intrusion detection systems.3. In our framework.3 Proposed Framework 83 5. session control module and logs unification module. Hence. thus. We have already discussed that the three tier architecture restricts data access only via the application. we define two modules. User / Client Session Control Web Server with Deployed Application Web Server Log Intrusion Detection System Unified Log Data Data Access Log Figure 5.1: Framework for Building Application Intrusion Detection System In our framework.1 Description of our Framework We present our unified logging framework in Figure 5. . every request first passes through the session control which is described next.5. the application acts as bridging element between the user and the data.

we first process the data access logs and represent them using simple statistics such as ‘the number of queries invoked by a single web request’ and ‘the time taken to process them’ rather than analyzing every data access individually. the session control also accepts input from the intrusion detection system. Once the session id is evaluated for a request. it must also ensure that once an attack is detected. The session control can either be implemented as a part of the application or can also be implemented as a separate entity. The unified log incorporates features from both the web access logs and the corresponding data access logs.2. If a request is evaluated to be anomalous by the intrusion detection system. the response from the application can be blocked at the session control before data is made visible to the user. Second. it maintains a list of valid sessions which are allowed to access the application.84 Session Control Module Unified Logging Framework and Audit Data Collection The prime objective of an intrusion detection system is to detect attacks reliably. the . appropriate intrusion response mechanisms are activated in order to mitigate their impact and prevent similar attacks in future. First. Every request to access the application is checked for a valid session id at the session control and anomalous requests can be blocked depending upon the installed security policy. the request is sent to the application where it is processed. The two logs are then combined by the logs unification module to generate unified log which is described next. For this. Hence. All corresponding data accesses are also logged. As a result. it is capable of acting as an intrusion response system. we discussed that analyzing the web assess logs and the data access logs in isolation is not sufficient to detect application level attacks. thereby preventing malicious data accesses in real-time. present in both. The web server logs every request. we propose using unified log which can better detect attacks as compared to independent analysis of the two logs. very often. Hence. helps to capture the user-application interaction and the application-data interactions. the number of data accesses is extremely large when compared to the number of web requests. Using the unified log. However. Logs Unification Module In Section 5. The session control module serves dual purpose in our framework. However. thus. it is responsible for establishing new sessions and for checking the session id for previously established sessions. We then use the session id. The logs unification module is used to generate the unified log.

f 1 . gm ) Data Access (de12 = g1 .. g2 .g M represent the features of web access logs and the features extracted from the reduced data access logs respectively.... g1 . f2 .. .. f2 .. there is no data set which can be used for our experiments. Once the web access logs and the corresponding data access logs are available... gm ) Log Unification Unified Log (e1 = f1 . gm ) Data Access (de13 = g1 .. f 2 .. fn . the next step involves the reduction of data access logs by extracting simple statistics as discussed before. g3 . then. Web Request (We1 = f1 . gm ) Figure 5. Figure 5. 5.. g2 ..2. the log unification module generates a unified log which can be used by an application intrusion detection system. The session id can. . g2 ..2: Representation of a Single Event in the Unified log From Figure 5. f N and g1 . gm ) Data Access Log Reduction Web Request Reduced Data Acesses (We1 = f1 . . .. . fn ) Data Access (de11 = g1 . ..4 Audit Data Collection As presented in our framework.... represents how the web access logs and the corresponding data access logs can be uniquely mapped to generate a unified log... we observe that a single web request may result in more than one data accesses which depend upon the logic encoded into the application. f 3 . .. g2 .. Application data sets such as [130] are available. However. .4 Audit Data Collection 85 application access logs and the associated data access logs. f2 . In the figure. be used to uniquely combine the two logs to generate the unified log.. g2 ..5. to uniquely map the extracted statistics (obtained from the data access logs) to the corresponding web requests in order to generate a unified log... g2 . but are restricted to ..2. fn ) (de1 = g1 .

we used an online shopping application [131] and deployed it on a web server running Apache.86 Unified Logging Framework and Audit Data Collection monitoring the sequence of system calls for privileged processes. For the first data set. 6. we collected data sets locally. is very hard. frames and dynamic content. Web access logs contain useful information such as the details of every request made by a client (user).55. version 2. Amount of data transferred (in bytes). in case the . Such data sets cannot be used in our experiments. This request further generates one or more data requests which depend on the logic encoded in the application. we generate a unified log format where every user session is represented as a sequence of vectors. for example a bank website data. We collected two separate data sets by setting up an environment that mimics a real world application environment. 2.1. Time taken to process the request. the web requests and the corresponding data accesses were logged. To collect the second data set.0.4 GHz and 2 GB RAM. CPU 2. The six features are: 1. Further. we used another online shopping application [132] and deployed it separately on exactly the same configuration. Thus. data access logs contain important details such as the exact data table and columns accessed. Number of data queries generated in a single web request. At the backend.1 Feature Selection We used two features from the data access logs and four features from the web access logs to represent the unified log.22.4. Request made (or the function invoked) by the client. Both the data sets are made freely available and can be downloaded from [13]. version 4. Reference to the previous request in the same session. each having six features. Response generated for the request. 5. The servers and the application were installed on a desktop running with Intel(R) Core(TM) 2. Both. amount of data transferred etc. 5. we consider a web request to be a single request to render a page by the server and not a single HTTP GET request as it may contain multiple images. 4. if not impossible. The operating system installed was Microsoft Windows XP Professional Service Pack 2. 3. getting real world application data. Similarly. A request can be easily identified from the web server logs. response of the web server. For both the applications. Hence. the application was connected to MySQL database.

however. the access to the application was restricted to only from within the department. we cannot consider a one to one mapping between a user and an IP address.4 Audit Data Collection 87 data is stored in a database. visits the website. buys some products and logs off. searches for an item. The unified log is then used as input to the intrusion detection system. we did not use the IP address to identify a user accessing the application. For the purpose of normal data collection. the students were advised not to provide any personal information and were asked to use dummy information instead of using their actual details. adds few items to cart. requires substantially more resources when compared to our approach. A user is a registered user. adds few items to cart but does not buy them.4. 3.2 Normal Data Collection To collect normal data. the system was online for five consecutive days. The application was accessed using different scenarios. other scenarios were also considered. 5. The students were not restricted to create a single user account. separately. buys them by registering and completes the check out process and finally logs off. in this case. in isolation. and many of them created multiple accounts. looks at some items. the postgraduate students in our department were encouraged to access the application. 2. adds some items to cart. The application was accessible like any other online shopping website. adds it to cart. for both the data sets. starts the check out process but does not finish buying and logs off. Furthermore. searches for an item. A user is not a registered user. Further. 4. For data collection. 5. A user is not a registered user. some examples of the scenarios are: 1. A user is a registered user.5. visits the website. This is significant because. which is also not . the students were asked to use different browsers to access the same application. looks at some items. which is the final module in our framework and is discussed in the next chapter. The user visits the website. monitoring the logs together eliminate the need to monitor every data query since we can use simple statistics to represent the features of the data accesses logs in the unified log. Hence. Performing intrusion detection at the data access level. In addition to these. A user is not interested in shopping but clicks on a few links to explore the website. The user visits the website.

642 web requests with 931.88 Unified Logging Framework and Audit Data Collection possible in any real world application due to sharing of computers and the use of Network Address Translation in networks.28480.normal 0.xyz/catalog/. .http://dummydata.normal Figure 5. 0.25252.php.php.GET /catalog/ HTTP 1. This results in 117 user sessions with only 2. Similarly. Attacks which do not require any control over the web server or the database such as password guessing and SQL injection attack. Note that.−. This shows that our framework for unified logging can be employed with minimum effort for a variety of existing applications.normal 131.1.http://dummydata. The session depicts a user browsing the website and looking at different products displayed on the index page of the deployed web application.3. We then combine the web server logs with the data server logs to generate unified log as discussed earlier in Section 5.200. each of which include features from the web requests and the associated data requests.0. This is because.xyz/catalog/index. We launched attacks based upon two criteria: 1. we disabled access to the system to other users and generated the attack traffic manually.−.GET /catalog/index.1.4.1.GET /catalog/index.0.27121.normal 84. we observe that 35 different users accessed the application which results in 117 unique sessions composed of 2.normal 108.200. We represent a normal user session from the data set in Figure 5.28885.http://dummydata.GET /catalog/product_info.200.615 web requests and 232. Also.php. the number of data accesses per web request is large in the second data set when compared to the first.php.3.200.3: Representation of a Normal Session 5.25431.php.php.301.655 data requests.xyz/catalog/index.0. for the second data set.GET /catalog HTTP 1. we combine 1.642 event vectors. we did not make any additional change specific to the second application to collect the second data set. This is a realistic scenario and in reality a large number of the shopping carts are abandoned without purchase.671 data accesses which results in 60 unique user sessions with 1.xyz/catalog/product_info.1.http://dummydata. resulting in abandoning the shopping cart. For the first data set.3 Attack Data Collection To collect attack data.GET /catalog/index. the two applications are different.369.200. We also observe that a large number of user sessions are terminated without actual purchase.normal 105.0.615 event vectors.php.

we generate 241 web requests and 249.attack 203.47801.−. The unified log can be used as the input audit patterns for building application intrusion detection system.35467. For the first data set.200.attack Figure 5.http://dummydata.php. both.40401.0.0.php HTTP 1. This is because. The logs were then combined using our framework. we introduced our unified logging framework which efficiently combines the application access logs and the corresponding data access logs to generate unified log.php HTTP 1. A typical anomalous session in the data set is represented in Figure 5. Combining the two together. Combing the logs result in 25 unique sessions with 241 event vectors in the unified log.http://dummydata.xyz/catalog/checkout_shipping.php.php. we generate 45 different attack sessions with 272 web requests resulting in 44. we observe that a user has bypassed the login module which is necessary to complete a genuine transaction.php.1.php HTTP 1.xyz/catalog/checkout_payment. This is possible only when the deployed application has been modified and hence.28623.xyz/catalog/checkout_payment_address.0. We showed that our framework is not specific to any particular application since it does not encode application specific signatures and can be used for a variety of applications. a user successfully completes the transaction and the login module is never invoked.GET /catalog/checkout_payment.200.5 Conclusions In this chapter.GET /catalog/checkout_payment_address. Finally.200. the web requests and the data accesses were logged. capture the application-data interaction which helps in improving attack detection at the application level. the entire session is labeled as attack. we described our audit data collection methodology which was used to collect two different data sets. thus.200. the unified log has 45 unique attack sessions with 272 event vectors.1.5 Conclusions 89 2. Attacks which require prior control over the web server such as website defacement and cross site scripting.4: Representation of an Anomalous Session 5.attack 203.http://dummydata.5.0.597 corresponding data requests.attack 203. To collect the attack data.390 data requests.attack 208.xyz/catalog/index.GET /catalog/index.4.php HTTP 1.php HTTP 1.25605.1. The session depicts a scenario where the deployed application has been modified by taking control of the web server. 103.0. The two data sets can be used for building and .1.1. the user behaviour and the application behaviour and can. For the second data set. In this case.http://dummydata. The advantage of using unified log is that they include features of both.GET /catalog/checkout_success.GET /catalog/checkout_shipping.200.

In the next chapter. we perform experiments using our collected data sets and analyze the effectiveness of the unified log in building application intrusion detection systems.90 Unified Logging Framework and Audit Data Collection evaluating application intrusion detection systems and can be downloaded from [13]. We introduce user session modeling using a moving window of events to model sequence of events in a user session which can be used to effectively detect application level attacks. .

viz. Similarly. Our experimental results on the locally collected data sets show that our approach based on conditional random fields is effective and can detect attacks at an early stage by analyzing only a small number of sequential events. To overcome these deficiencies and to improve attack detection at the application level. in particular. and introduce user session modeling to detect application level attacks reliably and efficiently. discussed in Chapter 5. discussed in previous chapter.1 Introduction A PPLICATIONS have unrestricted access to the underlying application data and are thus a prime target of attacks resulting in loss of one or more of the three basic security re- quirements. To prevent such malicious data accesses. we introduce a novel approach of modeling user sessions as a sequence of events instead of analyzing every event in isolation. first they analyze every single event independently to detect possible attacks and second. unable to detect novel attacks whose signatures are not available. to perform efficiently. Web-based applications. have limited attack detection capabilities. hence. thus.. From our experiments. we show that the attack detection accuracy improves significantly when we perform session modeling. we integrate our unified logging framework. Hence. We also show that our system is robust and can reliably detect disguised attacks. it becomes critical to detect any compromise of applications which accesses the data. are easy targets and can be exploited by the attackers. 6. We integrate our unified logging framework. Present application intrusion detection systems cannot detect attacks reliably because. they are based on signature matching and. they are often signature based and. confidentiality. to build effective application intrusion detection systems which are not specific in detecting a single type of attack. integrity and availability of the data. hybrid and anomaly detection systems are inefficient and 91 .Chapter 6 User Session Modeling using Unified Log for Application Intrusion Detection Present application intrusion detection systems suffer from two disadvantages.

92

User Session Modeling using Unified Log for Application Intrusion Detection

unreliable, resulting in a large number of false alarms because they are based on thresholds which are difficult to estimate accurately. Further, application based systems often consider sequential events independently and hence unable to capture the sequence behaviour in consecutive events in a single user session. Very often, attacks are a result of more than one events and monitoring the events individually result in reduced attack detection accuracy. Hence, to detect attacks effectively, we introduce user session modeling at application level by monitoring a sequence of events using a moving window. We also integrate the unified logging framework which generates a single unified log with features from both, the application accesses and the corresponding data accesses. We evaluate various methods such as conditional random fields, support vector machines, decision trees, naive Bayes and hidden Markov models and compare their attack detection capability. As we will demonstrate from our experimental results, integrating the unified logging framework and modeling user sessions result in better attack detection accuracy, particularly for the conditional random fields. Session modeling, however, increases the complexity of the system. Nonetheless, our experiments show that using conditional random fields higher attack detection accuracy can be achieved by analyzing only a few events, which is desirable, as opposed to other methods which must analyze a large number of events to operate with comparable accuracy. Further our system operates efficiently as it uses simple statistics rather than analyzing all the features in every data access. Finally, our system performs best and is able to detect disguised attacks reliably when compared with other methods. The rest of the chapter is organized as follows; we motivate the use of session modeling for application intrusion detection in Section 6.2. We then describe the data sets used in our experiments in Section 6.3 and our methodology in Section 6.4. We describe our experimental set up and present our results in Section 6.5 followed by the analysis of our results in Section 6.6. In Section 6.7, we discuss some implementation issues such as the availability of training data and suitability of our approach for a variety of applications. Finally, we conclude this chapter in Section 6.8.

6.2 Motivating Example
Recalling from the previous chapter, we defined an event as a single request-response pair which can be represented as an ‘N’ feature vector as: ei = f 1 , f 2 , f 3 ... f N

6.2 Motivating Example

93

Similarly, we defined a user session as an ordered set of events or actions performed, i.e., a session is a sequence of one or more request-response pairs and is represented as a sequence of event vectors: si = start, e1 , e2 , e3 ..., end In many situations, to launch an attack the attacker must follow a sequence of events. For such cases in particular, the attack will be successful when the entire sequence of events is performed. Each event individually is not significant; however, the events if performed in a sequence can result in powerful attacks. Further, the situation can be relaxed to give advantage to an attacker such that the individual anomalous events may not strictly follow each other. As a result, the anomalous events may be disguised within a number of legitimate events, such that the attack is successful and hence, the overall session is considered as anomalous. For example, a single session with five sequential events along with their labels may be represented as follows:

< Session Start >
1 1 e1 < f 1 , g2 , ...h1 n 2 2 e2 < f 1 , g2 , ...h2 n 3 3 e3 < f 1 , g2 , ...h3 n 4 4 e4 < f 1 , g2 , ...h4 n 5 5 e5 < f 1 , g2 , ...h5 n

– – – – –

Normal> Normal> Attack > Normal> Attack >

< Session End >
In the above sequence of events, e1 ...e5 , when we consider every event individually, anomalous events may not be detected, however, if the events are analyzed such that their sequence of occurrence is taken into consideration, the attack sessions can be detected effectively. Consider for example, a website which collects and stores credit card information and the following sequence of events occur in a single session: 1. A user attempts to log in by entering a (stolen) user id and password. The log in is successful. (Note that, SQL injection can also be used to reveal such login information). 2. The user then visits the home page and modifies some information (to create a backdoor for reentry).

94

User Session Modeling using Unified Log for Application Intrusion Detection 3. The user exploits the application to gain administrator access. 4. The user then visits the home page of the original user (in order to attempt) to disguise the previous event within normal events. 5. The user exploits administrator rights to reveal credit card information of other users. It must be noted that in the above sequence of events, the individual events appear to be normal

events and may not be detected by the intrusion detection system when the system analyzes the events in isolation. In particular, the third event in the above sequence, when analyzed in isolation, may be considered as normal since the administrator can access the application using the super user access. However, the overall sequence of events; transition from a user with limited access to a user with administrator access and finally revealing the credit card information is made visible only when the system analyzes all the events together in the session. Using session modeling, we aim to minimize the number of false alarms and detect such attacks, including the disguised attacks, which cannot be reliably detected by traditional intrusion detection systems.

6.3 Data Description
To perform experiments using user session modeling at the application level, there does not exists any freely available data set which can be used. As a result, we collected the data sets locally as described earlier in Chapter 5. We summarize the two data sets in Table 6.1. Table 6.1: Data Sets Number of Web Requests Data Set One Data Set Two Normal Attack Normal Attack 2615 272 1642 241 Number of Data Accesses 232,655 44,390 931,671 249,597 Number of Sessions 117 45 60 25

Every session in both the data sets represents a sequence of event vectors, with each event vector having six features. It is important to note that, though, both the applications are examples

6.4 Methodology

95

of online shopping website; there is difference in the two applications. One significant difference is the application’s interaction with the underlying database which is encoded as the application logic. As a result, the number of data accesses in the second data set (1,181,268) is significantly larger than in the first (277,045). Further, the size of the two data sets is also different; the first data set consists of 162 sessions, while the second has only 85 sessions. It is also important to note that the two data sets were collected independently at different times.

6.4 Methodology
In order to gain data access an attacker performs a sequence of malicious events. An experienced attacker can also disguise attacks within a number of normal events in order to avoid detection. Hence, to reduce the false alarms and increase attack detection accuracy, intrusion detection systems must be capable of analyzing entire sequence of events rather than considering every event in isolation [49]. We therefore propose user session modeling to detect application level attacks. To model a sequence of event vectors, we need a method which does not assume independence among sequential events. Hence, we use conditional random fields as the core intrusion detector in our application intrusion detection system. The advantage of using conditional random fields is that they predict the label sequence y given the observation sequence x allowing them to model arbitrary relationships between different features in the observations without making independence assumptions. Figure 6.1 shows, how conditional random fields can be used to model user sessions.

y1

y2

y3

y4

e1 = f1 ...f6

e2 = f1 ...f6

e3 = f1 ...f6

e4 = f1 ...f6

Figure 6.1: User Session Modeling using Conditional Random Fields In the figure, e1 , e2 , e3 , e4 represents a user session of length four and every event ei in the session is correspondingly labeled as y1 , y2 , y3 , y4 . Further, every event ei is a feature vector of length six as described in the unified logging framework. The conditional random fields do not assume

Hence. The attack detection is not real-time.1 Feature Functions For a conditional random field.2 Session Modeling using a Moving Window of Events We use the logs generated by the unified logging framework presented in Chapter 5 and perform user session modeling using a moving window of events to build effective application intrusion .4. Since the complexity of the system increases as the width of the window increases. The size of the session can be very large with more than 50 events.. The feature functions used in our experiments are presented in Appendix C. is considered better. a method which can reliably detect attacks with only a small number of events. Examples of features extracted include. e4 . we restrict the window width to 20 in our experiments. 6. e3 . however.e. We use our domain knowledge to identify such dependencies in the features and then define functions which extracts features from the training data. Similarly another example can be.4. Based on our domain knowledge. it has two disadvantages: 1. other features were extracted similarly using the CRF++ tool [120]. Hence. e2 . Analyzing every session at its termination is effective since complete session information is available. i. 6. it is critical to define the feature functions because the ability of a conditional random field to model correlation between different features depends upon the predefined features used for training the random field. As a result. we perform user session modeling using a moving window of events. We note that a user session can be of variable length and some sessions may be longer than others. if f eature1 is request made = ‘abc’ and f eature2 is reference to previous request = ‘xyz’ then label is ‘normal’. as shown in the first example helps to capture the correlation between different features. 2. We vary the width of the window from 1 to 20 in all our experiments. at small values of window width. Using feature conjunction.96 User Session Modeling using Unified Log for Application Intrusion Detection any independence among the sequence of events e1 . analyzing all the events together increases the complexity and the amount of history that must be maintained for session analysis. if f eature is amount of data transferred = ‘pqr’ then label is ‘attack’.

e5 . e2 .5 Experiments and Results 97 detection systems. e8 . e7 . the window can be advanced forward with a step size > 1. e2 . e3 . e6 . ∅ e1 . ∅ e1 . ∅ e1 . ∅. e10 . ∅. ∅. e4 . support vector machines and hidden Markov models for detecting malicious data accesses at the application level. e3 . e8 e5 . the system operates in batch mode. e8 .5 Experiments and Results We now describe the experimental setup and compare our results using a number of methods such as the conditional random fields. e5 . e6 . For example. e2 . e7 . A method which . e9 e6 . e5 . e9 . e6 . e10 −→ ‘Label’ at step 1 −→ ‘Label’ at step 2 −→ ‘Label’ at step 3 −→ ‘Label’ at step 4 −→ ‘Label’ at step 5 −→ ‘Label’ at step 6 −→ ‘Label’ at step 7 −→ ‘Label’ at step 8 −→ ‘Label’ at step 9 −→ ‘Label’ at step 10 It is evident from the above representation that the window of events is advanced forward by one and hence. the events in this session can be analyzed as shown below (note that ∅ represents absence of an event): e1 . e6 . naive Bayes. e5 e2 .] 6. However. [Note that. e2 . e7 . e4 . e8 . In such cases. the system no longer operates in real-time. e3 . e3 . e7 .6. such a system can perform in real-time. e9 . depending upon the requirements of a particular application. decision trees. e5 . e4 . if the analysis is performed only at the end of every session. e4 . ∅ e1 . e1 . ∅. ∅. consider a session of length 10 represented as a sequence of events: < start >. e3 . < end > Using a moving window of width five with a step size of one. e4 . e6 e3 . It is important to note that the accuracy of attack detection and efficiency of operation are the two critical factors which determine the suitability of any method for intrusion detection. ∅. e2 . e7 e4 .

As we mentioned before.. Our experimental results. In this chapter. For example. separately. a technique which is efficient but cannot detect attacks with acceptable level of confidence is not useful. we label the sequence as attack when the number of votes in favour of the attack class is greater than or equal to three. However. We use exactly the same training and test samples for all the five methods that we compare (conditional random fields. we convert every session into a single record by appending sequential events at the end of the previous event and then label the entire session as either normal or as attack. one for each feature. Decision trees are very fast and generally result in accurate classification. for a session of length five. hidden Markov model toolbox for MatLab and the weka tool [121] and perform experiments. and then combine the individual results using a voting mechanism to get the final label for the sequence. an intrusion detection technique must balance the two. to experiment with these methods. for the support vector machines we experiment with three kernels. and vary the value of c between 1 and 100 for all of the kernels [121]. support vector machines and naive Bayes classifier). These methods have been effectively used for building anomaly and hybrid intrusion detection systems. naive Bayes and support vector machines are not designed for labeling sequential data. Hence. Hence. Similarly. we build six different hidden Markov models. from Chapter 4. we use the CRF++ toolkit [120]. rbf-kernel and normalized-poly-kernel. The naive Bayes classifier is simple to implement and very efficient. i. We perform our experiments using a moving window of events and vary the window width ‘S’ . We perform all experiments ten times by randomly selecting training and test data and report their average. hidden Markov models.e. Hidden Markov models are well known for modeling sequences and have been successful in various tasks in language processing. It is important to note that methods such as decision trees. we create a single record with 5 ∗ 6 = 30 features. suggests that conditional random fields outperform these methods and can be used to build accurate network intrusion detection systems. poly-kernel. decision trees. for experiments with the hidden Markov models. For our experiments. we use six features to represent every event.98 User Session Modeling using Unified Log for Application Intrusion Detection can detect most of the attacks but is extremely slow in operation may not be useful. we analyze the effectiveness of conditional random fields for building application intrusion detection systems and compare their performance with these methods. Additionally. using both the data sets. Support vector machines are also considered to be high quality systems which can handle data in high dimensional space. where every event is described by six features.

hence. i. Thus. naive Bayes and hidden Markov models for both the data sets. Hence. they can detect attacks reliably. the attacks are disguised in a large number of normal events. we define the disguised attack parameter.. Results for both the data sets show similar trends. when the number of normal events increases. however. can model the correlation between events. be exploited by attackers since they can hide the attacks within normal events. to make the intrusion detection task more realistic.1]. while a window of width S = 20 implies that a sequence of 20 events is analyzed to perform the labeling. we compare the attack detection accuracy (F-Measure) as we increase the window width ‘S’ from 1 to 20 for a fixed value of p = 1 for conditional random fields.e. The attacks are not disguised when p = 1. ‘p’ as follows: number o f Attack events number o f Normal events + number o f Attack events p= where number o f Attack events > 0 and number o f Normal events ≥ 0. the attacks are not disguised. We perform experiments to reflect these scenarios by varying the number of normal events in an attack session by setting ‘p’ between 0 and 1. We limit ‘S’ to 20 as the complexity of the system increases which affects system’s efficiency. We observe that conditional random fields and support vector machines perform similarly and their attack detection capability (F-Measure) increases.6.5.2. as the value of ‘S’ increases. This can. support vector machines. slowly but steadily. In Figure 6. The value of ‘p’ lies in the range (0. attack detection accuracy improves as ‘S’ increases. To create disguised attack data. we add a random number of attack events at random locations in the normal sessions and label all the events in the session as attack. i. decision trees. As the value of ‘p’ decreases.1 Experiments with Clean Data (p = 1) We first set p = 1. 6. making attack detection very difficult. This shows that modeling a user session results in better attack detection accuracy when compared to analyzing the events individually.5 Experiments and Results 99 from 1 to 20. Conditional random fields do not consider the sequence of events in a session to be independent and. This results in hiding the attacks within normal events such that the attack detection becomes difficult.e. since in this case the number of normal events is 0. Support vector machines also result in good attack detection accuracy and can easily handle a . Window of width S = 1 indicates that we consider only the current event and do not consider the history.

Their accuracy improves initially as ‘S’ increases but when ‘S’ becomes large their accuracy tends to decrease. the error due to loss of correlation is small which increases with the number of features. there is little effect of ‘S’ on the attack detection accuracy for decision trees when compared with the naive Bayes classifier. When the number of features is less. thereby resulting in good classification. they cannot model long range dependencies in the observations. This is because. they consider features independently to label a particular event in a session and then combine the results for all the features but do not consider the correlation between them. Hence. however. Decision trees and naive Bayes perform poorly and have low F-Measure regardless of the window width ‘S’. their accuracy improves slightly as ‘S’ increases. make independence assumptions. . When compared with the conditional random fields. thereby resulting in poor performance. Also. Furthermore. thus. the number of input features increases as ‘S’ increases.100 User Session Modeling using Unified Log for Application Intrusion Detection large number of features. but the decision trees selects a subset of features and its size remains fairly constant. hidden Markov models have lower attack detection accuracy because they are generative systems which model the joint distribution instead of the conditional distribution and. Hidden Markov models also perform poorly.

S 15 20 (b) Data Set Two Figure 6.6 0.8 F-Measure 0.8 F-Measure 0.5 HMM 0.2 0 0 5 10 Width of Window.2: Comparison of F-Measure ( p = 1) .6.5 HMM 0.2 0 0 5 10 Width of Window. S 15 20 (a) Data Set One CRF SVM 1 Naive Bayes C4.6 0.5 Experiments and Results 101 CRF SVM 1 Naive Bayes C4.4 0.4 0.

60) In order to test the robustness of different methods. where p < 1 indicates that the attack is disguised within normal events in a session.60. compared to that of the second. thus. In Figure 6. The size of the first data set is bigger.102 User Session Modeling using Unified Log for Application Intrusion Detection 6. we compare the results for all the five methods for both the data sets at p = 0. . the support vector machines did not perform as well. we perform experiments with disguised attack data. The hidden Markov models perform better because they consider the sequence information which becomes significant when the size of the data set is small. The variation in performance of the hidden Markov models and the decision trees for the two data sets is attributed to the size of the data sets. the attack detection accuracy increases as ‘S’ increases. As a result. The reason for this is that. perform effectively even when the attacks are disguised. The reason for better accuracy for the conditional random fields is that they can model long range dependencies among the events in a sequence. However. we define the disguised attack parameter. the conditional random fields perform best. The decision trees are least effective in detecting disguised attacks for the second data set. Again. the decision trees cannot perform optimally compared to the hidden Markov models. since they do not assume independence within the event vectors and. for the second data set. For both the data sets. the decision trees can better select significant features in the first data set resulting in higher accuracy. As discussed earlier. the support vector machines cannot geometrically differentiate between the normal and attack events because of the overlap between the normal data space and the attack data space. As we decrease ‘p’.3. Using such a data set makes attack detection realistic and more difficult as an attacker may try to hide the attack within normal events. outperforming all other methods and are robust in detecting disguised attacks when compared with any other method. we observe that the attack detection capability decreases as the attacks are disguised within normal events. due to its small size.5.2 Experiments with Disguised Attack Data (p = 0. Hidden Markov models are least effective for the first data set while support vector machines and naive Bayes classifier have similar performance. However. ‘p’.

5 Experiments and Results 103 CRF SVM 1 Naive Bayes C4.3: Comparison of F-Measure ( p = 0.8 F-Measure 0.60) .2 0 0 5 10 Width of Window.4 0.5 HMM 0.4 0. S 15 20 (a) Data Set One CRF SVM 1 Naive Bayes C4.6.8 F-Measure 0.5 HMM 0. S 15 20 (b) Data Set Two Figure 6.6 0.6 0.2 0 0 5 10 Width of Window.

. while it is 0. Recall and F-Measure for conditional random fields at p = 0. The best value for F-Measure for data set one is 0.60 and present the results in Figure 6.65 at S = 20 for data set two. from both the data sets.4.104 User Session Modeling using Unified Log for Application Intrusion Detection Results using Conditional Random Fields We study the Precision. suggest that they have high FMeasure which increases steadily as the window width ‘S’ increases. Results for conditional random fields.87 at S = 15. This suggests that the system based on conditional random field generates fewer false alarms and performs reliably even when attacks are disguised.

S 15 20 (b) Data Set Two Figure 6.6 0.2 0 0 5 10 Width of Window.60 .6.4 0.4 0.4: Results using Conditional Random Fields at p = 0.8 0.2 0 0 5 10 Width of Window.6 0.8 0.5 Experiments and Results 105 Precision 1 Recall F-Measure 0. S 15 20 (a) Data Set One Precision 1 Recall F-Measure 0.

rbf-kernel and normalized-poly-kernel.60. The best value of FMeasure for support vector machines for data set one is 0. for support vector machines.5 shows that support vector machines have moderate Precision for both the data sets. we experiment with three kernels. Recall and F-Measure for support vector machines as we increase ‘S’ from 1 to 20 at p = 0. we report the results using the same kernel. and vary the value of c between 1 and 100 for all of the three kernels. . As mentioned earlier.5 represents the variation in Precision.82 at S = 17. while it is 0.65 for data set one and data set two respectively. We observe that the poly-kernel with c = 1 performs best and.106 User Session Modeling using Unified Log for Application Intrusion Detection Results using Support Vector Machines Figure 6.49 at S = 20 for data set two in comparison to the conditional random fields which have the F-Measure of 0. polykernel.87 and 0. Figure 6. but low Recall and hence low F-Measure. hence.

4 0.6 0.6 0. S 15 20 (b) Data Set Two Figure 6. S 15 20 (a) Data Set One Precision 1 Recall F-Measure 0.4 0.6.5: Results using Support Vector Machines at p = 0.2 0 0 5 10 Width of Window.5 Experiments and Results 107 Precision 1 Recall F-Measure 0.8 0.60 .8 0.2 0 0 5 10 Width of Window.

Recall and F-Measure for decision trees in Figure 6.108 User Session Modeling using Unified Log for Application Intrusion Detection Results using Decision Trees We study the variation in Precision.6. . Results from Figure 6.6 show that the decision trees have very low F-Measure suggesting that they cannot be used effectively for detecting anomalous data accesses when the attacks are disguised. The detection accuracy for decision trees remains fairly constant as ‘S’ increases and is maximum at S = 20 and at S = 19 for the two data sets.

8 0.6.6: Results using Decision Trees at p = 0.6 0.4 0.5 Experiments and Results 109 Precision 1 Recall F-Measure 0.4 0.60 .2 0 0 5 10 Width of Window.6 0.8 0. S 15 20 (b) Data Set Two Figure 6. S 15 20 (a) Data Set One Precision 1 Recall F-Measure 0.2 0 0 5 10 Width of Window.

67 at S = 12 for data set one and 0.60.7 represents the variation in Precision.110 User Session Modeling using Unified Log for Application Intrusion Detection Results using Naive Bayes Classifier Figure 6. suggesting that a system based on naive Bayes classifier cannot detect attacks reliably. Recall and F-Measure for the naive Bayes classifier as we vary ‘S’ from 1 to 20 at p = 0. . The results suggest that the system has low F-Measure and there is little improvement in the attack detection accuracy as ‘S’ increases. Experimental results using both the data sets show similar trend for the naive Bayes classification.43 at S = 19 for data set two. The maximum value for F-Measure is 0.

S 15 20 (b) Data Set Two Figure 6.6 0.5 Experiments and Results 111 Precision 1 Recall F-Measure 0.4 0. S 15 20 (a) Data Set One Precision 1 Recall F-Measure 0.8 0.6 0.60 .8 0.4 0.2 0 0 5 10 Width of Window.2 0 0 5 10 Width of Window.7: Results using Naive Bayes Classifier at p = 0.6.

8.60 in Figure 6. Recall and F-Measure for the hidden Markov models for both the data sets at p = 0. we observe that the hidden Markov models have very high Recall but very low Precision and hence low F-Measure. . From Figure 6.112 User Session Modeling using Unified Log for Application Intrusion Detection Results using Hidden Markov Models We present the Precision. There is little effect of ‘S’ on the F-Measure which does not improve significantly.8.

2 0 0 5 10 Width of Window.6 0. S 15 20 (b) Data Set Two Figure 6.60 .8 0.8 0.6. S 15 20 (a) Data Set One Precision 1 Recall F-Measure 0.5 Experiments and Results 113 Precision 1 Recall F-Measure 0.4 0.2 0 0 5 10 Width of Window.4 0.8: Results using Hidden Markov Models at p = 0.6 0.

47 0.87 0.31 0.64 0.6 Analysis of Results Experimental results clearly suggest that the conditional random fields outperform other methods and are the best choice to build application intrusion detection systems.47 0.83 0. 6.2: Effect of ‘S’ on Attack Detection for Data Set One.65 0.27 0. Window width of 20 and beyond is often large.63 0.51 0.61 0.50 0.38 0.65 0.44 0.86 0.80 0.47 0.2 and 6.81 0.60 0.75 0.68 0.70 0.86 0.66 Support Vector Machines 0.66 0.66 0. resulting in delayed attack detection and high computation cost.35 0.64 0.65 0.67 0.69 0.4 describe the effect of ‘S’ on attack detection for the two data sets.66 0.82 0.51 0.50 0.81 Conditional Random Fields 0.65 0. Tables 6. We want ‘S’ to be small since the complexity and the amount of history that must be maintained increases with ‘S’ and the system cannot respond to attacks in real-time. we use a moving window to model user sessions by varying ‘S’ from 1 to 20.75 0.35 0.56 0.114 User Session Modeling using Unified Log for Application Intrusion Detection 6.44 0.41 0.80 0.65 0.40 0.48 0.39 0.76 0.6.51 0.27 0.84 0.33 0.66 0.74 0.86 0.65 0.77 0.35 0.41 Decision Trees 0. Table 6.76 0.58 0.30 0.38 0.56 Naive Bayes 0.78 0.47 0.36 0.87 0.86 0.86 .63 0.1 Effect of ‘S’ on Attack Detection In our experiments.76 0.61 0.47 0.62 0.61 0.39 0.60 F-Measure Width of Window ‘S’ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hidden Markov Models 0.00 0.44 0.47 0.74 0.53 0.84 0.46 0.77 0.79 0.64 0.71 0.24 0.72 0.82 0.39 0.68 0.80 0.26 0.41 0.69 0. when p = 0.40 0.81 0.

Row one in the table shows that the hidden Markov models achieve the best F-Measure of 0. at S = 1).41) C4. Hence.6 Analysis of Results 115 From Table 6.. Table 6. support vector machines reach their best performance at window width of 17 while the conditional random fields achieve same performance at S = 10. Similarly. Results for the first data set show that the hidden Markov model performs best at S = 18 while conditional random fields achieve the same performance at S = 1.67) SVM (0.3: Analysis of Performance of Different Methods HMM HMM (0. naive Bayes classifier.5 (0. Naive Bayes peak their performance at S = 12 while conditional random fields achieve same performance at S = 3.87) S = 18 S > 20 S > 20 S > 20 S > 20 C4. Finally. .e.87 at ‘S’ value of 15 while all other methods require more than 20 events to achieve the same performance.2. when we increase ‘S’ beyond 20 (not shown in the graphs). decision trees analyzes 20 events to reach their best performance while conditional random fields achieve same performance by analyzing only a single event (i.6.82) CRF (0.41 at S = 18 while decision trees.5 S=1 S = 20 S > 20 S > 20 S > 20 Naive Bayes S=1 S=1 S = 12 S > 20 S > 20 SVM S=1 S=1 S=3 S = 17 S > 20 CRF S=1 S=1 S=3 S = 10 S = 15 Table 6. the attack detection accuracy increases steadily and the system achieves very high F-Measure when we analyze the events in the entire session together. Additionally. since it results in early attack detection and an efficient system.3. performing session modeling using conditional random fields result in higher accuracy for attack detection at lower values of ‘S’ which is desirable. we observe that conditional random fields perform best and their attack detection capability increases as the window width increases. We compare various methods in Table 6. support vector machines and conditional random fields achieve the same F-Measure at S = 1.3 can be interpreted as follows. Similarly. the last row indicates that the conditional random fields achieve the highest F-Measure of 0.56) Naive Bayes (0.

41 0. the attack detection accuracy increases as the width of the window ‘S’ increases. . We do not present the results for other methods since they perform poorly at lower values of ‘p’.6.46 0.42 0.116 User Session Modeling using Unified Log for Application Intrusion Detection Table 6.42 0.39 0.21 0.36 0.37 0.25 0.26 0.34 0.57 0.59 0. attacks are disguised within normal events.39 0.52 0.41 0.25 0.40 0.35 0.42 0.41 0.35 0.44 0.45 0.41 0.41 0.13 0.36 0.e.43 0.42 0.21 0.00 0.35 0.40 Decision Trees 0.36 0. regardless of the value of ‘p’ and for a fixed value of ‘p’.41 Support Vector Machines 0.35 0.16 0.53 0.38 0.28 0.65 6.37 0.28 0.42 0.53 0.41 0.30 Naive Bayes 0.43 0.49 Conditional Random Fields 0.52 0.37 0.43 0.9.55 0.37 0. represents the effect of ‘p’ on conditional random fields for different values of ‘S’ for both the data sets.37 0.9.35 0.39 0.50 0.03 0.37 0.21 0. when p = 0.18 0.42 0. i.35 0.06 0.23 0..01 0.38 0.58 0. we make two observations.60 F-Measure Width of Window ‘S’ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hidden Markov Models 0.36 0.56 0.55 0.46 0.50 0.58 0.42 0.63 0.02 0.46 0.39 0.31 0. ‘p’.37 0. From Figure 6. between 0 and 1.42 0. first.31 0.32 0.41 0.44 0.38 0. Figure 6.04 0.40 0.41 0.63 0. we experiment with the disguised attack data by varying the disguised attack parameter.48 0.35 0.2 Effect of ‘p’ on Attack Detection (0 < p ≤ 1) To analyze the robustness of conditional random fields.42 0.4: Effect of ‘S’ on Attack Detection for Data Set Two.59 0. the attack detection accuracy decreases making it difficult to detect attacks and second.60 0.56 0. as the value of ‘p’ decreases.40 0.

S 15 20 (a) Data Set One p=1.4 0.35 p=0.60 p=0.9: Effect of ‘p’: Results using Conditional Random Fields when 0 < p ≤ 1 .8 F-Measure 0.4 0.35 p=0.00 1 p=0.6 Analysis of Results 117 p=1.6.25 p=0.60 p=0.00 1 p=0.6 0.45 p=0.45 p=0.8 F-Measure 0.2 0 0 5 10 Width of Window.2 0 0 5 10 Width of Window. S 15 20 (b) Data Set Two Figure 6.25 p=0.15 0.6 0.15 0.

In the third system. we can conclude that using a voting mechanism may not always be useful. it is evident that using the unified log eliminate the need to consider over one million (1. while the second system analyzes only the data access logs. An advantage of our system is that it can be deployed in real environment as it analyzes only the summary statistics extracted from the data access logs rather than analyzing every data access to match previously known attack signatures. We use the same instances as used in our previous experiments and present the results with conditional random fields at ‘p’ value of 0. we combine the individual responses from both the systems using a voting mechanism to determine the final labeling. . From Table 6. our approach limits the number of events to the number of web accesses. We call this system as a voting based system. the web access logs and the corresponding data access logs. Hence. based on our approach of session modeling with unified log.268) data accesses for the second data set. We present the comparison in Figure 6. which is significantly smaller when compared to the number of data accesses. Hence. we label the event as attack.118 User Session Modeling using Unified Log for Application Intrusion Detection 6. our system can analyze the user behaviour (via the web accesses) and its effect on the application behaviour (via the corresponding data accesses). Features in both the logs are correlated and analyzing them individually by building separate systems significantly affects attack detection capability. The results clearly suggest that using a single system. our approach uses features from both. performs best.3.60 by varying ‘S’ from 1 to 20.3 Significance of Using Unified Log We performed our experiments using the unified log (based on the framework described in Chapter 5) to detect application level attacks. the performance improves for the first data set. and at the same time limits the load at the intrusion detection system which is significant in high speed application environment.181.10. The first system analyzes only the application logs. If either of the two systems labels an event as attack. Hence. By using the unified log. We also observe that when we use two separate systems and use a voting mechanism to determine the final label. we perform additional experiments where we build three separate systems and compare them with our approach.1 in Section 6.6. Instead. but it decreases for the second data set.

8 F-Measure 0.6 Analysis of Results 119 Unified Log Voting Based 1 Web Access Logs Alone Data Access Logs Alone 0.10: Significance of Using Unified Log .8 F-Measure 0.6.2 0 0 5 10 Width of Window.6 0.2 0 0 5 10 Width of Window.4 0.6 0.4 0. S 15 20 (a) Data Set One Unified Log Voting Based 1 Web Access Logs Alone Data Access Logs Alone 0. S 15 20 (b) Data Set Two Figure 6.

support vector machines and naive Bayes classifiers because the number of features increases from 6 to 120 while the number of features in conditional random fields still remain equal to six. Nonetheless. appear to be counter intuitive and we expect the performance of naive Bayes classifier and decision trees to be better than that of the conditional random fields.4 Test Time Performance It is not justified to compare the efficiency of our system with that of a signature based system because the two systems are significantly different in their attack detection capability. we focus only on the test time complexity. this overhead can be eliminated by developing better software engineering practices which are aware of the security implications. both conditional random fields and hidden Markov models employ the Viterbi algorithm which has a complexity of O( TL2 ). However. Table 6.120 User Session Modeling using Unified Log for Application Intrusion Detection 6. It is important to note that the unification of logs does incur some overhead. The quadratic complexity is problematic when the number of labels is large. We now compare the test time performance of different methods. On the contrary. lies in their capability of detecting novel attacks in addition to detecting previously seen attacks. Signature based systems simply perform signature matching for previously known attacks while the strength of anomaly and hybrid systems. considering the fact that when we increase ‘S’ to 20. Hence.5 compares the average test time for analyzing a session by different methods at S = 20 and p = 0. but for intrusion detection we have a limited number of labels (normal and attack) and. During testing. the complexity increases for decision tress. where T is the length of the sequence and L is the number of labels. Further. particularly in web based applications.6. such as one described in this chapter. the overhead incurred is very small when compared with the time required to individually analyze the web access logs and the data access logs. The test time performance of various systems presented in Table 6. the system is efficient. hence. we consider a first order Markov assumption for labeling . naive Bayes classifier and decision trees are very efficient and can handle large dimensionality in data. we observe that the conditional random fields perform best. such as in the language tasks. However.5. We are generally not interested in the training time because training is often a one time process and can be performed offline. Support vector machines. Security aware software engineering practices can be followed which can provide standardized unified log rather than separately logging web accesses and their corresponding data accesses.60 for both the data sets.

when compared with other methods. at lower values of ‘p’. help to improve attack detection accuracy.6. we can conclude that our results are not the artifact of a particular data set.5: Comparison of Test Time Test Time (µ sec. very often. based on unified web access and data access logs. This is because. From our experiments. systems based on session modeling can detect attacks better when compared to those which analyze every event in isolation. 6. As a result.5 Discussion of Results Experimental results from both the data sets clearly suggest that conditional random fields. the conditional random fields score better and are a strong candidate for building robust and efficient application intrusion detection systems.e. perform best and are able to detect attacks reliably. Additionally. Further. we observe that the results follow the same trend for both the data sets and. even when the attacks are disguised in normal events. hence. that our framework is application independent and can be easily used for a variety of applications. the attacker performs a number of events in a sequence.) Data Set One Conditional Random Fields Hidden Markov Models Decision Trees Naive Bayes Classifier Support Vector Machines 510 7361 3515 4125 9740 Data Set Two 555 7415 3510 4080 9125 in the conditional random fields and the label set itself is very small (equal to two) which results in high test time efficiency. the time complexity for hidden Markov models is higher because of the additional overhead involved in combining the results from six independent models to get the final label as we discussed earlier. the attack detection accuracy and the test time performance.6. performing session modeling using our unified logging framework. i.6 Analysis of Results 121 Table 6. to launch an attack.. Therefore. considering both. This is clear from our experiments where we show that the attack detection improves as the value of .

hidden Markov models are generative systems and cannot represent long range dependencies among observations. Even then. The advantage of conditional random fields is that higher attack detection occurs at lower values of ‘S’ which is desirable for the reasons discussed before. they can reliably detect attacks when the value of ‘p’ decreases. and once trained they are very efficient and robust. Hence. However. decision trees and naive Bayes classifiers on the other hand consider the events to be independent and ignore the correlation between features. The reason for better attack detection with conditional random fields is that they do not consider the features to be independent and are able to model the correlation between them. Conditional random fields do not make any unwarranted assumptions about the data. we performed our experiments on two data sets and our results clearly suggest that the conditional random fields perform best for both the data sets. Our system focuses on detecting such modifications by combining the user behaviour with the application behaviour instead of using specially crafted signatures which are limited in detecting specific attacks. Further. Similarly.122 User Session Modeling using Unified Log for Application Intrusion Detection ‘S’ increases. Finally. Further. In general. thereby resulting in lower accuracy of attack detection. hence. it is important to note that simulating a few attacks does not necessarily imply that our system is limited in detecting only these attacks. our system is capable of detecting attacks as the system does not consider events independently. the system can detect attacks with higher accuracy as opposed to the independent analysis of the web access logs and the data access logs. We also note that an experienced attacker may disguise attacks within 20 or more normal events. establishing that our results are not an artifact of a particular data set. as we discussed earlier. our system can detect any illegitimate data access since malicious modifications result in different application-data interaction when compared to the legitimate requests. . We have already discussed that our system focuses on modeling the interaction between the user behaviour and the application behaviour. Support vector machines. ‘S’ must be increased when ‘p’ decreases. thereby resulting in lower accuracy of attack detection. Our experimental results confirm that when the unified log are analyzed using session modeling. there is a tradeoff between the disguise attack parameter ‘p’ and the window width ‘S’. Performing session modeling using a moving window of events in our unified logging framework helps to correlate the user behaviour and the application behaviour providing rich interacting features which improve attack detection. they can model the long range dependencies between sequential events in a session and. for better attack detection.

7. It is particularly suited to applications which follow the three tier architecture which have application and data independence. Furthermore.7. it can be easily extended to the general service oriented architecture by selecting many services. The challenge is to identify such correlations automatically and this provides an interesting direction for future work. require some domain specific knowledge in order to identify the correlated services (applications). it is critical to resolve issues such as the availability of the training data and suitability of our approach for a variety of applications. 6.7 Issues in Implementation 123 6. which can be used to train effective application intrusion detection systems. however. This is because as part of the business solution. the service oriented architecture defines numerous services each of which provides specific functionality and which have the capability to interact among themselves. However.1 Availability of Training Data Though our system is application independent and can be used to detect malicious data access in a variety of applications. 6.2 Suitability of Our Approach for a Variety of Applications As we already discussed. before deployment. training data can be made available as early as during the application testing phase when the application is tested to identify errors. However. our framework can be easily extended and deployed in the Service Oriented Architecture [133].6. our framework is generic and can be deployed for a variety of applications. Nonetheless. However. it must be trained before the system can be deployed online to detect attacks. . Our proposed framework can be considered as a special case for the service oriented architecture which defines only one service. this requires security aware software engineering practices which must ensure that necessary measures are taken to provide training data during the application development phase. To obtain such data may be difficult. This would. We now discuss various methods which can be employed to resolve such issues.7 Issues in Implementation Experimental results show that our approach based on conditional random fields can be used to build effective application intrusion detection systems. This requires training data which is specific to the application. Logs generated during the application testing phase can be used for training the intrusion detection system.

such as those discussed in this chapter. . we considered a sequence of events in a session.124 User Session Modeling using Unified Log for Application Intrusion Detection 6. Another advantage of our system is that it models user-application and application-data interaction which does not vary overtime as compared to modeling user profiles which change frequently. Application and data interaction vary only in case of an attack which is detected by our system. In our framework. We also showed that unified log not only helps to improve the attack detection accuracy but also to improve system’s performance since we can use summary statistics rather than analyzing every data access. following better security aware software engineering practices and taking care of logging mechanism during application development would not only help in application testing and related areas but would also provide necessary framework for building better and efficient application intrusion detection systems.8 Conclusions In this chapter. We also showed that our system using conditional random fields is robust and is able to detect disguised attacks effectively. we implemented user session modeling using a moving window of events in our unified logging framework to build application intrusion detection systems which can detect application level attacks effectively and efficiently. Experimental results confirm that conditional random fields can be effectively used in our framework and perform better when compared with other methods. Our experimental results with multiple data sets show similar trends and confirm that our framework is application independent and can be used for a variety of applications. Finally. Our system based on conditional random fields can detect attacks at smaller values of ‘S’ resulting in early attack detection. rather than analyzing the events individually which improves the attack detection accuracy.

Conditional random fields are a strong candidate for building robust and efficient intrusion detection systems. The three issues are: 1. availability of training data. Inefficiency in operation Other issues such as the scalability and ease of system customization. both. 2. the attacks can be detected with very high accuracy. The framework also identifies the type of attack and. Limited attack detection coverage 2. Integrating the layered framework with the conditional random fields can be used to build effective and efficient network intrusion detection systems. In addition. Large number of false alarms and 3. we explored the suitability of conditional random fields for building robust and efficient intrusion detection systems which can operate. we introduced novel frameworks and developed models which address three critical issues that severely affect the large scale deployment of present anomaly and hybrid intrusion detection systems in high speed networks. thus. robustness of the system to noise in the training data. and the ability of the system to detect disguised attacks were also addressed. Layered framework can be used to build efficient intrusion detection systems. In particular. Using conditional random fields as intrusion detectors result in very few false alarms and.Chapter 7 Conclusions I N this thesis. 125 . the framework offers ease of scalability for detecting different variety of attacks as well as ease of customization by incorporating domain specific knowledge. hence. As a result of this research. specific intrusion response mechanism can be initiated which helps to minimize the impact of the attack. we conclude that: 1. at the network and at the application level.

hence. Additionally. the system is robust and can effectively detect disguised attacks. show that our approach. We performed a range of experiments which show that. Further. capable of detecting novel attacks. Conditional random fields can easily model such correlations by defining specific feature functions which make them a strong candidate for building effective intrusion detectors. even when trained with noisy data.126 Conclusions 3. Conditional random fields can be effectively used in this framework to model a sequence of events in a user session. Our framework is highly scalable. we introduced the layered framework which helps to improve overall system performance. accuracy of attack detection and efficiency of system operation. Unified logging framework can capture user-application and application-data interactions which are significant to detect application level attacks. thereby. both. our system is not based on attack signatures and. Additionally. Assuming various features to be independent. it is critical to model the sequence of events. Using conditional random fields’ attacks can be detected at smaller window widths.5% improvement). This is because. Finally. resulting in an efficient system. naive Bayes.8% improvement) and Remote to Local (R2L) attacks (34. . for User to Root (U2R) attacks (34. it affects its attack detection capability. 4. very often. Statistical tests also demonstrate higher confidence in detection accuracy with layered conditional random fields. outperform these methods. makes a model simple and efficient. Experimental results on the benchmark KDD 1999 intrusion data set [12] and comparison with other well known methods for intrusion detection such as decision trees. an attacker must perform a number of sequential operations in order to launch a successful attack. User session modeling using the unified log must be performed in order to detect application level attacks with high accuracy. it is critical to model the correlations between multiple features in an observation. though. easily customizable and can be used to build efficient network intrusion detection systems which can detect a wide variety of attacks. based on layered conditional random fields. The impressive part of our results is the percentage improvement in attack detection accuracy. The framework is application independent and can be used for a variety of applications. We also performed experiments which show that. in order to effectively detect application level attacks. We also showed that our system is robust and can detect attacks with higher accuracy. in order to detect intrusions effectively. particularly. support vector machines and the winners of the KDD 1999 cup. in terms of. when compared with other methods.

87 while the same for hidden Markov models. support vector machines. Experimental results confirm that our system.60. this can be used to detect attacks with high accuracy. often using two separate systems.43 and 0. the application access logs and the corresponding data access logs are highly correlated. However.7. 7.46.56. To address these issues. the two data sets which we collected can be downloaded from [13] and can be used to build and evaluate application intrusion detection systems. decision trees.1 Directions for Future Research 127 for most applications. decision trees. can detect attacks at an early stage by analyzing only a small number of past events resulting in an efficient system which can block attacks in real-time. resulting in inefficient systems which give a large number of false alarms and. for data set one at p = 0. we introduced our unified logging framework which integrates the application access logs and the corresponding data access logs to generate unified log. Similarly. To detect attacks at the application level.49 respectively. Every network and application is custom designed and it becomes extremely difficult to de- .67 and 0. our system achieves an F-Measure of 0. Further.1 Directions for Future Research The critical nature of the task of detecting intrusions in networks and applications leaves no margin for errors. As a result.42. using conditional random fields in our unified logging framework achieves an F-Measure of 0. low attack detection accuracy. Finally. Experimental results also demonstrate that our system is robust and can detect disguised attacks effectively. present application intrusion detection systems analyze the logs separately. outperforming other methods such as the hidden Markov models. decision trees and the naive Bayes.41. it becomes critical to identify the best possible approach for developing better intrusion detection systems. 0. for data set two at p = 0. 0. The effective cost of a successful intrusion overshadows the cost of developing intrusion detection systems and hence.65 while the same for hidden Markov models. In particular.60. the user-application and the application-data interactions are stable and do not vary overtime as opposed to modeling user profiles which change frequently.82 respectively. and in particular for web based applications. the application logs or (and) the data access logs can be used. naive Bayes and support vector machines is 0. based on user session modeling using conditional random fields which analyze unified log. hence. 0. 0. the user-application and the application-data interaction can be captured. naive Bayes and support vector machines is 0.

There is ample scope and need to build systems which aim at preventing attacks rather than simply detecting them. it is evident that our systems performed efficiently. Integrating intrusion detection systems with the security policy in individual networks would help to minimize the false alarms and qualify the alarms raised by the intrusion detection systems. however. we proposed novel frameworks and developed methods which perform better than previously known approaches. 1. However. In this thesis. . The problem is to identify the true source of attack without affecting the performance of the overall system. However. in order to improve the overall performance of our system we used the domain knowledge for selecting better features for training our models. developing faster implementations of conditional random fields particularly for the domain of intrusion detection requires further investigation. our framework can be extended and deployed in the Service Oriented Architecture [133] and presents another line of interesting research. However. From our experiments. However. layered framework. but they require a global effort which is not very easy to ensure. Solutions are available for this such as the adjusted probabilistic packet marking and others [40]. if there is a reliable method to trace back the packets to their actual source. Many of the attacks are successful because the attackers enjoy anonymity and they can launch attacks from spoofed sources. many of the attacks can be prevented. making it very hard to trace back the true source of the attack. developing completely automatic systems presents an interesting direction for future research. We demonstrated the effectiveness of our application intrusion detection system in the well known three tier application architecture. Using domain knowledge to develop better systems is not a significant disadvantage. Thoughts for Practitioners We now outline some open issues which are significant but outside the scope of this thesis which must be considered in order to develop better intrusion detection systems [3]. This is justified because of the critical nature of the task of intrusion detection.128 Conclusions velop a single solution which can work for every network and application. Another possible direction for future research is to employ our approach. for building highly efficient systems since they give opportunity to implement pipelining of layers in multi core processors.

the security policy must be complete and second. However. The field of intrusion detection has been around since 1980’s and a lot of advancement has been made in the same. However. Intrusion detection systems must integrate with such networks and devices and provide support for advances in a comprehensible manner. The problem is how to link the supplied credentials with the actual human user? Methods based on user profiling can be used which learn the normal user profile and then can be used to detect significant deviations from the learnt profile. the problem is to clearly define the acceptable and the unacceptable usage of every resource. the policy must be clear and unambiguous. Hence. first. to keep pace with the rapid and ever changing networks and applications. they are based upon thresholds which are selected by empirical analysis and. . the research in intrusion detection must synchronize with the present networks. Present networks increasingly support wireless technologies. authentication mechanisms such as the use of login and password are weak and can be compromised. Many systems are based upon authenticating a user. may not always be accurate.7. There are two major issues in defining the security policy. However. 3. Security policy plays an important role in a network and describes the acceptable and non acceptable usage of the resources. hence. Multi factor authentication and use of biometric methods have been introduced but they can also be bypassed.1 Directions for Future Research 129 2. removable and mobile devices.

.

org/resources/idfaq/. Tao Peng. and Kirsten Swearingen.edu/research/ projects/how-much-info-2003. In Proceedings of IEEE International Conference on Intelligence and Security Informatics. http: //www. http://www2. Attacking Confidentiality: An Agent Based Approach. ACM Computing Surveys. and Kotagiri Ramamohanarao. Hal R. cert. Laheem Lamar Jordan. Technical Report 98-17. https://www. Vol (3975).isc. Christopher Leckie. [3] Kotagiri Ramamohanarao. pages 234–249. Survey of Network-Based Defense Mechanisms Countering the DoS and DDoS Problems.Bibliography [1] Stefan Axelsson. 2007. pages 285–296. and Ashraf Kazi. [2] SANS Institute .berkeley.org/archive/pdf/attack_trends. Joyojeet Pal. Springer Verlag. ACM. Kapil Kumar Gupta. 1998. Baikunth Nath. Nathan Good. and Christopher Leckie. 2007.pdf. Last accessed: Novmeber 30. [8] Tao Peng. 2008.sans.sims. [4] Overview of Attack Trends.Intrusion Detection FAQ. Vol (4812). [5] Kapil Kumar Gupta. http://www. 2002. Department of Computer Engineering. Kotagiri Ramamohanarao. Last accessed: November 30. 2008. Varian. Research in Intrusion-Detection Systems: A Survey. Last ac- cessed: Novmeber 30. Last accessed: Novmeber 30. 131 . Peter Charles. The Curse of Ease of Access to the Internet. 2008. 39(1):3. org/solutions/survey/. 2008. Springer Verlag. Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2006. [6] The ISC Domain Survey. How much Information. Chalmers University of Technology. [7] Peter Lyman. In Proceedings of the 3rd International Conference on Information Systems Security (ICISS).

Hernan.uci.ics. A Review of Information Security Issues and Respective Research Contributions. Last accessed: Novmeber 30. Robert D. 51(12):3448– 3470. 2002.cert. and Kotagiri Ramamohanarao. Department of Computer Engineering. Sherif and Tommy G. [17] Joseph S. Linda Hutz Pesante. Baikunth Nath. [11] Thomas A. Ghorbani. WET ICE. 38(1):60–80. 1(2):84–102. Mcmillan. unimelb.org/ stats/. Jones and Robert S. SIGMIS Database.au/˜kgupta. 1997. International Journal of Network Security. 2008. CERT Coordination Center. 2008. Research on Intrusion Detection and Response: A Survey. pages 115–133. Intrusion Detection: Systems and Models. [13] Kapil Kumar Gupta. Technical report.edu/˜jones/ IDS-research/Documents/jones-sielken-survey-v11. 2008. James T. Shawn V. http: //kdd. Security of the Internet. IEEE. Ellis. 2007. [18] Mikko T. Sielken. Longstaff. 2008. University of Virginia. [12] KDD Cup 1999 Intrusion Detection Data. http://www. Chalmers University of Technology. and Derek Simmel. 1999. . Lipson.virginia.html. Department of Computer Science.132 BIBLIOGRAPHY [9] Animesh Patcha and Jung-Min Park. http://www.html. Technical Report 99-15. http://www. [15] Anita K. [16] Peyman Kabiri and Ali A. [10] CERT/CC Statistics. http://www. Last accessed: Novmeber 30. Last accessed: Novmeber 30. Last accessed: Novmeber 30.edu. 2005. 2008.edu/databases/kddcup99/kddcup99.pdf.cert. In Proceedings of the Eleventh IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises. Dearmond. Computer Networks. 2007. Application based Intrusion Detection Dataset. [14] Stefan Axelsson. Intrusion Detection Systems: A Taxomomy and Survey. Siponen and Harri Oinas-Kukkonen.cs. Last accessed: Novmeber 30. ACM. 2000. Technical Report The Froehlich/Kent Encyclopedia of Telecommunications Vol (15). Computer System Intrusion Detection: A Survey.org/ encyc_article/tocencyc. An Overview of Anomaly Detection Techniques: Existing Solutions and Latest Technological Trends. Howard F.csse.

[27] Biswanath Mukherjee. Computer Networks. 1994. Todd Heberlein. Lunt. [21] James P. 31(9):805–822. Network Intrusion Detection. 1987.securityfocus. pages 120–128. In Proceedings of the IEEE Symposium on Research in Security and Privacy. and D. Computers and Security.V. [25] Paul Innella. IEEE Transactions on Software Engineering. [26] L. A. 1993. Valdes. Chalmers University of Technology. 2002. and Andreas Wespi.Smaha. Levitt. Mukherjee. L. [24] S.nist. 1980. N. [28] John McHugh. The Evolution of Intrusion Detection Systems. IEEE. 1988. Levitt. http://csrc. IEEE. 2008. Denning. Wolber. In Proceedinges of the IEEE Symposium on Research in Security and Privacy. IEEE. 2001. Somayaji. and T. pages 316–326. IEEE. S. Department of Computer Engineering. and Karl N. Technical Report 02-04. Hofmeyr. A. A Sense of Self for Unix Processes. A survey of intrusion detection techniques. Javitz and A. T. In Proceedings of the IEEE Symposium on Security and Privacy. [30] S. [29] Herv´ Debar. 2001. Computer Security Threat Monitoring and Surveillance. 1990. Last accessed: Novmeber 30. 8(3):26–41.pdf. Longstaff. 2008. A Network Security Monitor. Forrest. [22] Dorothy E. An Intrusion-Detection Model. Heberlein. G. A. IEEE. Elsevier Advanced Technology Publications. Intrusion and intrusion detection. 1(1):14–35. Survey of Intrusion Detection Research. pages 296–304. pages 37–44. Haystack: An Intrusion Detection System. Towards a taxonomy of intrusion-detection e systems. S. J. 12(4):405–418. . 1991. The SRI IDES Statistical Anomaly Detector. http://www.E. IEEE Network. 1999. Marc Dacier. Wood. K. Elsevier. 13(2):222–232. International Journal of Information Security.gov/publications/ history/ande80.com/infocus/1514. Last accessed: Novmeber 30. 1996.BIBLIOGRAPHY 133 [19] Teresa F. [20] Emilie Lundin and Erland Jonsson. Springer. [23] H. IEEE. B. Anderson. Dias. In Proceedings of the 4th Aerospace Computer Security Applications Conference.

[39] Christopher Kruegel. Baikunth Nath. and Kotagiri Ramamohanarao. Network Security Framework. Baikunth Nath. 2008. Detecting Intrusions Using System Calls: Alternative Data Models. Intrusion Detection and Correlation: Challenges and Solutions. IEEE Transactions on Dependable and Secure Computing. and Giovanni Vigna. Springer. and Kotagiri Ramamohanarao. pages 133–145. pages 203–208. Stephanie Forrest. Springer Verlag. and Fernando Pereira. [37] Kapil Kumar Gupta. [36] Kapil Kumar Gupta. Morgan Kaufmann. Baikunth Nath. [34] John Lafferty. In Proceedings of 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW). In Proceedings of Eighteenth International Conference on Machine Learning. [32] Kapil Kumar Gupta. and Kotagiri Ramamohanarao. 2006. Vol (278). Fredrik Valeur. [38] Kapil Kumar Gupta. International Journal of Computer Science and Network Security. and Kotagiri Ramamohanarao. Intrusion Detection in Networks and Applications. pages 269–283. [33] Kapil Kumar Gupta. Under Review. Layered Approach using Conditional Random Fields for Intrusion Detection. [35] Kapil Kumar Gupta.134 BIBLIOGRAPHY [31] Christina Warrender. In Proceedings of the IEEE Symposium on Security and Privacy. Lecture Notes in Computer Science. World Scientific. Baikunth Nath. IEEE. In Proceedings of the 23rd International Information Security Conference (SEC 2008). Conditional Random Fields for Intrusion Detection. IEEE. and Kotagiri Ramamohanarao. 2005. Baikunth Nath. 6(7B):151– 157. . Baikunth Nath. In Press. 2007. 2001. User Session Modeling for Effective Application Intrusion Detection. 1999. ACM Transactions on Information and Systems Security. Andrew McCallum. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Robust Application Intrusion Detection using User Session Modeling. To Appear. and Kotagiri Ramamohanarao. pages 282–289. In Handbook of Communication Networks and Distributed Systems. and Barak Pearlmutter.

pages 697–708.ietf. Kemmerer. IEEE Transactions on Systems. Adjusted Probabilistic Packet Marking for IP Traceback. 2002. Defining the Operational Limits of Sequence-Based Anomaly Detectors. IEEE. Cliff Kahn. Firewalls and Internet Security. March 1998. Rich Feiertag. pages 18–26. AddisonWesley. [47] Carol Taylor and Jim Alves-Foss. Gaithersburg. 2001. [43] Bruce Schneier. Probabilistic Techniques for Intrusion Detection Based on Computer Audit Data. ACM. Cheswick and Steven M. and Mingming Xu. Syed Masum Emran. National Institute of Standards and Technology. Brian Tung. 2002. 2001. Last accessed: Novmeber 30. Last accessed: Novmeber 30.Data Formats. 1994. Qiang Chen. Information Technology Laboratory. PhD thesis. NetSTAT: A Network-based Intrusion Detection Approach. The Common Intrusion Detection Framework . 31(4):266–274.org/ html/draft-staniford-cidf-data-formats-00. and Maureen Stillman. [46] Giovanni Vigna and Richard A. MD : Computer Security Division. 1996. 1998. pages 25–34. a Network based Intrusion Detection System. [49] Nong Ye. [42] Rebecca Bace and Peter Mell. and Kotagiri Ramamohanarao. Xiangyang Li. . http://tools. [44] Kymie Tan. Man and Cybernetics. 2008. 2008. http://www.snort. In Proceedings of the Second IFIP Networking Conference. 2002. An Empirical Analysis of NATE: Network Analysis of Anomalous Traffic Events. Dan Schnackenberg. [45] Stuart Staniford-Chen. Springer. Christopher Leckie.BIBLIOGRAPHY 135 [40] Tao Peng. Intrusion Detection Systems. [41] William R.org/. In Proceedings of the 14th Annual Computer Security Applications Conference. Bellovin. Part A: Systems and Humans. John Wiley & Sons. The University of Melbourne. Applied Cryptography. [48] Snort. Phil Porras. In Proceedings of the 2002 Workshop on New Security Paradigms.

Herv´ Debar. Data Mining for Network Intrusion Detection. Jose Omar Garcia-Fernandez. pages 1274–1278. pages 13–24. IEEE. Eskin. IEEE. Portnoy. Fuzzy Clustering for Intrusion Detection. and Diego Zamboni. IEEE. Undercoffer. and Saurabh Bagchi. [54] Yu-Sung Wu. Spafford. Imielinski. pages 21–30. 1993. David Isacoff. Mok. Salvatore J. In Proceeding of the 14th Annual Computer Security Applications Conference. In Proceedings of the 20th Annual Computer Security Applications Conference. . [53] Jai Sundar Balasubramaniyan. [56] L. pages 207–216. 1999. pages 120–132. In Proceedings of the NSF Workshop on Next Generation Data Mining. [57] H. Bingrui Foo. 1998. and Mireille Ducasse. Jaideep Srivastava. Levent Ertoz. Agrawal. Stolfo. In Proceedings of the International Conference on Dependability of Computer Systems. E. Ahmed Bendib. Collaborative Intrusion Detection System (CIDS): A Framework for Accurate and Efficient IDS. An Architecture for Intrusion Detection Using Autonomous Agents. Yongguo Mei. 2006. In Proceedings of the 19th Annual Computer Security Applications Conference. 2001. IEEE. In Proceedings of the ACM Workshop on Data Mining Applied to Security (DMSA). A Serial Combination of e Anomaly and Misuse IDSes Applied to HTTP Traffic. pages 428–437. Ludovic Me. In Proceedings of the International Conference on Management of Data (SIGMOD). pages 248–255. Mining Association Rules between Sets of Items in Large Databases. Intrusion Detection with Unlabeled Data using Clustering. Habiba Drias. Distributed Intrusion Detection Framework Based on Mobile Agents. [51] Wenke Lee. Eugene H.136 BIBLIOGRAPHY [50] Paul Dokas. In Proceedings of the 12th IEEE International Conference on Fuzzy Systems. [52] Dalila Boughaci. and Belaid Benhamou. 2004. Joshi. IEEE. and S. ACM. A Data Mining Framework for Building Intrusion Detection Model. ACM. [55] Elvis Tombini. J. In Proceedings of the IEEE Symposium on Security and Privacy. Vipin Kumar. and Kui W. IEEE. [58] R. Stolfo. Aleksandar Lazarevic. pages 234–244. Shah. 2003. and Pang-Ning Tan. and A. and A. 2003. Swami. 2002. T. Youcef Bouznit.

pages 943–949.N.Volume 2. Discovering Frequent Episodes in Sequences. Machine Learning. Dan Geiger. pages 259–267. Monique Becke. [60] Nahla Ben Amor. In Proceedings of 19th Annual Computer Security Applications Conference.Verkamo. Jun Li. Intrusion Detection with Neural Networks. Fredrik Valeur. and Risto Mikkulainen. Ghosh. 2005. and Moises Goldszmidt. MIT. 2006. and Kien A. 1997. 29(2-3):131–163. [62] Darren Mutz. Anomalous System Call Detection. Wu. ACM. pages 14–23.Toivonen. C. Giovanni Vigna. 1992. pages 136–141. Salem Benferhat. [65] Herv´ Debar. pages 420–424. In Proceedings of the 43rd Annual SouthEast Regional Conference . In Proceedings of the IEEE Symposium on Research in Security and Privacy. Annie S. Meng-Jang Lin. and Didier Siboni.I. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining. Darren Mutz. 1997. and A. [66] Anup K. ACM. IEEE. AAAI. ACM. Decision Tree Classifier for Network Intrusion Detection with GA-Based Feature Selection. pages 240–250. William Robertson. IEEE. Hua. In Proceedings of the ACM Symposium on Applied Computing. and Zied Elouedi. 1998. Jay Jorgenson. A Neural Network Component for an e Intrusion Detection System.BIBLIOGRAPHY 137 [59] H. [68] Zheng Zhang. Naive Bayes vs Decision Trees in Intrusion Detection Systems. 1995. [61] Nir Friedman. [63] Christopher Kruegel. HIDE: a Hierarchical Network Intrusion Detection System Using Statistical Preprocessing and Neural . IEEE. and Frank Charron. Manikopoulos.Mannila. and Fredrik Valeur. 2003. [64] Gray Stein. ACM Transactions on Information and System Security. In Proceedings of the 14th Annual Computer Security Applications Conference. Bayesian Network Classifiers. [67] Jake Ryan. pages 210–215. and Christopher Kruegel. In Advances in Neural Information Processing Systems. Bing Chen. Detecting Anomalous and Unknown Intrusions Against Programs. H. 9(1):61–93. Bayesian Event Classification for Intrusion Detection. and Jose Ucles. James Wanken. 2004. Springer.

In Proceedings of the 14th IEEE Computer Security Foundations Workshop. IEEE. Networking Technologies for Enhanced Internet Services International Conference. IEEE Transactions on Reliability. Robustness of the Markov-Chain Model for Cyber-Attack Detection. IEEE. 77(2):257–286. University of Maryland. IEEE. In Proceedings of the Information Networking. [76] Svetlana Radosavac. Ghosh. USENIX Association. and Andrew H. 1999. In Proceedings of International Conference on Machine Learning and Cybernetics. and Michael Schatz. and Xiang-Liang Zhang. Springer Verlag. In Proceedings of the 1st USENIX Workshop on Intrusion Detection and Network Monitoring. In Proceedings of the IEEE Workshop on Information Assurance and Security United States Military Academy. pages 2830–2835. Detection and Classification of Network Intrusions using Hidden Markov Models. 2004. 2003. Markov chains. 2002. Maxion.138 BIBLIOGRAPHY Network Classification. pages 206–219. [77] Wei Wang. Master’s thesis. In Proceedings of Symposium on Applications and the Internet. Network-Based Intrusion Detection with Support Vector Machines. K. Yebin Zhang. pages 209–216. [73] S.A. Classifiers. [72] Dong Seong Kim and Jong Sou Park. IEEE. [75] Lawrence R. [69] Anup K. IEEE. Xiao-Hong Guan. pages 747–756. Identifying Important Features for Intrusion Detection Using Support Vector Machines and Neural Networks. Proceedings of the IEEE. and Intrusion Detection. 2001. Aaron Schwartzbard. Sung. Learning Program Behavior Profiles for Intrusion Detection. 2003. pages 1702–1707. pages 85–90. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Intrusion Detection Using Neural Networks and Support Vector Machines. ICOIN. pages 51–62. [74] Nong Ye. 53(1):116–123. [70] Srinivas Mukkamala. and Connie M. 2001. Tan. Sung and Srinivas Mukkamala. 2004. Rabiner. Modeling Program behaviors by Hidden Markov Models for Intrusion Detection. 2003. Lecture Notes in Computer Science. Jha. [71] Andrew H. and R. . In Proceedings of the International Joint Conference on Neural Networks (IJCNN). Guadalupe Janoski. Borror. 1989.

ACM. .cerias. Mok.cse.BIBLIOGRAPHY 139 [78] Ye Du. [86] Yi Hu and Brajendra Panda. 2008. AAAI. USENIX Association. http://www. ACM. Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation. 2004. [82] Wenke Lee. Stolfo. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. pages 66–72. pages 711–716. and Xiao-Lin Qin. Stolfo. Mining in a Data-flow Environment: Experience in Network Intrusion Detection. 2004. Vol (4). A Clustering Method Based on Data Queries and Its Application in Database Intrusion Detection. and Yonggang Pang. 2008. In Proceedings of the Internet Measurement Conference. and Kui W. pages 345–350. 1998. ACM Transactions on Information and System Security (TISSEC). In Proceeedings of the Fifth World Congress on Intelligent Control and Automation (WCICA). Stolfo. Salvatore J. pages 114–124. A Data Mining Approach for Database Intrusion Detection. 1999. [87] Yong Zhong. Mining Audit Data to build Intrusion Detection Models. In Proceedings of the ACM symposium on Applied Computing. pages 4348–4351. Data Mining Approaches for Intrusion Detection.edu/research/isl/agentIDS. IEEE. http: //www. 2005. Stolfo. [80] Probabilistic Agent based Intrusion Detection. In Proceedings of the Fourth International Conference on Machine Learning and Cybernetics. Last accessed: Novmeber 30. ACM. In Proceedings of the 7th USENIX Security Symposium. Salvatore J. [85] Yu Gu. pages 2096–2101. [83] Wenke Lee. [81] Wenke Lee and Salvatore J. [79] Autonomous Agents for Intrusion Detection. 2000. Last accessed: Novmeber 30. Andrew McCallum.shtml.edu/about/history/coast/projects/ aafid. 2005. A Framework for Constructing Features and Models for Intrusion Detection Systems. and Don Towsley. 1998. 3(4):227–261. pages 79–94.purdue. IEEE. and Kui W. [84] Wenke Lee and Salvatore J. Zhen Zhu.php. A Hidden Markov Models-Based Anomaly Intrusion Detection Method. Huiqiang Wang.sc. Mok.

and Athena Vakali. Identification of Malicious Transactions in Database Systems. IEEE. DIDAFIT: Detecting Intrusions in Databases Through Fingerprinting Transactions. Michael Gertz. 2003. 2002. Kluwer. [89] Elisa Bertino. Wai Lup Low. pages 264–279. IEEE. IEEE. Nina Mishra. Learning Fingerprints for a Database Intrusion Detection System. Vol (3). pages 329–335. DEMIDS: A Misuse Detection System for Database Systems. In Proceedings of Third International Conference on Machine Learning and Cybernetics. pages 1671–1676. Stankovic. In Proceedings of the 7th European Symposium on Research in Computer Security.140 BIBLIOGRAPHY [88] Yi Hu and Brajendra Panda. and Pei Yuen Wong. and Sang H. [96] Rakesh Agarwal. and Karl Levitt. In Proceedings of the 28th International Conference on Very Large Databases. Ashish Kamra. and Peter Teoh. Lee. pages 151–162. . IEEE. and Yirong Xu. [92] Yong Zhong and Xiao-Lin-Qin. Vol (2502). In Proceedings of the Sixth IEEE Real Time Technology and Applications Symposium. [90] Wai Lup Low. Krishnaram Kenthapadi. Evimaria Terzi. In Proceedings of the 4th International Conference on Enterprise Information Systems. John A. [94] Christina Yip Chung. Nabar. 2004. Ramakrishnan Srikant. pages 159– 178. Jerry Kiernan. pages 124–133. [95] Shubha U. 2005. Research on Algorithm of User Query Frequent Itemsets Mining. Morgan Kaufmann.5 Working Conference on Integrity and Internal Control in Information Systems. Lecture Notes in Computer Science. Towards Robustness in Query Auditing. 2002.S. In Proceedings of the 32nd International Conference on Very large Data Bases. In Proceedings of the 7th International Database Engineering and Applications Symposium. Bhaskara Marthi. 2000. Joseph Lee. pages 143–154. [93] Victor C. Intrusion Detection in Real-time Database Systems Via Time Signatures. [91] Sin Yeung Lee. 1999. In Proceedings of the 21st Annual Computer Security Applications Conference. Hippocratic Databases. ACM. 2006. In Proceeding of the 3rd International IFIP TC-11 WG11. and Rajeev Motwani. Intrusion Detection in RBAC-Administered Databases. pages 170–182. pages 121–128. Son. Springer Verlag. 2002.

2005. and Barton P. In Proceedings of the 4th International Symposium on Recent Advances in Intrusion Detection. 2001. Wouter Joosen.. Bayardo Jr. and Giovanni Vigna. 2006. and Morgan Wang. pages 257–272. Auditing Compliance with a Hippocratic Database. Bridging the Gap Between Web Application Firewalls and Web Applications. IEEE. [102] Shai Rubin. Springer Verlag. pages 22–36. FMSE. Davide Balzarotti. and Pierre Verbaeten. [103] Bruce D. DeWitt. and Ramakrishnan Srikant. Raghu Ramakrishnan. pages 67–77. ACM.BIBLIOGRAPHY 141 [97] Rakesh Agrawal. In Proceedings of the 15th Usenix Security Symposium. Miller. Morgan Kaufmann. Frank Piessens. Rakesh Agrawal. [99] Lieven Desmet. Packet. Morgan Kaufmann. In Proceedings of the Fourth ACM workshop on Formal methods in security. pages 108–119. Session-Based Modeling for Intrusion Detection Systems. Jerry Kiernan. Caulkins. 2004. In Proceedings of the Proceedings of the 13th ACM conference on Computer and Communications Security. 2007.vs. Roberto J. [100] Holger Dreger. 2004. Ralf Rantzau. Swaddler: An Approach for the Anomaly-Based Detection of State Violations in Web Applications. Viktoria Felmetsger. [104] Magnus Almgren and Ulf Lindqvist. Somesh Jha. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). [98] Kristen LeFevre. and David J. pages 63–86. . Springer. Limiting Disclosure in Hippocratic Databases. pages 47–58. [101] Marco Cova. Vern Paxson. Michael Mai. 2006. Yirong Xu. Christos Faloutsos. ACM. Dynamic Application-Layer Protocol Analysis for Network Intrusion Detection. USENIX Association. Protomatching Network Traffic for High Throughput Network Intrusion Detection. pages 516–527. Vol (2212). Joohan Lee. and Robin Sommer. pages 116–121. Anja Feldmann. In Proceedings of the 30th International Conference on Very Large Databases. 2006. In Proceedings of the 10th International Symposium on Recent Advances in Intrusion Detection (RAID). Application-Integrated Data Collection for Security Monitoring. In Proceedings of the 30th International Conference on Very Large Databases. Lecture Notes in Computer Science. Vuk Ercegovac.

Anomaly Detection of Web-Based Attacks. E. Kumar. Springer. pages 123–140. Dayne Freitag. pages 591–598. A Novel Intrusion Detection System Model for Securing Web-based Database Systems. University of Pennsylvania. [113] Charles Sutton and Andrew McCallum. In Introduction to Statistical Relational Learning. Darren Mutz. Conditional Structure versus Conditional Estimation in NLP Models. 2000. 2003. and Vincent J. In Pro- . In Proceedings of the 17th International Conference on Machine Learning. Association for Computational Linguistics. Maximum Entropy Markov Models for Information Extraction and Segmentation. 1996. Pang-Ning Tan. [106] Christopher Kruegel and Giovanni Vigna. [109] Adwait Ratnaparkhi. A. An Introduction to Conditional Random Fields for Relational Learning. Maximum Entropy Models for Natural Language Ambiguity Resolution. Morgan Kaufmann. ACM. and Giovanni Vigna. pages 133–142. In Proceedings of Second International Conference on Detection of Intrusions and Malware. 2005. Eilertson.142 BIBLIOGRAPHY [105] Fredrik Valeur. A Maximum Entropy Approach to Natural Language Processing. In Proceedings of the 10th ACM Conference on Computer and Communications Security (CCS). [114] L. A Maximum Entropy Model for Part-of-Speech Tagging. [110] Adam L. 1998. Della Pietra. A Learning-Based Approach to the Detection of SQL Attacks. MIT. 2001. PhD thesis. 2006. Stephen A. Protecting Against Cyber Threats in Networked Information Systems. [111] Andrew McCallum. 2002. pages 249–254. V. pages 9–16. In Proceedings of the ACL-02 Conference on Empirical methods in Natural Language Processing Vol (10). Manning. [108] Adwait Ratnaparkhi. Association for Computational Linguistics. and Jaideep Srivastava. and Fernando Pereira. pages 251–261. 1996. IEEE. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. [112] Dan Klein and Christopher D. [107] Shu Wenhui and Tan T H Daniel. 22(1):39–71. In Proceedings of the 25th Annual International Computer Software and Applications Conference (COMPSAC). Computational Linguistics. Paul Dokas. Berger. and Vulnerability Assessment (DIMVA). Ertoz. Lazarevic. Della Pietra.

[119] G. 2004. . 2007. IEEE Transactions on Pattern Analysis and Machine Intelligence. pages 403–410. 2003. [116] Saso Dzeroski and Bernard Zenko. In Security and Protection in Information Processing Systems. [115] Shon Harris. Data Mining: Practical Machine Learning Tools and Techniques. CISSP All-in-One Exam Guide. In Proceedings of the Nineteenth International Conference on Machine Learning. IEEE Transactions on Information Theory. The Viterbi Algorithm. In Proceedings of the International Conference on Machine Learning. [123] Yacine Bouzida and Sylvain Gombault. Vincent Della Pietra. McGraw-Hill Osborne Media. In Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence. Models. 2005. Proceedings of the IEEE. 2008. 1967. 2003. pages 209–215. CSREA. Witten and Eibe Frank. Inducing Features of Random Fields. IEEE Transactions on Neural Networks. CRF++: Yet another CRF toolkit. pages 123–129. 1973. [118] Andrew Viterbi. [122] Maheshkumar Sabhnani and Gursel Serpen. http: //crfpp. Combinations of Weak Classifiers. 13(2):260–269. 8(1):32–42. Eigenconnections to Intrusion Detection. [125] Stephen Della Pietra.sourceforge. 1997. [117] Chuanyi Ji and Sheng Ma.Forney. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. Application of Machine Learning Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context. Technologies and Applications. Battlespace Digitization and Network Centric Systems III. Morgan Kaufmann. Efficiently Inducing Features of Conditional Random Fields. 2003. and John Lafferty. MLMTA.D. 19(4):380–393. pages 241–258. Is Combining Classifiers Better than Selecting the Best One. [124] Andrew McCallum. [120] Taku Kudo. 2002. Last accessed: Novmeber 30. [121] Ian H. Morgan Kaufmann.net/. Morgan Kaufmann.BIBLIOGRAPHY 143 ceedings of SPIE. pages 51–56. Springer. 1997. 61(3):268–278.

Individual Comparisons by Ranking Methods. Last accessed: Novmeber 30. http://www. 2004. In Proceedings of the Joint IAPR International Workshop on Structural. and Efficiency in Client Server Applications. 2002. Three Tier Client/Server Architecture: Achieving Scalability.144 BIBLIOGRAPHY [126] Andrew Kachites McCallum. [132] Zen Cart. Efficient Training of Conditional Random Fields. [130] Computer Immune Systems. [135] Hanna Wallach. MALLET: A Machine Learning for Language Toolkit. 10(1). Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence . pages 15–30.cs. 1(6):80–83. [127] J. (2396). University of Pennsylvania. http://www. 2008.com/. Master’s thesis. the art of e-commerce. and Andrew McCallum. Performance. Department of Computer and Information Science. [136] Hanna Wallach.edu/˜immsec/systemcalls. Syntactic. 2008. No. Machine Learning for Sequential Data: A Review. Lecture Notes in Computer Science. Open Information Systems. C4. [131] osCommerce. [129] W. http://www. Conditional Random Fields: An Introduction. University of Edinburgh. [128] Frank Wilcoxon. Ross Quinlan. 1993. Technical Report MSCIS-04-21.oscommerce. Understanding SOA with Web Services. 1995. Springer Verlag. [133] Eric Newcomer and Greg Lomow. 2008.umass. zencart. Last accessed: Novmeber 30. Khashayar Rohanimanesh. Last accessed: Novmeber 30. Eckerson. http://mallet. 1945. 2008.Data Sets and Software. Division of Informatics. Morgan Kaufmann. 2002. Open Source Online Shop E-Commerce Solutions. 2002. Dietterich. Last accessed: Novmeber 30. Biometrics. and Statistical Pattern Recognition.W.com/.unm. 2004.5: Programs for Machine Learning. [137] Charles Sutton.htm. Addison-Wesley Professional.cs. [134] Thomas G.edu.

David Kulp. [144] Sunita Sarawagi and William W. 2006. IEEE Transactions on Pattern Analysis and Machine Intelligence. [143] Aron Culotta. 2004. Kia-Fock Loe. [139] Fei Sha and Fernando Pereira. and Yan Liu. Shallow Parsing with Conditional Random Fields. In Proceedings of the 4th International Conference on Machine Learning and Cybernetics. MIT. pages 64–71. ACM. Pattern Recognition and Machine Learning. [142] John Lafferty. 2004. Semi Supervised Learning for Sequence Labeling Using Conditional Random Fields. In Proceedings of the 21st International Conference on Machine Learning. A Dynamic Conditional Random Field Model for Foreground and Shadow Segmentation. In Proceedings of the 21st International Conference on Machine Learning. Cohen. Intel Research. Michael Collins. pages 1185–1192. Xiaojin Zhu. and Jian-Kang Wu. pages 99–106. University of Massachusetts. pages 1097–1104. 28(2):279–289. [146] Kevin Murphy. Kernel Conditional Random Fields: Representation and Clique Selection. Technical report. Gene Prediction with Conditional Random Fields. An Introduction to Graphical Models. 2004. 2006. In Proceedings of Advances in Neural Information Processing Systems. Amherst. 2004. In Advances in Neural Information Processing Systems. IEEE. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. [140] Tak-Lam Wong and Wai Lam. Springer. and Andrew McCallum. and Trevor Darrel. . Semi-Markov Conditional Random Fields for Information Extraction. [141] Ariadna Quattoni. ACM. 2003. 2005. Technical Report UM-CS-2005-028. [138] Yang Wang. pages 134–141. 2005. Bishop. 2001. Association for Computational Linguistics. Conditional Random Fields for Object Recognition. [145] Christopher M.BIBLIOGRAPHY 145 Data. pages 2832–2837.

html.ca/˜murphyk/. [151] Serafim Batzoglou. http://robotics. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Last accessed: Novmeber 30. Technical University of Dortmund.ubc. Nan M. CRF Project. 2008. pages 61–67. Journal of the Royal Statistical Society. trees and general graphs. John Lafferty. George Soules. The Annals of Mathematical Statistics. and Donald B. 2007. Using Maximum Entropy for Text Classification. [149] Leonard E. Ted Petrie. http://www. Technical Report TR07-2-013.stanford. Laird. includes BP code). and Andrew McCallum. 2008. [150] Arthur Pentland Dempster. Information Theory and Statistical Mechanics.146 BIBLIOGRAPHY [147] Kamal Nigam. Last accessed: Novmeber 30. 1970. sourceforge. B.cs. 41(1):164–171. Baum. [154] Kevin P Murphy. The Physical Review. [148] Edwin Thompson Jaynes. [152] Roman Klinger and Katrin Tomanek. http://crf. 1999. In IJCAI-99 Workshop on Machine Learning for Information Filtering. Rubin. . Last accessed: Novmeber 30. 1957.edu/˜serafim/CS262_2005/ index. 39(1):1–38. [153] Sunita Sarawagi. 2008. and Norman Weiss. Maximum Likelihood from Incomplete Data via the EM Algorithm. Classical Probabilistic Models and Conditional Random Fields. CS 262 Computational Genomics Winter 2005. 1977. Conditional random fields (chains. 106(4):620–630.net/.

Appendices 147 .

.

[136].. intrusion detection and many others. We describe the theory behind conditional random fields in detail. The need to label sequence of observations also arises in intrusion detection tasks to correctly identify malicious events.. A. . determining the part of speech tags for a sentence. Similarly. yn from a finite set of labels Y [134]. computational biology includes various tasks such as biological sequence alignment. named entity recognition and others... gene prediction and many more. [124]. shallow parsing. part of speech tagging. In particular. we shall emphasize on conditional random fields. thus.Appendix A An Introduction to Conditional Random Fields Conditional random fields have been effectively used for a variety of tasks including gene prediction. which results in better classification. [135]. . focus on a sequence of observations and discuss various methods which have been proposed to label them. [113]. information extraction. object recognition. text segmentation. Computational linguistics involve various tasks such as text segmentation. label every observation as y1 .. named entity recognition. Conditional random fields exploit the sequence structure in the observations without making unwarranted assumptions. We shall.1 Introduction The need to correctly label a sequence of observations is of vital importance in a variety of domains including computational linguistics. determining secondary structure of protein sequences. The problem of sequence labeling is defined as follows: given a sequence of observations x1 . y2 . give their properties along with the assumptions made which motivate their use in a particular problem including their advantages and disadvantages with respect to previously known approaches which can be used for similar tasks. x2 . xn . computational biology and real-time intrusion detection. determining secondary structures of protein sequences.. x2 . highlighting their advantages over other methods and list a number of 149 . [34]. y2 .

many machine learning approaches first estimate the joint distribution of the observations and the labels and then . hidden Markov models. In order to perform such tasks. highlighting situations where conditional random fields are expected to perform better than their predecessors. We also give a brief description of the tools which implement conditional random fields. [138]. ∀ x.Given N random variables. X and Y. The rest of the chapter is organized as follows. In Section A.2 Background Many real life problems in language processing. [143]. we give a brief background on probability distributions and describe the notations used. in Section A. the joint distribution is represented as: P( X = x. of all the variables occurring together. thus. Finally. [142]. we conclude the chapter in Section A. We emphasize on feature functions. However. probabilistic approaches have gained wide acceptance which involve estimating either the joint distribution or the conditional distribution which are defined as follows. When there are only two random variables.4. [144]. [36]. [37]. ∀ x. we compare the directed and the undirected graphical models. • Conditional Probability Distribution. [140]. the joint distribution of the given random variables is the distribution. The aim is.6.5. y values. The observations in most sequence labeling tasks are known and the objective is to assign the correct label given the observations.2. of a subset of variables given the occurrences of the remaining random variables in the set N. naive Bayes classifiers and others which motivate the use of conditional random fields. y values. we discuss various graphical methods and highlight drawbacks in previously introduced methods such as the maximum entropy Markov models.3. [33]. [139]. [141]. We then describe conditional random fields. For two random variables.Given N random variables. In Section A.150 An Introduction to Conditional Random Fields applications where they have been successfully applied [32]. the conditional distribution is a distribution. training and testing and the complexity involved in using conditional random fields. the conditional distribution of Y given X is represented as: P(Y = y| X = x ). • Joint Probability Distribution. D. D. [137]. X and Y. in details in Section A. to predict the label sequence which maximizes the probability of the class labels given the observations. Y = y). A. time series prediction and sequence classification. computational biology and real-time intrusion detection involve sequence labeling.

Domain knowledge is typically used to determine such dependence and independence relations as in case of the Bayesian networks which are described later. the learnt model does not generalize to new observations. Assuming complete independence among variables significantly reduces this to O(K ∗ M). calculating the required marginal and conditional probabilities is an easy task. assuming complete independence among the random variables though makes the model tractable but severely affects the modeling capability. various complicated problems can be formulated and solved using purely algebraic manipulation. Once the complete joint distribution is available. it is difficult to estimate the accurate joint distribution. the use of graphical models augments the analysis us- . a diagrammatic representation of the random variables presenting their dependence relations is advantageous and graphical models have become an important tool for various machine learning tasks as presented in [145] and [146]. however. The number of observations required to determine the complete joint distribution is exponential in the number of variables. making such strong independence assumptions affects the accuracy of the model. feasible only in situations when the number of random variables is small and large amount of data samples are available for training. thus. To learn the joint distribution from the training data is difficult due to the following reasons: 1.A. making the model tractable with the help of independence assumptions without affecting the modeling power of the system and improving the generalization capability of the model on unseen observations given the limited number of training data samples. As a result. 2. For M variables each taking K possible labels. On the contrary. Estimating the joint distribution without making any independence assumptions is. Estimating the conditional distribution directly from the training observations eliminates the need of estimating the joint distribution and does not necessitate any unwarranted independence assumptions among the random variables. however. The amount of training data is limited and hence. The joint distribution learnt from a limited data set can result in over-fitting and mirrors the training data. this number is O(K M ). However. the objective is to build models which optimally balance the dual constraints. Hence.2 Background 151 determine the required conditional distribution using the Bayes rule. For estimating either the joint or the conditional distribution. the major issue in a joint distribution is estimating the required joint distribution itself. As mentioned in [145].

is a set of vertices. the graphical methods can be applied to label a single observation with multiple features as well as to label a sequence of observations where each observation is itself represented by multiple features..xt is the observed vector. with a large number of random variables. → • − = x1 x2 .. E. Based upon the type of edges used in the graphs. V. whose y 1 2 t → length is equal to that of the observation − ..3 Graphical Models Graphical models are often used to model the probability distribution over a set of random variables by factorizing complex distributions. x Note that. Let → there be k possible classes. For sequence labeling y is a vector.. Very often.. y . A graph. when a sequence of observations is considered. the observation represents the value of a single feature observed overtime. even though more than one feature can be used to represent an observation sequence. maximum entropy Markov models. each with a small set of variables. .152 An Introduction to Conditional Random Fields ing diagrammatic representations of probability distributions which not only help to visualize the structure of the probabilistic model but also gives insight to the properties of the model including the conditional independence which significantly improves the probabilistic analysis and helps to reduce the need of using larger data sets. Markov random fields and conditional random fields. Notation We use the following notations for the rest of the chapter. where a vertex represents a single or a group of random variable(s) and the edges between the vertices represents the relationship between these random variables. G. connected by edges. x • y is estimated class. into a product of simpler distributions. Methods which generally deal with a single observation are naive Bayes classifier and Maxent classifier. the graphical models can be broadly classified as Directed or Undirected. Let there be m alphabets for each xi . − = y . We use the term “class” interchangeably with the term “label”. . methods which deal with a sequence of observations are hidden Markov models. y . A. Similarly.

A vertex Vi can be represented by the random variable representation Xi . An important restriction for the directed graphs is the absence of closed loops. i.1) i =1 Figure A. where xi represents a node and xπi represents its parents. x2 ) (A. x2 ..1 can be factorized as: p ( x1 . x3 ) = p ( x1 ) ∗ p ( x2 | x1 ) ∗ p ( x3 | x1 . The directed graphical models factorize according to the probability distribution given in equation A.1 Directed Graphical Models Def. Vj ). there should be no directed path starting from and ending to the same vertex. A directed graphical model incorporates the parent child relationship via the direction of an edge. N p( x1 .: A directed graphical model is a graph G = (V. Directed graphical models are also known as the Bayesian Networks [61]. x2 .e. ..1.1 represents a fully connected directed graphical model for three random variables. Such graphs are called as the Directed Acyclic Graphs (DAG).. i. x1 x2 x3 Figure A. The joint distribution over a set of random variables can be factorized into a product of local conditional distributions in the directed graphical models.e. for a fully connected graph with M variables each taking K possible values the total number ..3.A.2) Thus..3 Graphical Models 153 A... an edge pointing from the vertex Vi to vertex Vj implicitly describes the parent child relationship such that Xi is the parent of X j . i = j} are the directed edges from vertex Vi to vertex Vj . VN } are the vertices and E = {(Vi . V2 . xn ) = ∏ p ( xi | xΠ ) i (A.. . E) where V = {V1 .1: Fully Connected Graphical Model The graphical model represented in Figure A.

Figure A. the conditional independence properties can be tested by applying the dseparation test. This is not feasible for most real world applications which often involve a large number of random variables with complex dependencies among them. x1 x2 x3 Figure A. Naive Bayes Classifier Naive Bayes classifier is a well known directed graphical model which is frequently used to determine the class label for a given observation. The naive Bayes classifier is represented in Figure A.3) Assuming the variables to be completely independent significantly reduces the number of required parameters to M(K − 1). In case of directed graphs. x2 .2: Fully Disconnected Graphical Model The graphical model represented in Figure A. which is manageable.3.154 An Introduction to Conditional Random Fields of parameters that must be specified for an arbitrary joint distribution is equal to K M − 1 which grows exponentially with M.2 represents a graphical model where the random variables are assumed to be completely independent. The complexity can be drastically reduced by assuming the random variables to be completely independent. Conditional independence properties can be used to simplify the structure of the graph. x3 ) = p ( x1 ) ∗ p ( x2 ) ∗ p ( x3 ) (A.2 can be factorized as: p ( x1 . We shall now describe some of the well known directed graphical models. This involves testing whether or not the path between two nodes is blocked. More details on d-separation can be found in [145]. .

Equation A. x2 ... find: → argmax p(y|− ) x y (A.3 Graphical Models 155 y x1 x2 xt Figure A.e. “Every feature xi is conditionally independent of every other feature”.5) The classifier presented in Equation A.3: Naive Bayes Classifier The objective is to find the label.A.. ∗ p( xt |y) x y y = argmax y p ( y ) ∗ ∏ p ( xi | y ) i =1 t (A. which maximizes the probability of the given observation. i. the resulting naive Bayes classifier is given by: → argmax p(y|− ) = argmax p(y) ∗ p( x1 |y) ∗ p( x2 |y)... . y. This makes the model simple ..4) → The Bayes Rule can be used to find p(y|− ): x → p(y|− ) = x Hence.xt |y) y Making naive Bayes assumption.4 can be rewritten as: → p(y) ∗ p(− |y) x −) → p( x → argmax p(y|− ) = argmax x y y → p(y) ∗ p(− |y) x −) → p( x → = argmax p(y) ∗ p(− |y) x y = argmax p(y) ∗ p( x1 .5 considers the features in the observation to be independent and discard any correlation which may exist between them. .

This results in a conditional distribution which is represented in Equation A. y x1 x2 xt Figure A.156 An Introduction to Conditional Random Fields but. λk is the bias weight and f k (y. Similar to the naive Bayes classifier. We also observed that. x ) is a feature function defined on an observation and label pair for every feature k. the Maxent classifier (or logistic regression) can be used to classify an observation which may be represented by multiple features. the Maxent classifier does not assume independence among the observation features thereby resulting in better classification accuracy. x) K is the normalization constant. for each class is a linear function of the observation x and a normalization constant. Contrary to the naive Bayes assumption.6) Z(x) = ∑ exp y k =1 ∑ λk ∗ f k (y. Maxent classifier is represented in Figure A.6. log p(y| x ). The ability of this model to capture the correlation . the naive Bayes classifier assumes different observation features to be completely independent which affects classification. to make the model simple. it affects classification accuracy of the resulting classifier.4. x) (A. 1 ∗ exp Z(x) K p(y| x ) = where k =1 ∑ λk ∗ f k (y. Maxent Classifier Recalling that naive Bayes classifier is a directed graphical model which is often used to assign a single label to an observation.4: Maxent Classifier Maxent classifier is motivated by the assumption that the log probability.

: A graphical model which models the conditional distribution of the labels given the observations. Generative and Discriminative Graphical Models Def. Maxent classifier (logistic regression) discussed earlier is a typical discriminative model. but beyond these constraints.3 Graphical Models 157 between observation features depend upon the feature functions. Such a conditional probability model is based on the Principle of Maximum Entropy [148] which states that. p(y. Other well known generative models are the hidden Markov models. − . for an observed sequence.A. − . and the weights. i. the only unbiased assumption that can be made is a distribution which is as uniform as possible given the available information. neural networks and nearest neighbor are examples of discriminative models. of length t. in many real world situations.e. The naive Bayes classifier discussed earlier is an example of a generative graphical model. conditional random fields. This model can → → be extended to estimate a sequence of labels. independence assumptions are made which results in approximate models. when for a probability distribution only incomplete information is available. learnt during training [147]. The prime disadvantage of generative models is that they need to enumerate all possible observation sequences. one which does not make any further assumptions.. However. Other well known methods such as the support vector machines. is known as a discriminative model. the model should be as uniform as possible. p(y| x ). Def. f k (y. maximum entropy Markov models. This means that the model should follow all the constraints imposed on it (which are defined by the feature functions extracted from the training data). λk . very often the observed sequence represents the values of a single feature taken . More details on Maximum Entropy models can be obtained from [110]. x ).: A graphical model which models the joint probability of the observations and the labels. Hidden Markov Model The naive Bayes classifier is generally used to predict only a single class label. As y x mentioned earlier. Bayesian networks. is known as a generative model. the amount of data available for training is limited and hence. x ). Markov random fields.

. We shall concentrate only on the first order hidden Markov model which assumes that a state at time t depends only on the state at time t − 1. + aik = 1. the sequence of states which generated the observation) since there may exist more than one sequence of states which generated the particular observation.158 An Introduction to Conditional Random Fields over a period of time. is assumed to be independent of observation at any other time which can affect accuracy. qk )..e. The hidden Markov model is well known example of a directed and a generative graphical model. For an hidden Markov model. Hence. i.7) → → → The best label sequence..5 can be factorized as: → → p(− ... + ei (bm ) = 1. q2 . They are doubly stochastic models. ei (b) = p( xi = b|qi = k) such that ei (b1 ) + .. Also.e. Since we consider only a single feature which is observed overtime. ∀ states i = 1. is one which maximizes this joint distribution. let the transition probability from state qt−1 = i to state qt = j be represented by aij such that ai1 + ai2 + . one cannot uniquely determine the labeling (i. The starting probabilities. the observation at time t depends only on the state at time t. . p(− .. − ) = y x ∏ p ( y i | y i −1 ) ∗ p ( x i | y i ) i =1 t (A.. the state sequence is generated by a stochastic process from which the output sequence is then generated [75]. are initialized for each state i such that a01 + a02 + . Hence.. The drawy y x back of hidden Markov model is that observation at time t. Further. each state has a probability of emitting an observation. .5: Hidden Markov Model The hidden Markov model represented in Figure A... Further... − ). the set of states is represented by Q = (q1 .. + a0k = 1. xt . this results in a chain like structure as represented in Figure A. − . given an output sequence (observation).k.k. ∀ states i = 1.5. y1 y2 yt−1 yt x1 x2 xt−1 xt Figure A. we often assume the number of states to be equal to the number of class labels. a0i ..

Dynamic programming can be used to perform this efficiently. what is x the probability of the observation sequence given the model. what is x the sequence of states that maximizes the joint probability of the observation sequence and the state sequence. find → argmax p(− |θ ) x θ Evaluation → The objective is to find p(− | M). xi−1 ...e. − .. i.. qi−1 . and an observation sequence − .e. − | M) x q − → q 3. i.e... .3 Graphical Models Three main questions are considered when using an hidden Markov model.qi−1 ∑ p( x1 . i.Given an hidden Markov model M and an observation sequence − . 159 → 1. x The naive approach is to perform summation over all possible ways of generating the observation → sequence.. i. the probability of the observation sequence given the model. what are the parameters (transition and x emission probabilities) that maximize the probability of the observation sequence. q1 . qi = k) ∗ ek ( xi ) . .e. Learning . .A.. find → → argmax p(− . find → p(− | M) x → 2.... i.. First..e.Given an hidden Markov model M with unspecified transition and emission → probabilities.Given an hidden Markov model M and an observation sequence − . Decoding .. Evaluation . qi = k) = q1 . x → p(− ) = x ∑ − → q → → p(− .. define the forward probability as follows: f k (i ) = p( x1 . xi . − ) x q →→ → = ∑ p(− |− ) ∗ p(− ) x q q − → q Summing over exponential number of paths is not desirable.

xt ...qt ∑ p( xi+1 .... ∀ k > 0 Iteration: f k (i ) = ek ( xi ) ∗ ∑l f l (i − 1) ∗ alk Termination: → p(− ) = ∑k f k (t) ∗ ak0 . q1 .. define the backward probx ability as follows: bk (i ) = p( xi+1 .... x Similar to the forward algorithm described above..qt ∑ p( xi+1 .. where ak0 is the probability of terminating in state k. qi+2 .. . qi−1 ) ∗ p(qi = k|qi−1 ) ∗ ek ( xi ) =∑ l q1 ... q1 ..qi−2 ∑ p( x1 ..qi−2 ∑ = ∑ p( x1 . q1 . qt |qi = k) =∑ l qi+1 . . . . ..xi−1 ..... the backward algorithm [75] employs dynamic programming and can be used in conjunction with the forward algorithm to determine the → most likely state at position i given the observation sequence − ... xi−1 .. qi−1 = l ) ∗ alk ∗ ek ( xi ) l = ek ( xi ) ∗ ∑ f l (i − 1) ∗ alk l Using this idea. .qi−1 ∑ p( x1 ... First. qi+1 . qt |qi = k) . qi−1 = l ) ∗ alk ∗ ek ( xi ) =∑ l q1 . . The algorithm is described next. Initialization: f 0 (0) = 1 f k (0) = 0.... xt .. qi−1 = l ) ∗ p(qi = k|qi−1 ) ∗ ek ( xi ) p( x1 . xi−1 . .. xi−1 ....... ... ... qi+1 = l.. qi−2 . xt |qi = k) = qi+1 . the forward algorithm [75] can be used to perform this efficiently with a time complexity of O(K2 T ) and a space complexity of O(KT ) where K is the possible number of states and T is the length of the observation sequence.... .160 An Introduction to Conditional Random Fields = q1 .

→ The most likely state at position i.qt ∑ p( xi+2 .8) This is also known as posterior decoding. . However. To calculate the sequence of states which maximizes the joint probability of the observation sequence and the state sequence. ∀ k Iteration: bk (i ) = ∑l el ( xi+1 ) ∗ akl ∗ bl (i + 1) Termination: → p(− ) = ∑l a0l ∗ el ( x1 )bl (1) x The backward algorithm also has a time complexity of O(K2 T ) and a space complexity of O(KT ) where K is the possible number of states and T is the length of the observation sequence. − | M) → → → q∗ x q − → q Consider the given observation sequence x1 .. ..xt as shown in Figure A...A..3 Graphical Models 161 = ∑ el ( xi+1 ) ∗ akl l qi+1 . qt |qt+1 = l ) = ∑ el ( xi+1 ) ∗ akl ∗ bl (i + 1) l Using the above concept. xt . Initialization: bk (t) = ak0 . The most likely state can be calculated at each position using Equation A.8.6. given the observation sequence − ...8.. x2 ... → p ( qi = k | − ) = x f k ( i ) ∗ bk ( i ) → p(− ) x (A. Decoding The objective is to find: − = argmax p(− . the backward algorithm described next. can now be calculated using x the Equation A. qi+2 . this does not represent the most likely sequence of states given the entire observation sequence of length t. . dynamic programming can be used to perform the computation .

xi+1 .9) Given Vk (i ) for all states k.6: Decoding in an Hidden Markov Model efficiently... .qi−1 k max p( x1 .... qi = k)] = max [ p( xi+1 |qi+1 = l ) ∗ p(qi+1 = l |qi = k) ∗ Vk (i )] = el ( xi+1 ) ∗ max [ akl ∗ Vk (i )] k (A. q1 .... . .. .. and for a fixed position i. .10) The Viterbi algorithm [118]..qi ) q1 . q1 .qi−1 .. .qi k = max [ p( xi+1 .. . xi . .xi ... qi+1 = l ) q1 ..162 An Introduction to Conditional Random Fields 1 1 1 1 1 2 2 2 2 2 k k k k k x1 x2 x3 xt−1 xt Figure A. qi+1 = l |qi = k)∗ q1 .. .....qi−1 (A. qi ) q1 . .xi .qi = max p( xi+1 .qi−1 . .. q1 .xi−1 . . qi+1 = l |qi ) ∗ p( x1 .. The algorithm is described in the following steps: .. q1 . qi+1 = l | x1 ..qi ) ∗ p( x1 ... xi .xi−1 .xi−1 . [119] implements this idea with a time complexity of O(K2 T ) and a space complexity of O(KT ) where K is the possible number of states and T is the length of the observation sequence.qi ..qi = max p( xi+1 . qi = k) q1 ..xi . calculate Vl (i + 1) as: Vl (i + 1) = max p( x1 . q1 ... Let Vk (i ) is the probability of most likely sequence of states ending in state qi = k Vk (i ) = max p( x1 .... q1 .qi−1 . xi .

q is known. Vk (0) = 0. In this chapter. q → → Ek ( x ) = number o f times state k in− emits x in − . the Baum-Welch algorithm [149] can be used which is based on the principle of expectation maximization [150]. ∀ k > 0 Iteration: Vj (i ) = e j ( xi ) ∗ maxk [ akj ∗ Vk (i − 1)] Ptr j (i ) = argmaxk akj ∗ Vk (i − 1) Termination: → q∗) p(− .e... q 1 2 t → Akl = number o f times transition occurs f rom k to l in − .. In case → when the training data is labeled the observation sequence. we shall only discuss the first case when labeled training data is available.. In case when training data is not labeled. − = x1 . x2 . two learning scenarios exist. is given and the x − = q ..11) (A. . We define.3 Graphical Models Initialization: V0 (0) = 1..12) . where 0 is the imaginary start position. xt . q x → The maximum likelihood parameters θ. i. − = maxk Vk (t) x → Traceback: q∗ = argmaxk Vk (t) t qi∗−1 = Ptrqi (i ) 163 Learning In order to estimate the parameters of an hidden Markov model. can be shown to be: x Akl ∑i Aki Ek (b) ek ( b ) = ∑c Ek (c) akl = (A.A. maximize p(− |θ ). the Viterbi training can also be used. q . Alternately. when the labeled training data is available and when the training data is not labeled. → corresponding state sequence. ..

− ) x q (A. thus. the transition and emission probabilities can be used to determine the likelihood of the parse..164 An Introduction to Conditional Random Fields Hence..− . given an hidden Markov model.......eqt ( xt ) is to consider all the parameters aij and ei (b) as features.. Maximum Entropy Markov Model Similar to how we extended the naive Bayes classifier to perform sequence labeling in the hidden Markov models. q) = number o f parameters θ j which occur in (− . We..aqt−1 qt eq1 ( x1 )eq2 ( x3 ). − ) = x q which can be reduced to the form j=1..aqt−1 qt eq1 ( x1 )eq2 ( x3 ).. observe that for an hidden Markov model. q1 .. an → → observation sequence − and a parse − . xt . i.. x.. − ) = exp x q j=1......eqt ( xt ) A compact approach to represent a0q1 aq1 q2 .6.. the best estimate of the parameters that can be obtained is the average frequency of transitions and emissions that occur in the training data..− ) x q → → p(− . n occurs in (− ) and (− ).e. . we can .. given the maximum entropy model (Maxent classifier) in Equation A. .13) Equation A.n ∏ θj →→ F ( j. . Counting → → the number of times every feature j = 1. − ) x q Thus. − ) = p( x1 . Let there be n such features (both aij and ei (b)).n ∑ → → log(θ j ) ∗ F ( j. → → p(− . − . qt ) x q = a0q1 aq1 q2 . we represent the count as x q → → F ( j. q2 . the conditional random fields which are discussed next.13 gives another way of representing an hidden Markov model which presents an intuitive approach for understanding the maximum entropy Markov models and. the likelihood of this parse is: x q → → p(− . given the labeled training data... A common drawback is that this can result in over-fitting which affects their generalization capability. x2 . thus..

Instead. xt ) yt k is the partition function. − .A. x t ) p ( y t | y t −1 . Comparing Equation A. . y1 y2 yt−1 yt x1 x2 xt−1 xt Figure A. Further. the maximum entropy Markov model models ..14) Z ( y t −1 .7.3 Graphical Models 165 → extend it to perform sequence labeling for an observation sequence. xi . yt−1 . they assume that the observation at time t depends only on the state at time t. x t ) k (A.7 can be factorized as: 1 ∗ exp Z ( y t −1 . yt .. λk is the weight and f k (yt .. we note that the hidden Markov model models the joint probability of the observation sequence and the label sequence by assuming that a state at time t depends only on the state at time t − 1. decoding can be performed similar to the hidden Markov model such that the probability of the overall sequence of labels is maximized as opposed to finding the optimum class label at each observation xt .14. x t ) = ∑ exp ∑ λk ∗ f k (yt .13 and Equation A. y t −1 . x t ) = where ∑ λ k ∗ f k ( y t .7: Maximum Entropy Markov Model One approach to perform sequence labeling is to run the Maxent classifier locally for every observation in the sequence resulting in a label for every observation. This results in a maximum x entropy Markov model as represented in Figure A. Using the Viterbi algorithm. yt−1 . To avoid this. The maximum entropy Markov model represented in Figure A. y2 . xt ) is the feature function defined for a feature k. An obvious drawback of this approach is that the labels for each observation xi are optimal locally as opposed to the optimal sequence of labels. y1 . the Viterbi decoding can be performed similar to the hidden Markov models.

however. the same path is preferred irrespective of the observation at any later stage (during decoding). Once. state 1 and state 4.8. the authors explain the label bias phenomenon with the following example. thus. r 0 i 1 2 b 3 r 4 o 5 b Figure A. . both the states have no choice but to ignore the observation and move to the next state with the maximum probability.166 An Introduction to Conditional Random Fields the conditional distribution of the label sequence by conditioning on the observation at time t. Further. However. it assign equal probability to both. local conditional modeling in each state [112]. hence. Consider the finite state model represented in Figure A. In other words. In [34]. Next the model observes the observation i. due to local normalization. the maximum entropy Markov model is analogous to a sequence of independent Maxent classifiers. [112] which is described next. the model observes. the observation r. As we discussed earlier. if one of the observation sequences is more common in the training. if certain sequence of states is more frequent during training. the probability at every instant sums to one. This is attributed to the directed graphical structure and. Label Bias in Maximum Entropy Markov Models Label bias is the phenomenon in which the model effectively ignores the observation thereby resulting in inaccurate results. thus. both the states 1 and 4 have only one outgoing transition and because the incoming probability is equal to the outgoing probability. given the model. Maximum entropy Markov models. when the model observes i or any other observation.8: Label Bias Problem Suppose that the observation sequence is r i b. often perform better than the hidden Markov models. the transitions would prefer the corresponding path irrespective of the observation. both states 2 and 5 result in equal probability. As a result. As a result. the previous state explains the current state so well that the observation at the current state is effectively ignored. they suffer from the Label Bias problem [34].

Another approach is to start with a fully connected structure.. . . p( x1 .. this would preclude the use of prior structural knowledge. i = j} are the undirected edges from vertex Vi to vertex Vj .9 represents an undirected graphical model for three random variables. it is possible to eliminate label bias by collapsing the states 1 and 4. it is necessary to calculate the partition function Z. however. However. ∑ ∏ ψc (xc ) x1 x2 xn c∈C where C is the set of cliques in the graph. xc and Z= 1 Z c∈C ∏ ψc (xc ) (A.15 represents a probability distribution. Potentials have no specific probabilistic interpretations. Undirected graphical models are also known as the Markov Random Fields [145].15) ∑ ∑ .3 Graphical Models 167 In the above example. before we describe the conditional random fields. Similar to the label bias.. The undirected graphical models factorize according to the probability distribution given in Equation A. A vertex Vi can be represented by the random variable representation Xi . x2 . this is a special case and not always possible [34]. Conditional random fields effectively address these issues by dropping local normalization and instead normalize globally on the observation sequence. A.3. we present the general undirected graphical models which are necessary for better understanding of the conditional random fields.. however. the authors in [112] describe what they call as the observation bias where the observations explain the states such that the previous states are effectively ignored. V2 ... Vj ).: An undirected graphical model is a graph G = (V.15. the undirected graphical models describe the factorization of a set of random variables and their notion of conditional independence. . Figure A. Z is the normalization factor known as the partition function and ψc are the strictly positive real valued functions known as the potential functions defined over the cliques. ∀ c. Similar to the directed graphical models.. E) where V = {V1 .2 Undirected Graphical Models Def.A. xn ) = such that ψc ( xc ) > 0. VN } are the vertices and E = {(Vi .. To make sure that the Equation A.

168 An Introduction to Conditional Random Fields x1 x2 x3 Figure A. all ‘K2 ’ akl features and all ‘K’ el ( xi ) features are significant at every instant. x3 )ψ2.9: Undirected Graphical Model The undirected graphical model represented in Figure A. we observed that using the dynamic programming approach. the authors proposed the conditional random fields as a solution for the label bias problem. conditional random fields can also be considered as a generalization of the hidden Markov models. conditional independence properties can be simply determined by graph separation. Rearranging Equation A.3 (x1 . a strong independence assumption is made that the state at any instant depends only upon the previous state.10 as: Vl (i ) = Vk (i − 1) + ( a(k. x2 )ψ1. x3 ) Z (A.e. x3 )ψ1.2.3 (x1 . A. A major drawback in an hidden Markov model is that the state qi can observe only the observation symbol xi . mc is the size of the clique m c.3 (x2 . x2 )ψ1.3 ( x1 .2 (x1 . i )) = Vk (i − 1) + g(k.3 ( x2 . l. x3 )ψ2.9 can be factorized as: p ( x1 . xi ) . The overall complexity can be determined from ∑c∈C O(k c ) where.4 Conditional Random Fields In [34]. More details can be found in [145].2 ( x1 . x2 . [113].16) ∑ ∑ ∑ ψ1. x3 )ψ1. Further. x2 . l ) + e(l. However. [151] which also gives a better view and help in better understanding. x3 ) = where Z= 1 ψ1. x2 .2. i.3 ( x1 . For the undirected graphical models.7 and Equation A. x3 ) x1 x2 x3 The complexity of an undirected graphical model depends upon the size of the largest clique..

However. xt ) = 1 Z 1 Z c∈C ∏ ψc (xc ) ∏ ψc (xc ) → p(− ) = x The conditional probability can be written as: c∈C →→ p(− |− ) = y x → → p(− . thus. x2 . x ) y ∗ ∏c∈C ψc (yc . xc ) → ∗ ∑− ∏c∈C ψc (yc . xc ) − → y c∈C Equation A.17. xc ) Z ( → ) c∈C x = 1 Z 1 Z (A.A. presents the general formulation of a conditional random field.. includes all the features present in an hidden Markov model and also has the capability to define a large number of additional features which significantly improves its modeling power compared to that of an hidden Markov model.4. l..1 Representation of Conditional Random fields Using Equation A.17) where → Z (− ) = x ∑ ∏ ψc (yc . the more likely state k will follow state l at position i. [113]. Higher the value of function g. A large number of features can be defined at every position which can capture long range dependencies in the observation sequence x. Generalizing this function to g(k. i ) removes the independence assumptions made in the hidden Markov model which forms the basis for conditional random fields. A conditional random field.4 Conditional Random Fields 169 We note that the restriction in an hidden Markov model arise from the xi part in the function g(k. x. − ) y x → p(− ) x → → p(− . p( x1 . xi ). − ) y x = − − → → → ∑− p( y . in this chapter. l. we shall focus on a linear chain structure for conditional random fields which is motivated from [34]. . [151] and [152] and is described next. xc ) y 1 = − ∏ ψc (yc ..15. . A.

− . there exists t possible maximal cliques which are represented by adjacent nodes in the chain. y1 y2 yt yt+1 x1 x2 xt xt+1 Figure A. for an observation of length t + 1 for a linear chain structure represented in Figure A.18 can be rewritten as: i =1 → x ∑ λ i f i ( y j −1 . summing over all possible label sequences ensures that it is a probability distribution. Further.17. thus.19) → Z (− ) = x ∑ exp − → y j =1 i =1 → x ∑ ∑ λ i f i ( y j −1 . − ) − → y j =1 m t → → ψj (− . j ) t m (A.19. The index j.10. y j . a linear chain conditional random field can be formulated as: →→ p(− |− ) = y x where t 1 − − → → → ∏ ψj ( y . represents the position in the input sequence and sums over a sequence . x ) Z ( − ) j =1 x (A. j ) →→ p(− |− ) = y x where 1 → ∗ exp Z (− ) x j =1 i =1 → x ∑ ∑ λ i f i ( y j −1 . A linear chain conditional random field is x represented in Figure A.4.10: Linear Chain Conditional Random Field Using Equation A. − . − ) = exp y x Equation A. − . y j . y j . j ) t m In Equation A.18) → Z (− ) = x and → → y x ∑ ∏ ψj (− .170 Linear Chain Conditional Random Field An Introduction to Conditional Random Fields → Consider an observation sequence − of length t + 1.

As discussed earlier. as we shall observe later. a clique template is defined which can extract a variety of features from the given training samples. − . x )∈ D ∑ − − →→ ∑ →→ log p(− |− ) y x log (A.4 Conditional Random Fields 171 of length t. it can be effectively used for maximum likelihood parameter estimation (estimating λ s) during training.4. implicitly represents the positivity constraint on the potentials.2 Training → Given the labeled training sequences. − )) x j = − → t m → ∑− exp(∑ j=1 ∑i=1 λi f i (y j−1 .20) = →→ (− . since exponential function is a continuous function and easily differentiable. Index i represents the m overall feature functions defined on the specified set of variables. From Equation A. features can then be extracted x j j −1 → for different realizations of y j and y j−1 and − from the training data. In order to define the features. y j . conditional random fields do not assume such independence among observations. Further. For a linear chain conditional random field. The log likelihood L on the training data D is given by: L( D ) = ( y .18. Given the clique template. The clique template makes assumptions on the structure of the underlying data by defining the composition of the cliques. x A. Feature Functions and Feature Selection In hidden Markov models. Using the exponential function. x )) y .A. for defining the potentials. the feature weights λi are not dependent on the position j. rather they are tied to the individual feature functions. every label (or state). the objective of training a conditional random field is to x →→ determine the weights. This is accomplished by using the features defined while training the random field. Maximum likelihood method is applied y x for parameter estimation.− )∈ D y x → exp(∑t=1 ∑im 1 λi f i (y j−1 . yi . λ s which maximize p(− |− ). we observe that the potential function ψ must be a strictly positive real valued function. y j . can look only at the observation xi and hence they cannot model long range dependencies between the observation sequence. there exist only one clique template which defines → the links between y and y and − . Further.

Laplacian. A number of priors such as the Gaussian. y j .− )∈ D y x t ∑ log ∑ exp( ∑ − → y → x ∑ λi fi (y j−1 . − ))  t j =1 i =1 m →→ (− . − ) ∑ f i ( y j −1 . Hyperbolic and others can be used. y j . − ) t t m ( y . − ) → x − → j =1 i =1 j =1 →→ → (− . y j . the likelihood is often penalized with some form of a prior distribution which has a regularizing influence.22) which is same as the expected value of the feature under its empirical distribution and is denoted ˜ as E( f i ). x )∈ D B ∑ − − →→ → log Z (− ) − ∑ x λ2 i 2σi2 i =1 m C (A. the likelihood becomes: L( D ) = ( y . − ) j =1 t (A.23) . − )) − ∑ m λ2 i 2σi2 i =1 m = ( y .− )∈ D y x ∑ ∂ 1 − ) ∂λ → Z( x i ∑ exp − → y t j =1 i =1 m → x ∑ ∑ λ i f i ( y j −1 . x )∈ D j=1 i =1 ∑ − − →→ → x ∑ ∑ λ i f i ( y j −1 . y j .− )∈ D − y x y ∑ →→ → y x x ∑ p ( − | − ) ∑ f i ( y j −1 .21) Taking partial derivatives of the likelihood with respect to the parameters. we get: t ∂ → A= x ∑ ∑ f i ( y j −1 . − ) j =1 t ( y . x )∈ D ∑ − − →→ 1 → ∑ exp Z (− ) − x → y 1 j =1 i =1 t m → → x x ∑ ∑ λ i f i ( y j −1 . Consider a simple prior of the form ∑im 1 = of the parameter λi . − ) − A m ( y . y j . x )∈ D j =1 i =1 → x ∑ λi fi (y j−1 .172 An Introduction to Conditional Random Fields To avoid over-fitting. y j . y j . x )∈ D y ∑ − − →→ → → x x ∑ Z(− ) exp ∑ ∑ λi fi (y j−1 . → ∂ 1 ∂Z (− ) x B= ∑ Z(− ) ∂λi → ∂λi x →→ (− .− )∈ D y x = = = = →→ (− . − ) ∑ fi (y j−1 . x )) j = y exp( ∑  t −∑ λ2 i 2 i =1 2σi m = log ( y . y j . λ2 i 2σi2 where σi is the standard deviation Hence.− )∈ D j=1 y x (A. λ s. − ) ∂λi →→ (− . y j . y j . y j . x )∈ D ∑ − − →→ ∑ − − →→ − log → exp(∑t=1 ∑im 1 λi f i (y j−1 . − )) x j = − → → ∑− exp(∑t=1 ∑im 1 λi f i (y j−1 . y j .

λi ˜ E( f i ) − E( f i ) − 2 = 0 σi 173 (A.21.25 to 0. Consider states s and s. y j = s . we equate the right hand side in Equation A.− ) y x ∑ 1 → Z (− ) x j =1 s → → → → x x x x ∑ ∑ ∑ fi (s. s . s ) β j (s |− ) s t The forward-backward algorithm has a complexity of O(K2 T ) where K is the number of states and T is the length of the sequence.27) (A. s ) s s (A. Training a conditional random field involves many iterations of the forward-backward algorithm.28) → ψj (− . ∂ 2λi C= 2 ∂λi 2σi λi = 2 σi Using A.24) (A. Hence.23 and A. A. s.26) ˜ E( f i ) can be easily computed by counting how often every feature occurs in the training data.25) (A. A. s ) = exp x i =1 → x ∑ λi fi (y j−1 = s. s . As described in [34]. To efficiently calculate the E( f i ). − ) m Using the α and β functions.4 Conditional Random Fields which is the expectation under the model distribution and is denoted as E( f i ). . s .24.A. defining the forward (α) and backward (β) scores as follows: → α j (s|− ) = x → β j (s|− ) = x where → → x x ∑ α j −1 ( s | − ) ∗ ψ j ( − . a modified version of the forward-backward algorithm can be used. s ) → → x x ∑ β j +1 ( s | − ) ∗ ψ j ( − . we get: ∂L( D ) λi ˜ = E( f i ) − E( f i ) − 2 ∂λi σi To find the maximum. s. it is possible to compute the expectation under the model distribution efficiently by: E( f i ) = →→ ( − .22. − ) ∗ α j (s|− )ψj (− .

.. we use the two interchangeably. hence. s ) x j s j −1 → q j (s) = argmaxs δj−1 (s ) ∗ ψ(− . the objective is to find x the most likely sequence of labels for the given observation. → Let δj (s|− ) represent the highest score of the sequence of states ending in state s at position x j and is defined as: → → δj (s|− ) = max p(y1 .30) The algorithm is described in the following steps: Initialization: ∀ s ∈ S : → δ1 (s) = ψ1 (− .y j−1 (A. s) x q1 ( s ) = s0 Recursion: ∀ s ∈ S : 1 ≤ j ≤ t → δ (s) = max δ (s ) ∗ ψ(− . y j = s|− ) o x y1 . s. s.3 Inference → Given the observed sequence − and the trained conditional random field. y j−1 . the Viterbi algorithm can be used to effectively determine the sequence of states. .... .29) We then calculate → → δj+1 (s|− ) = max δj (s ) ∗ ψj+1 (− . s0 .. As with the hidden Markov models.174 An Introduction to Conditional Random Fields A. s .4.. Often the number of states is assumed to be equal to the number of labels and. s ) x x s (A. s) x Termination: p∗ = maxs δt (s ) −∗ → l t = argmaxs δt (s ) Traceback: −∗ → − → l t = qt+1 ( l ∗+1 ) t The complexity of the algorithm is O(K2 T ) where K is the number of states and T is the length of the sequence.

As a result. using the approach presented by the undirected models. In directed graphs. both. U. The directed models determine the conditional independence properties via the d-separation test while the undirected models determine the same via graph separation [145]. This. however. A. Figure A. This can be represented as shown in Figure A.5 Comparing the Directed and Undirected Graphical Models Both directed and undirected graphical models allow complex distributions to be factorized into a product of simpler distributions (functions). Similarly. Sunita Sarawagi’s CRF package [153].A.11. of distributions which follow all the conditional independence properties. D. [120] gives an in depth description of the software and describes the commands necessary to run it using example of the named entity recognition task from language processing. we can represent only a subset. by directed and undirected models and there also exist distributions which can be represented only by either of the two. the two models differ in the way they determine the conditional independence relations. of distributions which follow all the conditional independence relations. Mallet [126]. the factorization results into a product of conditional probability distributions while in undirected graphs.11 shows that there exist a subset of distributions (which follow all the conditional independence relations) which can be represented. The directed graphical models do not require calculating such a partition function.5 Comparing the Directed and Undirected Graphical Models 175 A. We mainly experimented with the CRF++ and found it to be effective and easy to use and customize. Note that. the two models also differ in the way the probability distribution is factorized. The tools include CRF++ [120]. we can represent a subset. the factorization results into arbitrary functions. is not a complete list of tools and many other tools exist that can be used. it comes with the cost of calculating the normalization constant Z. consider a set S which represents the universe of distributions. Given the two ways (directed and undirected models) to factorize a distribution. Kevin Murphy’s MATLAB CRF code [154]. using the approach of directed graphical models.4. however. Factorization into arbitrary functions enable us to define functions which can capture dependencies among variables. However. A .4 Tools Available for Conditional Random Fields We now list some of the tools which implement conditional random fields. Then.

discussed properties along with the assumptions made which motivate their use in a particular problem including their advantages and disadvantages with respect to previously known approaches which can be used for similar tasks. • Conditional random fields eliminate the label bias problem which is present in other conditional models such as the maximum entropy Markov models. . The key features for conditional random fields are: • Conditional random fields can be considered as a generalization of the hidden Markov models. • Training a conditional random field involves many iterations of the forward-backward algorithm which has a complexity of O(K2 T ) where K is the number of states and T is the length of the sequence. A.11: Factorization in Graphical Models trivial example where the factorizations are alike is when all the random variables are independent.6 Conclusions In this chapter we described conditional random fields in detail.176 An Introduction to Conditional Random Fields Set of All Distributions Distributions Represented by Directed Models Distributions Represented by Both Directed and Undirected Models Distributions Represented by Undirected Models Figure A. computational biology and real-time intrusion detection. • Conditional random fields have been shown to be successful in many domains including computational linguistics. • Inference or test time complexity for a conditional random field is also O(K2 T ) where K is the number of states and T is the length of the sequence. • Long range dependencies can be modeled among observations using conditional random fields.

while features like ‘number of file creations’ and ‘number of files accessed’ are not expected to provide significant information. resulting in fitting irregularities in the data which can misguide classification. basic connection level features such as the ‘duration of connection’ and ‘source bytes’ are significant. DoS. Hence. We now describe our approach for selecting features for every layer and why some features were chosen over others. For detecting Probe attacks.1. Table B. B. The features selected for detecting Probe attacks are presented in Table B. using all the 41 features for detecting attacks belonging to all these classes severely affects the performance of the system and also generates superfluous rules. we performed feature selection to effectively detect different classes of attacks.1: Probe Layer Features Feature Number Feature Name 1 duration 2 protocol type 3 service 4 flag 5 src bytes 177 . We selected only five features for Probe layer. every record in the KDD 1999 data set presents 41 features which can be used for detecting a variety of attacks such as the Probe. However. R2L and U2R.Appendix B Feature Selection for Network Intrusion Detection As described in Chapter 4.1 Feature Selection for Probe Layer Probe attacks are aimed at acquiring information about the target network from a source that is often external to the network.

the network level features such as the ‘duration of connection’.2 Feature Selection for DoS Layer DoS attacks are meant to prevent the target from providing service(s) to its users by flooding the network with illegitimate requests.2: DoS Layer Features Feature Number Feature Name 1 duration 2 protocol type 4 flag 5 src bytes 23 count 34 dst host same srv rate 38 dst host serror rate 39 dst host srv serror rate 40 dst host rerror rate B. or whether or not the ‘root shell’ is invoked or ‘number of files accessed’ and.2. ‘service requested’ and the host level features such as the ‘number of failed login attempts’ among others. The features selected for detecting DoS attacks are presented in Table B. our experimental results presented earlier show that careful feature selection can significantly improve their detection. The features selected for detecting R2L attacks are presented in Table B.3. Hence.3 Feature Selection for R2L Layer R2L attacks are one of the most difficult attacks to detect and most of the present systems cannot detect them reliably. we selected only nine features for the DoS layer. Table B. to detect attacks at the DoS layer. require a large number of features and we selected 14 features. Detecting R2L attacks. Hence. the network level and the host level features. network traffic features such as the ‘percentage of connections having same destination host and same service’ and packet level features such as the ‘duration’ of a connection. it may not be important to know whether a user is ‘logged in or not’. ‘protocol type’. we selected both. to detect R2L attacks. hence. From all the 41 features.178 Feature Selection for Network Intrusion Detection B. To detect DoS attacks. We observed that effective detection of the R2L attacks involve both. ‘percentage of packets with errors’ and others are significant. . such features are not considered in the DoS layer. However. ‘source bytes’.

Features selected for detecting U2R attacks are presented in Table B. for detecting U2R attacks. ‘number of shell prompts invoked’.B. Table B. From all the 41 features.4: U2R Layer Features Feature Number Feature Name 10 hot 13 num compromised 14 root shell 16 num root 17 num file creations 18 num shells 19 num access files 21 is host login . we selected features such as ‘number of file creations’.3: R2L Layer Features Feature Number Feature Name 1 duration 2 protocol type 3 service 4 flag 5 src bytes 10 hot 11 num failed logins 12 logged in 13 num compromised 17 num file creations 18 num shells 19 num access files 21 is host login 22 is guest login B.4 Feature Selection for U2R Layer U2R attacks involve the semantic details which are very difficult to capture at an early stage at the network level. Such attacks are often content based and target an application. while we ignored features such as ‘protocol’ and ‘source bytes’. Hence.4 Feature Selection for U2R Layer 179 Table B. we selected only eight features for the U2R layer.4.

180 Feature Selection for Network Intrusion Detection B.0]/%x[0. unigram and bigram feature functions.0] U06:%x[0.0] U02:%x[0. which begins with B.0] U03:%x[1. the template defines a special macro %x[row.0] U01:%x[-1. For unigram feature functions.0] U04:%x[2.0]/%x[1. # Unigram U00:%x[-2. Hence. which begins with U. a combination of the current output token and previous output token (bigram) is automatically generated. where L is the number of output classes and N is the number of unique features generated by the templates. The number of feature functions generated by this type of template amounts to (L * N). where L is the number of output classes and N is the number of unique string expanded from the given template. feature functions must be chosen in prior. For bigram feature functions. A sample template used in our experiments is presented next. we defined a template which can be used to extract all possible feature functions from the given training data to train the conditional random filed using the CRF++ tool [120]. where row specifies the relative position from the current focusing token and col specifies the absolute position of the column. This type of template generates a total of (L * L * N) distinct features.5 Template Selection To train a conditional random field. The template can be used to define both.0] U05:%x[-1.col] which is used to specify a token in the input data.0] # Bigram B .

Request made (or the function invoked) by the client. Reference to the previous request in the same session. The number of feature functions generated by this type of template amounts to (L * N). which begins with U. Response generated for the request. The template can be used to define both. This type of template generates a total of (L * L * N) distinct features. unigram and bigram feature functions. a combination of the current output token and previous output token (bigram) is automatically generated. 3.1 Template Selection To train a conditional random field. feature functions must be chosen in prior.col] which is used to specify a token in the input data. Amount of data transferred (in bytes). 4. Hence. the template defines a special macro %x[row. 5. which begins with B. where L is the number of output classes and N is the number 181 . where L is the number of output classes and N is the number of unique string expanded from the given template. 2. Number of data queries generated in a single web request. For unigram feature functions. For bigram feature functions.Appendix C Feature Selection for Application Intrusion Detection As described in Chapter 5. where row specifies the relative position from the current focusing token and col specifies the absolute position of the column. we defined a template which can be used to extract all possible feature functions from the given training data to train the conditional random filed using the CRF++ tool [120]. The six features are: 1. 6. we used 6 features to represent a user session. Time taken to process the request. C.

U301:%x[-4. U101:%x[-4.1] .2] U207:%x[2.5] . U508:%x[3. U109:%x[4. U504:%x[-1. U102:%x[-3. U404:%x[-1. U106:%x[1.0] .5] .4] . U402:%x[-3.3] .1] .0] . . . U305:%x[0. U303:%x[-2. U309:%x[4. U204:%x[-1. U003:%x[-2. U409:%x[4.1] . U006:%x[1.0] U007:%x[2. U206:%x[1.1] .4] U407:%x[2. U108:%x[3.4] .0] . U008:%x[3.0] .0] .1] . U104:%x[-1.3] . U005:%x[0. U209:%x[4. U105:%x[0.0] .3] . A sample template used in our experiments is presented next.3] .1] .1] U107:%x[2.2] .5] .4] .4] . U501:%x[-4.5] U507:%x[2. U002:%x[-3.2] . U403:%x[-2.2] .2] . U009:%x[4.5] .0] . U406:%x[1. U203:%x[-2.1] . U304:%x[-1. . U509:%x[4.4] .3] U307:%x[2.2] .4] . U505:%x[0.3] .4] . \# Bigram B . U506:%x[1.0] . U401:%x[-4.1] . U302:%x[-3. U004:%x[-1.2] . U405:%x[0.5] . . U201:%x[-4. U503:%x[-2.2] . U408:%x[3. \# Unigram U001:%x[-4.2] . U308:%x[3. U306:%x[1. . U205:%x[0.5] .3] .3] . U103:%x[-2.4] . U202:%x[-3. U208:%x[3.5] .5] .182 Feature Selection for Application Intrusion Detection of unique features generated by the templates. U502:%x[-3.3] .

Sign up to vote on this title
UsefulNot useful