Improving IDS Accuracy Using Feature Selection and Hybrid Decision Trees

CHAPTER ONE
INTRODUCTION
1.1 BACKGROUND TO THE STUDY
The intrusion detection systems (IDS) are defined as efficient security tools which
are used for improving the security of the communicating and the information
systems, as they primarily focus on detecting malicious network traffics (Marib,
2018). An IDS is seen to be very similar to many processes like firewalls,
antivirus software, and can access the control schemes. The IDS is classified
depending on detection as the signature detection and anomaly detection system.
For the signature based detection systems, the systems identify the traffic pattern
or the application data as malicious and this requires an updated database for
storing all the new attack signatures, whereas the anomaly detection system
compares all activities against the normal defined behavior (Agrawal, 2015).
The main objective of the IDS system is detecting and then raising an alarm if the
network is attacked. The best IDS process detects the new or more malicious
attacks within a short time period and carries out the necessary actions. The
currently used IDS systems do not show 100% accuracy, hence, this study has
been carried out for improving and increasing the IDS system accuracy (Sheta
and Alamleh, 2017). Many of the machine learning techniques have been used for
helping in the detection of the network attacks, improving the accuracy detection
rate and developing effective classification and clustering models for
distinguishing between a normal and an abnormal behavior packet. The procedure

of detecting the intrusion accurately from the complete network traffic is
classified as the classification problem.
The IDS systems are classified as per the detection methods used for identifying
all the malicious attacks (Onik and Haq, 2016).
Due to the unprecedented growth of computer networks, the increasing number of
devices running on it and the increase in number of cyber-attacks, network
security has become a fundamental issue in today’s computer technology. So the
main task of the technology expert is to provide a secured data in terms of data
confidentiality, data integrity and data availability (Hemant, Sarkhedi &
Vaghamshi, 2013).
As network of computers has become an important part of the society, the
security of such network is of great importance as failure to do so may lead to a
dangerous effect, such as information theft and many others. Information gotten
from a private network, let’s say a governmental information can be used against
the progress of the society which can lead to pandemonium and panic in the
society at large.
Traditional protection techniques such as user authentication, data encryption,
avoiding programming errors and firewalls are used as the first line of defence for
computer security. If a password is weak, it can be easily compromised, user
authentication cannot prevent unauthorized user, firewalls are vulnerable to errors
in configuration and susceptible to ambiguous or undefined security policies
(Summers, 1997). They are generally unable to fully protect against malicious
mobile code, insider attacks and all other forms of intrusion. Programming errors
cannot be avoided as the complexity of the system and application software is
evolving rapidly leaving behind some exploitable weaknesses. Consequently,
computer systems are likely to remain unsecured for the foreseeable future.
Therefore, intrusion detection is required as an additional wall for protecting
systems despite the prevention techniques. Intrusion detection is useful not only
in detecting successful intrusions, but also in monitoring attempts to break
security, which provides important information for timely counter
measures(Sundaram, 1996).
Intrusion detection and prevention is a developing field as it tends to be of a
consideration nowadays due to the prevalent activities of hacker. Moreover, the
usage of an intrusion detection system for securing a network is not of utmost
importance as the rate at which the said IDS can effectively and efficiently
performing it duties. Anomaly detection recognizes any variation from the
defined patterns for users. Anomaly detection is based on creation of monitored
activity profiles. Anomaly detection refers to the problem of finding patterns in
data that do not conform to expected behaviour. These non-conforming patterns
are often referred to as anomalies, outliers, discordant observations, exceptions,
aberrations, surprises, peculiarities or contaminants in different application
domains. Of these, anomalies and outliers are two terms used most commonly in
the context of anomaly detection; sometimes interchangeably. Anomaly detection
finds extensive use in a wide variety of applications such as fraud detection for
credit cards, insurance or healthcare, intrusion detection for cyber-security, fault
detection in safety critical systems, and military surveillance for enemy activities
(Varun, Arindam, & Vipin, 2009). Misuse detection method using a rule-based
approach to detect known attacks by matching attack pattern to list of signatures,
greatly similar to antivirus applications. The signatures should be updated
regularly, because if the signature is not included in its library this type of IDS is
unable to detect the unknown attacks. Contrasting misuse detection, anomaly
based detection is involved monitoring user’s activities to catch any deviation
from normal behaviour profile. Despite being able to detect unknown attacks, the
probability of high false alarm is considerable (Muda, Yassin, Sulaiman & Udzir,
2011).Intrusion Detection Systems (IDS) have become a standard component in
security infrastructures as they allow network administrators to detect policy
violations. These policy violations range from external attackers trying to gain
unauthorized access to insiders abusing their access.
1.2 STATEMENT OF PROBLEM
Finding the intrusions in Computer Network using machine learning algorithms
such as Naïve Bayesian (NB), Neural Network (NN), Support Vector Machine
(SVM), K-Nearest Neighbors (KNN), Fuzzy Logic model, and Genetic Algorithm
(Chiba, 2017) have been widely used in the last decades. However, there exist
various problems in current Intrusion Detection System such as low accuracy,
imbalanced detection rates for different types of attacks, high false alarm rates,
redundancy of input attributes as well in the training data (Rida and Omri, 2016)
also suggested removal of noise and redundancy in dataset will be handy in
building of models for intrusion detection. Features may contain false
correlations, which hinder the process of detecting intrusions. Further, some
features may be redundant since the information they add is contained in other
features. Extra features can increase computation time, and can impact the
accuracy of IDS. Another shortcomings of IIDS is, intrusions attacks or
anomalies in Network infrastructures lead mostly in great financial losses,
massive sensitive data leaks, thereby decreasing efficiency and productivity of an
organization (Zhouhair, Noreddine & Khalid, 2018).
These gaps majorly on the detection accuracy and the false alarm are what this
project seeks to address. Hence, how to improve the detection accuracy has
become a major problem in Intrusion detection system.
1.3 AIM AND OBJECTIVES
The aim of this project is to develop an intrusion detection system based on
feature selection and hybridized decision tree (Decision tree classification
algorithm) to reduce the detection rate, accuracy and reduce the false alarm rate.
The following are the specific objectives;
i. to acquire the network intrusion detection dataset.
ii. to implement an intrusion detection system based on Random
forest and J48 algorithms.

iii. to evaluate the performance of the classifier in terms of accuracy,
precision, sensitivity and specificity.
1.4 SIGNIFICANCE OF THE STUDY
Intrusion Detection Systems (IDS) play an important role in an organization’s
security framework. Security tools such as anti-virus software, firewalls, packet
sniffers and access control lists aid in preventing attackers from gaining easy
access to an organization’s systems but they are in no way foolproof. These
factors make the need for a proper security framework even more
paramount. A tool is therefore needed to alert system administrators of the
possibility of rogue activities occurring on their networks. Intrusion
Detection Systems can play such a role. Therefore a need exists for development
of effective and efficient algorithms and intrusion detection systems for this
purpose.
1.5 DEFINITION OF TERMS
Data mining: The term data mining is frequently used to designate the process of
extracting useful information from large databases.
Machine learning: It is a collection of algorithms for data analysis and predictive
modeling.
Classification: Classification is used to determine the predetermined output. It
predicts the target class for each data item.

Intrusion: An intrusion can be defined as “any set of actions that attempt to
compromise the integrity, confidentiality or availability of a resource.
Intrusion Detection System (IDS): An intrusion detection system is a software
program which helps to identify the malicious program which enter our system or
in network.
KDD (Knowledge Discovery Database): is used to denote the process of
extracting useful knowledge from large data sets. Data mining, by contrast, refers
to one particular step in this process. Spherically, the data mining step applies so
called data mining techniques to extract patterns from the data.
1.6 Project Layout
This project study is organized as follows. Chapter two of this project work
consists of review of literature and important concept that are relevant to the
subject matter of this project.
Chapter three presents the methodology. In Chapter four the system design and
implementation and necessary system evaluation is presented. Chapter five is the
conclusion, summary and limitation of study together with future work.

CHAPTER TWO
LITERATURE REVIEW
2.0 Related work
Belhadj-Aissa and Guerroumi (2016) have presented a new AIS (Artificial
Immune Systems) based approach for Network Anomaly Detection based on
Negative Selection process (NADNS). The Negative Selection (NS) process in
the biological point of view is the principle of distinction between self-cells and
non-self-cells which is highly coherent with the classification problem
(normal/anomaly) in intrusion detection.
N. Lokeswari and B. Chakradhar Rao (2016) have proposed an anomaly intrusion
detection model based on artificial neural network classifier to detect intruders
trying to enter into a system. They have chosen back propagation algorithm
(BPN) as the learning algorithm for their artificial neural network of type
multilayer perceptrons (MLP), and weight updating is done based on particle
swarm optimization weight extraction algorithm (PSO WENN).
Erza Aminanto et al. (2017) have proposed an anomaly detection system to detect
network intrusions based on Ant Clustering Algorithm (ACA) and Fuzzy
Inference System (FIS).
Tao Ma et al. (2018) have proposed a novel approach called KDSVM, which
utilized the K-mean clustering technique and advantage of feature learning with
deep neural network (DNN) model and strong classifier of support vector
machines (SVM), to detect network intrusions.
Panda et al. (2018) stated that integrating a hybrid intelligent scheme, which
needed different classifiers to be implemented, would improve the detection and
make it very genuine, thereby improving the result quality. In this paper, the
researchers have applied a 2-class classification
strategy which is based on the 10-fold cross validation process, which would
increase the rate of intrusion detection and also decrease the rate of false alarms.
Aburomman and Reaz (2016) carried out a study which described the different
algorithms used for classifying the intrusions based on a popular machine
learning method. They studied different homogeneous or heterogeneous systems
along with various hybrid techniques. They stated that implementing the
ensemble-based techniques helped in solving the pattern classification-based
problems.
Security measures have failed in many cases to stop the wide variety of possible
attacks. The goal of intrusion detection is to build a system that would
automatically scan network activity and detect such intrusion attacks, providing
the necessary information to the system administrator to allow for corrective
action. A strong case can be made for the use of data mining techniques to
improve the current state of intrusion detection.
2.1 INTRUSION AND INTRUSION DETECTION
An intrusion can be defined as “any set of actions that attempt to
compromise the integrity, confidentiality or availability of a resource (Wenke &
Salvatore, 1998). Intrusion is a type of malicious activity that tries to deny the
security aspects of a computer system. Intrusion detection is a process of
gathering intrusion related information occurring in the process of monitoring the
events and inspecting them for sign of malicious acts (Maharaj & Khanna, 2014).
Maharaj and Khanna (2014) states ‘the primary goal of intrusion detection is to
model usual application behaviour, so that we can recognize attacks by their
peculiar effects without raising too many false alarms’. Intrusion detection is an
area growing in significance as more and more sensitive data are stored and
processed in networked systems (D’silva & Vora, 2013).
The goal of intrusion detection is to detect security violations in
information systems. Intrusion detection is a passive approach to security as it
monitors information systems and raises alarms when security violations are
detected (Reddy, Reddy & Rajulu, 2011).
2.2 INTRUSION DETECTION SYSTEM
Intrusion detection technique is a technology designed to observe
computer activities for the purpose of finding security violations. The security of
a computer system is compromised when an intrusion takes place. Intrusion
detection is the process of identifying and responding to malicious activity
targeted at computing and networking sources (Amoroso, 1999). Intrusion
prevention techniques, such as user authentication and information protection
have been used to protect computer systems as a first line of defense. Intrusion
prevention alone is not sufficient because as systems become ever more complex,
there are always exploitable weaknesses in the systems due to design and
programming errors. Now a day, intrusion detection is one of the high priority
tasks for network administrators and security professionals. As network based
computer systems play increasingly vital roles in modern society, they have
become intrusion detection systems provide following three essential security
functions:
1. Data confidentiality: Information that is being transferred through
the network should be accessible only to those that have been
properly authorized.
2. Data integrity: Information should maintain their integrity from
the moment they are transmitted to the moment they are actually
received. No corruption or data loss is accepted either from the
random events or malicious activity.
3. Data availability: The network or a system resource that ensures
that it is accessible and usable upon demand by an authorized
system user. Any intrusion detection system has some inherent
requirements. Its prime purpose is to detect as many attacks as
possible with minimum number of false alarms, i.e. the system
must be accurate in detecting attacks. However, an accurate system
that cannot handle large amount of network traffic and is slow in
decision making will not fulfill the purpose of an intrusion
detection system (IDs). Data mining techniques like data

reduction, data classification, features selection techniques play an
important role in IDS.
2.3 TAXONOMY OF IDS
IDs uses several techniques to determine what qualifies as an intrusion
versus normal traffic. There are two useful method of classification for intrusion
detection systems is according to data source. Each has a distinct approach for
monitoring, securing data and systems. There are two following general
categories under this classification:
1. Host-based IDSs (HIDS) – examine data held on individual
computers that serve as hosts. The network architecture of host-
based is agent-based, which means that a software agent resides
on each of the hosts that will be governed by the system
(Nadiammai, Krishaveni, & Hemalatha, 2011).
2. Network-based IDSs (NIDS) – examine data exchanged
between computers. Most efficient host-based intrusion detection
systems are capable of monitoring and collecting system audit in
real time as well as on a scheduled basis, thus distributing both
CPU utilization and network overhead and providing for a flexible
means of security administration (Nadiammai, Krishaveni, &
Hemalatha, 2011).
2.4 INTRUSION DETECTION APPROACHES
The signatures of some attacks are known, whereas other attacks only
reflect some deviation from normal patterns. Consequently, two main approaches
have been devised to detect intruders.
2.4.1 Anomaly Detection: Anomaly detection assumes that intrusions will
always reflect some deviations from normal patterns. Anomaly detection may be
divided into static and dynamic anomaly detection. A static anomaly detector
based on the assumption that there is a portion of the system being monitored that
does not change. The static portion of a system is the code for the system and the
constant portion of data upon which the correct functioning of the system
depends. For example, the operating systems software and data to bootstrap a
computer never change. If the static portion of the system ever deviates from its
original form, an error has occurred or an intruder has altered the static portion of
the system. Dynamic anomaly detection typically operates on audit records or on
monitored networked traffic data. Audit records of operating systems do not
record all events they only record events of interest. Therefore only behavior that
results in an event that is recorded in the audit will observed and these events may
occur in a sequence.
2.4.2 Misuse Detection: It is based on the knowledge of system vulnerabilities
and known attack patterns. Misuse detection is concerned with finding intruders
who are attempting to break into a system by exploiting some known

vulnerability. Ideally, a system security administrator should be aware of all the
known vulnerabilities and eliminate them. The term intrusion scenario is used as a
description of a known kind of intrusion; it is a sequence of events that would
result in an intrusion without some outside preventive intervention. An intrusion
detection system continually compares recent activity to known intrusion
scenarios to ensure that one or more attackers are not attempting to exploit known
vulnerabilities. To perform this, each intrusion scenario must be described or
modeled.
2.4.3 Advantages and Disadvantages of Anomaly Detection and Misuse
Detection
The main disadvantage of misuse detection approaches is that they will
detect only the attacks for which they are trained to detect. Novel attacks or
unknown attacks or even variants of common attacks often go undetected. The
main advantage of anomaly detection approaches is the ability to detect novel
attacks or unknown attacks against software systems, variants of known attacks,
and deviations of normal usage of programs regardless of whether the source is a
privileged internal user or an unauthorized external user. The disadvantage of the
anomaly detection approach is that well-known attacks may not be detected,
particularly if they fit the established profile of the user. Once detected, it is often
difficult to characterize the nature of the attack for forensic purposes. Finally a
high false positive rate may result for a narrowly trained detection algorithm, or
conversely, a high false negative rate may result for a broadly trained anomaly
detection approach.
2.4.4 Combining misuse and anomaly detection
Anomaly detection and misuse detection have major shortcomings that
hamper their effectiveness in detecting intrusions. Research can be carried into
intrusion detection methodologies which combine the anomaly detection
approach and the misuse detection approach (Lunt, 1989). These techniques seek
to incorporate the benefits of both of the standard approaches to intrusion
detection. The combined approach permits a single intrusion detection system to
monitor for indications of external and internal attacks. While a significant
advantage over the singular use of either method separately, the use of a
combined anomaly/misuse mechanism does possess some disadvantages. The use
of two knowledgebase for the intrusion detection system will increase the amount
of system resources which must be dedicated to the system (Cannady & Harrell,
1996). Additional disk space will be required for the storage of the profiles, and
increased memory requirements will be encountered as the mechanism compares
user activities with information in the dual knowledge bases. In addition, the
technique will share the disadvantage of either method individually in its inability
to detect collaborative or extended attack scenarios. Pattern recognition possesses
a distinct advantage over anomaly and misuse detection methods in that it is
capable of identifying attacks which may occur over an extended period of time, a
series of user sessions, or by multiple attackers working in concert. This approach
is effective in reducing the need to review a potentially large amount of audit data
(Cannady & Harrell, 1996).
Figure 2.3 shows taxonomy of Intrusion Detection Systems. More details and
information on the various IDS systems and the way they work can be found in
(Mitchell, 2005).
Figure 2.1: Intrusion Detection System Taxonomy (Stefan, 2000)
2.5 INTRUSION ATTACKS
Basically almost all authors or researcher categorize all intrusion attack into four
different types. A research work done by Jaiganesh et al (2013) detailed that there
are four different types of attacks made on a network based intrusion detection
system.
1. Denial of service attack (DoS): It is an attack in which the
attacker makes the memory too busy or too full to handle the
requests.
2. User to Root Attack (U2R): It is an attack in which attacker tries
to access the normal user account.
3. Remote to Local Attack (R2L): It is an attack in which attacker
sends packets to a machine over a network but does not have an
account on that machine.
4. Probing Attack: It is an attempt to gather information about the
network of computers.
However, Patel et al. (2013), discovered that the type of attack is more than four,
thus they stated the classes in KDD’ 99 dataset can be categorized into five main
classes (one normal class and four main intrusion classes: PROBE, DOS, U2R,
and R2L).
1. Normal connections are generated by simulated daily user
behaviour such as downloading files, visiting web pages.
2. Denial of Service (DoS) attack causes the computing power or
memory of a victim machine too busy or too full to handle
legitimate requests. DoS attacks are classified based on the
services that an attacker renders unavailable to legitimate users
like apache2, land, mail bomb, back, etc.

3. Remote to User (R2L) is an attack that a remote user gains access
of a local user/account by sending packets to a machine over a
network communication, which include send-mail, and Xlock.
4. User to Root (U2R) is an attack that an intruder begins with the
access of a normal user account and then becomes a root-user by
exploiting various vulnerabilities of the system. Most common
exploits of U2R attacks are regular buffer-overflows, load-module,
Fd-format, and fb-config.
5. Probing (Probe) is an attack that scans a network to gather
information or find known Vulnerabilities. An intruder with a map
of machines and services that are available on a network can use
the information to look for exploits.
2.6 DRAWBACKS OF INTRUSION DETECTION SYSTEMS (IDSS)
Intrusion Detection Systems (IDS) have become a standard component in
security infrastructures as they allow network administrators to detect policy
violations. These policy violations range from external attackers trying to gain
unauthorized access to insiders abusing their access. Current IDS have a number
of significant drawbacks:
1. Current IDS are usually tuned to detect known service level network
attacks. This leaves them vulnerable to original and novel malicious
attacks.
2. Data overload: Another aspect which does not relate directly to
misuse detection but is extremely important is how much data an
analyst can efficiently analyze. That amount of data he needs to look
at seems to be growing rapidly. Depending on the intrusion detection
tools employed by a company and its size there is the possibility for
logs to reach millions of records per day.
3. False positives: A common complaint is the amount of false positives
an IDS will generate. A false positive occurs when normal attack is
mistakenly classified as malicious and treated accordingly.
4. False negatives: This is the case where an IDS does not generate an
alert when an intrusion is actually taking place. (Classification of
malicious traffic as normal)
Data mining can help improve intrusion detection by addressing each and every
one of the above mentioned problems.
1. Remove normal activity from alarm data to allow analysts to focus on
real attacks.
2. Identify false alarm generators and “bad” sensor signatures.
3. Find anomalous activity that uncovers a real attack.
4. Identify long, ongoing patterns (different IP address, same activity).
To accomplish these tasks, data miners employ one or more of the following
techniques:
I. Data summarization with statistics, including finding outliers
II. Visualization: presenting a graphical summary of the data
III. Clustering of the data into natural categories
IV. Association rule discovery: defining normal activity and enabling
the discovery of anomalies
V. Classification: predicting the category to which a particular record
belongs
2.7 DATA MINING AND IDS
Data mining techniques can be differentiated by their different model
functions and representation, preference criterion, and algorithms (Fayyad et al.,
1996). The main function of the model that we are interested in is classification,
as normal, or malicious, or as a particular type of attack (Ghosh, Schwartzbar &
Schatz, 1999). We are also interested in link and sequence analysis (Eric
Bloedron et al., 2001). Additionally, data mining systems provide the means to
easily perform data summarization and visualization, aiding the security analyst
in identifying areas of concern (Eric Bloedron et al., 2001). The models must be
represented in some form. Common representations for data mining techniques
include rules, decision trees, linear and non-linear functions (including neural
nets), instance-based examples, and probability models (Fayyad et al., 1996).

A. Off Line Processing
The use of data mining techniques in IDSs, usually implies analysis of
the collected data in an offline environment. There are important
advantages In performing intrusion detection in an offline environment, in
addition to the real-time detection tasks typically employed.
Below we present the most important of these advantages:
1. In off-line analysis, it is assumed that all connections have already
finished and, therefore, we can compute all the features and check the
detection rules one by one (Fayyad et al., 1996).
2. The estimation and detection process is generally very demanding and,
therefore, the problem cannot be addressed in an online environment
because of the various the real-time constraints (Fayyad et al., 1996).
Many real-time IDSs will start to drop packets when flooded with data
faster than they can process it.
3. An offline environment provides the ability to transfer logs from
remote sites to a central site for analysis during off-peak times.
B. Data Mining and Real Time IDSs
Even though offline processing has a number of significant
advantages, data mining techniques can also be used to enhance IDSs in
real time. (Lee et al., 1998). (Ghosh, Schwartzbar & Schatz, 1999) were
one of the first to address important and challenging issues of accuracy,

efficiency, and usability of real-time IDSs. They implemented feature
extraction and construction algorithms for labeled audit data. They
developed several anomaly detection algorithms. In the paper, the authors
explore the use of information- theoretic measures, i.e., entropy,
conditional entropy, relative entropy, information gain, and information
cost to capture intrinsic characteristics of normal data and use such
measures to guide the process of building and evaluating anomaly
detection models. They also develop efficient approaches that use statistics
on packet header values for network anomaly detection. A real-time IDS,
called ”Judge”, was also developed to test and evaluate the use of those
techniques. A serious limitation of their approaches (as well as with most
existing IDSs) is that they only do intrusion detection at the network or
system level. However, with the rapid growth of e-Commerce and e-
Government applications, there is an urgent need to do intrusion and fraud
detection at the application-level. This is because many attacks may focus
on applications that have no effect on the underlying network or system
activities.
C. Multi sensor Correlation
The use of multiple sensors to collect data by various sources has been
presented by numerous researchers as a way to increase the performance
of an IDS.
1. Lee et al (1998), state that using multiple sensors for ID should
increase the accuracy of IDSs.
2. Kumar (1995) states that ”Correlation of information from different
sources has allowed additional information to be inferred that may be
difficult to obtain directly”.
3. Lee et al (1998) note that ” an IDS should consist of multiple co-
operative light weight subsystems that each monitor a separate part
(such as an access point) of the entire environment”.
4. Dickerson and Dickerson (2000) also explore a possible
implementation of such a mechanism. Their architecture consists of
three layers:
– A set of Data Collectors (packet collectors)
– A set of Data Processors
– A Threat analyzer that utilizes fuzzy logic and basically performs
a risk assessment of the collected data.
2.8 BENEFITS OF DATA MINING TECHNIQUES
1. Problems with large databases may contain valuable implicit
regularities that can be discovered automatically.
2. Difficult-to-program applications, which are too difficult for traditional
manual programming.
3. Software applications that customize to the individual user’s
preferences, such as personalized advertising.
There are several reasons why data mining approaches plays a role in these three
domains. First of all, for the classification of security incidents, a vast amount of
data has to be analyzed containing historical data. It is difficult for human beings
to find a pattern in such an enormous amount of data. Data mining, however,
seems well-suited to overcome this problem and can therefore be used to discover
those patterns.
2.9 REASONS TO USE DATA MINING APPROACHES IN IDS
1. It is very hard to program an IDS using ordinary programming
languages that require the explicitation and formalization of knowledge.
2. The adaptive and dynamic nature of machine-learning makes it a
suitable solution for this situation.
3. The environment of an IDS and its classification task highly depend on
personal preferences. What may seem to be an incident in one
environment may be normal in other environments. This way, the ability
of computers to learn enables them to know someone’s “personal” (or
organizational) preferences, and improve the performance of the IDS, for
this particular environment (Narayana, Prasad, Srividhya, & Ranga
Reddy, 2011).
2.10 THE DATA MINING PROCESS OF BUILDING INTRUSION
DETECTION MODELS
With the recent rapid development in KDD, a better understanding of the
techniques and process frameworks that can support systematic data analysis on
the vast amount of audit data that can be made available. The process of using
data mining approaches to build intrusion detection models is shown in Fig 2.4.
Figure 2.2: The Data Mining Process of Building ID Models (Lee,
1999).
Here raw (binary) audit data is first processed into ASCII network packet
information (or host event data), which is in turn summarized into connection
records (or host session records) containing a number of within-connection
features, e.g., service, duration, flag (indicating the normal or error status
according to the protocols), etc. Data mining programs are then applied to the
connection records to compute the frequent patterns, i.e., association rules and
frequent episodes, which are then analyzed to construct additional features for the
connection records. Classification programs, for example, RIPPER, are then used
to inductively learn the detection models. This process is of course iterative. For
example, poor performance of the classification models often indicates that more
pattern mining and feature construction is needed (Lee, 1999).
Data Mining is the automated process of going through large amounts of data
with the intention to discover useful information about the data that is not
obvious. Useful information may include special relations between the data,
specific models that of the data that repeats itself, specific patterns, and ways of
classifying it or discovering specific values that fall out of the “normal” pattern or
model (Agrawal & Srikant, 1994). In other to understand how data mining can
help advance intrusion detection, it is important to know how current IDS work to
identify an intrusion. Intrusion detection systems are a combination of hardware
and software resources aimed at protecting the confidentiality, availability and
integrity of a computer system or network. To an analyst sitting in front of an
IDS, an ideal system would alert on all malicious connections, whether it is a
known or novel attack (Chittur, 2001). However the search for the ideal IDS
continues and the amount of network data is increasing. Besides the issue of data
overload facing network analysts due to increasing complexity and large size of
networks, traditional methods for intrusion detection are based on extensive

knowledge of signatures of known attacks that are provided by human experts.
The signature database has to be manually revised for each new type of intrusion
that is discovered. A significant limitation of signature-based methods is that they
cannot detect emerging cyber threats. In addition, once a new attack is discovered
and its signature developed, often there is a substantial latency in its deployment
across networks.
2.11 DATA MINING APPROACHES FOR IDS
The central theme of our approach is to apply data mining techniques for
intrusion detection in network based system. Data mining generally refers to the
process of (automatically) extracting models from large stores of data. The recent
rapid development in data mining has made available a wide variety of
algorithms, drawn from the fields of statistics, pattern recognition, machine
learning, and database. Several types of algorithms (Lee, Stolfo, & Mok, 1999)
are particularly relevant to our research:
1. Classification: Maps a data item into one of several pre-defined
categories. These algorithms normally out-put “classifiers”, for example,
in the form of decision trees or rules. An ideal application in intrusion
detection will be to gather sufficient “normal” and “abnormal” audit data
for a user or a program, then apply a classification algorithm to learn a
classifier that can label or predict new unseen audit data as belonging to
the normal class or the abnormal class environment ( Narayana, Prasad,
Srividhya, & Pandu Ranga , 2011).
2. Link analysis: Determines relations between fields in the database.
Finding out the correlations in audit data will provide insight for selecting
the right set of system features for intrusion detection environment
(Narayana, Prasad, Srividhya, & Pandu Ranga , 2011).
3. Sequence analysis: Models sequential patterns. These algorithms can
discover what time-based sequence of audit events are frequently
occurring together environment (Narayana, Prasad, Srividhya, & Pandu
Ranga, 2011). These frequent event patterns provide guidelines for
incorporating temporal statistical measures into intrusion detection
models. For example, patterns from audit data containing network-based
denial-of-service (DOS) attacks suggest that several per-host and per-
service measures should be included.
4. Association Rule: This technique searches a frequently occurring item set
from a large dataset. Association rule mining determines association rules
and/or correlation relationships among large set of data items. The mining
process of association rule can be divided into two steps as follows:

i. Frequent Item set Generation: Generates all set of items whose
support is greater than the specified threshold called as
minsupport.
ii. Association Rule Generation: from the previously generated
frequent item sets, it generates the association rules in the form of
―if then‖ statements that have confidence greater than the
specified threshold called as minconfidence.
5. Clustering: It is an unsupervised machine learning mechanism for
discovering patterns in unlabeled data. It is used to label data and assign it
into clusters where each cluster consists of members that are quite similar.
Members from different clusters are different from each other. Hence
clustering methods can be useful for classifying network data for detecting
intrusions. Clustering can be applied on both Anomaly detection and
Misuse detection.
2.12 CLASSIFICATION TECHNIQUES
Classification is a data mining (machine learning) techniques used to predict
group membership for data instance. It consists of predicting a certain outcome
based on a given input. According to D’silva and Vora (2013), classification is a
supervised learning technique. A classification based IDS will classify all the
network traffic into either normal or malicious. Classification technique is mostly
used for anomaly detection. The classification process is as follows:
i. It accepts collection of items as input.

ii. Maps the items into predefined groups or classes defined by some
attributes.
iii. After mapping, it outputs a classifier that can accurately predict the
class to which a new item belongs.
Classification is one of data mining functionalities. It finds a model or function
that separates classes or data concepts in order to predict the classes of an
unknown object. For example, a loan officer requires data analysis to determine
which loan applicants are "safe" or "risky". The data analysis task is
classification, where a model or classifier is constructed to predict class
(categorical) labels, such as “safe” or “risky” for the loan application data. These
categories can be represented by discrete values, where the ordering among
values has no meaning. Because the class labels of training data is already known,
it is also called supervised learning. Classification consist two processes:
i. Training and
ii. Testing.
The first process, training, builds a classification model by analyzing training
data containing class labels. While the second process, testing, examines a
classifier (using testing data) for accuracy (in which case the test data contains
the class labels) or its ability to classify unknown objects (records) for
prediction (Crosbie & Spafford, 1995).

Classification algorithms are laid under the following classification techniques:
i. Decision Tree based Methods
ii. Rule-based Method
iii. Memory – based Reasoning
iv. Neural Networks
v. Naïve Bayes and Bayesian Belief Networks
vi. Support Vector Machines
2.13 FEATURE SELECTION
Hebat, Sherif, and Mohamed (2012) in their work reveal that subsequent to
preprocessing of data, the features of the data set are identified as either being
significant to the intrusion detection process, or redundant. This process is known
as feature selection. Redundant features are generally found to be closely
correlated with one or more other features. As a result, omitting them from the
intrusion detection process does not degrade classification accuracy. In fact, the
accuracy may improve due to the resulting data reduction, and removal of noise
and measurement errors associated with the omitted features. Therefore, choosing
a good subset of features proves to be significant in improving the performance of
the system.
The authors presented two method of feature selection, which are:
1) Information Gain: In this method, the features are filtered to create the
most prominent feature subset before the start of the learning process.
2) Gain ratio: a modification of the information gain that solves the issue
of bias towards features with a larger set of values, exhibited by
information gain. Gain ratio should be Large when data is evenly
spread and small when all data belong to one branch attribute.
Gain ratio takes number and size of branches into account when choosing an
attribute as It corrects the information gain by taking the intrinsic information of a
split into account (i.e. how much information do we need to tell which branch an
instance belongs to) where Intrinsic information is the entropy of distribution of
instances into branches.
Krzystof and Nobert (2007) states, effective and versatile classification cannot be
achieved by single classification algorithms. They must be hybrid complex
models comprising a feature selection stage. Similarly to other data analysis tasks,
also in feature selection it is very beneficial to take advantage of different kinds of
algorithms. The authors further stressed that one of the most efficient heuristics
used for decision tree construction is the Separability of Split Value (SSV)
criterion. Its basic advantage is that it can be applied to both continuous and
discrete features in such a manner that the estimates of separability can be
compared regardless the substantial difference in types. Furthermore, they opined
that The SSV criterion has been successfully used not only for building
classification trees. It proved efficient in data type conversion (from continuous to
discrete and in the opposite direction) and as the discretization part of feature
selection methods, which finally rank the features according to such indices like
Mutual Information. It is known that extra features can increase computation time
and can impact accuracy of IDS, so feature selection is a very good measure of
improvising on machine learning algorithm used for classification purposes.

CHAPTER THREE
SYSTEM APPROACH AND METHODOLOGY
3.1 Methodology
The approach used in this project is to apply machine learning algorithm known
as Decision trees extensively in the classification of Intrusion detection system.
3.2 Decision Trees
Decision Trees are a class of very powerful Machine Learning model cable of
achieving high accuracy in many tasks while being highly interpretable. What
makes decision trees special in the realm of ML models is really their clarity of
information representation. The “knowledge” learned by a decision tree through
training is directly formulated into a hierarchical structure. This structure holds
and displays the knowledge in such a way that it can easily be understood, even by
non-experts.
3.3 Proposed Approach
The approach used in this project is to discuss extensively the intended algorithms
with their respective relation with classification and intrusion detection.
3.4 INFORMATION GAIN IN TERMS OF FEATURE REDUCTION
According to Ibrahim, Badr and Shaheen (2012), Subsequent to preprocessing of
data, the features of the data set are identified as either being significant to the
intrusion detection process, or redundant. This process is known as feature
selection. Redundant features are generally found to be closely correlated with

one or more other features. As a result, omitting them from the intrusion detection
process does not degrade classification accuracy. In fact, the accuracy may
improve due to the resulting data reduction, and removal of noise and
measurement errors associated with the omitted features. Therefore, choosing a
good subset of features proves to be significant in improving the performance of
the system.
They thus define Information Gain: In this method, the features are filtered to
create the most prominent feature subset before the start of the learning process.
Mathematically, (Xindong et al, 2008) stated that information gain for a dataset
(D) containing Si tuples of class Ci for i = (1,…..,m) is defined as:
Information measures info required to classify any arbitrary tuple
Info (D) = -∑ Si log2 (Si/S) ………………….(1)
Where, Si is the total value of a feature X and S is the number of
possible value a feature X can take.
Entropy of feature X with values (X1, X2, …….,Xv)
E(X) =∑ S1j +……..+Smj Info (D) ……………….(2)
J=1 S
Information gained by branching on feature X
Gain (X) = Info (D) – E(X) ………………………..(3)

3.5 DATASET
Since 1999, KDD’99 (Yimin, 2004) has been the most widely used data set for
the evaluation of anomaly detection methods. This data set is built based on the
data captured in DARPA’98 IDS evaluation program (KDD, 1999). DARPA’98
is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of
network traffic. The two weeks of test data have around 2 million connection
records. KDD training dataset consists of approximately 4,900,000 single
connection vectors each of which contains 41 features and is labeled as either
normal or an attack, with exactly one specific attack type. The simulated attacks
fall in one of the following four categories:
(1) Denial of Service Attack (DoS): is an attack in which the attacker
makes some computing or memory resource too busy or too full to handle
legitimate requests, or denies legitimate users access to a machine.
(2) User to Root Attack (U2R): is a class of exploit in which the attacker
starts out with access to a normal user account on the system (perhaps
gained by sniffing passwords, a dictionary attack, or social engineering)
and is able to exploit some vulnerability to gain root access to the system.
(3) Remote to Local Attack (R2L): occurs when an attacker who has the
ability to send packets to a machine over a network but who does not have
an account on that machine exploits some vulnerability to gain local
access as a user of that machine.
(4) Probing Attack: is an attempt to gather information about a network of
computers for the apparent purpose of circumventing its security controls.
Table 1 showed the four categories and their corresponding attacks on
each category.
Figure 3.2: Classification of attacks on KDD data set

KDD CUP Dataset
Import Dataset
Data preprocessing
Apply feature selection
Apply the classifiers RF and J48
Result
Figure 3.1. An architecture of the proposed algorithm
3.6 DATA PRE-PROCESSOR
In complex classification domains, some data may hinder the classification
process. Features may contain false correlations, which hinder the process of
detecting intrusions. Further, some features may be redundant since the
information they add is contained in other features. Extra features can increase
computation time, and can impact the accuracy of IDS. Feature selection
improves classification by searching for the subset of features, which best
classifies the training data. Dimension Reduction techniques are proposed as a
data pre-processing step. This process identifies a suitable low-dimensional

representation of original data. Reducing the dimensionality improves the
computational efficiency and accuracy of the data analysis.
3.7 Random Forest Algorithm
Random Forest is the go to machine learning algorithm that works through
bagging approach to create a bunch of decision trees with a random subset of the
data. It is considered to be one of the most effective algorithm to solve almost any
prediction task. It can be used both for classification and the regression kind of
problems. It is a combination of tree predictors where each tree depends on the
values of a random vector sampled independently with the same distribution for
all trees in the forest.
The pseudo code for random forest algorithm can split into two stages. First, in
which ‘n' random trees are created, this forms the random forest. In the second
stage, the outcome for the same test feature from all decision trees is combined.
Then the final prediction is derived by assessing the results of each decision tree
or just by going with a prediction that appears the most times in the decision trees.
Random Forest Machine Learning Algorithm maintains accuracy even when there
is inconsistent data and is simple to use. It also gives estimates on what variables
are important for the classification. It runs efficiently on large databases while
generating an internal unbiased estimate of the generalisation error. It also
provides methods for balancing error in class population unbalanced data sets but
analysing them theoretically is difficult and formation of a large number of trees

can also slow down prediction while handling real-time system. There is also
another drawback that is, it does not predict beyond the range of the response
values in the training data.
Random forest algorithm is a supervised classification algorithm. As the name
suggest, this algorithm creates the forest with a number of trees.
In general, the more trees in the forest the more robust the forest looks like. In
the same way in the random forest classifier, the higher the number of trees in
the forest gives the high accuracy results.
3.7.1 Random Forest Pseudocode:
1. Randomly select “k” features from total “m” features.

1. Where k << m
2. Among the “k” features, calculate the node “d” using the best split
point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to
create “n” number of trees.
The beginning of random forest algorithm starts with randomly

selecting “k” features out of total “m” features.
3.7.2 Random Forest Algorithm

Require: Initially the tree has exactly one leaf (TreeRoot) which covers the whole
space
Require: The dimensionality of the input, D. Parameters λ, m and τ .
Select Candidate Split Dimensions (TreeRoot, min(1 + Poisson(λ), D))
for t = 1 . . . do Receive (Xt, Yt, It) from the environment
At ← leaf containing Xt
if It = estimation then
UpdateEstimationStatistics(At, (Xt, Yt))
for all S ∈ CandidateSplits(At) do
for all A ∈ CandidateChildren(S) do
if Xt ∈ A then
UpdateEstimationStatistics(A, (Xt, Yt))
end if
end for
end for
else if
It = structure then
if At has fewer than m candidate split points then
for all d ∈ CandidateSplitDimensions(At) do

CreateCandidateSplit(At, d, πdXt)
end for
end if
for all S ∈ CandidateSplits(At) do
for all A ∈ CandidateChildren(S) do

if Xt ∈ A then UpdateStructuralStatistics(A, (Xt, Yt))
end if
end for
end for
if CanSplit(At) then if ShouldSplit(At) then
Split(At)
else if
MustSplit(At) then
Split(At)
end if
end if
end if
end for
3.8 J48
C4.5 is a successor of ID3 developed by Ross Quinlan and is implemented in
WEKA as J48 using Java. All of them adopt a greedy and a top-down approach to
decision tree making. It is used for classification in which new data is labelled
according to already existing observations (training data set). Decision tree
induction begins with a dataset (training set) which is partitioned at every node
resulting in smaller partitions, thus following a recursive divide and conquer
strategy. In addition to a data set, which is a collections of objects, a set of
attributes is also passed. Objects can be an event, an activity and the attributes are
the information related to that object. To every tuple in the data set is associated a
class label which identifies whether an object belongs to a particular class or not.
Splitting can further be performed only if the tuples fall in different classes. The
partition of dataset uses heuristics that chooses an attribute that best partitions a
data set. This is referred to a as attribute selection measure. These attribute
selection measures are responsible for the type of branching that occurs on a
node. Gini index, information gain are some examples which partition a node into
a binary or multiway, respectively. Certain other ways to convert multi-branch
trees into strict binary can be used if need be. C4.5 uses gain ratio as the attribute
selection measure which has an advantage over information gain used in its
predecessor ID3. Since ID3 can produce n-ary branch trees if the attribute on
which partitioning is data has unique values, and therefore cannot be used for
classification.
The algorithm of J48 algorithm is as follows;
3.8.1 Algorithm (J48)
INPUT:
Dataset//Training data
OUTPUT
Tee//Decision tree
BUILD(*DataSet)
{Tree = ∅;
Tree = Add arc to node which is root and for each split predicate and label are
assigned;
For each arc do
Dataset = Database created by applying Splitting predicate to Dataset;
If stopping point reached to this path, then Tree = create leaf node and label with
appropriate class;
Else
Tree = BUILD (Dataset);\
Tree = add Tree to arc;
}
3.8.2 Pseudocode of J48
Step1: All the rows in a dataset are passing onto the root node.
Step2: Based on the values for the rows in the node considered, each of the
predictor variables is splited at all its possible split points.
Step3: At each split point, the parent node is split as binary nodes (child nodes)
by separating the rows with values lower than or equal to the split point and
values higher than the split point for the considered predictor variable. For
categorical predictor variables, each category of the variable will be considered in
turn.
Step 4: The predictor variable and split point with the highest value of I is
selected for the node.
Where PL and PR are the probabilities of a sample to lie in left sub-tree & right
sub-tree respectively and are the probabilities that a sample is in the class Cj and
in the left sub-tree or right sub-tree.
+1) The (negative) multinomial log-likelihood is thus: L = -sum[i=1..n]
{ sum[j=1..(k-1)](Yij * ln(Pj(Xi))) + (1 - (sum[j=1..(k-1)]Yij)) * ln(1 - sum[j=1..
(k-1)]Pj(Xi)) } + ridge * (B^2)
3.9 The Data Mining Tools

The experimental tool used was Waikato Environment for Knowledge Analysis
(WEKA). WEKA is one of the popular suites of machine learning software
developed at the University of Waikato. It is open source software available under
the Not Unix General Public License (GNU). The WEKA work bench contains a
collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to this
functionality. WEKA toolkit is a widely used toolkit for machine learning and
data mining originally developed at the University of Waikato in New Zealand. It
contains a large collection of state-of-the-art machine learning and data mining
algorithms written in Java. WEKA contains tools for regression, classification,
clustering, association rules, visualization, and data pre-processing. WEKA has
become very popular with academic and industrial researchers, and is also widely
used for teaching purposes. To use WEKA, the collected data need to be prepared
and converted to (csv) file format to be compatible with the WEKA data mining
toolkit.
3.10 System Configuration
Minimum Hardware Configuration
- Processor - Intel core duo
- Speed - 1.1 Ghz
- RAM - 4 GB (min)
- Hard Disk - 20 GB
Software Configuration
- Operating System: Windows XP and higher
- Programming Tool: Weka.

- Dataset: KDD Cup 99 Dataset
CHAPTER FOUR
RESULTS AND DISCUSSION

4.1 INTRODUCTION
The goal of this simulation is to show how the two classification algorithms can
efficiently and effectively able to detect intrusions. This study used two
classification algorithms Random forest and J48. The experiments was
performed using Weka, Data Mining tool. The dataset used in this project is the
KDD dataset. Next, we will discuss about the data used to train and test the
classifiers. The data exploration and presentation processes consist of the
following steps: data sets, data mining tools and performance measurement terms.
4.1.1 The Dataset
The dataset used in this research is the KDD dataset. KDD is a data set suggested
to solve some of the inherent problems of the KDD cup'99 data set. It is basically
a processed version of the KDD cup‟99 dataset. This dataset enables researchers
to train their algorithms on the full dataset (because of its smaller amount of
records) instead of using a portion of the full dataset as in the case of the KDD
cup‟99data set.
4.1.2 Data Mining Tools
The experiments was done using Weka 3.6.7. Weka(Waikato Environment for
Knowledge Analysis) is a popular suite of machine learning software written in

Java, developed at the University of Waikato, New Zealand. Weka supports
several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection.
The Experiments were carried out on a 32-bit Windows 8 Professional operating
system, with 2 GB of RAM and a Pentium (R) Dual-core CPU at 2.20Hz per core.
Due to the iterative nature of the experiments and resultant processing power
required, the java heap size for weka-3-6.7 was set to 1024 MB.
To assess the effectiveness of the algorithms, each one of them was trained on the
KDD data set using a ten-fold validation test mode in a Weka (Waikato
Environment for Knowledge Analysis) environment. To test and evaluate the
algorithms we use 10-fold cross validation. In this process the data set is divided
into 10 subsets. Each time, one of the 10 subsets is used as the test set and the
other k-1 subsets form the training set. Performance statistics are calculated
across all 10 trials. This provides a good indication of how well the classifier will
perform on unseen data.

4.1.3 Performance Measurement Terms
(1). Correctly Classified Instance: The correctly and incorrectly classified
instances show the percentage of test instances that were correctly and
incorrectly classified. The percentage of correctly classified instances is
often called accuracy or sample accuracy.
(2). Kappa Statistics: Kappa is a chance-corrected measure of
agreement between the classifications and the true classes. It's calculated
by taking the agreement expected by chance away from the observed
agreement and dividing by the maximum possible agreement. A value
greater than 0 means that the classifier is doing better than chance.
(3) Mean Absolute Error, Root Mean Squared Error, Relative Absolute
Error: The error rates are used for numeric prediction rather than
classification. In numeric prediction, predictions aren't just right or wrong,
the error has a magnitude, and these measures reflect that. Detection of
attack is measured by following metrics:
(i). True positive (TP): Corresponds to the number of detected attacks and
it is in fact an attack.
(ii). False positive (FP): Or false alarm, corresponds to the number of
detected attacks that is in fact normal.

The accuracy of an intrusion detection system is measured regarding to
detection rate and false alarm rate.
In misuse detection related problems, standard data mining techniques are
not applicable due to several specific details that include dealing with skewed
class distribution, learning from data streams and labeling network connections.
The problem of skewed class distribution in the network intrusion detection is
very apparent since intrusion as a class of interest is much smaller i.e. rarer than
the class representing normal network behavior. In such scenarios when the
normal behavior may typically represent 98-99% of the entire population a trivial
classifier that labels everything with the majority class can achieve 98-99%
accuracy.
4.1.4 10 - Fold Cross Validation
The 10-fold cross validation is used in the field of machine learning to determine
how accurately a learning algorithm will be able to predict data that it was not
trained on.
Figure 4.1. KDD dataset loaded into the model
Figure 4.2. J48 Classifier error
Figure 4.3. J48 classifier tree generated
Figure 4.4 J48 Margin curve
Figure 4.5 J48 Threshold curve
Figure 4.6 J48 cost curve
Figure 4.7 J48 Cost benefit analysis curve
Figure 4.8. J48 classifier performance
Table 4.1. Proposed classifiers performance
Metrics/Classi Correctly Incorrec Kappa Mean Root Relative Root

fiers classified tly statisti absolut mean absolute relative
instances classified cs e error square error squared
instances d error error
Random Forest 99 22 99 67 45 1.3 9

J48 99 1 99 78 70 1.5 14
Figure 4.9. Random forest classifier error
Figure 4.10. Random forest margin curve
Figure 4.11. Random forest classifier
Table 4.2. Performance evaluation of the study
Metrics TP rate FP Precision Recall F- ROC

rate measure area
Random 99 4 99 99 99 100
forest
J48 99 6 99 99 99 99
As can be seen from Table 4.2, the RF and J48 have the same TP rate, precision,
recall and F-measure. The RF has the highest ROC. The J48 has the highest FP
rate. The two classifiers traded off ROC and FP rate.

100
90
80
70
60
50 Random forest
J48
40
30
20
10
0
Figure 4.12. Performance of the proposed model
The figure 4.12 revealed that the RF outperformed RF in terms of ROC while
they have same TP rate, recall and F-measure. The J48 performed better in terms
of FP rate with same TP rate, precision and F- measure. The two algorithms
proposed has detection accuracy of 99% with Random forest has false alarm rate
of 4% and J48 with false alarm rate of 6%. This revealed that the RF performed
better than J48 algorithm.

CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATION
5.1 SUMMARY
Finally, the conclusions of the study and the summary of the results obtained from
all the experiments carried out using the RF and J48 algorithms for anomaly
detection have been presented in this section.
In today’s day and age, the prevention of the security breaches with the help of
currently available technology is quite unrealistic. Hence, intrusion detection is a
very important feature in the network security. Furthermore, the misuse detection
methods are unable to detect the unknown attacks; hence, anomaly detection
needs to be used for identifying such attacks. The data mining technique is
applied in the anomaly-based detection techniques for improving the intrusion
detection accuracy rate.
In this project, this project developed and proposed the J48 and RF Classification
algorithms for the intrusion and anomaly detection. It was seen that the proposed
algorithm showed a better performance with RF performing better than the J48.
This new method was very effective for detection of many attacks and showed
higher detection accuracy, as compared to the algorithms reported earlier.
5.2 CONCLUSION
The current research is focused on designing and implementation of intrusion
detection system
using data mining and machine learning algorithms. In this project, two decision
tree algorithms; RF and J48 have been chosen to find which of the two algorithms
is more accurate and efficient. The efforts have been made to implement the
intrusion detection system using famous J48 and RF algorithms for machine
learning. The results obtained by the experiments revealed that Rf performs well
for designing and implementation of intrusion detection system than J48
algorithm. The detection rate of RF algorithm was 99% mark while as J48
provides also 99 % of detesction rate but false alarm of 4% for RF with 6% for
J48.
5.3 RECOMMENDATION
Finally, for future work, the IDS intrusion detection accuracy rate and the
performance of the proposed technique have to be improved and it has to be
implemented in real network environments. The future work can also be to further
explore the features of the J48 algorithm and improve the split value and the
construction of the decision tree by applying the correlation feature selection
technique.
Optimization algorithm using other algorithms of swarm intelligence, some
algorithms combining kernel methods with other classification methods for
pattern analysis and optimization techniques for SVM parameter optimization
may be developed.
Abstract
Nowadays it is very important to maintain a high level security to ensure safe and
trusted communication of information between various organizations. But secured
data communication over internet and any other network is always under threat of
intrusions and misuses. Inarguably, a wide range of security technologies such as
information encryption, access control, and intrusion prevention are used to
protect network based systems, but there are still many undetected intrusions.
However, over the past years, a growing number of research projects have applied
data mining to Intrusion detection system with different classification algorithms.
Many of these approaches resulted in high detection rate and accuracy but,
majority of them encounter a high false alarm rate which is as a result of falsely
classification of normal connections as attack. This maybe be as a result of large
dataset which contains noise or meaningless attributes. Therefore, IDS research
area is in desperate need of focusing not only on level of detection rate and
accuracy but also on a way to reduce the dataset from noise but still retains its
value and false alarm rate to properly identify such intrusions. This project
presents a classification algorithms based on Random Forest (RF) and J48. The first stage
of the process is data preprocessing based on feature selection with Principal Component
Analysis (PCA) before the Network classification with RF and J48. The KDDCup 99
data set is used for the experiment and the experiment was performed in WEKA. The
results findings showed that the RF gave a detection accuracy of 99% with 4% false
alarm rate. The J48 gave a detection accuracy of 99% with 6% false alarm rate.
Experimental results showed that the RF algorithm has higher detection rate with low
false alarm rate.

REFERENCES
Agrawal, R., Imielinski, T., and Swami, A. (1993). Mining Associations between
Sets of Items in Massive Databases. In Proceedings of the ACM-
SIGMOD 199International Conferencing Management of Data, pages
207{216.
Agrawal, R. and Srikant, R. (1994). Fast Algorithms for Mining Association
Rules. In Proceedings of the 20th International Conference on Very Large
Databases, pages 487{499.
Ajayi, A., Idowu, and S.A., Anyaehie A.,(2013). Comparative study of selected
data mining algorithms used for intrusion detection. International Journal
of Soft computing and Engineering (IJSCE).
Amoroso, .E.G., (1999) Intrusion detection: an introduction to internet
surveillance, correlation, traceback, traps, and response. Intrusion.Net
Books, NJ.
Axelsson, S. (2000). “The base-rate fallacy and the difficulty of intrusion
detection”, ACM Trans. Information and System Security 3 (3), pp. (186-
205).
Barbarà, D., Couto, J., Jajodia, S., Popyack, L., And Wu,N., ADAM: Testbed for
Exploring the Use of Data Mining in Intrusion Detection, ACM SIGMOD
Record, 30(4), 2001,pp. 15-24.

Barbara, D., Wu, N. and Jajodia, S. [2001]. “Detecting Novel Network Intrusions
Using Bayes Estimators”, Proceedings Of the First SIAM Int. Conference
on Data Mining, (SDM 2001), Chicago, IL.
Berry, M. J. A. and Lino_,G. (1997). Data Mining Techniques. John Wiley and
Sons, Inc.
Biswanath Mukherjee, L.Todd Heberlein, Karl .Levitt, ”Network Intrusion
Detection”,IEEE, June 1994.
Carbone, P. L. (1997). “Data mining or knowledge discovery in databases: An
overview”, In Data Management Handbook, New York: Auerbach
Publications.
Chen, W.H., Hsu, S.H., and Shen, H.P. (2005). Application of SVM and ANN for
intrusion detection. Comput. Oper. Res., 32: 2617-2634.
Chittur, A., ”Model generation for an intrusion detection system using genetic
algorithms”, High School Honors Thesis, Ossining High School. In
cooperation with Columbia Univ, 2001
Cohen, W. W. (1995).Fast Effective Rule Induction. In Proceedings 12th
International Conference on Machine Learning, pages 115{123. Elmasri,R
and Navathe, S. B. (1994). Fundamentals of Database Systems. Addison-
Wesley
Crosbie, M. and E. H. Spafford, ”Active defense of a computer system using
autonomous agents”, Technical Report CSD-TR- 95-008, Purdue Univ.,
West Lafayette, IN, 15 February 1995.
D’silva, M., Deepali, V., (2013) “Comparative Study of Data Mining Techniques
to Enhance Intrusion Detection”. IJERA, Vol. 3, Issue 1, January –
February 2013.
Dasgupta, D. and F. A. Gonzalez, ”An intelligent decision support system for
intrusion detection and response”, . In Proc. of International Workshop on
Mathematical Methods, Models and Architectures for Computer Networks
Security (MMM-ACNS), St.Petersburg. Springer- , 21-23 May,2001.
Dickerson, J. E. and J. A. Dickerson, ”Fuzzy network profiling for intrusion
detection”, In Proc. of NAFIPS 19th International Conference of the North
American Fuzzy Information Processing Society, Atlanta, pp. 301306.
North American Fuzzy Information Processing Society (NAFIPS), July
2000.
Didaci, L., Giacinto, A. & Roli, F. (2002). “Ensemble learning for intrusion
detection in computer networks”, Proceedings of AI*IA, Workshop on
“Apprendimento automatico: metodi e applicazioni”, Siena, Italy.

Eitel and Giri, (2008) “A Comparative Study Of Data Mining Algorithms For
Network Intrusion Detection In The Presence Of Poor Quality Data”.
ICIQ-03, 2008.
Eric Bloedorn et al, ”Data Mining for Network Intrusion Detection: How to Get
Started,”
Technical paper, 2001.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. J., A Geometric
Framework for Unsupervised Anomaly Detection: Detecting Intrusions in
Unlabeled Data, In D.Barbarà and S. Jajodia (eds.), Applications of Data
Mining in Computer Security, Kluwer Academic Publishers, Boston, MA,
2002, pp. 78-99.
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., editors
1996b). Advances in Knowledge Discovery and Data Mining. AAAI
Press/MIT Press.
Fayyad, U. (1998). Mining Databases: Towards Algorithms for Knowledge
Discovery. Bulletin of the IEEE Computer Society Technical Committee
on Data Engineering, 22(1):39-48.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996a). From Data Mining to
Knowledge Discovery in Databases. AI Magazine, 17(3):37 54.

Fayyad, U. M., G. Piatetsky-Shapiro, and P. Smyth, ”The KDD process for
extracting useful Knowledge from volumes of data,” Communications of
the ACM 39 (11),November 1996, 2734.
G. J. Klir, ”Fuzzy arithmetic with requisite constraints”, Fuzzy Sets and Systems,
91:165175, 1997.
Ghosh, A. K., A. Schwartzbard, and M. Schatz,” Learning program behavior
profiles for intrusion detection”, In Proc. 1st USENIX, 9- 12 April, 1999.
Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques, Morgan
Kaufmann Publisher.
Heady,.R. , Luger, .G., Maccabe, .A., and Servilla,.M.(1990)The architecture of a
network level intrusion detection system. Technical Report, Department of
Computer Science, University of NewMexico, August, 1990.
Heba Ezzat Ibrahim, Sherif M. Badr, and Mohamed A. Shaheen, (2012)
“Adaptive Layered Approach using Machine Learning Techniques with
Gain Ratio for Intrusion Detection Systems”. IJCA, Volume 56 – No.7,
October 2012.
Jaiganesh, V., Sumathi, P., and Vinitha, A.,(2013) “Classification Algorithms in
Intrusion Detection System: A Survey”. IJCTA, Vol 4(5), September –
October, 2013.
Jaiganesh, V., Mangayarkarasi, S., and Sumathi, P., (2013) “Intrusion Detection
Systems: A Survey and Analysis of Classification Techniques”.
IJARCCE, Vol. 2, Issue 4, April 2013.
James Cannady (1998). Artificial Neural Networks for Misuse Detection.
National Information Systems Security Conference.
James Cannady, Jay Harrell (1996). A comparative Analysis of current Intrusion
Detection Technologies.
Jimmy Shum and Heidar A. Malki,“Network Intrusion Detection System Using
Neural Networks” Fourth International Conference on Natural
Computation in IEEE 2008.
KDD Cup 1999. Available on:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, December
2009.
Kesavulu, E., Reddy, V. N. and Rajulu, P. G. (2011). “A Study of Intrusion
Detection in Data Mining”. Proceedings of the World Congress on
Engineering 2011 Vol IIIWCE 2011, July 6 - 8, 2011,London, U.K.
Kumar, S., ”Classification and Detection of Computer Intrusion”, PhD. thesis,
1995, Purdue Univ., West Lafayette, IN.
Krzystof, G., and Nobert, J., (2012). “Feature selection with Decision Tree
Criterion”. 2012.
Landwehr, C.E., Bull, A.R., McDermott, J.P., and Choi, W.S.(1994). A taxonomy
of computer program security flaws. ACM Comput. Surv.,
vol.26,no.3,pp.211–254,1994.
Lane, T. D. (2000). “Machine Learning Techniques for the computer security
domain of anomaly detection”, Ph.D. Thesis, Purdue Univ., West
Lafayette, IN.
Lee, W (1999). A Data Mining Framework for Constructing Features and Models
for Intrusion Detection Systems. PhD Thesis, Computer Science
Department, Columbia University.
Lee, W., Stolfo, S.J. & Mok, K.W. (1999). “Mining in a data-flow environment:
Experience in network intrusion detection,” (Chaudhuri, S. & Madigan, D.
Eds.). Proc. of the Fifth International Conference on Knowledge
Discovery and Data Mining (KDD-99) (pp. 114-124), San Diego, CA:
ACM,
Lee, W. and S. J. Stolfo, ”Data mining approaches for intrusion detection”, In
Proc. of the 7th USENIX Security Symp., San Antonio, TX.USENIX,
1998.
Lee, W. , S.J.Stolfo et al, ”A data mining and CIDF based approach for detecting
novel and distributed intrusions”, Proc. of Third International Workshop

on Recent advances in Intrusion Detection (RAID 2000), Toulouse,
France.
Lee, W., S. J. Stolfo, and K. W. Mok, ” Mining in a data- flow environment:
Experience in network intrusion detection,” In S. Chowdhury and D.
Madigan (Eds.), Proc. of the Fifth International Conference on Knowledge
Discovery and Data Mining (KDD-99), San Diego, CA, pp. 114124.
ACM,12-15 August 1999.
Lee, W., S. J. Stolfo, and K. W. Mok, ”Adaptive intrusion detection: A data
mining approach,” Artificial Intelligence Review 14 (6), 533567, 2000.
Lunt, T.F. (1989). Real -Time Intrusion Detection. Proceedings from IEEE
COMPCON.
Mannila, H. (1996). Data Mining: Machine Learning, Statistics, and Databases. In
Proceedings of the 8th International Conference on Scientific and
Statistical Database Management, pages 1{8.}
Mannila, H., Smyth, P., and Hand, D. J. (2001). Principles of Data Mining. MIT
Press. Mannila, H., Toivonen, H., and Verkamo, A. I. (1997). Discovery
of Frequent Episodes in Event Sequences Data Mining and Knowledge
Discovery, :259-289.
Markou, M. and Singh, S., Novelty Detection: A review, Part 1: Statistical
Approaches, Signal Processing, 8(12), 2003, pp. 2481-2497.

Miller, R. and Yang, T. (1997). Association Rules Over Interval Data. In
Proceedings of the 1997 ACM- SIGMOD Conference on Management of
Data, pages 452{461. }
Mithcell Rowton,(2005). Introduction to Network Security Intrusion Detection.
Mounji, A. (1997). Languages and Tools for Rule-Based Distributed Intrusion
Detection. PhD thesis, Faculties Universitaires Notre-Dame dela Paix
Namur (Belgium).
Mukkamala, .S., Janoski, .G., Sung, .A., (2002) Intrusion Detection Using Neural
Networks and Support Vector Machines. Proceedings of IEEE
International Joint Conference n Neural Networks, pp.1702-1707, 2002 .
Mukkamala, S., Sung, A.H., Abraham, A., (2003) Intrusion detection using
ensemble of soft computing paradigms, third international conference on
intelligent systems design and applications, intelligent systems design and
applications, advances in soft computing. Germany: Springer;2003.p.239–
48.
Mukkamala, S., Sung, A.H., Abraham, A. (2004a). Modeling intrusion detection
systems using linear genetic Programming approach, The 17th
international conference on industrial & engineering applications of
artificial intelligence and expert systems, innovations in applied artificial
intelligence. In:Robert, O., Chunsheng, Y., Moonis, A., editors. Lecture

Notes in Computer Science, vol.3029. Germany:Springer; 2004a. p.633–
42.
Mukkamala, S., Sung, A.H., Abraham, A., Ramos, V.(2004b) Intrusion detection
systems using adaptive regression splines. In: SerucaI, Filipe, J.,
Hammoudi, S., Cordeiro, J., editors. Proceedings of the 6th international
conference on enterprise information systems, ICEIS’04, vol.3, Portugal.
2004b. p.26–33[ISBN:972-8865-00-7].
Nadiammai, G.V., Krishaveni,. S., Hemalatha, .M., (2011). “A comprehensive
Analysis and study in intrusion detection system using data mining
Techniques”. IJCA, Volume 35 –No.8, December,2011.
Narayana, M.S., Prasad, B. V. V. S., Srividhya, A., Pandu Ranga R.K., (Issue 6,
September 2011), International Journal of Computer Science and
Telecommunications, Vol. Volume 2, pp. 8-14. ISSN 2047-3338 .
Neri, F., ”Comparing local search with respect to genetic evolution to detect
intrusion in computer networks”, In Proc. of the 2000 Congress on
Evolutionary Computation CEC00, La Jolla, CA, pp. 238243. IEEE Press,
16-19 July, 2000.
Neri, F., ”Mining TCP/IP traffic for network intrusion detection”, In R. L. de
M’antaras and E. Plaza (Eds.), Proc. of Machine Learning: ECML
2000,11th European Conference on Machine Learning, Volume 1810 of

Lecture Notes in Computer Science, Barcelona, Spain, pp. 313322.
Springer, May 31- June 2, 2000.
Noel, S., Wijesekera, D., and Youman, C., Modern Intrusion Detection, Data
Mining, and Degrees of Attack Guilt, In D. Barbarà and S. Jajodia (eds.),
Applications of Data Mining in Computer Security, Kluwer Academic
Publishers, Boston, MA, 2002, pp. 2-25.
Patel, H., Sarkhedi, B., and Vaghamshi, H., (2013) “Intrusion Detection in Data
Mining with Classification Algorithm”. IJAREEIE, Vol. 2, Issue7, July
2013.
Patel, R.,Thakkar, A., Ganatra, A., (2012) “ A Survey and Comparative Analysis
of Data Mining Techniques for Network Intrusion Detection
Systems”.IJSCE, Volume-2, Issue-1, March 2012.
PGarcia-Teodoro, J.Diaz-Verdejo, “Anomalyy network intrusion detection:
Techniques, systems and challenges”, www.elsevier.com , 2009.
Phurivit Sangkatsanee, Naruemon Wattanapongsakorn and Chalermpol
Charnsripinyo (2012). Real-time Intrusion Detection and Classification.
SANS:{FAQ: Data Mining in Intrusion Detection) http://www.sans.org/security-
resources/idfaq/data_mining.php
Shyu, M., Chen, S., Sarinnapakorn, K. and Chang, L. (2003). A novel Anomaly
detection scheme based on principal component classifier, Proceedings of

the IEEE Foundations and New Directions of Data Mining Workshop, in
conjunction with the Third IEEE International Conference on DataMining
(ICDM03), pp.172–179, 2003.
Stefan, A., (2000). “Intrusion Detection Systems: A Survey and Taxonomy”.
Chalmers University of Technology, Sweden.
Summers, R.C. Secure computing: threats and safeguards. NewYork: McGraw-
Hill; 1997.
Sundaram, A. (1996). An introduction to intrusion detection. ACM Cross Roads
1996;2(4).
Tavallaee,M. , Bagheri, E. , Lu,W. & Ghorbani, A. (2009). “A Detailed
Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE
Symposium on Computational Intelligence for Security and Defense
Applications (CISDA), 2009.
Valdimir, .V. N., (1995). The Nature of Statistical Learning Theory, Springer,
1995.
Yimin Wu, (2004). High-dimensional Pattern Analysis in Multimedia
Information Retrieval and Bioinformatics, Doctoral Thesis, State
University of New York, January 2004.
Zhao, J., Chen, M., and Lou, Q. (2011). Research of intrusion detection system
based on neural networks. Proceedings of IEEE 3rd international

Conference on Communication Software and Nerworks, May 27-29,
2011,Xi’an, China, pp:174-178.

Improving IDS Accuracy Using Feature Selection and Hybrid Decision Trees

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving IDS Accuracy Using Feature Selection and Hybrid Decision Trees

Uploaded by

Copyright:

Available Formats

CHAPTER ONE

1.1 BACKGROUND TO THE STUDY

systems, as they primarily focus on detecting malicious network traffics (Marib,

2018). An IDS is seen to be very similar to many processes like firewalls,

depending on detection as the signature detection and anomaly detection system.

rate and developing effective classification and clustering models for

distinguishing between a normal and an abnormal behavior packet. The procedure

classified as the classification problem.

all the malicious attacks (Onik and Haq, 2016).

Due to the unprecedented growth of computer networks, the increasing number of

devices running on it and the increase in number of cyber-attacks, network

security has become a fundamental issue in today’s computer technology. So the

confidentiality, data integrity and data availability (Hemant, Sarkhedi &

As network of computers has become an important part of the society, the

security of such network is of great importance as failure to do so may lead to a

Traditional protection techniques such as user authentication, data encryption,

computer security. If a password is weak, it can be easily compromised, user

authentication cannot prevent unauthorized user, ﬁrewalls are vulnerable to errors

in conﬁguration and susceptible to ambiguous or undeﬁned security policies

cannot be avoided as the complexity of the system and application software is

evolving rapidly leaving behind some exploitable weaknesses. Consequently,

Therefore, intrusion detection is required as an additional wall for protecting

in detecting successful intrusions, but also in monitoring attempts to break

security, which provides important information for timely counter

Intrusion detection and prevention is a developing field as it tends to be of a

consideration nowadays due to the prevalent activities of hacker. Moreover, the

usage of an intrusion detection system for securing a network is not of utmost

performing it duties. Anomaly detection recognizes any variation from the

defined patterns for users. Anomaly detection is based on creation of monitored

activity profiles. Anomaly detection refers to the problem of finding patterns in

data that do not conform to expected behaviour. These non-conforming patterns

are often referred to as anomalies, outliers, discordant observations, exceptions,

aberrations, surprises, peculiarities or contaminants in different application

the context of anomaly detection; sometimes interchangeably. Anomaly detection

approach to detect known attacks by matching attack pattern to list of signatures,

greatly similar to antivirus applications. The signatures should be updated

unable to detect the unknown attacks. Contrasting misuse detection, anomaly

based detection is involved monitoring user’s activities to catch any deviation

2011).Intrusion Detection Systems (IDS) have become a standard component in

security infrastructures as they allow network administrators to detect policy

unauthorized access to insiders abusing their access.

1.2 STATEMENT OF PROBLEM

Finding the intrusions in Computer Network using machine learning algorithms

various problems in current Intrusion Detection System such as low accuracy,

building of models for intrusion detection. Features may contain false

correlations, which hinder the process of detecting intrusions. Further, some

accuracy of IDS. Another shortcomings of IIDS is, intrusions attacks or

anomalies in Network infrastructures lead mostly in great financial losses,

massive sensitive data leaks, thereby decreasing efficiency and productivity of an

organization (Zhouhair, Noreddine & Khalid, 2018).

become a major problem in Intrusion detection system.

1.3 AIM AND OBJECTIVES

The aim of this project is to develop an intrusion detection system based on

feature selection and hybridized decision tree (Decision tree classification

The following are the specific objectives;

i. to acquire the network intrusion detection dataset.

ii. to implement an intrusion detection system based on Random

forest and J48 algorithms.

precision, sensitivity and specificity.

1.4 SIGNIFICANCE OF THE STUDY

Intrusion Detection Systems (IDS) play an important role in an organization’s

security framework. Security tools such as anti-virus software, firewalls, packet

access to an organization’s systems but they are in no way foolproof. These