You are on page 1of 79

CHAPTER ONE

INTRODUCTION

1.1 BACKGROUND TO THE STUDY

The intrusion detection systems (IDS) are defined as efficient security tools which

are used for improving the security of the communicating and the information

systems, as they primarily focus on detecting malicious network traffics (Marib,

2018). An IDS is seen to be very similar to many processes like firewalls,

antivirus software, and can access the control schemes. The IDS is classified

depending on detection as the signature detection and anomaly detection system.

For the signature based detection systems, the systems identify the traffic pattern

or the application data as malicious and this requires an updated database for

storing all the new attack signatures, whereas the anomaly detection system

compares all activities against the normal defined behavior (Agrawal, 2015).

The main objective of the IDS system is detecting and then raising an alarm if the

network is attacked. The best IDS process detects the new or more malicious

attacks within a short time period and carries out the necessary actions. The

currently used IDS systems do not show 100% accuracy, hence, this study has

been carried out for improving and increasing the IDS system accuracy (Sheta

and Alamleh, 2017). Many of the machine learning techniques have been used for

helping in the detection of the network attacks, improving the accuracy detection

rate and developing effective classification and clustering models for

distinguishing between a normal and an abnormal behavior packet. The procedure


of detecting the intrusion accurately from the complete network traffic is

classified as the classification problem.

The IDS systems are classified as per the detection methods used for identifying

all the malicious attacks (Onik and Haq, 2016).

Due to the unprecedented growth of computer networks, the increasing number of

devices running on it and the increase in number of cyber-attacks, network

security has become a fundamental issue in today’s computer technology. So the

main task of the technology expert is to provide a secured data in terms of data

confidentiality, data integrity and data availability (Hemant, Sarkhedi &

Vaghamshi, 2013).

As network of computers has become an important part of the society, the

security of such network is of great importance as failure to do so may lead to a

dangerous effect, such as information theft and many others. Information gotten

from a private network, let’s say a governmental information can be used against

the progress of the society which can lead to pandemonium and panic in the

society at large.

Traditional protection techniques such as user authentication, data encryption,

avoiding programming errors and firewalls are used as the first line of defence for

computer security. If a password is weak, it can be easily compromised, user

authentication cannot prevent unauthorized user, firewalls are vulnerable to errors

in configuration and susceptible to ambiguous or undefined security policies

(Summers, 1997). They are generally unable to fully protect against malicious
mobile code, insider attacks and all other forms of intrusion. Programming errors

cannot be avoided as the complexity of the system and application software is

evolving rapidly leaving behind some exploitable weaknesses. Consequently,

computer systems are likely to remain unsecured for the foreseeable future.

Therefore, intrusion detection is required as an additional wall for protecting

systems despite the prevention techniques. Intrusion detection is useful not only

in detecting successful intrusions, but also in monitoring attempts to break

security, which provides important information for timely counter

measures(Sundaram, 1996).

Intrusion detection and prevention is a developing field as it tends to be of a

consideration nowadays due to the prevalent activities of hacker. Moreover, the

usage of an intrusion detection system for securing a network is not of utmost

importance as the rate at which the said IDS can effectively and efficiently

performing it duties. Anomaly detection recognizes any variation from the

defined patterns for users. Anomaly detection is based on creation of monitored

activity profiles. Anomaly detection refers to the problem of finding patterns in

data that do not conform to expected behaviour. These non-conforming patterns

are often referred to as anomalies, outliers, discordant observations, exceptions,

aberrations, surprises, peculiarities or contaminants in different application

domains. Of these, anomalies and outliers are two terms used most commonly in

the context of anomaly detection; sometimes interchangeably. Anomaly detection

finds extensive use in a wide variety of applications such as fraud detection for
credit cards, insurance or healthcare, intrusion detection for cyber-security, fault

detection in safety critical systems, and military surveillance for enemy activities

(Varun, Arindam, & Vipin, 2009). Misuse detection method using a rule-based

approach to detect known attacks by matching attack pattern to list of signatures,

greatly similar to antivirus applications. The signatures should be updated

regularly, because if the signature is not included in its library this type of IDS is

unable to detect the unknown attacks. Contrasting misuse detection, anomaly

based detection is involved monitoring user’s activities to catch any deviation

from normal behaviour profile. Despite being able to detect unknown attacks, the

probability of high false alarm is considerable (Muda, Yassin, Sulaiman & Udzir,

2011).Intrusion Detection Systems (IDS) have become a standard component in

security infrastructures as they allow network administrators to detect policy

violations. These policy violations range from external attackers trying to gain

unauthorized access to insiders abusing their access.

1.2 STATEMENT OF PROBLEM

Finding the intrusions in Computer Network using machine learning algorithms

such as Naïve Bayesian (NB), Neural Network (NN), Support Vector Machine

(SVM), K-Nearest Neighbors (KNN), Fuzzy Logic model, and Genetic Algorithm

(Chiba, 2017) have been widely used in the last decades. However, there exist

various problems in current Intrusion Detection System such as low accuracy,

imbalanced detection rates for different types of attacks, high false alarm rates,

redundancy of input attributes as well in the training data (Rida and Omri, 2016)
also suggested removal of noise and redundancy in dataset will be handy in

building of models for intrusion detection. Features may contain false

correlations, which hinder the process of detecting intrusions. Further, some

features may be redundant since the information they add is contained in other

features. Extra features can increase computation time, and can impact the

accuracy of IDS. Another shortcomings of IIDS is, intrusions attacks or

anomalies in Network infrastructures lead mostly in great financial losses,

massive sensitive data leaks, thereby decreasing efficiency and productivity of an

organization (Zhouhair, Noreddine & Khalid, 2018).

These gaps majorly on the detection accuracy and the false alarm are what this

project seeks to address. Hence, how to improve the detection accuracy has

become a major problem in Intrusion detection system.

1.3 AIM AND OBJECTIVES

The aim of this project is to develop an intrusion detection system based on

feature selection and hybridized decision tree (Decision tree classification

algorithm) to reduce the detection rate, accuracy and reduce the false alarm rate.

The following are the specific objectives;

i. to acquire the network intrusion detection dataset.

ii. to implement an intrusion detection system based on Random

forest and J48 algorithms.


iii. to evaluate the performance of the classifier in terms of accuracy,

precision, sensitivity and specificity.

1.4 SIGNIFICANCE OF THE STUDY

Intrusion Detection Systems (IDS) play an important role in an organization’s

security framework. Security tools such as anti-virus software, firewalls, packet

sniffers and access control lists aid in preventing attackers from gaining easy

access to an organization’s systems but they are in no way foolproof. These

factors make the need for a proper security framework even more

paramount. A tool is therefore needed to alert system administrators of the

possibility of rogue activities occurring on their networks. Intrusion

Detection Systems can play such a role. Therefore a need exists for development

of effective and efficient algorithms and intrusion detection systems for this

purpose.

1.5 DEFINITION OF TERMS

Data mining: The term data mining is frequently used to designate the process of

extracting useful information from large databases.

Machine learning: It is a collection of algorithms for data analysis and predictive

modeling.

Classification: Classification is used to determine the predetermined output. It

predicts the target class for each data item.


Intrusion: An intrusion can be defined as “any set of actions that attempt to

compromise the integrity, confidentiality or availability of a resource.

Intrusion Detection System (IDS): An intrusion detection system is a software

program which helps to identify the malicious program which enter our system or

in network.

KDD (Knowledge Discovery Database): is used to denote the process of

extracting useful knowledge from large data sets. Data mining, by contrast, refers

to one particular step in this process. Spherically, the data mining step applies so

called data mining techniques to extract patterns from the data.

1.6 Project Layout

This project study is organized as follows. Chapter two of this project work

consists of review of literature and important concept that are relevant to the

subject matter of this project.

Chapter three presents the methodology. In Chapter four the system design and

implementation and necessary system evaluation is presented. Chapter five is the

conclusion, summary and limitation of study together with future work.


CHAPTER TWO

LITERATURE REVIEW

2.0 Related work

Belhadj-Aissa and Guerroumi (2016) have presented a new AIS (Artificial

Immune Systems) based approach for Network Anomaly Detection based on

Negative Selection process (NADNS). The Negative Selection (NS) process in

the biological point of view is the principle of distinction between self-cells and

non-self-cells which is highly coherent with the classification problem

(normal/anomaly) in intrusion detection.

N. Lokeswari and B. Chakradhar Rao (2016) have proposed an anomaly intrusion

detection model based on artificial neural network classifier to detect intruders

trying to enter into a system. They have chosen back propagation algorithm

(BPN) as the learning algorithm for their artificial neural network of type

multilayer perceptrons (MLP), and weight updating is done based on particle

swarm optimization weight extraction algorithm (PSO WENN).

Erza Aminanto et al. (2017) have proposed an anomaly detection system to detect

network intrusions based on Ant Clustering Algorithm (ACA) and Fuzzy

Inference System (FIS).

Tao Ma et al. (2018) have proposed a novel approach called KDSVM, which

utilized the K-mean clustering technique and advantage of feature learning with
deep neural network (DNN) model and strong classifier of support vector

machines (SVM), to detect network intrusions.

Panda et al. (2018) stated that integrating a hybrid intelligent scheme, which
needed different classifiers to be implemented, would improve the detection and
make it very genuine, thereby improving the result quality. In this paper, the
researchers have applied a 2-class classification
strategy which is based on the 10-fold cross validation process, which would
increase the rate of intrusion detection and also decrease the rate of false alarms.
Aburomman and Reaz (2016) carried out a study which described the different
algorithms used for classifying the intrusions based on a popular machine
learning method. They studied different homogeneous or heterogeneous systems
along with various hybrid techniques. They stated that implementing the
ensemble-based techniques helped in solving the pattern classification-based
problems.
Security measures have failed in many cases to stop the wide variety of possible

attacks. The goal of intrusion detection is to build a system that would

automatically scan network activity and detect such intrusion attacks, providing

the necessary information to the system administrator to allow for corrective

action. A strong case can be made for the use of data mining techniques to

improve the current state of intrusion detection.

2.1 INTRUSION AND INTRUSION DETECTION

An intrusion can be defined as “any set of actions that attempt to

compromise the integrity, confidentiality or availability of a resource (Wenke &

Salvatore, 1998). Intrusion is a type of malicious activity that tries to deny the
security aspects of a computer system. Intrusion detection is a process of

gathering intrusion related information occurring in the process of monitoring the

events and inspecting them for sign of malicious acts (Maharaj & Khanna, 2014).

Maharaj and Khanna (2014) states ‘the primary goal of intrusion detection is to

model usual application behaviour, so that we can recognize attacks by their

peculiar effects without raising too many false alarms’. Intrusion detection is an

area growing in significance as more and more sensitive data are stored and

processed in networked systems (D’silva & Vora, 2013).

The goal of intrusion detection is to detect security violations in

information systems. Intrusion detection is a passive approach to security as it

monitors information systems and raises alarms when security violations are

detected (Reddy, Reddy & Rajulu, 2011).

2.2 INTRUSION DETECTION SYSTEM

Intrusion detection technique is a technology designed to observe

computer activities for the purpose of finding security violations. The security of

a computer system is compromised when an intrusion takes place. Intrusion

detection is the process of identifying and responding to malicious activity

targeted at computing and networking sources (Amoroso, 1999). Intrusion

prevention techniques, such as user authentication and information protection

have been used to protect computer systems as a first line of defense. Intrusion

prevention alone is not sufficient because as systems become ever more complex,
there are always exploitable weaknesses in the systems due to design and

programming errors. Now a day, intrusion detection is one of the high priority

tasks for network administrators and security professionals. As network based

computer systems play increasingly vital roles in modern society, they have

become intrusion detection systems provide following three essential security

functions:

1. Data confidentiality: Information that is being transferred through

the network should be accessible only to those that have been

properly authorized.

2. Data integrity: Information should maintain their integrity from

the moment they are transmitted to the moment they are actually

received. No corruption or data loss is accepted either from the

random events or malicious activity.

3. Data availability: The network or a system resource that ensures

that it is accessible and usable upon demand by an authorized

system user. Any intrusion detection system has some inherent

requirements. Its prime purpose is to detect as many attacks as

possible with minimum number of false alarms, i.e. the system

must be accurate in detecting attacks. However, an accurate system

that cannot handle large amount of network traffic and is slow in

decision making will not fulfill the purpose of an intrusion

detection system (IDs). Data mining techniques like data


reduction, data classification, features selection techniques play an

important role in IDS.

2.3 TAXONOMY OF IDS

IDs uses several techniques to determine what qualifies as an intrusion

versus normal traffic. There are two useful method of classification for intrusion

detection systems is according to data source. Each has a distinct approach for

monitoring, securing data and systems. There are two following general

categories under this classification:

1. Host-based IDSs (HIDS) – examine data held on individual

computers that serve as hosts. The network architecture of host-

based is agent-based, which means that a software agent resides

on each of the hosts that will be governed by the system

(Nadiammai, Krishaveni, & Hemalatha, 2011).

2. Network-based IDSs (NIDS) – examine data exchanged

between computers. Most efficient host-based intrusion detection

systems are capable of monitoring and collecting system audit in

real time as well as on a scheduled basis, thus distributing both

CPU utilization and network overhead and providing for a flexible

means of security administration (Nadiammai, Krishaveni, &

Hemalatha, 2011).
2.4 INTRUSION DETECTION APPROACHES

The signatures of some attacks are known, whereas other attacks only

reflect some deviation from normal patterns. Consequently, two main approaches

have been devised to detect intruders.

2.4.1 Anomaly Detection: Anomaly detection assumes that intrusions will

always reflect some deviations from normal patterns. Anomaly detection may be

divided into static and dynamic anomaly detection. A static anomaly detector

based on the assumption that there is a portion of the system being monitored that

does not change. The static portion of a system is the code for the system and the

constant portion of data upon which the correct functioning of the system

depends. For example, the operating systems software and data to bootstrap a

computer never change. If the static portion of the system ever deviates from its

original form, an error has occurred or an intruder has altered the static portion of

the system. Dynamic anomaly detection typically operates on audit records or on

monitored networked traffic data. Audit records of operating systems do not

record all events they only record events of interest. Therefore only behavior that

results in an event that is recorded in the audit will observed and these events may

occur in a sequence.

2.4.2 Misuse Detection: It is based on the knowledge of system vulnerabilities

and known attack patterns. Misuse detection is concerned with finding intruders

who are attempting to break into a system by exploiting some known


vulnerability. Ideally, a system security administrator should be aware of all the

known vulnerabilities and eliminate them. The term intrusion scenario is used as a

description of a known kind of intrusion; it is a sequence of events that would

result in an intrusion without some outside preventive intervention. An intrusion

detection system continually compares recent activity to known intrusion

scenarios to ensure that one or more attackers are not attempting to exploit known

vulnerabilities. To perform this, each intrusion scenario must be described or

modeled.

2.4.3 Advantages and Disadvantages of Anomaly Detection and Misuse

Detection

The main disadvantage of misuse detection approaches is that they will

detect only the attacks for which they are trained to detect. Novel attacks or

unknown attacks or even variants of common attacks often go undetected. The

main advantage of anomaly detection approaches is the ability to detect novel

attacks or unknown attacks against software systems, variants of known attacks,

and deviations of normal usage of programs regardless of whether the source is a

privileged internal user or an unauthorized external user. The disadvantage of the

anomaly detection approach is that well-known attacks may not be detected,

particularly if they fit the established profile of the user. Once detected, it is often

difficult to characterize the nature of the attack for forensic purposes. Finally a

high false positive rate may result for a narrowly trained detection algorithm, or
conversely, a high false negative rate may result for a broadly trained anomaly

detection approach.

2.4.4 Combining misuse and anomaly detection

Anomaly detection and misuse detection have major shortcomings that

hamper their effectiveness in detecting intrusions. Research can be carried into

intrusion detection methodologies which combine the anomaly detection

approach and the misuse detection approach (Lunt, 1989). These techniques seek

to incorporate the benefits of both of the standard approaches to intrusion

detection. The combined approach permits a single intrusion detection system to

monitor for indications of external and internal attacks. While a significant

advantage over the singular use of either method separately, the use of a

combined anomaly/misuse mechanism does possess some disadvantages. The use

of two knowledgebase for the intrusion detection system will increase the amount

of system resources which must be dedicated to the system (Cannady & Harrell,

1996). Additional disk space will be required for the storage of the profiles, and

increased memory requirements will be encountered as the mechanism compares

user activities with information in the dual knowledge bases. In addition, the

technique will share the disadvantage of either method individually in its inability

to detect collaborative or extended attack scenarios. Pattern recognition possesses

a distinct advantage over anomaly and misuse detection methods in that it is

capable of identifying attacks which may occur over an extended period of time, a
series of user sessions, or by multiple attackers working in concert. This approach

is effective in reducing the need to review a potentially large amount of audit data

(Cannady & Harrell, 1996).

Figure 2.3 shows taxonomy of Intrusion Detection Systems. More details and

information on the various IDS systems and the way they work can be found in

(Mitchell, 2005).

Figure 2.1: Intrusion Detection System Taxonomy (Stefan, 2000)

2.5 INTRUSION ATTACKS

Basically almost all authors or researcher categorize all intrusion attack into four

different types. A research work done by Jaiganesh et al (2013) detailed that there

are four different types of attacks made on a network based intrusion detection

system.
1. Denial of service attack (DoS): It is an attack in which the

attacker makes the memory too busy or too full to handle the

requests.

2. User to Root Attack (U2R): It is an attack in which attacker tries

to access the normal user account.

3. Remote to Local Attack (R2L): It is an attack in which attacker

sends packets to a machine over a network but does not have an

account on that machine.

4. Probing Attack: It is an attempt to gather information about the

network of computers.

However, Patel et al. (2013), discovered that the type of attack is more than four,

thus they stated the classes in KDD’ 99 dataset can be categorized into five main

classes (one normal class and four main intrusion classes: PROBE, DOS, U2R,

and R2L).

1. Normal connections are generated by simulated daily user

behaviour such as downloading files, visiting web pages.

2. Denial of Service (DoS) attack causes the computing power or

memory of a victim machine too busy or too full to handle

legitimate requests. DoS attacks are classified based on the

services that an attacker renders unavailable to legitimate users

like apache2, land, mail bomb, back, etc.


3. Remote to User (R2L) is an attack that a remote user gains access

of a local user/account by sending packets to a machine over a

network communication, which include send-mail, and Xlock.

4. User to Root (U2R) is an attack that an intruder begins with the

access of a normal user account and then becomes a root-user by

exploiting various vulnerabilities of the system. Most common

exploits of U2R attacks are regular buffer-overflows, load-module,

Fd-format, and fb-config.

5. Probing (Probe) is an attack that scans a network to gather

information or find known Vulnerabilities. An intruder with a map

of machines and services that are available on a network can use

the information to look for exploits.

2.6 DRAWBACKS OF INTRUSION DETECTION SYSTEMS (IDSS)

Intrusion Detection Systems (IDS) have become a standard component in

security infrastructures as they allow network administrators to detect policy

violations. These policy violations range from external attackers trying to gain

unauthorized access to insiders abusing their access. Current IDS have a number

of significant drawbacks:

1. Current IDS are usually tuned to detect known service level network

attacks. This leaves them vulnerable to original and novel malicious

attacks.
2. Data overload: Another aspect which does not relate directly to

misuse detection but is extremely important is how much data an

analyst can efficiently analyze. That amount of data he needs to look

at seems to be growing rapidly. Depending on the intrusion detection

tools employed by a company and its size there is the possibility for

logs to reach millions of records per day.

3. False positives: A common complaint is the amount of false positives

an IDS will generate. A false positive occurs when normal attack is

mistakenly classified as malicious and treated accordingly.

4. False negatives: This is the case where an IDS does not generate an

alert when an intrusion is actually taking place. (Classification of

malicious traffic as normal)

Data mining can help improve intrusion detection by addressing each and every

one of the above mentioned problems.

1. Remove normal activity from alarm data to allow analysts to focus on

real attacks.

2. Identify false alarm generators and “bad” sensor signatures.

3. Find anomalous activity that uncovers a real attack.

4. Identify long, ongoing patterns (different IP address, same activity).

To accomplish these tasks, data miners employ one or more of the following

techniques:
I. Data summarization with statistics, including finding outliers

II. Visualization: presenting a graphical summary of the data

III. Clustering of the data into natural categories

IV. Association rule discovery: defining normal activity and enabling

the discovery of anomalies

V. Classification: predicting the category to which a particular record

belongs

2.7 DATA MINING AND IDS

Data mining techniques can be differentiated by their different model

functions and representation, preference criterion, and algorithms (Fayyad et al.,

1996). The main function of the model that we are interested in is classification,

as normal, or malicious, or as a particular type of attack (Ghosh, Schwartzbar &

Schatz, 1999). We are also interested in link and sequence analysis (Eric

Bloedron et al., 2001). Additionally, data mining systems provide the means to

easily perform data summarization and visualization, aiding the security analyst

in identifying areas of concern (Eric Bloedron et al., 2001). The models must be

represented in some form. Common representations for data mining techniques

include rules, decision trees, linear and non-linear functions (including neural

nets), instance-based examples, and probability models (Fayyad et al., 1996).


A. Off Line Processing

The use of data mining techniques in IDSs, usually implies analysis of

the collected data in an offline environment. There are important

advantages In performing intrusion detection in an offline environment, in

addition to the real-time detection tasks typically employed.

Below we present the most important of these advantages:

1. In off-line analysis, it is assumed that all connections have already

finished and, therefore, we can compute all the features and check the

detection rules one by one (Fayyad et al., 1996).

2. The estimation and detection process is generally very demanding and,

therefore, the problem cannot be addressed in an online environment

because of the various the real-time constraints (Fayyad et al., 1996).

Many real-time IDSs will start to drop packets when flooded with data

faster than they can process it.

3. An offline environment provides the ability to transfer logs from

remote sites to a central site for analysis during off-peak times.

B. Data Mining and Real Time IDSs

Even though offline processing has a number of significant

advantages, data mining techniques can also be used to enhance IDSs in

real time. (Lee et al., 1998). (Ghosh, Schwartzbar & Schatz, 1999) were

one of the first to address important and challenging issues of accuracy,


efficiency, and usability of real-time IDSs. They implemented feature

extraction and construction algorithms for labeled audit data. They

developed several anomaly detection algorithms. In the paper, the authors

explore the use of information- theoretic measures, i.e., entropy,

conditional entropy, relative entropy, information gain, and information

cost to capture intrinsic characteristics of normal data and use such

measures to guide the process of building and evaluating anomaly

detection models. They also develop efficient approaches that use statistics

on packet header values for network anomaly detection. A real-time IDS,

called ”Judge”, was also developed to test and evaluate the use of those

techniques. A serious limitation of their approaches (as well as with most

existing IDSs) is that they only do intrusion detection at the network or

system level. However, with the rapid growth of e-Commerce and e-

Government applications, there is an urgent need to do intrusion and fraud

detection at the application-level. This is because many attacks may focus

on applications that have no effect on the underlying network or system

activities.

C. Multi sensor Correlation

The use of multiple sensors to collect data by various sources has been

presented by numerous researchers as a way to increase the performance

of an IDS.
1. Lee et al (1998), state that using multiple sensors for ID should

increase the accuracy of IDSs.

2. Kumar (1995) states that ”Correlation of information from different

sources has allowed additional information to be inferred that may be

difficult to obtain directly”.

3. Lee et al (1998) note that ” an IDS should consist of multiple co-

operative light weight subsystems that each monitor a separate part

(such as an access point) of the entire environment”.

4. Dickerson and Dickerson (2000) also explore a possible

implementation of such a mechanism. Their architecture consists of

three layers:

– A set of Data Collectors (packet collectors)

– A set of Data Processors

– A Threat analyzer that utilizes fuzzy logic and basically performs

a risk assessment of the collected data.

2.8 BENEFITS OF DATA MINING TECHNIQUES

1. Problems with large databases may contain valuable implicit

regularities that can be discovered automatically.

2. Difficult-to-program applications, which are too difficult for traditional

manual programming.
3. Software applications that customize to the individual user’s

preferences, such as personalized advertising.

There are several reasons why data mining approaches plays a role in these three

domains. First of all, for the classification of security incidents, a vast amount of

data has to be analyzed containing historical data. It is difficult for human beings

to find a pattern in such an enormous amount of data. Data mining, however,

seems well-suited to overcome this problem and can therefore be used to discover

those patterns.

2.9 REASONS TO USE DATA MINING APPROACHES IN IDS

1. It is very hard to program an IDS using ordinary programming

languages that require the explicitation and formalization of knowledge.

2. The adaptive and dynamic nature of machine-learning makes it a

suitable solution for this situation.

3. The environment of an IDS and its classification task highly depend on

personal preferences. What may seem to be an incident in one

environment may be normal in other environments. This way, the ability

of computers to learn enables them to know someone’s “personal” (or

organizational) preferences, and improve the performance of the IDS, for

this particular environment (Narayana, Prasad, Srividhya, & Ranga

Reddy, 2011).
2.10 THE DATA MINING PROCESS OF BUILDING INTRUSION

DETECTION MODELS

With the recent rapid development in KDD, a better understanding of the

techniques and process frameworks that can support systematic data analysis on

the vast amount of audit data that can be made available. The process of using

data mining approaches to build intrusion detection models is shown in Fig 2.4.

Figure 2.2: The Data Mining Process of Building ID Models (Lee,

1999).

Here raw (binary) audit data is first processed into ASCII network packet

information (or host event data), which is in turn summarized into connection

records (or host session records) containing a number of within-connection

features, e.g., service, duration, flag (indicating the normal or error status
according to the protocols), etc. Data mining programs are then applied to the

connection records to compute the frequent patterns, i.e., association rules and

frequent episodes, which are then analyzed to construct additional features for the

connection records. Classification programs, for example, RIPPER, are then used

to inductively learn the detection models. This process is of course iterative. For

example, poor performance of the classification models often indicates that more

pattern mining and feature construction is needed (Lee, 1999).

Data Mining is the automated process of going through large amounts of data

with the intention to discover useful information about the data that is not

obvious. Useful information may include special relations between the data,

specific models that of the data that repeats itself, specific patterns, and ways of

classifying it or discovering specific values that fall out of the “normal” pattern or

model (Agrawal & Srikant, 1994). In other to understand how data mining can

help advance intrusion detection, it is important to know how current IDS work to

identify an intrusion. Intrusion detection systems are a combination of hardware

and software resources aimed at protecting the confidentiality, availability and

integrity of a computer system or network. To an analyst sitting in front of an

IDS, an ideal system would alert on all malicious connections, whether it is a

known or novel attack (Chittur, 2001). However the search for the ideal IDS

continues and the amount of network data is increasing. Besides the issue of data

overload facing network analysts due to increasing complexity and large size of

networks, traditional methods for intrusion detection are based on extensive


knowledge of signatures of known attacks that are provided by human experts.

The signature database has to be manually revised for each new type of intrusion

that is discovered. A significant limitation of signature-based methods is that they

cannot detect emerging cyber threats. In addition, once a new attack is discovered

and its signature developed, often there is a substantial latency in its deployment

across networks.

2.11 DATA MINING APPROACHES FOR IDS

The central theme of our approach is to apply data mining techniques for

intrusion detection in network based system. Data mining generally refers to the

process of (automatically) extracting models from large stores of data. The recent

rapid development in data mining has made available a wide variety of

algorithms, drawn from the fields of statistics, pattern recognition, machine

learning, and database. Several types of algorithms (Lee, Stolfo, & Mok, 1999)

are particularly relevant to our research:

1. Classification: Maps a data item into one of several pre-defined

categories. These algorithms normally out-put “classifiers”, for example,

in the form of decision trees or rules. An ideal application in intrusion

detection will be to gather sufficient “normal” and “abnormal” audit data

for a user or a program, then apply a classification algorithm to learn a

classifier that can label or predict new unseen audit data as belonging to
the normal class or the abnormal class environment ( Narayana, Prasad,

Srividhya, & Pandu Ranga , 2011).

2. Link analysis: Determines relations between fields in the database.

Finding out the correlations in audit data will provide insight for selecting

the right set of system features for intrusion detection environment

(Narayana, Prasad, Srividhya, & Pandu Ranga , 2011).

3. Sequence analysis: Models sequential patterns. These algorithms can

discover what time-based sequence of audit events are frequently

occurring together environment (Narayana, Prasad, Srividhya, & Pandu

Ranga, 2011). These frequent event patterns provide guidelines for

incorporating temporal statistical measures into intrusion detection

models. For example, patterns from audit data containing network-based

denial-of-service (DOS) attacks suggest that several per-host and per-

service measures should be included.

4. Association Rule: This technique searches a frequently occurring item set

from a large dataset. Association rule mining determines association rules

and/or correlation relationships among large set of data items. The mining

process of association rule can be divided into two steps as follows:


i. Frequent Item set Generation: Generates all set of items whose

support is greater than the specified threshold called as

minsupport.

ii. Association Rule Generation: from the previously generated

frequent item sets, it generates the association rules in the form of

―if then‖ statements that have confidence greater than the

specified threshold called as minconfidence.

5. Clustering: It is an unsupervised machine learning mechanism for

discovering patterns in unlabeled data. It is used to label data and assign it

into clusters where each cluster consists of members that are quite similar.

Members from different clusters are different from each other. Hence

clustering methods can be useful for classifying network data for detecting

intrusions. Clustering can be applied on both Anomaly detection and

Misuse detection.

2.12 CLASSIFICATION TECHNIQUES

Classification is a data mining (machine learning) techniques used to predict

group membership for data instance. It consists of predicting a certain outcome

based on a given input. According to D’silva and Vora (2013), classification is a

supervised learning technique. A classification based IDS will classify all the

network traffic into either normal or malicious. Classification technique is mostly

used for anomaly detection. The classification process is as follows:

i. It accepts collection of items as input.


ii. Maps the items into predefined groups or classes defined by some

attributes.

iii. After mapping, it outputs a classifier that can accurately predict the

class to which a new item belongs.

Classification is one of data mining functionalities. It finds a model or function

that separates classes or data concepts in order to predict the classes of an

unknown object. For example, a loan officer requires data analysis to determine

which loan applicants are "safe" or "risky". The data analysis task is

classification, where a model or classifier is constructed to predict class

(categorical) labels, such as “safe” or “risky” for the loan application data. These

categories can be represented by discrete values, where the ordering among

values has no meaning. Because the class labels of training data is already known,

it is also called supervised learning. Classification consist two processes:

i. Training and

ii. Testing.

The first process, training, builds a classification model by analyzing training

data containing class labels. While the second process, testing, examines a

classifier (using testing data) for accuracy (in which case the test data contains

the class labels) or its ability to classify unknown objects (records) for

prediction (Crosbie & Spafford, 1995).


Classification algorithms are laid under the following classification techniques:

i. Decision Tree based Methods

ii. Rule-based Method

iii. Memory – based Reasoning

iv. Neural Networks

v. Naïve Bayes and Bayesian Belief Networks

vi. Support Vector Machines

2.13 FEATURE SELECTION

Hebat, Sherif, and Mohamed (2012) in their work reveal that subsequent to

preprocessing of data, the features of the data set are identified as either being

significant to the intrusion detection process, or redundant. This process is known

as feature selection. Redundant features are generally found to be closely

correlated with one or more other features. As a result, omitting them from the

intrusion detection process does not degrade classification accuracy. In fact, the

accuracy may improve due to the resulting data reduction, and removal of noise

and measurement errors associated with the omitted features. Therefore, choosing

a good subset of features proves to be significant in improving the performance of

the system.

The authors presented two method of feature selection, which are:

1) Information Gain: In this method, the features are filtered to create the

most prominent feature subset before the start of the learning process.
2) Gain ratio: a modification of the information gain that solves the issue

of bias towards features with a larger set of values, exhibited by

information gain. Gain ratio should be Large when data is evenly

spread and small when all data belong to one branch attribute.

Gain ratio takes number and size of branches into account when choosing an

attribute as It corrects the information gain by taking the intrinsic information of a

split into account (i.e. how much information do we need to tell which branch an

instance belongs to) where Intrinsic information is the entropy of distribution of

instances into branches.

Krzystof and Nobert (2007) states, effective and versatile classification cannot be

achieved by single classification algorithms. They must be hybrid complex

models comprising a feature selection stage. Similarly to other data analysis tasks,

also in feature selection it is very beneficial to take advantage of different kinds of

algorithms. The authors further stressed that one of the most efficient heuristics

used for decision tree construction is the Separability of Split Value (SSV)

criterion. Its basic advantage is that it can be applied to both continuous and

discrete features in such a manner that the estimates of separability can be

compared regardless the substantial difference in types. Furthermore, they opined

that The SSV criterion has been successfully used not only for building

classification trees. It proved efficient in data type conversion (from continuous to

discrete and in the opposite direction) and as the discretization part of feature

selection methods, which finally rank the features according to such indices like
Mutual Information. It is known that extra features can increase computation time

and can impact accuracy of IDS, so feature selection is a very good measure of

improvising on machine learning algorithm used for classification purposes.


CHAPTER THREE
SYSTEM APPROACH AND METHODOLOGY
3.1 Methodology

The approach used in this project is to apply machine learning algorithm known

as Decision trees extensively in the classification of Intrusion detection system.

3.2 Decision Trees

Decision Trees are a class of very powerful Machine Learning model cable of

achieving high accuracy in many tasks while being highly interpretable. What

makes decision trees special in the realm of ML models is really their clarity of

information representation. The “knowledge” learned by a decision tree through

training is directly formulated into a hierarchical structure. This structure holds

and displays the knowledge in such a way that it can easily be understood, even by

non-experts.

3.3 Proposed Approach

The approach used in this project is to discuss extensively the intended algorithms

with their respective relation with classification and intrusion detection.

3.4 INFORMATION GAIN IN TERMS OF FEATURE REDUCTION

According to Ibrahim, Badr and Shaheen (2012), Subsequent to preprocessing of

data, the features of the data set are identified as either being significant to the

intrusion detection process, or redundant. This process is known as feature

selection. Redundant features are generally found to be closely correlated with


one or more other features. As a result, omitting them from the intrusion detection

process does not degrade classification accuracy. In fact, the accuracy may

improve due to the resulting data reduction, and removal of noise and

measurement errors associated with the omitted features. Therefore, choosing a

good subset of features proves to be significant in improving the performance of

the system.

They thus define Information Gain: In this method, the features are filtered to

create the most prominent feature subset before the start of the learning process.

Mathematically, (Xindong et al, 2008) stated that information gain for a dataset

(D) containing Si tuples of class Ci for i = (1,…..,m) is defined as:

Information measures info required to classify any arbitrary tuple

Info (D) = -∑ Si log2 (Si/S) ………………….(1)

Where, Si is the total value of a feature X and S is the number of

possible value a feature X can take.

Entropy of feature X with values (X1, X2, …….,Xv)

E(X) =∑ S1j +……..+Smj Info (D) ……………….(2)

J=1 S

Information gained by branching on feature X

Gain (X) = Info (D) – E(X) ………………………..(3)


3.5 DATASET

Since 1999, KDD’99 (Yimin, 2004) has been the most widely used data set for

the evaluation of anomaly detection methods. This data set is built based on the

data captured in DARPA’98 IDS evaluation program (KDD, 1999). DARPA’98

is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of

network traffic. The two weeks of test data have around 2 million connection

records. KDD training dataset consists of approximately 4,900,000 single

connection vectors each of which contains 41 features and is labeled as either

normal or an attack, with exactly one specific attack type. The simulated attacks

fall in one of the following four categories:

(1) Denial of Service Attack (DoS): is an attack in which the attacker

makes some computing or memory resource too busy or too full to handle

legitimate requests, or denies legitimate users access to a machine.

(2) User to Root Attack (U2R): is a class of exploit in which the attacker

starts out with access to a normal user account on the system (perhaps

gained by sniffing passwords, a dictionary attack, or social engineering)

and is able to exploit some vulnerability to gain root access to the system.

(3) Remote to Local Attack (R2L): occurs when an attacker who has the

ability to send packets to a machine over a network but who does not have
an account on that machine exploits some vulnerability to gain local

access as a user of that machine.

(4) Probing Attack: is an attempt to gather information about a network of

computers for the apparent purpose of circumventing its security controls.

Table 1 showed the four categories and their corresponding attacks on

each category.

Figure 3.2: Classification of attacks on KDD data set


KDD CUP Dataset

Import Dataset

Data preprocessing

Apply feature selection

Apply the classifiers RF and J48

Result

Figure 3.1. An architecture of the proposed algorithm

3.6 DATA PRE-PROCESSOR

In complex classification domains, some data may hinder the classification

process. Features may contain false correlations, which hinder the process of

detecting intrusions. Further, some features may be redundant since the

information they add is contained in other features. Extra features can increase

computation time, and can impact the accuracy of IDS. Feature selection

improves classification by searching for the subset of features, which best

classifies the training data. Dimension Reduction techniques are proposed as a

data pre-processing step. This process identifies a suitable low-dimensional


representation of original data. Reducing the dimensionality improves the

computational efficiency and accuracy of the data analysis.

3.7 Random Forest Algorithm

Random Forest is the go to machine learning algorithm that works through

bagging approach to create a bunch of decision trees with a random subset of the

data. It is considered to be one of the most effective algorithm to solve almost any

prediction task. It can be used both for classification and the regression kind of

problems. It is a combination of tree predictors where each tree depends on the

values of a random vector sampled independently with the same distribution for

all trees in the forest.

The pseudo code for random forest algorithm can split into two stages. First, in

which ‘n' random trees are created, this forms the random forest. In the second

stage, the outcome for the same test feature from all decision trees is combined.

Then the final prediction is derived by assessing the results of each decision tree

or just by going with a prediction that appears the most times in the decision trees.

Random Forest Machine Learning Algorithm maintains accuracy even when there

is inconsistent data and is simple to use. It also gives estimates on what variables

are important for the classification. It runs efficiently on large databases while

generating an internal unbiased estimate of the generalisation error. It also

provides methods for balancing error in class population unbalanced data sets but

analysing them theoretically is difficult and formation of a large number of trees


can also slow down prediction while handling real-time system. There is also

another drawback that is, it does not predict beyond the range of the response

values in the training data.

Random forest algorithm is a supervised classification algorithm. As the name

suggest, this algorithm creates the forest with a number of trees.

In general, the more trees in the forest the more robust the forest looks like. In

the same way in the random forest classifier, the higher the number of trees in

the forest gives the high accuracy results.

3.7.1 Random Forest Pseudocode:

1. Randomly select “k” features from total “m” features.


1. Where k << m
2. Among the “k” features, calculate the node “d” using the best split
point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to
create “n” number of trees.

The beginning of random forest algorithm starts with randomly


selecting “k” features out of total “m” features.

3.7.2 Random Forest Algorithm


Require: Initially the tree has exactly one leaf (TreeRoot) which covers the whole
space
Require: The dimensionality of the input, D. Parameters λ, m and τ .
Select Candidate Split Dimensions (TreeRoot, min(1 + Poisson(λ), D))
for t = 1 . . . do Receive (Xt, Yt, It) from the environment
At ← leaf containing Xt
if It = estimation then
UpdateEstimationStatistics(At, (Xt, Yt))

for all S ∈ CandidateSplits(At) do

for all A ∈ CandidateChildren(S) do

if Xt ∈ A then
UpdateEstimationStatistics(A, (Xt, Yt))
end if
end for
end for
else if
It = structure then
if At has fewer than m candidate split points then

for all d ∈ CandidateSplitDimensions(At) do


CreateCandidateSplit(At, d, πdXt)
end for
end if

for all S ∈ CandidateSplits(At) do

for all A ∈ CandidateChildren(S) do


if Xt ∈ A then UpdateStructuralStatistics(A, (Xt, Yt))
end if
end for
end for
if CanSplit(At) then if ShouldSplit(At) then
Split(At)
else if
MustSplit(At) then
Split(At)
end if
end if
end if
end for
3.8 J48
C4.5 is a successor of ID3 developed by Ross Quinlan and is implemented in

WEKA as J48 using Java. All of them adopt a greedy and a top-down approach to

decision tree making. It is used for classification in which new data is labelled

according to already existing observations (training data set). Decision tree

induction begins with a dataset (training set) which is partitioned at every node

resulting in smaller partitions, thus following a recursive divide and conquer

strategy. In addition to a data set, which is a collections of objects, a set of

attributes is also passed. Objects can be an event, an activity and the attributes are

the information related to that object. To every tuple in the data set is associated a

class label which identifies whether an object belongs to a particular class or not.

Splitting can further be performed only if the tuples fall in different classes. The

partition of dataset uses heuristics that chooses an attribute that best partitions a

data set. This is referred to a as attribute selection measure. These attribute

selection measures are responsible for the type of branching that occurs on a

node. Gini index, information gain are some examples which partition a node into

a binary or multiway, respectively. Certain other ways to convert multi-branch

trees into strict binary can be used if need be. C4.5 uses gain ratio as the attribute
selection measure which has an advantage over information gain used in its

predecessor ID3. Since ID3 can produce n-ary branch trees if the attribute on

which partitioning is data has unique values, and therefore cannot be used for

classification.

The algorithm of J48 algorithm is as follows;

3.8.1 Algorithm (J48)

INPUT:

Dataset//Training data

OUTPUT

Tee//Decision tree

BUILD(*DataSet)

{Tree = ∅;

Tree = Add arc to node which is root and for each split predicate and label are

assigned;

For each arc do

Dataset = Database created by applying Splitting predicate to Dataset;

If stopping point reached to this path, then Tree = create leaf node and label with

appropriate class;

Else

Tree = BUILD (Dataset);\

Tree = add Tree to arc;

}
3.8.2 Pseudocode of J48

Step1: All the rows in a dataset are passing onto the root node.

Step2: Based on the values for the rows in the node considered, each of the

predictor variables is splited at all its possible split points.

Step3: At each split point, the parent node is split as binary nodes (child nodes)

by separating the rows with values lower than or equal to the split point and

values higher than the split point for the considered predictor variable. For

categorical predictor variables, each category of the variable will be considered in

turn.

Step 4: The predictor variable and split point with the highest value of I is

selected for the node.

Where PL and PR are the probabilities of a sample to lie in left sub-tree & right

sub-tree respectively and are the probabilities that a sample is in the class Cj and

in the left sub-tree or right sub-tree.

+1) The (negative) multinomial log-likelihood is thus: L = -sum[i=1..n]

{ sum[j=1..(k-1)](Yij * ln(Pj(Xi))) + (1 - (sum[j=1..(k-1)]Yij)) * ln(1 - sum[j=1..

(k-1)]Pj(Xi)) } + ridge * (B^2)

3.9 The Data Mining Tools


The experimental tool used was Waikato Environment for Knowledge Analysis

(WEKA). WEKA is one of the popular suites of machine learning software

developed at the University of Waikato. It is open source software available under

the Not Unix General Public License (GNU). The WEKA work bench contains a
collection of visualization tools and algorithms for data analysis and predictive

modelling, together with graphical user interfaces for easy access to this

functionality. WEKA toolkit is a widely used toolkit for machine learning and

data mining originally developed at the University of Waikato in New Zealand. It

contains a large collection of state-of-the-art machine learning and data mining

algorithms written in Java. WEKA contains tools for regression, classification,

clustering, association rules, visualization, and data pre-processing. WEKA has

become very popular with academic and industrial researchers, and is also widely

used for teaching purposes. To use WEKA, the collected data need to be prepared

and converted to (csv) file format to be compatible with the WEKA data mining

toolkit.
3.10 System Configuration

Minimum Hardware Configuration

- Processor - Intel core duo

- Speed - 1.1 Ghz

- RAM - 4 GB (min)

- Hard Disk - 20 GB

Software Configuration

- Operating System: Windows XP and higher

- Programming Tool: Weka.


- Dataset: KDD Cup 99 Dataset
CHAPTER FOUR

RESULTS AND DISCUSSION


4.1 INTRODUCTION

The goal of this simulation is to show how the two classification algorithms can

efficiently and effectively able to detect intrusions. This study used two

classification algorithms Random forest and J48. The experiments was

performed using Weka, Data Mining tool. The dataset used in this project is the

KDD dataset. Next, we will discuss about the data used to train and test the

classifiers. The data exploration and presentation processes consist of the

following steps: data sets, data mining tools and performance measurement terms.

4.1.1 The Dataset

The dataset used in this research is the KDD dataset. KDD is a data set suggested

to solve some of the inherent problems of the KDD cup'99 data set. It is basically

a processed version of the KDD cup‟99 dataset. This dataset enables researchers

to train their algorithms on the full dataset (because of its smaller amount of

records) instead of using a portion of the full dataset as in the case of the KDD

cup‟99data set.

4.1.2 Data Mining Tools

The experiments was done using Weka 3.6.7. Weka(Waikato Environment for

Knowledge Analysis) is a popular suite of machine learning software written in


Java, developed at the University of Waikato, New Zealand. Weka supports

several standard data mining tasks, more specifically, data preprocessing,

clustering, classification, regression, visualization, and feature selection.

The Experiments were carried out on a 32-bit Windows 8 Professional operating

system, with 2 GB of RAM and a Pentium (R) Dual-core CPU at 2.20Hz per core.

Due to the iterative nature of the experiments and resultant processing power

required, the java heap size for weka-3-6.7 was set to 1024 MB.

To assess the effectiveness of the algorithms, each one of them was trained on the

KDD data set using a ten-fold validation test mode in a Weka (Waikato

Environment for Knowledge Analysis) environment. To test and evaluate the

algorithms we use 10-fold cross validation. In this process the data set is divided

into 10 subsets. Each time, one of the 10 subsets is used as the test set and the

other k-1 subsets form the training set. Performance statistics are calculated

across all 10 trials. This provides a good indication of how well the classifier will

perform on unseen data.


4.1.3 Performance Measurement Terms

(1). Correctly Classified Instance: The correctly and incorrectly classified

instances show the percentage of test instances that were correctly and

incorrectly classified. The percentage of correctly classified instances is

often called accuracy or sample accuracy.

(2). Kappa Statistics: Kappa is a chance-corrected measure of

agreement between the classifications and the true classes. It's calculated

by taking the agreement expected by chance away from the observed

agreement and dividing by the maximum possible agreement. A value

greater than 0 means that the classifier is doing better than chance.

(3) Mean Absolute Error, Root Mean Squared Error, Relative Absolute

Error: The error rates are used for numeric prediction rather than

classification. In numeric prediction, predictions aren't just right or wrong,

the error has a magnitude, and these measures reflect that. Detection of

attack is measured by following metrics:

(i). True positive (TP): Corresponds to the number of detected attacks and

it is in fact an attack.

(ii). False positive (FP): Or false alarm, corresponds to the number of

detected attacks that is in fact normal.


The accuracy of an intrusion detection system is measured regarding to

detection rate and false alarm rate.

In misuse detection related problems, standard data mining techniques are

not applicable due to several specific details that include dealing with skewed

class distribution, learning from data streams and labeling network connections.

The problem of skewed class distribution in the network intrusion detection is

very apparent since intrusion as a class of interest is much smaller i.e. rarer than

the class representing normal network behavior. In such scenarios when the

normal behavior may typically represent 98-99% of the entire population a trivial

classifier that labels everything with the majority class can achieve 98-99%

accuracy.

4.1.4 10 - Fold Cross Validation

The 10-fold cross validation is used in the field of machine learning to determine

how accurately a learning algorithm will be able to predict data that it was not

trained on.
Figure 4.1. KDD dataset loaded into the model
Figure 4.2. J48 Classifier error
Figure 4.3. J48 classifier tree generated
Figure 4.4 J48 Margin curve
Figure 4.5 J48 Threshold curve
Figure 4.6 J48 cost curve
Figure 4.7 J48 Cost benefit analysis curve
Figure 4.8. J48 classifier performance

Table 4.1. Proposed classifiers performance

Metrics/Classi Correctly Incorrec Kappa Mean Root Relative Root


fiers classified tly statisti absolut mean absolute relative
instances classified cs e error square error squared
instances d error error

Random Forest 99 22 99 67 45 1.3 9


J48 99 1 99 78 70 1.5 14
Figure 4.9. Random forest classifier error
Figure 4.10. Random forest margin curve
Figure 4.11. Random forest classifier

Table 4.2. Performance evaluation of the study

Metrics TP rate FP Precision Recall F- ROC


rate measure area
Random 99 4 99 99 99 100
forest
J48 99 6 99 99 99 99

As can be seen from Table 4.2, the RF and J48 have the same TP rate, precision,

recall and F-measure. The RF has the highest ROC. The J48 has the highest FP

rate. The two classifiers traded off ROC and FP rate.


100
90
80
70
60
50 Random forest
J48
40
30
20
10
0

Figure 4.12. Performance of the proposed model

The figure 4.12 revealed that the RF outperformed RF in terms of ROC while

they have same TP rate, recall and F-measure. The J48 performed better in terms

of FP rate with same TP rate, precision and F- measure. The two algorithms

proposed has detection accuracy of 99% with Random forest has false alarm rate

of 4% and J48 with false alarm rate of 6%. This revealed that the RF performed

better than J48 algorithm.


CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATION
5.1 SUMMARY

Finally, the conclusions of the study and the summary of the results obtained from

all the experiments carried out using the RF and J48 algorithms for anomaly

detection have been presented in this section.

In today’s day and age, the prevention of the security breaches with the help of

currently available technology is quite unrealistic. Hence, intrusion detection is a

very important feature in the network security. Furthermore, the misuse detection

methods are unable to detect the unknown attacks; hence, anomaly detection

needs to be used for identifying such attacks. The data mining technique is

applied in the anomaly-based detection techniques for improving the intrusion

detection accuracy rate.

In this project, this project developed and proposed the J48 and RF Classification

algorithms for the intrusion and anomaly detection. It was seen that the proposed

algorithm showed a better performance with RF performing better than the J48.

This new method was very effective for detection of many attacks and showed

higher detection accuracy, as compared to the algorithms reported earlier.

5.2 CONCLUSION

The current research is focused on designing and implementation of intrusion

detection system
using data mining and machine learning algorithms. In this project, two decision

tree algorithms; RF and J48 have been chosen to find which of the two algorithms

is more accurate and efficient. The efforts have been made to implement the

intrusion detection system using famous J48 and RF algorithms for machine

learning. The results obtained by the experiments revealed that Rf performs well

for designing and implementation of intrusion detection system than J48

algorithm. The detection rate of RF algorithm was 99% mark while as J48

provides also 99 % of detesction rate but false alarm of 4% for RF with 6% for

J48.

5.3 RECOMMENDATION

Finally, for future work, the IDS intrusion detection accuracy rate and the

performance of the proposed technique have to be improved and it has to be

implemented in real network environments. The future work can also be to further

explore the features of the J48 algorithm and improve the split value and the

construction of the decision tree by applying the correlation feature selection

technique.

Optimization algorithm using other algorithms of swarm intelligence, some

algorithms combining kernel methods with other classification methods for

pattern analysis and optimization techniques for SVM parameter optimization

may be developed.
Abstract

Nowadays it is very important to maintain a high level security to ensure safe and

trusted communication of information between various organizations. But secured

data communication over internet and any other network is always under threat of

intrusions and misuses. Inarguably, a wide range of security technologies such as

information encryption, access control, and intrusion prevention are used to

protect network based systems, but there are still many undetected intrusions.

However, over the past years, a growing number of research projects have applied

data mining to Intrusion detection system with different classification algorithms.

Many of these approaches resulted in high detection rate and accuracy but,

majority of them encounter a high false alarm rate which is as a result of falsely

classification of normal connections as attack. This maybe be as a result of large

dataset which contains noise or meaningless attributes. Therefore, IDS research

area is in desperate need of focusing not only on level of detection rate and

accuracy but also on a way to reduce the dataset from noise but still retains its

value and false alarm rate to properly identify such intrusions. This project

presents a classification algorithms based on Random Forest (RF) and J48. The first stage

of the process is data preprocessing based on feature selection with Principal Component

Analysis (PCA) before the Network classification with RF and J48. The KDDCup 99

data set is used for the experiment and the experiment was performed in WEKA. The

results findings showed that the RF gave a detection accuracy of 99% with 4% false

alarm rate. The J48 gave a detection accuracy of 99% with 6% false alarm rate.
Experimental results showed that the RF algorithm has higher detection rate with low

false alarm rate.


REFERENCES

Agrawal, R., Imielinski, T., and Swami, A. (1993). Mining Associations between

Sets of Items in Massive Databases. In Proceedings of the ACM-

SIGMOD 199International Conferencing Management of Data, pages

207{216.

Agrawal, R. and Srikant, R. (1994). Fast Algorithms for Mining Association

Rules. In Proceedings of the 20th International Conference on Very Large

Databases, pages 487{499.

Ajayi, A., Idowu, and S.A., Anyaehie A.,(2013). Comparative study of selected

data mining algorithms used for intrusion detection. International Journal

of Soft computing and Engineering (IJSCE).

Amoroso, .E.G., (1999) Intrusion detection: an introduction to internet

surveillance, correlation, traceback, traps, and response. Intrusion.Net

Books, NJ.

Axelsson, S. (2000). “The base-rate fallacy and the difficulty of intrusion

detection”, ACM Trans. Information and System Security 3 (3), pp. (186-

205).

Barbarà, D., Couto, J., Jajodia, S., Popyack, L., And Wu,N., ADAM: Testbed for

Exploring the Use of Data Mining in Intrusion Detection, ACM SIGMOD

Record, 30(4), 2001,pp. 15-24.


Barbara, D., Wu, N. and Jajodia, S. [2001]. “Detecting Novel Network Intrusions

Using Bayes Estimators”, Proceedings Of the First SIAM Int. Conference

on Data Mining, (SDM 2001), Chicago, IL.

Berry, M. J. A. and Lino_,G. (1997). Data Mining Techniques. John Wiley and

Sons, Inc.

Biswanath Mukherjee, L.Todd Heberlein, Karl .Levitt, ”Network Intrusion

Detection”,IEEE, June 1994.

Carbone, P. L. (1997). “Data mining or knowledge discovery in databases: An

overview”, In Data Management Handbook, New York: Auerbach

Publications.

Chen, W.H., Hsu, S.H., and Shen, H.P. (2005). Application of SVM and ANN for

intrusion detection. Comput. Oper. Res., 32: 2617-2634.

Chittur, A., ”Model generation for an intrusion detection system using genetic

algorithms”, High School Honors Thesis, Ossining High School. In

cooperation with Columbia Univ, 2001

Cohen, W. W. (1995).Fast Effective Rule Induction. In Proceedings 12th

International Conference on Machine Learning, pages 115{123. Elmasri,R

and Navathe, S. B. (1994). Fundamentals of Database Systems. Addison-

Wesley
Crosbie, M. and E. H. Spafford, ”Active defense of a computer system using

autonomous agents”, Technical Report CSD-TR- 95-008, Purdue Univ.,

West Lafayette, IN, 15 February 1995.

D’silva, M., Deepali, V., (2013) “Comparative Study of Data Mining Techniques

to Enhance Intrusion Detection”. IJERA, Vol. 3, Issue 1, January –

February 2013.

Dasgupta, D. and F. A. Gonzalez, ”An intelligent decision support system for

intrusion detection and response”, . In Proc. of International Workshop on

Mathematical Methods, Models and Architectures for Computer Networks

Security (MMM-ACNS), St.Petersburg. Springer- , 21-23 May,2001.

Dickerson, J. E. and J. A. Dickerson, ”Fuzzy network profiling for intrusion

detection”, In Proc. of NAFIPS 19th International Conference of the North

American Fuzzy Information Processing Society, Atlanta, pp. 301306.

North American Fuzzy Information Processing Society (NAFIPS), July

2000.

Didaci, L., Giacinto, A. & Roli, F. (2002). “Ensemble learning for intrusion

detection in computer networks”, Proceedings of AI*IA, Workshop on

“Apprendimento automatico: metodi e applicazioni”, Siena, Italy.


Eitel and Giri, (2008) “A Comparative Study Of Data Mining Algorithms For

Network Intrusion Detection In The Presence Of Poor Quality Data”.

ICIQ-03, 2008.

Eric Bloedorn et al, ”Data Mining for Network Intrusion Detection: How to Get

Started,”

Technical paper, 2001.

Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. J., A Geometric

Framework for Unsupervised Anomaly Detection: Detecting Intrusions in

Unlabeled Data, In D.Barbarà and S. Jajodia (eds.), Applications of Data

Mining in Computer Security, Kluwer Academic Publishers, Boston, MA,

2002, pp. 78-99.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., editors

1996b). Advances in Knowledge Discovery and Data Mining. AAAI

Press/MIT Press.

Fayyad, U. (1998). Mining Databases: Towards Algorithms for Knowledge

Discovery. Bulletin of the IEEE Computer Society Technical Committee

on Data Engineering, 22(1):39-48.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996a). From Data Mining to

Knowledge Discovery in Databases. AI Magazine, 17(3):37 54.


Fayyad, U. M., G. Piatetsky-Shapiro, and P. Smyth, ”The KDD process for

extracting useful Knowledge from volumes of data,” Communications of

the ACM 39 (11),November 1996, 2734.

G. J. Klir, ”Fuzzy arithmetic with requisite constraints”, Fuzzy Sets and Systems,

91:165175, 1997.

Ghosh, A. K., A. Schwartzbard, and M. Schatz,” Learning program behavior

profiles for intrusion detection”, In Proc. 1st USENIX, 9- 12 April, 1999.

Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques, Morgan

Kaufmann Publisher.

Heady,.R. , Luger, .G., Maccabe, .A., and Servilla,.M.(1990)The architecture of a

network level intrusion detection system. Technical Report, Department of

Computer Science, University of NewMexico, August, 1990.

Heba Ezzat Ibrahim, Sherif M. Badr, and Mohamed A. Shaheen, (2012)

“Adaptive Layered Approach using Machine Learning Techniques with

Gain Ratio for Intrusion Detection Systems”. IJCA, Volume 56 – No.7,

October 2012.

Jaiganesh, V., Sumathi, P., and Vinitha, A.,(2013) “Classification Algorithms in

Intrusion Detection System: A Survey”. IJCTA, Vol 4(5), September –

October, 2013.
Jaiganesh, V., Mangayarkarasi, S., and Sumathi, P., (2013) “Intrusion Detection

Systems: A Survey and Analysis of Classification Techniques”.

IJARCCE, Vol. 2, Issue 4, April 2013.

James Cannady (1998). Artificial Neural Networks for Misuse Detection.

National Information Systems Security Conference.

James Cannady, Jay Harrell (1996). A comparative Analysis of current Intrusion

Detection Technologies.

Jimmy Shum and Heidar A. Malki,“Network Intrusion Detection System Using

Neural Networks” Fourth International Conference on Natural

Computation in IEEE 2008.

KDD Cup 1999. Available on:

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, December

2009.

Kesavulu, E., Reddy, V. N. and Rajulu, P. G. (2011). “A Study of Intrusion

Detection in Data Mining”. Proceedings of the World Congress on

Engineering 2011 Vol IIIWCE 2011, July 6 - 8, 2011,London, U.K.

Kumar, S., ”Classification and Detection of Computer Intrusion”, PhD. thesis,

1995, Purdue Univ., West Lafayette, IN.

Krzystof, G., and Nobert, J., (2012). “Feature selection with Decision Tree

Criterion”. 2012.
Landwehr, C.E., Bull, A.R., McDermott, J.P., and Choi, W.S.(1994). A taxonomy

of computer program security flaws. ACM Comput. Surv.,

vol.26,no.3,pp.211–254,1994.

Lane, T. D. (2000). “Machine Learning Techniques for the computer security

domain of anomaly detection”, Ph.D. Thesis, Purdue Univ., West

Lafayette, IN.

Lee, W (1999). A Data Mining Framework for Constructing Features and Models

for Intrusion Detection Systems. PhD Thesis, Computer Science

Department, Columbia University.

Lee, W., Stolfo, S.J. & Mok, K.W. (1999). “Mining in a data-flow environment:

Experience in network intrusion detection,” (Chaudhuri, S. & Madigan, D.

Eds.). Proc. of the Fifth International Conference on Knowledge

Discovery and Data Mining (KDD-99) (pp. 114-124), San Diego, CA:

ACM,

Lee, W. and S. J. Stolfo, ”Data mining approaches for intrusion detection”, In

Proc. of the 7th USENIX Security Symp., San Antonio, TX.USENIX,

1998.

Lee, W. , S.J.Stolfo et al, ”A data mining and CIDF based approach for detecting

novel and distributed intrusions”, Proc. of Third International Workshop


on Recent advances in Intrusion Detection (RAID 2000), Toulouse,

France.

Lee, W., S. J. Stolfo, and K. W. Mok, ” Mining in a data- flow environment:

Experience in network intrusion detection,” In S. Chowdhury and D.

Madigan (Eds.), Proc. of the Fifth International Conference on Knowledge

Discovery and Data Mining (KDD-99), San Diego, CA, pp. 114124.

ACM,12-15 August 1999.

Lee, W., S. J. Stolfo, and K. W. Mok, ”Adaptive intrusion detection: A data

mining approach,” Artificial Intelligence Review 14 (6), 533567, 2000.

Lunt, T.F. (1989). Real -Time Intrusion Detection. Proceedings from IEEE

COMPCON.

Mannila, H. (1996). Data Mining: Machine Learning, Statistics, and Databases. In

Proceedings of the 8th International Conference on Scientific and

Statistical Database Management, pages 1{8.}

Mannila, H., Smyth, P., and Hand, D. J. (2001). Principles of Data Mining. MIT

Press. Mannila, H., Toivonen, H., and Verkamo, A. I. (1997). Discovery

of Frequent Episodes in Event Sequences Data Mining and Knowledge

Discovery, :259-289.

Markou, M. and Singh, S., Novelty Detection: A review, Part 1: Statistical

Approaches, Signal Processing, 8(12), 2003, pp. 2481-2497.


Miller, R. and Yang, T. (1997). Association Rules Over Interval Data. In

Proceedings of the 1997 ACM- SIGMOD Conference on Management of

Data, pages 452{461. }

Mithcell Rowton,(2005). Introduction to Network Security Intrusion Detection.

Mounji, A. (1997). Languages and Tools for Rule-Based Distributed Intrusion

Detection. PhD thesis, Faculties Universitaires Notre-Dame dela Paix

Namur (Belgium).

Mukkamala, .S., Janoski, .G., Sung, .A., (2002) Intrusion Detection Using Neural

Networks and Support Vector Machines. Proceedings of IEEE

International Joint Conference n Neural Networks, pp.1702-1707, 2002 .

Mukkamala, S., Sung, A.H., Abraham, A., (2003) Intrusion detection using

ensemble of soft computing paradigms, third international conference on

intelligent systems design and applications, intelligent systems design and

applications, advances in soft computing. Germany: Springer;2003.p.239–

48.

Mukkamala, S., Sung, A.H., Abraham, A. (2004a). Modeling intrusion detection

systems using linear genetic Programming approach, The 17th

international conference on industrial & engineering applications of

artificial intelligence and expert systems, innovations in applied artificial

intelligence. In:Robert, O., Chunsheng, Y., Moonis, A., editors. Lecture


Notes in Computer Science, vol.3029. Germany:Springer; 2004a. p.633–

42.

Mukkamala, S., Sung, A.H., Abraham, A., Ramos, V.(2004b) Intrusion detection

systems using adaptive regression splines. In: SerucaI, Filipe, J.,

Hammoudi, S., Cordeiro, J., editors. Proceedings of the 6th international

conference on enterprise information systems, ICEIS’04, vol.3, Portugal.

2004b. p.26–33[ISBN:972-8865-00-7].

Nadiammai, G.V., Krishaveni,. S., Hemalatha, .M., (2011). “A comprehensive

Analysis and study in intrusion detection system using data mining

Techniques”. IJCA, Volume 35 –No.8, December,2011.

Narayana, M.S., Prasad, B. V. V. S., Srividhya, A., Pandu Ranga R.K., (Issue 6,

September 2011), International Journal of Computer Science and

Telecommunications, Vol. Volume 2, pp. 8-14. ISSN 2047-3338 .

Neri, F., ”Comparing local search with respect to genetic evolution to detect

intrusion in computer networks”, In Proc. of the 2000 Congress on

Evolutionary Computation CEC00, La Jolla, CA, pp. 238243. IEEE Press,

16-19 July, 2000.

Neri, F., ”Mining TCP/IP traffic for network intrusion detection”, In R. L. de

M’antaras and E. Plaza (Eds.), Proc. of Machine Learning: ECML

2000,11th European Conference on Machine Learning, Volume 1810 of


Lecture Notes in Computer Science, Barcelona, Spain, pp. 313322.

Springer, May 31- June 2, 2000.

Noel, S., Wijesekera, D., and Youman, C., Modern Intrusion Detection, Data

Mining, and Degrees of Attack Guilt, In D. Barbarà and S. Jajodia (eds.),

Applications of Data Mining in Computer Security, Kluwer Academic

Publishers, Boston, MA, 2002, pp. 2-25.

Patel, H., Sarkhedi, B., and Vaghamshi, H., (2013) “Intrusion Detection in Data

Mining with Classification Algorithm”. IJAREEIE, Vol. 2, Issue7, July

2013.

Patel, R.,Thakkar, A., Ganatra, A., (2012) “ A Survey and Comparative Analysis

of Data Mining Techniques for Network Intrusion Detection

Systems”.IJSCE, Volume-2, Issue-1, March 2012.

PGarcia-Teodoro, J.Diaz-Verdejo, “Anomalyy network intrusion detection:

Techniques, systems and challenges”, www.elsevier.com , 2009.

Phurivit Sangkatsanee, Naruemon Wattanapongsakorn and Chalermpol

Charnsripinyo (2012). Real-time Intrusion Detection and Classification.

SANS:{FAQ: Data Mining in Intrusion Detection) http://www.sans.org/security-

resources/idfaq/data_mining.php

Shyu, M., Chen, S., Sarinnapakorn, K. and Chang, L. (2003). A novel Anomaly

detection scheme based on principal component classifier, Proceedings of


the IEEE Foundations and New Directions of Data Mining Workshop, in

conjunction with the Third IEEE International Conference on DataMining

(ICDM03), pp.172–179, 2003.

Stefan, A., (2000). “Intrusion Detection Systems: A Survey and Taxonomy”.

Chalmers University of Technology, Sweden.

Summers, R.C. Secure computing: threats and safeguards. NewYork: McGraw-

Hill; 1997.

Sundaram, A. (1996). An introduction to intrusion detection. ACM Cross Roads

1996;2(4).

Tavallaee,M. , Bagheri, E. , Lu,W. & Ghorbani, A. (2009). “A Detailed

Analysis of the KDD CUP 99 Data Set,” Submitted to Second IEEE

Symposium on Computational Intelligence for Security and Defense

Applications (CISDA), 2009.

Valdimir, .V. N., (1995). The Nature of Statistical Learning Theory, Springer,

1995.

Yimin Wu, (2004). High-dimensional Pattern Analysis in Multimedia

Information Retrieval and Bioinformatics, Doctoral Thesis, State

University of New York, January 2004.

Zhao, J., Chen, M., and Lou, Q. (2011). Research of intrusion detection system

based on neural networks. Proceedings of IEEE 3rd international


Conference on Communication Software and Nerworks, May 27-29,

2011,Xi’an, China, pp:174-178.

You might also like