Professional Documents
Culture Documents
ABSTRACT Machine learning techniques are being widely used to develop an intrusion detection sys-
tem (IDS) for detecting and classifying cyberattacks at the network-level and the host-level in a timely
and automatic manner. However, many challenges arise since malicious attacks are continually changing
and are occurring in very large volumes requiring a scalable solution. There are different malware datasets
available publicly for further research by cyber security community. However, no existing study has shown
the detailed analysis of the performance of various machine learning algorithms on various publicly available
datasets. Due to the dynamic nature of malware with continuously changing attacking methods, the malware
datasets available publicly are to be updated systematically and benchmarked. In this paper, a deep neural
network (DNN), a type of deep learning model, is explored to develop a flexible and effective IDS to detect
and classify unforeseen and unpredictable cyberattacks. The continuous change in network behavior and
rapid evolution of attacks makes it necessary to evaluate various datasets which are generated over the
years through static and dynamic approaches. This type of study facilitates to identify the best algorithm
which can effectively work in detecting future cyberattacks. A comprehensive evaluation of experiments of
DNNs and other classical machine learning classifiers are shown on various publicly available benchmark
malware datasets. The optimal network parameters and network topologies for DNNs are chosen through the
following hyperparameter selection methods with KDDCup 99 dataset. All the experiments of DNNs are run
till 1,000 epochs with the learning rate varying in the range [0.01–0.5]. The DNN model which performed
well on KDDCup 99 is applied on other datasets, such as NSL-KDD, UNSW-NB15, Kyoto, WSN-DS, and
CICIDS 2017, to conduct the benchmark. Our DNN model learns the abstract and high-dimensional feature
representation of the IDS data by passing them into many hidden layers. Through a rigorous experimental
testing, it is confirmed that DNNs perform well in comparison with the classical machine learning classifiers.
Finally, we propose a highly scalable and hybrid DNNs framework called scale-hybrid-IDS-AlertNet which
can be used in real-time to effectively monitor the network traffic and host-level events to proactively alert
possible cyberattacks.
INDEX TERMS Cyber security, intrusion detection, malware, big data, machine learning, deep learning,
deep neural networks, cyberattacks, cybercrime.
Such cyberattacks are constantly evolving with very sophis- required to persevere today’s rapidly increasing high-speed
ticated algorithms with the advancement of hardware, soft- network size, speed and dynamics. These challenges form
ware, and network topologies including the recent devel- the prime motivation for this work with a research focus on
opments in the Internet of Things (IoT) [4]. Malicious evaluating the efficacy of various classical machine learning
cyberattacks pose serious security issues that demand the classifiers and deep neural networks (DNNs) applied to NIDS
need for a novel, flexible and more reliable intrusion detec- and HIDS. This work assumes the following;
tion system (IDS). An IDS is a proactive intrusion detection • An attacker aims at pretence as normal user to remain
tool used to detect and classify intrusions, attacks, or viola- hidden from the IDS. However, the patterns of intru-
tions of the security policies automatically at network-level sive behaviors differ in some aspect. This is due to the
and host-level infrastructure in a timely manner. Based specific objective of an attacker for example getting an
on intrusive behaviors, intrusion detection is classified unauthorized access to computer and network resources.
into network-based intrusion detection system (NIDS) and • The usage pattern of network resources can be captured,
host-based intrusion detection system (HIDS) [5]. An IDS however the existing methods ends up in high false
system which uses network behavior is called as NIDS. The positive rate.
network behaviors are collected using network equipment via • The patterns of intrusions exist in normal traffic with a
mirroring by networking devices, such as switches, routers, very low profile over long time interval.
and network taps and analyzed in order to identify attacks and Overall, this work has made the following contributions to
possible threats concealed within in network traffic. An IDS the cyber security domain:
system which uses system activities in the form of various • By combining both NIDS and HIDS collaboratively,
log files running on the local host computer in order to detect an effective deep learning approach is proposed by
attacks is called as HIDS. The log files are collected via modeling a deep neural network (DNN) to detect
local sensors. While NIDS inspects each packet contents in cyberattacks proactively. In this study, the efficacy
network traffic flows, HIDS relies on the information of log of various classical machine learning algorithms and
files which includes sensors logs, system logs, software logs, DNNs are evaluated on various NIDS and HIDS datasets
file systems, disk resources, users account information and in identifying whether network traffic behavior is either
others of each system. Many organizations use a hybrid of normal or abnormal due to an attack that can be classi-
both NIDS and HIDS. fied into corresponding attack categories.
Analysis of network traffic flows is done using misuse • The advanced text representation methods of natural
detection, anomaly detection and stateful protocol analysis. language processing (NLP) are explored with host-level
Misuse detection uses predefined signatures and filters to events, i.e. system calls with the aim to capture the
detect the attacks. It relies on human inputs to constantly contextual and semantic similarity and to preserve the
update the signature database. This method is accurate in sequence information of system calls. The comparative
finding the known attacks but is completely ineffective in the performance of these methods is conducted with the
case of unknown attacks. Anomaly detection uses heuristic ADFA-LD and ADFA-WD datasets.
mechanisms to find the unknown malicious activities. In most • This study uses various benchmark datasets to conduct a
of the scenarios, anomaly detection produces a high false comparative experimentation. This is mainly due to the
positive rate [5]. To combat this problem, most organizations reason that each dataset suffers from various issues such
use the combination of both the misuse and anomaly detec- as data corruptions, traffic variety, inconsistencies, out
tion in their commercial solution systems. Stateful protocol of date and contemporary attacks.
analysis is most powerful in comparison to the aforemen- • A scalable hybrid intrusion detection framework called
tioned detection methods due to the fact that stateful protocol SHIA is introduced to process large amount of
analysis acts on the network layer, application layer and network-level and host-level events to automatically
transport layer. This uses the predefined vendors specification identify malicious characteristics in order to provide
settings to detect the deviations of appropriate protocols and appropriate alerts to the network admin. The proposed
applications. Though deep learning approaches are being framework is highly scalable on commodity hardware
considered more recently to enhance the intelligence of such server and by joining additional computing resources to
intrusion detection techniques, there is a lack of study to the existing framework, the performance can be further
benchmark such machine learning algorithms with publicly enhanced to handle big data in real-time systems.
available datasets. The most common issues in the exist- The code and detailed results are made publicly
ing solutions based on machine learning models are: firstly, available [7] for further research. The remainder of the chap-
the models produce high false positive rate [3], [5] with wider ter is organized as follows. Section II discusses various
range of attacks; secondly, the models are not generalizable stages of compromise according to attackers perspective.
as existing studies have mainly used only a single dataset Section III discusses the related works of similar research
to report the performance of the machine learning model; work done to NIDS and HIDS. Information of scalable
thirdly, the models studied so far have completely unseen framework, the mathematical details of DNNs and text
today’s huge network traffic; and finally the solutions are representation methods for intrusion detection is placed
in Section IV. Section V includes information related to over a network or locally within a system. Now an attack can
major shortcomings of IDS datasets, problem formulation be defined as a set of actions that potentially compromises
and statistical measures. Section VI includes description of the confidentiality, data integrity, availability, or any kind of
datasets. Section VII and Section VIII includes experimental security policy of a resource. Primarily, an IDS system aims
analysis and a brief overview of proposed system and archi- at detecting all these types of attacks to prevent the com-
tecture design. Section IX presents the experimental results. puters and networks from malicious activities. In this work,
Conclusion, future work directions and discussions are placed we focus towards the categorization scheme as suggested by
in Section X. the DARPA Intrusion Detection Evaluation.
detection, anomaly detection, and malicious domain name the feature relevance in terms of information gain. In addi-
detection [11]. This work focuses towards analyzing the tion, they presented the most relevant features for each class
effectiveness of various classical machine learning and deep label. Reference [21] discussed random forest techniques in
neural networks (DNNs) for NIDS with the publicly avail- misuse detection by learning patterns of intrusions, anomaly
able network-based intrusion datasets such as KDDCup 99, detection with outlier detection mechanism, and hybrid detec-
NSL-KDD, Kyoto, UNSW-NB15, WSN-DS and CICIDS tion by combining both the misuse and anomaly detection.
2017. They reported that the misuse approach worked better than
A large study of academic research used the de facto stan- winning entries of KDDCup 99 challenge results, and in
dard benchmark data, KDDCup 99 to improve the efficacy of addition anomaly detection worked better compared to other
intrusion detection rate. KDDCup 99 was used for the third published unsupervised anomaly detection methods. Overall,
International Knowledge Discovery and Data Mining Tools it was concluded that the hybrid system enhances the perfor-
Competition and the data was created as the processed form mance with the advantage of combining both the misuse and
of tcpdump data of the 1998 DARPA intrusion detection (ID) anomaly detection approaches [22], [23], [72]. In [24], an ID
evaluation network. The aim of the contest was to create algorithm using AdaBoost technique was proposed that used
a predictive model to classify the network connections into decision stumps as weak classifiers. Their system performed
two classes: Normal or Attack. Attacks were categorized into better than other published results with a lower false alarm
denial of service (’DoS’), ’Probe’, remote-to-local (’R2L’), rate, a higher detection rate, and a computationally faster
user-to-root (’U2R’) categories. The mining audit data for algorithm. However, the drawback is that it failed to adopt
automated models for ID (MADAMID) was used as feature the incremental learning approach. In [25], the performance
construction framework in KDDCup 99 competition [17]. of the shared nearest neighbor (SNN) based model in ID was
MADAMID outputs 41 features: first 9 features are basic studied and reported as the best algorithm with a high detec-
features of a packet, 10-22 are content features, 23-31 are traf- tion rate. With the reduced dataset they were able to conclude
fic features, and 32-41 are host-based features. The choices that SNN performed well in comparison to the K-means for
of available datasets are: (1) full dataset and (2) comple- ’U2R’ attack category. However, their work failed to show
mentary 10% data. The detailed evaluation results of KDD- the results on the entire testing dataset.
Cup 98 and KDDCup 99 challenge was published in [3]. In [26], Bayesian networks for ID was explored using
Totally, 24 entries were submitted in the KDDCup 98, in that Naive Bayesian networks with a root node to represent a
3 winning entries used variants of decision tree to whom class of a connection and leaf nodes to represent features
they showed only the marginal statistics significance in per- of a connection. Later, [27] investigated the application of
formance. The 9th winning entry in the contest used the Naive Bayes network to ID and through detailed experi-
1-nearest neighbor classifier. The first significant difference mental analysis, they showed that Bayesian networks per-
in performance was found between 17th and 18th entries. formed equally well and sometimes even better in ’U2R’ and
This inferred that the first 17 submissions method were robust ’Probe’ categories in comparison with the winning entries
and were profiled by [3]. The Third International Knowledge of KDDCup 99 challenge. In [28], a non-parametric den-
Discovery and Data Mining Tools Competition task remained sity estimation method based on Parzen-window estimators
as a baseline work and after this contest many machine learn- was studied with Gaussian kernels and Normal distribution.
ing solutions have been found. Most of the published results Without the intrusion data, their system was comparatively
took only the 10% data of training and testing and few of favorable to the existing winning entries that was based on
them used custom-built datasets. Recently, a comprehensive ensemble of decision trees. In [29], a genetic algorithm based
literature survey on machine learning based ID with KDDCup NIDS was proposed that facilitates to model both tempo-
99 dataset was conducted [18]. ral and spatial information to identify complex anomalous
After the challenge, most of the published results of behavior. An overview of ensemble learning techniques for
KDDCup 99 have used several feature engineering meth- ID was given in [30], and swarm intelligence techniques for
ods for dimensionality reduction [18]. While few studies ID using ant colony optimization, ant colony clustering and
employed custom-built datasets, majority used the same particle swarm optimization of systems were studied in [31].
dataset for newly available machine learning classifiers [18]. A comparative study in such research works show that the
These published results are partially comparable to the results descriptive statistics was predominantly used.
of the KDDCup 99 contest. Overall, a comprehensive literature review shows very
In [19], the classification model consists of two-stages: few studies use modern deep learning approaches for NIDS
i) P-rules stage to predict the presence of the class, and and the commonly used benchmark datasets for experimen-
ii) N-rules stage to predict the absence of the class. This tal analysis are KDDCup 99 and NSL-KDD [3], [32]–[34].
performed well in comparison with the aforementioned The IDS based on recurrent neural network (RNN) out-
KDDCup 99 results except for the user-to-root (’U2R’) cat- performed other classical machine learning classifiers in
egory. In [20], the significance of feature relevance analysis identifying intrusion and intrusion type on the NSL-KDD
was investigated for IDS with the most widely used dataset, dataset [32]. Two level approach proposed for IDS in which
KDDCup 99. For each feature they were able to express the first level extracts the optimal features using sparse
autoencoder in an unsupervised way and classified using communication with the operating system via system calls,
softmax regression [33]. The application of stacked autoen- their behavior, ordering, type and length generates a unique
coder was proposed for optimal feature extraction in an trace. This can be used to distinguish between the known
unsupervised way where the proposed method is completely and unknown applications [12]. System calls of normal and
non-symmetric and classification was done using Random intrusive process are entirely different. Thus analysis of those
forest. Novel long short-term memory (LSTM) architecture system calls provides significant information about the pro-
was proposed and by modeling the network traffic informa- cesses of a system. Various feature engineering approaches
tion in time series obtained better performance. The proposed have been used for system call based process classification.
method performed well compared to all the existing methods They are N-gram [12], [13], sequence gram [14] and pair
and as well as KDDCup 98 and 99 challenge entries [3]. The gram [15]. An important advantage of HIDS is that it provides
performance of various RNN types were evaluated by [34]. detailed information about the attacks.
Various deep learning architectures and classical machine The three main components of HIDS, namely the data
learning algorithms were evaluated for anomaly based ID source, the sensor, and the decision engine play an important
on NSL-KDD dataset [74]. The configuration of SVM was role in detecting security intrusions. The sensor component
formulated as bi-objective optimization problem and solved monitors the changes in data source, and the decision engine
using hyper-heuristic framework. The performance was eval- uses the machine learning module to implement to the intru-
uated for malware and anomaly ID. The proposed framework sion detection mechanism. However, the benchmarking the
is very suitable for big data cyber security problems [75]. data source component requires much investigation.
To enhance the anomaly based ID rate, the spatial and Compiling the KDDCup 99 dataset involved the
temporal features were extracted using convolutional neu- data source component with system calls and Sequence
ral network and long short-term memory architecture. The Time-Delay Embedding (STIDE) approach used to analyze
performance was shown on both KDDCup 99 and ISCX the fixed length pattern of system calls to distinguish between
2012 datasets [76]. Two step attack detection method was normal and anomalous behaviors [13]. A large number of
proposed along with a secure communication protocol for decision engines have been used to analyze patterns of
big data systems to identify insider attack. In the first step, system calls to detect intrusions. Such a data source is
process profiling was done independently at each node and most commonly used among cyber security research com-
in second step using hash matching and consensus, process munity. Apart from system calls, since Windows operating
matching was done [77]. An online detection and estimation system (OS) does not provide a direct access to system calls,
method was proposed for smart grid system [78]. The method log entries [35] and registry entry manipulations [36] form
specifically designed for identifying false data injection and the other two most commonly used data sources. This work
jamming attacks in real-time and additionally provides online focuses on the decision engine component to benchmark the
estimates of the unknown and time-varying attack parameters data source.
and recovered state estimates [78]. A scalable framework Classical methods aim to find information about the nature
for ID over vehicular ad hoc network was proposed. The of the host activity by analyzing the patterns in the sequence
framework uses distributed machine learning i.e. alternating of system calls. While STIDE was most commonly used
direction method of multipliers (ADMM) to train a machine simple algorithm, Support Vector Machines (SVMs), Hid-
learning model in a distributed way to learn whether an den Markov Models (HMMs) and Artificial Neural Net-
activity normal or attack [79]. works (ANNs) are more recently adopted complex meth-
ods. In [37], N-gram feature extraction approach was used
B. HOST-BASED INTRUSION DETECTION SYSTEMS (HIDS) for compiling the ADFA-LD system call data and N-gram
Various software tools such as Metasploit, Sqlmap, Nmap, features were passed to different classical machine learning
Browser Exploitation provide the necessary framework to classifiers to identify and categorize attacks. In [39], in order
examine and gather information from target system vulner- to reduce the dimensions of system calls, K-means and KNN
abilities. Malicious attackers use such information to launch were experimented using a frequency based model. A revised
attacks to various applications like FTP server, web server, version of N-gram model was used in [38] to represent system
SSH server, etc. Existing methods such as firewall, cryptogra- calls with various classical machine learning classifiers for
phy methods and authentications aim to defend host systems both Binary and Multi-class categories. An approach for
against such attacks. However, these solutions have limita- HIDS based on N-gram system call representations with
tions and malicious attackers are able to gain unauthorized various classical machine learning classifiers was proposed
access to the system. To address this, a typical HIDS operates in [40]. To reduce the dimensions of N-gram, dimensionality
at host-level by analyzing and monitoring all traffic activities reduction methods were employed. In [41], frequency dis-
on the system application files, system calls and operating tribution based feature engineering approach with machine
system [73]. These types of traffic activities are typically learning algorithms was explored to handle the zero-day and
called as audit trials. A system call of an operating system is stealth attacks in Windows OS. In [42], an ensemble approach
a key feature that interacts between the core kernel functions for HIDS was proposed using language modeling to reduce
and low level system applications. Since an application makes the false alarm rates which is a drawback in classical methods.
This method leveraged the semantic meaning and communi- The proposed scalable architecture employs distributed
cations of system call. The effectiveness of their methods was and parallel machine learning algorithms with various opti-
evaluated on three different publicly available datasets. mization techniques that makes it capable of handling very
Overall, the published results are limited in detecting the high volume of network and host-level events. The scalable
intrusions and cyberattacks using HIDS. Studies that show an architecture also leverages the processing capability of the
increase in detection rate of intrusions and cyberattacks also general purpose graphical processing unit (GPGPU) cores for
show an increase in false alarm rate. faster and parallel analysis of network and host-level events.
The pros and cons of NIDS and HIDS with its efficacy are The framework contains two types of analytic engines, they
discussed in detail by [16]. Major advantages of HIDSs are: are real-time and non-real-time. The purpose of analytic
HIDSs facilitate to detect local attacks and are unaffected by engine is to monitor network and host-level events to generate
the encryption of network traffic. Major disadvantage is that an alert for an attack. The developed framework can be scaled
they need all the configuration files to identify attack, but it out to analyze even larger volumes of network event data by
is a daunting task due to the huge amount of data. Allowing adding additional computing resources. The scalability and
access to big data technology in the domain of cyber security real-time detection of malicious activities from early warning
is of paramount importance, particularly IDS. The motivation signals makes the developed framework stand out from any
of this research is to develop a novel scalable platform with system of similar kind.
hybrid framework of NIDS and HIDS, which is capable of
handling large amount of data with the aim to detect the B. TEXT REPRESENTATION METHODS
intrusions and cyberattacks more accurately. System calls are essential in any operating system depict-
ing the computer processes and they constitute a humon-
gous amount of unstructured and fragmented texts that a
IV. PROPOSED SCALABLE FRAMEWORK typical HIDS uses to detect intrusions and cyberattacks.
Today’s ICT system is considerably more complex, con- In this research we consider text representation methods
nected and involved in generating extremely large volume to classify the process behaviors using system call trace.
of data, typically called as big data. This is primarily due to Classical machine learning approaches adopt feature extrac-
the advancement in technologies and rapid deployments of tion, feature engineering and feature representation meth-
large number of applications. Big data is a buzzword which ods. However, with advanced machine learning embedded
contains techniques to extract important information from approach such as deep learning, the necessity of the feature
large volume of data. Allowing access to big data technology engineering and feature extraction steps can be completely
in the domain cyber security particularly IDS is of paramount avoided. We adopt such advanced deep learning along with
importance [44]. The advancement in big data technology text representation methods to capture the contextual and
facilitates to extract various patterns of legitimate and mali- sequence related information from system calls. The follow-
cious activities from large volume of network and system ing feature representation methods in the field of NLP are
activities data in a timely manner that in turn facilitates used to convert the system calls into feature vectors in this
to improve the performance of IDS. However, processing study.
of big data by using the conventional technologies is often
• Bag-of-Words (BoW): This classical and most com-
difficult [43]. The purpose of this section is to describe the
monly used representation method is used to form
computing architecture and the advanced methods adopted in
a dictionary by assigning a unique number for each
the proposed framework, such as text representation methods,
system call. Term document matrix (TDM) and term
deep neural networks (DNNs) and the training mechanisms
frequency-inverse document frequency (TF-IDF) are
employed in DNNs.
employed to estimate the feature vectors. The drawback
is that it cannot capture the sequence information of
A. SCALABLE COMPUTING ARCHITECTURE system calls [46].
The technologies such as Hadoop Map reduce and Apache • N-grams: An N-gram text representation method has
Spark in the field of high performance computing is found the capability to preserve the sequence information of
to be an effective solution to process the big data and to pro- system calls. The size of N can be 1 (uni-gram), 2 (bi-
vide timely actions. We have developed scalable framework gram), 3 (tri-gram), 4 (four-gram), etc., which can be
based on big data techniques, Apache Spark cluster com- employed appropriately depending on the context.
puting platform [45]. Due to the confidential nature of the • Keras Embedding: This follows a sequential represen-
research, the scalable framework details cannot be disclosed. tation method to convert the system calls into a numeric
The Apache spark cluster computing framework is setup over form of vocabulary by simply assigning a unique num-
Apache Hadoop Yet Another Resource Negotiator (YARN). ber for each system call. The size of vocabulary defines
This framework facilitates to efficiently distribute, execute the number of unique system calls and their frequency
and harvest tasks. Each system has specifications(32 GB of occurrence places them in an ascending order within a
RAM, 2 TB hard disk, Intel(R) Xeon(R) CPU E3-1220 v3 lookup table. Each system call in a vector is transformed
@ 3.10GHz) running over 1 Gbps Ethernet network. to a numeric using the lookup table for assigning a
this [53] used Snort ID system on DARPA / KDDCup 98 tcp- the case of Binary classification. Let L and A be the number
dump traces. The system performed poorly, the accuracy and of Normal and Attack connection records in the test dataset,
the false positive rates were impermissible. This is mainly due respectively and the following terms are used for determining
to the system failed to detect the attacks belongs to the ’DoS’ the quality of the classification models:
and ’Probe’ category with a fixed signature. In contrast to this, • True Positive (TP) - the number of connection records
the detection performance of ’R2L’ and ’U2R’ is much better. correctly classified to the Normal class.
Despite of the harsh criticisms yet, KDDCup 99 is been • True Negative (TN ) - the number of connection records
the most widely used reliable benchmark dataset in most of correctly classified to the Attack class.
the study related to ID system evaluation and other security • False Positive (FP) - the number of Normal connec-
related tasks [55]. To resolve the inherent issues that are exists tion records wrongly classified to the Attack connection
in KDDCup 99, [55] proposed a most refined version called record.
NSL-KDD. They removed the redundant connection records • False Negative (FN ) - the number of Attack connection
in the entire train and test data and in addition the invalid records wrongly classified to the Normal connection
records, numbered 136,489 and 136,497 were removed from record.
test data. Thus it protects the classifier not to be biased Based on the aforementioned terms, the following most
in the direction of the more frequent connection records. commonly used evaluation metrics are considered.
NSL-KDD is also not a real world representative of network
1) Accuracy: It estimates the ratio of the correctly recog-
traffic data. Still, this refined version failed to entirely solve
nized connection records to the entire test dataset. If the
the issues reported by [57]. To enhance the performances in
accuracy is higher, the machine learning model is better
detecting attacks, 10 more extra features were added with
(Accuracy ∈ [0, 1]). accuracy serves as a good measure
14 important features from KDDCup 99 [58]. The Kyoto
for the test dataset that contains balanced classes and
dataset was generated using honeypots. Thus, each flow of
defined as follows
network traffic was done automatically. The normal traf-
fic of Kyoto dataset was not captured from the real world TP + TN
Accuracy = (15)
network traffic. Moreover, the dataset doesn’t contain false TP + TN + FP + FN
positives that help to minimize the number of alerts to the 2) Precision: It estimates the ratio of the correctly iden-
network admin [58]. In [56] generated a new dataset follow- tified attack connection records to the number of
ing two different profile system in which one system was all identified attack connection records. If the Preci-
for generating attacks and other one was for normal activi- sion is higher, the machine learning model is better
ties. This dataset doesn’t contain network traffic of HTTPS (Precision ∈ [0, 1]). Precision is defined as follows
protocol. Most of the attacks were simulated and failed to
TP
preserve the characteristics of real world statistics. In [57] Precision = (16)
proposed UNSW-NB15. They adopted the notion of profiles TP + FP
that contains the comprehensive information of intrusions and 3) F1-Score: F1-Score is also called as F1-Measure. It is
applications, protocols, or lower level network entities from the harmonic mean of Precision and Recall. If the
the modern network traffic and detailed information about F1-Score is higher, the machine learning model is better
the network traffic. Recently, to provide benchmark dataset (F1−Score ∈ [0, 1]). F1-Score is defined as follows
to the research community, [59] generated reliable dataset.
Pr ecision × Recall
This meets the real world benign and attacks of network F1 − Score = 2 × (17)
Pr ecision + Recall
activities. Moreover, the detailed evaluation of network traffic
features was done by them and detailed experiments towards 4) True Positive Rate (TPR): It is also called as Recall.
the importance of features to detect various attacks were It estimates the ratio of the correctly classified Attack
done. connection records to the total number of Attack con-
Most widely used dataset for HIDS is KDDCup 98, nection records. If the TPR is higher, the machine
KDDCup 99 and University of New Maxico (UNM). These learning model is better (TPR ∈ [0, 1]). TPR is defined
datasets were compiled decades ago and most of them are as follows
irrelevant for today’s operating system. Recently, [62] made TP
the dataset to be publicly available. Thus, this dataset has TPR = (18)
TP + FN
been used as new benchmark for evaluating system call
5) False Positive Rate (FPR): It estimates the ratio of
based HIDS. The dataset comprises of modern vulnerability
the Normal connection records flagged as Attacks to
exploits and attacks.
the total number of Normal connection records. If the
FPR is lower, the machine learning model is better
D. STATISTICAL MEASURES
(FPR ∈ [0, 1]). FPR is defined as follows
In evaluation to estimate the various statistical measures the
ground truth value is required. The ground truth composed of FP
FPR = (19)
set of connection records labeled either Normal or Attack in FP + TN
TABLE 1. Training and testing connection records from KDDCup 99 and NSL-KDD datasets.
6) Receiver Operating Characteristics (ROC) curve: •Basic features [1-9]: The packet capture (Pcap)
ROC is plotted based on the trade-off between the TPR files of tcpdump are used to extract the basic
on the y axis to FPR on the x axis across different features from the packet headers, TCP segments,
thresholds. Area Under the ROC Curve (AUC) is the and UDP datagram instead of payload. This task
size of the area under the ROC curve used along with was carried out using a remodeled network analy-
ROC as a comparison metric for the machine learning sis framework, Bro IDS.
models. If the AUC is higher, the machine learning • Content features [10-22]: Content features are
model is better. extracted from the full payload of TCP/IP packets
rooted on domain knowledge in tcpdump files.
Z 1 TP FP The feature analysis of payload has remained
AUC = d
0 TP + FN TN + FP as research area for the last years. Recently,
in [65], a deep learning approach was introduced
to analyze the entire payload data instead of fol-
VI. MODELLING THE DATASET
lowing the feature extraction process. Content
Due to security and privacy issues, most of the datasets
features are mainly used to identify ’R2L’ and
are not publicly available. Additionally, the data which are
’U2R’ category attacks. For example, many failed
publicly available are laboriously anonymized and do not
login attempts is the most prominent feature to
contemplate today’s network traffic variety. Due to these
indicate the malicious behavior in the entire pay-
issues, the exemplary dataset is yet to be discerned [64]. The
load. Unlike other category attacks, ’R2L’ and
details of various IDS datasets are discussed in [59] and [66].
’U2R’ category do not have the prominent sequen-
A detailed overview of available datasets between 1998 and
tial patterns due to events happening in a single
2016 is discussed in detail by [66]. We consider the pros and
connection.
cons of existing datasets used in NIDS and HIDS and discuss
• Time-based traffic features [23-41]: Time-based
how our datasets were modeled.
traffic features are extracted with a specific tem-
poral window of two seconds. These are grouped
A. DATASETS USED IN NIDS into ’same host’ and ’same service’ based on the
1) KDDCup 99: KDDCup 99 dataset was built by pro- connection characteristics in the past 2 seconds.
cessing tcpdump data of the 1998 DARPA intrusion To handle slow probing attacks, the aforemen-
detection challenge dataset. The Mining Audit data tioned characteristics are recalculated based on a
for automated models for ID (MADMAID) frame- connection window of 100 connections to the same
work was used to extract features from raw tcpdump host. These are typically termed as connection
data. The detailed statistics of the dataset is reported based or host-based traffic features.
in Table 1. KDDCup 1998 dataset was created by MIT 2) NSL-KDD: NSL-KDD is the distilled version of
Lincon laboratory using 1000’s of UNIX machines and KDDCup 99 intrusion data. The filters are used to
100’s users accessing those machines. The network remove redundant connection records in KDDCup
traffic data was captured and stored in tcpdump format 99 and connection records numbered 136,489 and
for 10 weeks. The data of first seven weeks was used 136,497 are removed from the test data. NSL-KDD can
as training dataset and rest used as testing dataset. protect machine learning algorithms not to be biased.
KDDCup 99 dataset is available in two forms. They This can suits well for misuse detection in compared to
are full dataset and 10% dataset. The dataset contains the KDDCup 99 dataset. This also suffers from repre-
41 features and 5 classes (’Normal’, ’DoS’, ’Probe’, senting the real-time network traffic profile character-
’R2L’, ’U2R’). These features are grouped into differ- istics. The detailed statistics of NSL-KDD is reported
ent categories as given below: in Table 1.
41534 VOLUME 7, 2019
R. Vinayakumar et al.: Deep Learning Approach for Intelligent IDS
3) UNSW-NB15: The cyber security research team of 4) Kyoto: The honeypot systems of Kyoto university
Australian Centre for Cyber Security (ACCS) has intro- network traffic data have 24 statistical features. Among
duced a new data called as UNSW-NB15 to resolve 24 features, 14 features are from KDDCup 99. These
the issues found in the KDDCup 99 and NSL-KDD features are important as they are collected from raw
datasets. This data is generated in a hybrid way, con- traffic data of Kyoto university honeypot systems.
taining the normal and attack behaviors of a live net- Additionally, 10 more features are identified with the
work traffic using IXIA Perfect Storm tool that has a honeypot’s network traffic system. In this work, the net-
repository of new attacks and common vulnerability work logs of the year 2015 are considered. The logs
exposures (CVE), a storehouse containing information are preprocessed and divided into training and test-
regarding security vulnerabilities and exposures, which ing datasets. The detailed statistics of the connection
are known publicly. Two servers were used in IXIA records are reported in Table 5.
traffic generator tool where one server generated the 5) WSN-DS: It is an IDS dataset developed for wireless
normal activities, whereas the other generated mali- sensor networks (WSN). This composed of four dif-
cious activities in the network. Tcpdump tool used for ferent types of ’DoS’ attacks: Blackhole, Grayhole,
capturing network packet traces which took several Flooding, and Scheduling. They used Low-energy
hours to compile the whole data of 100 GBs that were adaptive clustering hierarchy (LEACH) protocol to
divided into 1,000 MB pcaps using tcpdump. From collect data from network Simulator 2 (NS-2) and then
pcap files, the features were extracted using Argus and preprocessed to generate 23 features. This dataset was
Bro-IDS in Linux Ubuntu 14.0.4. In addition to the termed as WSN-DS and its detailed statistics is reported
above methods, depth analysis of each packet was done in Table 3.
with 12 algorithms which are developed using C#. The 6) CICIDS2017: This dataset includes the contemporary
data is accessible in two forms as follows: activities of benign and attacks which depicts the real-
a) Full connection records consisting of 2 million time network traffic. The main interest is given towards
connection records collecting the real-time background traffic during cre-
b) A partition of full connection records which is ating this dataset. Using B-profile system, benign back-
composed of 82,332 train connection records ground traffic was collected. This benign traffic con-
and 175,341 test connection records confined tains the characteristics of 25 users based on the HTTP,
with 10 attacks. The partitioned dataset consists HTTPS, FTP, SSH, and email protocols. The network
of 42 features with their parallel class labels traffic for five days was collected and dumped with
which are Normal and nine different Attacks. The normal activity traffic on one day, and attacks injected
information regarding simulated attacks category on other days. The various attacks injected were Brute
and its detailed statistics are described in Table 2. Force FTP, Brute Force SSH, ’DoS’, Heartbleed, Web
parameter for the DNNs, a medium sized architecture was In order to find an optimal learning rate, three trials of
used for experiments with a specific hidden units, learning experiments for 500 epochs with learning rate varying in
rate and activation function. A medium sized DNN contains the range [0.01-0.5] were run. These experiments used a
3 layers. One is input layer, second one is hidden layer or fully medium sized DNN with 1,024 units. Learning rate has a
connected layer and third one is output layer. For KDDCup strong impact on the training speed, thus we had selected
99, the input layer contains 41 neurons, hidden layer contains the range [0.01-0.5]. The peak value for attack detection rate
128, 256, 384, 512, 640, 768, 896 and 1,024 units and output was obtained when the learning rate was 0.1. There was
layer contains 1 neuron in classifying the connection record a quick decrease in attack detection rate when the learn-
as either normal or attack. It contains 5 neurons in classi- ing rate was 0.2 and reached to peak accuracy at learn-
fying the connection record as either normal or attack and ing rates of 0.35, 0.45 and 0.45 compared to learning rate
categorizing attack into corresponding attack categories. The 0.1. This attack detection rate was intensified by running
connection between the units between input layer and hidden the experiments till 1,000 epochs. As we had examined
layer and hidden layer to output layer are fully connected. more complex architectures for this experiment, it showed
Initially, the train and test datasets were normalized using L2 less performance for epochs less than 500, henceforth we
normalization. Two trials of experiments were run for hidden decided to use the learning rate 0.1 for the rest of the
units 128, 256, 384, 512, 640, 768, 896 and 1,024 with a experimentation process, because learning rate greater than
medium sized DNN. The experiment was run for each param- 0.1 was found to be time consuming. To find the opti-
eter with appropriate units and for 300 epochs. The DNN mal learning rate for Multi-class classification of KDDCup
with various units have learnt the patterns of normal con- 99, we run two trials of experiments for learning rate in
nection records with epochs 200 in comparison to the those the range [0.01-0.5]. The experiments with lower learning
with attacks. To capture the significant features which can rate 0.01 showed better performance for attacks ’DoS’ and
distinguish the attack connection record by DNN, 200 epochs ’Probe’. When we increase the learning rate from 0.01,
were required. After 200 epochs, the performance of nor- the attack detection rate remained same. Experiments with
mal connection records fluctuated due to overfitting. Each learning rate 0.1 has showed optimal performance in detec-
number of units with DNN took varied number of epochs to tion of ’R2L’ and ’U2R’ attacks category. Based on the
attain considerable performance. It was found that the layer observations of performance of detection of attacks in both
containing 1,024 units had shown highest number of attack Binary and Multi-class classification, the learning rate was set
detection rates. When we increased the number of hidden to 0.1
units from 1,024 to 2,048, the performance in attack detection In third trial of experiments, we had also run experi-
rate deteriorated. Hence, we decided to use 1,024 units for ments with the sigmoid and tanh activation functions for
the rest of the experiments. The medium sized DNN with both the Binary and Multi-class classification. These func-
1,024 units in hidden layer was used for experiments with tions achieved good attack detection rates than the ReLU for
Multi-class classification of KDDCup 99 using three trials 100 epochs in Binary classification. When the same set of
of experiments for each hidden units until 500 epochs as experiments were run for 500 epochs the experiments with
the DNN performed well in comparison to the other units. ReLU activation function performed better than the sigmoid
A medium sized DNN with less number of units, 128, 256 and and tanh activation functions. In the case of Multi-class
384 learned the patterns of high frequency attack, ’DoS’. The classification, the performance of ReLU was good in com-
performance in detection of ’DoS’ attack remained same for parison to the sigmoid and tanh activation functions. Thus,
other variants in the number of units of a medium sized DNN. we decided to set the ReLU activation function for the rest
Due to the large number of connection records of ’DoS’, of the experiments. All the models were trained using adam
DNN network with less number of units was able to achieve optimizer with a batch size of 64 for 500 epochs to monitor
optimal detection rate. The acceptable detection rate for validation accuracy.
’Probe’ category of attacks was found with units 512 and 640.
A medium sized DNN with hidden units of 896 performed B. FINDING AN OPTIMAL NETWORK TOPOLOGY OF DNN
well for detection of ’R2L’ attacks in comparison to the other The following network topologies were used to choose
units. A medium sized DNN required 1,024 units to detect the best network topology for training an IDS model with
the attacks of ’U2R’. Once the number of units increased KDDCup 99.
from 1,024, the performance of DNN network deteriorated.
By considering all these test experiments, 1,024 was set as 1) DNN 1 layer
the ideal hidden layer units. To achieve a considerable perfor- 2) DNN 2 layers
mance, each DNN network topology required varied number 3) DNN 3 layers
of epochs. DNN network with less number of parameters 4) DNN 4 layers
have achieved good performance till 100 epochs but when it 5) DNN 5 layers
reaches 500 epochs, complex DNN networks have performed For all the above network topologies, we had run 3 trials
well in comparison to the DNN network with less number of of experimentation for 300 epochs each. We observed that
parameters. most of the deep learning architectures learnt the Normal
category patterns of input data for epochs less than 400, C. PROPOSED DNN ARCHITECTURE
while the number of epochs required for discovering Attack This work proposes a unique DNN architecture for NIDS
category was fluctuating. Both the DNN 1 layer and DNN and HIDS composed of an input layer, 5 hidden layers and
2 layers networks have completely failed to learn the attack an output layer. The hierarchical layers in the DNN facili-
categories of ’R2L’ and ’U2R’. The performance of ’DoS’ tate to extract highly complex features and do better pattern
and ’Probe’ attack categories was good with DNN 3 layers in recognition capabilities in IDS data. Each layer estimates
comparison to the DNN 2 layers and DNN 1 layer. The com- non-linear features that are passed to the next layer and the
plex network architectures required a large number of epochs last layer in the DNN performs the classification. An input
in order to reach the optimum accuracy. The performance layer contains 41 neurons for KDDCup 99, 41 neurons for
of DNN 5 layers for various Attack categories and Normal NSL-KDD, 43 neurons for UNSW-NB15, 17 neurons for
category was good as compared to other DNNs network WSN-DS and 77 neurons for CICIDS 2017. An output layer
topologies. By considering all these factors, we’ve decided to contains 1 neuron for Binary classification for all types of
use 5 layer DNNs network for the remaining experimentation datasets and 5 neurons for Multi-class classification in KDD-
process. Cup 99, 5 neurons for NSL-KDD, 10 neurons for UNSW-
In order to increase the speed of training and to avert NB15, 5 neurons for WSN-DS and 8 neurons for CICIDS
over fitting, we used batch normalization and dropout 2017. The detailed information and configuration details of
(0.01) approach. When we run experiments without dropout, the DNN architecture is shown in Table 8. The DNN is
the models ends up in over fitting. Also, the experiments with trained using the backpropogation mechanism [60]. Gener-
batch normalization achieved better results in comparison to ally, the units in input to hidden layer and hidden to output
the networks without batch normalization [10]. For HIDS, layer are fully connected. The DNN is composed of various
the best performed DNNs in NIDS as such used. In HIDS, components, a brief description of each component is given
we had followed hyper parameter tuning methods only in below.
conversion of system calls into numeric representation. For Fully connected layer: This layer is called as fully con-
N-gram, the 3 trials of experiments were run with 1-gram, nected layer since the units in this layer have connection to
2-gram, 3-gram and 4-gram with different DNNs network every other unit in the succeeding layer. Generally, the fully
topologies. DNNs network topologies with 3-gram system connected layers map the data into high dimensions. The
call representation performed well in comparison to the other output will be more accurate, when the dimension of data is
N-gram system call representation. more. It uses ReLU as the non-linear activation function.
Batch Normalization and Regularization: Dropout system. The SHIA framework is basically decomposed into
(0.01) and Batch Normalization [10] was used in between two modules described below:
fully connected layers to obviate overfitting and speedup
1) Packet and system call processing module: In this
the DNN model training. A dropout removes neurons with
module, the networks which are to be monitored are
their connections randomly. In our alternative architectures,
connected to a single port mirroring switch, which
the DNNs could easily overfit the training data without regu-
larization even when trained on large number samples. replicates the flow of the entire network traffic of all
Classification: The last layer is a fully connected layer the switches. The proposed SHIA IDS is required to
monitor a network composed of different subnets, each
which uses sigmoid activation function for Binary classifi-
with n number of different machines. The monitored
cation and softmax activation function for Multi-class classi-
networks are hosts composed of computer machines
fication. The prediction loss function for sigmoid is defined
using Binary cross entropy and the prediction loss for softmax which allow users to communicate and transfer data.
is defined using the Categorical cross entropy as follows: All the traffic generated by the internet were collected,
without considering the internal traffic between net-
The prediction loss for Binary classification is estimated
works.
using Binary cross entropy given by,
The SHIA framework with the network where one of
N
1 X the switches is used as a port mirroring switch and
loss(pd, ed) = − [edi log pdi +(1−edi ) log(1−pdi )] connected to a traffic collector. This module collects
N
i=1 network traffic data using Netmap packet capturing
(20) tool and stores it in NoSQL database. Feature vectors
where pd is a vector of predicted probability for all samples are passed into the DNN module, and a copy of data
in testing dataset, ed is a vector of expected class label, values is passed into NoSQL database. Likewise, the system
are either 0 or 1. calls and configuration files are collected in a dis-
The prediction loss for Multi-class classification is esti- tributed manner by following the methodology pro-
mated using Categorical cross entropy given by, vided in [62]. These are passed to NoSQL database and
X followed by the text representation method to map the
loss(pd, ed) = − pd(x) log(ed(x)) (21) system calls to feature vectors. A copy of these feature
x
vectors are dumped into NoSQL database and these
where ed is true probability distribution, pd is predicted
feature vectors are again passed into the DNN module
probability distribution. We have used adam as an optimizer
for classification.
to minimize the loss of Binary cross entropy and Categorical
2) DNN module: From the experimental analysis,
cross entropy.
we found that the DNN performed well over other
VIII. SCALE-HYBRID-IDS-ALERTNET (SHIA) FRAMEWORK
algorithms in all cases of HIDS and NIDS. Thus, DNN
is used to model the network activities with the aim
An IDS has become an indispensable tool for any type of
organization or industry due to the wide growing nature of to detect the attacks more accurately. DNNs require
data and internet. There are many attempts that have been large volume of network and host-level events to learn
made to develop solutions for IDS. These methods used single the behaviors of legitimate and malicious activities.
To make the classifier more generalizable, the network
host for storage and computational resources and the algo-
traffic activities can be collected in different time with
rithms are not distributed. Using these legacy solutions, it is
different users in an isolated network. Finally, the DNN
entirely difficult to monitor and identify intrusions in today’s
network. This is due to high speed networks and attacks are module outputs are passed to Front End Broker. This
more and occurring rapidly. The legacy intrusion detection displays the results to the network admin. The imple-
methods which are existing in internet struggle to keep a mentation details regarding how to conduct a compre-
hensive experimental analysis will be considered as one
watch on the networks more efficiently. To address this, based
of the significant work directions towards future work.
on [61] our work leverages the distributed computing and dis-
tributed machine learning models to propose Scale-Hybrid-
IDS-AlertNet (SHIA) framework for an effective identifica- IX. RESULTS
tion of intrusions and attacks at both the network-level and Publicly available NIDS and HIDS datasets were used to
host-level. The framework provides a scalable design and acts evaluate the performance of classical machine learning and
as a distributed monitoring and reporting system. The SHIA DNNs in order to identify a baseline method. These datasets
enhances the computational performance using the charac- were separated into train and test datasets, and normalized
teristics of distributed computing and distributed machine using L2 normalization. Train datasets were used to train
learning algorithms using a hybrid system placed in different machine learning model and test datasets were used to eval-
locations in the network. The deployed system is designed uate the trained machine learning models. Train accuracy
to use the computational resource optimally and at the same of Multi-class using DNN for KDDCup 99, NSL-KDD and
time with low latency in response while monitoring a critical UNSW NB-15, WSN-DS are shown in Figure 4a and 4b
FIGURE 4. Train accuracy. (a) KDDCup 99 and NSL-KDD. (b) UNSW-NB-15 and WSN-DS. (c) Visualization of 100 connection records with their
corresponding activation values of the last hidden layer neurons from Kyoto.
FIGURE 5. ROC curves of (a) KDDCup 99-using classical machine learning classifiers, (b) KDDCup 99-using DNNs, (c) NSL-KDD-using classical
machine learning classifiers, (d) NSL-KDD-using DNNs.
FIGURE 6. ROC curves of (a) UNSW-NB 15-using classical machine learning classifiers, (b) UNSW-NB 15-using DNNs, (c) Kyoto-using classical
machine learning classifiers, (d) Kyoto-using DNNs.
respectively. For KDDCup 99 and NSLKDD datasets, most obtained in terms of FPR is less in comparison to other
of the DNN network topologies showed train accuracy in classical machine learning classifiers in all the datasets. The
the range 95% to 99%. For UNSW-NB15 and WSN-DS, experiments with 3-gram representation performed well as
all the DNN network topologies showed train accuracy in compared to 1-gram and 2-gram.
the range 65% to 75%. Several observations were extracted
including ROC curve. The ROC curve for KDDCup 99, A. PERFORMANCE COMPARISONS
NSL-KDD, UNSW NB-15, Kyoto, WSN-DS is shown in The detailed results for Binary as well as Multi-class clas-
Figure 5a, Figure 5b, Figure 5c, Figure 5d, Figure 6a, sification of various classical machine learning classifiers
Figure 6b, Figure 6c, Figure 6d, Figure 7a, Figure 7b and and DNNs are reported in Table 9, Table 10 and Table 11,
Figure 7c, Figure 7d respectively. In most of the cases, Table 12 respectively. In terms of accuracy noted that the DT,
DNN performed well in comparison to the classical machine AB and RF classifiers performed better than the other classi-
learning classifiers with AUC used as the standard metric. fiers namely LR, NB, KNN and SVM-rbf. Additionally, the
This indicates that the DNN obtained a highest TPR and a performance of DT, AB and RF classifiers remains the same
lowest FPR and in some cases close to 0. The performance range across different datasets. However, the performance
FIGURE 7. ROC curves of (a) WSN-DS-using classical machine learning classifiers, (b) WSN-DS-using DNNs, (c) CICIDS 2017-using classical
machine learning classifiers, (d) CICIDS 2017-using DNNs.
TABLE 9. Test results of DNNs for Binary class classification. TABLE 10. Test results of DNNs for Multi-class classification.
TABLE 11. Test results of various classical machine learning classifiers for TABLE 12. Test results of various classical machine learning classifiers for
binary class classification. Multi-class classification.
FIGURE 8. Sailency map for a randomly chosen connection record from (a) KDDCup 99, (b) CICIDS 2017 and visualization of 100 connection
records with their corresponding activation values of the last hidden layer neurons from (c) KDDCup 99, and (d) NSLKDD.
TABLE 13. Test results using minimal feature sets. TABLE 14. Test results of host-based IDS.
TABLE 15. Detailed test results for Multi-class classification- KDDCup 99.
normal connection records. This shows that they have similar has been followed. This uses Taylor expansion for the fea-
characteristics and it requires additional features to classify it tures of penultimate layer and finds the first order partial
correctly. derivative of the classification results before placing them
To know the importance of each feature and as well as through the softmax function. This helps to detect the signif-
to identify the significant features, the methodology of [69] icant features to distinguish the connection record as either
Normal or Attack and categorize an Attack into its categories. KDDCup 99 dataset. Thus, the dataset has unique set of
The connection record which belongs to ’R2L’ is shown train and test connection records. Moreover, the connection
in Figure 8a. The features have similar characteristics as of records of NSL-KDD is highly non-linearly separable in
features of ’U2R’. For CICIDS 2017, the connection record comparison to the KDDCup 99. Moreover, the performance
which belongs to ’DoS’ is shown in Figure 8b. The features of both the classical machine learning classifiers and DNNs
have similar characteristics between the ’DoS’ and ’DDoS’. considerably less in comparison to the KDDCup 99 and
This shows that the dataset requires few more additional fea- NSL-KDD. On subset of CICIDS 2017, both the classi-
tures to classify the connection record to ’DoS’ and ’DDoS’ cal machine learning classifiers and DNNs performed well.
correctly. Thus, the proposed SHIA architecture in this work can work
Both classical machine learning classifiers and DNNs net- well in real-time. The performance of DNNs can be enhanced
works have performed well on KDDCup 99 in comparison by carefully following a hyper parameter techniques for
to the NSL-KDD. The NSL-KDD is a refined version of NSL-KDD and UNSW-NB 15.
TABLE 19. Detailed test results for Multi-class classification- CICIDS 2017.
The detailed test results of ADFA-LD and ADFA-WD B. IMPORTANCE OF MINIMAL FEATURE SETS
datasets are reported in Table 14. N-gram and Keras embed- Feature selection is an important step for intrusion detection.
ding text representation methods are used to transform sys- It is an important step in order to identify the various types of
tem call into numeric vectors with DNNs. For comparative attacks more accurately. Without feature selection, there may
study, tf-idf is used as system call representation method with be a possibility in misclassification of attacks and it would
classical machine learning classifier, SVM. The performance take a large time to train a model [70]. The significance of
of N-gram and Keras embedding is good in compared to feature selection method was discussed in detail for intru-
tf-idf. This is due to the fact that both N-gram and Keras sion detection using NLS-KDD dataset [71]. They reported
embedding have the capability to preserve the sequence infor- that the feature selection method significantly reduces the
mation of the system calls. Moreover, the Keras embedding training and testing time and also showed improved intru-
performed well over N-gram representation method. This sion detection rate. To evaluate the performance of vari-
is due to the fact that it facilitates to capture the relation ous DNN topologies and static machine learning classifiers,
among system calls. To choose the best value for N in two trials of experiments are run on minimal feature sets
N-gram, the experiments are run with 1, 2 and 3 gram. on KDDCup 99 and NSL-KDD [3]. The detailed results are
When the N is increased from 3 to 4, the performance reported in Table 13. The experiments with 11 and 8 fea-
reduced. The experiments with 3-gram representation per- ture sets performed well in comparison to the experiments
formed well in compared to 1-gram and 2-gram. However, with 4 feature set. Moreover, experiments with 11 feature
the N-gram and tf-idf representation produces very large sets performed well in comparison to the 8 feature set. The
matrix and in some cases this type of matrix can be sparse. difference in performance between 11 and 8 minimal feature
Thus there may be chance that using any classifier it is very sets is marginal.
difficult to achieve best performance on the sparse repre-
sentation. An additional advantage of Keras embedding is X. CONCLUSIONS AND FUTURE WORK
that the weights of the embedding layer is updated during In this paper, we proposed a hybrid intrusion detection alert
backpropogation. system using a highly scalable framework on commodity
hardware server which has the capability to analyze the [8] M. Tang, M. Alazab, Y. Luo, and M. Donlon, ‘‘Disclosure of cyber secu-
network and host-level activities. The framework employed rity vulnerabilities: time series modelling,’’ Int. J. Electron. Secur. Digit.
Forensics, vol. 10, no. 3, pp. 255–275, 2018.
distributed deep learning model with DNNs for handling and [9] V. Paxson, ‘‘Bro: A system for detecting network intruders in real-
analyzing very large scale data in real-time. The DNN model time,’’ Comput. Netw., vol. 31, nos. 23–24, pp. 2435–2463, 1999.
was chosen by comprehensively evaluating their performance doi: 10.1016/S1389-1286(99)00112-7.
[10] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
in comparison to classical machine learning classifiers on no. 7553, p. 436, 2015.
various benchmark IDS datasets. In addition, we collected [11] Y. Xin et al., ‘‘Machine learning and deep learning methods for cyberse-
curity,’’ IEEE Access, vol. 6, pp. 35365–35381, 2018.
host-based and network-based features in real-time and [12] S. A. Hofmeyr, S. Forrest, and A. Somayaji, ‘‘Intrusion detection using
employed the proposed DNN model for detecting attacks and sequences of system calls,’’ J. Comput. Secur., vol. 6, no. 3, pp. 151–180,
intrusions. In all the cases, we observed that DNNs exceeded 1998.
[13] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, ‘‘A sense of
in performance when compared to the classical machine self for unix processes,’’ in Proc. IEEE Symp. Secur. Privacy, May 1996,
learning classifiers. Our proposed architecture is able to per- pp. 120–128.
form better than previously implemented classical machine [14] N. Hubballi, S. Biswas, and S. Nandi, ‘‘Sequencegram: n-gram modeling
of system calls for program based anomaly detection,’’ in Proc. 3rd Int.
learning classifiers in both HIDS and NIDS. To the best of Conf. Commun. Syst. Netw. (COMSNETS), Jan. 2011, pp. 1–10.
our knowledge this is the only framework which has the [15] N. Hubballi, ‘‘Pairgram: Modeling frequency information of lookahead
capability to collect network-level and host-level activities pairs for system call based anomaly detection,’’ in Proc. 4th Int. Conf.
Commun. Syst. Netw. (COMSNETS), Jan. 2012, pp. 1–10.
in a distributed manner using DNNs to detect attack more [16] H. Kozushko, ‘‘Intrusion detection: Host-based and network-based intru-
accurately. sion detection systems,’’ Independ. Study, New Mexico Inst. Mining Tech-
The performance of the proposed framework can be fur- nol., Socorro, NM, USA, 2003.
[17] W. Lee and S. J. Stolfo, ‘‘A framework for constructing features and models
ther enhanced by adding a module for monitoring the DNS for intrusion detection systems,’’ ACM Trans. Inf. Syst. Secur., vol. 3, no. 4,
and BGP events in the networks. The execution time of the 2000, Art. no. 227261. doi: 10.1145/382912.382914.
proposed system can be enhanced by adding more nodes to [18] A. Ozgur and H. Erdem, ‘‘A review of KDD99 dataset usage in intrusion
detection and machine learning between 2010 and 2015,’’ PeerJ PrePrints,
the existing cluster. In addition, the proposed system does vol. 4, Apr. 2016, Art. no. e1954.
not give detailed information on the structure and charac- [19] R. Agarwal and M. V. Joshi, ‘‘PNrule: A new framework for learning
teristics of the malware. Overall, the performance can be classifier models in data mining,’’ Dept. Comput. Sci., Univ. Minnesota,
Minneapolis, MN, USA, Tech. Rep. TR 00-015, 2000.
further improved by training complex DNNs architectures [20] H. G. Kayacik, A. N. Zincir-Heywood, and M. I. Heywood, ‘‘Selecting
on advanced hardware through distributed approach. Due to features for intrusion detection: A feature relevance analysis on KDD 99
extensive computational cost associated with complex DNNs intrusion detection datasets,’’ Proc. 3rd Annu. Conf. Privacy, Secur. Trust,
2005, pp. 12–14.
architectures, they were not trained in this research using the [21] J. Zhang, M. Zulkernine, and A. Haque, ‘‘Random-forests-based network
benchmark IDS datasets. This will be an important task in intrusion detection systems,’’ IEEE Trans. Syst., Man, Cybern. C, Appl.
Rev., vol. 38, no. 5, pp. 649–659, Sep. 2008.
an adversarial environment and is considered as one of the [22] S. Huda, J. Abawajy, M. Alazab, M. Abdollalihian, R. Islam, and
significant directions for future work. J. Yearwood, ‘‘Hybrids of support vector machine wrapper and filter based
framework for malware detection,’’ Future Gener. Comput. Syst., vol. 55,
pp. 376–390, Feb. 2016.
ACKNOWLEDGMENT [23] M. Alazab et al., ‘‘A hybrid wrapper-filter approach for Malware detec-
The authors would like to thank NVIDIA India, for the GPU tion,’’ J. Netw., vol. 9, no. 11, pp. 2878–2891, 2014.
[24] W. Hu, W. Hu, and S. Maybank, ‘‘AdaBoost-based algorithm for network
hardware support to research grant. They would also like intrusion detection,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38,
to thank Computational Engineering and Networking (CEN) no. 2, pp. 577–583, Apr. 2008.
department for encouraging the research. [25] L. Ertöz, M. Steinbach, and V. Kumar, ‘‘Finding clusters of different sizes,
shapes, and densities in noisy, high dimensional data,’’ in Proc. SIAM Int.
Conf. Data Mining, 2013, pp. 47–58.
REFERENCES [26] N. B. Amor, S. Benferhat, and Z. Elouedi, ‘‘Naive Bayesian networks
in intrusion detection systems,’’ in Proc. 23rd Workshop Probabilistic
[1] B. Mukherjee, L. T. Heberlein, and K. N. Levitt, ‘‘Network intrusion Graph. Models Classification, 14th Eur. Conf. Mach. Learn. (ECML) 7th
detection,’’ IEEE Netw., vol. 8, no. 3, pp. 26–41, May 1994. Eur. Conf. Princ. Pract. Knowl. Discovery Databases (PKDD), Cavtat-
[2] D. Larson, ‘‘Distributed denial of service attacks–holding back the flood,’’ Dubrovnik, Croatia, 2003, p. 11.
Netw. Secur., vol. 2016, no. 3, pp. 5–7, 2016. [27] A. Valdes and K. Skinner, ‘‘Adaptive, model-based monitoring for cyber
[3] R. C. Staudemeyer, ‘‘Applying long short-term memory recurrent neural attack detection,’’ in Proc. Int. Workshop Recent Adv. Intrusion Detection.
networks to intrusion detection,’’ South Afr. Comput. J., vol. 56, no. 1, Berlin, Germany: Springer, Oct. 2000, pp. 80–93.
pp. 136–154, 2015. [28] D.-Y. Yeung and C. Chow, ‘‘Parzen-window network intrusion detectors,’’
[4] S. Venkatraman and M. Alazab, ‘‘Use of data visualisation for in Proc. 16th Int. Conf. Pattern Recognit., vol. 4, Aug. 2002, pp. 385–388.
zero-day Malware detection,’’ Secur. Commun. Netw., vol. 2018, [29] W. Li, ‘‘Using genetic algorithm for network intrusion detection,’’ in Proc.
Dec. 2018, Art. no. 1728303. [Online]. Available: https://doi.org/10.1155/ United States Dept. Energy Cyber Secur. Group Training Conf., 2004,
2018/1728303 pp. 24–27.
[5] P. Mishra, V. Varadharajan, U. Tupakula, and E. S. Pilli, ‘‘A detailed [30] L. Didaci, G. Giacinto, and F. Roli, ‘‘Ensemble learning for intrusion detec-
investigation and analysis of using machine learning techniques for intru- tion in computer networks,’’ in Proc. Workshop Mach. Learn. Methods
sion detection,’’ IEEE Commun. Surveys Tuts., to be published. doi: Appl., Siena, Italy, 2002, pp. 1–11.
10.1109/comst.2018.2847722. [31] C. Kolias, G. Kambourakis, and M. Maragoudakis, ‘‘Swarm intelligence in
[6] A. Azab, M. Alazab, and M. Aiash, ‘‘Machine learning based botnet intrusion detection: A survey,’’ Comput. Secur., vol. 30, no. 8, pp. 625–642,
identification traffic,’’ in Proc. 15th IEEE Int. Conf. Trust, Secur. Privacy 2011. doi: 10.1016/j.cose.2011.08.009.
Comput. Commun. (Trustcom), Tianjin, China, Aug. 2016, pp. 1788–1794. [32] C. yin, Y. Zhu, J. Fei, and X. He, ‘‘A deep learning approach for intru-
[7] R. Vinayakumar. (Jan. 2, 2019). Vinayakumarr/Intrusion-Detection V1 sion detection using recurrent neural networks,’’ IEEE Access, vol. 5,
(Version V1). [Online]. Available: http://doi.org/10.5281/zenodo.2544036 pp. 21954–21961, 2017.
[33] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, ‘‘A deep learning [55] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed analysis
approach for network intrusion detection system,’’ in Proc. 9th EAI of the KDD CUP 99 data set,’’ in Proc. 2nd IEEE Symp. Comput. Intell.
Int. Conf. Bio-Inspired Inf. Commun. Technol. (BIONETICS), 2016, Secur. Defence Appl., Jul. 2009, pp. 1–6.
pp. 21–26. [56] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, ‘‘Toward
[34] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, ‘‘Long short term developing a systematic approach to generate benchmark datasets for
memory recurrent neural network classifier for intrusion detection,’’ intrusion detection,’’ Comput. Secur., vol. 31, no. 3, pp. 357–374,
in Proc.Int. Conf. Platform Technol. Service (PlatCon), Feb. 2016, 2012.
pp. 1–5. [57] N. Moustafa and J. Slay, ‘‘UNSW-NB15: A comprehensive data set for
[35] F. A. B. H. Ali and Y. Y. Len, ‘‘Development of host based intrusion network intrusion detection systems (UNSW-NB15 network data set),’’
detection system for log files,’’ in Proc. IEEE Symp. Bus., Eng. Ind. Appl. in Proc. IEEE Mil. Commun. Inf. Syst. Conf. (MilCIS), Nov. 2015,
(ISBEIA), Sep. 2011, pp. 281–285. pp. 1–6.
[36] M. Topallar, M. O. Depren, E. Anarim, and K. Ciliz, ‘‘Host-based intrusion [58] J. Song et al., ‘‘Statistical analysis of honeypot data and building of Kyoto
detection by monitoring Windows registry accesses,’’ in Proc. IEEE 12th 2006+ dataset for NIDS evaluation,’’ in Proc. 1st Workshop Building Anal.
Signal Process. Commun. Appl. Conf., Apr. 2004, pp. 728–731. Datasets Gathering Exper. Returns Secur., 2011, pp. 29–36.
[37] E. Aghaei and G. Serpen, ‘‘Ensemble classifier for misuse detection using [59] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, ‘‘Toward generat-
N-gram feature vectors through operating system call traces,’’ Int. J. ing a new intrusion detection dataset and intrusion traffic characteriza-
Hybrid Intell. Syst., vol. 14, no. 3, pp. 141–154, 2017. tion,’’ in Proc. 4th Int. Conf. Inf. Syst. Secur. Privacy (ICISSP), 2018,
[38] B. Borisaniya and D. Patel, ‘‘Evaluation of modified vector space repre- pp. 1–8.
sentation using ADFA-LD and ADFA-WD datasets,’’ J. Inf. Secur., vol. 6, [60] H. Leung and S. Haykin, ‘‘The complex backpropagation algorithm,’’
no. 3, p. 250, 2015. IEEE Trans. Signal Process., vol. 39, no. 9, pp. 2101–2104, Sep. 1991.
[39] M. Xie, J. Hu, X. Yu, and E. Chang, ‘‘Evaluating host-based anomaly [61] M. Alazab, ‘‘Profiling and classifying the behavior of malicious codes,’’
detection systems: Application of the frequency-based algorithms to J. Syst. Softw., vol. 100, pp. 91–102, Feb. 2015.
ADFA-LD,’’ in Proc. Int. Conf. Netw. Syst. Secur. Cham, Switzerland: [62] G. Creech and J. Hu, ‘‘A semantic approach to host-based intrusion detec-
Springer, 2014, pp. 542–549. tion systems using contiguousand discontiguous system call patterns,’’
[40] B. Subba, S. Biswas, and S. Karmakar, ‘‘Host based intrusion detection IEEE Trans. Comput., vol. 63, no. 4, pp. 807–819, Apr. 2014.
system using frequency analysis of n-gram terms,’’ in Proc. IEEE Region [63] L. Van der Maaten and G. Hinton, ‘‘Visualizing data using t-SNE,’’
10 Conf. (TENCON), Nov. 2017, pp. 2006–2011. J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.
[41] W. Haider, G. Creech, Y. Xie, and J. Hu, ‘‘Windows based data sets for [64] J. O. Nehinbe, ‘‘A critical evaluation of datasets for investigating IDSs and
evaluation of robustness of host based intrusion detection systems (IDS) IPSs researches,’’ in Proc. IEEE 10th Int. Conf. Cybern. Intell. Syst. (CIS),
to zero-day and stealth attacks,’’ Future Internet, vol. 8, no. 3, p. 29, Sep. 2011, pp. 92–97.
2016. [65] Z. Wang, ‘‘The applications of deep learning on traffic identification,’’
[42] G. Kim, H. Yi, J. Lee, Y. Paek, and S. Yoon. (2016). ‘‘LSTM- BlackHat USA, pp. 21–26, 2015.
based system-call language modeling and robust ensemble method for [66] A. Gharib, I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani,
designing host-based intrusion detection systems.’’ [Online]. Available: ‘‘An evaluation framework for intrusion detection dataset,’’ in Proc. Int.
https://arxiv.org/abs/1611.01726 Conf. Inf. Sci. Secur. (ICISS), Dec. 2016, pp. 1–6.
[43] M. Kezunovic, L. Xie, and S. Grijalva, ‘‘The role of big data in improving [67] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network
power system operation and protection,’’ in in Proc. IREP Symp. Bulk training by reducing internal covariate shift,’’ in Proc. Int. Conf. Mach.
Power Syst. Dyn. Control-IX Optim., Secur. Control Emerg. Power Grid Learn., 2015, pp. 448–456.
(IREP), Aug. 2013, pp. 1–9. [68] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
[44] M. Tang, M. Alazab, and Y. Luo, ‘‘Big data for cybersecurity: Vulnera- nov, ‘‘Dropout: A simple way to prevent neural networks from overfitting,’’
bility disclosure trends and dependencies,’’ IEEE Trans. Big Data, to be J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
published. doi: 10.1109/tbdata.2017.2723570. [69] K. Simonyan, A. Vedaldi, and A. Zisserman. (2013). ‘‘Deep inside con-
[45] R. Vinayakumar, P. Poornachandran, and K. P. Soman, ‘‘Scalable frame- volutional networks: Visualising image classification models and saliency
work for cyber threat situational awareness based on domain name systems maps.’’ [Online]. Available: https://arxiv.org/abs/1312.6034
data analysis,’’ in Big Data in Engineering Applications (Studies in Big [70] M. Alazab, S. Venkatraman, P. Watters, and M. Alazab, ‘‘Zero-day mal-
Data), vol. 44, S. Roy, P. Samui, R. Deo, and S. Ntalampiras, Eds. Singa- ware detection based on supervised learning algorithms of API call sig-
pore: Springer, 2018. natures,’’ in Proc. 9th Australas. Data Mining Conf., vol. 121, 2011,
[46] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Informa- pp. 171–182.
tion Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008, Ch. 20, [71] A. Alazab, M. Hobbs, J. Abawajy, and M. Alazab, ‘‘Using feature selection
pp. 405–416. for intrusion detection system,’’ in Proc. Int. Symp. Commun. Inf. Technol.
[47] X. Glorot, A. Bordes, and Y. Bengio, ‘‘Deep sparse rectifier neural net- (ISCIT), Oct. 2012, pp. 296–301.
works,’’ in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011, pp. 315–323. [72] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, ‘‘A multimodal deep
[48] A. L. Maas, A. Y. Hannun, and A. Y. Ng, ‘‘Rectifier nonlinearities improve learning method for Android Malware detection using various features,’’
neural network acoustic models,’’ in Proc. ICML, 2013, vol. 30, no. 1, IEEE Trans. Inf. Forensics Secur., vol. 14, no. 3, pp. 773–788, Mar. 2019.
pp. 1–3. [73] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, ‘‘Madam: Effective
[49] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted boltz- and efficient behavior-based android malware detection and prevention,’’
mann machines,’’ in Proc. 27th Int. Conf. Mach. Learn. (ICML), 2010, IEEE Trans. Dependable Secure Comput., vol. 15, no. 1, pp. 83–97,
pp. 807–814. Jan. 2018.
[50] M. V. Mahoney and P. K. Chan, ‘‘An analysis of the 1999 DARPA/Lincoln [74] S. Naseer et al., ‘‘Enhanced network anomaly detection based on deep
Laboratory evaluation data for network anomaly detection,’’ in Recent neural networks,’’ IEEE Access, vol. 6, pp. 48231–48246, 2018.
Advances in Intrusion Detection (Lecture Notes in Computer Science), [75] N. R. Sabar, X. Yi, and A. Song, ‘‘A bi-objective hyper-heuristic sup-
vol. 2820. Berlin, Germany: Springer, 2003, pp. 220–237. port vector machines for big data cyber-security,’’ IEEE Access, vol. 6,
[51] M. Sabhnani and G. Serpen, ‘‘Why machine learning algorithms fail in pp. 10421–10431, 2018.
misuse detection on KDD intrusion detection data set,’’ Intell. Data Anal., [76] W. Wang et al., ‘‘HAST-IDS: Learning hierarchical spatial-temporal fea-
vol. 8, no. 4, pp. 403–415, 2004. tures using deep neural networks to improve intrusion detection,’’ IEEE
[52] Y. Bouzida and F. Cuppens, ‘‘Neural networks vs. decision trees for intru- Access, vol. 6, pp. 1792–1806, 2018.
sion detection,’’ in Proc. IEEE/IST Workshop Monitoring, Attack Detection [77] S. Aditham and N. Ranganathan, ‘‘A system architecture for the detection
Mitigation (MonAM), Sep. 2006, pp. 1–29. of insider attacks in big data systems,’’ IEEE Trans. Dependable Secure
[53] S. T. Brugger and J. Chow, ‘‘An assessment of the DARPA IDS evaluation Comput., vol. 15, no. 6, pp. 974–987, Nov. 2018.
dataset using snort,’’ Dept. Comput. Sci., Univ. California, Davis, Davis, [78] M. N. Kurt, Y. Yılmaz, and X. Wang, ‘‘Real-time detection of hybrid and
CA, USA, Tech. Rep. CSE-2007-1, 2005. stealthy cyber-attacks in smart grid,’’ IEEE Trans. Inf. Forensics Security,
[54] J. McHugh, ‘‘Testing intrusion detection systems: A critique of the 1998 vol. 14, no. 2, pp. 498–513, Feb. 2019.
and 1999 DARPA intrusion detection system evaluations as performed [79] T. Zhang and Q. Zhu, ‘‘Distributed privacy-preserving collaborative intru-
by Lincoln Laboratory,’’ ACM Trans. Inf. Syst. Secur., vol. 3, no. 4, sion detection systems for VANETs,’’ IEEE Trans. Signal Inf. Process.
pp. 262–294, 2000. doi: 10.1145/382912.382923. Netw., vol. 4, no. 1, pp. 148–161, Mar. 2018.