Professional Documents
Culture Documents
2.Multi-Stage Optimized Machine Learning Framework For Network Intrusion Detection
2.Multi-Stage Optimized Machine Learning Framework For Network Intrusion Detection
Abstract—Cyber-security garnered significant attention due to the industry and academia. To that end, more resources are
the increased dependency of individuals and organizations on the being deployed and allocated to protect modern Internet-
Internet and their concern about the security and privacy of their based networks from potential attacks or anomalous activities.
online activities. Several previous machine learning (ML)-based
network intrusion detection systems (NIDSs) have been developed Several protection mechanisms have been proposed such as
to protect against malicious online behavior. This paper proposes firewalls, user authentication, and the deployment of antivirus
a novel multi-stage optimized ML-based NIDS framework that and malware programs as a first line of defense [4]. However,
reduces computational complexity while maintaining its detec- these mechanisms have not been able to completely protect
tion performance. This work studies the impact of oversampling the organizations’ networks, particularly with contemporary
techniques on the models’ training sample size and determines
the minimal suitable training sample size. Furthermore, it com- attacks [5].
pares between two feature selection techniques, information gain Typically, network intrusion detection systems (NIDSs) can
and correlation-based, and explores their effect on detection be divided into two main categories: signature-based detec-
performance and time complexity. Moreover, different ML hyper- tion systems (misused detection) and anomaly-based detection
parameter optimization techniques are investigated to enhance systems [6]. Signature-based detection systems base their
the NIDS’s performance. The performance of the proposed
framework is evaluated using two recent intrusion detection detection on the observation of pre-defined attack patterns.
datasets, the CICIDS 2017 and the UNSW-NB 2015 datasets. Thus, they have proven to be effective for attacks with well-
Experimental results show that the proposed model significantly known signatures and patterns. However, these systems are
reduces the required training sample size (up to 74%) and fea- vulnerable against new attacks due to their inability to detect
ture set size (up to 50%). Moreover, the model performance new attacks by learning from previous observations [7]. In con-
is enhanced with hyper-parameter optimization with detection
accuracies over 99% for both datasets, outperforming recent lit- trast, anomaly-based detection systems base their detection on
erature works by 1-2% higher accuracy and 1-2% lower false the observation of any behavior or pattern that deviates from
alarm rate. what is considered to be normal. Therefore, these systems can
Index Terms—Network intrusion detection, machine learning, detect unknown attacks or intrusions based on the built models
hyper-parameter optimization, Bayesian optimization, particle that characterize normal behavior [8].
swarm optimization, genetic algorithm. Despite the continuous improvements in NIDS performance,
there is still room for further improvement. This is partic-
ularly evident given the high volume of generated network
I. I NTRODUCTION traffic data, continuously evolving environments, vast amounts
HE INTERNET has become an essential aspect of daily of features collected that form the training datasets (high
T life with individuals and organizations depending on
it to facilitate communication, conduct business, and store
dimensional datasets), and the need for real-time intrusion
detection [9]. For example, having redundant or irrelevant
information [1], [2]. This dependence is coupled with these features can have a negative impact on the detection capa-
individuals and organizations’ concern about the security and bilities of NIDSs as it slows down the model training
privacy of their online activities [3]. Accordingly, the area process. Therefore, it is important to choose the most suit-
of cyber-security has garnered significant attention from both able subset of features and optimize the parameters of the
machine learning (ML)-based detection models to enhance
Manuscript received March 18, 2020; revised June 15, 2020; accepted their performance [10].
August 4, 2020. Date of publication August 7, 2020; date of current version This paper extends our previous work in [11] by propos-
June 10, 2021. The associate editor coordinating the review of this article ing a novel multi-stage optimized ML-based NIDS framework
and approving it for publication was S. Kanhere. (Corresponding author:
MohammadNoor Injadat.) that reduces the computational complexity while maintaining
MohammadNoor Injadat, Abdallah Moubayed, and Abdallah Shami are its detection performance. To that end, this work first stud-
with the Department of Electrical and Computer Engineering, University of ies the impact of oversampling techniques on the models’
Western Ontario, London, ON N6A 5B9, Canada (e-mail: minjadat@uwo.ca;
amoubaye@uwo.ca; abdallah.shami@uwo.ca). training sample size and determines the minimum suitable
Ali Bou Nassif is with the Department of Electrical and Computer training size for effective intrusion detection. Furthermore,
Engineering, University of Western Ontario, London, ON N6A 5B9, Canada, it compares between two different feature selection tech-
and also with the Department of Computer Engineering, University of Sharjah,
Sharjah, UAE (e-mail: anassif@sharjah.ac.ae). niques, namely information gain and correlation-based feature
Digital Object Identifier 10.1109/TNSM.2020.3014929 selection, and explores their effect on the models’ detection
1932-4537
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1804 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
performance and time complexity. Moreover, different ML Chung and Wahid proposed a hybrid approach that included
hyper-parameter optimization techniques are investigated to feature selection and classification with simplified swarm
enhance the NIDS’s performance and ensure its effectiveness optimization (SSO) in addition to using weighted local search
and robustness. (WLS) to further enhance its performance [31]. Similarly,
To evaluate the performance of the proposed optimized Kuang et al. presented a hybrid GA-SVM model associated
ML-based NIDS framework, two recent state-of-the-art intru- with kernel principal component analysis (KPCA) to improve
sion detection datasets are used, namely the CICIDS 2017 the performance [15]. Comparatively, Zhang et al. combined
dataset [12] (which is the updated version of the ISCX misuse and anomaly detection using RF [32]. In contrast,
2012 dataset [13] used in our previous work [11]) and the our previous work in [11] proposed a Bayesian optimization
UNSW-NB 2015 dataset [14]. The performance evaluation is model to hyper-tune the parameters of different supervised ML
conducted using various evaluation metrics such as accuracy algorithms for anomaly-based IDSs [11].
(acc), precision, recall, and false alarm rate (FAR).
The remainder of this paper is organized as follows:
Section II briefly summarizes some of the previous litera- B. Limitations of Related Work
ture works that focused on this research problem and presents Despite the many previous works in the literature that
its limitations. Section III summarizes the contributions of focused on the intrusion detection problem, the previously
this work. Section IV discusses the theoretical mathematical proposed models suffer from various shortcomings. For exam-
background of the different deployed techniques. Section V ple, many of these works do not focus on the class imbalance
presents the proposed multi-stage optimized ML-based NIDS issue often encountered in intrusion detection datasets. Also,
framework. Section VI describes the two datasets under the training sample size is often selected randomly rather than
consideration in more details. Section VII presents and dis- using a systematic approach. They are also limited by the use
cusses the experimental results obtained. Finally, Section VIII of outdated datasets such as NLS KDD99. Additionally, the
concludes the paper and proposes potential future research results reported are usually only done using one dataset rather
endeavors. than being validated using multiple datasets. Few works also
considered the hyper-parameter optimization using different
II. R ELATED W ORK AND L IMITATIONS techniques and used only one method instead. Also, only some
research works studied the time complexity of their proposed
A. Related Work
framework, a metric that is often overlooked.
ML classification techniques have been proposed as part of
various network attack detection frameworks and other appli-
cations using different classification models such as Support
Vector Machines (SVM) [15], Decision Trees [16], KNN [17], III. R ESEARCH C ONTRIBUTIONS
Artificial Neural Networks (ANN) [18], [19], and Naive The main contributions and differences between this work
Bayes [20] as illustrated in [1]. One such application is and our previous work in [11] can be summarized as follows:
the DNS typo-squatting attack detection framework presented • Propose a novel multi-stage optimized ML-based NIDS
in [21], [22]. Also, ML techniques have been proposed to framework that reduces computational complexity and
detect zero-day attacks as illustrated by the probabilistic enhances detection accuracy.
Bayesian network model presented in [23]. Comparatively, • Study the impact of oversampling techniques and deter-
hybrid ML-fuzzy logic-based system that focuses on dis- mine the minimum suitable training sample size for
tributed denial of service (DDoS) attack detection has been effective intrusion detection.
proposed in [24]. These ML classification techniques have also • Explore the impact of different feature selection tech-
been proposed for bot net detection [25] as well as for mobile niques on the NIDS detection performance and time
phone malware detection [26]. (training and testing) complexity.
Similarly, several previous works focused on the use of • Propose and investigate different ML hyper-parameter
ML classification techniques for network intrusion detection. optimization techniques and their corresponding enhance-
For example, Salo et al. conducted a literature survey and ment of the NIDS detection performance.
identified 19 different data mining techniques commonly used • Evaluate the performance of the optimized ML-based
for intrusion detection [27], [28]. The result of this review NIDS framework using two recent state-of-the-art
highlighted the need for more ML-based research to address datasets, namely the CICIDS 2017 dataset [12] and the
real-time IDSs. The authors then proposed an ensemble fea- UNSW-NB 2015 dataset [14].
ture selection and an anomaly detection method for network • Compare the performance of the proposed framework
intrusion detection [29]. In contrast, Yang et al. proposed a with recent works from the literature and illustrate the
decision tree (DT)-based IDS model for autonomous and con- improvement of detection accuracy, reduction of FAR,
nected vehicles [30]. The goal of the IDS is to detect both and a reduction of both the training sample size and
intra-vehicle and external vehicle network attacks [30]. feature set size.
In a similar fashion, several previous research works To the best of our knowledge, no previous work proposed
proposed the use of various optimization techniques to such a multi-stage optimized ML-based NIDS framework and
enhance the performance of their NIDSs. For example, evaluated it using these datasets.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1805
IV. BACKGROUND AND P RELIMINARIES 1) Information Gain-Based Feature Selection: The first
As mentioned earlier, this paper proposes a multi-stage algorithm considered is the information gain-based feature
optimized ML-based NIDS framework that reduces computa- selection (IGBFS) algorithm. As the name suggests, it uses
tional complexity while maintaining its detection performance. information theory concepts such as entropy and mutual
Multiple techniques are deployed at different stages for this to information to select the relevant features [38], [39]. The
be implemented. An overview of the used techniques is given IGBFS ranks features based on the amount of information
in what follows. (in bits) that can be gained about the target class and selects
the ones with the highest amount of information as part of the
A. Data Pre-Processing feature subset provided for the ML model. Thus, the feature
evaluation function is [39]:
The data pre-processing stage involves performing data
normalization using the Z-score method and minority class I (S ; C ) = H (S ) − H (S |C )
oversampling using the SMOTE algorithm. P si , cj
1) Z-Score Normalization: The first step of the data pre- = P si , cj log (3)
P (si ) × P cj
processing stage is performing Z-score data normalization. si ∈S cj ∈C
However, the data must first be encoded using a label encoder where I(S;C) is the mutual information between feature sub-
to transform any categorical features into numerical ones. set S and class C, H(S) is the entropy/uncertainty of discrete
Then, data normalization is performed by calculating the feature subset S, H(S|C) is the conditional entropy/uncertainty
normalized value xnorm of each data sample xi as follows: of discrete feature subset S given class C, P (si , cj ) is the
x −μ joint probability of feature having a value si and class being
xnorm = i (1) cj , P (si ) is the probability of feature having a value si , and
σ
where μ being the mean vector of the features and σ being P (cj ) is the probability of class being cj .
the standard deviation. It is worth mentioning that the Z-score 2) Correlation-Based Feature Selection: The second fea-
data normalization is performed given that ML classification ture selection algorithm considered is the correlation-based
models tend to perform better with normalized datasets [33]. feature selection (CBFS) algorithm. It is often used due
2) SMOTE Technique: The second step is performing to its simplicity since it ranks features based on their
minority class oversampling using the SMOTE algorithm. This correlation with the target class and selects the highest
algorithm aims at synthetically creating more instances of the ones [40], [41], [42]. CBFS includes a feature as part of the
minority class to reduce the class-imbalance that often nega- subset if it is considered to be relevant (i.e., if it is highly cor-
tively impacts the ML classification model’s performance [34]. related with or predictive of the class [41], [43]). When using
Thus, performing minority class oversampling is impor- CBFS, the Pearson’s correlation coefficient is used as the fea-
tant, especially for network traffic datasets which typi- ture subset evaluation function. Thus, the evaluation function
cally suffer from this issue, to improve the training model is [41]:
performance [35]. k × rcf
MeritS = (4)
Upon analyzing the original minority class instances, k + k × (k − 1) × rff
SMOTE algorithm synthesizes new instances using the k-
nearest neighbors concept. Accordingly, the algorithm groups where MeritS is the merit of the feature subset S, k is the
all the instances of the minority class into one set Xminority . number of features in feature subset S, rcf is the average class-
For each instance Xinst within Xminority , a new synthetic feature Pearson correlation, and rff is the average feature-
instance Xnew is determined as follows [36]: feature Pearson correlation.
Xnew = Xinst + rand (0, 1) ∗ Xj − Xinst , j = 1, 2, . . . , k
C. Hyper-Parameter Optimization
(2)
This work explores different hyper-parameter optimization
where rand(0, 1) is a random value in the range [0,1] and Xj methods, namely random search (RS), Particle Swarm
is a randomly selected sample from the set {X1 , X2 , . . . , Xk } Optimization (PSO) and Genetic Algorithm (GA) meta-
of k nearest neighbors of Xinst . Note that unlike other over- heuristic algorithms, and Bayesian optimization algo-
sampling algorithms that replicate minority class instances, rithm [11], [44], [45].
SMOTE algorithm generates new high quality instances that 1) Random Search: The first hyper-parameter optimization
statistically resemble samples of the minority class [35], [36]. technique is the RS method. This method belongs to the
class of heuristic optimization models [46]. Similar to the grid
B. Feature Selection search algorithm [47], [48], RS tries different combinations of
This work compares between two different feature selection the parameters to be optimized. Mathematically, this translates
techniques, namely information gain-based and correlation- to the following model:
based feature selection, and explores their effect on the max f (parm) (5)
models’ detection performance and time complexity. This is parm
particularly relevant when designing ML models for large where f is an objective function to be maximized (typically
scale systems that generate high dimensional data [37]. the accuracy of the model) and parm is the set of parameters
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1806 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
to be tuned. In contrast to the grid search method, the RS using biologically inspired operations including muta-
method does not perform an exhaustive search of all possi- tion, crossover, and selection [51]. Using these oper-
ble combinations, but rather only randomly chooses a subset ators, GA algorithms can search the solution space
of combinations to test [46]. Therefore, RS tends to out- efficiently [51].
perform grid search method, especially when the number of In the context of ML hyper-parameter optimization, GA
hyper-parameters is small [46]. Additionally, this method also algorithm works as follows [51]:
allows for the optimization to be performed in parallel, further a) Initialize a population of random solutions denoted
reducing its computational complexity [44]. as chromosomes. Each chromosome is a vector of
2) Meta-Heuristic Optimization Algorithms: The second potential hyper-parameter value combinations.
class of hyper-parameter optimization methods is the meta- b) Determine the fitness of each chromosome using a
heuristic optimization algorithms. These algorithms aim at fitness function. The function is typically the ML
identifying or generating a heuristic that may provide a model’s accuracy when using each chromosome’s
sufficiently good solution to the optimization problem at vector.
hand [49]. They tend to find suitable solutions for combi- c) Rank the chromosomes according to their relative
natorial optimization problems with a lower computational fitness in descending order.
complexity [49], making them good candidates for hyper- d) Replace least-fit chromosomes with new chromo-
parameter optimization. somes generated through crossover and mutation
This work considers two well-known meta-heuristics for processes.
hyper-parameter optimization, namely PSO and GA. e) Repeat steps b)-d) until the performance is no
1) PSO: is a well-known meta-heuristic algorithm that aims longer improving or some stopping criterion is met.
at simulating the social behavior such as flocks of birds Due to its effectiveness in identifying very good solu-
traveling to a “promising position” [50]. In the case of tions (near-optimal in many cases), this meta-heuristic
hyper-parameter optimization, the desired “position” is has been used in a variety of applications including
the suitable values for the hyper-parameters. In general, workflow scheduling [52], photovoltaic systems [53],
PSO algorithm uses a population or a set of particles wireless networking [54], and in this case machine
to search for a suitable solution by iteratively updating learning [55].
these particles’ position within the search space. 3) Bayesian Optimization: The third hyper-parameter
More specifically, each particle looks at its own optimization method considered in this work is the Bayesian
best previous experience pbest (the cognition part) Optimization method. This method belongs to the class of
and the best experience of other particles gbest probabilistic global optimization models [56]. This method
(the social part) to determine its searching direc- aims at minimizing a scalar objective function f (x) for some
tion change. Mathematically, the position of the par- value x. The output of this optimization process for the same
ticle at each iteration t is represented as a vector input x differs based on whether the function is deterministic
xit = {xi1 t , x t , . . . , x t } and its velocity as v t =
i2 iD i or stochastic [57]. The minimization process is divided into
{vi1 , vi2 , . . . , viD
t t t } where D is the number of param- three main parts: a surrogate model that fits all the points
eters to be optimized. Assuming that pbestit is particle of the objective function f (x), a Bayesian update process
i’s best solution until iteration t and gbest t is the best that modifies the surrogate model after each new evalua-
solution within the population at iteration t, each particle tion of the objective function, and an acquisition function
changes its velocity as follows [50]: a(x). Different surrogate models can be assumed, namely the
Gaussian Process and the Tree Parzen Estimator.
t t−1 t t
vid = vid + c1 r1 pbestid − xid 1) Gaussian Process (GP): The model is assumed to follow
t a Gaussian distribution. Thus, it is of the form [58]:
+ c2 r2 gbestdt − xid (6)
p(f (x ) x , parm) = N f (x ) μ̂, σ̂ 2 (8)
where c1 is the particle’s cognition learning factor, c2
the social learning factor, and r1 and r2 being random where parm is the configuration space of the hyper-
numbers between [0,1]. Accordingly, the particle’s new parameters and f (x) the value of the objective function
position becomes [50]: with μ̂ and σ̂ 2 being its mean and variance respectively.
Note that such a model is effective when the num-
t+1 t t
xid = xid + vid (7) ber of hyper-parameters is small, but is ineffective for
conditional hyper-parameters [59].
Within the context of hyper-parameter optimization, 2) Tree Parzen Estimator (TPE): The model is assumed
xit = parm where parm is the set of parameters for to follow one of two density functions, l(x) or g(x)
the ML model under consideration. For example, in the depending on some pre-defined threshold f ∗ (x ) [58]:
case of SVM, the parameters are C and γ.
l (x ) if f (x ) < f ∗ (x )
2) GA: is another well-known meta-heuristic algorithm
p(x f (x ), parm) = (9)
that is inspired by the evolution and the process of g(x ) if f (x ) > f ∗ (x )
natural selection [51]. It is often used to identify high- where parm is the configuration space of the hyper-
quality solutions to combinatorial optimization problems parameters and f (x) the value of the objective function.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1807
B. Security Considerations
The proposed multi-stage optimized ML-based NIDS
Fig. 1. Proposed Multi-stage Optimized ML-based NIDS Framework.
framework is a signature-based NIDS system. This is illus-
trated by the fact that the framework oversamples the minor-
ity class, which typically is the attack class in network
Note that TPE estimators follow a tree-structure and can
traffic [27], [28]. Thus, the framework learns from the
optimize all hyper-parameter types [59].
observed patterns of the known initiated attacks [27], [28].
Based on the surrogate model assumption, the acquisition
However, it is worth noting that the framework can work as an
function is maximized to determine the subsequent evaluation
anomaly-based NIDS since it is trained by adopting a binary
point. The role of the function is to measure the expected
classification model so that it can classify any anomalous
improvement in the objective while avoiding values that would
behavior as an attack.
increase it [57]. Therefore, the expected improvement (EI) can
This framework can be deployed as one module within
be determined as follows:
a more comprehensive security framework/policy that an
individual or organization can adopt. This security frame-
EI (x , Q) = EQ max 0, μQ (xbest ) − f (x ) (10)
work/policy can include other mechanisms such as firewalls,
where xbest is the location of the lowest posterior mean and deep packet inspection, user access control, and user authenti-
μQ (xbest ) is the lowest value of the posterior mean. cation mechanisms [61], [62]. This would offer a multi-layer
secure framework that can preserve the privacy and security
of the users’ data and information.
V. P ROPOSED M ULTI -S TAGE O PTIMIZED ML-BASED
NIDS F RAMEWORK C. Complexity
A. General Framework Description To determine the time complexity of the proposed multi-
This work focuses on building a multi-stage optimized ML- stage optimized ML-based NIDS framework, we need to
based NIDS framework that achieves high detection accuracy, determine the complexity of each algorithm used in each stage.
low FAR, and has a low time complexity.The proposed frame- Given that this work compares the performance of different
work is divided into three main stages to achieve this goal. algorithms within the different stages of the framework, the
The first stage includes the data pre-processing that includes overall time complexity is determined by the combination of
performing Z-score normalization and Synthetic Minority algorithms that results in the highest aggregate complexity.
Oversampling TEchnique (SMOTE). This is done to improve It is assumed that the data is composed of M samples and
the performance of the training model and reduce the class- N features. Starting with the first stage, i.e., the data pre-
imbalance often observed in network traffic data [34]. In turn, processing stage, the complexity of the Z-score normalization
this can reduce the training sample size since the ML model process is O(N) since we need to normalize all the samples of
would have enough samples to understand the behavior of each the N features within the dataset. On the other hand, the com-
class [35]. plexity of the SMOTE algorithm is O(Mmin 2 N ) where M
min
The second stage of the proposed framework is conducting is the number of samples belonging to the minority class [63].
a feature selection process to reduce the number of fea- 2 N ).
Thus, the overall complexity of the first stage is O(Mmin
tures needed for the ML classification model. This is done The complexity of the second stage is dependent on the
to reduce the time complexity of the classification model and complexity of the different feature selection algorithms con-
consequently decrease its training time without sacrificing its sidered. The complexity of Correlation-based feature selection
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1808 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1809
The data was collected through two different simulations con- to two main reasons. Firstly, these classifiers were the top per-
ducted on two different days, namely January 22 and February forming classifiers in our previous work as they showed their
17, 2015. The resulting dataset consists of 2,540,044 instances effectiveness with network intrusion detection [11]. Secondly,
and 49 features (1 class feature and 48 statistical features) these classifiers have lower computational complexities when
representing the different characteristics of a network traffic compared to other classifiers. For example, the KNN clas-
request such as source and destination details, duration, pro- sifier has a complexity of O(MN) where M is the number of
tocol used, and packet size [14]. These instances are labeled instances and N is the number of features [68], [69].√ Similarly,
as follows: 2,218,761 normal instances and 521,283 attack the complexity of the RF classifier is O(M 2 N t) where
instances. In this case, no merging of attacks was needed since t is the number of trees within the RF classifier. However,
the dataset was originally labeled in a binary fashion. since this classifier allows for multi-threading, its √
training time
2 Mt
In a similar fashion, Fig. 3 shows the first and second prin- is significantly reduced to approximately O( Nthreads ) where
cipal components for the UNSW-NB 2015 dataset. Again, we threads is the maximum number of participating threads [30].
can observe that the features are non-linear. However, it can be In contrast, the complexity of SVM can reach an order of
observed that the level of intertwining between the two classes O(M 3 N ) [70]. Therefore, training such a model would be
is lower. Accordingly, it is easier to separate between the two computationally prohibitive, especially given the dataset sizes
classes. used in this work. Note that the parameters to be tuned are:
Note that there are other network intrusion detection • KNN: number of neighbors K.
datasets that can be studied such as the NSL KDD 99 dataset • RF: Splitting criterion (Gini or Entropy) and Number of
and the Kyoto 2006+ datasets. However, these datasets have trees.
already been extensively studied. Moreover, they are outdated It is worth noting that the runtime complexity (also com-
and may not have recent attack patterns. In contrast, the two monly referred to as testing complexity) of KNN and RF
datasets considered in this work are more recent and have more optimized models is O(MN) and O(Nt) respectively where M
attack patterns. As such, studying them will provide better is the number of training samples, N is the number of fea-
equipped NIDSs that are trained to detect more attack types. tures, and t is the number of decision trees forming the RF
classifier [71], [72]. In the case of KNN, any new instance is
C. Attack Types classified after calculating the distance between itself and all
The two datasets considered in this work contain some other instances in the training sample and identifying its K
similar attacks and some that are different. For example, the nearest neighbors [71]. Conversely, when using the RF clas-
CICIDS 2017 dataset contains the following attacks: Denial- sifier, the new instance is fed to the t different decision trees,
of-Service (DoS), port scanning, brute-force, web-attacks, each of which uses N splits based on the N features consid-
botnets, and infiltration [12]. In contrast, the UNSW-NB 2015 ered, and the class is determined based on the majority vote
dataset contains the following attacks: fuzzers, analysis, back- among these t trees.
doors, DoS, exploits, generic, reconnaissance, shellcode, and
worms [14]. Accordingly, it can be deduced that the proposed B. Results and Discussion
framework learns the patterns of various attack types.
Note that the proposed framework adopts a binary classifi- 1) Impact of Data Pre-Processing on Training Sample Size:
cation model by labeling all attack types as “attack”. The goal Starting with the impact of data pre-processing stage on the
is to develop a NIDS that can detect various attacks rather training sample size, the learning curve showing the variation
than just a finite group of common attacks such as DoS. This of training accuracy and the cross-validation accuracy as the
reiterates the idea that the proposed multi-stage optimized ML- training sample size changes. Both datasets were split ran-
based NIDS can work as an anomaly-based NIDS despite its domly into training and testing samples after normalization
training as a signature-based NIDS. using a 70%/30% split criterion.
Using the SMOTE technique, the number of instances of
each type in each dataset’s training sample is as follows:
VII. E XPERIMENTAL P ERFORMANCE E VALUATION
• CICIDS 2017: 1,818,477 benign instances (denoted as 0)
A. Experimental Setup and 1,800,000 attack instances (denoted as 1).
The experiments conducted for this work were completed • UNSW-NB 2015: 1,775,010 normal instances (denoted
using Python 3.7.4 running on Anaconda’s Jupyter Notebook. as 0) and 1,500,000 attack instances (denoted as 1).
This was run on a virtual machine having a 3 processors It can be seen from Fig. 4 that the number of training
Intel Xeon CPU E5-2660 v3 2.6 GHz and 64GB of memory samples needed for the CICIDS 2017 dataset for the training
running Windows Server 2016. The experimental results are accuracy and cross-validation accuracy to converge is close
divided into three main subsections, namely the impact of to 2.3 million samples. Similarly, for the UNSW-NB 2015
data pre-processing on training sample size, impact of feature dataset, the number of training samples needed is close to
selection on feature set size and training sample size, and the 1.3 million samples as can be seen from Fig. 5. This can be
impact of optimization methods on the ML models’ detection attributed to the fact that both datasets are originally imbal-
performance. anced with much fewer attack samples when compared to
The classification models used in this work are KNN clas- normal samples. Hence, the model struggles to learn the attack
sifier and the RF classifier. These classifiers were chosen due patterns and behaviors.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1810 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
Fig. 4. Learning Curve Showing Training and Cross-Validation Accuracy Fig. 7. Learning Curve Showing Training and Cross-Validation Accuracy
for CICIDS 2017 Dataset Before SMOTE. for UNSW-NB 2015 Dataset After SMOTE.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1811
Fig. 8. Mutual Information Score of Features for CICIDS 2017 Dataset Showing the Highest Scoring Features in Descending Order.
Fig. 9. Mutual Information Score of Features for UNSW-NB 2015 Dataset Showing the Highest Scoring Features in Descending Order.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1812 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
TABLE I
O PTIMAL PARAMETER VALUES W ITH IGBFS
FOR D IFFERENT ML M ODELS
TABLE II
O PTIMAL PARAMETER VALUES W ITH CBFS
FOR D IFFERENT ML M ODELS
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1813
TABLE III
P ERFORMANCE R ESULTS OF THE M ULTI -S TAGE O PTIMIZED ML-BASED NIDS F RAMEWORK W ITH IGBFS FOR T ESTING DATASETS
TABLE IV
P ERFORMANCE R ESULTS OF THE M ULTI -S TAGE O PTIMIZED ML-BASED NIDS F RAMEWORK W ITH CBFS FOR T ESTING DATASETS
this is due to the fact that the algorithm’s stopping criterion is with respect to the class), and thus would be overlooked if the
typically the number of iterations and thereby does not test all entropy splitting criterion is chosen. This is the reason behind
potential values. Accordingly, it is possible for it to miss the choosing the Gini splitting criterion when the CBFS method
optimal number of neighbors. Similarly, one of the stopping is used.
criteria in the PSO algorithm is also the number of evaluations, Tables III and IV show the performance of the two clas-
which can also lead to it missing the optimal value. In contrast, sification algorithms when using IGBFS and CBFS methods,
the GA, BO-GP, and BO-TPE all resulted in a similar number respectively. Several observations can be made. The first obser-
of neighbors for both the CICIDS 2017 and UNSW-NB 2015 vation is that the optimized models outperform the regular
datasets. For the GA algorithm, the number of generations is models recently reported in [12], [30], [75] by 1-2% on aver-
typically set sufficiently high to reach the optimal value for age in terms of accuracy and a reduction of 1-2% in FAR for
the number of neighbors. In a similar manner, the BO-GP both datasets. This is expected since one of the main goals of
and BO-TPE determine the actual optimal value based on the hyper-parameter optimization is to improve the performance
assumed model. of the ML models. The second observation is that the RF clas-
In the case of the RF method, the RS and PSO algorithms sifier outperforms the KNN classifier for both the IGBFS and
tend to choose a lower number of trees compared to the GA, CBFS methods as seen in the CICIDS 2017 and UNSW-NB
BO-GP, and BO-TPE. This is due to the algorithms’ stopping 2015 datasets. This reiterates the previously obtained results
criterion that often leads to a pre-mature stoppage. In contrast, in [11] with ISCX 2012 dataset and the reported results
the GA, BO-GP, and BO-TPE determine that the number of in [12], [30], [75] in which the RF classifier also outper-
trees needed is higher as they explore more potential values, formed the KNN model. This can be attributed to the RF
allowing them to select more optimal values for the number of classifier being an ensemble model. Accordingly, it is effective
trees. In terms of the splitting criterion, the entropy criterion with non-linear and high-dimensional datasets like the datasets
is mostly selected. This is expected since the IGBFS method under consideration in this work. The third observation is that
selects features based on their information gain, which is deter- the BO-TPE-RF method had the highest detection accuracy
mined using the entropy of each feature. As such, this criterion for both the CICIDS 2017 and UNSW-NB 2015 datasets for
would be more suitable when using IGBFS. both feature selection algorithms with a detection accuracy
Looking at Table II, similar observations about the hyper- of 99.99% and 100%, respectively. This proves the effective-
parameter optimization performance of the different algo- ness and robustness of the proposed multi-stage optimized
rithms can be made for both the KNN and RF methods. The ML-based NIDS framework as it outperformed other NIDS
only difference is that for the RF method, the splitting criterion frameworks.
is chosen to be the Gini index. This is due to the CBFS method
using the correlation as the selection criterion rather than the VIII. C ONCLUSION
entropy. Therefore, the features chosen may result in a low The area of cyber-security has garnered significant atten-
amount of information (equivalent to having a high entropy tion from both the industry and academia due to the
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1814 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1815
[23] X. Sun, J. Dai, P. Liu, A. Singhal, and J. Yen, “Using Bayesian networks [47] M. Injadat, A. Moubayed, A. B. Nassif, and A. Shami, “Systematic
for probabilistic identification of zero-day attack paths,” IEEE Trans. Inf. ensemble model selection approach for educational data mining,” Knowl.
Forensics Security, vol. 13, no. 10, pp. 2506–2521, Oct. 2018. Based Syst., vol. 200, Jul. 2020, Art. no. 105992. [Online]. Available:
[24] A. Alsirhani, S. Sampalli, and P. Bodorik, “DDoS detection system: http://www.sciencedirect.com/science/article/pii/S0950705120302999
Using a set of classification algorithms controlled by fuzzy logic system [48] M. Injadat, A. Moubayed, A. B. Nassif, and A. Shami, “Multi-split
in Apache spark,” IEEE Trans. Netw. Service Manag., vol. 16, no. 3, optimized bagging ensemble model selection for multi-class educational
pp. 936–949, Sep. 2019. datasets,” Appl. Intell., to be published.
[25] A. A. Daya, M. A. Salahuddin, N. Limam, and R. Boutaba, “BotChase: [49] L. Bianchi, M. Dorigo, L. M. Gambardella, and W. J. Gutjahr, “A sur-
Graph-based BOT detection using machine learning,” IEEE Trans. Netw. vey on metaheuristics for stochastic combinatorial optimization,” Nat.
Service Manag., vol. 17, no. 1, pp. 15–29, Mar. 2020. Comput., vol. 8, no. 2, pp. 239–287, 2009.
[26] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A multimodal [50] S.-W. Lin, K.-C. Ying, S.-C. Chen, and Z.-J. Lee, “Particle swarm
deep learning method for android malware detection using various fea- optimization for parameter determination and feature selection of sup-
tures,” IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 773–788, port vector machines,” Expert Syst. Appl., vol. 35, no. 4, pp. 1817–1824,
Mar. 2019. 2008.
[27] F. Salo, M. Injadat, A. B. Nassif, A. Shami, and A. Essex, “Data min- [51] G. Cohen, M. Hilario, and A. Geissbuhler, “Model selection for sup-
ing techniques in intrusion detection systems: A systematic literature port vector classifiers via genetic algorithms. An application to medical
review,” IEEE Access, vol. 6, pp. 56046–56058, 2018. decision support,” in Proc. Int. Symp. Biol. Med. Data Anal., 2004,
[28] F. Salo, M. Injadat, A. B. Nassif, and A. Essex, “Data mining with big pp. 200–211.
data in intrusion detection systems: A systematic literature review,” in [52] S. G. Ahmad, C. S. Liew, E. U. Munir, T. F. Ang, and S. U. Khan,
Proc. Int. Symp. Big Data Manag. Anal., Apr. 2019, pp. 1–8. “A hybrid genetic algorithm for optimization of scheduling workflow
[29] F. Salo, M. Injadat, A. Moubayed, A. B. Nassif, and A. Essex, applications in heterogeneous computing systems,” J. Parallel Distrib.
“Clustering enabled classification using ensemble feature selection for Comput., vol. 87, pp. 80–90, Jan. 2016.
intrusion detection,” in Proc. IEEE Int. Conf. Comput. Netw. Commun. [53] S. Blaifi, S. Moulahoum, I. Colak, and W. Merrouche, “An enhanced
(ICNC), 2019, pp. 276–281. dynamic model of battery using genetic algorithm suitable for photo-
[30] L. Yang, A. Moubayed, I. Hamieh, and A. Shami, “Tree-based intelligent voltaic applications,” Appl. Energy, vol. 169, pp. 888–898, May 2016.
intrusion detection system in Internet of Vehicles,” in Proc. IEEE Global [54] U. Mehboob, J. Qadir, S. Ali, and A. Vasilakos, “Genetic algorithms
Commun. Conf. (GLOBECOM), 2019, pp. 1–6. in wireless networking: Techniques, applications, and issues,” Soft
[31] Y. Y. Chung and N. Wahid, “A hybrid network intrusion detection system Comput., vol. 20, no. 6, pp. 2467–2501, 2016.
using simplified swarm optimization (SSO),” Appl. Soft Comput., vol. 12, [55] A. Rikhtegar, M. Pooyan, and M. T. Manzuri-Shalmani, “Genetic
no. 9, pp. 3014–3022, 2012. algorithm-optimised structure of convolutional neural network for face
[32] J. Zhang, M. Zulkernine, and A. Haque, “Random-forests-based network recognition applications,” IET Comput. Vis., vol. 10, no. 6, pp. 559–566,
intrusion detection systems,” IEEE Trans. Syst., Man, Cybern. C, Appl. 2016.
Rev., vol. 38, no. 5, pp. 649–659, 2008. [56] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian
[33] K. M. Ali Alheeti and K. McDonald-Maier, “Intelligent intrusion detec- optimization of machine learning algorithms,” in Proc. Adv. Neural Inf.
tion in external communication systems for autonomous vehicles,” Syst. Process. Syst., 2012, pp. 2951–2959.
Sci. Control Eng., vol. 6, no. 1, pp. 48–56, Sep. 2018. [57] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on Bayesian
[34] Z. Chen et al., “Machine learning based mobile malware detection optimization of expensive cost functions, with application to active
using highly imbalanced network traffic,” Inf. Sci., vols. 433–434, user modeling and hierarchical reinforcement learning,” 2010. [Online].
pp. 346–364, Apr. 2018. Available: arXiv:1012.2599.
[35] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [58] SigOpt. (2015). Bayesian Optimization Primer. [Online]. Available:
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. https://app.sigopt.com/static/pdf/SigOpt_Bayesian_Optimization_
Res., vol. 16, pp. 321–357, Aug. 2002. Primer.pdf
[36] X. Tan et al., “Wireless sensor networks intrusion detection based on [59] K. Eggensperger et al., “Towards an empirical foundation for assessing
smote and the random forest algorithm,” Sensors, vol. 19, no. 1, p. 203, Bayesian optimization of hyperparameters,” in Proc. NIPS Workshop
2019. Bayesian Optim. Theory Practice, vol. 10, 2013, p. 3.
[37] M. B. Catalkaya, O. Kalipsiz, M. S. Aktas, and U. O. Turgut, “Data fea- [60] D. Ashlock, Evolutionary Computation for Modeling and Optimization.
ture selection methods on distributed big data processing platforms,” in New York, NY, USA: Springer, 2006.
Proc. 3rd Int. Conf. Comput. Sci. Eng. (UBMK), Sep. 2018, pp. 133–138. [61] A. Moubayed, A. Refaey, and A. Shami, “Software-defined perimeter
[38] R. S. B. Krishna and M. Aramudhan, “Feature selection based on (SDP): State of the art secure solution for modern networks,” IEEE
information theory for pattern classification,” in Proc. Int. Conf. Netw., vol. 33, no. 5, pp. 226–233, Sep./Oct. 2019.
Control Instrum. Commun. Comput. Technol. (ICCICCT), Jul. 2014, [62] P. Kumar, A. Moubayed, A. Refaey, A. Shami, and J. Koilpillai,
pp. 1233–1236. “Performance analysis of SDP for secure internal enterprises,” in Proc.
[39] B. Bonev, “Feature selection based on information theory,” Ph.D. dis- IEEE Wireless Commun. Netw. Conf. (WCNC), 2019, pp. 1–6.
sertation, Dept. Comput. Sci., University of Alicante, Alicante, Spain, [63] F. Hu and H. Li, “A novel boundary oversampling algorithm based on
Jun. 2010. neighborhood rough set model: Nrsboundary-smote,” Math. Probl. Eng.,
[40] J. Li et al., “Feature selection: A data perspective,” ACM Comput. vol. 2013, Nov. 2013, Art. no. 694809.
Surveys, vol. 50, no. 6, p. 94, 2018. [64] A. Lissovoi, P. S. Oliveto, and J. A. Warwicker, “On the time complexity
[41] M. A. Hall, “Correlation-based feature selection for machine learn- of algorithm selection hyper-heuristics for multimodal optimisation,” in
ing,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Waikato Hamilton, Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 2322–2329.
Hamilton, New Zealand, 1999. [65] R. Cheng and Y. Jin, “A social learning particle swarm optimization
[42] A. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, “Relationship algorithm for scalable optimization,” Inf. Sci., vol. 291, pp. 43–60,
between student engagement and performance in e learning environ- Jan. 2015.
ment using association rules,” in Proc. IEEE World Eng. Educ. Conf. [66] P. S. Oliveto and C. Witt, “Improved time complexity analysis of the
(EDUNINE), Mar. 2018, pp. 1–6. simple genetic algorithm,” Theor. Comput. Sci., vol. 605, pp. 21–41,
[43] J. H. Gennari, P. Langley, and D. Fisher, “Models of incremental concept Nov. 2015.
formation,” Artif. Intell., vol. 40, nos. 1–3, pp. 11–61, 1989. [67] M. Feurer and F. Hutter, “Hyperparameter optimization,” in Automated
[44] P. Koch, B. Wujek, O. Golovidov, and S. Gardner, “Automated hyper- Machine Learning. Cham, Switzerland: Springer, 2019, pp. 3–33.
parameter tuning for effective machine learning,” in Proc. SAS Global [68] The Kernel Trip. (Apr. 2018). Computational Complexity of Machine
Forum Conf., 2017, pp. 1–23. Learning Algorithms. Accessed: Feb. 27, 2020. [Online]. Available:
[45] L. Yang and A. Shami, “On hyperparameter optimization https://www.thekerneltrip.com/machine/learning/computational-
of machine learning algorithms: Theory and practice,” complexity-learning-algorithms/
Neurocomputing, to be published. [Online]. Available: [69] C.-T. Chu et al., “Map-reduce for machine learning on multicore,” in
http://www.sciencedirect.com/science/article/pii/S0925231220311693 Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 281–288.
[46] J. Bergstra and Y. Bengio, “Random search for hyper-parameter [70] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Feb. 2012. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1816 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021
[71] O. Veksler. (2015). Cs434a/541a class notes prof. Olga Veksler. Ali Bou Nassif (Member, IEEE) received the Ph.D.
[Online]. Available: http://www.csd.uwo.ca/courses/CS9840a/Lecture2 degree in electrical and computer engineering from
_knn.pdf the University of Western Ontario, London, ON,
[72] X. Solé, A. Ramisa, and C. Torras, “Evaluation of random forests on Canada, in 2012. He is currently an Assistant
large-scale classification problems using a bag-of-visual-words represen- Professor and an Assistant Dean of the Graduate
tation.” in Proc. CCIA, 2014, pp. 273–276. Studies, University of Sharjah, UAE, and an Adjunct
[73] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Research Professor with Western University. He has
Statistical Learning, vol. 112. New York, NY, USA: Springer, 2013. published more than 60 papers in international jour-
[74] A. B. Nassif, D. Ho, and L. F. Capretz, “Regression model for software nals and conferences. His interests are machine
effort estimation based on the use case point method,” in Proc. Int. Conf. learning and soft computing, software engineering,
Comput. Softw. Model., vol. 14, 2011, pp. 106–110. cloud computing and service oriented architecture,
[75] N. Moustafa, B. Turnbull, and K. R. Choo, “An ensemble intrusion and mobile computing. He is a Registered Professional Engineer in Ontario,
detection technique based on proposed statistical flow features for pro- as well as a member of IEEE Computer Society.
tecting network traffic of Internet of Things,” IEEE Internet Things J.,
vol. 6, no. 3, pp. 4815–4830, Jun. 2019.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.