You are on page 1of 14

IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO.

2, JUNE 2021 1803

Multi-Stage Optimized Machine Learning


Framework for Network Intrusion Detection
MohammadNoor Injadat , Member, IEEE, Abdallah Moubayed , Member, IEEE,
Ali Bou Nassif , Member, IEEE, and Abdallah Shami , Senior Member, IEEE

Abstract—Cyber-security garnered significant attention due to the industry and academia. To that end, more resources are
the increased dependency of individuals and organizations on the being deployed and allocated to protect modern Internet-
Internet and their concern about the security and privacy of their based networks from potential attacks or anomalous activities.
online activities. Several previous machine learning (ML)-based
network intrusion detection systems (NIDSs) have been developed Several protection mechanisms have been proposed such as
to protect against malicious online behavior. This paper proposes firewalls, user authentication, and the deployment of antivirus
a novel multi-stage optimized ML-based NIDS framework that and malware programs as a first line of defense [4]. However,
reduces computational complexity while maintaining its detec- these mechanisms have not been able to completely protect
tion performance. This work studies the impact of oversampling the organizations’ networks, particularly with contemporary
techniques on the models’ training sample size and determines
the minimal suitable training sample size. Furthermore, it com- attacks [5].
pares between two feature selection techniques, information gain Typically, network intrusion detection systems (NIDSs) can
and correlation-based, and explores their effect on detection be divided into two main categories: signature-based detec-
performance and time complexity. Moreover, different ML hyper- tion systems (misused detection) and anomaly-based detection
parameter optimization techniques are investigated to enhance systems [6]. Signature-based detection systems base their
the NIDS’s performance. The performance of the proposed
framework is evaluated using two recent intrusion detection detection on the observation of pre-defined attack patterns.
datasets, the CICIDS 2017 and the UNSW-NB 2015 datasets. Thus, they have proven to be effective for attacks with well-
Experimental results show that the proposed model significantly known signatures and patterns. However, these systems are
reduces the required training sample size (up to 74%) and fea- vulnerable against new attacks due to their inability to detect
ture set size (up to 50%). Moreover, the model performance new attacks by learning from previous observations [7]. In con-
is enhanced with hyper-parameter optimization with detection
accuracies over 99% for both datasets, outperforming recent lit- trast, anomaly-based detection systems base their detection on
erature works by 1-2% higher accuracy and 1-2% lower false the observation of any behavior or pattern that deviates from
alarm rate. what is considered to be normal. Therefore, these systems can
Index Terms—Network intrusion detection, machine learning, detect unknown attacks or intrusions based on the built models
hyper-parameter optimization, Bayesian optimization, particle that characterize normal behavior [8].
swarm optimization, genetic algorithm. Despite the continuous improvements in NIDS performance,
there is still room for further improvement. This is partic-
ularly evident given the high volume of generated network
I. I NTRODUCTION traffic data, continuously evolving environments, vast amounts
HE INTERNET has become an essential aspect of daily of features collected that form the training datasets (high
T life with individuals and organizations depending on
it to facilitate communication, conduct business, and store
dimensional datasets), and the need for real-time intrusion
detection [9]. For example, having redundant or irrelevant
information [1], [2]. This dependence is coupled with these features can have a negative impact on the detection capa-
individuals and organizations’ concern about the security and bilities of NIDSs as it slows down the model training
privacy of their online activities [3]. Accordingly, the area process. Therefore, it is important to choose the most suit-
of cyber-security has garnered significant attention from both able subset of features and optimize the parameters of the
machine learning (ML)-based detection models to enhance
Manuscript received March 18, 2020; revised June 15, 2020; accepted their performance [10].
August 4, 2020. Date of publication August 7, 2020; date of current version This paper extends our previous work in [11] by propos-
June 10, 2021. The associate editor coordinating the review of this article ing a novel multi-stage optimized ML-based NIDS framework
and approving it for publication was S. Kanhere. (Corresponding author:
MohammadNoor Injadat.) that reduces the computational complexity while maintaining
MohammadNoor Injadat, Abdallah Moubayed, and Abdallah Shami are its detection performance. To that end, this work first stud-
with the Department of Electrical and Computer Engineering, University of ies the impact of oversampling techniques on the models’
Western Ontario, London, ON N6A 5B9, Canada (e-mail: minjadat@uwo.ca;
amoubaye@uwo.ca; abdallah.shami@uwo.ca). training sample size and determines the minimum suitable
Ali Bou Nassif is with the Department of Electrical and Computer training size for effective intrusion detection. Furthermore,
Engineering, University of Western Ontario, London, ON N6A 5B9, Canada, it compares between two different feature selection tech-
and also with the Department of Computer Engineering, University of Sharjah,
Sharjah, UAE (e-mail: anassif@sharjah.ac.ae). niques, namely information gain and correlation-based feature
Digital Object Identifier 10.1109/TNSM.2020.3014929 selection, and explores their effect on the models’ detection
1932-4537 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1804 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

performance and time complexity. Moreover, different ML Chung and Wahid proposed a hybrid approach that included
hyper-parameter optimization techniques are investigated to feature selection and classification with simplified swarm
enhance the NIDS’s performance and ensure its effectiveness optimization (SSO) in addition to using weighted local search
and robustness. (WLS) to further enhance its performance [31]. Similarly,
To evaluate the performance of the proposed optimized Kuang et al. presented a hybrid GA-SVM model associated
ML-based NIDS framework, two recent state-of-the-art intru- with kernel principal component analysis (KPCA) to improve
sion detection datasets are used, namely the CICIDS 2017 the performance [15]. Comparatively, Zhang et al. combined
dataset [12] (which is the updated version of the ISCX misuse and anomaly detection using RF [32]. In contrast,
2012 dataset [13] used in our previous work [11]) and the our previous work in [11] proposed a Bayesian optimization
UNSW-NB 2015 dataset [14]. The performance evaluation is model to hyper-tune the parameters of different supervised ML
conducted using various evaluation metrics such as accuracy algorithms for anomaly-based IDSs [11].
(acc), precision, recall, and false alarm rate (FAR).
The remainder of this paper is organized as follows:
Section II briefly summarizes some of the previous litera- B. Limitations of Related Work
ture works that focused on this research problem and presents Despite the many previous works in the literature that
its limitations. Section III summarizes the contributions of focused on the intrusion detection problem, the previously
this work. Section IV discusses the theoretical mathematical proposed models suffer from various shortcomings. For exam-
background of the different deployed techniques. Section V ple, many of these works do not focus on the class imbalance
presents the proposed multi-stage optimized ML-based NIDS issue often encountered in intrusion detection datasets. Also,
framework. Section VI describes the two datasets under the training sample size is often selected randomly rather than
consideration in more details. Section VII presents and dis- using a systematic approach. They are also limited by the use
cusses the experimental results obtained. Finally, Section VIII of outdated datasets such as NLS KDD99. Additionally, the
concludes the paper and proposes potential future research results reported are usually only done using one dataset rather
endeavors. than being validated using multiple datasets. Few works also
considered the hyper-parameter optimization using different
II. R ELATED W ORK AND L IMITATIONS techniques and used only one method instead. Also, only some
research works studied the time complexity of their proposed
A. Related Work
framework, a metric that is often overlooked.
ML classification techniques have been proposed as part of
various network attack detection frameworks and other appli-
cations using different classification models such as Support
Vector Machines (SVM) [15], Decision Trees [16], KNN [17], III. R ESEARCH C ONTRIBUTIONS
Artificial Neural Networks (ANN) [18], [19], and Naive The main contributions and differences between this work
Bayes [20] as illustrated in [1]. One such application is and our previous work in [11] can be summarized as follows:
the DNS typo-squatting attack detection framework presented • Propose a novel multi-stage optimized ML-based NIDS
in [21], [22]. Also, ML techniques have been proposed to framework that reduces computational complexity and
detect zero-day attacks as illustrated by the probabilistic enhances detection accuracy.
Bayesian network model presented in [23]. Comparatively, • Study the impact of oversampling techniques and deter-
hybrid ML-fuzzy logic-based system that focuses on dis- mine the minimum suitable training sample size for
tributed denial of service (DDoS) attack detection has been effective intrusion detection.
proposed in [24]. These ML classification techniques have also • Explore the impact of different feature selection tech-
been proposed for bot net detection [25] as well as for mobile niques on the NIDS detection performance and time
phone malware detection [26]. (training and testing) complexity.
Similarly, several previous works focused on the use of • Propose and investigate different ML hyper-parameter
ML classification techniques for network intrusion detection. optimization techniques and their corresponding enhance-
For example, Salo et al. conducted a literature survey and ment of the NIDS detection performance.
identified 19 different data mining techniques commonly used • Evaluate the performance of the optimized ML-based
for intrusion detection [27], [28]. The result of this review NIDS framework using two recent state-of-the-art
highlighted the need for more ML-based research to address datasets, namely the CICIDS 2017 dataset [12] and the
real-time IDSs. The authors then proposed an ensemble fea- UNSW-NB 2015 dataset [14].
ture selection and an anomaly detection method for network • Compare the performance of the proposed framework
intrusion detection [29]. In contrast, Yang et al. proposed a with recent works from the literature and illustrate the
decision tree (DT)-based IDS model for autonomous and con- improvement of detection accuracy, reduction of FAR,
nected vehicles [30]. The goal of the IDS is to detect both and a reduction of both the training sample size and
intra-vehicle and external vehicle network attacks [30]. feature set size.
In a similar fashion, several previous research works To the best of our knowledge, no previous work proposed
proposed the use of various optimization techniques to such a multi-stage optimized ML-based NIDS framework and
enhance the performance of their NIDSs. For example, evaluated it using these datasets.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1805

IV. BACKGROUND AND P RELIMINARIES 1) Information Gain-Based Feature Selection: The first
As mentioned earlier, this paper proposes a multi-stage algorithm considered is the information gain-based feature
optimized ML-based NIDS framework that reduces computa- selection (IGBFS) algorithm. As the name suggests, it uses
tional complexity while maintaining its detection performance. information theory concepts such as entropy and mutual
Multiple techniques are deployed at different stages for this to information to select the relevant features [38], [39]. The
be implemented. An overview of the used techniques is given IGBFS ranks features based on the amount of information
in what follows. (in bits) that can be gained about the target class and selects
the ones with the highest amount of information as part of the
A. Data Pre-Processing feature subset provided for the ML model. Thus, the feature
evaluation function is [39]:
The data pre-processing stage involves performing data
normalization using the Z-score method and minority class I (S ; C ) = H (S ) − H (S |C )
 
oversampling using the SMOTE algorithm.     P si , cj
1) Z-Score Normalization: The first step of the data pre- = P si , cj log   (3)
P (si ) × P cj
processing stage is performing Z-score data normalization. si ∈S cj ∈C
However, the data must first be encoded using a label encoder where I(S;C) is the mutual information between feature sub-
to transform any categorical features into numerical ones. set S and class C, H(S) is the entropy/uncertainty of discrete
Then, data normalization is performed by calculating the feature subset S, H(S|C) is the conditional entropy/uncertainty
normalized value xnorm of each data sample xi as follows: of discrete feature subset S given class C, P (si , cj ) is the
x −μ joint probability of feature having a value si and class being
xnorm = i (1) cj , P (si ) is the probability of feature having a value si , and
σ
where μ being the mean vector of the features and σ being P (cj ) is the probability of class being cj .
the standard deviation. It is worth mentioning that the Z-score 2) Correlation-Based Feature Selection: The second fea-
data normalization is performed given that ML classification ture selection algorithm considered is the correlation-based
models tend to perform better with normalized datasets [33]. feature selection (CBFS) algorithm. It is often used due
2) SMOTE Technique: The second step is performing to its simplicity since it ranks features based on their
minority class oversampling using the SMOTE algorithm. This correlation with the target class and selects the highest
algorithm aims at synthetically creating more instances of the ones [40], [41], [42]. CBFS includes a feature as part of the
minority class to reduce the class-imbalance that often nega- subset if it is considered to be relevant (i.e., if it is highly cor-
tively impacts the ML classification model’s performance [34]. related with or predictive of the class [41], [43]). When using
Thus, performing minority class oversampling is impor- CBFS, the Pearson’s correlation coefficient is used as the fea-
tant, especially for network traffic datasets which typi- ture subset evaluation function. Thus, the evaluation function
cally suffer from this issue, to improve the training model is [41]:
performance [35]. k × rcf
MeritS =  (4)
Upon analyzing the original minority class instances, k + k × (k − 1) × rff
SMOTE algorithm synthesizes new instances using the k-
nearest neighbors concept. Accordingly, the algorithm groups where MeritS is the merit of the feature subset S, k is the
all the instances of the minority class into one set Xminority . number of features in feature subset S, rcf is the average class-
For each instance Xinst within Xminority , a new synthetic feature Pearson correlation, and rff is the average feature-
instance Xnew is determined as follows [36]: feature Pearson correlation.
 
Xnew = Xinst + rand (0, 1) ∗ Xj − Xinst , j = 1, 2, . . . , k
C. Hyper-Parameter Optimization
(2)
This work explores different hyper-parameter optimization
where rand(0, 1) is a random value in the range [0,1] and Xj methods, namely random search (RS), Particle Swarm
is a randomly selected sample from the set {X1 , X2 , . . . , Xk } Optimization (PSO) and Genetic Algorithm (GA) meta-
of k nearest neighbors of Xinst . Note that unlike other over- heuristic algorithms, and Bayesian optimization algo-
sampling algorithms that replicate minority class instances, rithm [11], [44], [45].
SMOTE algorithm generates new high quality instances that 1) Random Search: The first hyper-parameter optimization
statistically resemble samples of the minority class [35], [36]. technique is the RS method. This method belongs to the
class of heuristic optimization models [46]. Similar to the grid
B. Feature Selection search algorithm [47], [48], RS tries different combinations of
This work compares between two different feature selection the parameters to be optimized. Mathematically, this translates
techniques, namely information gain-based and correlation- to the following model:
based feature selection, and explores their effect on the max f (parm) (5)
models’ detection performance and time complexity. This is parm
particularly relevant when designing ML models for large where f is an objective function to be maximized (typically
scale systems that generate high dimensional data [37]. the accuracy of the model) and parm is the set of parameters

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1806 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

to be tuned. In contrast to the grid search method, the RS using biologically inspired operations including muta-
method does not perform an exhaustive search of all possi- tion, crossover, and selection [51]. Using these oper-
ble combinations, but rather only randomly chooses a subset ators, GA algorithms can search the solution space
of combinations to test [46]. Therefore, RS tends to out- efficiently [51].
perform grid search method, especially when the number of In the context of ML hyper-parameter optimization, GA
hyper-parameters is small [46]. Additionally, this method also algorithm works as follows [51]:
allows for the optimization to be performed in parallel, further a) Initialize a population of random solutions denoted
reducing its computational complexity [44]. as chromosomes. Each chromosome is a vector of
2) Meta-Heuristic Optimization Algorithms: The second potential hyper-parameter value combinations.
class of hyper-parameter optimization methods is the meta- b) Determine the fitness of each chromosome using a
heuristic optimization algorithms. These algorithms aim at fitness function. The function is typically the ML
identifying or generating a heuristic that may provide a model’s accuracy when using each chromosome’s
sufficiently good solution to the optimization problem at vector.
hand [49]. They tend to find suitable solutions for combi- c) Rank the chromosomes according to their relative
natorial optimization problems with a lower computational fitness in descending order.
complexity [49], making them good candidates for hyper- d) Replace least-fit chromosomes with new chromo-
parameter optimization. somes generated through crossover and mutation
This work considers two well-known meta-heuristics for processes.
hyper-parameter optimization, namely PSO and GA. e) Repeat steps b)-d) until the performance is no
1) PSO: is a well-known meta-heuristic algorithm that aims longer improving or some stopping criterion is met.
at simulating the social behavior such as flocks of birds Due to its effectiveness in identifying very good solu-
traveling to a “promising position” [50]. In the case of tions (near-optimal in many cases), this meta-heuristic
hyper-parameter optimization, the desired “position” is has been used in a variety of applications including
the suitable values for the hyper-parameters. In general, workflow scheduling [52], photovoltaic systems [53],
PSO algorithm uses a population or a set of particles wireless networking [54], and in this case machine
to search for a suitable solution by iteratively updating learning [55].
these particles’ position within the search space. 3) Bayesian Optimization: The third hyper-parameter
More specifically, each particle looks at its own optimization method considered in this work is the Bayesian
best previous experience pbest (the cognition part) Optimization method. This method belongs to the class of
and the best experience of other particles gbest probabilistic global optimization models [56]. This method
(the social part) to determine its searching direc- aims at minimizing a scalar objective function f (x) for some
tion change. Mathematically, the position of the par- value x. The output of this optimization process for the same
ticle at each iteration t is represented as a vector input x differs based on whether the function is deterministic
xit = {xi1 t , x t , . . . , x t } and its velocity as v t =
i2 iD i or stochastic [57]. The minimization process is divided into
{vi1 , vi2 , . . . , viD
t t t } where D is the number of param- three main parts: a surrogate model that fits all the points
eters to be optimized. Assuming that pbestit is particle of the objective function f (x), a Bayesian update process
i’s best solution until iteration t and gbest t is the best that modifies the surrogate model after each new evalua-
solution within the population at iteration t, each particle tion of the objective function, and an acquisition function
changes its velocity as follows [50]: a(x). Different surrogate models can be assumed, namely the
 Gaussian Process and the Tree Parzen Estimator.
t t−1 t t 
vid = vid + c1 r1 pbestid − xid 1) Gaussian Process (GP): The model is assumed to follow
 t  a Gaussian distribution. Thus, it is of the form [58]:
+ c2 r2 gbestdt − xid (6)  
 
p(f (x ) x , parm) = N f (x ) μ̂, σ̂ 2 (8)
where c1 is the particle’s cognition learning factor, c2
the social learning factor, and r1 and r2 being random where parm is the configuration space of the hyper-
numbers between [0,1]. Accordingly, the particle’s new parameters and f (x) the value of the objective function
position becomes [50]: with μ̂ and σ̂ 2 being its mean and variance respectively.
Note that such a model is effective when the num-
t+1 t t
xid = xid + vid (7) ber of hyper-parameters is small, but is ineffective for
conditional hyper-parameters [59].
Within the context of hyper-parameter optimization, 2) Tree Parzen Estimator (TPE): The model is assumed
xit = parm where parm is the set of parameters for to follow one of two density functions, l(x) or g(x)
the ML model under consideration. For example, in the depending on some pre-defined threshold f ∗ (x ) [58]:
case of SVM, the parameters are C and γ.
 l (x ) if f (x ) < f ∗ (x )
2) GA: is another well-known meta-heuristic algorithm 
p(x f (x ), parm) = (9)
that is inspired by the evolution and the process of g(x ) if f (x ) > f ∗ (x )
natural selection [51]. It is often used to identify high- where parm is the configuration space of the hyper-
quality solutions to combinatorial optimization problems parameters and f (x) the value of the objective function.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1807

performance [37]. With that in mind, two different methods


are compared within this stage of the framework.
The third stage of the framework involves the optimization
of the hyper-parameters of the different ML classification mod-
els considered. To that end, three different hyper-parameter
tuning/optimization models are investigated, namely random
search, meta-heuristic optimization algorithms including par-
ticle swarm optimization (PSO) and genetic algorithm (GA),
and Bayesian Optimization (BO) algorithm. These models
represent three different hyper-parameter tuning/optimization
categories which are heuristics [46], meta-heuristics [60], and
probabilistic global optimization [56] models respectively.
The results of these optimization stages are combined to
build the optimized ML classification model for effective
NIDS system that classifies new instances as either normal
or attack instances. Figure 1 illustrates the different stages of
the proposed framework.

B. Security Considerations
The proposed multi-stage optimized ML-based NIDS
Fig. 1. Proposed Multi-stage Optimized ML-based NIDS Framework.
framework is a signature-based NIDS system. This is illus-
trated by the fact that the framework oversamples the minor-
ity class, which typically is the attack class in network
Note that TPE estimators follow a tree-structure and can
traffic [27], [28]. Thus, the framework learns from the
optimize all hyper-parameter types [59].
observed patterns of the known initiated attacks [27], [28].
Based on the surrogate model assumption, the acquisition
However, it is worth noting that the framework can work as an
function is maximized to determine the subsequent evaluation
anomaly-based NIDS since it is trained by adopting a binary
point. The role of the function is to measure the expected
classification model so that it can classify any anomalous
improvement in the objective while avoiding values that would
behavior as an attack.
increase it [57]. Therefore, the expected improvement (EI) can
This framework can be deployed as one module within
be determined as follows:
a more comprehensive security framework/policy that an

  individual or organization can adopt. This security frame-
EI (x , Q) = EQ max 0, μQ (xbest ) − f (x ) (10)
work/policy can include other mechanisms such as firewalls,
where xbest is the location of the lowest posterior mean and deep packet inspection, user access control, and user authenti-
μQ (xbest ) is the lowest value of the posterior mean. cation mechanisms [61], [62]. This would offer a multi-layer
secure framework that can preserve the privacy and security
of the users’ data and information.
V. P ROPOSED M ULTI -S TAGE O PTIMIZED ML-BASED
NIDS F RAMEWORK C. Complexity
A. General Framework Description To determine the time complexity of the proposed multi-
This work focuses on building a multi-stage optimized ML- stage optimized ML-based NIDS framework, we need to
based NIDS framework that achieves high detection accuracy, determine the complexity of each algorithm used in each stage.
low FAR, and has a low time complexity.The proposed frame- Given that this work compares the performance of different
work is divided into three main stages to achieve this goal. algorithms within the different stages of the framework, the
The first stage includes the data pre-processing that includes overall time complexity is determined by the combination of
performing Z-score normalization and Synthetic Minority algorithms that results in the highest aggregate complexity.
Oversampling TEchnique (SMOTE). This is done to improve It is assumed that the data is composed of M samples and
the performance of the training model and reduce the class- N features. Starting with the first stage, i.e., the data pre-
imbalance often observed in network traffic data [34]. In turn, processing stage, the complexity of the Z-score normalization
this can reduce the training sample size since the ML model process is O(N) since we need to normalize all the samples of
would have enough samples to understand the behavior of each the N features within the dataset. On the other hand, the com-
class [35]. plexity of the SMOTE algorithm is O(Mmin 2 N ) where M
min
The second stage of the proposed framework is conducting is the number of samples belonging to the minority class [63].
a feature selection process to reduce the number of fea- 2 N ).
Thus, the overall complexity of the first stage is O(Mmin
tures needed for the ML classification model. This is done The complexity of the second stage is dependent on the
to reduce the time complexity of the classification model and complexity of the different feature selection algorithms con-
consequently decrease its training time without sacrificing its sidered. The complexity of Correlation-based feature selection

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1808 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

is O(MN 2 ) since this method needs to calculate all the


class-feature and feature-feature correlations [41]. In contrast,
the complexity of the information gain-based feature selec-
tion method is O(MN). This is due to the fact that this
method has to calculate the joint probabilities of the class-
feature interaction [39]. Therefore, the overall complexity of
the second stage is O(MN 2 ).
Similarly, the complexity of the third stage depends on
the complexity of each of the hyper-parameter optimization
methods and the underlying ML model. Starting with the RS
method, its complexity is O(Nparm logNparm ) where Nparm
is the number of parameters to be optimized [64]. Conversely, Fig. 2. Principal Component Analysis of CICIDS 2017 Dataset Illustrating
its Non-linear Nature.
the complexity of the PSO algorithm is O(Nparm Npop ) where
Npop is the population size, i.e., the number of swarm parti-
cles or potential solutions that we start with [65]. In a similar
fashion, it can be shown that the complexity of the GA algo-
rithm is also O(Nparm Npop ) where Npop is the population
size, i.e., the number of chromosomes/potential solutions at the
initialization stage [66]. For the GP-based BO algorithm, the
complexity is O(Mred 3 ) where M
red is the size of the reduced
training sample. This is because the optimization process is
carried on the training sample chosen after pre-processing and
feature selection. In contrast, the time complexity of the TPE-
based BO model is O(Mred logMred ) since this model follows
a tree-like structure when performing the optimization [67].
Based on the aforementioned discussion, the overall com- Fig. 3. Principal Component Analysis of UNSW-NB 2015 Dataset Illustrating
its Non-linear Nature.
plexity of the proposed framework is O(MN 2 ). This is
because the second stage will dominate the complexity as it
would still use the complete dataset rather than the reduced most up-to-date common network attacks. The data collection
training dataset. As such, even if we consider the com- process span a duration of five days from Monday July 3
plexity of the potential ML classification model (for exam- till Friday July 7, 2017. Within this period, different attacks
ple the complexity of KNN classifier can be estimated as where generated during different time windows. The resulting
O(Mred Nred ) [68], [69] where Nred is the size of the dataset contained 3,119,345 instances and 83 features
reduced feature set), it is dependent on the reduced train- (1 class feature and 82 statistical features) representing the
ing sample dataset with reduced feature set size. Hence, the different characteristics of a network traffic request such as
multi-stage optimized ML-based NIDS framework’s complex- duration, protocol used, packet size, as well as source and
ity is O(MN 2 ). Determining the overall time complexity of destination details. However, nearly 300,000 samples were
the complete framework including the optimized ML model unlabeled and hence were discarded. Therefore, the refined
training is essential since the model will be frequently re- dataset considered in this work contains 2,830,540 instances
trained to learn new attack patterns. This is based on the fact in total with 2,359,087 being BENIGN and 471,453 being
that network intrusion attacks continue to evolve and thus orga- ATTACK. Note that the attack instances represent various
nizations need to have a flexible and dynamic NIDSs to keep types of real-world network traffic attacks such as denial-of-
up with these new attacks. service (DoS) and port scanning. However, this work merged
all attacks into one label as the goal is to detect an attack
VI. DATASETS D ESCRIPTION regardless of its nature.
This work uses two state-of-the-art intrusion datasets to Fig. 2 shows the first and second principal components for
evaluate the performance of the proposed multi-stage opti- the CICIDS 2017 dataset. It can be seen that the two classes
mized ML-based NIDS framework. In what follows, a brief are intertwined. Moreover, it can be observed that the features
description of the two datasets is given. of the dataset are non-linear. Hence, we would expect a non-
linear kernel to perform better in classifying the instances of
this dataset.
A. CICIDS 2017
The first dataset under consideration is the Canadian
Institute of Cybersecurity’s IDS 2017 (CICIDS2017) B. UNSW-NB 2015
dataset [12]. This dataset is an extension of the ISCX 2012 The second dataset considered is the University of New
dataset used in our previous work [11]. The dataset was South Wales’s network intrusion dataset (UNSW-NB 2015)
generated with the goal of it resembling realistic background generated in 2015 [14]. The dataset is a hybrid of real modern
traffic [12]. As such, the dataset contains benign and 14 of the network normal activities and synthetic attack behaviors [14].

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1809

The data was collected through two different simulations con- to two main reasons. Firstly, these classifiers were the top per-
ducted on two different days, namely January 22 and February forming classifiers in our previous work as they showed their
17, 2015. The resulting dataset consists of 2,540,044 instances effectiveness with network intrusion detection [11]. Secondly,
and 49 features (1 class feature and 48 statistical features) these classifiers have lower computational complexities when
representing the different characteristics of a network traffic compared to other classifiers. For example, the KNN clas-
request such as source and destination details, duration, pro- sifier has a complexity of O(MN) where M is the number of
tocol used, and packet size [14]. These instances are labeled instances and N is the number of features [68], [69].√ Similarly,
as follows: 2,218,761 normal instances and 521,283 attack the complexity of the RF classifier is O(M 2 N t) where
instances. In this case, no merging of attacks was needed since t is the number of trees within the RF classifier. However,
the dataset was originally labeled in a binary fashion. since this classifier allows for multi-threading, its √
training time
2 Mt
In a similar fashion, Fig. 3 shows the first and second prin- is significantly reduced to approximately O( Nthreads ) where
cipal components for the UNSW-NB 2015 dataset. Again, we threads is the maximum number of participating threads [30].
can observe that the features are non-linear. However, it can be In contrast, the complexity of SVM can reach an order of
observed that the level of intertwining between the two classes O(M 3 N ) [70]. Therefore, training such a model would be
is lower. Accordingly, it is easier to separate between the two computationally prohibitive, especially given the dataset sizes
classes. used in this work. Note that the parameters to be tuned are:
Note that there are other network intrusion detection • KNN: number of neighbors K.
datasets that can be studied such as the NSL KDD 99 dataset • RF: Splitting criterion (Gini or Entropy) and Number of
and the Kyoto 2006+ datasets. However, these datasets have trees.
already been extensively studied. Moreover, they are outdated It is worth noting that the runtime complexity (also com-
and may not have recent attack patterns. In contrast, the two monly referred to as testing complexity) of KNN and RF
datasets considered in this work are more recent and have more optimized models is O(MN) and O(Nt) respectively where M
attack patterns. As such, studying them will provide better is the number of training samples, N is the number of fea-
equipped NIDSs that are trained to detect more attack types. tures, and t is the number of decision trees forming the RF
classifier [71], [72]. In the case of KNN, any new instance is
C. Attack Types classified after calculating the distance between itself and all
The two datasets considered in this work contain some other instances in the training sample and identifying its K
similar attacks and some that are different. For example, the nearest neighbors [71]. Conversely, when using the RF clas-
CICIDS 2017 dataset contains the following attacks: Denial- sifier, the new instance is fed to the t different decision trees,
of-Service (DoS), port scanning, brute-force, web-attacks, each of which uses N splits based on the N features consid-
botnets, and infiltration [12]. In contrast, the UNSW-NB 2015 ered, and the class is determined based on the majority vote
dataset contains the following attacks: fuzzers, analysis, back- among these t trees.
doors, DoS, exploits, generic, reconnaissance, shellcode, and
worms [14]. Accordingly, it can be deduced that the proposed B. Results and Discussion
framework learns the patterns of various attack types.
Note that the proposed framework adopts a binary classifi- 1) Impact of Data Pre-Processing on Training Sample Size:
cation model by labeling all attack types as “attack”. The goal Starting with the impact of data pre-processing stage on the
is to develop a NIDS that can detect various attacks rather training sample size, the learning curve showing the variation
than just a finite group of common attacks such as DoS. This of training accuracy and the cross-validation accuracy as the
reiterates the idea that the proposed multi-stage optimized ML- training sample size changes. Both datasets were split ran-
based NIDS can work as an anomaly-based NIDS despite its domly into training and testing samples after normalization
training as a signature-based NIDS. using a 70%/30% split criterion.
Using the SMOTE technique, the number of instances of
each type in each dataset’s training sample is as follows:
VII. E XPERIMENTAL P ERFORMANCE E VALUATION
• CICIDS 2017: 1,818,477 benign instances (denoted as 0)
A. Experimental Setup and 1,800,000 attack instances (denoted as 1).
The experiments conducted for this work were completed • UNSW-NB 2015: 1,775,010 normal instances (denoted
using Python 3.7.4 running on Anaconda’s Jupyter Notebook. as 0) and 1,500,000 attack instances (denoted as 1).
This was run on a virtual machine having a 3 processors It can be seen from Fig. 4 that the number of training
Intel Xeon CPU E5-2660 v3 2.6 GHz and 64GB of memory samples needed for the CICIDS 2017 dataset for the training
running Windows Server 2016. The experimental results are accuracy and cross-validation accuracy to converge is close
divided into three main subsections, namely the impact of to 2.3 million samples. Similarly, for the UNSW-NB 2015
data pre-processing on training sample size, impact of feature dataset, the number of training samples needed is close to
selection on feature set size and training sample size, and the 1.3 million samples as can be seen from Fig. 5. This can be
impact of optimization methods on the ML models’ detection attributed to the fact that both datasets are originally imbal-
performance. anced with much fewer attack samples when compared to
The classification models used in this work are KNN clas- normal samples. Hence, the model struggles to learn the attack
sifier and the RF classifier. These classifiers were chosen due patterns and behaviors.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1810 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

Fig. 4. Learning Curve Showing Training and Cross-Validation Accuracy Fig. 7. Learning Curve Showing Training and Cross-Validation Accuracy
for CICIDS 2017 Dataset Before SMOTE. for UNSW-NB 2015 Dataset After SMOTE.

This highlights the time complexity reduction associated with


adopting an oversampling technique.
Moreover, it can be seen from all these figures that the mod-
els developed before and after SMOTE for both datasets do
not suffer from overfitting as illustrated by the relatively small
error gap between the training and cross-validation accuracy in
Figs. 4 and 5 and the zero error gap seen in Figs. 6 and 7. As
per [73], overfitting can be observed from the learning curve
whenever the error gap between the training accuracy and the
cross-validation accuracy is large. Thus, a small or zero error
Fig. 5. Learning Curve Showing Training and Cross-Validation Accuracy gap implies that the developed model is not too specific to the
for UNSW-NB 2015 Dataset Before SMOTE. training dataset but can perform equally well on the testing
and cross-validation sets.
2) Impact of Feature Selection on Feature Set Size and
Training Sample Size: The second stage of analysis involves
studying the impact of the different feature selection algo-
rithms on the feature set size and training sample size.
1) Impact of Feature Selection on Feature set Size: Starting
with the IGBFS method, Figs. 8 and 9 show the
mutual information score for each of the features for
the CICIDS 2017 and UNSW-NB 2015 datasets respec-
tively. For example, for the CICIDS 2017 dataset, some
of the most informative features include the average
packet size and packet length variance. Similarly, for the
Fig. 6. Learning Curve Showing Training and Cross-Validation Accuracy UNSW-NB 2015 dataset, some of the most informative
for CICIDS 2017 Dataset After SMOTE. features are also the packet size (denoted by sbyte and
dbyte features) and the time to live values. This illus-
trates the tendency of attacks to have different packet
In contrast, it can be seen from Figs. 6 and 7 that the num- sizes when compared to normal traffic. Moreover, the
ber of training samples needed is around 600,000 samples and figures also show that some IPs may have a higher ten-
800,000 samples for the CICIDS 2017 and UNSW-NB 2015 dency to initiate attacks, which means they are more
respectively. This represents a drop of approximately 74% and likely to be compromised.
39% in the training sample size for the two datasets respec- Based on the figures, the number of features selected for
tively. This highlights the positive impact of using SMOTE the CICIDS 2017 and UNSW-NB 2015 datasets is 31
technique as it was able to significantly reduce the size of features and 19 features, respectively. This represents a
the training sample needed without sacrificing the detection reduction of 62% and 61% in the feature set size for the
performance. This is mainly due to the introduction of more two datasets respectively. This is caused by the IGBFS
attack samples that allow the ML model to better learn their method choosing the relevant features that provide the
patterns and behaviors. To further highlight the impact of most information about the class.
using data pre-processing phase, the time needed to build In contrast, when using the CBFS method, the number of
the learning curve was determined. For example, building selected features for the CICIDS 2017 and UNSW-NB
the learning curve for the UNSW-NB 2015 dataset needed 2015 datasets is 41 and 32 features respectively. This
close to 600 minutes prior to applying SMOTE. In contrast, represents a reduction of 50% and 33.3% for each of
it required around 90 minutes after implementing SMOTE. the datasets, respectively. This reduction is due to the

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1811

Fig. 8. Mutual Information Score of Features for CICIDS 2017 Dataset Showing the Highest Scoring Features in Descending Order.

Fig. 9. Mutual Information Score of Features for UNSW-NB 2015 Dataset Showing the Highest Scoring Features in Descending Order.

CBFS method choosing the relevant features that are


highly correlated with the class feature, i.e., the features
whose variation is also reflected in a variation in the
corresponding class.
The IGBFS method tends to choose a lower number
of features when compared to the CBFS method. This
is because the CBFS method relies on the correlation.
Thus, two features may be chosen that are highly corre-
lated with the class because they have a high correlation
between them and one of them is highly correlated with
the class. On the other hand, the IGBFS method stud-
Fig. 10. Learning Curve Showing Training and Cross-Validation Accuracy
ies the features one by one with respect to the class for CICIDS 2017 Dataset After IGBFS.
and selects the features that provide the highest amount
of information about the class without considering the
mutual information between the features themselves. of 59% and 86% when compared to the required train-
Hence, a lower number of features is typically chosen ing sample size after the SMOTE technique is applied.
by the IGBFS method. This shows that the IGBFS method can keep the fea-
2) Impact of feature selection on training sample size: tures that provide the most information about the class
In addition to the impact of the feature selection pro- and discard any feature that may be negatively impacting
cess on the feature set size, this work also studies its the learning process.
impact on the training sample size. Starting with the Similarly for the case of using CBFS method, it can be
IGBFS method, it can be seen from Figs. 10 and 11 observed from Figs. 12 and 13 that the required training sam-
that the training sample size was reduced to 250,000 and ple size for the CICIDS 2017 and UNSW-NB 2015 datasets is
110,000 samples for the CICIDS 2017 and UNSW-NB reduced to 500,000 and 200,000, respectively. This represents
2015 datasets, respectively. This represents a reduction a reduction of 17% and 75% when compared to the required

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1812 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

TABLE I
O PTIMAL PARAMETER VALUES W ITH IGBFS
FOR D IFFERENT ML M ODELS

Fig. 11. Learning Curve Showing Training and Cross-Validation Accuracy


for UNSW-NB 2015 Dataset After IGBFS.

TABLE II
O PTIMAL PARAMETER VALUES W ITH CBFS
FOR D IFFERENT ML M ODELS

Fig. 12. Learning Curve Showing Training and Cross-Validation Accuracy


for CICIDS 2017 Dataset After CBFS.

cross-validation accuracy. This indicates that the model is suit-


able to be generalized for testing and cross-validation datasets
and is not being overfit to the training dataset [73].
3) Impact of Optimization Methods on the ML Models’
Detection Performance: To evaluate the performance of the
different classifiers and study the impact of the different
optimization methods on them, we determine four evaluation
Fig. 13. Learning Curve Showing Training and Cross-Validation Accuracy metrics, namely the accuracy (acc), precision, recall/true pos-
for UNSW-NB 2015 Dataset After CBFS. itive rate (TPR), and false alarm/positive rate (FAR/FPR) as
per [11], [74] using the following equations:
tp + tn
training sample size after SMOTE technique is applied. This Acc = (11)
tp + tn + fp + fn
shows that the CBFS method is also able to select relevant tp
features that have a positive impact on the learning process. Precision = (12)
tp + fp
However, since some of the features selected may be redun-
tp
dant, this may have a negative impact on the learning process Recall/TPR = (13)
when compared to that of the IGBFS. The time needed to tp + fn
fp
build the learning curve using the two feature selection meth- FAR/FPR = (14)
ods was determined to further highlight the impact of feature tn + fp
selection on the reduction of time complexity. For example, where tp is the number of true positives, tn is the number
building the learning curve for the UNSW-NB 2015 dataset of true negatives, fp is the number of false positives, and fn
required around 21 minutes and 25 minutes for the IGBFS and is the number of false negatives. These values compose the
CBFS methods, respectively. Accordingly, applying either of confusion matrix of any ML model.
the two feature selection methods will have a positive impact Table I gives the optimal parameter values for the two differ-
on the feature set size and training sample size with the IGBFS ent classifiers when the IGBFS technique is used. In the case
method having a slight advantage over the CBFS method. of KNN method, the RS and PSO methods tend to choose
Figs. 10, 11, 12, and 13 describe a relatively small smaller values for the number of neighbors when compared to
or zero error gap between the training accuracy and the the GA, BO-GP, and BO-TPE methods. For the RS method,

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1813

TABLE III
P ERFORMANCE R ESULTS OF THE M ULTI -S TAGE O PTIMIZED ML-BASED NIDS F RAMEWORK W ITH IGBFS FOR T ESTING DATASETS

TABLE IV
P ERFORMANCE R ESULTS OF THE M ULTI -S TAGE O PTIMIZED ML-BASED NIDS F RAMEWORK W ITH CBFS FOR T ESTING DATASETS

this is due to the fact that the algorithm’s stopping criterion is with respect to the class), and thus would be overlooked if the
typically the number of iterations and thereby does not test all entropy splitting criterion is chosen. This is the reason behind
potential values. Accordingly, it is possible for it to miss the choosing the Gini splitting criterion when the CBFS method
optimal number of neighbors. Similarly, one of the stopping is used.
criteria in the PSO algorithm is also the number of evaluations, Tables III and IV show the performance of the two clas-
which can also lead to it missing the optimal value. In contrast, sification algorithms when using IGBFS and CBFS methods,
the GA, BO-GP, and BO-TPE all resulted in a similar number respectively. Several observations can be made. The first obser-
of neighbors for both the CICIDS 2017 and UNSW-NB 2015 vation is that the optimized models outperform the regular
datasets. For the GA algorithm, the number of generations is models recently reported in [12], [30], [75] by 1-2% on aver-
typically set sufficiently high to reach the optimal value for age in terms of accuracy and a reduction of 1-2% in FAR for
the number of neighbors. In a similar manner, the BO-GP both datasets. This is expected since one of the main goals of
and BO-TPE determine the actual optimal value based on the hyper-parameter optimization is to improve the performance
assumed model. of the ML models. The second observation is that the RF clas-
In the case of the RF method, the RS and PSO algorithms sifier outperforms the KNN classifier for both the IGBFS and
tend to choose a lower number of trees compared to the GA, CBFS methods as seen in the CICIDS 2017 and UNSW-NB
BO-GP, and BO-TPE. This is due to the algorithms’ stopping 2015 datasets. This reiterates the previously obtained results
criterion that often leads to a pre-mature stoppage. In contrast, in [11] with ISCX 2012 dataset and the reported results
the GA, BO-GP, and BO-TPE determine that the number of in [12], [30], [75] in which the RF classifier also outper-
trees needed is higher as they explore more potential values, formed the KNN model. This can be attributed to the RF
allowing them to select more optimal values for the number of classifier being an ensemble model. Accordingly, it is effective
trees. In terms of the splitting criterion, the entropy criterion with non-linear and high-dimensional datasets like the datasets
is mostly selected. This is expected since the IGBFS method under consideration in this work. The third observation is that
selects features based on their information gain, which is deter- the BO-TPE-RF method had the highest detection accuracy
mined using the entropy of each feature. As such, this criterion for both the CICIDS 2017 and UNSW-NB 2015 datasets for
would be more suitable when using IGBFS. both feature selection algorithms with a detection accuracy
Looking at Table II, similar observations about the hyper- of 99.99% and 100%, respectively. This proves the effective-
parameter optimization performance of the different algo- ness and robustness of the proposed multi-stage optimized
rithms can be made for both the KNN and RF methods. The ML-based NIDS framework as it outperformed other NIDS
only difference is that for the RF method, the splitting criterion frameworks.
is chosen to be the Gini index. This is due to the CBFS method
using the correlation as the selection criterion rather than the VIII. C ONCLUSION
entropy. Therefore, the features chosen may result in a low The area of cyber-security has garnered significant atten-
amount of information (equivalent to having a high entropy tion from both the industry and academia due to the

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1814 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

increased dependency of individuals and organizations on the R EFERENCES


Internet and their concern about the security and privacy [1] C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin, and W.-Y. Lin, “Intrusion detection
of their activities. More resources are being deployed and by machine learning: A review,” Expert Syst. Appl., vol. 36, no. 10,
allocated to protect modern Internet-based networks against pp. 11994–12000, 2009.
potential attacks or anomalous activities. Accordingly, dif- [2] A. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, “Student engage-
ment level in e-learning environment: Clustering using k-means,” Amer.
ferent types of network intrusion detection systems (NIDSs) J. Distance Educ., vol. 34, no. 2, pp. 137–156, 2019.
have been proposed in the literature. Despite the continu- [3] M. Injadat, F. Salo, and A. B. Nassif, “Data mining
ous improvements in NIDS performance, there is still room techniques in social media: A survey,” Neurocomputing,
vol. 214, pp. 654–670, Nov. 2016. [Online]. Available:
for further improvement. More insights can be extracted http://www.sciencedirect.com/science/article/pii/S092523121630683X
from the high volume of network traffic data gener- [4] M. B. Salem, S. Hershkop, and S. J. Stolfo, “A survey of insider attack
ated, the continuously changing environments, the plethora detection research,” in Insider Attack and Cyber Security. Boston, MA,
USA: Springer, 2008, pp. 69–90.
of features collected as part of training datasets (high [5] W. Bul’ajoul, A. James, and M. Pannu, “Improving network intrusion
dimensional datasets), and the need for real-time intrusion detection system performance through quality of service configura-
detection. tion and parallel technology,” J. Comput. Syst. Sci., vol. 81, no. 6,
pp. 981–999, 2015.
Choosing the most suitable subset of features and opti-
[6] S. M. H. Bamakan, B. Amiri, M. Mirzabagheri, and Y. Shi, “A new
mizing the parameters of the machine learning (ML)-based intrusion detection approach using PSO based multiple criteria linear
detection models is essential to enhance their performance. programming,” Procedia Comput. Sci., vol. 55, pp. 231–237, Jul. 2015.
Accordingly, this paper expanded on our previous work by [7] S. X. Wu and W. Banzhaf, “The use of computational intelligence in
intrusion detection systems: A review,” Appl. Soft Comput., vol. 10,
proposing a multi-stage optimized ML-based NIDS framework no. 1, pp. 1–35, 2010.
that reduced the computational complexity while maintaining [8] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection
its detection performance. Using two recent state-of-the-art system: A comprehensive review,” J. Netw. Comput. Appl., vol. 36, no. 1,
pp. 16–24, 2013.
intrusion detection datasets (CICIDS 2017 dataset and the [9] S. Suthaharan, “Big data classification: Problems and challenges
UNSW-NB 2015 dataset) for performance evaluation, this in network intrusion prediction with machine learning,” ACM
work first studied the impact of oversampling techniques on SIGMETRICS Perform. Eval. Rev., vol. 41, no. 4, pp. 70–73, 2014.
the models’ training sample size and determined the mini- [10] J. Zhang and M. Zulkernine, “Anomaly based network intrusion detec-
tion with unsupervised outlier detection,” in Proc. IEEE Int. Conf.
mum suitable training size for effective intrusion detection. Commun. (ICC), vol. 5, 2006, pp. 2388–2393.
Experimental results showed that using the SMOTE oversam- [11] M. Injadat, F. Salo, A. B. Nassif, A. Essex, and A. Shami, “Bayesian
pling technique can reduce the training sample size between optimization with machine learning algorithms towards anomaly detec-
tion,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2018,
39% and 74% of the original datasets’ size. Additionally, pp. 1–6.
this work compared between two different feature selection [12] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating
techniques, namely information gain (IGBFS) and correlation- a new intrusion detection dataset and intrusion traffic characterization.”
in Proc. ICISSP, 2018, pp. 108–116.
based feature selection (CBFS), and explored their impact on [13] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward devel-
the feature set size, the training sample size, and the mod- oping a systematic approach to generate benchmark datasets for intrusion
els’ detection performance. The experimental results showed detection,” Comput. Security, vol. 31, no. 3, pp. 357–374, 2012.
that the feature selection methods were able to reduce the fea- [14] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for
network intrusion detection systems (UNSW-NB15 network data set),”
ture set size by almost 60%. Moreover, they further reduced in Proc. Mil. Commun. Inf. Syst. Conf. (MilCIS), Nov. 2015, pp. 1–6.
the required training sample size between 33% and 50% [15] F. Kuang, W. Xu, and S. Zhang, “A novel hybrid KPCA and SVM
when compared to the training sample after SMOTE. Finally, with GA model for intrusion detection,” Appl. Soft Comput., vol. 18,
pp. 178–184, May 2014.
this work investigated the impact of different ML hyper- [16] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel feature-selection
parameter optimization techniques on the NIDS’s performance approach based on the cuttlefish optimization algorithm for intrusion
using two ML classification models, namely the K-nearest detection systems,” Expert Syst. Appl., vol. 42, no. 5, pp. 2670–2679,
2015.
neighbors (KNN) and the Random Forest (RF) classifiers.
[17] W. Li, P. Yi, Y. Wu, L. Pan, and J. Li, “A new intrusion detection system
Experimental results showed that the optimized RF classi- based on KNN classification algorithm in wireless sensor network,” J.
fier with Bayesian Optimization using Tree Parzen Estimator Elect. Comput. Eng., vol. 2014, p. 8, Jun. 2014.
(BO-TPE-RF) had the highest detection accuracy when com- [18] A. B. Nassif, L. F. Capretz, and D. Ho, “Estimating software effort using
an ANN model based on use case points,” in Proc. 11th Int. Conf. Mach.
pared to the other optimization techniques (enhanced the Learn. Appl., vol. 2, 2012, pp. 42–47.
detection accuracy by 1-2% and reduce the FAR by 1-2% [19] A. Moubayed, M. Injadat, A. B. Nassif, H. Lutfiyya, and A. Shami,
when compared to recent works from the literature). It “E-learning: Challenges and research opportunities using machine learn-
ing data analytics,” IEEE Access, vol. 6, pp. 39117–39138, 2018.
was also observed that using the IGBFS method achieved [20] S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based
better detection accuracy when compared to the CBFS intrusion detection system through feature selection analysis and build-
method. ing hybrid efficient model,” J. Comput. Sci., vol. 25, pp. 152–160,
Mar. 2017.
Other models such as deep learning classifiers can be
[21] A. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, “DNS typo-
explored for future work since these models perform admirably squatting domain detection: A data analytics & machine learning based
on non-linear and high-dimensional datasets. Investigating the approach,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2018,
impact of combining supervised and unsupervised ML tech- pp. 1–7.
[22] A. Moubayed, E. Aqeeli, and A. Shami, “Ensemble-based feature selec-
niques may also prove paramount in this field to detect novel tion and classification model for DNS typo-squatting detection,” in Proc.
attacks. IEEE 33rd Can. Conf. Elect. Comput. Eng. (CCECE), 2020, pp. 1–6.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
INJADAT et al.: MULTI-STAGE OPTIMIZED MACHINE LEARNING FRAMEWORK FOR NETWORK INTRUSION DETECTION 1815

[23] X. Sun, J. Dai, P. Liu, A. Singhal, and J. Yen, “Using Bayesian networks [47] M. Injadat, A. Moubayed, A. B. Nassif, and A. Shami, “Systematic
for probabilistic identification of zero-day attack paths,” IEEE Trans. Inf. ensemble model selection approach for educational data mining,” Knowl.
Forensics Security, vol. 13, no. 10, pp. 2506–2521, Oct. 2018. Based Syst., vol. 200, Jul. 2020, Art. no. 105992. [Online]. Available:
[24] A. Alsirhani, S. Sampalli, and P. Bodorik, “DDoS detection system: http://www.sciencedirect.com/science/article/pii/S0950705120302999
Using a set of classification algorithms controlled by fuzzy logic system [48] M. Injadat, A. Moubayed, A. B. Nassif, and A. Shami, “Multi-split
in Apache spark,” IEEE Trans. Netw. Service Manag., vol. 16, no. 3, optimized bagging ensemble model selection for multi-class educational
pp. 936–949, Sep. 2019. datasets,” Appl. Intell., to be published.
[25] A. A. Daya, M. A. Salahuddin, N. Limam, and R. Boutaba, “BotChase: [49] L. Bianchi, M. Dorigo, L. M. Gambardella, and W. J. Gutjahr, “A sur-
Graph-based BOT detection using machine learning,” IEEE Trans. Netw. vey on metaheuristics for stochastic combinatorial optimization,” Nat.
Service Manag., vol. 17, no. 1, pp. 15–29, Mar. 2020. Comput., vol. 8, no. 2, pp. 239–287, 2009.
[26] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A multimodal [50] S.-W. Lin, K.-C. Ying, S.-C. Chen, and Z.-J. Lee, “Particle swarm
deep learning method for android malware detection using various fea- optimization for parameter determination and feature selection of sup-
tures,” IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 773–788, port vector machines,” Expert Syst. Appl., vol. 35, no. 4, pp. 1817–1824,
Mar. 2019. 2008.
[27] F. Salo, M. Injadat, A. B. Nassif, A. Shami, and A. Essex, “Data min- [51] G. Cohen, M. Hilario, and A. Geissbuhler, “Model selection for sup-
ing techniques in intrusion detection systems: A systematic literature port vector classifiers via genetic algorithms. An application to medical
review,” IEEE Access, vol. 6, pp. 56046–56058, 2018. decision support,” in Proc. Int. Symp. Biol. Med. Data Anal., 2004,
[28] F. Salo, M. Injadat, A. B. Nassif, and A. Essex, “Data mining with big pp. 200–211.
data in intrusion detection systems: A systematic literature review,” in [52] S. G. Ahmad, C. S. Liew, E. U. Munir, T. F. Ang, and S. U. Khan,
Proc. Int. Symp. Big Data Manag. Anal., Apr. 2019, pp. 1–8. “A hybrid genetic algorithm for optimization of scheduling workflow
[29] F. Salo, M. Injadat, A. Moubayed, A. B. Nassif, and A. Essex, applications in heterogeneous computing systems,” J. Parallel Distrib.
“Clustering enabled classification using ensemble feature selection for Comput., vol. 87, pp. 80–90, Jan. 2016.
intrusion detection,” in Proc. IEEE Int. Conf. Comput. Netw. Commun. [53] S. Blaifi, S. Moulahoum, I. Colak, and W. Merrouche, “An enhanced
(ICNC), 2019, pp. 276–281. dynamic model of battery using genetic algorithm suitable for photo-
[30] L. Yang, A. Moubayed, I. Hamieh, and A. Shami, “Tree-based intelligent voltaic applications,” Appl. Energy, vol. 169, pp. 888–898, May 2016.
intrusion detection system in Internet of Vehicles,” in Proc. IEEE Global [54] U. Mehboob, J. Qadir, S. Ali, and A. Vasilakos, “Genetic algorithms
Commun. Conf. (GLOBECOM), 2019, pp. 1–6. in wireless networking: Techniques, applications, and issues,” Soft
[31] Y. Y. Chung and N. Wahid, “A hybrid network intrusion detection system Comput., vol. 20, no. 6, pp. 2467–2501, 2016.
using simplified swarm optimization (SSO),” Appl. Soft Comput., vol. 12, [55] A. Rikhtegar, M. Pooyan, and M. T. Manzuri-Shalmani, “Genetic
no. 9, pp. 3014–3022, 2012. algorithm-optimised structure of convolutional neural network for face
[32] J. Zhang, M. Zulkernine, and A. Haque, “Random-forests-based network recognition applications,” IET Comput. Vis., vol. 10, no. 6, pp. 559–566,
intrusion detection systems,” IEEE Trans. Syst., Man, Cybern. C, Appl. 2016.
Rev., vol. 38, no. 5, pp. 649–659, 2008. [56] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian
[33] K. M. Ali Alheeti and K. McDonald-Maier, “Intelligent intrusion detec- optimization of machine learning algorithms,” in Proc. Adv. Neural Inf.
tion in external communication systems for autonomous vehicles,” Syst. Process. Syst., 2012, pp. 2951–2959.
Sci. Control Eng., vol. 6, no. 1, pp. 48–56, Sep. 2018. [57] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on Bayesian
[34] Z. Chen et al., “Machine learning based mobile malware detection optimization of expensive cost functions, with application to active
using highly imbalanced network traffic,” Inf. Sci., vols. 433–434, user modeling and hierarchical reinforcement learning,” 2010. [Online].
pp. 346–364, Apr. 2018. Available: arXiv:1012.2599.
[35] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [58] SigOpt. (2015). Bayesian Optimization Primer. [Online]. Available:
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. https://app.sigopt.com/static/pdf/SigOpt_Bayesian_Optimization_
Res., vol. 16, pp. 321–357, Aug. 2002. Primer.pdf
[36] X. Tan et al., “Wireless sensor networks intrusion detection based on [59] K. Eggensperger et al., “Towards an empirical foundation for assessing
smote and the random forest algorithm,” Sensors, vol. 19, no. 1, p. 203, Bayesian optimization of hyperparameters,” in Proc. NIPS Workshop
2019. Bayesian Optim. Theory Practice, vol. 10, 2013, p. 3.
[37] M. B. Catalkaya, O. Kalipsiz, M. S. Aktas, and U. O. Turgut, “Data fea- [60] D. Ashlock, Evolutionary Computation for Modeling and Optimization.
ture selection methods on distributed big data processing platforms,” in New York, NY, USA: Springer, 2006.
Proc. 3rd Int. Conf. Comput. Sci. Eng. (UBMK), Sep. 2018, pp. 133–138. [61] A. Moubayed, A. Refaey, and A. Shami, “Software-defined perimeter
[38] R. S. B. Krishna and M. Aramudhan, “Feature selection based on (SDP): State of the art secure solution for modern networks,” IEEE
information theory for pattern classification,” in Proc. Int. Conf. Netw., vol. 33, no. 5, pp. 226–233, Sep./Oct. 2019.
Control Instrum. Commun. Comput. Technol. (ICCICCT), Jul. 2014, [62] P. Kumar, A. Moubayed, A. Refaey, A. Shami, and J. Koilpillai,
pp. 1233–1236. “Performance analysis of SDP for secure internal enterprises,” in Proc.
[39] B. Bonev, “Feature selection based on information theory,” Ph.D. dis- IEEE Wireless Commun. Netw. Conf. (WCNC), 2019, pp. 1–6.
sertation, Dept. Comput. Sci., University of Alicante, Alicante, Spain, [63] F. Hu and H. Li, “A novel boundary oversampling algorithm based on
Jun. 2010. neighborhood rough set model: Nrsboundary-smote,” Math. Probl. Eng.,
[40] J. Li et al., “Feature selection: A data perspective,” ACM Comput. vol. 2013, Nov. 2013, Art. no. 694809.
Surveys, vol. 50, no. 6, p. 94, 2018. [64] A. Lissovoi, P. S. Oliveto, and J. A. Warwicker, “On the time complexity
[41] M. A. Hall, “Correlation-based feature selection for machine learn- of algorithm selection hyper-heuristics for multimodal optimisation,” in
ing,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Waikato Hamilton, Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 2322–2329.
Hamilton, New Zealand, 1999. [65] R. Cheng and Y. Jin, “A social learning particle swarm optimization
[42] A. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, “Relationship algorithm for scalable optimization,” Inf. Sci., vol. 291, pp. 43–60,
between student engagement and performance in e learning environ- Jan. 2015.
ment using association rules,” in Proc. IEEE World Eng. Educ. Conf. [66] P. S. Oliveto and C. Witt, “Improved time complexity analysis of the
(EDUNINE), Mar. 2018, pp. 1–6. simple genetic algorithm,” Theor. Comput. Sci., vol. 605, pp. 21–41,
[43] J. H. Gennari, P. Langley, and D. Fisher, “Models of incremental concept Nov. 2015.
formation,” Artif. Intell., vol. 40, nos. 1–3, pp. 11–61, 1989. [67] M. Feurer and F. Hutter, “Hyperparameter optimization,” in Automated
[44] P. Koch, B. Wujek, O. Golovidov, and S. Gardner, “Automated hyper- Machine Learning. Cham, Switzerland: Springer, 2019, pp. 3–33.
parameter tuning for effective machine learning,” in Proc. SAS Global [68] The Kernel Trip. (Apr. 2018). Computational Complexity of Machine
Forum Conf., 2017, pp. 1–23. Learning Algorithms. Accessed: Feb. 27, 2020. [Online]. Available:
[45] L. Yang and A. Shami, “On hyperparameter optimization https://www.thekerneltrip.com/machine/learning/computational-
of machine learning algorithms: Theory and practice,” complexity-learning-algorithms/
Neurocomputing, to be published. [Online]. Available: [69] C.-T. Chu et al., “Map-reduce for machine learning on multicore,” in
http://www.sciencedirect.com/science/article/pii/S0925231220311693 Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 281–288.
[46] J. Bergstra and Y. Bengio, “Random search for hyper-parameter [70] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Feb. 2012. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.
1816 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 18, NO. 2, JUNE 2021

[71] O. Veksler. (2015). Cs434a/541a class notes prof. Olga Veksler. Ali Bou Nassif (Member, IEEE) received the Ph.D.
[Online]. Available: http://www.csd.uwo.ca/courses/CS9840a/Lecture2 degree in electrical and computer engineering from
_knn.pdf the University of Western Ontario, London, ON,
[72] X. Solé, A. Ramisa, and C. Torras, “Evaluation of random forests on Canada, in 2012. He is currently an Assistant
large-scale classification problems using a bag-of-visual-words represen- Professor and an Assistant Dean of the Graduate
tation.” in Proc. CCIA, 2014, pp. 273–276. Studies, University of Sharjah, UAE, and an Adjunct
[73] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Research Professor with Western University. He has
Statistical Learning, vol. 112. New York, NY, USA: Springer, 2013. published more than 60 papers in international jour-
[74] A. B. Nassif, D. Ho, and L. F. Capretz, “Regression model for software nals and conferences. His interests are machine
effort estimation based on the use case point method,” in Proc. Int. Conf. learning and soft computing, software engineering,
Comput. Softw. Model., vol. 14, 2011, pp. 106–110. cloud computing and service oriented architecture,
[75] N. Moustafa, B. Turnbull, and K. R. Choo, “An ensemble intrusion and mobile computing. He is a Registered Professional Engineer in Ontario,
detection technique based on proposed statistical flow features for pro- as well as a member of IEEE Computer Society.
tecting network traffic of Internet of Things,” IEEE Internet Things J.,
vol. 6, no. 3, pp. 4815–4830, Jun. 2019.

MohammadNoor Injadat (Member, IEEE)


received the B.Sc. degree in computer science
from Al Al-Bayt University, Jordan, in 2000, the
M.Sc. degree from University Putra Malaysia,
Malaysia, in 2002, the M.E. degree in electrical
and computer engineering and the Ph.D. degree
in software engineering from the Department of
Electrical and Computer Engineering, University of
Western Ontario in 2015, and 2020, respectively.
His research interests include data mining, machine
learning, social network analysis, e-learning
analytics, and network security.

Abdallah Moubayed (Member, IEEE) received


the B.E. degree in electrical engineering from the Abdallah Shami (Senior Member, IEEE) is a
Lebanese American University, Beirut, Lebanon, in Professor with the ECE Department, Western
2012, the M.Sc. degree in electrical engineering University, London, ON, Canada, where he is
from the King Abdullah University of Science and the Director of the Optimized Computing and
Technology, Thuwal, Saudi Arabia, in 2014, and the Communications Laboratory. He is currently an
Ph.D. degree in electrical and computer engineering Associate Editor for the IEEE T RANSACTIONS ON
from the University of Western Ontario in August M OBILE C OMPUTING, IEEE N ETWORK, and IEEE
2018, where he is currently a Postdoctoral Associate C OMMUNICATIONS T UTORIALS & S URVEY. He
with the Optimized Computing and Communications has chaired key symposia for IEEE GLOBECOM,
Lab. His research interests include wireless com- IEEE ICC, IEEE ICNC, and ICCIT. He was the
munication, resource allocation, wireless network virtualization, performance elected Chair of the IEEE Communications Society
and optimization modeling, machine learning and data analytics, computer Technical Committee on Communications Software from 2016 to 2017, and
network security, cloud computing, and e-learning. the IEEE London Ontario Section Chair from 2016 to 2018.

Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on March 10,2023 at 19:48:04 UTC from IEEE Xplore. Restrictions apply.

You might also like