You are on page 1of 12

Computer Networks 148 (2019) 164–175

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

Dimensionality reduction with IG-PCA and ensemble classifier for


network intrusion detection
Fadi Salo a, Ali Bou Nassif b,a, Aleksander Essex a,∗
a
Department of Electrical and Computer Engineering, The University of Western Ontario, London, Canada
b
Department of Electrical and Computer Engineering, University of Sharjah, Sharjah, UAE

a r t i c l e i n f o a b s t r a c t

Article history: Handling redundant and irrelevant features in high-dimension datasets has caused a long-term challenge
Received 1 June 2018 for network anomaly detection. Eliminating such features with spectral information not only speeds up
Revised 16 October 2018
the classification process but also helps classifiers make accurate decisions during attack recognition time,
Accepted 10 November 2018
especially when coping with large-scale and heterogeneous data. A novel hybrid dimensionality reduction
Available online 14 November 2018
technique is proposed for intrusion detection combining the approaches of information gain (IG) and
Keywords: principal component analysis (PCA) with an ensemble classifier based on support vector machine (SVM),
Network security Instance-based learning algorithms (IBK), and multilayer perceptron (MLP). The performance of this IG-
Intrusion detection PCA-Ensemble method was evaluated based on three well-known datasets, namely ISCX 2012, NSL-KDD,
Data mining and Kyoto 2006+. Experimental results show that the proposed hybrid dimensionality reduction method
Dimensionality reduction with the ensemble of the base learners contributes more critical features and significantly outperforms
Ensemble classifier
individual approaches, achieving high accuracy and low false alarm rates. A comparative analysis is per-
IG
PCA
formed of our approach relative to related work and find that the proposed IG-PCA-Ensemble method
exhibits better performance regarding classification accuracy, detection rate, and false alarm rate than
the majority of the existing state-of-the-art approaches.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction detection philosophy [2]. The first methodology is signature-based


(misuse) intrusion detection, which analyzes network packets us-
Despite recent developments in computer networks, security ing pre-defined attack signatures to classify intrusion attempts.
applications, and peoples’ awareness of the latest protection tech- This method, as such, cannot recognize novel attacks [3].
nology, current mitigations against contemporary cyber-attacks re- In contrast, anomaly-based detection has the potential to dis-
main unable to provide complete protection. The necessity of cover previously unknown attacks by examining network traffic
evolving efficient security approaches has, therefore, attracted con- to check any deviations from normal activities. In this method, a
siderable attention from industry and academia to better detect model of trustworthy system behavior is defined a priori using,
security threats. Classical security methods such as firewalls, mal- for example, data mining techniques. Observed events and behav-
ware prevention, data encryption, and user authentication form a iors are then classified as normal or as an anomaly [4]. Research
necessary but incomplete set of tools to secure computers and net- in this area has focused on improving accuracy and efficiency of
works from today’s attacks. As such, additional lines of defense IDSs. Given the promising effectiveness of anomaly-based IDS, this
such as artificial intelligence, have become a quickly growing area approach has been widely adopted and has become a principal
of interest [1]. Given such challenges, intrusion detection systems focus of IDS research. In this context, various machine learning
(IDS) along with the traditional security systems can complement techniques (MLT) have utilized to build an effective IDS. Examples
each other on security infrastructures around the world. Generally, include Bayesian networks (BN), Markov models, neural networks
IDSs are active systems that continually monitor and analyze net- (NN), fuzzy logic techniques, k-nearest neighbor (k-NN), and sup-
work traffic to identify deviation from the expected behavior of port vector machine (SVM). Mukkamala et al. [5] combined differ-
passing traffic. Two types of IDSs can be classified based on the ent MLTs including SVM, Multivariate Adaptive Regression Splines
(MARS), and Artificial Neural Networks (ANN) to improve their
model accuracy in detecting intrusions. The process compared five

Corresponding author. different classifiers along with their model. Results have shown
E-mail addresses: fsalo@uwo.ca (F. Salo), anassif@sharjah.ac.ae (A.B. Nassif),
that the ensemble of SVM, MARS, and ANN achieved the best accu-
aessex@uwo.ca (A. Essex).

https://doi.org/10.1016/j.comnet.2018.11.010
1389-1286/© 2018 Elsevier B.V. All rights reserved.
F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175 165

racy. Similarly, the assembling of radial basis function neural net- the original features without alteration. Feature selection is typi-
works (RBF) and decision trees (J48) was investigated by Mrutyun- cally classified into three categories: filter, wrapper, and embed-
jaya et al. [6]. The experiment was performed on the popular NSL- ded methods [12]. The Filter method chooses features based on
KDD [7] and KDDCup99 [8] datasets and the result shows a high their intrinsic properties, i.e., relevance, and without considering
detection rate with low false alarms. Chandrasekhar and Raghu- the performance of the classifier. Some common examples of fil-
veer [9] classified network patterns using neuro-fuzzy and redial ter method are Information Gain, Chi-Squared, and Correlation Co-
SVM to detect intrusions. The authors utilized KDDCup99 to dis- efficient Scores [13]. Unlike the filter method, wrapper methods
tinguish normal traffic from differing attack types such as prob- pick up features according to their utility in improving classifier
ing, Denial of Service Attacks (DoS), Remote to Local (R2L), and performance such as Recursive Feature Elimination, Forward Fea-
User to Root Attacks (U2R). The experiment was compared with ture Selection, and Backward Feature Elimination [14]. Thus, wrap-
other existing outstanding techniques, and the result proved the per methods are obtaining relatively better performance in classi-
effectiveness of their technique. Tanet et al. [10] proposed a new fication models; however, they are computationally more expen-
technique to detect an anomaly in terms of normal pattern or DoS sive than the filter, especially when dealing with high-dimensional
attack. The idea was based on treating network traffic records as data. Embedded method, on the other hand, leverages the advan-
images to be evaluated in the proposed detection system. The ex- tages of both filter and wrapper methods. Principally, it is similar
periment used two different datasets (KDDCup99 and ISCX 2012) to wrapper but computationally less expensive since the selection
and acquired a remarkable detection accuracy of 99.95% and 90.12% of the best features occurs while the model is being created, i.e.,
respectively. Least Absolute Selection and Shrinkage Operator (LASSO) and Ridge
Despite advances, the sheer volume of data has posed a con- regression [15].
tinual challenge to IDSs, with growing computational and stor- In the second method, feature extraction creates new features
age complexities leading to unsatisfactory classification results. The by producing new combinations of the original features [16]. Stein,
benchmark dataset in the field of IDS research is KDDCup99. This Gary, et al. [17] used genetic algorithms to select, for example,
dataset includes millions of records with a wide variety of intru- the most relevant attributes in the KDDCUP99 dataset for deci-
sions to provide a training and testing datasets for researchers. Ac- sion tree classifiers. The result achieved better detection accuracy
cordingly, classifying such a dataset may encounter many difficul- and low false alarm when feature selection was applied. Likewise,
ties which could degrade the performance of the classifier, or led Mukkamala et al. [18] introduced a new feature selection approach
to a complete failure due to an insufficient memory issue. Also, to reduce the KDDCUP99 dimensionality from 41 to 6 features.
preprocessing large-volume datasets usually presents other serious The authors evaluated their model based on SVM classifier to de-
challenges in handling redundant data, noisy data, and high di- tect intrusions, and the result shows an improvement in classifica-
mensionality, which could affect the classifies’ efficiency. The main tion accuracy. Using the same dataset, Chen et al. [19] presented
contributions of this paper are listed as follows: an IDS model based on Flexible Neural Trees (FNT). Surprisingly,
their model achieved a significant result in detection performance
• This work proposes a novel dimensionality reduction method, (99.19%) with only 4 features. Mukherjee and Sharma [20] pro-
in which both feature selection and extraction techniques are posed an IDS using Naive Bayes with a novel method called Feature
integrated. In this method, IG and PCA were utilized to both Vitality Based Reduction (FVBR). The method has been compared
reduce features and extract a new set of un-correlated features. with three different features selection methods including IG, Gain
Subsequently, the selected optimal subspace, which represents Ratio (GR), and Correlation-based Feature Selection (CFS). Results
the lower dimensional feature, is used in the training and test- show that the proposed FVBR method outperformed the other
ing phase of the proposed model. three standard features selection methods. Guisong et al. [21] pro-
• Improve the predictive performance by combining decisions posed a hierarchical intrusion detection model using PCA neural
from multiple classifiers (SVM, IBK, and MLP) into one using networks (PCANN) for online anomaly and misuse detection. Their
an ensemble approach. To boost the performance of ensemble experiments are carried out and evaluated based on DARPA 1998
of classifier response, a vote classifier is utilized based on the dataset. The obtained results show that the proposed model out-
average of probabilities (AOP) combination rule [11]. performs other models listed in their study. Kuang et al. presented
• Conduct experiments on three different benchmark IDS datasets another novel model named KPCA-GA-SVM. Kuang et al. [22] for
to provide a more robust performance evaluation of the pro- ID. The model incorporates kernel principal component analysis
posed framework. The widely studied DARPA/KDD99 dataset is (KPCA) with genetic algorithm (GA) to reduce the input features
outdated and does not include many novel attacks. Further- and optimize the classifier parameters respectively, as well as the
more, the utilized datasets include different number of fea- SVM classifier to identify attacks. The experiment results show
tures and instances, which increase the challenges for testing that KPCA-GA-SVM model achieved a superior predictive accu-
the proposed dimensionality reduction method. racy with less training time in comparison with other algorithms.
PCA and Fisher Discriminant Ratio (FDR) have been utilized to re-
The rest of the paper is organized as follows: Section 2 presents
duce features and eliminate noise from the dataset by De la Hoz
related work. Section 3 describes the proposed method.
et al. [23]. Their model was based on probabilistic self-organizing
Section 4 presents the experimental setup and results. Finally,
maps (PSOM) and intended to model the feature space and recog-
the conclusion is presented in Section 5.
nize normal from anomalous patterns. The achieved results in term
of accuracy, specificity, and sensitivity were 90%, 93%, and 97% re-
2. Related work spectively.
In this work, a novel dimensionality reduction technique is pro-
As a means of improving computational performance, the tech- posed by integrating both feature selection and extraction tech-
nique of dimensionality reduction can be used as a pre-processing niques with an ensemble classifier for distinguishing between nor-
step in a machine learning algorithm to eliminate irrelevant fea- mal and anomalous connections. To make a fair comparison with
tures and retain the most related features to the predictive mod- some of the state-of-the-art methods, the proposed model was
eling problem. In practice, there are two methods used to re- evaluated based on three well-known datasets, namely ISCX 2012,
duce the number of attributes in the dataset. The first method NSL-KDD, and Kyoto 2006+.
is feature selection which can be used to retrieve a subset of
166 F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175

Fig. 1. The framework of the IG-PCA-Ensemble model.

3. Methodology In this study, filter method was utilized because it is computation-


ally simple and faster than other methods [24]. Thus, IG for a large
In this paper, a new hybrid approach for dimensionality reduc- number of candidate attributes has been calculated for the three
tion using IG and PCA with ensemble classifier based on SVM, IBK, datasets. Ultimately, small numbers of attributes with high infor-
and MLP algorithms is presented. Fig. 1 demonstrates the detec- mation gain are considered for the next step.
tion framework of the proposed method, which consists of three
stages including: dimensionality reduction using IG and PCA to de-
3.2. Feature extraction: Principal Component Analysis (PCA)
termine the relevant features, build and train the ensemble classi-
fier based on SVM, IBK, and MLP algorithms, and attack recognition
The selected attributes from the IG method can be used directly
using voting technique to combine the probability distributions of
for classification. However, a common problem of IG is a bias to-
the base learners. This ensemble was chosen after extensive ex-
ward attributes with a large range of possible values [25]. In such
perimentation of assembling various learning methods, including
a scenario, these attributes return a near-zero entropy value, which
SVM, IBK, and MLP to detect intrusions. In this scenario, six differ-
increases their gain values than any other attributes. Consequently,
ent algorithms were trained to distinguish the normal traffic from
the true ranking of these features might not be reflected in their
attack. The ensemble of SVM, IBK, and MLP achieved the best per-
relevance to the training instances. To mitigate this limitation, at-
formance in terms of classification accuracies. Detailed information
tributes from the feature selection stage will be exposed for further
about the framework is provided in Sections 3.1–3.3.
reduction through applying the PCA technique to choose an opti-
3.1. Feature selection: Information Gain (IG) mum subset of attributes. This helps the PCA decrease the search
range from the entire original feature space to the pre-selected fea-
The first step in this study is to reduce the dimensionality of tures. PCA has been widely unitized in feature extraction and data
the utilized datasets by employing IG with ranker as a filtering ap- compression to reduce computational complexity, distractive noise,
proach. The main concept of this approach is to rank subsets of the risk of over-fitting, as well as its calculation flexibility and re-
attributes by calculating the IG entropy for each attribute in de- versibility [26,27]. Primarily, it is a statistical procedure that con-
creasing order. Each attribute gains a score from 1 (most relevant) verts a set of features, using orthogonal transformations, into a set
to 0 (least relevant). Attributes with the highest scores are con- of values of linearly uncorrelated variables without losing much in-
sidered as the input subset of features to the next dimensionality formation. The new form of transformation, called principal compo-
reduction step. More precisely, let S represent a sample training nents, is sorted from the largest possible variance to the lowest. In
dataset with its instances and corresponding attributes. If m is the other words, the first component (PCA1) covers the maximum vari-
number of classes and the training set encompasses di instances of ance, and each component that follows it covers the lesser value of
class I, and D is the total number of instances in the training set. variance.
Eq. (1) compute the estimated information required to classify a More specifically, let {x(t )} for t = 1, . . . , n be a random dataset
given instance. including its corresponding instances and features with zero mean.
d  The covariance matrix of x(t) is shown in Eq. (4):

m
di
I ( d1 , d2 , . . . , dm ) = −
i
· log2 (1) 1 
n 
D D R= x(t )x(t )T . (4)
i=1
n−1
An attribute T with values {t1 , t2 , . . . , tv } can split the training t=1

dataset into v subsets {S1 , S2 , . . . , Sv }, where Sj is the subset which The linear transformation from x(t) to y(t) in PCA can be computed
has the value t1 for attribute T. Moreover, if Sj includes dij instances as
of class I, the entropy of the attribute T is calculated as:
y(t ) = MT x(t ) (5)

v
d1 j + . . . + dm j
E (T ) = × I ( d1 j , . . . , dm j ) (2)
D where M represents an n × n orthogonal matrix whose ith column,
j=1
of the sample covariance matrix R, is equivalent to the ith eigen-
Eq. (3) shows the calculation required to obtain information gain vector. PCA initially fixes the eigenvalue problem shown in Eq. (6):
for T:
Gain(T ) = I (d1 , . . . , dm ) − E (T ) (3) λi qi = Rqi , (6)
F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175 167

where λi denotes an eigenvalue of R (consider λ1 > λ2 > . . . > λn ), Suppose we have  classifiers C = {C1 . . . C }, and c classes  =
while the corresponding eigenvector is denoted by qi . According to {ω1 , . . . , ωc }. For the datasets considered in this paper, c = 2 (i.e.,
Eq. (5), the principal component is calculated by: attack/non-attack), and  = 3 as per the classifiers listed above. A
n c an object x ∈ Rn and outputs a
yi (t ) = qTi x (t ), i = 1, . . . , n. (7)  Ci : R → [0, 1] accepts
classifier 
vector PCi (ω1 |x ), . . . PCi (ωc |x ) , where PCi (ω j |x ) denotes the prob-
where yi (t) signifies the ith principle component. The principal di- ability assigned by Ci to the hypothesis that object x belongs to
rections (k eigenvectors) can be acquired by sorting eigenvalues λi class ωj . For each class ωj , let mj be the mean of the proabilities
in descending order for feature extraction. Eq. (8) states the cal- assigned by the  respective classifiers, i.e.,
culation to project a new sample x(t) into the principal space. Let 
1
mj = PCi (ω j |x ). (10)

k 
i=1
xˆ (t ) = (t )bi,
T
bi x (8)
i=1 Let M = [m1 , . . . , mc ] be the set of mean probabilities for each
class. Object x is assigned to the class in M with the greatest mean,
where B = {bi : bi = qi , i = 1, . . . , k}. Eq. (9) shows the projection
i.e., x is assigned to class ωk if
error of x(t) by calculating the distance d between x(t) and, xˆ (t ):
mk = max M. (11)
et = d (x(t ), xˆ (t )). (9)
The performance of the proposed ensemble approach is measured
The selection process of the hybrid IG-PCA approach is summa- using three well-known intrusion detection evaluation datasets,
rized in Algorithm 1. namely ISCX 2012, Kyoto 2006+ and NSL-KDD (binary class)
dataset.
Algorithm 1 IG-PCA dimensionality reduction method.
1: Input dataset X, where (X includes n instances with its corre- 3.3.1. Support vector machine (SVM)
sponding T features) SVM is a learning technique which aims to find the opti-
mal separating hyperplane that maximizes the margin between
2: procedure Compute_IG (X)
classes in a high dimensional feature space. Intuitively, support
3: Calculate estimated information required to categorize a vectors can be defined by the vectors used to signify the hy-
given instance (cf. Eq. (1)) perplane. A desirable property of SVM is that it can perform
4: while 1 ≤ i ≤ n do classification using the support vectors rather than the entire
5: Compute entropy for attribute Ti (cf. Eq. (2)) dataset, and thereby, it is extremely robust to outliers and can pre-
6: Compute information gain for attribute Ti (cf. Eq. (3)) dict very efficiently. Let N be the training data points (vectors)
7: Y ← the k attributes with the highest scores {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}, where xi ∈ Rd and yi ∈ {+1, −1}.
Each data point has an associated Lagrangian multiplier α i assign-
8: return Y
ing a relative weight/importance. If the hyperplane is defined by
9: procedure Compute_PCA (Y )
(w, b), the predicted class of point x can be calculated as
10: Compute the covariance matrix of Y (cf. Eq. (4))
11: Compute eigenvector (q1 , . . . , qi ) and eigenvalues 
N  ||x − x ||2  
(λ1 , . . . , λi ) of the covariance matrix above (cf. Eq. (6)) f (x ) = sgn(w · k(x, xi ) − b) = sgn αi yi exp − i
−b ,
2σ 2
12: Z ← the k eigenvectors with largest eigenvalues (cf. Eq. (7)) i=1

13: return Z  the new k–dimensional feature space (12)


14: end where sgn is the sign function, k( · , · ) is the radial basis func-
tion (RBF) kernel, w is the weight vector, x is a point in the in-
put space with an unknown classification, σ is the standard de-
3.3. Classification using an ensemble approach viation, and b is the bias. Once the hyperplane is defined, all the
points located closest to it will have α i > 0, which in other words
Ensemble techniques are powerful methods to improve the pre- can be called support vectors. On the other hand, the remaining
dictive performance of the final model by creating multiple inde- points will have αi = 0. SVM has been utilized successfully in var-
pendent models and combining them to obtain results with higher ious problems from the real world including bioinformatics, image
overall accuracy [28,29]. The improved effectiveness of ensemble classification, and text categorization.
classifiers over individual classifiers is typically the result of three
scenarios: representational issue, statistical reason, and computa- 3.3.2. Instance-based learning algorithms (IBK)
tional reason [30]. In the first scenario, an issue can arise when a IBK classifiers, also known as k nearest-neighbors (k-NN) are a
single classifier is not qualified to find the best representation in simple and effective approach belonging to a lazy learning cate-
the hypothesis space. Likewise, the second scenario occurs when gory. In such a technique, no learning is required for the model and
the input dataset is not sufficient to train the learning algorithm, the prediction can be performed from the raw training instances.
in which the result will more likely lead to a weak hypothesis. In k-NN uses a majority vote between the new (unseen) instance and
the last case, a problem can emerge when it is too computationally the k most similar instances, where the distance is the key fac-
time consuming for an individual classifier to produce a suitable tor to identify the similarity between two data points. Suppose we
hypothesis. have pairs {(x1 , y1 ), . . . , (xn , yn )}, where xi ∈ Rd and yi ∈ {0, 1}; for
Ensemble techniques have also arisen as a possible method- a new i, k-NN uses the majority vote to identity the k nearest. Eu-
ology to solve the class imbalance problem [31]. There are vari- clidean distance is often used in k-NN to identify the similarity be-
ous ensemble learning techniques available such as: boosting [32], tween two points (vectors):
bagging [33], voting [34], Bayesian parameter averaging [35], and
stacking [36]. In this paper, a novel ensemble classifier that uses 
d
d 2 ( xi , x j ) = xi − x j 2 = (xik − x jk )2 (13)
SVM, IBK, and MLP learners is proposed to enhance the intrusion
k=1
detection accuracy. These classifiers were used in a vote algorithm
and based on the average of probabilities (AOP) combination rule. where (xi , x j ) ∈ Rd , xi = (xi1 , xi2 , . . . , xid ).
168 F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175

Table 1 Standard classifier algorithms, i.e., Decision Tree, tend only to pre-
Statistics of subsets of the utilized datasets .
dict the majority class and treat the minority class as noise [31].
Dataset Class Number of instances Total instances Thus, selecting some important samples as the original training
ISCX 2012 Normal 12,393 30112 samples is an open issue for researchers. In this study, random
Attack 17,719 sampling without replacement was performed in the preprocessing
NSL-KDD Normal 18,274 34327 step, however other more sophisticated sampling methods exist
Attack 16,053 [31,39], which may make an interesting topic for future work. To
Kyoto 2006+ Normal 15,037 34337
allow different integration testing environments, the samples were
Attack 19,300
taken from three well-known datasets, namely ISCX 2012 (syn-
thesized dataset), NSL-KDD (virtualized dataset), and Kyoto 2006+
3.3.3. Multilayer Perceptron (MLP) (realistic dataset). Given that the datasets only have two classes,
Multilayer Perceptron(MLP) is a feed-forward fully artificial multiple large samples were randomly taken from the datasets in
neural network model with one or more layers between the in- such a manner that the obtained training dataset is balanced in
put and output layer. The basic concept of this technique is to map terms of normal and anomalous instances. This is done to ensure
multiple real-valued inputs into a set of proper output by adjust- that the resulting model is not biased towards one class over the
ing the weight among its internal nodes (neurons). MLP learns a other. During the selection process, the stratified split option was
function f (x ) : Ri → Ro by training on a dataset, using the back- considered to ensure that the minority attacks have a represen-
propagation learning technique [37], where i, o ∈ Z+ are number of tative sampling of the data. For further verification, we compared
input and output dimensions respectively. This can be calculated the selected dataset with the remaining unsampled portion of the
as dataset by creating a join on the datasets, using the SQL query, to
determine which instances were unused. In this way, the training

n 
y=ϕ wi X + b = ϕ (W T X + b) (14) datasets will include sufficient instances of each class to provide
i=1 a fair representation of their characteristics. Table 1 shows the se-
lected samples of the utilized datasets. Additional details about the
where, ϕ is the activation function, w signifies the vector of
datasets are presented in Sections 4.1.1–4.1.3.
weights, X signifies the vector of inputs, and b is the bias. In prac-
tice, this neural classifier has been widely adopted in diverse fields
such as pattern classification, prediction, and recognition. 4.1.1. The ISCX 2012 dataset
The Information Security Center of Excellence (ISCX) dataset
4. Experimental setup and results was built in at the University of New Brunswick ISCX to provide
a contemporary benchmark for ID [41]. The dataset traced real
As stated before, this paper aims to develop an efficient intru- packets for seven days of network activity including HTTP, SMTP,
sion detection method with high accuracy and low false alarms. SSH, IMAP, POP3 and FTP protocols covering various scenarios of
For this purpose, a hybrid method, combined IG and PCA named normal and malicious activities. The dataset includes a total of
IG-PCA, was performed to reduce the dimensionality of the input 2,450,324 flows and consists of 19 features in addition to the class
datasets in order to eliminate the irrelevant features, as well as (“app name”, “total source bytes”, “total destination bytes”, “total des-
to improve the classification efficiency. In the classification step, tination packets”, “total source packets”, “source payload as base64”,
three different data mining algorithms, SVM, IBK, and MLP, as well “destination payload as base64”, “destination payload as UTF”, “direc-
as an ensemble classifier combined the utilized algorithms were tion”, “source TCP flags description”, “destination TCP flags descrip-
trained and tested based on the ISCX 2012 dataset. To obtain the tion”, “source”, “protocol name”, “source port”, “destination”, “desti-
final prediction result of the ensemble classifier, the AOP algo- nation port”, “start date time”, “stop date time”, and “tag”).
rithm was used as a combination rule to construct a weighted
vote of the classifiers’ predictions. Results show that the IG-PCA 4.1.2. The NSL-KDD dataset
with ensemble classifier outperformed every single other classifier The NSL-KDD dataset was proposed in 2009 as a new revised
by achieving high classification performance results. To ensure the version of the original dataset KDDCup99 [42]. The new edition
robustness of the proposed approach, the proposed model was re- addresses some drawbacks inherited from the original dataset such
evaluated based on NSL-KDD and Kyoto 2006+ datasets. Ultimately, as the elimination of redundant records, inclusion of a more rea-
the IG-PCA with ensemble classifier model has shown the high- sonable number of instances, and maintenance of diversity of the
est accuracy, lowest false positive rate, as well as a significant im- selected samples. NSL-KDD includes 41 different features that can
provement in the computational complexity. WEKA 3.8 [38] was be categorized into four groups:
utilized for machine learning and experiments were conducted on
desktop PC with a 3.4 GHz Intel Core i7-6700 processor and 16GB 1. Basic features of individual connections: these features provide
RAM. information about TCP/IP connection. Features in this group
are “duration”, “protocol type”, “service”, “flag”, “src bytes”, “dst
4.1. Description of the benchmark datasets bytes”, “land”, “wrong fragment”, “urgent”.
2. Traffic features: these features were computed using a time
Contending with the sheer volume of network traffic data is a window (2 seconds). Features in this group are “count”, “ser-
significant challenge to contemporary IDSs. Working at scale can ror rate”, “rerror rate”, “same srv rate”, “diff srv rate”,“srv count”,
expose computational complexities and slow the classification pro- “srv rerror rate”, “srv serror rate”,”srv diff host rate”.
cess, and lead to unsatisfactory results. Hence, the data reduction 3. Host features: these features computed using a window of hun-
step is considered one of the most important and challenging steps dred connections. Features in this group are “dst host count”,“dst
in data mining techniques, especially when dealing with heteroge- host serror rate”, “dst host rerror rate”, “dst host same src port
neous data such as the network traffic dataset [39]. In some cases, rate”, “dst host diff srv rate”, “dst host srv count”, “dst host srv
the selection may cause loss of information and lead to wrong de- serror rate”, “dst host srv rerror rate”, “dst host same srv rate”,
cision [39]. Handling imbalanced dataset is also considered a com- “dst host srv diff host rate”.
mon problem with the network traffic data due to low represen- 4. Content features: A connection of these features was suggested
tation of some attacks in contrast with the normal packets [40]. by domain knowledge. Features in this group are “hot”, “num
F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175 169

Fig. 2. Performance based on the number of PCA on ISCX2012 dataset.

Table 2 from human, at Kyoto University [43]. The collected data was ob-
Confusion matrix (2 × 2 dimensions).
tained from the university servers and honeypots. It consists of
Prediction 24 features where 14 significant and essential features extracted
Intrusion Legitimate from KDDCUP99 and 10 additional features. Among them, the first
14 features include (“duration”, “service”,“source bytes”, “destination
Actual Intrusion TP FN
Legitimate FP TN
bytes”, “count”, “same srv rate”, “serror rate”, “srv serror rate”, “dst
host count”, “dst host srv count”, “dst host same src port rate”, “dst
host serror rate”, “dst host srv serror rate”, and “flag”), while the
Table 3 remaining features include (“IDS detection”, “malware detection”,
Most relevant features based on IG.
“ashula detection”, “label”, “source IP address”, “source port number”,
Rank # Feature Name Information Gain “destination IP address”, “destination port number”, “start time”, and
1 sourcePort 0.9162 “duration”).
2 destination 0.7512
3 direction 0.6861
4 totalDestinationBytes 0.56 4.2. Data preprocessing
5 appName 0.5531
6 source 0.5396
7 totalSourceBytes 0.5242 Data preprocessing is the most time consuming and essential
8 destinationPort 0.4152 step in data mining. Realistic data typically comes from hetero-
9 destinationTCPFlagsDescription 0.3489 geneous platforms and can be noisy, redundant, incomplete, and
inconsistent. Thus, it is important to transform raw data into a for-
mat suitable for analysis and knowledge discovery. In this research,
failed logins”, “logged in”, “num compromised”, “root shell”, “su the preprocessing step involves removing outliers and redundant
attempted”, “num root”, “num file creations”, “num shells”, “num instances, as well as data transforming. The utilized datasets con-
access files”, “num outbound cmds”, “is host login”, “is guest lo- tain symbolic, continuous, and binary values. For instance, the fea-
gin”. ture “app name” in the ISCX 2012 dataset includes symbolic values
such as: “DNS”, “HTTPWeb”, and “IMAP”. As many classifiers accept
Although this dataset includes a large number of features to only numerical values, the converting process is considered vital
classify various attack types, the content of some features is not and has a significant impact on IDS accuracy. In practice, there are
discriminating enough, and can be misleading in the classification different methods used to handle the symbolic features such as re-
process. Thus, dimensionality reduction will prove helpful in this placing every single value with an integer [44]. Even though this
regard. method is applicable in many cases, it is not the optimal encod-
ing process for classifiers according to Euclidean distance [45]. Al-
4.1.3. The Kyoto 2006+ sataset ternatively, the nominal to binary method [46] was performed in
The Kyoto 2006+ dataset was built between Nov 2006 and this work to transfer all nominal features into binary numeric at-
Aug 2009 from real network traffic, without any modification tributes.
170 F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175

Fig. 3. Comparison performance on the three datasets.

4.3. Data normalization this equation:


(x − xmin )
As stated before, most MLTs can process numerical inputs. How- f (x ) = · (d − c ) + c (15)
(xmax − xmin )
ever, different scales among features can degrade the classifica-
tion performance. For example, features that take on large nu- where xmin and xmax are the minimum and maximum values of
meric values, e.g., “src_bytes” can dominate the classifier’s model attribute x, [c, d] is the new interval, i.e., [0, 1].
relative to features with relatively small numeric values such as
4.4. Results and discussion
“num_failed_logins”. Accordingly, mapping features onto a normal-
ized range (i.e., between 0 and 1) is vital. The minimum-maximum
The proposed method has been evaluated using the aforemen-
method was used as an approach that is simple, fast, and has
tioned datasets. The resampling without replacement was used to
lower memory consumption than other methods. Eq. (15) shows
split the selected samples of each dataset into two different sub-
sets for training and testing. With this, the training subset can
F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175 171

Table 4
Best performance classification for all dimensionality reduction stages based on the ISCX2012 dataset.

(a) The performance results based on the original features

Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)

SVM 87.02 0.870 0.091 0.901 0.871 3.18 25.61

IBK 94.29 0.996 0.132 0.914 0.953 0 8.86

MLP 82.42 0.824 0.127 0.872 0.824 3.92 0.15

Ensemble 87.27 0.873 0.090 0.903 0.873 7.36 35.58
(b) The performance results based on the selected features using IG
Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)

IG-SVM 88.93 0.889 0.078 0.913 0.889 2.58 22.93

IG-IBK 96.94 0.995 0.067 0.954 0.975 0 6.39

IG-MLP 94.12 0.989 0.126 0.917 0.952 1.72 0.09

IG-Ensemble 97.17 0.994 0.060 0.959 0.976 4.32 29.83
(c) The performance results based on the selected features using IG+PCA
Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)
IG-PCA-SVM 5 98.82 0.988 0.011 0.992 0.990 0.05 0.34
IG-PCA-IBK 13 98.72 0.986 0.011 0.992 0.989 0 3.84
IG-PCA-MLP 12 98.66 0.987 0.014 0.987 0.987 2.57 0.14
IG-PCA-Ensemble 7 99.01 0.991 0.010 0.991 0.992 2 3.49

accurately estimate the model performance on new unseen data, Table 4(c) represents the first principle components for each clas-
while the testing subset is held back to evaluate the model per- sifier in the new subspace. For instance, the number of features
formance. In this scenario, it is not necessary to construct sub- retained after IG-PCA with the ensemble classifier was 7, which
sets for cross-validation assessment, which can be time consuming represents (PCA1, PCA2 ... , PCA7). We ran the experiment several
with large datasets. Three experiments were conducted to eval- times to examine the classifiers’ performance, in which we start in-
uate the effectiveness of the proposed method. According to the creasing the number of PCAs/features for each classifier gradually.
confusion matrix presented in Table 2, these evaluation metrics In each iteration, we keep increasing the number of PCAs till an
were utilized: accuracy, detection rate (DR, also recall or sensitiv- addition of a new PCA does not improve the performance of the
ity), false alarm rate (FAR), f-measure, precision, and ROC curve. The model. As shown in Fig. 2, we observed that the performance of
mathematical calculations of the utilized evaluation metrics are ex- the models did not improve after the 25 PCAs and the performance
plained in [47]. of the proposed ensemble approach achieves the highest accuracy
where the true positive (TP) is the number of actual attacks rate of 99.01% with 7 PCAs and outperforms all other individual
identified as attacks, true negative (TN) is the number of normal classifiers. In contrast, the best accuracy of the SVM, IBK, and MLP
instances identified as normal, false negative (FN) is the number of classifiers were 98.82%, 98.72%, 98.66% with 5, 13, and 12 PCAs, re-
attacks identified as normal instances, and false positive (FP) is the spectively. Furthermore, the proposed IG-PCA-Ensemble model ex-
number of normal instances identified as attacks. hibits one of the highest scores in DR, precision, and f-measure,
and the lowest FAR in comparison with other combined models.
4.4.1. Experiment based on ISCX 2012 dataset The proposed dimensional reduction algorithm reduced the
The first experiment was performed using the ISCX 2012 computational cost significantly when it was applied to the ensem-
dataset. First, essential features were identified by calculating the ble model. Table 4 shows a comparison of the consumed training
IG entropy for each attribute in decreasing order. Overall, nine can- and testing times based on the number of selected features. Ac-
didate features were selected from the original 19 for the next cording to the table, the ensemble model with IG and PCA has mit-
stage. Table 3 shows the selected features alongside with their cor- igated the training and testing times considerably compared with
responding information gain measures. By implementing IG alone, the ensemble model using all features or IG alone. The proposed
the approach was seen to generate a number of FARs. To miti- model was significantly reduced the training and testing times
gate such a limitation, a second additional reduction step was per- from 7.36 and 35.58 to 2 and 3.49, respectively.
formed using the feature extraction method (PCA) based on the
selected features listed in Table 3. To avoid bias, the PCAs were 4.4.2. Experiments based on NSL-KDD and Kyoto 2006+ datasets
generated based on the training dataset only, which ensures the In order to demonstrate the performance of the proposed IG-
information from the testing dataset is not leaked into the train- PCA-Ensemble model, additional experiments were conducted on
ing dataset. If the entire dataset is used to calculate the PCAs, the the NSL-KDD and Kyoto 2006+ datasets. In the preprocessing step
model will not perform as well when real new unseen data are fed of these datasets, IG and PCAs were computed as was done in
to the model. Likewise, if the PCAs are calculated on the two sets the first experiment. As demonstrated in Table 5, the selected fea-
separately, this will create two incompatible data sets. We cannot tures alongside with their corresponding information gain mea-
train a classifier in one space and apply it to a different space. sures were: 13 from NSL-KDD, and 10 from Kyoto 2006+. Tables 6
Thus, the same statistics of the training dataset was utilized to and 7 depict the best performance obtained based on the different
map the testing dataset into the same feature space by using the dimensionality reduction methods on the datasets. As stated in the
batch-filtering method [48]. tables, the proposed model continues to show promising results
The new datasets of size n × k were used to evaluate the per- in the classification performance. In this context, the optimum ac-
formance of the proposed method, where n is the number of in- curacy achieved on NSL-KDD and Kyoto 2006+ were respectively
stances and k is the number of PCA. Accordingly, four different 98.24% and 98.95%, with 12 PCAs for each. In addition, the train-
classifiers were constructed using the training dataset, while the ing and testing times have been improved considerably when the
testing dataset was used for classification. dimensionality reduction algorithm was applied to the ensemble
Table 4 summarizes the best classification performance of the model.
different dimensionality reduction methods in the context of accu- To a large extent, a comparison results based on the three
racy, DR, FAR, precision, and f-measure. The number of features in datasets are demonstrated in Fig. 3. The results visualized in Fig. 3.
172 F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175

Table 5
List of features selected by IG on NSL-KDD and Kyoto 2006+ datasets.

NSL-KDD Kyoto 2006+

Rank # Feature name Information gain Rank # Feature name Information gain

1 src_bytes 0.783 1 Destination_IP_Address 0.988


2 service 0.664 2 Start_Time 0.964
3 dst_bytes 0.625 3 Source_IP_Address 0.95
4 flag 0.539 4 Destination_Port_Number 0.912
5 diff_srv_rate 0.526 5 Source_Port_Number 0.887
6 same_srv_rate 0.508 6 Service 0.851
7 dst_host_srv_count 0.495 7 Destination_bytes 0.763
8 dst_host_same_srv_rate 0.451 8 Dst_host 0.762
9 dst_host_serror_rate 0.428 9 Source_bytes 0.473
10 dst_host_srv_serror_rate 0.41 10 Flag 0.453
11 dst_host_diff_srv_rate 0.409
12 serror_rate 0.408
13 logged_in 0.407

Table 6
Best performance classification for all dimensionality reduction stages based on the NSL-KDD dataset.

(a). The performance results based on the original features (41)

Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)

SVM 88.42 0.884 0.131 0.905 0.882 2.63 21.29

IBK 83.97 0.84 0.182 0.875 0.834 0 61.15

MLP 90.50 0.905 0.108 0.919 0.904 6.58 0.22

Ensemble 89.13 0.891 0.123 0.909 0.889 9.19 82.64
(b). The performance results based on the selected features using IG (13 features)
Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)

IG-SVM 95.33 0.905 0.052 0.956 0.948 0.51 4.1

IG-IBK 88.80 0.880 0.127 0.906 0.886 0 30.48

IG-MLP 90.54 0.818 0.018 0.976 0.890 3.08 0.11

IG-Ensemble 91.35 0.913 0.098 0.925 0.912 3.62 35.74
(c). The performance results based on the selected features using IG+PCA
Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)
IG-PCA-SVM 8 97.22 0.988 0.042 0.954 0.971 0.33 2.19
IG-PCA-IBK 12 98.20 0.979 0.015 0.982 0.981 0 4.13
IG-PCA-MLP 12 96.89 0.954 0.018 0.979 0.966 1.05 0.12
IG-PCA-Ensemble 12 98.24 0.982 0.017 0.982 0.981 1.52 6.43

Table 7
Best performance classification for all dimensionality reduction stages based on the Kyoto dataset dataset.

(a). The performance results based on the original features (24)

Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)

SVM 84.42 0.844 0.122 0.884 0.840 3.45 21.8

IBK 90.06 0.838 0.019 0.983 0.905 0 36.95

MLP 95.82 0.725 0.057 0.957 0.963 3.6 0.23

Ensemble 87.39 0.874 0.101 0.898 0.874 9.13 59.02
(b). The performance results based on the selected features using IG (10 features)
Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)

IG-SVM 93.21 0.932 0.054 0.996 0.936 3.08 21.85

IG-IBK 94.47 0.938 0.047 0.963 0.950 0 25.06

IG-MLP 96.31 0.979 0.057 0.956 0.968 2.5 0.15

IG-Ensemble 95.89 0.964 0.047 0.940 0.964 5.23 48.97
(c). The performance results based on the selected features using IG+PCA
Classifier # PCA Accuracy DR FAR Precision F-Measure Building (s) Testing (s)
IG-PCA-SVM 17 94.99 0.985 0.095 0.930 0.957 0.24 2.06
IG-PCA-IBK 10 93.92 0.905 0.017 0.986 0.944 0 4.11
IG-PCA-MLP 13 98.39 0.986 0.007 0.994 0.990 0.95 0.11
IG-PCA-Ensemble 12 98.95 0.998 0.021 0.984 0.991 1.38 7.07

(a) and (b) indicate the implementation of the two stages to re- IG-PCA-Ensemble model on the three datasets. The best perfor-
duce the dimensionality significantly improved the classification mance obtained based on the number of the selected features
accuracy and reduced the inaccurate classification rate of the pro- (PCAs) achieved was 99.01%, 98.24% and 98.95% with 7, 12, and
posed intrusion detection model compared with other models in- 12 features, respectively for ISCX 2012, NSL-KDD and Kyoto 2006+,
cluding all features. Similarly, from Fig. 3(c) and (d) it can be ob- respectively. The ROC (receiver operating characteristic curve) was
served that the computational cost was reduced enormously for also used to evaluate the performance of the model. The ROC
the training and testing times across all dimensionality reduction curves of the three datasets plotted in Fig. 3. (f) show that the
stages. Fig. 3(e) shows the classification accuracy of the proposed proposed model had a good result in terms of high detection, with
F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175 173

Table 8
Comparison results with other approaches.

Method Dataset Feature selection/extraction tech. # features Accuracy FAR DR Year

DMNB [51] NSL-KDD PCA, Random Projection (RP) All 96.50 3.0 N/A 2010
DBN-SVM [52] NSL-KDD deep belief network (DBN) All 92.84 N/A N/A 2011
TUIDS [49] NSL-KDD N/A All 96.55 1.12 98.88 2012
PSOM [53] NSL-KDD PCA, Fisher Discriminant Ratio (FDR), Kernel PCA, Isomap 23 88.30 0.14 N/A 2013
HbPHAD [54] ISCX 2012 N/A All N/A N/A 99.04 2014
PSOM+PCA+FDR [23] NSL-KDD PCA, FDR 8 90 N/A 97 2015
EMD [10] ISCX 2012 PCA N/A 90.12 7.92 90.04 2015
OS-ELM [55] Kyoto 2006+ Filtered, CFS and Consistency subsets evaluation 11 96.37 5.76 N/A 2015
EMFFS [50] NSL-KDD IG, GR, Chi-squared, Relief 13 99.67 0.42 99.76 2016
HG-GA-SVM [56] NSL-KDD HG, GA 35 97 0.83 97.14 2017
SVM [57] ISCX 2012 N/A 11 N/A 1.10 98.50 2017
SLFN [57] ISCX 2012 N/A 11 N/A 5.56 88.16 2017
RFAODE [58] Kyoto 2006+ N/A 15 90.51 0.14 92.38 2017
Bagging-REPTree [59] NSL-KDD FVBRM 25 83.22 8.09 N/A 2018
Bagging-J48 [59] NSL-KDD Gain Ratio 35 84.25 2.79 N/A 2018
IG-PCA-Ensemble (proposed) ISCX 2012 IG, PCA 7 99.011 0.01 99.1 2018
IG-PCA-Ensemble (proposed) NSL-KDD IG, PCA 12 98.24 0.017 98.2 2018
IG-PCA-Ensemble (proposed) Kyoto 2006+ IG, PCA 12 98.95 0.021 99.8 2018

N/A: name not available.

an area under ROC curve (AUC) of 0.998 (ISCX 2012), 0.996 (NSL- 5. Conclusion
KDD), and 0.985 (Kyoto 2006+) for both normal and attack con-
nections. Much work has been carried out in the field of intrusion de-
tection to develop robust security systems augmenting traditional
approaches. In this context, recent studies have shown that an ef-
4.5. Additional comparison ficient feature selection approach is a major component to assist
the classification method in the detection process. In this paper,
In this section, the performance of the proposed IG-PCA- a novel hybrid technique combining IG and PCA is proposed to
Ensemble method is compared with other state-of-the-art ap- discard irrelevant features and retain the optimum attribute sub-
proaches. The comparison includes the utilized techniques, num- set, while the ensemble classifier based on SVM, IBK, and MLP is
ber of features, accuracy, FAR, and DR for intrusion detection (nor- used to construct the classification model. The AOP algorithm then
mal or attack separation). Table 8 depicts the comparison re- was utilized to obtain the final decision of the base learners to
sults based on ISCX 2012, NSL-KDD, and Kyoto 2006+ datasets. recognize whether a given instance is normal or attack. To allow
The comparisons are provided for the reader’s reference, with the different integration testing environments, the proposed IG-PCA-
caveat that the techniques used in sampling and preprocessing dif- Ensemble was evaluated based on three intrusion detection bench-
fer between research groups. Nevertheless, the proposed method mark datasets, namely ISCX 2012, NSL-KDD, and Kyoto 2006+. Fur-
achieved promising results in the context of accuracy rate, DR, and thermore, the performance of the proposed method was compared
FAR. Among the ISCX 2012 dataset, the proposed approach enjoyed against recent and related approaches. Based on the experimen-
the highest performance, with accuracy (99.01%), DR (99.1%), and tal results, the IG-PCA-Ensemble approach delivers improved ac-
the lowest FAR (0.01). The NSL-KDD dataset, on the other hand, curacy, FAR, and DR. In the ISCX 2012 dataset, the proposed ap-
yielded lower detection rates by 0.62% in contrast with TUIDS [49]. proach acquired the highest accuracy rate (99.01%), DR (99.1%), and
However, our proposed approach performed well in terms of ac- the lowest FAR (0.01%). Similarly, the robustness of the proposed
curacy rate and FAR. Likewise, the obtained accuracy and DR were approach yielded promising results in both NSL-KDD and Kyoto
lower by 1.43% and 1.56% respectively compared to EMFFS [50]. Ad- 2006+ datasets of 98.24% and 98.95% in accuracy rates, 98.2% and
ditionally, a lower FAR by 0.403% was achieved, which is a useful 99.8% in DRs, and 0.017%, 0.021% in the FARs, respectively. On the
property for real-word IDSs. In Kyoto 2006+, our approach contin- other hand, the computational cost of the proposed method has
ues improved performance in relation to other approaches. been reduced considerably among all experiments. Although the
proposed novel IG-PCA-Ensemble method has exhibited encourag-
ing performance, its capability could be further improved to ac-
4.6. Threats to validity commodate the enormous data flows in real-time, and this issue
will be considered in the future work.
The utilized datasets represent the major threats to validity in
this research. The widely studied KDD99/NSL-KDD dataset is ob- References
solete and does not capture contemporary attack types. As a syn-
thetic dataset, it is not a strong representative of contemporary [1] S. Pontarelli, G. Bianchi, S. Teofili, Traffic-aware design of a high-speed fpga
networks. Hence, to provide a more robust performance evaluation, network intrusion detection system, IEEE Trans. Comput. 62 (11) (2013) 2322–
2334, doi:10.1109/TC.2012.105.
the proposed method was validated against three distinct dataset
[2] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Maciá-Fernández, E. Vázquez, Anoma-
types: synthesized (ISCX 2012), virtualized (NSL-KDD), and realis- ly-based network intrusion detection: techniques, systems and challenges,
tic (Kyoto 2006+). Random sampling is another threat to validity Comput. Secur. 28 (1–2) (2009) 18–28.
[3] Y. Tang, S. Chen, An automated signature-based approach against polymorphic
as it makes the precise experiment difficult to replicate. Thus, to
internet worms, IEEE Trans. Parallel Distrib. Syst. 18 (7) (2007) 879–892.
verify the reliability of the proposed approach, the experiments [4] O. Joldzic, Z. Djuric, P. Vuletic, A transparent and scalable anomaly-based dos
were repeated on three different datasets with a large data sample detection method, Comput. Networks 104 (2016) 27–42.
size. Finally, although the proposed approach exhibited encourag- [5] S. Mukkamala, A.H. Sung, A. Abraham, Intrusion detection using an ensemble
of intelligent paradigms, J. Network Comput. Appl. 28 (2) (2005) 167–182.
ing performance in binary classification, it requires further study [6] M. Panda, A. Abraham, M.R. Patra, A hybrid intelligent approach for network
in multiple-class classification problems. intrusion detection, Procedia Eng. 30 (2012) 1–9.
174 F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175

[7] S. Revathi, A. Malathi, A detailed analysis on nsl-kdd dataset using various Intelligent Information Systems, 1994. Proceedings of the 1994 Second Aus-
machine learning techniques for intrusion detection, Int. J. Eng. Res. Technol. tralian and New Zealand Conference on, IEEE, 1994, pp. 357–361.
ESRSA Publ. 2 (12) (2013) 1848–1853. [39] R. Singh, H. Kumar, R. Singla, Analyzing statistical effect of sampling on net-
[8] M.K. Siddiqui, S. Naahid, Analysis of kdd cup 99 dataset using clustering based work traffic dataset, in: ICT and Critical Infrastructure: Proceedings of the
data mining, Int. J. Database Theory Appl. 6 (5) (2013) 23–34. 48th Annual Convention of Computer Society of India-Vol I, Springer, 2014,
[9] A. Chandrasekhar, K. Raghuveer, An effective technique for intrusion detection pp. 401–408.
using neuro-fuzzy and radial svm classifier, in: Computer Networks & Commu- [40] R. Singh, H. Kumar, R. Singla, Sampling based approaches to handle imbalances
nications (NetCom), Springer, 2013, pp. 499–507. in network traffic dataset for machine learning techniques, arXiv preprint
[10] Z. Tan, A. Jamdagni, X. He, P. Nanda, R.P. Liu, J. Hu, Detection of denial-of- arXiv:1311.2677 (2013) 1–11.
service attacks based on computer vision techniques, IEEE Trans. Comput. 64 [41] A. Shiravi, H. Shiravi, M. Tavallaee, A.A. Ghorbani, Toward developing a system-
(9) (2015) 2519–2533, doi:10.1109/TC.2014.2375218. atic approach to generate benchmark datasets for intrusion detection, Comput.
[11] C. Catal, S. Tufekci, E. Pirmit, G. Kocabag, On the use of ensemble of classi- Secur. 31 (3) (2012) 357–374.
fiers for accelerometer-based activity recognition, Appl. Soft Comput. 37 (2015) [42] S. Aljawarneh, M. Aldwairi, M.B. Yassein, Anomaly-based intrusion detection
1018–1022, doi:10.1016/j.asoc.2015.01.025. system through feature selection analysis and building hybrid efficient model,
[12] V. Hajisalem, S. Babaie, A hybrid intrusion detection system based on abc-afs J. Comput. Sci. 25 (2018) 152–160.
algorithm for misuse and anomaly detection, Comput. Networks 136 (2018) [43] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, K. Nakao, Statistical analysis
37–50. of honeypot data and building of kyoto 2006+ dataset for nids evaluation, in:
[13] I. Dutt, S. Borah, I.K. Maitra, K. Bhowmik, A. Maity, S. Das, Real-time hybrid Proceedings of the First Workshop on Building Analysis Datasets and Gathering
intrusion detection system using machine learning techniques, in: Advances in Experience Returns for Security, ACM, 2011, pp. 29–36.
Communication, Devices and Networking, Springer, 2018, pp. 885–894. [44] H.G. Kayacik, A.N. Zincir-Heywood, M.I. Heywood, A hierarchical som-based in-
[14] C. Lai, M.J. Reinders, L. Wessels, Random subspace method for multivariate fea- trusion detection system, Eng. Appl. Artif. Intell. 20 (4) (2007) 439–451.
ture selection, Pattern Recognit. Lett. 27 (10) (2006) 1067–1076, doi:10.1016/j. [45] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A geometric framework for
patrec.2005.12.018. unsupervised anomaly detection, in: Applications of data mining in computer
[15] J. Puneet, N. Bouguila, Intrusion detection using unsupervised approach, in: security, Springer, 2002, pp. 77–101.
Emerging Technologies for Developing Countries: First International EAI Con- [46] J. Maudes, J.J. Rodríguez, C. García-Osorio, Cascading for nominal data,
ference, AFRICATEK 2017, Marrakech, Morocco, March 27–28, 2017 Proceed- in: International Workshop on Multiple Classifier Systems, Springer, 2007,
ings, 206, Springer, 2017, p. 192. pp. 231–240.
[16] A. Sophian, G.Y. Tian, D. Taylor, J. Rudlin, A feature extraction technique based [47] S. Elhag, A. Fernández, A. Bawakid, S. Alshomrani, F. Herrera, On the combi-
on principal component analysis for pulsed eddy current ndt, NDT E Int. 36 nation of genetic fuzzy systems and pairwise learning for improving detection
(1) (2003) 37–41. rates on intrusion detection systems, Expert Syst. Appl. 42 (1) (2015) 193–202.
[17] G. Stein, B. Chen, A.S. Wu, K.A. Hua, Decision tree classifier for network in- [48] R.R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald, D. Scuse,
trusion detection with ga-based feature selection, in: Proceedings of the 43rd Weka manual for version 3-6-0, University of Waikato, Hamilton, New Zealand
annual Southeast regional conference-Volume 2, ACM, 2005, pp. 136–141. (2008) 1–341.
[18] S. Mukkamala, A.H. Sung, Significant feature selection using computational in- [49] P. Gogoi, M.H. Bhuyan, D. Bhattacharyya, J.K. Kalita, Packet and flow based net-
telligent techniques for intrusion detection, in: Advanced Methods for Knowl- work intrusion dataset, in: International Conference on Contemporary Com-
edge Discovery from Complex Data, Springer, 2005, pp. 285–306. puting, Springer, 2012, pp. 322–334.
[19] Y. Chen, A. Abraham, B. Yang, Feature selection and classification using flexible [50] O. Osanaiye, H. Cai, K.-K.R. Choo, A. Dehghantanha, Z. Xu, M. Dlodlo, Ensem-
neural tree, Neurocomputing 70 (1–3) (2006) 305–313. ble-based multi-filter feature selection method for ddos detection in cloud
[20] S. Mukherjee, N. Sharma, Intrusion detection using naive bayes classifier with computing, EURASIP J. Wirel. Commun. Netw. 2016 (1) (2016) 130.
feature reduction, Procedia Technol. 4 (2012) 119–128. [51] M. Panda, A. Abraham, M.R. Patra, Discriminative multinomial naive bayes
[21] G. Liu, Z. Yi, S. Yang, A hierarchical intrusion detection model based on the pca for network intrusion detection, in: Information Assurance and Security (IAS),
neural networks, Neurocomputing 70 (7–9) (2007) 1561–1568. 2010 Sixth International Conference on, IEEE, 2010, pp. 5–10.
[22] F. Kuang, W. Xu, S. Zhang, A novel hybrid kpca and svm with ga model for [52] M.A. Salama, H.F. Eid, R.A. Ramadan, A. Darwish, A.E. Hassanien, Hybrid intelli-
intrusion detection, Appl. Soft Comput. 18 (2014) 178–184. gent intrusion detection scheme, in: Soft computing in industrial applications,
[23] E. De la Hoz, E. De La Hoz, A. Ortiz, J. Ortega, B. Prieto, Pca filtering and Springer, 2011, pp. 293–303.
probabilistic som for network intrusion detection, Neurocomputing 164 (2015) [53] E. de la Hoz, A. Ortiz, J. Ortega, E. de la Hoz, Network anomaly classification
71–81. by support vector classifiers ensemble and non-linear projection techniques,
[24] H. Shi, H. Li, D. Zhang, C. Cheng, X. Cao, An efficient feature generation ap- in: International Conference on Hybrid Artificial Intelligence Systems, Springer,
proach based on deep learning and feature selection techniques for traffic clas- 2013, pp. 103–111.
sification, Comput. Networks 132 (2018) 81–98. [54] W. Yassin, N.I. Udzir, A. Abdullah, M.T. Abdullah, Z. Muda, H. Zulzalil, Packet
[25] E. Adi, Z. Baig, P. Hingston, Stealthy denial of service (dos) attack modelling header anomaly detection using statistical analysis, in: International Joint Con-
and detection for http/2 services, J. Network Comput. Appl. 91 (2017) 1–13. ference SOCO 14-CISIS 14-ICEUTE 14, Springer, 2014, pp. 473–482.
[26] Y. Wang, Y. Zhao, Q. Zhou, Z. Lin, Image encryption using partitioned cellular [55] R. Singh, H. Kumar, R. Singla, An intrusion detection system using network
automata, Neurocomputing 275 (2018) 1318–1332. traffic profiling and online sequential extreme learning machine, Expert Syst.
[27] P. Nskh, M.N. Varma, R.R. Naik, Principle component analysis based intrusion Appl. 42 (22) (2015) 8609–8624.
detection system using support vector machine, in: Recent Trends in Elec- [56] M.G. Raman, N. Somu, K. Kirthivasan, R. Liscano, V.S. Sriram, An efficient intru-
tronics, Information & Communication Technology (RTEICT), IEEE International sion detection system based on hypergraph - genetic algorithm for parameter
Conference on, IEEE, 2016, pp. 1344–1350. optimization and feature selection in support vector machine, Knowl. Based
[28] R.P. Duin, D.M. Tax, Experiments with classifier combining rules, in: Interna- Syst. 134 (2017) 1–12, doi:10.1016/j.knosys.2017.07.005.
tional Workshop on Multiple Classifier Systems, Springer, 20 0 0, pp. 16–29. [57] H. Huang, R.S. Khalid, H. Yu, Distributed machine learning on smart-gate-
[29] G. Seni, J.F. Elder, Ensemble methods in data mining: improving accuracy way network towards real-time indoor data analytics, in: Data Science and
through combining predictions, Synth. Lect. Data Mining Knowl. Disc. 2 (1) Big Data: An Environment of Computational Intelligence, Springer, 2017,
(2010) 1–126. pp. 231–263.
[30] T.G. Dietterich, Ensemble methods in machine learning, in: International work- [58] M. Jabbar, R. Aluvalu, et al., RFAODE: a novel ensemble intrusion detection sys-
shop on multiple classifier systems, Springer, 20 0 0, pp. 1–15. tem, Procedia Comput. Sci. 115 (2017) 226–234.
[31] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on [59] N.T. Pham, E. Foo, S. Suriadi, H. Jeffrey, H.F.M. Lahza, Improving performance of
ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- intrusion detection system using ensemble methods and feature selection, in:
based approaches, IEEE Trans. Syst., Man, Cybern., Part C (Appl. Rev.) 42 (4) Proceedings of the Australasian Computer Science Week Multiconference, in:
(2012) 463–484, doi:10.1109/TSMCC.2011.2161285. ACSW ’18, ACM, New York, NY, USA, 2018, pp. 2:1–2:6, doi:10.1145/3167918.
[32] G.I. Webb, Z. Zheng, Multistrategy ensemble learning: reducing error by com- 3167951.
bining ensemble learning techniques, IEEE Trans. Knowl. Data Eng. 16 (8)
(2004) 980–991. Fadi Salo received the B.Sc. and M.Sc. degrees in com-
[33] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: a review puter science from Al-Ahliyya Amman University and Uni-
of classification techniques, Emerg. Artif. Intell. Appl. Comput. Eng. 160 (2007) versity Putra Malaysia in Jordan and Malaysia in 1999 and
3–24. 2005, respectively. He obtained a Master of Engineering
[34] J. Hu, An approach to eeg-based gender recognition using entropy measure- in Electrical and Computer Engineering from The Univer-
ment methods, Knowl. Based Syst. 140 (2018) 134–141, doi:10.1016/j.knosys. sity of Western Ontario in 2015. He is currently a Ph.D.
2017.10.032. candidate in the field of Software Engineering in the De-
[35] K. Friston, K. Stephan, B. Li, J. Daunizeau, Generalised filtering, Math. Probl. partment of Electrical and Computer Engineering. His re-
Eng. 2010 (2010). search interests include data mining, social network anal-
[36] C. Hung, J.-H. Chen, A selective ensemble based on expected probabilities for ysis, cloud computing, data analytics, network security,
bankruptcy prediction, Expert Syst. Appl. 36 (3) (2009) 5297–5303. and intrusion detection. He is a student member of the
[37] H. Leung, S. Haykin, The complex backpropagation algorithm, IEEE Trans. Sig- IEEE.
nal Process. 39 (9) (1991) 2101–2104.
[38] G. Holmes, A. Donkin, I.H. Witten, Weka: a machine learning workbench, in:
F. Salo, A.B. Nassif and A. Essex / Computer Networks 148 (2019) 164–175 175

Ali Bou Nassif is an assistant professor at the Univer- Aleksander Essex is an assistant professor of software en-
sity of Sharjah, UAE, and adjunct research professor at gineering in the Department of Electrical and Computer
The University of Western Ontario. He received a Ph.D. Engineering at The University of Western Ontario. Spe-
in Electrical and Computer Engineering from The Univer- cializing in cybersecurity, his research focuses on cyber
sity of Western in 2012. His research includes applications threats to electronic and online voting, and on secure
of statistical and artificial intelligence models to software multi-party cryptographic techniques for private health
engineering, electrical engineering, e-learning, cybersecu- informatics. Part of his research focuses on applications of
rity and social media. He is a member of the IEEE and is cryptography to the sharing of health information, such as
a licensed professional engineer in Ontario. private records linkage and genomic privacy. He received
a Ph.D. in computer science from the University of Water-
loo in 2012. He is a member of the IEEE, the ACM, and is
a licensed professional engineer in Ontario.

You might also like