You are on page 1of 15

Computers & Security 95 (2020) 101851

Contents lists available at ScienceDirect

Computers & Security


journal homepage: www.elsevier.com/locate/cose

Building Auto-Encoder Intrusion Detection System based on random


forest feature selection
XuKui Li a, Wei Chen a,∗, Qianru Zhang b, Lifa Wu a
a
Nanjing University of Posts and Telecommunications, No.9,Wenyuan Road, Nanjing, Jiangsu, China
b
University of Hong Kong, Pokfulam Road, Central and Western District, Hong Kong, China

a r t i c l e i n f o a b s t r a c t

Article history: Machine learning techniques have been widely used in intrusion detection for many years. However,
Received 11 November 2019 these techniques are still suffer from lack of labeled dataset, heavy overhead and low accuracy. To im-
Revised 12 March 2020
prove classification accuracy and reduce training time, this paper proposes an effective deep learning
Accepted 20 April 2020
method, namely AE-IDS (Auto-Encoder Intrusion Detection System) based on random forest algorithm.
Available online 29 April 2020
This method constructs the training set with feature selection and feature grouping. After training, the
Keywords: model can predict the results with auto-encoder, which greatly reduces the detection time and effectively
Network security improves the prediction accuracy. The experimental results show that the proposed method is superior to
Network Intrusion Detection System traditional machine learning based intrusion detection methods in terms of easy training, strong adapt-
Deep learning ability, and high detection accuracy.
Auto-Encoder
Unsupervised clustering © 2020 Elsevier Ltd. All rights reserved.

1. Introduction (2) Poor adaptive ability. The adaptive ability mainly refers to
the ability to detect new attacks and the ability to suit new
Network security has become one of the most important issues situations. The existing intrusion detection technology al-
in the world. As one of the important technologies to ensure net- ways has certain defects in this aspect (Roshan et al., 2018).
work security, intrusion detection systems have always been the It is important to build a model with adaptive and self-
focus of the industry(Wang et al., 2018). An intrusion detection learning ability, which can detect unknown new attacks and
system is a security management system, which collects and ana- suit new situations.
lyzes information from computers and networks to check whether (3) The low efficiency of data processing. With the advent of
some abnormal behaviors appear. As a proactive security defense the era of big data, a large amount of network data can be
technology, intrusion detection can effectively protect the security generated in a short period of time. Intrusion detection tries
of cyberspace. to identify intrusion behaviors from these large quantities of
With the development of data analysis technologies, intrusion data, which brings great challenges to current intrusion de-
detection system has continuously evolved. From the initial sta- tection technologies. How to quickly detect network intru-
tistical theory to today’s artificial intelligence algorithms, the in- sion behavior with high detection rate is one of the major
trusion detection technologies become more and more powerful. challenges that need to be solved urgently.
However, with the complexity of network topology and the di-
versification of intrusion behavior, the existing intrusion detection Faced with the increasingly challenges of network security, peo-
technologies have gradually presented some drawbacks, mainly re- ple need to develop new intrusion detection methods on the
flected in: achievements of existing deep learning techniques. Deep learning
can overcome most of the above shortcomings because of its excel-
(1) The detection rate is unstable. As network traffic data grows lent prediction ability and strong adaptability (Hodo et al., 2016).
larger, it is very difficult to obtain the characteristics of in- Therefore, we apply auto-encoder, a deep neural network, to in-
trusion from massive data. Since the intrusion behavior is trusion detection system AE-IDS (Auto-Encoder Intrusion Detec-
diversifying, it is difficult to maintain a stable detection rate tion System). AE-IDS is a lightweight self-coding intrusion detec-
for all types of intrusions. tion system, which can be deployed online.
AE-IDS uses stochastic forest algorithm to select features, re-

Corresponding author. duce dimension of data, and obtains the best features. Affinity
E-mail address: chenwei@njupt.edu.cn (W. Chen). Propagation (AP) classifies the best features into several feature

https://doi.org/10.1016/j.cose.2020.101851
0167-4048/© 2020 Elsevier Ltd. All rights reserved.
2 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

subsets to achieve feature grouping. Finally, anomaly detection expensive. At the same time, we also need to consider computation
uses the auto-encoder(AE) to reconstruct the new data represen- or storage resources consumed by training or testing model.
tation, calculate the root mean square error (RMSE) between the Jashuva et al. mentioned the importance of function and at-
old and new data, and classify RMSE through GMM or Kmeans. tribute selection for performing accurate wireless network intru-
AE-IDS is able to detect unknown abnormal network behaviors sion detection, and made manual function selection (Chundi and
by comparing the normal traffic model and incoming network traf- Rao, 2013). The first 20 features are selected to train 8 classifiers.
fic. Network traffic deviating from normal traffic model is classi- Their overall accuracy ranges from 89.43 % to 96.2 %. However,
fied as intrusion (Aljawarneh et al., 2018). The advantage of AE- manual feature selection can be a very cumbersome and time-
IDS is that it can predict new attacks. After extracting related fea- consuming process. In this paper, we adopt self-learning method,
tures from the network traffic, the appropriate feature selection which is the inherent feature of in-depth learning, to complete the
and dimensional reduction can improve the data quality, speed up challenging task of feature selection. As a result, it achieves higher
the detection, and enhance the classification results (Javaid et al., overall classification accuracy.
2016a). Auto-encoder (Yousefi-Azar et al., 2017) can be used as an In Javaid et al. (2016b), the authors used deep learning for in-
unsupervised learning method to identify network traffic anoma- trusion detection on NSL-KDD data sets. NSL-KDD data set is based
lies and reduce training time. We use CSE-CIC-IDS 2018 on AWS on KDD Cup 99 dataset and consists of network traffic captured by
dataset (University of New Brunswick, 0 0 0 0) to verify AE-IDS. The DARPA Intrusion Detection Assessment Program in 1998. NSL-KDD
experimental results show that our intrusion detection method is dataset is an improved version, in which redundant records are
efficient and accurate. deleted and partitioned, so that records are grouped into various
In conclusion, the main contributions of this paper are as fol- levels of difficulty according to the number of learning algorithms
lows: developed so far, and they can be classified correctly. The method
proposed by the author correctly classifies the test data into two
(1) We propose a novel online lightweight learning method AE- categories (attack and normal) and five categories (normal and 4
IDS, which uses random forests to select the effective fea- attacks) with accuracy of 88.39% and 79.1%, respectively. However,
tures from original datasets. Online real-time detection di- the data set seems outdated since it does not contain new types
rectly uses the selected features. It can achieve effective rep- of attacking data appeared in recent years. The authors only con-
resentation and reduce dimensionality, which improve the sidered the sparse auto-encoder with Sigmoid activation function,
results of neural networks and traditional machine learning which limited its performance.
algorithms, such as Kmeans, GMM, in binary classification. In Han et al. (2018) and Wang et al. (2017b), the authors pro-
(2) AP clustering algorithm clusters the best features to find the posed to use auto-encoder to extract features from datasets, reduce
features with strong correlation, which helps feature merg- feature dimensions, thus significantly reduce memory require-
ing. Because the order of feature magnitude is different, the ments, and improve network threat detection. However, the auto-
output results tend to be biased towards features with larger matic encoder itself was not used for anomaly detection. Finally,
orders of magnitude, resulting in poor AE expression ability. the authors used classifiers to detect network threats. Their solu-
Clustering operations highly correlate features into the same tions were supervised, requiring experts to label training dataset,
group, which contains features of similar magnitude order. and our solutions are unsupervised.
(3) The innovation of our method lies in the combination of 3- In Sakurada and Yairi (2014), Sakurada et al. proposed to use
layer shallow AE neural network and traditional unsuper- the non-linear dimension reduction self-encoder in anomaly de-
vised machine learning clustering algorithm. AE-IDS uses tection tasks. The dimension reduction of artificial data and real
shallow AE neural network to reduce computational com- data was processed by auto-encoder. Its performance was illus-
plexity and achieve lightweight purpose. In addition, the trated by comparing with linear principal component analysis and
accuracy of unsupervised clustering algorithm can be im- kernel principal component analysis. In addition, the accuracy of
proved by self-learning 3-layer auto-encoder. auto-encoder could be improved by extending it to denoising auto-
encoder. But their work is different from ours because (1) it was
The rest of this paper is structured as follows. We briefly intro-
offline detection, (2) the architecture used by the authors was not
duce the related research work in Section 2, especially the applica-
lightweight and not extended by feature subset, (3) it was not ap-
tion of traditional machine learning technology and deep learning
plied to network intrusion detection, (4) we do not use denois-
method in anomaly detection. In Section 3, we outline the back-
ing auto-encoder. Here, we note that an appropriate feature ex-
ground knowledge of the auto-encoder and its working principles.
traction framework is very helpful to speed up the computational
Section 4 describes the design and implementation of AE-IDS in
efficiency.
details. Section 5 analyzes experimental results with CSE-CIC-IDS
In Wang et al. (2017a), Wang et al. proposed a new method
2018 on AWS datasets. Section 6 makes conclusions.
based on RDF-SVM algorithm to detect DDoS attacks. Considering
the importance of feature selection in DDoS attack detection, RDF-
2. Related work SVM algorithm aimed to use random forest to calculate the impor-
tance of features, and use SVM to re-filter features. It could pre-
Network intrusion detection has become the most important vent erroneous deletion of features. Finally, the best feature subset
part of network defense infrastructure. Various machine learning was obtained, which would achieve higher detection rate and re-
methods are applied to detect anomalies or attacks in network call rate. In our method, the random forest algorithm is applied to
traffic. These methods include decision tree, K-Nearest Neighbor select features, aiming at extracting the most effective features and
(K-NN), naive Bayesian network, SOM, SVM and Artificial Neural removing the redundant features with high correlation.
Network (ANN). Iman et al. made a survey about the state-of-the-art in the IDS
In the past(Buczak and Guven, 2015), (Al-Yaseen et al., 2017), datasets generation and evaluation (Sharafaldin et al., 2018). They
(Kong et al., 2017), (Hoz et al., 2015), (Fiore et al., 2013), the IDS analyzed the limitations of 11 public available datasets released
methods using machine learning techniques have been extensively since 1998. They generated a new IDS dataset and released it on-
studied. However, these solutions usually consider little about how line (University of New Brunswick, 0 0 0 0), including 7 latest fami-
to label training dataset, which is very important to the accuracy of lies of attacks which are publicly available and appear in the real
detection models. Actually the cost of labelling training dataset is world. They compared the performance and accuracy of the se-
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 3

notes the number of neurons in l(i). The weight of connecting l(i)


to l (i + 1 ) is expressed as l(i)∗ l (i + 1 ) matrix Wi , l (i + 1 )
vector with bias b(i) as l (i + 1 ). Finally, we record parameter θ
as a set of weights and biases for each layer as tuple θ =(W, b). AE
activates l(2) with the output of weighted W(1) in l(1), then l(2) ac-
tivates l(3) with the ouput of weighted W(2) , and so on until the
last layer is activated. This process is called forward propagation.
Let z(i) be the l(i) dimensional vector of neuron output in l(i).
We calculate z(i + 1 ) by passing z(i) to l (i + 1 ). The formula is as
Fig. 1. The Structure of the Auto-encoder.
following:

z(i + 1 ) = f (W (i ) ∗ z(i ) + b(i )) (1)


lected features using 7 different machine learning algorithms. They
also compared the quality of the generated dataset with other pub- f is a common activation function of neurons. We use sigmoid
lic datastes based on the 11 criteria of the proposed dataset eval- function in RAE-IDS, which is defined as:
uation framework. Since this dataset has the advantage of widely 1
scale in features and samples, and the dataset also was updated in f (z ) = (2)
1 + exp(z )
2018, we finally select the updated version of this dataset(CSE-CIC-
IDS 2018 on AWS) for evaluation. Finally, we define the output of the last layer as x = z(k). Let func-
Yisroel Mirsky et al. proposed Kitsune, a lightweight network tion h be a series of fully connected forward propagation calcula-
intrusion detection system (Mirsky et al., 2018). Kitsune used un- tion functions from input x to final output x , expressed as:
supervised learning to detect attacks online. Kitsune’s core algo- hθ ( x ) = x (3)
rithm is KitNet, which uses a collection of auto-encoder neural net-
works to distinguish between normal and abnormal traffic. Kitsune During the training period, the back propagation algorithm is used
was composed of four modules: packet capturer, feature extractor, to adjust the weight W and bias b of AE so that hθ (x ) = x . Find the
feature mapper, and anomaly detector. It used the idea of inte- best W and b by giving x and Y. The objective of back propagation
grated learning to integrate several auto-encoders into a classifier. is to calculate the gradient ∇ C of cost function C, i.e. the partial
It restructured network traffic by auto-encoder, which improved its derivatives of cost function C with respect to weight W and bias b
performance gradually. Experimental results showed that Kitsune are ∂∂W
C
, ∂∂Cb . Cost function C is defined as following:
could detect various attacks, and its performance was as good as 
offline detectors.
n
( hθ ( x ) − x ) 2
C (w, b) = i=1
(4)
This paper also uses auto-encoder as the basic module for n
anomaly detection, which is inspired by Kitsune. But AE-IDS and
n is the number of training data x.
KitNet are different in many aspects. In feature selection, Kitsune
In application, backpropagation algorithm (Baldi et al., 2018)
sets a damping window to update statistical information gradu-
and stochastic gradient descent method (SGD)(Bottou, 2012) are
ally, and dynamically extracts features hidden in network traffic.
usually combined to calculate the corresponding gradient for train-
AE-IDS uses the Random Forest algorithm to select the most use-
ing input data samples to calculate ∇ Cx , and then estimate the
ful features, which significantly improves the performance of de-
gradient ∇ C. The reason why we use SGD algorithm to optimize
tection. In feature mapping, Kitsune maps n features to k smaller
(Ergen and Kozat, 2017) weight W and b step by step is that our
feature subsets by performing agglomerated hierarchical clustering.
AE-IDS is an online method, and the features of each network
The AE-IDS uses AP clustering to group features according to the
packet are different. Even the same type of packets sequence, their
degree of similarity. The features in each subset are at the same or-
features could be different because they may arrive in different
der of magnitude, which helps AE-IDS calculate the average value
time order due to network delay or error. If new data arrives dur-
from normal samples to implement feature group.
ing the training phase, a single iteration of the internal loop is per-
formed, and then it waits for the next iteration. So each observed
3. Background
data is processed only once.
3.1. Auto-Encoder
3.2. RMSE
Auto-encoder (AE)(Mirsky et al., 2018),(Zhou and Paffenroth,
The RMSE(Root Mean Square Error) is the ratio of the square
2017) is the basic core part of AE-IDS. To better describe the struc-
deviation between the observed value and the true value. RMSE
ture of the auto-encoder, we will refer to the example in Fig. 1.
can be used to calculate the accuracy of the prediction (Chai,
Firstly, AE compresses the input data x into potential spatial
Draxler, 2014). The calculation formula of RMSE is as following:
representation, and then reconstructs the output x . AE neural

network is composed of encoder and decoder. Encoder calculates n
i=1 (Xs,i − Xm,i )2
n-dimensional implicit variable Z by encoding function f(x). m- RMSE = (5)
dimension is much smaller than n-dimension, because encoder can
n
effectively compress data into the lower dimension space by learn- Xs,i denotes the observed value,Xm,i denotes the real value, and n
ing. In fact, 1∗ m matrix x is multiplied by m∗ n matrix W to get 1∗ n denotes the number of observations.
matrix Z. The decoder generates the m-dimensional new data x , The function of automatic encoder is to reconstruct new data
using the n-dimensional hidden variable Z as the decoding func- with original data. In the training phase, we use the characteristics
tion g(Z). Therefore, the whole AE can be described by the function of normal samples as the input for the auto-encoder learning. It
g( f (x )) = x , where the output x is close to the original input x. can obtain new data with high similarity to normal samples. In
AE neural network consists of neuron network layers, in which the decoding process, we try to make the results of loss function
each layer is fully connected by neurons. Each neuron has asso- approach zero, which means the observed value Xs,i and the real
ciated weights. Specifically, Lk denotes that AE is composed of k value Xm,i are made as equal as possible, that is,Xs,i -Xm,i ≈ 0. ELM
layers of neural networks, l(i) denotes layer i in AE, and l(i) de- (Extreme Learning Machine method) is used to optimize it.
4 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

Fig. 2. AE-IDS architecture.

Therefore, in the testing stage, RMSE of normal samples tends features in each group on the order of magnitude; secondly, the
to be zero by the above formula, while RMSE of abnormal sam- variation of abnormal flow in these subsets can be observed.
ples tends to be much larger than 0. RMSE is used to measure the Anomaly detection: The anomaly detection module is formed by
deviation between the observed value and the real value. the combination of auto-encoder and K-means / GMM. After fea-
ture grouping, several feature subsets are obtained. In this module,
4. AE-IDS design and implementation there are as many auto-encoders as the total number of subsets.
The normal network data is used to train auto-encoders. When
In this section, we present the design and implementation de- AE-IDS is deployed, auto-encoder handles the new incoming traffic
tails about AE-IDS. First, the overall architecture of AE-IDS is intro- and calculate the RMSE. The RMSE value is the criterion to judge
duced. Then four main modules are discussed, including Data Pre- the network traffic is normal or not.
processing, Feature Selection, Feature Grouping and Anomaly De-
tection. 4.2. Data preprocessing

4.1. AE-IDS architecture


The data preprocessing module cleans the dataset, selects the
best feature, clusters similar features, and provides reasonable data
AE-IDS is an IDS using auto-encoder neural network, which
for anomaly detection. In this module, the default values and in-
aims at effectively detecting anomalies in network traffic. Since AE-
finity values of the original data are processed. Then the original
IDS is designed as a lightweight real-time intrusion detection sys-
data is divided into dense matrices and sparse matrices. Finally,
tem, it should consume storage and computation resources as little
L2 normalization is performed on the dense matrix and the sparse
as possible. We provide the source code for download1 . The AE-
matrix, respectively, as shown in the Fig. 3.
IDS architecture is shown in Fig. 2, which consists of the following
four main modules: Data Preprocessing, Feature Selection, Feature
4.2.1. Replace default and infinite values
Grouping and Anomaly Detection. These four modules are closely
In this paper, Pandas2 is used to process CSE-CIC-IDS 2018 on
connected. The results of the previous module affect the results of
AWS dataset. If there are default values (NaN) or infinite values
next one.
(Infinity) for features, it is necessary to count the proportion of
Data preprocessing: The default value and the infinite value in
samples containing these features in the whole data set. How to
the raw samples are required to be substituted with suitable val-
replace NaN and Infinity is determined by the occupancy ratio. For
ues. The datasets are divided into the sparse matrix and the dense
example, “Flow Bytes/s” and “Flow Pkts/s” in the dataset have de-
matrix. Each matrix is divided into a training set and a test set
fault values (NaN) and infinite values (Infinity). Because the ratio
by proportion. The default value and the infinite value in the raw
of NaN and Infinity to all other values is not more than 1%, so NaN
samples are required to be substituted with suitable values. L2 nor-
and Infinity are replaced by 0.
malization is carried out for two matrices.
Feature selection: Random forest algorithm is applied to select
the most significant features from the sparse matrix and the dense 4.2.2. Sparse matrix and dense matrix
matrix. The randomness method is used to greatly reduce the vari- Most datasets extracted from network traffic can be roughly di-
ance of the data and improve the accuracy of feature selection. At vided into two categories, one is sparse matrix, the other is dense
the same time, the out-of-bag data of the random forest algorithm matrix. Sparse matrix means the number of elements with value 0
is used to test the model trained by the random forest algorithm. is much more than that of non-zero elements, and the distribution
It can verify the accuracy of the random forest feature selection. of non-zero elements is irregular. Dense matrix, on the contrary,
Feature grouping : Feature grouping is performed on the se- means the number of non-zero elements occupies the majority of
lected best features according to the degree of similarity. Firstly the matrix. For example, in CSE-CIC-IDS 2018 on AWS datasets,
we extract some normal data features and average them. Secondly, we can use “totlen FWD PKTS” and “totlen BWD PKTS” features to
the AP clustering algorithm is used to group the average features judge which matrix should a record belong to. “totlen FWD PKTS”
and obtain several feature subsets. In each feature subset, the AP means how many total size of packets is in forward (source to des-
clustering algorithm makes every feature subset contain features of tination) direction. “totlen BWD PKTS” means the size in backward
the same magnitude order as much as possible. The advantages in- (destination to source) direction. If the value of “totlen FWD PKTS”
clude two points: firstly, It increases the comparability of the data and “totlen BWD PKTS” is very small, it means most of the network

1 2
The source code for AE-IDS is available for download at:https://github.com/ Pandas is a fast, powerful, flexible and easy to use open source data analysis
Battlingboy/AE-IDS and manipulation tool.
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 5

Fig. 3. The steps of data preprocessing.

traffic statistical information are 0. Therefore this record should be- to measure the purity of sample set. A higher Gini index indicates
long to sparse matrix. After the dataset is divided into different greater inequality and less purity.
types of matrix, different learning parameters can be used for dif- When the random forest algorithm is applied to feature selec-
ferent matrix, which reduces model training time and improves the tion, the Permutation Importance (Gregorutti et al., 2017) score of
classification accuracy. each feature is calculated in each decision tree. Under the condi-
tion that other features remain unchanged, the random forest al-
4.2.3. Proportion of training set to test set gorithm calculates the Permutation Importance of each feature of
In our training phase, only normal samples are used for train- each decision tree. it is necessary to reorder the value distribution
ing. 85% of the normal samples are used and divided into sparse of this feature on samples in each decision tree. The prediction ac-
matrix and dense matrix. The remaining 15% of the normal sam- curacy is calculated after the first prediction, and the prediction
ples and 100% of abnormal samples are used for testing. Once accuracy is calculated again after the second random disturbance
the data is judged as sparse matrix (or dense matrix), we use the feature. The difference between the two prediction accuracy value
corresponding sparse matrix (or dense matrix) training model for is calculated as the measurement to evaluate feature Permutation
training. Importance.
There are three kinds of values for Permutation Importance:
4.2.4. L2 regularization positive, negative and zero. After random rearrangement, the cor-
To calculate the similarity between the two samples, L2 regu- responding Permutation Importance for important features should
larization(Nakatsukasa et al., 2017) is used. This method is mainly be a positive value. Poor features are scattered in different sam-
applied to text classification and clustering. Therefore, we regular- ples. The corresponding Permutation Importance for poor features
ize the classified sparse matrix (or dense matrix) by L2. The rea- is a negative value. An irrelevant feature does not have any corre-
son for choosing L2 regularization is that it can weaken the strong lation with the label on the sample, and the corresponding Permu-
features as much as possible, and highlight some features with tation Importance is always zero no matter how many rearrange-
smaller value but more significance. ments are made. The higher the Permutation Importance score is,
the more important the feature is. After calculating the importance
4.3. Feature selection score for each feature, the random forest algorithm will rank all
features according to the importance score from big to small, re-
Different features naturally play different roles in the classifi- move the feature with the importance score lower than or equal
cation process. But some features have little effect, which can be to zero.
removed to improve the performance. There are many methods
to remove these features. Among them, random forest algorithm 4.3.1. The workflow of feature selection
(Breiman, 2001) has the following advantages: The specific workflow of this module is as follows:
1. It can process a large number of high-dimensional data in Step 1: Set up 10 0 0 decision trees to build a random forest.
short time . The construction of sub-data sets is carried out by sampling each
2. Random selection of data and features can improve classifi- decision tree in the random forest with random put back.
cation performance; Step 2: Construct sub-decision trees by ensuring that each sub-
3. When building forests, it can produce unbiased error esti- decision tree calculates the output result of the sub-data set, and
mates. each decision tree outputs a result.
Step 3: The output result of random forest is determined by the
Random forest algorithm is based on the construction of multi- result of sub-decision tree voting;
ple decision trees, each of which works as a classifier. In random Step 4: Calculate the number of classification errors Ei of out-
forests, each tree in the forest is sampled from the original dataset of-bag data in each sub-decision tree in random forest;
to construct a sub-dataset. The sub-datasets are put into each de- Step 5: Randomly disturb the value of out-of-bag data x of each
cision tree, and each decision tree outputs a result. The final deci- decision tree, and recalculate its classification error number Exi .
sion result is determined by the voting of all decision trees. Tree Step 6: Calculate significance and verify feature selection. Make
in random forests does not use all the features to be selected. In- i = 1, 2,...,n. n is the number of decision trees contained in the
stead, some features are randomly selected from all the features. random forest. Repeat steps 3 and 4. The importance of feature X

i [ (Exi − Ei )]. If the out-of-bag error of the
Then the optimal features are selected from the randomly selected is defined as: Ix = 1/n N
features. Because of this randomness, the forest deviation usually model is significantly increased after noise is added to a feature,
increases slightly (relative to the deviation of a single non-random it indicates that the feature has a greater impact on the prediction
tree), but its variance also decreases, which can not only compen- result and thus has a higher significance.
sate for the increase of the deviation, but also produce a better
overall classification model (Geurts et al., 2006). 4.3.2. Assessment of feature
We use random forest algorithm to construct 10 0 0 decision To assess the feature selection method, we take DDoS: HIOC
trees as classifiers to vote for the best results. For each node, m dataset as an example to conduct feature selection. Experimental
features are randomly selected, and the decision of each node in results are shown in the Table 1. We rank the features according
the decision tree is based on these features. According to these m to the Permutation Importance score. The importance of features is
features, the optimal splitting mode is calculated. Each tree will different in the Dense Matrix and Sparse Matrix, which also proves
grow completely without pruning, which may be adopted after a that it is necessary to select features separately in two matrices.
normal tree classifier is built. We use Gini index as scoring crite- Because boostrap is used in the process of generating a ran-
ria to select partition features for subtrees. Gini index is an index dom decision tree, only part of samples are used in the process of
6 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

Table 1
The results of DDoS attack-HOIC feature selection.

DenseMatrix SparseMatrix

Selection Feature Feature Name Permutation Importance Selection Feature Feature Name Permutation Importance

1. feature 67 Init Bwd Win Byts 0.142448 1. feature 48 ACK Flag Cnt 0.167369
2. feature 0 Dst Port 0.124863 2. feature 66 Init Fwd Win Byts 0.088511
3. feature 11 Bwd Pkt Len Max 0.088525 3. feature 21 Fwd IAT Tot 0.087468
4. feature 14 Bwd Pkt Len Std 0.087701 4. feature 2 Flow Duration 0.079206
5. feature 10 Fwd Pkt Len Std 0.084192 5. feature 1 Protocol 0.047451
6. feature 52 Down/Up Ratio 0.073227 6. feature 69 Fwd Seg Size Min 0.040396
7. feature 7 Fwd Pkt Len Max 0.063529 7. feature 19 Flow IAT Max 0.034278
8. feature 54 Fwd Seg Size Avg 0.028953 8. feature 62 Subflow Fwd Pkts 0.029227
DDoS attack-HOIC 9. feature 5 TotLen Fwd Pkts 0.028386 9. feature 68 Fwd Act Data Pkts 0.028637
10. feature 3 Tot Fwd Pkts 0.02716 10. feature 24 Fwd IAT Max 0.028466
11. feature 9 Fwd Pkt Len Mean 0.027025 11. feature 3 Tot Fwd Pkts 0.028393
12. feature 63 Subflow Fwd Byts 0.025216 12. feature 76 Idle Max 0.026967
13. feature 35 Fwd Header Len 0.024327 13. feature 18 Flow IAT Std 0.019029
14. feature 62 Subflow Fwd Pkts 0.024067 14. feature 63 Subflow Fwd Byts 0.018089
15. feature 65 Subflow Bwd Byts 0.021435 15. feature 35 Fwd Header Len 0.018001
16. feature 13 Bwd Pkt Len Mean 0.020893 16. feature 74 Idle Mean 0.015988
17. feature 55 Bwd Seg Size Avg 0.02013 17. feature 22 Fwd IAT Mean 0.015723
18. feature 64 Subflow Bwd Pkts 0.01464 18. feature 5 TotLen Fwd Pkts 0.015555
19. feature 6 TotLen Bwd Pkts 0.013768 19. feature 47 PSH Flag Cnt 0.014592
20. feature 4 Tot Bwd Pkts 0.013105 20. feature 70 Active Mean 0.014084
21. feature 36 Bwd Header Len 0.012986 21. feature 75 Idle Std 0.013421
22. feature 39 Pkt Len Min 0.013288

Table 2
The OOB_score of datasets.

Datasets OOB score

Dense Matrix (%) Sparse matrix (%) Raw data (%)

Bot 99.99 99.99 99.99


DoS attacks-SlowHTTPTest – 100 100
DoS attacks-Hulk 100 99.99 100
Brute Force-XSS 100 82.36 75.10
Brute Force-Web 87.34 81.11 77.59
SQL Injection 81.32 80.73 82.59
Infiltration 82.34 80.44 80.97
DDoS attack-HOIC 100 100 100
DDoS attack-LOIC-UDP 100 – 100
FTP-BruteForce – 99.99 99.99
SSH-Bruteforce 99.99 100 99.99

generating a tree. The unused samples are called out-of-bag sam- unsupervised clustering algorithm as feature grouping method. The
ples. The accuracy of the tree can be evaluated by using out-of- clustering algorithm can divide the features into different groups
bag samples. The other subtrees can be evaluated according to this by calculating the similarity or distance between the data features,
principle. Finally, the average value can be obtained. Due to the which not only eliminates the need for statistics of data features
existence of out-of-bag samples, there is no need to use cross-test. but also overcomes the problem of uneven distribution of ele-
By assigning a random number to each feature in turn, the perfor- ments.
mance of the algorithm can be observed. A larger score indicates
the more important feature. Therefore, we can rank them accord- 4.4.1. Selecting affinity propagation
ing to the importance of features, and then choose the best combi- Traditional clustering algorithms can be roughly divided into
nation of features. We use oob_score to represent the out-of-bag es- three categories: partition clustering, hierarchical clustering, and
timations score of the training dataset. As shown in Table 2, except density clustering. We compare AP algorithm with these three cat-
for Brute Force-Web, Brute Force-XSS and SSH-Bruteforce datasets, egories of algorithms to show the advantages of AP algorithm.
most oob_score is above 99%. Although some partition-based clustering algorithm (e.g., K-
means) and the AP algorithm both use the greedy algorithm to
4.4. Feature grouping solve the combinatorial optimization problem, the AP algorithm is
more robust. In detail, the first step of the K-means algorithm is
Feature grouping refers to divide the features into different setting the number of clusters (i.e., K value). The K value greatly
groups according to similarity, and each group becomes a feature affects the quality of the clusters.
subset. Feature grouping tries to keep the difference between the However, the AP algorithm is a continuous optimization pro-
features in the same group as small as possible, while the differ- cess, which means each sample point is considered as a candidate
ence between the features in the other groups as large as possible. class representative point, and the clustering is gradually identi-
In the field of artificial intelligence, feature grouping is helpful to fied. Due to this reason, the AP algorithm is not troubled by the
observe, extract and identify data. selection of the initial point and it can guarantee convergence to
This module takes the selected features from normal network the global optimum. Thus, the AP algorithm is more stable than
traffic without considering the dependent variable, so it adopts an the K-means algorithm.
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 7

For the hierarchical clustering algorithms (e.g., single-link clus- to divide these most representative features into several feature
tering, CURE algorithm, and BIRCH algorithm)(Reddy and Vinza- groups. In this way, not only anomaly detection complexity de-
muri, 2018), they have two main advantages: (1) They are simple creases, but also the anomaly changes occurring in each group can
and easy to understand; (2) The hierarchical relationship between be easily captured. We do not directly hand over the data to AP
clusters can be easily found. However, the main limitation of these clustering, but extract part of the normal samples, and calculate
algorithms is that choosing different distance measures and linking the average for AP clustering. During AP clustering, normal sam-
criteria greatly affect the correctness of the final result. Moreover, ples do not change very much in the selected features, and the
multiple agglomerations or split operations bring higher time and variance is small.
space complexity. Although the AP algorithm is also based on the Although AP clustering algorithm has obvious advantages in cal-
similarity matrix of the data, it does not have any requirements culating large data classification, its algorithm complexity is high,
for the symmetry of the similarity matrix of the data. Due to this which is O(n∗ n∗ logn). For example, if we extract 85%∗ 25% of the
reason, the AP algorithm has a broader scope than that of the hi- normal samples for AP clustering directly, it will cost a lot in space
erarchical clustering algorithm. and time, which is contrary to our intention of lightweight de-
For the density clustering algorithms based on the DBSCAN al- sign of AE-IDS. 85% of the normal samples are extracted for model
gorithm(Lv et al., 2016), the main advantage is they do not need to training. The first 25% of these training sets are used for feature
determine the number of clusters K before feature grouping. More- grouping, and the last 75% of the training set is used for auto-
over, they are sensitive to noise and can find clusters of any shape. encoder training.
However, they need to set two fixed parameters (i.e., the radius After calculating the average value of the selected features, we
of the circle eps and the minimum number of clusters MinPts) to divide them into three feature subsets by AP clustering. Each fea-
find the cluster. If the degrees of the sparseness of each cluster are ture subset contains the same numerical value of features. The re-
different, clustering them with fixed parameters can result in sig- sults show in the Fig. 4. The x-axis represents the feature number
nificantly different clustering effects. In other words, the clustering and the y-axis represents the value size.
results are unstable in this case. As the AP algorithm does not need
to set fixed parameters, it does not have this limitation. 4.5. Anomaly detection
The module is grouping m features (dimensions) of data x to n
smaller sub-datasets through AP clustering. In the anomaly detec- The anomaly detection module is composed of AE neural net-
tion, we use n smaller sub-datasets to reconstruct data through AE work and unsupervised clustering algorithm. The input of AE neu-
network to generate RMSE. In fact, n sub-datasets are examples of ral network is f  = [ f1 , f2 , f3 , . . . , fn ], a subset of normal features.
n auto-encoders in the anomaly detection section. Let f’ denote an During the training stage, the normal behavior of each feature sub-
ordered set of n instances, which can be expressed as: set is learned. In this way, there will be n AE neural networks
f  = [ f1 , f2 , f3 , . . . , fn ] (6) involved in the calculation. In the test phase, each auto-encoder
calculates the final RMSE reconstruction error and sends its mean
and value to unsupervised clustering algorithm.

n
|| fi || = ||x|| (7)
4.5.1. Construct AE neural network
i=1
At first, we introduce how AE-IDS constructs AE neural net-
Among them, the example fi ’ is actually a subset of data X which work. Auto-encoder is an artificial neural network, which can be
contains several strong interrelated features.The Affinity Propaga- trained to reconstruct its input (x = x ). To be specific, during train-
tion(AP) clusters the unevenly distributed features into several ing phase, the auto-encoder attempts to reconstruct new data x as
similar feature subsets. It does not only need to set the number of same as the original data x, i.e. hθ (x ) = x . The number limitation
clusters, but iteratively updates until suitable features are clustered of neurons in the hidden layer forces it to understand the rela-
subset, and the AP algorithm considers all nodes for clustering at tionship between the features in x and to learn more meaningful
the same time in each iteration. Compared with other algorithms, hidden variable z, which is the role of the encoder. The decoder
it can obtain more stable clustering results. This is an important uses the most representative hidden variable z to restore the orig-
reason for this module to choose the AP clustering algorithm. inal data x. Auto-encoder is symmetrical in network structure and
number of ganglion points. We do not intend to generate random
4.4.2. The workflow of feature grouping weight matrix W on encoder and decoder separately. On the con-
The specific workflow is as following: trary, the same weight matrix W is used for encoding and decod-
Step 1: 85%∗ 25% of normal samples are extracted and the aver- ing, that is, W is used for encoding and WT is used for decoding.
age values of each feature are calculated to get the data points to This only needs to optimize WT , thus reducing computation cost
be clustered. needed during training (Zou et al., 2018).
Step 2: Computing the similarity matrix of the data point. The AE-IDS auto-encoder is constructed by three layers of neu-
Step 3: The responsibility information (Hang et al., 2016) of ral networks. Among them, the numbers of nodes in the first  layer

each point in the similarity matrix is updated. At the same time, and the third layer are the same, which is the dimension  fi  of
the availability information (Gan and Ng, 2015) is also calculated.
input data fi (i ∈ 1, 2, 3, n) and the number of nodes in the sec-
Step 4: Updating the responsibility information and the avail-
ond layer is set to a smaller number. The output fi of the first
ability information of the samples.
layer neural network l1 is multiplied by the weight matrix W1i , and
Step 5: The summation of samples determines whether they
the output z of the second layer neural network l2 is obtained by
can be selected as clustering centers. If the clustering centers re-
adding biasing b1i . This process is the encoding mode. At this stage,
main unchanged after several iterations, or the number of iter-
the initialization of weighting matrix W1 and bias b1i , b2i are ran-
ations exceeds the predefined number, or the updated informa-
dom. In the decoding mode, the output z of the second layer neural
tion remains unchanged after several iterations, the iteration ends
network l2 is multiplied by the weight matrix W2i . The output fi
(Frey and Dueck, 2007).
of the third layer neural network l3 is obtained by biasing b2i . In
Step 6: Cluster similar features into the same group.
the decoding stage, let
In the feature selection part, the random forest algorithm has
helped us to select best features, and then we use AP clustering W2i = W1Ti (8)
8 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

Fig. 4. The feature grouping results after AP clustering.

Force the error of the required output hθ ( fi ) - fi = 0. SGD great influence on the performance of the algorithm. The threshold
(Stochastic Gradient Descent) can be used to reverse optimize pa- value φ is set to the maximum error of 85% of the normal samples.
rameters W1i and bias b1i , b2i . Only three layers of neural network We use min-max Normalization to transform the original data lin-
are used to build AE in our paper because of the following consid- early and map the result value to [0–1].
eration: The training algorithm and executing algorithm are shown in
1. Although the more complex deep neural network (with many the Algorithm 1 and Algorithm 2 , respectively. The anomaly alarm
network layers and network nodes) can achieve excellent fitting
effect in the training dataset, it is very likely that there will be
over-fitting phenomenon in the test dataset generated by network Algorithm 1: Training Algorithm.
traffic. The computation cost is also too high for online IDS. Input :
2. Selecting the most effective features and using the shallow The first layer L(1 );
neural network can get a good expression of features. The differ- The second layer L(2 );
ence between abnormal data and normal data in the optimal fea- The features grouping f ;
tures is enough to distinguish them. Output:
If the feature subset fi is a row vector of 1∗ m and the weight z = RMSE;
matrix Wi is defined as a matrix of m∗ (m∗ 0.75 ), then the hidden 1 Training feature sub-datasets:train(L(1 ), L(2 ), f );
variable z in the middle layer will be a matrix of 1∗ (m∗ 0.75 ). 2 Initializating L(2):z = zeros(k );
In this way, the feature compression of the original feature sub- 3 for each θi ∈ L(1 ) do
set fi is completed. We set m∗ 0.75 to ensure the original feature 4 Normalization:vi = norm( fi );
subset fi can have strong and effective feature representation. We
5 The first layer propagates forward: Wi , zi = hθ(1 ) ( fi );
look for a highly correlated feature subset in the optimal features, i

which are selected by the original random forest and already rep- 6 The second layer propagates forward:WiT , yi = hθ(2 ) (zi );
i
resentative. The number of elements in the feature subset is kept 7 Back-propagation: s i = b θi ( f i , y i ) ;
small because we do not want to lose too many effective features 8 Updating weights: θi = SGD(Wi , si );
at this stage. 9 Abnormal score calculation: z[i] = RMSE (vi , yi ) ;
After SGD optimization, RMSE can be calculated. The forcing
10 end
error is expected to be zero, i.e. hθ ( fi ) − fi = 0. The purpose of
11 return z;
using Extreme Learning Machine (ELM) in the training stage is to
improve the learning efficiency and simplify the parameter setting
for Backward Propagation (BP) algorithm. Because the forcing error
depends on the cut-off threshold φ . A simple method to set φ
hθ ( fi ) − fi = 0, the RMSE value in the training stage is 0. In the
is to use maximum score calculated during the training period,
test stage, we directly input the feature subset into trained model
where the training dataset is assume to represent all normal data.
and calculate RMSE. The parameters W1i and bias b1i , b2i are all
Kmeans++(Sandhu et al., 2019) clustering method can be used as
based on normal data. When the abnormal data are input in the
anomaly classifier. We can also send out an alarm according to the
model, the RMSE will increase dramatically because the abnormal
probability. Specifically, the output RMSE score can be fitted to a
data have strong differences in the selected features.
normal distribution (Bugdary and Maymon, 2019). When the prob-
ability that it belongs to a normal sample is less than the proba-
4.5.2. Evaluate anomaly score bility that it belongs to an abnormal sample, an alarm will be sent
After RMSE value is calculated for each feature subset, the aver- out. An unsupervised GMM classifier is used in this method.
age RMSE value of all feature subsets is calculated as the anomaly By performing the following steps, the auto-encoder can be ap-
score s. The value of s is in the range of [0,+∞), and the larger plied to anomaly detection. Let φ be the abnormal threshold with
value indicates the higher anomaly probability. The output is usu- initial value -1.
ally normalized so that a fraction less than threshold φ is consid- 1. Training phase: train auto-encoders with normal network
ered normal and a fraction greater than threshold φ is considered traffic.
abnormal (Gil et al., 2018). The choice of the threshold φ has a 1) Execution: s = RMSE(x, x )
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 9

Algorithm 2: Executing Algorithm.


Input :
The first layer L(1 );
The second layer L(2 );
The features grouping f ;
Output:
z = RMSE;
1 Executing feature sub-datasets train(L(1 ), L(2 ), f );
2 Initializating L(2) z = zeros(k );
3 for each θi ∈ L(1 ) do
4 Normalization: vi = norm( fi );
5 The first layer propagates forward: Wi , zi = hθ(1 ) ( fi );
i
6 The second layer propagates forward: WiT , yi = hθ(2 ) (zi );
i
7 Abnormal score calculation: z[i] = RMSE (vi , yi ) ;
8 end
9 return z;

2) Update: if(s ≥ φ ):φ = s


2. Testing phase: run trained auto-encoders with online net-
work traffic.
1) Execution: s = RMSE(x, x )
2) Judgment: if(s ≥ φ ): Alarm

4.5.3. The workflow of anomaly detection


The specific workflow of this module is as follows:
Step 1: Build an auto-encoder with the neural network, which
is composed of three-layer neural networks. Among them, the
number of nodes in the first layer and the last layer of the neu-
ral network is the same.
Step 2: Use extreme learning machine to optimize parameter
weight and bias, the loss function is the root mean square er-
ror(RMSE) of fitting data and raw data;
Step 3: Divide the normal samples into the sparse matrix and
dense matrix, group the selected features and put them into the
corresponding auto-encoders. The number of auto-encoders is as
same as that of feature groups. Calculating the average score of
RMSEs as output.
Step 4: In the training phase, part of normal samples are input
to the auto-encoder, and the weights and biases are calculated.
Step 5: In the testing phase, the remaining normal samples are
put into the trained model, calculate RMSE’s average value as the
output.
Step 6: normalize RMSE and cluster RMSE by GMM / K-means.

4.6. The overall workflow of AE-IDS

As shown in the Fig. 5, x represents original input data. x1


is selected from normal samples and x2 is selected from abnor-
mal samples. If x1 and x2 are less than 20% of x, up sampling Fig. 5. The AE-IDS Flowchart.
procession is performed. Otherwise, random forest is used to se-
lect features. We define Xtrain = 85% ∗ x1 . When counter n reaches
m(m = Xtrain ∗ 25%), AP algorithm is applied to do feature group in the code libraries to build modules. The source code of AE-IDS has
order to increase the final accuracy and efficiency. When n reaches been released at github.com1 .
Xtrain (m1 = Xtrain ∗ 75%, Xtrain = m1 + m ), we use auto-encoder to
do anomaly detection. At the final step, GMM and Kmeans algo- 5.1. Introduction to CSE-CIC-IDS 2018 dataset
rithms are used to do final decision.
The Communications Security Establishment (CSE) and the
5. Evaluation Canadian Institute for Cybersecurity (CIC) had collaborated to de-
sign a systematic method to generate datasets for evaluating in-
In this section, we describe the evaluation details about how trusion detection systems (University of New Brunswick, 0 0 0 0).
to setup experimental environment, how to use the CSE-CIC-IDS The dataset includes seven different attack scenarios: Brute-force,
2018 dataset to evaluate anomaly detection, as well as how to use Heartbled, Bot, DoS, DDoS, Web attack and infiltration. The dataset
10 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

Table 3 dense and sparse matrices, respectively. L2 regularization is used


Statistics of CSE-CIC-IDS 2018 on AWS dataset.
to normalize dense and sparse matrices. Pandas is used during the
Datasets DenseMatrix SparseMatrix RawData whole process2 .
709,675 338,900 1,048,575 Step2: The best features are selected by the Random forest al-
Bot 0:567,492 0:194,892 0:762,384 gorithm and stored in a training dataset and a testing dataset at
1:142,183 1:144,008 1:286,191 a ratio. Training dataset extracts 85% of the normal data from the
467,947 440,737 908,684 processed data. Testing dataset contains the remaining 15% normal
DoS attacks-Hulk 0:442,399 0:4,373 0:446,772 and 100% abnormal data. The scikit-learn3 toolkit is used to ac-
1:25,548 1:436,364 1:461,912 complish Random forest algorithm.
442,399 144,263 586,662 Step3: The AP clustering algorithm used 25% of training dataset
DoS attacks-SlowHTTPTest 0:442,399 0:4,373 0:446,772 to group features. It adjusts the cluster centers by AP clustering
1:0 1:139,890 1:139,890 and updates responsibility and availability. This step repeats until
730,496 317,664 1,048,160 reaches the stop condition. We accomplish this by scikit-learn.
Brute Force-XSS 0:730,418 0:317,591 0:1,048,009 Step4: Each feature group is sent to an auto-encoder to com-
1:78 1:73 1:151 pute RMSE. All RMSE values from all auto-encoders are calcu-
730,670 317,701 1,048,371 lated and averaged as anomaly score. Extreme learning machine
Brute Force-Web 0:730,418 0:317,591 0:1,048,009 is used to optimize parameter weight and bias. The loss function
1:252 1:110 1:362 is anomaly score of fitting data and input data.
659,438 388,624 1,048,062 Step5: In the execution phase, testing dataset is put into the
SQL Injection 0:659,411 0:18,760 0:1,048,009 trained model. Anomaly score is calculated according to normal-
1:27 1:26 1:53 ized RMSE’s mean value and classified by GMM / K-means. Auto-
204,075 127,025 613,071 encoder is built using python. GMM and Kmeans are implemented
Infiltration 0:156,313 0:81,724 0:696,667 by scikit-learn tookkit.
1:47,762 1:45,301 1:88,064

522,387 524,458 1,046,845 5.3. Experimental analysis


DDoS attack-HOIC 0:358,637 0:2,196 0:360,833
1:163,750 1:52,262 1:686,012
We compare the sparse matrix and dense matrix with the raw
362,288 275 362,563
data to prove that the divided dataset reduces the training time
DDoS attack-LOIC-UDP 0:360,558 0:275 0:360,833
1:1,730 1:0 1:1,730
and improves the detection accuracy. We also compare AE-IDS and
KitNet (Mirsky et al., 2018) to prove AE-IDS is more efficient and
448,576 412,410 860,986
accurate.
FTP-BruteForce 0:448,576 0:219,050 0:667,626
1:0 1:193,360 1:193,360

542,394 312,821 855,215 5.3.1. Comparison of confusion matrix results


SSH-Bruteforce 0:448,576 0:219,050 0:667,626 Confusion matrix, also known as error matrix, is a standard for-
1:93,818 1:93,771 1:187,589 mat for accuracy evaluation. It is mainly used to compare the clas-
0 :Normal Samples; 1 :Abnormal Samples. sification results with the actual measured values. It can display
the accuracy of the classification results in a confusion matrix.
Table 4 is the confusion matrix of AE-IDS sparse matrix and
includes network traffic and system logs, from which 83 features dense matrix experimental results. It shows that using sparse
are extracted, such as duration, number of packets, number of matrix and dense matrix in detection can significantly improve
bytes, length of packets. The dataset has two-way flow, in which the accuracy and efficiency. We can get good detection results
the first packet determines the forward (source-to-destination) and for most of sub-datasets, such as Bot, DoS attacks-SlowHTTPTest,
backward (destination-to-source) directions. SSH-BruteForce. However, both Infiltration and SQL Injection sub-
The general information about CSE-CIC-IDS 2018 on AWS datasets do not perform well in AE-IDS because the features of
dataset is list in Table 3, including the number of samples in each normal and abnormal samples are almost the same in these two
sub-dataset, the number of normal samples and abnormal samples, sub-datasets. We make a survey about the orignal network traf-
the number of samples after classifying them into sparse matrix fic. Infiltration attacks are usually performed in the intranet and
(dense matrix), and the number of normal samples and abnormal do not show suspicious network behavior. SQL Injection works by
samples in sparse matrix (dense matrix). Although up-sampling submitting an HTTP request containing the wrong SQL statement.
can solve data imbalance problem and eliminate outlier sensitive From a network traffic perspective, this looks similar to a normal
problem, the AUC area of the experimental results with and with- HTTP request. This means it is not enough to detect Infiltration and
out up-sampling are the same. In the end, we did not use up- SQL Injection attacks only by network traffic. System logs are very
sampling for data preprocessing. useful to detect these two attacks.
To compare with KitNet, we run KitNet on the same AWS
dataset. Table 5 shows that AE-IDS performs as well as KitNet in
5.2. Experiment environment
most of the dataset experiments. In Bot sub-dataset experiment,
AE-IDS perform much better than KitNet. Although KitNet uses cor-
We have implemented AE-IDS using python3.7 on the Windows
relation coefficients between features to cluster hierarchically, it
10 platform. The experiments are carried out on PC with Intel (R)
is not relatively accurate and spends too much computation re-
Core (TM) i7-7700HQ CPU. The main frequency is 2.80 GHZ and
sources. AE-IDS performs slightly better than KitNet in computa-
the memory size is 16 GB RAM. The following steps are performed
tion overhead. This is due to the use of random forest and AP al-
for every attack dataset.
gorithm for feature selection and grouping. AP clustering performs
Step1: The default value and the infinite value are replaced with
zero-filling. We divide each processed dataset into the sparse ma-
trix and dense matrix, extract the normal samples from the sparse 3
The scikit-learn is an open source machine learning toolkit based on Python
matrix or dense matrix. AE-IDS builds the normal model with language.
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 11

Table 4
The Confusion Matrix of AE-IDS.

well because the normal sample features of these datasets do not ples in the Dense Matrix. DDoS attack-LOIC-UDP only has normal
change much and can be replaced by average values. In Infiltra- samples in the Sparse Matrix, so the recall rates of those exper-
tion, Brute Force-Web, SQL Injection and SSH-Bruteforce datasets, iments are not available. The results show that our method can
the features of abnormal sample data are almost as same as that of well adapt to the imbalanced environment and effectively improve
normal data, which cause poor results. GMM method is also evalu- the performance for positive classes.
ated and gives the similar results, which are not listed in the paper.

5.3.2. Recall 5.3.3. AUC


In anomaly detection, people pay more attention to the recall ROC curve is based on the confusion matrix, which is used to
rate(recall=TP/(TP+FN)) and even want to ensure that it is 100%. evaluate the prediction ability of the model. In the actual data
Table 6 shows the performance of AE-IDS and KitNet on the re- set, there is often a phenomenon of category imbalance. There are
call rate. The dense matrix and sparse matrix used by AE-IDS is more negative samples than positive samples (or vice versa). The
significantly better than RawData. The overall recall rate of AE-IDS distribution of positive and negative samples in the testing dataset
is better than that of KitNet on the same dataset. DoS attacks- may change with time. In this case, ROC curve can remain un-
SlowHTTPTest and FTP-Brute Force datasets only have normal sam- changed.
12 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

Table 5
The Confusion Matrix of AE-IDS and KitNet.

Table 6
The recall rate of all experiments.

DataSet AE-IDS AE-IDS KitNet(%)

DenseMatrix (%) SparseMatrix (%) RawData (%)

Bot 96.98 99.73 47.26 0.05


Brute Force -Web 61.90 56.36 40.61 35.36
Brute Force -XSS 92.31 57.53 17.22 3.31
DDoS attack-LOIC-UDP 100.00 N/A 100.00 98.44
DDoS attack-HOIC 100.00 100.00 100.00 100.00
DoS attacks-SlowHTTPTest N/A 100.00 100.00 100.00
DoS attacks-Hulk 100.00 100.00 2.99 0.27
Infiltration 19.38 10.63 23.37 N/A
SQL Injection N/A N/A 11.32 3.77
SSH-Bruteforce 100.00 100.00 99.41 100.00
FTP-Bruteforce N/A 100.00 100.00 100.00
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 13

Fig. 6. The AUC of AE-IDS detection results.

AUC (Area Under Curve) is the area enclosed by the coordinate Table 7
Detection time used by AD-IDS.
axis under ROC curve, and the maximum value of AUC area is 1. If
the AUC is close to 1.0, it means the detection results are trustful. Datasets DenseMatrix (s) SparseMatrix (s)
Criteria for judging the quality of classifier (prediction model) from Bot 297.30 106.88
AUC include: DoS attacks-Hulk 101.78 99.53
DoS attacks-SlowHTTPTest 1253.62 101.54
Brute Force -XSS 330.10 108.20
• AUC = 1, perfect classification. Brute Force -Web 336.72 119.42
• AUC = [0.85, 0.95], good effect. SQL Injection 369.18 89.51
Infiltration 273.41 90.59
• AUC = [0.7, 0.85], general effect.
DDoS attack-HOIC 230.78 249.55
• AUC = [0.5, 0.7], less effective. DDoS attack-LOIC-UDP 138.19 1.75
• AUC = 0.5, just like random guess, the model has no predic- FTP-BruteForce 1158.42 158.65
tive value. SSH-Bruteforce 170.99 88.38
• AUC < 0.5, worse than random guess.

Fig. 6 shows ROC curves and AUC areas of AE-IDS sparse matrix 5.3.4. Efficiency comparison
and dense matrix experiments. AE-IDS has good ability to distin- We compare the time spent by AE-IDS and KitNet on the same
guish abnormal in most experiments. In the infiltration dataset, the dataset, as shown in Tables 7 and 8. Table 7 shows that the ef-
reason why AE-IDS does not perform well is that the intruder pen- ficiency is obviously improved when the dataset is divided into
etrates the intranet and becomes a legitimate user of the intranet. sparse matrix and dense matrix. Generally, the time summation of
All behaviors in the intranet will be considered normal. The at- sparse matrix and dense matrix is less than the time of raw data.
tacker performs SQL Injection by sending HTTP request containing Similar to DoS attack-Slow HTTPTest and FTP-BruteForce datasets,
malformed SQL statement, which is difficult for AE-IDS to detect. only normal samples are provided. Random forest algorithm can-
14 X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851

Table 8 References
Detection time used by AE-IDS and KitNet.

Datasets AE-IDS (s) KitNet (s) Al-Yaseen, W.L., Othman, Z.A., Nazri, M.Z.A., 2017. Multi-level hybrid support vector
machine and extreme learning machine based on modified k-means for intru-
Bot 712.96 3495.81 sion detection system. Expert Syst. Appl. 67, 296–303.
DoS attacks-Hulk 629.48 2495.78 Aljawarneh, S., Aldwairi, M., Yassein, M.B., 2018. Anomaly-based intrusion detection
DoS attacks-SlowHTTPTest 248.77 1902.74 system through feature selection analysis and building hybrid efficient model. J.
Brute Force -XSS 500.25 3799.07 Comput. Sci. 25, 152–160.
Brute Force -Web 2493.83 3420.34 Baldi, P., Sadowski, P., Lu, Z., 2018. Learning in the machine: random backpropaga-
SQL Injection 746.92 2246.02 tion and the deep learning channel. Artif. Intell. 260, 1–35.
Bottou, L., 2012. Stochastic gradient descent tricks. In: Neural networks: Tricks of
Infiltration 331.66 1739.20
the trade. Springer, pp. 421–436.
DDoS attack-HOIC 418.36 1412.46
Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32.
DDoS attack-LOIC-UDP 295.75 1497.70 Buczak, A.L., Guven, E., 2015. A survey of data mining and machine learning meth-
FTP-BruteForce 213.35 2271.77 ods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18 (2),
SSH-Bruteforce 377.11 2277.37 1153–1176.
Bugdary, S., Maymon, S., 2019. Online clustering by penalized weighted GMM. arXiv
preprint: arXiv:1902.02544.
Chai, T., Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute error
not select features from them. It uses all features for AP clustering (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev.
and get many feature groups, which will produce multiple auto- 7 (3), 1247–1250.
Chundi, J., Rao, V.V.G., 2013. Role of feature reduction in intrusion detection systems
encoder neural networks. Thus, it takes much long time to finish for wireless attacks. Int. J. Eng. Trends Technol. 6 (5), 241–246.
training. From Table 8, we can see that AE-IDS consumes less time Ergen, T., Kozat, S.S., 2017. Efficient online learning algorithms based on LSTM neural
than KitNet in the raw datasets. Because KitNet does not use ran- networks. IEEE Trans Neural Netw. Learn. Syst. 29 (8), 3772–3783.
Fiore, U., Palmieri, F., Castiglione, A., De Santis, A., 2013. Network anomaly detection
dom forest algorithm for feature selection, it is much slower than with the restricted Boltzmann machine. Neurocomputing 122, 13–23.
AE-IDS. In addition, AE-IDS skillfully uses AP clustering in feature Frey, B.J., Dueck, D., 2007. Clustering by passing messages between data points. Sci-
grouping, which makes anomaly detection use much less time. ence 315 (5814), 972–976.
Gan, G., Ng, M.K.-P., 2015. Subspace clustering using affinity propagation. Pattern
Recognit. 48 (4), 1455–1464.
6. Conclusion Geurts, P., Ernst, D., Wehenkel, L., 2006. Extremely randomized trees. Mach. Learn.
63 (1), 3–42.
Gil, P., Martins, H., Januário, F., 2018. Outliers detection methods in wireless sensor
AE-IDS is an unsupervised auto-encoder based intrusion detec- networks. Artif. Intell. Rev. 1–26.
tion system, including four modules: data preprocessing, feature Gregorutti, B., Michel, B., Saint-Pierre, P., 2017. Correlation and variable importance
selection, feature grouping and anomaly detection. It does not de- in random forests. Stat. Comput. 27 (3), 659–678.
Han, K., Wang, Y., Zhang, C., Li, C., Xu, C., 2018. Autoencoder inspired unsuper-
pend on labelled training dataset. Only normal traffic is enough for
vised feature selection. In: Proceedings of the 2018 IEEE International Con-
training. Because of the three-layer neural network structure, it is ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2941–
efficient and light-weight. Auto-encoder technique can efficiently 2945.
solve the sample imbalance problem, which is very common in Hang, W., Chung, F.-l., Wang, S., 2016. Transfer affinity propagation-based clustering.
Inf. Sci. (Ny) 348, 337–356.
network environment. AE-IDS works online and performs better Hodo, E., Bellekens, X., Hamilton, A., Dubouilh, P.-L., Iorkyase, E., Tachtatzis, C.,
than some popular batch/offline methods. It reduces the compu- Atkinson, R., 2016. Threat analysis of IOT networks using artificial neural net-
tation cost by feature selection and feature grouping operation. AP work intrusion detection system. In: Proceedings of the 2016 International Sym-
posium on Networks, Computers and Communications (ISNCC). IEEE, pp. 1–6.
clustering is helpful to find significant feature subset and merge Hoz, E.D.L., Hoz, E.D.L., Ortiz, A., Ortega, J., Prieto, B., 2015. PCA filtering and prob-
similar features. Experimental results show it can detect most at- abilistic SOM for network intrusion detection. Neurocomputing 164 (Complete),
tacks accurately and speed up the training and testing procession. 71–81.
Javaid, A., Niyaz, Q., Sun, W., Alam, M., 2016. A deep learning approach for net-
At present, network attacks become more and more hidden, work intrusion detection system. In: Proceedings of the 9th EAI International
which will be mixed in normal network traffic. The encryption of Conference on Bio-inspired Information and Communications Technologies (for-
malicious traffic also brings many challenges to intrusion detection. merly BIONETICS). ICST (Institute for Computer Sciences, Social-Informatics and
Telecommunications Engineering), pp. 21–26.
Features extracted from network traffic is not enough, we need to Javaid, A.Y., Niyaz, Q., Sun, W., Alam, M., 2016. A deep learning approach for
consider more information, including the system log and security network intrusion detection system. In: Proceedings of the 9th EAI In-
device alarms, to effectively defend against intrusions. ternational Conference on Bio-inspired Information and Communications
Technologies (formerly BIONETICS). ICST (Institute for Computer Sci-
ences, Social-Informatics and Telecommunications Engineering), pp. 21–
Declaration of Competing Interest 26.
Kong, L., Huang, G., Wu, K., 2017. Identification of abnormal network traffic using
support vector machine. In: Proceedings of the 2017 18th International Con-
The authors declare that they have no known competing finan- ference on Parallel and Distributed Computing, Applications and Technologies
cial interests or personal relationships that could have appeared to (PDCAT). IEEE, pp. 288–292.
influence the work reported in this paper. Lv, Y., Ma, T., Tang, M., Cao, J., Tian, Y., Al-Dhelaan, A., Al-Rodhaan, M., 2016. An effi-
cient and scalable density-based clustering algorithm for datasets with complex
structures. Neurocomputing 171, 9–22.
CRediT authorship contribution statement Nakatsukasa, Y., Soma, T., Uschmajew, A., 2017. Finding a low-rank basis in a matrix
subspace. Math. Program 162 (1–2), 325–361.
Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A., 2018. Kitsune: An Ensemble of Au-
XuKui Li: Methodology, Validation, Software, Investigation, Data toencoders for Online Network Intrusion Detection. Proceedings of NDSS. Inter-
curation. Wei Chen: Conceptualization, Supervision, Resources, net Society. doi: 10.14722/ndss.2018.23211.
Writing - review & editing. Qianru Zhang: Visualization, Data cu- University of New Brunswick. CSE-CIC-IDS2018 on AWS dataset. https://www.unb.
ca/cic/datasets/ids-2018.html.
ration. Lifa Wu: Funding acquisition.
Reddy, C.K., Vinzamuri, B., 2018. A survey of partitional and hierarchical clustering
algorithms. In: Data Clustering. Chapman and Hall/CRC, pp. 87–110.
Acknowledgment Roshan, S., Miche, Y., Akusok, A., Lendasse, A., 2018. Adaptive and online network
intrusion detection system using clustering and extreme learning machines. J.
Frankl. Inst. 355 (4), 1752–1779.
We would like to thank our anonymous reviewers for their Sakurada, M., Yairi, T., 2014. Anomaly detection using autoencoders with nonlinear
valuable comments and suggestions. This work is supported by the dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on
Machine Learning for Sensory Data Analysis. ACM, p. 4.
National Key Research and Development Program of China under
Sandhu, S.S., Tripathy, B., Jagga, S., 2019. KMST+: a k-means++-based minimum
Grant No. 2019YFB2101704 and the National Natural Science Foun- spanning tree algorithm. In: Smart Innovations in Communication and Compu-
dation of China under Grant No. 61602258. tational Sciences. Springer, pp. 113–127.
X. Li, W. Chen and Q. Zhang et al. / Computers & Security 95 (2020) 101851 15

Sharafaldin, I., Habibi Lashkari, A., Ghorbani, A., 2018. Toward generating a new XuKui Li, a master candidate from Nanjing University of Posts and Telecommunica-
intrusion detection dataset and intrusion traffic characterization. In: The In- tions, major is software engineering. His research interest is in the field of informa-
ternational Conference on Information Systems Security and Privacy (ICISSP), tion security and machine learning.
pp. 108–116. doi:10.5220/0 0 06639801080116.
Wang, C., Zheng, J., Li, X., 2017. Research on DDOS attacks detection based on RD- Wei Chen received the Ph.D. degree from Wuhan University, in 2005. He is cur-
F-SVM. In: Proceedings of the 2017 10th International Conference on Intelligent rently a Professor in the Nanjing University of Posts and Telecommunications. His
Computation Technology and Automation (ICICTA). IEEE, pp. 161–165. research interest is in the field of information security, including Web security, IoT
Wang, C.-R., Xu, R.-F., Lee, S.-J., Lee, C.-H., 2018. Network intrusion detection us- security and mobile system security.
ing equality constrained-optimization-based extreme learning machines. Knowl.
Based Syst. 147, 68–80.
Wang, S., Ding, Z., Fu, Y., 2017. Feature selection guided auto-encoder. In: Proceed- Qianru Zhangreceived her Master degree from the University of Hong Kong, in
ings of the 31st AAAI Conference on Artificial Intelligence. 2019. Her research interest is in the field of the artificial intelligence algorithms,
Yousefi-Azar, M., Varadharajan, V., Hamey, L., Tupakula, U., 2017. Autoencoder-based including natural language processing algorithm, machine learning.
feature learning for cyber security applications. In: Proceedings of the 2017 In-
ternational Joint Conference on Neural Networks (IJCNN). IEEE, pp. 3854–3861. Lifa Wu received the Ph.D. degree from Nanjing University, in 1998. He is currently
Zhou, C., Paffenroth, R.C., 2017. Anomaly detection with robust deep autoencoders. a Professor in the Nanjing University of Posts and Telecommunications. His research
In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowl- interest is in the field of information security, including system security, software
edge Discovery and Data Mining. ACM, pp. 665–674. security and privacy protection.
Zou, W., Yao, F., Zhang, B., Guan, Z., 2018. Back propagation convex extreme learning
machine. In: Proceedings of ELM-2016. Springer, pp. 259–272.

You might also like