You are on page 1of 17

Received: 26 March 2021

DOI: 10.1049/ise2.12051

ORIGINAL RESEARCH
- -
Revised: 13 September 2021 Accepted: 11 November 2021

- IET Information Security

PCA mix‐based Hotelling's T2 multivariate control charts for


intrusion detection system

Mo Shaohui | Gulanbaier Tuerhong | Mairidan Wushouer | Tuergen Yibulayin


College of Information Science and Engineering, Xinjiang University, Urumqi, China

Correspondence Abstract
Gulanbaier Tuerhong, College of Information Most of the data, which is in the field of network intrusion detection, have the charac-
Science and Engineering, Xinjiang University,
830017 Urumqi, China.
teristics of a mixture of high‐dimensional datasets of continuous and categorical variables.
Email: gulam@xju.edu.cn It easily leads the traditional multivariate control chart to get the error detection results.
Hotelling's T2 multivariate control charts based on Principal Component Analysis mix
Funding information (PCA mix) with bootstrap control limit were proposed, and applied to the network
Natural Science Foundation of the Autonomous intrusion detection system. It was compared with the conventional Hotelling's T2 control
Region, Grant/Award Numbers: 2021D01C118,
2018D01C075; The Autonomous Region university
chart based on PCA and the performance of the control limits obtained with the
scientific research program, Grant/Award Numbers: bootstrap method was compared to the ones calculated using the most commonly used
XJEDU2017S006, XJEDU2018Y005; High‐Level kernel density estimation. The experimental results revealed that the proposed method
Innovative Talents Project of the Autonomous
Region, Grant/Award Numbers: 100400016,
had better performance in intrusion detection than its counterparts.
042419006; Doctoral startup fund of Xinjiang
University, Grant/Award Numbers: 620312308, KEYWORDS
620312310 computer network security, data compression, industrial control, security of data

1 | INTRODUCTION model based on data mining. This system reduces the detection
time and strengthens the detection ability. P. Anitha and B.
Network intrusion detection is a crucial link whether network Kaarthick [11] proposed a data mining approach for an
intrusion can be detected in a network security system. With intrusion detection system, based on an Oppositional Lap-
the increasingly prominent problem of network security, lacian grey wolf optimization algorithm with a higher detection
intrusion monitoring and analysis of network data provide rate. Ansam Khraisat et al. [12] propose a Hybrid IDS (HIDS)
effective help to combat network crimes [1]. Moreover, by combining the C5 decision tree classifier and One Class
network intrusion detection technologies can complete the Support Vector Machine (OC‐SVM) to achieve high accuracy
analysis of network security without affecting the network and low false alarm rates. Generally speaking, algorithms based
performance, and can take a positive response to prevent the on data mining and machine learning can achieve good
intrusion behaviour from damaging the network and ensure detection rate, however, they do not pay attention to the
the security of the network operation [2]. relationship between features [13]. Also, the data transmitted in
According to different technologies, intrusion detection the network usually has multiple features, which can be highly
can be divided into anomalies based on detection and feature‐ correlated with each other, and if the correlation is not
based detection [3]. In general, the detection methods adopted considered, it might affect instruction detection performance
by researchers are roughly divided into three types for feature‐ of the machine learning‐based approaches.
based intrusion detection: data mining technology‐based [4], The intrusion detection system based on SPC not only
machine learning methods‐based [5–8] and statistical process ensures real‐time detection, but also is the most commonly
control (SPC)‐based detection [9]. Zhongjin Fang and Shu used multivariable control chart that can closely contact each
Zhou [10] establish a network intrusion detection system feature of the data, which is used in the process involving

-
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited and is not used for commercial purposes.
© 2021 The Authors. IET Information Security published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

IET Inf. Secur. 2022;16:161–177. wileyonlinelibrary.com/journal/ise2 161


17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
162
- SHAOHUI ET AL.

highly correlated data [9]. A. Sánchez‐FernándeZ et al. [14] numerical variables, MCA can handle categorical variables,
use a new methodology based on time series models and and PCA mix can simultaneously handle both categorical
statistical process control (MSPC) to improve the system's and continuous variables. When the variables in the dataset
performance and reduce manufacturing costs. Ahsan are all numerical variables, the PCA mix algorithm is
Muhammad et al. [15] use PCA mix‐based T2 control chart in actually the standard PCA algorithm; when the dataset is a
order to detect outliers. mixed variable, the numerical variable part continues to be
Hotelling's T2 control chart is proposed by Hotelling in processed by the standard PCA algorithm, and the discrete
1947. It is a multivariable control chart, which solves variable part is processed by the MCA algorithm. PCA mix
multiple variables that cannot be considered to affect each algorithm is used to perform generalized singular value
other in a single‐variable control method [16]. Key studies decomposition (GSVD) for data.
in this area are presented hereafter: Ye et al. [17] have Ahsan et al. [21] proposed the multivariate control chart
analysed the abnormal fluctuation of intrusion using based on PCA mix established control limits. The proposed
Hotelling's T2 control chart. Sivasamy and Sundan [13] chart demonstrated a great performance by success to
further optimized the algorithm by combining Hotelling's detect more outliers correctly [15]. However, the KDE
T2 control charts with multivariate statistical analysis tech- method needs more custom parameters. In response, the
nology. Sparks [9] used Hotelling's T2 control chart to bootstrap method with similar performance is employed in
monitor the intrusion of a network. Ahsana et al. [18] this work [26]. The bootstrap method provides a non‐
proposed Hotelling's T2 control chart based on PCA to parametric way of determining the decision boundary of
monitor the anomalies in the network. Nevertheless, the proposed approach. The results from experiments on
Hotelling's T2 control chart is not suitable for the data in five publicly available intrusion detection datasets show that
the real network. Hotelling's T2 control chart needs to our approach produced better performance in accuracy and
detect both mean shifts and counter‐relationships when IDS practicability than its counterparts. In 1979, Efron [27]
is carried out [19]. The performance of control chart created the bootstrap method for calculating confidence
detecting any shift in a process may be decreased because intervals. The bootstrap method puts forward real‐time
the data in the real network have a large number of quality internal test problems for each statistical subgroup listed,
characteristics [20]. In addition, it is better to achieve real‐ so it becomes one of the common tools in statistical error
time intrusion detection, and the traditional Hotelling's T2 estimation and hypothesis tests. Bootstrap method is to
control chart cannot meet this requirement [21]. In this assume that the sample data from the population X of Fn
work, Hotelling's T2 control chart determines whether the is equal to x1, x2, …, xn. In the case of given x, construct
observed data is in control by calculating the deviation of the estimation Fˆn, and then regenerate a batch of random
distance to historical normal data. When the distance be- variables from Fˆn: X∗ that is equal to x∗1, x∗2, …, xn*.
tween the observed data and the scale mean value of his- Let R (X, F ) be a preselected random variable and use the
torical normal data is less than a certain value, it is judged distribution of R∗ = R (X∗, Fˆn) to approximate the dis-
that the observed data is in control; otherwise, it is out of tribution of R (X, F ), then R∗ is called bootstrap distri-
control. The control limit CL, which is used to judge bution. In recent years, the confidence interval estimation
whether the observed data is in control, can be obtained by based on bootstrap is commonly used in the following four
calculating the mean values and covariance matrix of his- methods: Standard bootstrap (SB), percentile bootstrap
torical normal data. This calculation method can get many (PB), percentile bootstrap (PTB), and percentile bootstrap
CL, and then compare the detection results of different CL after correction of deviation (BCPB). This work adopts the
to select the best detection result. PB method.
The principal component analysis (PCA) mix algorithm In this work, the network intrusion detection system is
is formed by PCA and the multiple correspondence anal- built, based on the PCA mix algorithm, Hotelling's T2
ysis (MCA), and it reduces high‐dimensional and mixture control chart and bootstrap method. The goal is to obtain
data consisting of quality and quantity variables into lower‐ higher accuracy. Our paper makes the following
dimensional data used for calculating the monitoring sta- contributions:
tistics T2 mix. In 1958, Kaiser [22] introduced the varimax
criterion into PCA, and Kiers [23] extended the maximum � Proposed a PCA mix‐based Hotelling's T2 control
variance criterion to realize a simple structure in the PCA chart with the bootstrap method, and compared it with
mix algorithm in 1991, which is called the original method the conventional PCA‐based Hotelling's T2 control
of Henk A. L. Kiers to PCA mix algorithm. In 2012, chart.
Chavent et al. [24] further improved the PCA mix algo- � Compared the performance of the control limits obtained
rithm, introduced singular value decomposition (SVD), and with the bootstrap method to the ones calculated using the
formed a PCA mix algorithm based on singular value most commonly used KDE.
decomposition (SVD). In 2017, Chavent et al. [25] also � Extended the application of the proposed PCA mix‐based
implemented multivariate analysis methods: PCA mix, PCA Hotelling's T2 control chart to the network intrusion
rot (rotation in PCA mix) and MFA mix (multi‐factor detection system and evaluated its performance using five
analysis of mixed data in datasets). PCA can deal with publicly available intrusion detection datasets.
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 163

The rest of this paper is organized as follows: Section 2 Zmix ¼ CZΛ ð4Þ
presents charting procedures. Section 3 describes datasets,
where matrix Z is divided up as follows:
mix
Zmix
1 contains the
comparison algorithms and performance metrics. Section 4
factor scores of the a1 continuous variables and Zmix
2 contains
analyses and compares the experimental results. Last but not
the factor scores of the a2 discrete variables.
least, Section 5 is allocated for the conclusion and future
development.
� Calculate the mean value α of matrix Zmix after dimension
reduction and use n to represent the dimension of Zmix :
2 | CHARTING PROCEDURES
m1 þ m2 þ ⋯ þ mn
α¼ ð5Þ
2.1 | Multivariate Hotelling's T control 2 n
chart with PCA mix
� The covariance matrix ϕ and the T 2 statistic can be calcu-
In this work, the PCA mix algorithm is used to reduce the lated as follows:
dimension of the data, and then, the multivariate Hotelling's T2
control chart is used to train the reduced dimension data. The ðm1 − αÞðm1 − αÞ0 þ ðm2 − αÞðm2 − αÞ0
PCA mix algorithm is used to perform generalized singular þ ⋯ þ ðmm − αÞðmm − αÞ0
ϕ¼ ð6Þ
value decomposition (GSVD) for data. The specific steps are m−1
as follows.
T 2 ¼ mðmi − αÞ0 ϕ−1 ðmi − αÞ ð7Þ
� Let m denote the number of observation units, a1 denote
the number of continuous variables, a2 denote the number
of discrete variables, A1 denote the m � a1 continuous 2.2 | Proposed algorithm with bootstrap
matrix and A2 denote the m � a2 discrete matrix. control limit
� Let A be the m � ða1 þ a2 Þ matrix that combines the
normalized matrix A1 and normalized matrix A2 : Using PB technology and removing the extreme values
A ¼ ½A1 ; A2 �. below 1j and above 1 − 1j , j groups of decision boundary is
� Build a diagonal matrix B, that is, the weights matrix of the obtained and stored in the array C ¼ fc1 ; c2 ; :::; cj g. In this
rows of A. The b rows are weighted by b1, such that B ¼ b1I b . work, 1000 groups of decision boundaries are selected, so
� Build a diagonal matrix C, that is, the weights matrix of the the extreme values below 0.1% and above 99.9% are
columns of A. The a1 first columns are weighted by 1 (as the removed.
calculation method of the Euclidean distance in PCA). The
j
a2 last columns are weighted by bbj (as the weighted distance 1X 2
CLPB ¼ cj ¼ T ðj � ð1 − αÞÞ ð8Þ
represented by X 2 distance in MCA), where j ¼ 1; 2; :::; a2 . j i¼1

That is, C ¼ diag 1; 1; :::; 1; nn1 ; :::; nna .
2

� The GSVD of A gives the decomposition


2.3 | Experimental protocol

This study presents an intrusion detection method based


A ¼ Y ΛZT ð1Þ on non‐parametric multivariate control chart, which can
realize real‐time monitoring in a real network. The estab-
where Λ is the diagonal matrix of the radical of the lishment of the monitoring system is mainly divided into
eigenvalue of matrix A. Let r represent the rank of matrix two steps: data preprocessing and construction of a
A; then, Y be the matrix of the eigenvector of the m � r multivariate control chart.
matrix A and let Z be the matrix of the eigenvector of the Figure 1 shows the process of intrusion detection
ða1 þ a2 Þ � r matrix A. using multivariate control charts. The first step is to read
The eigenvector matrix Y can be expressed as follows: datasets and preprocess these. In this step, the dataset
should be organized to better carry out experiments,
Y mix ¼ ACZ ð2Þ generally through extracting feature, transforming eigen-
value, selecting the parameter and handling missing
� Or directly computed from the GSVD decomposition as
values.
Then, choose the way of dimension reduction, this
Y mix ¼ Y Λ ð3Þ experiment chooses the PCA mix algorithm to reduce
dimension, and the results are compared with that of the PCA
� The eigenvector matrix Z can be expressed as algorithm.
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
164
- SHAOHUI ET AL.

Step 3 Calculate W and T2, which are the covariance


matrix and robust mean vector of the reduced
dimension matrix Z in Step 2.

Step 4 Use CL as in Section 2.2 to calculate the boot-


strap control limit by α.

2.5 | Phase II: test

This phase uses the value of Z, W, T2 and CL from Phase I.


The detailed steps are as follows:

Step 1 The new trained connection data (clean data)


matrix Ytest is obtained by cleaning the data of the
training matrix.

Step 2 Calculate T2test from new trained connection


data Ytest by using W, T2 are taken from Phase I.

Step 3 If T2test >CL, then the connection is an


intrusion, and if T2test <CL, then the connection is
normal, update Ytest, and recalculate the W
and T2.

3 | EXPERIMENTAL STUDY
3.1 | Datasets
FIGURE 1 Flow of data chart during programme execution

3.1.1 | KDDCUP99 dataset


Then, Hotelling's T2 control chart is used to train trained KDDCUP99 dataset is the most common dataset of intrusion
connection data. detection. The KDDCUP99 dataset is divided into labelled
Next, we use the bootstrap method to calculate the training dataset and unmarked test dataset. There are 4898461
decision boundary and compare the result with that of samples in the training dataset, which can be divided into two
KDE. The bootstrap method is used to control the re- types: normal and attack. Among them, attacks can be divided
striction, so that it get the best CL during monitoring, into four types: DOS, Probe, R2L, U2R. In this work, 10% of
and better monitor the network in real time. KDE is a the original training dataset, namely kddcup.data_10_percent
traditional method to control the limit of intrusion file, is selected as the experimental dataset.
detection. The dataset is a mixture of high‐dimensional dataset with
Finally, the proposed method is compared with the other 41 attributes, including 32 continuous attributes and 9 discrete
techniques from some literatures. attributes. In addition, not only in the training dataset, the
The proposed algorithm can be divided into the following normal data is about 19.85%, and the rest is attack data. The
two stages: difference ratio between attack data and normal data is nearly
4:1, and not only the dataset is extremely unbalanced, but also
the proportion of data of various types of attacks is also
2.4 | Phase I: building normal profile extremely unbalanced, as shown in Table 1.

Step 1 The trained connection data (clean data) matrix


Y is obtained by cleaning the data of the training 3.1.2 | NSL_KDD dataset
matrix.
NSL_KDD dataset is a dataset created to solve some problems
Step 2 Calculate Z, which is reduced dimension by of KDDCUP99 dataset. This dataset adjusts the proportion of
trained connection data Y using the PCA mix algo- positive and negative samples of the KDD‐CUP99 dataset,
rithm in Section 2.1. makes the number of samples in the training dataset and the
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 165

test dataset more reasonable, and removes redundant data in URLs: Spam URLs, Phishing URLs, Malware URLs and
the KDDCUP99 dataset. Although it still cannot reflect the Defacement URLs. The ISCX‐URL2016 dataset crawled from
real network environment perfectly, it can be used as a various websites one by one, representing the real‐world network
benchmark dataset to different intrusion detection methods in traffic [28]. This work selects ‘All’ file as the experimental dataset.
the absence of a public intrusion detection dataset based on There are 79 features in the ISCX‐URL2016 dataset,
the network. In this work, the kddtrainþ file processed by the including 10 discrete features and 69 continuous features.
original dataset is selected as the experimental dataset. The benign traffic URLs are about 21.20% of the training
The dataset is a mixture of high‐dimensional datasets with data, and the rest is malicious traffic URLs. In addition, the
41 attributes, including 32 continuous attributes and 9 discrete number of benign traffic URLs is much less than that of
attributes. In addition, not only in the training dataset, the malicious traffic URLs, and they are extremely unbalanced,
normal data is about 53.45%, the rest are attack data, the with a difference ratio of 1:4. However, in the data of
difference between attack data and normal data is unbalanced, malicious traffic URLs, the data proportion of various
but also in the attack data, the data proportion of various types malicious traffic URLs is relatively balanced, as shown in
of attacks is also extremely unbalanced, as shown in Table 2. Table 4.

3.1.3 | UNSW‐NB15 dataset TABLE 2 Characteristics of the NSL‐KDD dataset

Class Training size Percent (%)


UNSW‐NB15 dataset is a new dataset generated by the
Australian Network Security Centre (ACCS) simulation display Normal 67,334 53.45
network environment in 2015. The dataset includes modern DOS 45,927 36.46
normal data and nine types of attack traffic data. UNSW‐NB15
Probe 11,656 9.25
dataset contains a new type of covert attack, which can fully
represent the real situation in the real network environment. U2R 995 0.79
There are 82,332 samples in the UNSW‐NB15 training dataset, R2L 52 0.05
which can be divided into two types: normal and attack.
Total 125,973 100
Among them, attacks can be divided into nine types: Fuzzers,
Analysis, Backdoors, DOS, Exploits, Generis, Reconnaissance,
Shellcode and Worms. In this work, the UNSW‐NB15 training TABLE 3 Characteristics of the UNSW‐NB15 dataset
dataset after the original dataset processing is selected as the Class Training size Percent (%)
experimental dataset.
Normal 37,000 44.94
The dataset also belongs to mixture of high‐dimensional
datasets, with 43 attributes, including 39 continuous attributes Generic 18,871 22.92
and 4 discrete attributes. In addition, not only in the training Exploits 11,132 13.52
dataset, the normal data is about 44.94%, the rest are attack data,
Fuzzers 6062 7.36
the difference between attack data and normal data is unbal-
anced, but also in the attack data, the data proportion of various DOS 4089 4.97
types of attacks is also extremely unbalanced, as shown in Reconnaissance 3496 4.25
Table 3.
Analysis 677 0.82

Backdoors 583 0.71


3.1.4 | ISCX‐URL2016 dataset Shellcode 378 0.46

Worms 44 0.05
ISCX‐URL2016 dataset is a tagged dataset about URL traffic
released by the University of New Brunswick (UBN) in 2016. Total 82,332 100
The dataset contains benign URLs and four kinds of malicious
TABLE 1 Characteristics of the KDDCUP99 dataset TABLE 4 Characteristics of the ISCX‐URL2016 dataset

Class Training size Percent (%) Class Training size Percent (%)
Normal 97,277 19.69 Benign 7781 21.20

DOS 391,458 79.24 Spam 6698 18.25

Probe 4107 0.83 Phishing 7586 2067

U2R 1126 0.23 Malware 6712 18.28

R2L 52 0.01 Defacement 7930 21.60

Total 494,021 100 Total 36,707 100


17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
166
- SHAOHUI ET AL.

TABLE 5 Characteristics of the TON_IoT dataset


3.1.5 | TON_IoT dataset
Class Training size Percent (%)
Normal 10,000 62.58 TON_IoT dataset is a new Internet of Things (IoT) and In-
dustrial IoT (IIoT) dataset, which is collected from a realistic
DDOS 2134 13.35
and large‐scale network in 2020. The network is designed at the
Backdoor 1779 11.13 Cyber Range and IoT Labs at the UNSW Canberra (Australia)
Injection 998 6.25 to mimic the complexity and scalability of industrial IoT and
Industry 4.0 networks. The dataset contains normal data and 7
Password 757 4.74
types of attack data: backdoor, ddos, ransomware, injection,
Scanning 226 1.41 xss, password and scanning [29]. This work selects
Ransomware 82 0.51 ‘Train_Test_Windows 7’ file as the experimental dataset.
The dataset also belongs to mixture of high‐dimensional
XSS 4 0.03
datasets, with 132 attributes, including 120 continuous attri-
Total 15,980 100 butes and 12 discrete attributes. In addition, not only in the
training dataset, the normal data is about 62.58%, the rest are
TABLE 6 Confusion matrix
attack data, the difference between attack data and normal data
Negative sample is unbalanced, but also in the attack data, the data proportion
Real marker Positive sample Negative sample of various types of attacks is also extremely unbalanced, as
shown in Table 5.
Positive sample True positive (TP) False negative (FN)

Negative sample False positive (FP) True negative (TN)


3.2 | Comparison algorithms
TABLE 7 Features of datasets

Dataset Class Num Total


3.2.1 | PCA
KDDCUP99 Continuous 30 38 PCA is a statistical method to extract a new set of unrelated
Discrete 8 variables by orthogonal transformation [30]. The extracted
variable is known as the principal component (PC); the co-
NSL‐KDD Continuous 30 38
efficients of the variable can be achieved from the covariance
Discrete 8 or correlation feature vector of the input data [31].
UNSW‐NB15 Continuous 38 41 PCA can reduce the feature quantity, speed up the calculation
Discrete 3
and eliminate the multicollinearity problem in the process, and
can also be used for feature reduction [32] and feature selection
ISCX‐URL2016 Continuous 56 66 [33]. PCA is widely used in network anomaly monitoring. For
Discrete 10 example, Wang et al. [34] developed an intrusion detection
TON_IoT Continuous 75 76
method based on PCA with fast computing speed and high ef-
ficiency. Moreover, PCA can be combined with machine learning
Discrete 1 methods, such as support vector machines [35], decision tree

TABLE 8 Eigenvalue transformation

Dataset Feature's name Eigenvalue transformation


KDDCUP99 protocol_type tcp = 1; udp = 2; icmp = 3
flag OTH = 1; REJ = 2; RSTO = 3; RSTOS0 = 4; RSTR = 5; S0 = 6; S1 = 7; S2 = 8; S3 = 9; SF = 10;
SH = 11
sign normal = 0; others = 1

NSL‐KDD protocol_type tcp = 1; udp = 2; icmp = 3


flag OTH = 1; REJ = 2; RSTO = 3; RSTOS0 = 4; RSTR = 5; S0 = 6; S1 = 7; S2 = 8; S3 = 9; SF = 10;
SH = 11
sign normal = 0; others = 1

UNSW‐NB15 Proto tcp = 1; udp = 2; icmp = 3


service http = 1; ftp = 2; smtp = 3; ssh = 4; dns = 5; ftp‐data = 6; irc = 7; ‐ = 8; others = 9
state ACC = 1; CLO = 2; CON = 3; ECO = 4; ECR = 5; FIN = 6; INT = 7; MAS = 8; PAR = 9;
REQ = 10; RST = 11; TST = 12; TXD = 13; URH = 14; URN = 15; ‐ = 16
label normal = 0; others = 1
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 167

TABLE 9 Characteristics of datasets


Dataset Method Dimension TP FN FP TN Hit (%)
KDDCUP99 No 29,260 495 9071 20,113 83.77
PCA 6 29,649 106 1482 27,702 97.31
PCA mix 8 29,438 317 250 28,934 99.04

NSL_KDD No 17,728 2206 1124 19,079 91.70


PCA 7 19,541 393 8331 11,872 78.26
PCA mix 11 18,469 1465 1308 18,895 93.09

UNSW‐NB15 No 7 17,894 14 16,786 48.39


PCA 7 4005 13,896 1804 14,996 54.76
PCA mix 12 13,058 4843 2417 14,383 79.08

ISCX‐URL2016 No 2086 1219 249 950 67.41


PCA 7 1832 561 503 1608 76.38
PCA mix 11 2061 1172 274 997 67.90

TON_IoT No 2810 190 2275 772 59.24


PCA 4 2320 680 267 2780 84.34
PCA mix 15 2793 207 230 2817 92.77

(a) (b)
10 20 30 40 50

gaussian gaussian

200
rectangular rectangular
triangular triangular
epanechnikov epanechnikov
Density

Density
biweight biweight
cosine cosine

100
optcosine optcosine
0

0 500 1000 1500 2000 0 50 100 150 200


samples samples
(c) (d)
gaussian gaussian
60

rectangular rectangular
triangular triangular
20

epanechnikov epanechnikov
Density

Density

biweight biweight
40

cosine cosine
optcosine optcosine
10

20
0

0 20 40 60 80 100 0 20 40 60 80 100
samples samples
(e)
gaussian
rectangular
triangular
epanechnikov
biweight
Density

cosine
optcosine

F I G U R E 2 Density of different kernel


0

functions. (a) KDDCUP99, (b) NSL_KDD, (c)


UNSW–NB15, (d) ISCX–URL2016 and (e) 0 20 40 60 80 100
TON_IoT samples
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
168
- SHAOHUI ET AL.

F I G U R E 3 Type I and type II error rates of


(a) (b) two different dimension reduction methods. (a)
PCA PCA KDDCUP99, (b) NSL_KDD, (c) UNSW–NB15, (d)

0.9
0.9

PCA mix PCA mix ISCX–URL2016 and (e) TON_IoT

type2 error rate/%


type2 error rate/%

0.6
0.6

0.3
0.3

0
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


type1 error rate/% type1 error rate/%
(c) (d)
PCA PCA
0.9

PCA mix
0.9
type2 error rate/%
PCA mix
type2 error rate/%
0.6

0.6
0.3

0.3
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


type1 error rate/% type1 error rate/%
(e)
PCA
0.3 0.6 0.9
type2 error rate/%

PCA mix
0

0 0.2 0.4 0.6 0.8 1


type1 error rate/%

n
X n � �
[36], naive Bayes [37] and so on. For instance, Gharsellaoui et al. 1 1 X x − xi
f^j ðxÞ ¼ K j ðx − xi Þ ¼ K ð9Þ
[38] used multi‐scale principal component analysis (MSPCA) to n i¼1
nj i¼1 j
obtain higher classification accuracy.
where K j is scaled Kernel; j is the bandwidth, it is greater than
zero and can be used to adjust the smoothness of KDE; K is
3.2.2 | KDE the kernel function, which accords with the property of
probability density, it is non‐negative, its integral is one and its
In probability theory, in order not to put forward all kinds of mean value is zero; xi is the ith sample point.
hypotheses before the experiment, which makes the actual
results greatly different from the theoretical results, a non‐ 3.3 | Performance metrics
parametric estimation method is proposed. KDE is one of
these methods, which can directly study its distribution from The confusion matrix is obtained by comparing the prediction
the data itself. results of test samples with the actual results. The confusion
KDE borrows its intuitive approach from the familiar matrix is a combination of statistical real markers and prediction
histogram, which offers a method to smoothly aggregate data results.
points. Assume that X ¼ fx1 ; x2 ; :::; xn g is the sample data of When the real marker is a positive sample and the prediction
n points collected, and these data are independent random is also a positive sample, it is called true positive, and TP is used to
variables. A narrow band Gaussian kernel, f^j , is introduced. represent the number of samples in this case. When the real
The KDE can be expressed as follows: marker is a positive sample and the prediction result is a negative
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 169

F I G U R E 4 P–R curve of two different


dimension reduction methods. (a) KDDCUP99, (b)
(a) (b)
NSL_KDD, (c) UNSW–NB15, (d) ISCX–URL2016

1
and (e) TON_IoT

precision rate/%

precision rate/%
0.8

0.8
PCA PCA

0.6

0.6
PCA mix PCA mix

0.4

0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
recall rate/% recall rate/%
(c) (d)

1
precision rate/%

precision rate/%
0.8

0.8
PCA PCA
0.6

0.6
PCA mix PCA mix
0.4

0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
recall rate/% recall rate/%
(e)
1
precision rate/%
0.4 0.6 0.8

PCA
PCA mix

0 0.2 0.4 0.6 0.8 1


recall rate/%

T A B L E 10 F1 value is called true negative, and TN is used to represent the number of


Dataset Method F1 samples in this case. When the real marker is a negative sample
and the predicted one is a positive sample, it is called false pos-
KDDCUP99 PCA 0.8078808
itive, and FP is used to represent the number of samples in this
PCA mix 0.9902969 case. Its form is shown in Table 6. The sum of the number of
NSL_KDD PCA 0.6693305 samples in these four cases is the total number of samples.
By the confusion matrix, we can calculate P, R, F1 and Hit
PCA mix 0.9316372
of the classifier.
UNSW‐NB15 PCA 0.6518637
TP
PCA mix 0.7984789 P¼ ð10Þ
T P þ FP
ISCX‐URL2016 PCA 0.7191962
PCA mix 0.7415691 TP
R¼ ð11Þ
TON_IoT PCA 0.7039397 T P þ FN
PCA mix 0.9274448 2PR 2 � TP
F1 ¼ ¼ ð12Þ
P þ R 2T P þ FN þ FP
sample, it is called false negative, and FN is used to represent the
number of samples in this case. When the real marker is a TP þ TN
Hit ¼ ð13Þ
negative sample and the predicted one is also a negative sample, it T P þ FN þ FP þ T N
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
170
- SHAOHUI ET AL.

F I G U R E 5 ROC curve of two different


(a) (b) dimension reduction methods. (a) KDDCUP99, (b)
NSL_KDD, (c) UNSW–NB15, (d) ISCX–URL2016

0.9
0.9

True Positive Rate/%


True Positive Rate/%

and (e) TON_IoT

0.6
0.6

0.3
0.3

PCA PCA
PCA mix PCA mix

0
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


False Positive Rate/% False Positive Rate/%
(c) (d)
0.9

True Positive Rate/%


0.9
True Positive Rate/%
0.6

0.6
0.3

0.3

PCA PCA
PCA mix PCA mix
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


False Positive Rate/% False Positive Rate/%
(e)
True Positive Rate/%
0.3 0.6 0.9

PCA
PCA mix
0

0 0.2 0.4 0.6 0.8 1


False Positive Rate/%

3.4 | Data preprocessing discrete features need to be converted into character types
and continuous features need to be standardized because
In this approach, before dimensional reduction, data pre- the PCA mix method can directly reduce the dimension of
processing mainly includes four parts: extract features, trans- high‐dimensional and mixture data. The specific situation is
form eigenvalues, select parameters and handle missing values. shown in Table 8. The data in ISCX‐URL2016 and
TON_IoT datasets are numerical values, and do not need
1) Extracting feature: By removing some features, which to be converted.
have a large number of same values or missing values 3) Selecting parameter: Through the confusion matrix, the
and do not affect the final result, there are 8 discrete hit of each dataset is compared. Finally, the dimension
and 30 continuous features in KDD‐CUP99 dataset, reduction of the four datasets is integrated, and the best
whereas there are 8 discrete and 30 continuous features dimensions are shown in Table 9. Taking different K
in the NSL_KDD dataset, 3 discrete and 38 continuous values, such as k = 3, 5, 7,..., 19. By comparing their
features in UNSW‐NB15 dataset, 10 discrete and 56 F1 values, we finally choose that k is 5 in the OCKNN
continuous features in the ISCX‐URL2016 dataset, and algorithm.
1 discrete and 75 continuous features in TON_IoT
dataset (Table 7). Different kernel functions (Figure 2), such as Gaussian,
2) Transforming eigenvalue: In order to facilitate calculation, triangular, rectangular, epanechnikov, biweight, cosine and
all features are converted into numbers. In this step, only optcosine kernel function, and corresponding parameters, are
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 171

PCA mix: 0.927758

TON
taken. By comparing their density curves, we finally choose the
best classification effect when kernel function is taken as PCA: 0.6631227
epanechnikov and bandwidth is taken as 0.001457 in the
algorithm. PCA mix: 0.671157

ISCX
PCA: 0.6595699
4) Handling missing values: In the process of data collection,
some data are missing in virtue of machine failure or

UNSW
PCA mix: 0.7927937
human factors. When operating the data with some
PCA:0.4998205
missing values, we need to first handle the missing values
to avoid noise, which will affect the overall data. There
PCA mix: 0.7927937

NSL
are many ways to deal with missing values. According to
different actual situations, there are different ways to deal PCA:0.4998205
with it.

KDDCUP99
PCA mix: 0.7927937
This work uses the method of mean filling: when the PCA:0.4998205
column of missing value is a continuous feature, average other
data except missing value, and fill the missing part with the
0.0 0.2 0.4 0.6 0.8 1.0
average value. When the column where the missing value is
located is a discrete feature, find out the mode of other data AUC
except the missing value, and fill the mode in the missing
part. FIGURE 6 AUC value of two different dimension reduction methods

4 | DISCUSSION
the five datasets were 8, 11, 12, 11 and 15, respectively
4.1 | The comparison to other approaches of (Figure 3).
control charts Figure 4 shows the P–R curve. The P–R curve uses a
straight line to represent the result of PCA‐based dimen-
4.1.1 | The dimensional reduction sional reduction and uses a line segment to represent the
result of the PCA mix algorithm‐based dimensional
In this section, we use KDDCUP99, NSL_KDD, UNSW‐ reduction. Through Figure 4, it can be seen that each of
NB15, ISCX‐URL2016 and TON_IoT datasets for the the P–R curves of the five datasets intersect each other
experiment. The results of different dimensional reduction and the classification effect can be compared by their F1
methods in the sample dataset are compared. The three values.
methods are no dimensional reduction, PCA‐based It can be seen from Table 10 that the maximum value of F1
dimensional reduction and PCA‐mix‐based dimensional after dimension reduction of PCA mix algorithm in four
reduction. The bootstrap method is used to control the datasets is slightly greater than the maximum value of F1 after
limit in these three cases. The specific situation is shown dimension reduction of PCA: the difference of F1 in
in Table 9. KDDCUP99 dataset is 18.21847%, that in NSL_KDD dataset
Under different datasets, comparing the accuracy of is 26.23067%, that in UNSW‐NB15 dataset is 14.66152%, that
different dimensionality reduction methods, the highest in in ISCX‐URL2016 dataset is 2.2379%, and that in TON_IoT
most datasets is through the PCA mix algorithm. KDDCUP99 dataset is 22.35051%. It can be concluded that the dimension
dataset can achieve 99.04% when the dimensional reduction reduction of PCA mix algorithm is slightly better by combining
dimension is 8; NSL_KDD dataset can achieve 93.09% when the five datasets.
the dimensional reduction dimension is 11; the UNSW‐NB15 Compared with the ROC curve in Figure 5, the straight line
dataset can achieve 79.08% when the dimensional reduction represents ROC curve after PCA dimension reduction, and
dimension is 12; the ISCX‐URL2016 dataset can achieve line segment represents the ROC curve after the PCA mix
66.92% when the dimensional reduction dimension is 11; algorithm dimension reduction. The two curves in Figure 5 are
TON_IoT dataset can achieve 92.77% when the dimensional in a cross state, so it is impossible to directly compare the
reduction dimension is 15. Therefore, the best dimensional classification effect of the two curves, so we can draw a
reduction method is the PCA mix algorithm. In this experi- conclusion by comparing the AUC value.
ment, the dimensional reduction dimension with the highest Similarly, through the comparison of AUC values as shown
accuracy was selected for the experiment, that is, the dimen- in Figure 6, it can be intuitively concluded that the effect of the
sional reduction results of the PCA dimensional reduction PCA mix algorithm dimension reduction is better than that of
method was the same as that of the PCA mix algorithm, and PCA dimension reduction.
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
172
- SHAOHUI ET AL.

T A B L E 11 Confusion matrix
Dataset Method Dimension TP FN FP TN Hit (%)
KDDCUP99 bootstrap 8 29,438 317 250 28,934 99.04
kde 10 29,353 402 208 28,976 98.97

NSL_KDD bootstrap 11 18,469 1465 1308 18,895 93.09


kde 10 17,240 2694 1181 19,022 90.35

UNSW‐NB15 bootstrap 12 13,058 4843 2417 14,383 79.08


kde 9 14,837 3064 4135 12,665 79.25

ISCX‐URL2016 bootstrap 11 2122 1263 213 906 67.23


kde 11 2172 1561 163 608 61.72

TON_IoT bootstrap 15 2793 207 230 2817 92.77


kde 15 2831 169 301 2746 92.23

F I G U R E 7 Sample classification of two


different control limit methods. (a) KDDCUP99,
(b) NSL_KDD, (c) UNSW–NB15, (d) ISCX–
URL2016 and (e) TON_IoT
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 173

4.1.2 | The control limit method reach 79.08%, which is only 0.17% different from that of KDE
method. Therefore, compared with the five datasets, the
In this section, KDDCUP99, NSL_KDD, UNSW‐NB15, bootstrap control limit method is relatively better.
ISCX‐URL2016 and TON_IoT datasets are used. In the same Through the scatter diagram as shown in Figure 7, the
case that the PCA mix algorithm is used to reduce dimensions, hollow circle represents the positive sample in the test sample,
bootstrap and KDE are used to control the limit and compare the multiplier represents the negative sample in the test sample,
the best control limit method. The specific situation is shown the horizontal straight line represents the best control limit
in Table 11. calculated by the bootstrap method, and the horizontal dotted
Under the same dimension reduction, the highest accuracy line represents the best control limit calculated by KDE
of the five datasets is generally after the bootstrap control limit. method. Although the difference is not obvious, it can also be
KDDCUP99 dataset can reach 99.04% at the bootstrap con- seen that the optimal control limit calculated by bootstrap
trol limit, NSL_KDD dataset can reach 93.09% at the boot- method, except for the last dataset, makes the data classifica-
strap control limit, ISCX‐URL2016 dataset can reach 67.23% tion clearer than that calculated by KDE method.
at the bootstrap control limit, and TON_IoT dataset can reach By comparing the curves of two kinds of error probability,
92.77% at the bootstrap control limit. The accuracy of UNSW‐ we can also compare the better control limit method. As
NB15 dataset in KDE control limit is higher, which can reach shown in Figure 8, the line segment represents two types of
79.25%, but in bootstrap control limit, the highest accuracy can error probability curves at bootstrap control limit, and the

(a) (b)
0.9

0.9
type2 error rate/%

type2 error rate/%


kde kde
0.6

0.6
bootstrap bootstrap
0.3

0.3
0

0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1


type1 error rate/% type1 error rate/%
(c) (d)
0.9
0.9

type2 error rate/%


type2 error rate/%

kde kde
bootstrap
0.6
0.6

bootstrap
0.3
0.3
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


type1 error rate/% type1 error rate/%
(e)
0.9
type2 error rate/%

kde
0.6

bootstrap
0.3

F I G U R E 8 Type I and type II error rates of


0

two different control limit methods. (a)


KDDCUP99, (b) NSL_KDD, (c) UNSW–NB15, (d) 0 0.2 0.4 0.6 0.8 1
ISCX–URL2016 and (e) TON_IoT type1 error rate/%
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
174
- SHAOHUI ET AL.

combination of point and line segment represents two types of is 1.33057%, that in the NSL‐KDD dataset is 5.26422%, that
error probability curves at KDE control limit. From Figure 8, in the UNSW‐NB15 dataset is 0.70417%, that in the ISCX‐
it can be seen that irrespective of which dataset, the two types URL2016 dataset is 2.60714%, and that in the TON_IoT
of error probability curves of bootstrap control limit are closer dataset is 0.40919%, so the bootstrap control limit method is
to the data axis, and the model effect is better. slightly better.
Check the P–R curve as shown in Figure 9. Use line Compared with ROC curve in Figure 10, the ROC curve of
segment to represent the P–R curve at bootstrap control limit bootstrap control limit is represented by the line segment, and
method, and use the combination of point and line segment to the ROC curve of KDE control limit is represented by a
represent the P–R curve at KDE control limit. In Figure 9, it combination of point and line segment. Although the two
can be seen that most of the P–R curves of the five datasets curves in Figure 10 are partially crossed, it can be seen that the
intersect each other, the better dimension reduction method ROC curve at bootstrap control limit can basically ‘contain’ the
cannot be seen intuitively, but the classification effect of the ROC curve at KDE control limit. In order to compare the
two methods can be compared by comparing their F1 values. classification effect more intuitively, we can use the AUC value.
It can be seen from Table 12 that the maximum value of Through the comparison of AUC values as shown in
F1, which is in the bootstrap control limit of the five datasets, Figure 11, it can be intuitively concluded that the classification
is slightly greater than the maximum value of F1 in the KDE effect of bootstrap control limit is far better than the KDE
control limit: the difference of F1 in the KDDCUP99 dataset control limit method.

(a) (b)
1

1
0.5 0.6 0.7 0.8 0.9

0.5 0.6 0.7 0.8 0.9


precision rate/%

precision rate/%

kde kde
bootstrap bootstrap

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


recall rate/% recall rate/%
(c) (d)
1
1

0.9
0.5 0.6 0.7 0.8 0.9

precision rate/%
precision rate/%

0.8
0.7

kde
bootstrap
0.6

kde
bootstrap
0.5

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


recall rate/% recall rate/%
(e)
0.5 0.6 0.7 0.8 0.9
precision rate/%

kde
bootstrap

F I G U R E 9 P–R curve of two different control


limit methods. (a) KDDCUP99, (b) NSL_KDD, (c)
0 0.2 0.4 0.6 0.8 1 UNSW–NB15, (d) ISCX–URL2016 and (e)
recall rate/% TON_IoT
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 175

T A B L E 12 F1 value
4.2 | Comparison to the other methods
Dataset Method F1
KDDCUP99 kde 0.9769912 In this subsection, we compare the proposed method with the
other techniques from some literatures.
bootstrap 0.9902969
For KDDCUP99 dataset, the proposed method is
NSL_KDD kde 0.878995 compared with honeynet‐based SVM algorithm [34], Loading‐
bootstrap 0.9316372 balancing Partition Support Vector Machine (LBP‐SVM) [39],
a deep learning method that combines CNN and LSTM [40],
UNSW‐NB15 kde 0.7914372
an exquisitely designed framework based on Hybrid Multi‐
bootstrap 0.7984789 Level Data Mining (HMLD) [41], as well as an Improved
ISCX‐URL2016 kde 0.7158866 Elephant Herding Optimization (IEHO) [42]. For this dataset,
the proposed method presents better performance, as shown
bootstrap 0.7419580
in Table 13.
TON_IoT kde 0.9233529 For the NSL_KDD dataset, the proposed method is
bootstrap 0.9274448 compared with the introduction of the ReLU function in
neural network architectures [43], incremental extreme learning

(a) (b)

0.9
0.9

True Positive Rate/%


True Positive Rate/%

0.6
0.6

kde kde
bootstrap bootstrap

0.3
0.3

0
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


False Positive Rate/% False Positive Rate/%
(c) (d) 0.9
0.9

True Positive Rate/%


True Positive Rate/%

0.6
0.6

kde
bootstrap kde
bootstrap
0.3
0.3

0
0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


False Positive Rate/% False Positive Rate/%
(e)
0.9
True Positive Rate/%
0.6

kde
0.3

bootstrap

F I G U R E 1 0 ROC curve of two different


0

control limit methods. (a) KDDCUP99, (b)


NSL_KDD, (c) UNSW–NB15, (d) ISCX–URL2016 0 0.2 0.4 0.6 0.8 1
and (e) TON_IoT False Positive Rate/%
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
176
- SHAOHUI ET AL.

bootstrap: 0.927758
TON covariance determinant (Fast‐MCD) algorithm and kernel
kde:0.9224405 density estimation (KDE) [47]. In this dataset, the proposed
method also has a higher performance. The specific situation
bootstrap: 0.6632417 can be shown in Table 13.
ISCX

Furthermore, for the UNSW‐NB15 dataset, the perfor-


kde:0.6052531
mance of the proposed method is better than the other
techniques.
UNSW

bootstrap: 0.7927937
kde:0.7812368 ISCX‐URL2016 and Ton_IOT datasets were created in
2019 and 2020, respectively. At present, only a few basic ma-
chine learning algorithms are applied to these datasets, which
bootstrap: 0.9308823
NSL

have no comparability.
kde:0.8656704
KDDCUP99

bootstrap: 0.9902275 5 | CONCLUSION


kde:0.9769956
In summary, this research work presented PCA mix‐based
proposed Hotelling's T2 control charts with bootstrap con-
0.0 0.2 0.4 0.6 0.8 1.0
trol limit. Compared with conventional PCA‐based Hotelling's
AUC T2 control chart, the proposed method has better performance
to detect an anomaly in the network, which has been well
FIGURE 11 AUC value of two different control limit methods verified in most experiments. In order to obtain higher accu-
racy, this work uses bootstrap method. Meanwhile, this study
T A B L E 13 Performance comparison of the proposed method to the compared the performance of the control limits obtained by
other methods in monitoring intrusion the bootstrap method and the most commonly used KDE, and
the experimental results showed that control limits calculated
Dataset Method Hit (%)
with bootstrap are better.
KDDCUP99 honeynet‐based SVM [34] 89.9 In the future, we aim to investigate the root cause of
intrusion detection by using explainable machine learning
LBP‐SVM [39] 92.19
approaches.
LSTM, CNN [40] 92.536
HMLD [42] 96.7 A C K N OWL E D G E ME N TS
This paper is one of the results of the Natural Science Foun-
IEHO [41] 98
dation of the Autonomous Region (2021D01C118,
Proposed method 99.04 2018D01C075), the Autonomous Region university scientific
NSL_KDD ReLU [43] 74.662 research program (XJEDU2017S006, XJEDU2018Y005), the
High‐level Innovative Talents Project of the Autonomous
I‐ELM+A‐PCA [44] 81.22 Region (100400016, 042419006) and the Doctoral startup fund
MCA‐LSTM [45] 82.15 of Xinjiang University (No. 620312308, No. 620312310).
ABCWNB [46] 91.08
CO N F LI C T O F I N T E RE S T
Fast‐MCD, KDE [47] 91.71 The authors declare no conflict of interests.
Proposed method 93.09
DA TA AVA I LA B I L I T Y S TA T EM E N T
UNSW‐NB15 I‐ELM+A‐PCA [44] 70.51
The data that support the findings of this study are available
NB [48] 75.73
from the corresponding author upon reasonable request.

DNN [49] 76.1 O R CI D


EM [50] 77.2 Mo Shaohui https://orcid.org/0000-0001-9447-1396
MCA‐LSTM [45] 77.74
R E F ER E N CE S
Proposed method 79.08
1. Efstathopoulos, G., et al.: Operational data based intrusion detection
system for smart grid. In: 2019 IEEE 24th International Workshop on
machine (I‐ELM) with an adaptive principal component (A‐ Computer Aided Modeling and Design of Communication Links and
Networks (CAMAD), pp. 1–6. (2019)
PCA) [44], multivariate correlations analysis—long short‐term
2. Radoglou, G.P., et al.: ARIES: a novel multivariate intrusion detection
memory network (MCA‐LSTM) [45], modified Naive Bayes system for smart grid. Sensors. 20(18), 5305 (2020)
algorithm based on artificial bee colony algorithm (ABCWNB) 3. Wei, W., et al.: A multi‐objective immune algorithm for intrusion feature
[46], multivariate control chart based on the fast minimum selection. Appl. Soft Comput. J. 95(10) (2020)
17518717, 2022, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12051, Wiley Online Library on [07/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
SHAOHUI ET AL.
- 177

4. Buczak, A.L., Guven, E.: A survey of data mining and machine learning 30. Bencheikh, F., et al.: New reduced kernel PCA for fault detection and
methods for cyber security intrusion detection. IEEE Commun. Surv. diagnosis in cement rotary kiln. Chemom. Intell. Lab. Syst. 204, 104091
Tutor. 18(2), 1153–1176 (2016) (2020)
5. Mishra, P., et al.: A detailed investigation and analysis of using machine 31. Kim, C., Klabjan, D.: A simple and fast algorithm for L1‐norm kernel
learning techniques for intrusion detection. IEEE Commun. Surv. Tutor. PCA. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 1842–1855 (2020)
21(1), 686–728 (2019) 32. Levada, A.L.M.: Parametric PCA for unsupervised metric learning.
6. Sarker, I.H., et al.: IntruDTree: a machine learning based cyber security Pattern Recognit. Lett. 135, 425–430 (2020)
intrusion detection model. Symmetry. 12(5), 754 (2020) 33. Li, M., et al.: Fast hybrid dimensionality reduction method for classifi-
7. Sarnovsky, M., Paralic, J.: Hierarchical intrusion detection using machine cation based on feature selection and grouped feature extraction. Expert
learning and knowledge model. Symmetry. 12(2), 203 (2020) Syst. Appl. 150, 113277 (2020)
8. Martins, N., et al.: Adversarial machine learning applied to intrusion and 34. Wang, Z., et al.: Honeynet construction based on intrusion detection.
malware scenarios: a systematic review. IEEE Access. 8, 35403–35419 2019. https://doi.org/10.1145/3331453.3360983
(2020) 35. Tao, Y., Cuicui, L.: Recognition system for leaf diseases of Ophiopogon
9. Sparks, R.: Monitoring highly correlated multivariate processes using japonicus based on PCA‐SVM. Plant Dis. Pests. 2, 9–13 (2020)
Hotelling’s T2 statistic: problems and possible solutions. Qual. Reliab. 36. Miao, L.: Application of cart decision tree combined with PCA algorithm
Eng. Int. 31(6), 1089–1097 (2015) in intrusion detection. In: 2017 IEEE 8th International Conference on
10. Zhongjin, F., Shu, Z.: Research on a network intrusion detection system Software Engineering and Service Science (2017)
based on data mining. In: Proceedings of 2012 Third International 37. Choubey, D.K., et al.: Performance evaluation of classification methods
Conference on Theoretical and Mathematical Foundations of Computer with PCA and PSO for diabetes. Netw. Model. Anal. Health Inf. Bio-
Science (ICTMF 2012) (2012) informa. 9(5) (2020)
11. Anitha, P., Kaarthick, B.: Oppositional based Laplacian Grey Wolf 38. Gharsellaoui, S., et al.: Multi‐ variate features extraction and effective
Optimization Algorithm with SVM for data mining in intrusion detection decision making using machine learning approaches. Energies. 13(3)
system. J. Ambient Intell. Humaniz. Comput. 1–12 (2019) (2020)
12. Khraisat, A., et al.: Hybrid intrusion detection system based on the 39. Chen, X., Wang, Z.J., Ji, X.: A load‐balancing divide‐and‐conquer SVM
stacking ensemble of C5 decision tree classifier and one class support solver. ACM Trans. Embed. Comput. Syst. 16(3) (2017)
vector machine. Electronics. 9(1) (2020) 40. Lu, X., Liu, P., Lin, J.: Network traffic anomaly detection based on in-
13. Sivasamy, A.A., Sundan, B.: A dynamic intrusion detection system based formation gain and deep learning. In: Proceedings of the 2019 3rd In-
on multivariate Hotelling’s T2 statistics approach for network environ- ternational Conference on Information System and Data Mining, pp.
ments. Sci. World J. 2015, 850153 (2015) 11–15. (2019)
14. Sánchez‐Fernández, A., et al.: Fault detection based on time series 41. Xu, H., et al.: Applying an improved elephant herding optimization al-
modeling and multivariate statistical process control. Chemom. Intell. gorithm with spark‐based parallelization to feature selection for intrusion
Lab. Syst. 182, 57–69 (2018) detection. Int. J. Perform. Eng. (6), 1600–1610 (2019)
15. Muhammad, A., et al.: Outlier detection using PCA mix based T2 control 42. Yao, H., et al.: An intrusion detection framework based on hybrid multi‐
chart for continuous and categorical data. Commun. Stat. Simul. Comput. level data mining. Int. J. Parallel Program. 47(4), 740–758 (2019)
50(5), 1496–1523 (2021) 43. Nader, A., Azar, D.: Searching for activation functions using a self‐adaptive
16. Hotelling, H.: Multivariate Quality Control‐Illustrated by Air Testing of evolutionary algorithm. In: Proceedings of the 2020 Genetic and Evolu-
Sample Bomb Sights. McGraw Hill, New York (1974) tionary Computation Conference Companion, pp. 145–146. (2020)
17. Ye, N., et al.: Multivariate statistical analysis of audit trails for host‐based 44. Shi, Q., et al.: A framework of intrusion detection system based on
intrusion detection. IEEE Trans. Comput. 51(7), 810–820 (2002) Bayesian network in IoT. Internet Things Smart Environ. 14 (2018)
18. Ahsana, M., et al.: Intrusion detection system using multivariate control 45. Dong, R., et al.: Network intrusion detection model based on multivar-
chart Hotelling’s T2 based on PCA. Adv. Sci. Eng. 8(5), 1905–1911 (2018) iate correlation analysis – long short‐time memory network. IET Inf.
19. Ye, N., et al.: Probabilistic techniques for intrusion detection based on Secur. 14(2), 166–174 (2020)
computer audit data. IEEE Trans. Syst. Man Cybern. A Syst. Humans. 46. Yang, J., et al.: Modified naive Bayes algorithm for network intrusion
31(4), 266–274 (2001) detection based on artificial bee colony algorithm. In: 2018 IEEE 4th
20. Shanmugam, R.: Introduction to engineering statistics and lean six sigma: International Symposium on Wireless Systems within the International
statistical quality control and design of experiments and systems. J. Stat. Conferences on Intelligent Data Acquisition and Advanced Computing
Comput. Simul. 89(15), 2980 (2019) Systems (IDAACS‐SWS), pp. 35–40. (2018)
21. Ahsan, M., et al.: Multivariate control chart based on PCA mix for 47. Ahsana, M., et al.: Robust adaptive multivariate Hotelling’s T2 control
variableand attribute quality characteristics. Prod. Manuf. Res. 6(1), chart based on kernel density estimation for intrusion detection system.
364–384 (2018) Expert Syst. Appl. 145, 113105 (2020)
22. Kaiser, H.F.: The varimax criterion for analytic rotation in factor analysis. 48. Nawir, M., et al.: Performances of machine learning algorithms for binary
Psychometrika. 23(3), 187–200 (1958) classification of network anomaly detection system. J. Phys. Conf Ser.
23. Kiers, H.A.L.: Simple structure in component analysis techniques for 1018, 012015 (2018)
mixtures of qualitative and quantitative variables. Psychometrika. 56(2), 49. Kasongo, S.M., Sun, Y.: A deep learning method with wrapper based
197–212 (1991) feature extraction for wireless intrusion detection system. Comput. Secur.
24. Chavent, M., Kuentz Simonet, V., Saracco, J.: Orthogonal rotation in 92, 101752 (2020)
PCAmix. Adv. Data Anal. Classif. 6(2), 131–146 (2012) 50. Moustafa, N., Slay, J.: A hybrid feature selection for network intrusion
25. Chavent, M., et al.: Multivariate analysis of mixed data: the r package detection systems: central points. In: Proceedings of the 16th Australian
PCAmixdata. 2017. arXiv:1411.4911v4 [stat.CO] Information Warfare Conference, pp. 5–13. (2015)
26. Phaladiganon, P., et al.: Bootstrap‐based T2 multivariate control charts.
Commun. Stat. Simul. Comput. 40(5), 645–662 (2011)
27. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat.
7(1), 1–26 (1979) How to cite this article: Shaohui, M., et al.: PCA mix‐
28. Mohammad, S.I.M., et al.: Detecting malicious URLs using lexical anal- based Hotelling's T2 multivariate control charts for
ysis. Netw. Syst. Secur. 467–482 (2016)
intrusion detection system. IET Inf. Secur. 16(3),
29. Alsaedi, A., et al.: TON_IoT Telemetry Dataset: a new generation dataset
of IoT and IIoT for data‐driven intrusion detection systems. IEEE 161–177 (2022). https://doi.org/10.1049/ise2.12051
Access. 8, 165130–165150 (2020)

You might also like