A Study On Performance Metrics For ANomaly Detection Based On Industrial Control System Operation Data

electronics
Article
A Study on Performance Metrics for Anomaly Detection Based
on Industrial Control System Operation Data
Ga-Yeong Kim 1 , Su-Min Lim 1 and Ieck-Chae Euom 2, *
1 System Security Research Center, Chonnam National University, Gwangju 61186, Korea;
rkdud8727@gmail.com (G.-Y.K.); smsky1010@gmail.com (S.-M.L.)
2 Department of Data Science, Chonnam National University, Gwangju 61186, Korea
* Correspondence: iceuom@chonnam.ac.kr
Abstract: Recently, OT (operational technology) networks of industrial control systems have been
combined with IT networks. Therefore, OT networks have inherited the vulnerabilities and attack
paths existing in IT networks. Consequently, attacks on industrial control systems are increasing, and
research on technologies combined with artificial intelligence for detecting attacks is active. Current
research focuses on detecting attacks and improving the detection accuracy. Few studies exist on
metrics that interpret anomaly detection results. Different analysis metrics are required depending
on the characteristics of the industrial control system data used for anomaly detection and the type of
attack they contain. We focused on the fact that industrial control system data are time series data.
The accuracy and F1-score are used as metrics for interpreting anomaly detection results. However,
these metrics are not suitable for evaluating anomaly detection in time series data. Because it is not
possible to accurately determine the start and end of an attack, range-based performance metrics
Citation: Kim, G.-Y.; Lim, S.-M.; must be used. Therefore, in this study, when evaluating anomaly detection performed on time series
Euom, I.-C. A Study on Performance
data, we propose a range-based performance metric with an improved algorithm. The previously
Metrics for Anomaly Detection Based
studied range-based performance metric time-series aware precision and recall (TaPR) evaluated all
on Industrial Control System
attacks equally. In this study, improved performance metrics were studied by deriving ambiguous
Operation Data. Electronics 2022, 11,
instances according to the characteristics of each attack and redefining the algorithm of the TaPR
1213. https://doi.org/10.3390/
electronics11081213
metric. This study provides accurate assessments when performing anomaly detection on time series
data and allows predictions to be evaluated based on the characteristics of the attack.
Academic Editors: Panagiotis
Radoglou-Grammatikis,
Keywords: industrial control system; time-series data; detection of unknown attacks; anomaly
Panagiotis Sarigiannidis,
detection; performance metrics; statistics and probability; big data
Thomas Lagkas
and Vasileios Argyriou
Received: 3 March 2022

Accepted: 6 April 2022 1. Introduction
Published: 12 April 2022
Recently, attacks targeting industrial control systems have been increasing. This is
Publisher’s Note: MDPI stays neutral because OT networks, which were more closed off in the past, have been combined with IT
with regard to jurisdictional claims in networks, and the number of devices communicating with the outside has increased [1,2].
published maps and institutional affil- Therefore, OT networks have inherited the vulnerabilities and attack paths and vectors that
iations. exist in IT networks. Notably, because various industrial control systems comprise national
infrastructure, the scale of damage caused by an attack is significantly large. Therefore,
research on protection and detection methods is being actively conducted. In the past,
systems such as firewalls, IDSs, and IPSs were introduced to protect and detect external
Copyright: © 2022 by the authors.
attacks. However, recently, attention has shifted toward solutions involving artificial
Licensee MDPI, Basel, Switzerland.
intelligence, a core element of the Fourth Industrial Revolution [3,4]. Most studies focus on
This article is an open access article
using machine learning, which uses deep learning to train testbed-based data to improve
distributed under the terms and
accuracy and detect more anomalies. However, research on how to interpret the results
conditions of the Creative Commons
of anomaly detection is limited. An appropriate interpretation must be made according
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
to the characteristics of the data applied to the anomaly detection and the type of attack
4.0/).
they contain.
Electronics 2022, 11, 1213. https://doi.org/10.3390/electronics11081213 https://www.mdpi.com/journal/electronics

Electronics 2022, 11, 1213 2 of 21
The data extracted from industrial control systems are characteristically time series
data. To perform and evaluate anomaly detection on time series data, non-traditional
performance metrics are needed. In many cases, standard point-based metrics are used
when evaluating detections. However, standard point-based metrics cannot accurately
evaluate anomaly detection on time series data. Standard point-based metrics give higher
scores to detect longer anomalies. However, it is important to detect various attacks in
industrial control systems. Therefore, point-based metrics are not appropriate because they
give higher scores to the method of detecting multiple attacks and the method of detecting
long attacks, whichever method that may be. In addition, abnormal behavior that occurs in
industrial control systems often progresses slowly over a long period of time. Therefore, it
is difficult to clearly grasp the start and end points of an attack. It is necessary to evaluate
the result of anomaly detection in consideration of its characteristics.
The purpose of our research is to evaluate the performance of anomaly detection
and suggest directions for further improving the performance metrics, using performance
metrics that characterize the data. We overwrote the ambiguous section of the time-series
aware precision and recall (TaPR) metric [5], which is a range-based performance metric.
TaPR consists of a detection score and a partial score. At this time, the partial score is given
in consideration of the feature that it is difficult to grasp the start and end points of the
attack. We looked at this ambiguous section of the algorithm and figured out that we could
apply the same percentage to all anomalous behaviors to derive the ambiguous section. The
attack duration section is from the moment the attacker starts attacking the system to the
moment it stops. However, the sections in which the system is actually affected are different.
Assuming the system was attacked by an attacker while it was operating normally, the
physical operating values will begin to change over time. Then, the moment the operator
sets the threshold value, the real affected section begins. The real affected section is not the
moment when the attacker finishes the attack, but the moment when the physical action
value reaches below the threshold value. Therefore, this section is not directly affected by
the attacker, but it means that the attack remains and affects the system. The duration of
the attack and the real affected section are different for each attack. We would like to focus
on this part and define ambiguous sections according to the characteristics of the attack.
In summary, our contributions include the following:
• We create training models and perform anomaly detection using operation datasets
derived from real industrial control systems;
• We propose a range-based metric suitable for evaluating anomaly detection against
time series data;
• We redefine the algorithm for the ambiguous section of TaPR in consideration of the
characteristics of the attack.
This paper is structured as follows. In Section 2, we discuss industrial control systems
and the data in the networks and operating systems of the industrial control system datasets
that have been published thus far. Section 2 also describes the standard and time-series-
based performance metrics typically used to describe the proposed anomaly detection
assessment of performance metrics. After that, we analyze the operational data to determine
which performance index to use for the anomaly detection research. Section 3 proposes
an improved algorithm for the time-series aware precision and recall (TaPR) metric, a
time-series-based performance metric. We redefine the algorithm of the ambiguous section
of the TaPR metric using the upper control line (UCL) and lower control line (LCL), which
are concepts of statistical process control (SPC). In Section 4, we perform and analyze
anomaly detection in our operational database using deep learning. Additionally, the
redefined algorithm proposed in Section 3 is applied to the case study. Finally, Section 5
draws conclusions and presents future research directions.
nally, Section 5 draws conclusions and presents future research directions.
2. Related Works
Electronics 2022, 11, 1213
2.1. Industrial Control System Structure and Security Requirements 3 of 21
A control system is a device that manages, commands, directs, and re

2.eration
Related of other devices or systems. It is a system that can affect the real w
Works
connects cyber and
2.1. Industrial Control Systemphysical
Structure processes. According to the IEC 62443 standard
and Security Requirements
control system is a collection of processes, personnel, hardware, and softwa

A control system is a device that manages, commands, directs, and regulates the
operation of other devices or systems. It is a system that can affect the real world, and
or can
that affect,
connects theand
cyber safety, security,
physical and
processes. reliabletooperation
According the IEC 62443of industrial
standard, anproces
trial control system structure is described in reference to thesoftware
industrial control system is a collection of processes, personnel, hardware, and PERA (Purd
Reference Architecture) model [6] in Figure 1, one of the reference archite
that affects, or can affect, the safety, security, and reliable operation of industrial processes.
The industrial control system structure is described in reference to the PERA (Purdue
stratingReference
Enterprise industrial control model
Architecture) networks. The 1,PERA
[6] in Figure model
one of the is aarchitectures
reference reference mode
stood by dividing
demonstrating industrialdevices and equipment
control networks. into layers
The PERA model [7,8].model
is a reference The easily
PERA mod
divided into a safety zone, an industrial zone from levels 0 to 3, and an e
understood by dividing devices and equipment into layers [7,8]. The PERA model is
primarily divided into a safety zone, an industrial zone from levels 0 to 3, and an enterprise
from
zone levels
from levels44 to 5.
to 5.
Figure
Figure 1. Structure
1. Structure of the(Purdue
of the PERA PERAEnterprise
(Purdue Enterprise
Reference Reference
Architecture) model.Architecture) model
The safety zone is where the safety instrumentation system, responsible for the safe
Theof safety
operation zone control
the industrial is where theissafety
system, located.instrumentation
The industrial zonesystem, responsi
is called OT
operation
(operation of the industrial
technology) and includescontrol system,
the cell/area zoneiscontaining
located.levels
The 0,industrial
1, and 2, aszone is
well as level 3, which serves as a control center that integrates and manages multiple
eration technology) and includes the cell/area zone containing levels 0, 1, an
cells/areas. Level 0 is the process area, consisting of I/O devices that recognize and
level 3, which
manipulate physicalserves as aincluding
processes, controlsensors
centerthatthat integrates
measure and manages
temperature, humidity, multi
Level 0 is the process area, consisting of I/O devices that recognize
pressure, etc., and actuators that control equipment according to the commands from theand ma
controller. Level 1 is the basic control area, primarily composed of a manufacturing process
ical processes, including sensors that measure temperature, humidity, pre
controller interacting with level 0 devices and operated by a PLC (programmable logic
actuators
controller) andthat
VFDcontrol
(variable equipment
frequency drive).according to supervision
Level 2 is the the commands from
and control area,the con
is theanbasic
where control
operator area, primarily
HMI (human–machine composed
interface) and alarmofsystems
a manufacturing process c
that supervise and
acting with level 0 devices and operated by a PLC (programmable logic c
control operations are operated [9].
Level 3 is the field operation management area comprising an application server that
VFD (variable
integrates frequency
the monitoring drive).
and control Level
of physical 2 isprocesses,
on-site the supervision
an engineerand control area
workstation
erator HMI (human–machine interface) and alarm systems that supervise a
erations are operated [9].
Level 3 is the field operation management area comprising an applicat
integrates the monitoring and control of physical on-site processes, an e
Electronics 2022, 11, 1213 4 of 21
that configures and controls the overall hardware of the industrial control system, an
HMI, a domain controller, and a historian server that primarily uses a real-time database.
This level is an essential zone because there are applications, devices, and controllers
crucial for monitoring and controlling the industrial control system. An antivirus or patch
management server operates in this area. The enterprise zone includes levels 4 and 5
and refers to an environment where the production management system, manufacturing
execution system (MES), process monitoring, general IT system application, enterprise
resource planning (ERP) system, and e-mail server are installed [9].
Industrial control systems introduced and operated in the national infrastructure
are rapidly becoming intelligent and digitized. In addition, the increase in connectivity
be-tween heterogeneous devices in various environments owing to the Fourth Industrial
Revolution triggers the expansion of the attack surface of the existing IT environment
security threats to the OT environment, leading to more diverse security threats than
before [10].
According to the international standard IEC 62443, a standard that defines regulations
and requirements exists. IEC 62443-1-1 describes seven functional requirements (FR), such
as the definition of the control system capability security level and component require-
ments [11]. This study focused on the timely event response and resource availability
among the seven requirements of IEC 62443-4-2 [12]. The two requirements focus on a
control system’s safe monitoring and operation. Therefore, monitoring the operation data in
the industrial control system in real time and performing anomaly detection are necessary.
In the past, behavior-based IDSs/IPSs were used, but recently, anomaly detection has also
begun to use neural network algorithms.
2.2. Datasets
The data extracted from industrial control systems can be divided into network traffic
and operating system datasets. Table 1 shows the operating system datasets, where five
types were investigated. The operating system conducts a scenario-based attack through
the attack point.
Table 1. Industrial control system operation dataset list.
Name Proposed Attack Points Year

ICS Cyber Tommy
[13] 32 2015
Attack Datasets Morris
[14] BATADAL iTrust 14 2017
[15] WADI iTrust 15 2019
[16] SWaT Datasets iTrust 41 2015
[17,18] HAI Datasets NSR 50 2020
The first dataset is from the ICS Cyber Attack Datasets [13], consisting of datasets for five
processes: power systems, gas pipelines, gas pipelines and water storage tanks, improved
gas pipelines, and energy management systems. The second dataset is BATADAL [14],
a dataset from a contest aimed at generating an attack detection algorithm, based on a
water distribution system, and consisting of three datasets, two for training data and one
for validation. Training data #1 are normal data collected for one year, and training data
#2 are data collected for 6 months and consist of normal data and attacked data. Data
for verification are unlabeled data collected for 4 months. The third dataset is WADI [15],
which was generated from multiple testbeds extended from the SWaT testbed. The drainage
testbed simulates water demand and consumption over time. It collects steady-state data
for 14 days and contains attacks for 2 days.
The fourth dataset is SWaT [16], which was collected from the testbed of a water
treatment system. It contains the physical devices that make up each stage of the water
treatment system, and eight versions through seven renewals exist thus far. The fifth
dataset is HAI [17], which comprises data collected from a water treatment process system.
Electronics 2022, 11, x FOR PEER REVIEW 5 of 21
Electronics 2022, 11, 1213 5 of 21
It consists of three physical systems, namely, GE’s turbine, Emerson’s boiler, and FESTO’s
water treatment system, and a dSPACE HIL simulator. The variables measured and con-
trolled by the
It consists control
of three system
physical were collected
systems, namely, at 1 sturbine,
GE’s intervalsEmerson’s
at 78 points. Currently,
boiler, ver-
and FESTO’s
sions
water20.07 [18] and
treatment 21.03 and
system, exist, and the renewed
a dSPACE 21.03 version
HIL simulator. adds attack
The variables and collection
measured and con-
points.
trolled by the control system were collected at 1 s intervals at 78 points. Currently, versions
20.07 [18] and 21.03 exist, and the renewed 21.03 version adds attack and collection points.
2.3. Performance Metrics for Anomaly Detection
2.3. Performance Metrics for Anomaly Detection
Performance metrics for anomaly detection evaluation can be divided into point- and
Performance
range-based metrics
methods, for anomaly
as shown detection
in Figure evaluation can
2. In point-based be divided
methods, only into point-and
abnormal and
range-based methods, as shown in Figure 2. In point-based methods, only abnormal and
normal answer labels are correct, and points are given if the answer is the same. In con-
normal answer labels are correct, and points are given if the answer is the same. In contrast,
trast, range-based methods not only divide into abnormal and normal labels but also as-
range-based methods not only divide into abnormal and normal labels but also assign
sign different scores according to performance metrics. This allows the evaluator to assign
different scores according to performance metrics. This allows the evaluator to assign
different weights to give different scores.
different weights to give different scores.
Figure 2. Performance metric tree for anomaly detection evaluation.

Figure 2. Performance metric tree for anomaly detection evaluation.
2.3.1. Standard Metric
2.3.1. The
Standard Metric
standard metric refers to a point-based performance index and a function con-
The using
structed standard metric
items refers to amatrix.
of a confusion point-based performance
The function index and
uses accuracy andaoften
function con-
considers
structed
precision using
and items
recallof
asawell.
confusion
Therematrix.
are theThe function
F1-score usesprecision
using accuracy and
and often
recall,considers
the ROC
precision and recall
curve (receiver as well.
operating There are the
characteristic F1-score
curve) usingusing precision
true and and recall,
false positive the
rates, ROC
and the
curve (receiver
AUC (area underoperating characteristic
the curve) curve)
indicating the using
bottom trueofand
area the false
ROC positive
curve. rates, and the
AUC (area under the curve) indicating the bottom area of the ROC curve.
2.3.2. Range-Based Recall and Precision
2.3.2. Anomaly
Range-Based Recallisand
detection Precision
defined as a range-based anomaly caused by an atypical event;
therefore, using a point-based performance
Anomaly detection is defined as a range-based evaluation index caused
anomaly is inappropriate for this
by an atypical [19].
event;
Therefore, range-based recall (RR) and precision (RP), which are range-based performance
therefore, using a point-based performance evaluation index is inappropriate for this [19].
indexes, were
Therefore, used. Recall
range-based receives
recall (RR) aand
score if the abnormality
precision (RP), whichis sufficiently identified
are range-based and
perfor-
a penalty if not. However, because time series data are expressible as a range, recall
mance indexes, were used. Recall receives a score if the abnormality is sufficiently identi- is an
inappropriate performance evaluation index.
fied and a penalty if not. However, because time series data are expressible as a range,
recall is an inappropriate performance evaluation index.
2.3.3. TaPR
2.3.3. TaPR
TaPRwas proposed as a performance index suitable for evaluating anomaly detection
methods in time series data with time-series aware precision and recall [5]. TaPR consists
TaPR was proposed as a performance index suitable for evaluating anomaly detec-
of a detection score indicating the number of anomalies detected and a subscore indicating
tion methods
how accuratelyin time
eachseries datawas
anomaly withdetected.
time-series aware
TaPR precision
presents two and recall [5].
problems TaPR
in the con-
existing
sists
performance metrics: a method that detects only long anomalies receives a high score; in-
of a detection score indicating the number of anomalies detected and a subscore the
dicating how accurately each anomaly was detected. TaPR presents two problems
ambiguity of whether the value derived from the process of returning to the normal valuein the
within the physical process is affected by the attack being regarded as normal or abnormal.
mal or abnormal.
Figure 3 presents the problem of giving a high score to a method that detects only
long anomalies. Method 1 performed anomaly detection with predictions 1 and 2, and
existing performance metrics: a method that detects only long anomalies receives a high
method 2 with prediction 3. The actual anomaly occurred in anomalies 1 and 2 in time
score; the ambiguity of whether the value derived from the process of returning to the
Electronics 2022, 11, 1213 units value
normal 2 to 8within
and 9 to
the13 on the process
physical time axis. Table 2by
is affected shows the precision
the attack and recall
being regarded 6for
as nor- methods
of 21
1 and 2.
mal or abnormal.
Figure 3 presents the problem of giving a high score to a method that detects only
long anomalies. Methodthe
Figure 3 presents 1 performed
problem of anomaly detection
giving a high with
score to predictions
a method 1 and only
that detects 2, and
long
method 2 with prediction 3. The actual anomaly occurred in anomalies 1 and
anomalies. Method 1 performed anomaly detection with predictions 1 and 2, and method 2 in time 2
units 2 to
with 8 and 9 to3.13The
prediction on the time
actual axis. Table
anomaly 2 shows
occurred in the precision
anomalies and 2recall
1 and for methods
in time units 2 to 8
1 and
and2.9 to 13 on the time axis. Table 2 shows the precision and recall for methods 1 and 2.
Figure 3. Anomaly detection case to explain problems with existing performance metrics.
Method 1 detected all anomalies but had lower precision and recall values than
method 2, which detected only one long anomaly. If abnormality 2 is an attack that can
have
Figure 3.aAnomaly
Figure 3.fatal effect
Anomaly on the
detection case
detection internals
to explain
case of the industrial
problems
to explain with
problems control
existing
with system,
performance
existing method
metrics. 1 outperforms
metrics.
performance
method 2 in abnormality
Table 2. Precision and recall fordetection.
methods 1 and 2.
Method 1 detected all anomalies but had lower precision and recall values than
method
Table 2, which detected
2. Precision
Method onlyfor
and recall one long anomaly.
methods 1 and 2. If abnormality 2 is an Recall
Precision attack that can
have a fatal effect1 on the internals of the industrial
0.67
control system, method 10.50
outperforms
method 2 in abnormality
Method
2 detection. Precision
1.00 0.625 Recall
1 0.67 0.50
Table 2. Precision and recall for methods 1 and 2.
2 1.00 0.625
Method 1 detected all anomalies but had lower precision and recall values than method
2, whichMethod
detected only one long anomaly. Precision
If abnormality 2 is an attackRecall that can have a fatal
effectWe
on the 1internals
describe theofsecond problem,
the industrial 0.67
the ambiguity,
control system, method in reference 0.50
to Figure
1 outperforms method4. In Figure 4,
2 in
2 1.00
the X-axis represents the time, and the Y-axis represents the physical operating values of
abnormality detection. 0.625
We describe the second problem, the ambiguity, in reference to Figure 4. In Figure 4,
internal system devices. During normal operation, an attacker carries out an attack, and
theWe describe the second problem,
andthe
theambiguity, in reference to Figureoperating
4. In Figure
reaches4,a value de-
X-axis represents the time, Y-axis represents the physical values
consequently, the physical operating value exceeds the threshold and
theofX-axis represents
internal system the time, During
devices. Y-axis represents
and the normal operation,the anphysical
attackeroperating
carries out values of
an attack,
termined as abnormal. Onceoperating
the attack ends, section the (a) outlines the ambiguousvalue section
internal system devices.
and consequently, During normal
the physical operation,
value an attacker
exceeds carries out
threshold andan attack,aand
reaches
while returning
consequently,
determined the
to the normal
Once thevalue.
physical operating
as abnormal. attackSection
value exceeds
(a) is (a)theoutlines
the threshold
ends, section process
andthe
ofambiguous
returning
reaches
to normal be-
a valuesection
de-
cause
while it
termined asisabnormal.
not directly
returning Once
to the affected by the
the attack
normal value. attacker,
ends,
Section (a) but
section theprocess
is(a)the influence
outlines the remains.to
ofambiguous
returning Innormal
addition, half
section
while returning to the normal value. Section (a) is the process of returning to normalthe
of section
because it (a)
is exceeds
not directly the threshold
affected by and
the may
attacker, bebut regarded
the as
influence abnormal,
remains. and
In be- other half
addition,
mayitbe
cause
half of determined
is not
section (a) as normal
directly affected
exceeds the because
by the attacker,itbut
threshold and exceeds
may be the threshold.
the influence
regarded remains. In addition, half
as abnormal, and the other
of section (a) exceeds the threshold and may be regarded as abnormal, and the other half
half may be determined as normal because it exceeds the threshold.
may be determined as normal because it exceeds the threshold.
Figure 4. Example of an ambiguous section of the TaPR performance metric.

Figure 4. Example of an ambiguous section of the TaPR performance metric.
Figure 4.The
Example of an ambiguous
performance section of
metric derived the TaPR
from theseperformance
two problemsmetric.
is TaPR. TaPR consists of a
detection score and a partial score. The detection score indicates the number of detected
anomalies, and the partial score indicates how accurately each anomaly is detected. TaPR
emphasizes the importance of the number of anomalies detected through the sum of these
two scores. In addition, although the value of ‘normal’ is shown among the sections owing
to external influences, this section (a) being ‘abnormal’ is highly likely; thus, if this section
is predicted as ‘abnormal’, the subsequent scoring is given. TaP and TaR are calculated
Electronics 2022, 11, 1213 7 of 21
as the sum of the detection score (TaP_d, TaR_d) and the portion score (TaP_p, TaR_p).
The ambiguous instances used in the portion score are denoted as a0 and the number of
ambiguous instances is denoted as δ [5] in Table 3.
Table 3. Factors and formulas constituting the performance metrics TaP and TaR.
Factor Description Formula

α A set of instances a = { t, t + 1, · · · , t + l − 1 }
l The number of instances |a|
A A set of anomalies A = { a1 , a2 , · · · , a n }
n The number of anomalies -
ρ The suspicious instances ρ = { t0 , t0 + 1, · · · , t0 + l − 1 }
P A set of predictions P = { ρ1 , ρ2 , · · · , ρ n }
m The number of predictions -
a0 The ambiguous instances a0 = { t + 1, t + l + 1, · · · , t + l + δ + 1 }
δ The number of ambiguous instances -
A0 A0 = a10 , a20 , · · · , a0n

A set of ambiguous instances
2.4. Requirements for Performance Metrics

To apply the anomaly detection evaluation method to time series operation data, the
following conditions are required for performance metrics:
1. The highest score is not given to the method of detecting long-length anomalies.
2. Scores are given to cases that detect various anomalies.
3. Scores are given to cases that detect proactive signs.
4. When an external influence such as an attack is received, the threshold value that
separates normal and abnormal is exceeded, and the external influence ends; the
value showing abnormality has a score for the case in which it returns to normal.
5. A detection score is given according to the characteristics of the attack.
Supplementing the explanation of No. 4 are cases in which the external influence is
applied but the boundary value is not exceeded yet, and the external influence is over, but
the threshold value is not exceeded. Determining this section only as abnormal and normal
is difficult.
Table 4 summarizes papers that performed anomaly detection on the ICS operation
dataset and evaluated performance metrics. It is indicated whether the performance
metrics used in the papers are point-based or range-based. The authors of [20] conducted
an anomaly detection study on the SWaT dataset. SWaT’s attack types consist of single-stage
single-point, single-stage multipoint, multistage single-point, and multistage multipoint
attacks. Among them, the problem was that there was no research to detect multi-step
attacks, and the attacked variables could not be identified. The model used for anomaly
detection was the genetic algorithm (GA). To improve detection performance, exponentially
weighted moving average smoothing, the mean p-powered error measure, individual error
weights for each tag value, and disjoint prediction windows were adjusted. Then, the NAB
performance metric was used to evaluate the prediction of anomaly detection. The authors
of [21] proposed a deep learning-based anomaly detection method using the Seq-to-Seq
model. The study was conducted on the SWaT dataset, and by using the Seq-to-Seq model,
it can be said that the encoding–decoding method is suitable for time series operational
data. Anomaly detection prediction consists of three steps. In the first stage, the model
predicts the sensor and actuator values, and in the second stage, the difference between the
predicted data and the actual data is calculated. In the last step, if the difference calculated
in step 2 is greater than the set value, it is regarded as abnormal, and a warning is sent.
After that, cases in which false negative attacks were not detected and false positives are
analyzed. The authors of [22] performed anomaly detection on the SWaT dataset through
five machine learning algorithms. SVM, random forest, and KNN were used as supervised
learning algorithms, and one-class SVM (OCSVM) and an autoencoder (AE) were used as
unsupervised learning algorithms. As an anomaly detection performance evaluation index,
Electronics 2022, 11, 1213 8 of 21
values derived from the accuracy and F1-score were compared. On average, supervised
learning using data labeling performed better than unsupervised learning, and random
forest showed the best performance among the supervised learning algorithms. The authors
of [23] studied an intrusion detection mechanism using physical operational data. The
method that can be integrated into the network intrusion detection system (NIDS) and the
ICS system security were improved. Using operational data, it is easy to detect anomalies
in the section without interfering with the communication of each level section. Anomalies
were detected using three machine learning models for the HAI 20.03 dataset. The data
imbalance problem was solved using the Synthetic Minority Over-sampling Technique
(SMOTE). The stratified shuffle split (SSS) method was used for training dataset separation.
As a machine learning model, KNN, a decision tree, and a random forest algorithm were
used to build the model. Random forest showed the best performance using precision,
recall, F1-score, and AUC as performance metrics. The authors of [24] performed anomaly
detection based on a stacked autoencoder (SAE) and deep support vector data description
(deep SVDD). HAI version 20.03 was used as operational data. Data preprocessing was
performed through data scaling, and a model was created through the SAE and deep SVDD.
Single event anomaly detection was performed by comparing the accuracies by setting the
threshold with the highest accuracy.
Table 4. List of papers researching anomaly detection targeting ICS operation data. (O: Satisfy a
requirement, X: Not satisfy a requirement).
Performance Metrics Req.1 Req.2 Req.3 Req.4 Req.5 Refs.

Confusion Matrix X X X X X
Specificity X X X X X
Point-based Recall X X X X X [20–24]
Precision X X X X X
Accuracy X X X X X
Ranged-Based
O O X O X
Recall and Precision [5,19]
Range-based
Time-Series Aware
O O O O X
Precision and Recall
Table 4 shows the results of whether the performance metrics presented in Figure 2
satisfy the relevant items. The point-based performance metrics do not satisfy all the condi-
tions. In addition, a study evaluating anomaly detection based on point-based performance
metrics is presented. There were papers that studied the range-based performance metrics
themselves, but there was no study of anomaly detection using them as performance
metrics. Among the range-based performance metrics, RR and RP do not satisfy conditions
3 and 5, and TaPR does not satisfy condition 5. Therefore, in Section 3, we propose perfor-
mance metrics that satisfy all the conditions by improving the TaPR metric. Moreover, the
existing research performed anomaly detection targeting industrial control system time
series data but evaluated it using performance metrics in a point-based method. In this
paper, the results are derived from the performance metrics of the point-based method and
the range-based method. We then compare the results to discuss which method is more
suitable for evaluating anomaly detection targeting time series data.
3. Proposed Performance Metrics

In this section, the method of deriving performance metrics of the existing TaPR is
improved. As a direct definition was unspecified in this study, the ambiguous section of the
detection score (factor α0 ) can be defined by looking at the python code of TaPR, as follows:
...
start_id = self._anomalies[i].get_time()[1] + 1
end_id = start_id + int(self._delta_ratio * (self._anomalies[i].get_time()[1] −
self._anomalies[i].get_time()[0]))
Electronics 2022, 11, 1213 9 of 21
...
For example, an actual attack occurring in the time interval from 10 to 100 is assumed.
The size of the section where the attack occurred is then 90. Because start_id adds 1 to the
last point of the attack time, it becomes 101. The end_id is multiplied by the ratio (delta,
0 ≤ delta ≤ 1), multiplied by the size of the section where the attack occurred, and then
start_id is added. Therefore, if the delta is 1.0, the ambiguous section becomes the ambigu-
ous section from 101 to 191 immediately after the attack.
3.1. Statistical Process Control (SPC)

Because the duration of the actual attack and the section affected by each attack are
different, there is a limit to applying only one ratio to all attacks. Therefore, the following
method is proposed. Because the attack duration section is defined in the test dataset, it is
not defined separately. The actual affected section must be defined. First, the upper and
lower limits are set based on the threshold. Here, the concept of process control [25,26]
is introduced. The industrial control system dataset used in this experiment comprises
data collected from several processes for which process management, which is statistical
data, was determined to be appropriate for detecting anomalies. The process control
selects quality characteristics that indicate the state of the process and obtains information
about the process. The process is maintained in control by acting on the cause of the
abnormality found. A graph in which one centerline and two control limit lines are set to
manage the dispersion of quality is called a control chart. The control limit line is divided
into the specification limit, which is a standard for judging the quantity and defects of
products already produced, and the control limit, which is a management procedure that
instructs improvement measures for the process to eliminate the cause of abnormalities.
The threshold that separates abnormality from normality is defined as the central line (CL)
of the control chart. The upper control line (UCL) and lower control line (LCL) are set by
taking the section width at one time (1σ) of the standard deviation of both sides based on
the CL.
3.2. Defining the Real Affected Section

The starting point of the affected section is where the size of the mean error, which
is the blue line in the graph, exceeds the lower limit value. Next, if the magnitude of the
average error falls below the lower limit again and lasts for 1 min, the affected section ends.
The size of the ambiguous instance is the absolute value of the duration of the attack minus
the section affected. Therefore, the start point of the ambiguous instance is obtained by
adding 1 to the last point of the attack duration section, and the end point is obtained by
adding the start point and the size of the ambiguous instances. Algorithm 1 for finding
ambiguous instances are summarized.
Initialize Central Line (CL) to threshold
UCL = CL + 1σ
LCL = CL − 1σ
During the attack, the algorithm executes the following in Figure 5. First, the mag-
nitude of the mean error and the magnitude of the LCL are compared. If the magnitude
of the mean error is larger than that of the LCL, the starting point is appended to the RI,
which is the affected section. Next, if the magnitude of the average error is smaller than the
LCL, the time point is stored in RI_end_time, which temporarily stores the end time point.
Next, if the magnitude of the average error for 1 min is smaller than the LCL, RI_end_time
is stored as the end time. However, if the size of the average error becomes larger than
the LCL again, the end point must be set again. Because ambiguous instances must be
calculated and stored as absolute values, the sizes of AD and RA are compared and stored
in AI. The start point is set by adding 1 to the last point of the attack time, and end_id is set
by adding AI to start_id.
8. IF error of the mean > LCL then
9. BREAK
10. ELSE
11. real affected section.append(end time of real affected section)
Electronics 2022, 11, 1213 10 of 21
12. ENDIF
13. ENDWHILE
14. ENDIF
Algorithm 1. Definition of the real affected section
15. ENDFOR
16.INPUT:
IF attack duration[1]
CL, UCL, LCL > real affected section[1]
OUTPUT: size of ambiguous instance
17.1. attack
size duration.append(anomalies[i].get_time()[0],
of ambiguous instance = attack duration[1] – real affected section[1]
anomalies[i].get_time()[1])
18.2. ELSE
FOR attack duration do
19.3. size of ambiguous
IF error of the mean >instance
LCL then = real affected section[1] - attack duration[1]
4. real affected section.append(current time)
5. ELSE IF Error of the Mean < LCL then
Duringend
6.
thetime
attack, the algorithm executes the following in Figure 5. First, the magni-
of real affected section = current time
tude7. of the WHILE
mean error
1 minuteanddothe magnitude of the LCL are compared. If the magnitude of
the 8.
mean error is larger
IF error than >that
of the mean LCL of the LCL, the starting point is appended to the RI,
then
which is the affected section. Next, if the magnitude of the average error is smaller than
9. BREAK
the 10.
LCL, the time ELSE
point is stored in RI_end_time, which temporarily stores the end time
11. real affected section.append(end time of real affected section)
point.
12. Next, ifENDIFthe magnitude of the average error for 1 min is smaller than the LCL,
RI_end_time
13. is stored as the end time. However, if the size of the average error becomes
ENDWHILE
larger
14. thanENDIF
the LCL again, the end point must be set again. Because ambiguous instances
must15.be ENDFOR
calculated and stored as absolute values, the sizes of AD and RA are compared
16. IF attack duration[1] > real affected section[1]
and17.
storedsizein AI. The start point is set by adding 1 to the last point of the attack time, and
of ambiguous instance = attack duration[1] – real affected section[1]
end_id is
18. ELSE set by adding AI to start_id.
19. size of ambiguous instance = real affected section[1] - attack duration[1]
Figure 5. Example of applying the algorithm to the redefinition of the real affected section.
Figure 5. Example of applying the algorithm to the redefinition of the real affected section.
4. Experimentation
4.1. Model Generation Process
4.1.1. Training Data
This is a learning data selection process for anomaly detection based on industrial
control system operation data using artificial intelligence. In the industrial control system
operation datasets, SWaT and HAI, the water treatment system operation data collected in
Section 2.2 were compared. When comparing data points and attack points between them,
SWaT had 60 and 41, and HAI had 78 and 50, respectively. From the diversity of data points
and attack points, we determined that HAI had more suitable data for experimentation.
The experiment was more suitable for making a case because it had various attack points,
while experimenting with SWaT was difficult because there was no labeling for anomalies
in recent data.
4.1.2. Choosing a Deep Learning Model

Because the data are time series, selecting a model suitable for time series is necessary.
Representative recurrent neural network (RNN), long short-term memory (LSTM), and
gated recurrent unit (GRU) algorithms were compared. The RNN [27] receives several
input values in chronological order as current inputs and calculates them. However, if
the length of the training data is long owing to the structure of the RNN, transferring
the previous information to the present is difficult, resulting in a gradient loss problem
Electronics 2022, 11, 1213 11 of 21
(long-term dependency). LSTM is an algorithm that solves the gradient loss problem of
RNNs. It has a structure in which each unit in the middle layer of a neural network is
replaced with a memory unit called an LSTM block. LSTM [28] consists of four gates and
one cell state. The gate layer includes a forget gate layer, an input gate layer, the current
state, and an output gate layer, and the cell includes the cell state. The cell state indicates
the long-term memory. GRU [29] is a simplified version of the cells constituting the LSTM.
Compared to LSTM, the structure is simple, and the calculation speed is fast because there
are fewer parameters to learn. GRU has two gates: update and reset gates. The update gate
determines how much of the previous memory is remembered, and the reset gate decides
whether the new input is merged with the old memory.
The performance metric comparison in Table 5 was performed by changing only the
model in the HAI 2.0 data and HAI 2.0 baseline code provided by the HAICon2020 Indus-
trial Control System Security Threat Detection AI Contest [17]. Table 5 shows the best
indicator of the GRU’s performance. Studies have demonstrated that LSTMs show better
performances as the amount of data increases. However, without an accurate comparison
index, this was directly applied to the data to be tested, and the GRU showing the best
performance was selected as the model to be applied to the experiment; PyTorch was used
for implementation.
Table 5. Comparison of models based on performance metrics.
Performance Metrics
Method Threshold
F1-Score TaP TaR
RNN 0.38 78.02 97.20 64.40
LSTM 0.38 81.60 97.40 67.66
GRU 0.38 83.10 97.50 71.50
4.1.3. Training Data Preprocessing

The HAI version 21.03 data used as training data consisted of three training data and
were provided as a CSV file. Each file is in the form shown in Table 6. Column 1 is the time
at which the data were collected, and columns 2 to 80 are the physical values collected from
field devices P1, P2, P3, and P4. Column 81 indicates whether an attack event occurred (1)
or not (0), and columns 82 to 84 indicate at which stage of the process the attack occurred
with attack_P1, attack_P2, and attack_P3, respectively.
Table 6. HAI version 21.03 training data (921,603 rows × 84 columns).
Time P1_B2004 ... Attack ... Attack_P3

2020-07-11
0.10121 ... 0 ... 0
12:00:00 AM
2020-07-11
0.10121 ... 0 ... 0
12:00:01 AM
... ... ... ... ... ...
The preprocessing of the training data consists of three steps:

1. Refining data by removing outliers and missing values;
2. Scaling data using normalization;
3. Eliminating unnecessary data points with feature graphs.
The first step purifies data by removing outliers and missing values. An outlier is a
value with a larger deviation than the normal value it belongs to and is detected using
methods such as normal distribution, likelihood, nearest neighbor, density, and clustering.
Thereafter, outliers are replaced with the upper or lower limits, the standard deviation
of the mean, the absolute mean deviation, and extreme percentiles. The training data of
HAI 21.03 are the data collected when operated normally without an attack, all attack
ing. Thereafter, outliers are replaced with the upper or lower limits, the standard devia-
tion of the mean, the absolute mean deviation, and extreme percentiles. The training data
of HAI 21.03 are the data collected when operated normally without an attack, all attack
labels are 0, and all data points are within the range of the minimum and maximum val-
Electronics 2022, 11, 1213 12 of 21
ues. Therefore, there are no outliers. Missing values indicate absent values and are ex-
pressed as NA, which must be processed during preprocessing because they cause distor-
tions later whenlabelsusingarethe model
0, and to points
all data forecast. Therefore,
are within theofmissing
the range values
the minimum andare processed
maximum values.
by the deletion, substitution, and insertion of the predicted values. The HAI 21.03 training
Therefore, there are no outliers. Missing values indicate absent values and are expressed
as NA, which must be processed during preprocessing because they cause distortions
data do not have missing values.
later when using the model to forecast. Therefore, the missing values are processed by the
The seconddeletion,
step scales the dataand
substitution, using normalization,
insertion of the predictedas shown in Figure
values. The HAI 21.036. Data scal-
training data
ing requires preprocessing because
do not have missing if each data range is too wide or small, the data may
values.
converge or divergeThe to 0second
during stepthescales
learning process.
the data For example,asP1_B4022
using normalization, is data6.that
shown in Figure Data
represent temperature with a range from 0 to 40, and P1_FT03 is data that represent pres-
scaling requires preprocessing because if each data range is too wide or small, the data may
converge or diverge to 0 during the learning process. For example, P1_B4022 is data that
sure with a rangerepresent
from 0temperature
to 2500. Because eachfrom
with a range range0 toof
40,data differs,iseither
and P1_FT03 standardization
data that represent pressure
or normalizationwith is used
a rangeforfrom
scaling. Standardization
0 to 2500. involves
Because each range distributing
of data data
differs, either on either
standardization
side of 0 as a value indicating the distance between the data and the mean. In contrast,
or normalization is used for scaling. Standardization involves distributing data on either
normalization in-volves transforming the data range from a minimum of 0 to a maximum
side of 0 as a value indicating the distance between the data and the mean. In contrast,
normalization in-volves transforming the data range from a minimum of 0 to a maximum of
of 1 to reduce the1 to effect of the relative size of the data. The minimum and maximum phys-
reduce the effect of the relative size of the data. The minimum and maximum physical
ical data values (columns
data values2(columns
to 80) were derived
2 to 80) from the
were derived fromtraining data,data,
the training and andnormalization
normalization
was performed. was performed.
Figure 6. Normalized training

Figure data. training data.
6. Normalized
The third step removes unnecessary data points using the feature graph. After nor-
malizing the data points of the training data and converting them to values between 0 and
1, we used the graph to study the characteristics of the data points.
Figure 7 shows the feature graph of data point P1_B3005. Because the values are
dis-tributed between 0 and 1 and change over time, they represent the characteristics well.
In contrast, the feature graph of data point P1_PCV02D in Figure 8 does not change its
value over time. Thus, it is unnecessary data for anomaly detection. Data points with these
characteristics were removed. The list of the 22 removed data points includes the following:
P1_PCV02D, P1_PP01AD, P1_PP01AR, P1_PP01BD, P1_PP01BR, P1_PP02D, P1_PP02R,
P1_STSP, P2_ASD, P2_AutoGO, P2_Emerg, P2_MSD, P2_ManualGO, P2_OnOff, P2_RTR,
P2_TripEx, P2_VTR01, P2_VTR02, P2_VTR03, P2_VTR04, P3_LH, and P3_LL.
contrast, the feature graph of data point P1_PCV02D in Figure 8 does not change its value
acteristics were it
over time. Thus, removed. The list
is unnecessary of for
data theanomaly
22 removed data Data
detection. points includes
points the following:
with these char-
P1_PCV02D,
acteristics wereP1_PP01AD,
removed. The P1_PP01AR, P1_PP01BD,
list of the 22 P1_PP01BR,
removed data P1_PP02D,
points includes P1_PP02R,
the following:
P1_STSP,
P1_PCV02D, P2_ASD, P2_AutoGO,
P1_PP01AD, P2_Emerg,
P1_PP01AR, P2_MSD,
P1_PP01BD, P2_ManualGO,
P1_PP01BR, P2_OnOff,
P1_PP02D, P2_RTR,
P1_PP02R,
P2_TripEx, P2_VTR01,
P1_STSP, P2_ASD, P2_VTR02,
P2_AutoGO, P2_VTR03,
P2_Emerg, P2_VTR04,
P2_MSD, P3_LH, and
P2_ManualGO, P3_LL.P2_RTR,
P2_OnOff,
Electronics 2022, 11, 1213
P2_TripEx, P2_VTR01, P2_VTR02, P2_VTR03, P2_VTR04, P3_LH, and P3_LL.
13 of 21
Figure 7.7. Feature graphfor

for datapoint
point P1_B3005.
Figure
Figure7. Feature
Feature graph
graph for data
data pointP1_B3005.
P1_B3005.
Figure
Figure8.8.Feature
Figure Feature graph
Feature graph for
graphfor data
fordata point
datapoint
pointP1_PCV02D.
P1_PCV02D.
P1_PCV02D. This
This
This feature graph
feature
feature graph means
graph
means that
means
that the
that
the values
the
values of the
values
of the of the
data points do not change over time.
4.1.4. Model Generation

4.1.4. Model
4.1.4. Model Generation
Generation
In the model generation process, the model was created considering the following
In the
In the model
model generation
generationprocess,
process,thethemodel
modelwas
wascreated considering
created the the
considering following
following
factors:
factors:
factors:
1. Sliding window;
1. Sliding window;
1.2. Sliding
Number window;
of hidden layers;
2.3. Number of hidden
size. layers;
2. Number of hidden
Hidden layer layers;
3. Hidden layer size. [30] is a method of recognizing data according to the character-
3. Hidden layer size.
The sliding window
isticsThe sliding
of time window
series data. [30] is a method
The method of recognizing
receives data according
data and remembers to the for
the pattern character-
that
The sliding window [30] is a method of recognizing data according to the character-
istics
window, of time
and series data.input
it involves The and
methodoutputreceives
values.data
The and
inputremembers
is the valuethe pattern
at the for that
beginning
istics
of theofwindow,
window,
time series
and it involves
data.
and the The ismethod
input and
output output
the receives
value values.
data and
Theofinput
at the end remembers
is the value
the window. As athe pattern
atresult of thefor that
the beginning
window,
ofexperiment and
the window, it and
involves
considering the theinput
output and
is
size of theoutput
the values.
value at
training the for
data The
end input
ofsliding
the iswindow
the window. the value
As aat
input, the
result beginning
of the
the size
of the
experimentwindow,
between 75considering and the output
the size for
and 90 was suitable is
of thethe value
the training at the end
dataTherefore,
experiment. of the
for the sliding window.
it waswindow As a result
input, the
set at intervals of the
of 5.size
experiment
Table 7 shows
between considering
75 and the the sizefor
90corresponding
was suitable of the
thewindow
sliding traininginput
experiment.data for the
and output
Therefore, slidingwas window
it valuessetfrom input,
which
at intervals the5.size
theof
between
Table 7 shows the corresponding sliding window input and output values from which theof 5.
model 75
was and
created.90 was suitable for the experiment. Therefore, it was set at intervals
Table
model 7 shows the corresponding sliding window input and output values from which the
Table 7.was created.
Sliding window output and input settings.
model was created.
Input Output
1 74 75
2 79 80
3 84 85
4 89 90
The following is the number of hidden layers and their sizes. The stacked GRU used
is a multi-layer neural network and consists of input, hidden, and output layers. Each layer
comprises nodes, and the number of hidden layers determines how many exist. The model
was created using the stacked GRU algorithm selected earlier as the deep learning model.
The L1 loss function was used, and AdamW [31] was used as the optimization function.
The AdamW function separates the weight decay step of the existing Adam algorithm.
During training, the epoch was set to 64, and the model’s parameter with the best loss value
Electronics 2022, 11, 1213 14 of 21
was stored. A rectified linear unit (ReLU) was added as an activation function to solve the
gradient vanishing problem. In addition, the PyTorch library was used for implementation.
Table 8 shows if the learning proceeded according to the size of the sliding window input
and output.
Table 8. Comparison of learning models according to the size of the sliding window.
Input Output Wall Time Loss Value

Case 1 74 75 4 h 50 min 52 s 3.266
Case 2 79 80 5 h 9 min 44 s 3.262
Case 3 84 85 5 h 30 min 13 s 3.301
Case 4 89 90 5 h 50 min 9 s 3.298
4.2. Anomaly Detection

The test data were formatted as shown in Table 6, and the size was 402,005 rows ×
84 columns. In addition, the preprocessing of the test data was performed at the same
stage as the training data. The data do not have missing values. The 22 data points were
removed, and the data were normalized in the scaling step. This is the process of predicting
the results of a test dataset with a model created using a training dataset. In this process,
a threshold is used. To determine whether an attack occurs, the average of the calculated
differences by the entire field is determined, and then the abnormality is predicted using
the threshold value.
The blue line in Figure 9 indicates the size of the average error. If the blue line exceeds
the red line, the average error is determined to be abnormal (an attack). Abnormality
is indicated by 1, and normality by 0. The predicted label is stored in LABELS, and the
correct label of the test set is extracted as ATTACK_LABELS. In this case, the LABELS and
ATTACK_LABELS sizes do not match. Because the model learned with the sliding window
method, the first part cannot be predicted as much as the sliding window input value. In
addition, because the three learning datasets were combined into one set of data, ‘time’
does not continue between the file changes and is thus unpredictable, and the number
of predicted labels, LABELS, is less than ATTACK_LABELS. To solve this, we applied a
function that fills in the blanks with the normal 0 and remaining values with the judgment
of the model.
As shown in Table 9, the F1-score, TaP, and TaR performance metrics according to the
boundary values were derived. The number of detected anomalies was given the highest
priority, and subsequently, the highest performance metrics were listed as C1-2, C2-2, C3-1,
and C4-2. Among them, C1-2, which had the highest number of detected anomalies and
performance metrics, was applied in the analysis of four anomaly detection experiments
and evaluation methods. A total of 50 ATTACK_LABELS in the test dataset were labeled,
and of these, 36 anomalies were detected as abnormal (attacks) by C1-2: [’1’, ’3’, ’5’, ’6’, ’7’,
’11’, ’12’, ’13’, ’14’, ’16’, ’17’, ’18’, ’22’, ’23’, ’27’, ’28’, ’29’, ’30’, ’31’, ’32’, ’33’, ’34’, ’35’, ’36’,
’37’, ’38’, ’39’, ’40’, ’41’, ’42’, ’43’, ’45’, ’47’, ’48 ’, ’49’, ’50’].
2022, 11, x FOR PEER REVIEW 15 of 21
Electronics 2022, 11, 1213 15 of 21
Figure 9. AverageFigure
of differences produced
9. Average by theproduced
of differences entire field (blue)/threshold
by the (red). The blue(red).
entire field (blue)/threshold line is
The blue line is
the average of thethe
differences
average ofcalculated by thecalculated
the differences entire field
by to
thedetermine
entire fieldifto
there is an attack.
determine if thereThe red
is an attack. The red
line is a value arbitrarily
line is aspecified by the experimenter,
value arbitrarily specified by theand if the blue and
experimenter, line if
exceeds
the bluethe red
line line, that
exceeds the red line, that
point is predictedpoint
to beisanpredicted
attack. to be an attack.
Table 9. Performance metrics according to the threshold of each case.

As shown in Table 9, the F1-score, TaP, and TaR performance metrics according to
the boundary values were derived. The number of detected anomalies was Performance given the high-Metrics
Number of Anomalies
est priority, and subsequently, the highest performance
Threshold
Detected metrics were listed
F1-Score
as C1-2,TaP
C2-2, TaR
C3-1, and C4-2. Among them, C1-2, which had the highest number of detected anomalies
C1-1 0.042 36 0.501 0.413 0.635
and performance metrics, C1-2 was applied
0.04 in the analysis 36 of four anomaly 0.504detection0.416
experi- 0.639
ments and evaluation C1-3methods. 0.38 A total of 50 ATTACK_LABELS35 in the
0.503test dataset were
0.415 0.638
labeled, and of these, C1-436 anomalies 0.36 were detected as35abnormal (attacks) 0.507by C1-2:0.417['1', '3', 0.646
'5', '6', '7', '11', '12', '13',
C2-1 '14', '16', '17',
0.05 '18', '22', '23', '27',
36 '28', '29', '30', '31', '32',
0.482 '33', '34',
0.396 '35', 0.611
'36', '37', '38', '39', '40', '41', '42', '43',
C2-2 '45', '47', '48 ', '49',36
0.047 '50']. 0.496 0.413 0.620
C2-3 0.044 35 0.504 0.425 0.621
C2-4 0.04 34
Table 9. Performance metrics according to the threshold of each case. 0.513 0.437 0.622
C3-1 0.047 35 0.484 0.401 0.611
C3-2 Number
0.044 of Anomalies 34 Performance
0.496 Metrics0.417 0.611
Threshold
C3-3 0.04 Detected 33 F1-Score TaP
0.506 TaR
0.428 0.619
C1-1 0.042
C3-4 0.038 36 31 0.501 0.413
0.487 0.635
0.409 0.603
C1-2 0.04
C4-1 0.046 36 34 0.504 0.416
0.500 0.639
0.417 0.624
C1-3 0.38
C4-2 0.044 35 34 0.503 0.415
0.504 0.638
0.421 0.628
C1-4 0.36
C4-3 0.042 35 34 0.507 0.417
0.501 0.646
0.415 0.632
C4-4 0.04 32 0.498 0.420 0.611
C2-1 0.05 36 0.482 0.396 0.611
C2-2 0.047 36 0.496 0.413 0.620
C2-3 4.3.0.044
Analysis of Performance35Metrics 0.504 0.425 0.621
C2-4 0.04
For the performance34 0.513was set0.437
evaluation, a section 0.622
in the test dataset. Among the
C3-1 0.047
36 anomalies detected by 35
C1-2, attacks [‘1’, 0.484 0.401
‘10’, ‘16’, ‘27’, ‘33’] were0.611
selected. Those five
C3-2 0.044are shown in Figures
attacks 34 10–14. 0.496 0.417 0.611
C3-3 0.04 33 0.506 0.428 0.619
C3-4 0.038 31 0.487 0.409 0.603
C4-1 0.046 34 0.500 0.417 0.624
C4-2 0.044 34 0.504 0.421 0.628
C4-3 0.042 34 0.501 0.415 0.632
C4-4 0.04 32 0.498 0.420 0.611
Electronics2022,
Electronics 11,xxFOR
2022,11, FORPEER
PEERREVIEW
REVIEW 16 ofof 21
16 21
4.3.
4.3. Analysis
Analysis of
of Performance
Performance Metrics
Metrics
4.3.Analysis
4.3. Analysisof ofPerformance
PerformanceMetrics
Metrics
For
For the
the performance
performance evaluation,
evaluation, a section was set in the test dataset. Among the 36
Forthe
For theperformance evaluation,aaasection
performanceevaluation, section
sectionwas was
wassetset in
setin the
inthe test
thetest dataset.
testdataset. Among
dataset.Among
Amongthethe
the 36
36
36
Electronics 2022, 11, 1213 anomalies
anomalies detected
detected by
by C1-2,
C1-2, attacks
attacks [‘1’,
[‘1’, ‘10’,
‘10’, ‘16’,
‘16’, ‘27’,
‘27’, ‘33’]
‘33’] were
were selected.
selected. Those
Those five at-
16 of
five
21
at-
anomalies
anomalies
tacks are detected
detected
shown in by
by C1-2,
C1-2,
Figures attacks
attacks
10–14. [‘1’,
[‘1’, ‘10’,
‘10’, ‘16’,
‘16’, ‘27’,
‘27’, ‘33’]
‘33’] were
were selected.
selected. Those
Those five
five at-
at-
tacks are
tacksare shown
areshown
shownin in Figures
inFigures 10–14.
Figures10–14.
10–14.
tacks
Figure
Figure 10.
Figure 10. Section
10. Section1:
Section 1:2,000~3,000
1: 2000~3000 (Attack
2,000~3,000 (Attack 1).
(Attack 1). The
1). The red
The red line
red line is
line is the
is the threshold,
the threshold, and
threshold, and when
and when the
when the blue
the blue line
blue line
line
Figure10.
Figure
crosses 10.Section
the Section
red line,1:we
1: 2,000~3,000
2,000~3,000
predict an (Attack
(Attack
attack. 1).The
1).
The The redline
red
orange lineis
line isisthe
theactual
the threshold,
threshold,
attack and
and whenthe
when
range. theblue
blueline
line
crosses
crosses the
the red
crossesthe
red line,
thered
line, we
redline,
we predict
line,we
predict an
wepredict
an attack.
predictan
attack. The
anattack.
The orange
attack.The
orange line
Theorange
line is
orangeline
is the
the actual
lineisisthe
actual attack
theactual
attack range.
actualattack
range.
attackrange.
range.
crosses
Figure
Figure11.
Figure 11.
11. Section
Section2:
Section 2:71,950~72,250
2: 71,950~72,250(Attack
71,950~72,250 (Attack10).
(Attack 10).The
10). Thered
The redline
red lineis
line isisthe
thethreshold,
the threshold,and
threshold, andwhen
and whenthe
when theblue
the blueline
blue line
line
Figure
Figure
crosses 11.Section
11.
the Section
red 2:2:we
line, 71,950~72,250
71,950~72,250
predict an (Attack
(Attack
attack. 10).
10).
The The
The
orange red
red line
line
line is is
isthe the
the threshold,
threshold,
actual attack and
and when
when
range. the
the blue
blue line
line
crosses
crosses the
crossesthe
the red
thered
red line,
redline,
line,we
line, we
we predict
wepredict
predict an
predictan
an attack.
anattack. The
attack.The
attack. orange
Theorange
The line
orangeline
orange is the
lineisisthe
line is actual
theactual
the attack
actualattack
actual range.
attackrange.
attack range.
range.
crosses
Figure
Figure 12.
12. Section
Section 3:
3: 114,375~114,675
114,375~114,675 (Attack
(Attack 16).
16). The
The red
red line
line is
isisthe
the threshold,
threshold, and
and when
when the
the blue
blue
Figure
Figure
Figure
line 12.Section
12.
12.
crosses Section
Section
the red 3:
3:3: 114,375~114,675
114,375~114,675
114,375~114,675
line, we predict an (Attack
(Attack
(Attack
attack. 16).
16).
The The
16).The red
Thered
orange line
line
redline
lineisis
is the
the
thethe threshold,
threshold,
threshold,
actual attackand
and
and when
when the
whenthe
range. blue
theblue
blue
line
line crosses the
crossesthe red
thered line,
redline, we
line,we predict
wepredict an
predictan attack.
anattack. The
attack.The orange
Theorange line
orangeline is
lineisis the
isthe actual
theactual attack
actualattack range.
attackrange.
range.
line
linecrosses
crosses the red line, we predict an attack. The orange line the actual attack range.
Figure
Figure 13.
13. Section
Section 4:
4: 217,000~217,500
217,000~217,500 (Attack
(Attack 27).
27). The
The red
red line
line is
is the
the threshold,
threshold, and
and when
when the
the blue
blue
Figure
Figure
Figure
line 13.Section
13.
13.
crosses Section
Section
the red 4:
4:4: 217,000~217,500
217,000~217,500
217,000~217,500
line, we predict an (Attack
(Attack
(Attack
attack. 27).
27).
The The
27).The red
Thered
orange line
line
redline
lineisis
is is the
the
thethe threshold,
threshold,
threshold,
actual attackand
and
and when
when the
whenthe
range. blue
theblue
blue
line
line crosses the
crossesthe red
thered line,
redline, we
line,we predict
wepredict an
predictan attack.
anattack. The
attack.The orange
Theorange line
orangeline is
lineisis the
isthe actual
theactual attack
actualattack range.
attackrange.
range.
line
linecrosses
crosses the red line, we predict an attack. The orange line the actual attack range.
Electronics 2022, 11,
Electronics 2022, 11, x1213
FOR PEER REVIEW 1717ofof21
21
Figure
Figure 14.
14. Section
Section 5:5: 262,100~262,500
262,100~262,500(Attack
(Attack33).
33).The
Thered
redline
lineisisthe
thethreshold,
threshold,and
andwhen
whenthe
the blue
blue
line
linecrosses
crosses the
the red
red line,
line, we
we predict
predict an
an attack.
attack. The
The orange
orange line
line is
is the
the actual
actual attack
attack range.
range.
Comparison
Comparison of of Performance
Performance Metrics
Metrics by
by Attack
Attack
Table
Table 10
10 compares
compares thethe anomaly
anomaly detection
detection performance
performance of
of five
five case
case intervals
intervals selected
selected
in
in the
theinterval
intervalsetting
settingfor
forperformance
performance evaluation. TheThe
evaluation. point-based
point-basedstandard metrics
standard ac-
metrics
curacy
accuracyandand
F1-score andand
F1-score the the
range-based standard
range-based metrics
standard TaP and
metrics TaP TaR
and were used used
TaR were as per-
as
formance
performancemetrics.
metrics.
Table
Table 10.
10. Comparison
Comparison of
of performance
performance metrics by attack.
Performance
PerformanceMetrics
Metrics
Section
Section Attack
Attack
Accuracy
Accuracy F1-Score
F1-Score TaP
TaP TaR
TaR
11 11 0.953
0.953 0.819
0.819 0.778
0.778 0.865
0.865
22 10
10 0.883
0.883 0.654
0.654 0.534
0.534 0.843
0.843
33 16
16 0.916
0.916 0.895
0.895 0.895
0.895 0.895
0.895
44 27
27 0.952
0.952 0.774
0.774 0.654
0.654 0.940
0.940
5 33 0.9875 0.898 0.820 0.992
5 33 0.9875 0.898 0.820 0.992
Five
Five attacks
attacks scored
scored anan accuracy
accuracy ofof 0.88
0.88 or
or higher.
higher. IfIf only
only this
this metric
metric isis checked,
checked, wewe
can determine that all attacks were accurately detected. However, referring to the F1-
can determine that all attacks were accurately detected. However, referring to the F1-score,
score,
whichwhich was always
was always lowerlower thanaccuracy,
than the the accuracy,
as it isasaitcombination
is a combination of recall
of recall and pre-
and precision,
a score between TaP and TaR was derived. When evaluating anomaly
cision, a score between TaP and TaR was derived. When evaluating anomaly detection detection in time
in
series data by deriving various performance metric results, using point-based
time series data by deriving various performance metric results, using point-based per- performance
metrics may
formance lead may
metrics to anlead
inaccurate evaluation.
to an inaccurate evaluation.
As evident in Figure 15, 71,955~72,000
As evident in Figure 15, 71,955~72,000 was was undetected,
undetected, 72,000~72,045
72,000~72,045was wasdetected,
detected,
72,045~72,055was
72,045~72,055 wasundetected,
undetected,72,055~72,140
72,055~72,140was wasundetected,
undetected,72,140~72,160
72,140~72,160was wasdetected,
detected,
and 72,160~72,235 was undetected. From the performance index for
and 72,160~72,235 was undetected. From the performance index for this section, the accu- this section, the
accuracy was 0.883, which is unsuitable as an index for performing anomaly detection. In
racy was 0.883, which is unsuitable as an index for performing anomaly detection. In con-
contrast, the TaP and TaR scores considering the time series were 0.534 and 0.843. For the
trast, the TaP and TaR scores considering the time series were 0.534 and 0.843. For the
ambiguous section, the actual duration of the attack was section 71,955~72,235, and it can
ambiguous section, the actual duration of the attack was section 71,955~72,235, and it can
be inferred that the starting point of the affected section began at 71,980.
be inferred that the starting point of the affected section began at 71,980.
Through this analysis, while an attack was predicted, there is a section in which an
attack was detected and a section in which an attack was not detected. Additionally, an
attack (red line) occurred, but the blue line does not cross the threshold value, which
confirms that the attack is slowly affecting the inside of the system. Therefore, to evaluate
the anomaly detection of time series data, time series features must be considered.
lectronics 2022, 11, x FOR PEER REVIEW 18 of 21
Electronics 2022, 11, 1213 18 of 21
Figure 15. Analysis

Figure of of
15. Analysis detected/undetected attacksfor
detected/undetected attacks for Section
Section 2 (Attack
2 (Attack 10). 10).
Table 11 shows that TaP_d and TaR_d were 1.000 because an attack was detected.
Through
TaP_p this analysis,
was derived while an
by considering the attack wasinstance
ambiguous predicted,
valuethere is a section
after calculating theincase
which an
attack was detected
in which the actualand a section
anomaly in which overlap.
and prediction an attack waswas
TaR_p notderived
detected. Additionally, an
by calculating
attack (red line) occurred, but the blue line does not cross the threshold value,
the case in which the anomaly and prediction overlap and then dividing by the number which
of con-
firmsinstances.
that theAsattack
such,is slowly
using TaP affecting
and TaR asthe inside ofmetrics
performance the system.
is moreTherefore,
suitable for to evaluate the
anomaly
detection analysis than the accuracy or F1-score.
anomaly detection of time series data, time series features must be considered.
Table 11 shows that TaP_d and TaR_d were 1.000 because an attack was detected.
Table 11. Comparing performance metrics by attack.
TaP_p was derived by considering the ambiguous instance value after calculating the case
in which the actualTaP anomaly
= 0.534 and prediction overlap. TaR_p was
TaR = 0.843derived by calculating
the case in which

TaP_d the anomaly TaP_p
and prediction overlap
TaR_dand then dividing
TaR_pby the number
of instances.1.000
As such, using TaP and TaR as performance
0.365 1.000 metrics is0.729
more suitable for
anomaly detection analysis than the accuracy or F1-score.
4.4. Application of the Proposed Algorithm
Table 11. In
Comparing performance
this section, metrics byintervals
we define ambiguous attack. by applying the overridden algorithm
proposed in Section 3. Figure 16 shows the size of the attack duration section, the section
TaPambiguous
affected, and the = 0.534 section of the proposed technique. We TaRset= upper
0.843 and lower
linesTaP_d TaP_p
based on the threshold value, which is the central TaR_d
line. The attack duration TaR_p
section
does 1.000
not need to be calculated 0.365
as it is a fact defined in the1.000
test dataset. Therefore, we 0.729
need
to define the real affected section. Depending on the algorithm, we need to set the start
and end points of the real affected section. In the process of setting the end point of the
4.4. Application
real affected of the Proposed
section, the greenAlgorithm
point is stored in RI_end_time but does not last for 1 min;
In this section, we define ambiguous
thus, the subsequent red point is stored inintervals by applying
RI_end_time. the overridden
In addition, because the endalgorithm
proposed in Section 3. Figure 16 shows the size of the attack duration
point of the attack duration section occurred after the end point of the real section,
affected the section
section,
the absolute value of the value obtained by subtracting the end point of the real affected
affected, and the ambiguous section of the proposed technique. We set upper and lower
section from the end point of the attack duration section is ambiguous. To derive the TaPR
linesscores
basedusing
on the threshold
the size value,
of the newly whichambiguous
defined is the central line.
interval, theThe attack
interval duration section
is substituted
doesinto
notthe
need to be calculated as it is a fact defined in the test dataset. Therefore,
partial scores for calculating TaR and TaP. Therefore, when calculating S(a’,p), the we need
to define the score
improved real affected
of TaPR is section.
derived by Depending
substitutingon the the
size algorithm,
of the newly we need
defined to set the start
ambiguous
and end points of the real affected section. In the process of setting the end point
section. Derived as performance metrics, these values can be said to have improved the of the
existing performance metrics because the size of the ambiguous section suitable for the
real affected section, the green point is stored in RI_end_time but does not last for 1 min;
attack was defined.
thus, the subsequent red point is stored in RI_end_time. In addition, because the end point
of the attack duration section occurred after the end point of the real affected section, the
absolute value of the value obtained by subtracting the end point of the real affected sec-
tion from the end point of the attack duration section is ambiguous. To derive the TaPR
scores using the size of the newly defined ambiguous interval, the interval is substituted
into the partial scores for calculating TaR and TaP. Therefore, when calculating S(a’,p), the
improved score of TaPR is derived by substituting the size of the newly defined ambigu-
ous section. Derived as performance metrics, these values can be said to have improved
OR PEER REVIEW 19 of 21
Electronics 2022, 11, 1213 19 of 21
Figure 16. An example of an ambiguous section obtained by applying the proposed algorithm.
Figure 16. An example of an ambiguous section obtained by applying the proposed algorithm.
5. Conclusions
5. Conclusions Cybersecurity threats to industrial control systems that control and manage national
infrastructure are on the rise. Therefore, the ability to detect and interpret anomalies in
Cybersecurity industrial
threats tocontrol
industrial
systemscontrol systems
is important. that control
To proactively detectand
and manage
respond tonational
threats, we
infrastructure are on needtheto rise.
use a Therefore,
real test-basedthe abilitysystem
operating to detect and
dataset. interpretitanomalies
Furthermore, in
is also important
industrial control systems is important. To proactively detect and respond to threats, wethe
to analyze the results of predicting attacks. Industrial control systems mainly have
characteristics of the time series. However, existing studies used a point-based performance
need to use a real test-based operating
evaluation method system
that does dataset.
not consider theseFurthermore, it is also important
time series characteristics. Therefore, we
to analyze the resultsusedof predicting
a real testbed-basedattacks. Industrial
operational datasetcontrol
to perform systems
anomalymainly
detectionhave the
and derived
characteristics of the time
results series.
using variousHowever,
performance existing studies used a point-based perfor-
metric methods.
mance evaluation method that does not consider these time series characteristics. There-
We then redefined the ambiguous section algorithm for TaPR, a range-based perfor-
mance metric. Traditional TaPR partial scores were calculated at the same rate without
fore, we used a realconsidering
testbed-based operational
the characteristics dataset
of the attack. Ittois perform
difficult to anomaly detection
conduct an accurate and
evaluation
derived results using various
without performance
considering metric methods.
the characteristics of the attack. Further, it is difficult to pinpoint
We then redefined
when the ambiguous
attacks on industrial section
control algorithm
systems begin forandTaPR, a range-based
end. Additionally, perfor-to
it is necessary
consider the time when the inside of the system returns to normal even after the attack is
mance metric. Traditional TaPR partial scores were calculated at the same rate without
over. Therefore, we used the SPC upper and lower limit concepts to reflect the characteris-
considering the characteristics of theUsing
tics of these attacks. attack. It isconcepts,
these difficult weto conduct
redefined thean accurate
algorithm of evaluation
the ambiguous
without considering the characteristics
section, and we could derive of an
theevaluation
attack. by Further,
reflectingit the
is characteristics
difficult to pinpoint
of the attack.
when attacks on industrial control systems begin and end. Additionally, it is necessary
The proposed algorithm allows for more accurate evaluation when performing to
anomaly
detection targeting industrial control system-based time series datasets. However, we need
consider the time when the insideanalyze
to quantitatively of the whether
systemthe returns
proposed to normal
performance even after the
indicators areattack
superioristo
over. Therefore, wethe used the SPC
existing upper metrics.
performance and lower limit
This is becauseconcepts to reflect
it is difficult the character-
to quantitatively express
istics of these attacks. Using
whether thethese concepts,
interpretation waswe redefined
accurate, the simply
rather than algorithm of thethat
determining ambiguous
the anomaly
detection was not successful.
section, and we could derive an evaluation by reflecting the characteristics of the attack.
Future research will apply the upper and lower limits applied to the proposed algo-
The proposed algorithmrithm toallows
anomaly for more accurate
detection. evaluation
Existing studies only usedwhen performing
thresholds to determine anomaly
anomalies
detection targeting andindustrial
normality.control system-based
Each process time seriesafter
has its own characteristics datasets. However,
normalizing we
various process
need to quantitatively analyze
values. whether
Therefore, there isthe proposed
a limit to judgingperformance
abnormal and normal indicators areone
with only superior
value. We
plan to perform anomaly detection using the upper and lower limits of the SPC applied to
to the existing performance metrics. This is because it is difficult to quantitatively express
the proposed algorithm together with the threshold value.
whether the interpretation was accurate, rather than simply determining that the anomaly
detection was not successful.
Author Contributions: Conceptualization, G.-Y.K.; methodology, G.-Y.K.; software, G.-Y.K.; valida-
tion, G.-Y.K., S.-M.L. and I.-C.E.; formal analysis, G.-Y.K. and I.-C.E.; investigation, G.-Y.K.; resources,
Future researchG.-Y.K.;
will data
apply the upper
curation, and lower limits
G.-Y.K.; writing—original draft applied
preparation,toG.-Y.K.;
the proposed algo-
writing—review and
rithm to anomaly detection. Existing
editing, G.-Y.K., studies
S.-M.L. and I.-C.E.;only used thresholds
visualization, to determine
G.-Y.K.; supervision, I.-C.E.; projectanomalies
administration,
and normality. EachI.-C.E.;
process hasacquisition,
funding its own I.-C.E.
characteristics after
All authors have readnormalizing
and agreed to thevarious
publishedprocess
version of
the manuscript.
values. Therefore, there is a limit to judging abnormal and normal with only one value.
Funding: This work was supported by the Nuclear Safety Research Program through the Korea
We plan to performFoundation
anomaly detection using the upper and lower limits of the SPC ap-
of Nuclear Safety (KoFONS) using financial resources granted by the Nuclear Safety and
plied to the proposed algorithm together with the threshold value.
Author Contributions: Conceptualization, G.-Y.K.; methodology, G.-Y.K.; software, G.-Y.K.; vali-

dation, G.-Y.K., S.-M.L., and I.-C.E.; formal analysis, G.-Y.K. and I.-C.E.; investigation, G.-Y.K.; re-
Electronics 2022, 11, 1213 20 of 21
Security Commission (NSSC) of the Republic of Korea (No.2106061), as well as the results of a study
supported by an Institute for Information & Communications Technology Planning & Evaluation
(IITP) grant funded by the Korean government (MSIT) under grant no. 2019-0-01343, Regional
strategic industry convergence security core talent training business.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Conflicts of Interest: The funders had no role in the design of the study; in the collection, analyses,
or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
1. Kevin, E.H.; Ronald, E.F. History of Industrial Control System Cyber Incidents, Internet Publication. Available online: https:
//www.osti.gov/servlets/purl/1505628 (accessed on 8 April 2022).
2. Joseph, S. Evolution of ICS Attacks and the Prospects for Future Disruptive Events, Internet Publication. Available online:
https://www.dragos.com/wp-content/uploads/Evolution-of-ICS-Attacks-and-the-Prospects-for-Future-Disruptive-Events-
Joseph-Slowik-1.pdf (accessed on 8 April 2022).
3. Stampar, M.; Fertalj, K. Artificial intelligence in network intrusion detection. In Proceedings of the 38th International Convention
on Information and Communication Technology, Electronics and Microelectronics, MIPRO, Opatija, Croatia, 25–29 May 2015.
4. Hongyu, L.; Bo, L. Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey. Appl. Sci. 2019, 9, 4396.
5. Hwang, W.S.; Yun, J.H.; Kim, J.; Kim, H.C. Time-Series Aware Precision and Recall for Anomaly Detection: Considering Variety of
Detection Result and Addressing Ambiguous Labeling. In Proceedings of the 28th ACM International Conference on Information
and Knowledge Management, CIKM ’19, Beijing, China, 3–7 November 2019; pp. 2241–2244.
6. Williams, T.J. The Purdue Enterprise Reference Architecture. IFAC Proc. Vol. 1993, 26 Pt 4, 559–564. [CrossRef]
7. CISCO. Network and Security in Industrial Automation Environments—Design Guide; CISCO: San Jose, CA, USA, 2020; pp. 8–13.
8. SANS. The Purdue Model and Best Practices for Secure ICS Architectures. Available online: https://www.sans.org/blog/
introduction-to-ics-security-part-2/ (accessed on 4 April 2022).
9. Kim, G.H. Industrial Control System Security, 1981th ed.; IITP: Daejeon, Korea, 2021; pp. 5–7.
10. Choi, S.O.; Kim, H.C. Energy sector infrastructure security monitoring plan based on MITRE ATT&CK framework. Rev. KIISC
2020, 30, 13–23.
11. Han, G.H. Trends in Standards and Testing and Certification Technology-Smart Manufacturing Security Standardization Status.
TTA J. 2018, 178, 80–90.
12. Korea Industrial Standards. Security for Industrial Automation and Control Systems—Part 4-2: Technical Security Requirements
for IACS Components. Available online: https://standard.go.kr/KSCI/standardIntro/getStandardSearchView.do?menuId=919&
topMenuId=502&upperMenuId=503&ksNo=KSXIEC62443-4-2&tmprKsNo=KS_C_NEW_2019_3780&reformNo=00 (accessed on
9 February 2022).
13. Shengyi, P.; Thomas, M.; Uttam, A. Developing a Hybrid Intrusion Detection System Using Data Mining for Power Systems.
IEEE Trans. Smart Grid 2015, 6, 3104–3113.
14. iTrust. BATtle of Attack Detection Algorithms (BATADAL). Available online: https://itrust.sutd.edu.sg/itrust-labs_datasets/
dataset_info/ (accessed on 9 February 2022).
15. iTrust. Water Distribution (WADI). Available online: https://itrust.sutd.edu.sg/itrust-labs-home/itrust-labs_wadi/ (accessed on
8 April 2022).
16. iTrust. Secure Water Treatment (SWaT). Available online: https://itrust.sutd.edu.sg/itrust-labs-home/itrust-labs_swat/ (accessed
on 8 April 2022).
17. Shin, H.K.; Lee, W.; Yun, J.H.; Min, B.G. Two ICS Security Datasets and Anomaly Detection Contest on the HIL-based Augmented
ICS Testbed. In Proceedings of the Cyber Security Experimentation and Test Workshop, CSET ’21, Virtual, CA, USA, 9 August
2021; ACM: New York, NY, USA, 2021; pp. 36–40.
18. Shin, H.K.; Lee, W.; Yun, J.H.; Kim, H.C. HAI 1.0: HIL-Based Augmented ICS Security Dataset. In Proceedings of the 13th USENIX
Workshop on Cyber Security Experimentation and Test (CSET 20), Boston, MA, USA, 10 August 2020.
19. Lee, T.J.; Justin, G.; Nesime, T.; Eric, M.; Stan, Z. Precision and Recall for Range-Based Anomaly Detection. In Proceedings of the
SysML Conference, Stanford, CA, USA, 15–16 February 2018.
20. Dmitry, S.; Pavel, F.; Andrey, L. Anomaly Detection for Water Treatment System based on Neural Network with Automatic
Architecture Optimization. arXiv 2018, arXiv:1807.07282.
21. Kim, J.; Yun, J.H.; Kim, H.C. Anomaly Detection for Industrial Control Systems Using Sequence-to-Sequence Neural Networks.
arXiv 2019, arXiv:1911.04831.
22. Giuseppe, B.; Mauro, C.; Federico, T. Evaluation of Machine Learning Algorithms for Anomaly Detection in Industrial Networks. In
Proceedings of the 2019 IEEE International Symposium on Measurements Networking (MN), Catania, Italy, 8–10 July 2019; pp. 1–6.
23. Sohrab, M.; Alireza, A.; Kang, K.; Arman, S. A Machine Learning Approach for Anomaly Detection in Industrial Control Systems
Based on Measurement Data. Electronics 2021, 10, 407.
Electronics 2022, 11, 1213 21 of 21
24. Kim, D.Y.; Hwang, C.W.; Lee, T.J. Stacked-Autoencoder Based Anomaly Detection with Industrial Control System. In Software
Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing; Springer: Cham, Switzerland, 2021; pp. 181–191.
25. Roland, C. Statistical process control (SPC). Assem. Autom. 1996, 16, 10–14.
26. Vanli, O.A.; Castillo, E.D. Statistical Process Control in Manufacturing; Encyclopedia of Systems and Control; Springer: London, UK, 2014.
27. David, E.R.; Geoffrey, E.H.; Ronald, J.W. Learning representations by back-propagating errors. Nature 1986, 323, 533–536.
28. Sepp, H.; Jürgen, S. Long Short-Term Memory. Neural Comput. 1997, 8, 1735–1780.
29. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078.
30. Koç, C.K. Analysis of sliding window techniques for exponentiation. Comput. Math. Appl. 1995, 30, 17–24. [CrossRef]
31. Ilya, L.; Frank, H. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning
Representations, ICLR, New Orleans, LA, USA, 6–9 May 2019.

A Study On Performance Metrics For ANomaly Detection Based On Industrial Control System Operation Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Study On Performance Metrics For ANomaly Detection Based On Industrial Control System Operation Data

Uploaded by

Copyright:

Available Formats

electronics

Received: 3 March 2022

Electronics 2022, 11, 1213. https://doi.org/10.3390/electronics11081213 https://www.mdpi.com/journal/electronics

A control system is a device that manages, commands, directs, and re

control system is a collection of processes, personnel, hardware, and softwa

Table 1. Industrial control system operation dataset list.

Name Proposed Attack Points Year

Electronics 2022, 11, 1213 5 of 21

Figure 2. Performance metric tree for anomaly detection evaluation.

Figure 4. Example of an ambiguous section of the TaPR performance metric.

Factor Description Formula

2.4. Requirements for Performance Metrics

Performance Metrics Req.1 Req.2 Req.3 Req.4 Req.5 Refs.

3. Proposed Performance Metrics

3.1. Statistical Process Control (SPC)

3.2. Defining the Real Affected Section

Initialize Central Line (CL) to threshold

4.1.2. Choosing a Deep Learning Model

Table 5. Comparison of models based on performance metrics.

4.1.3. Training Data Preprocessing

Table 6. HAI version 21.03 training data (921,603 rows × 84 columns).

Time P1_B2004 ... Attack ... Attack_P3

The preprocessing of the training data consists of three steps:

Figure 6. Normalized training

Figure 7.7. Feature graphfor

4.1.4. Model Generation

Input Output Wall Time Loss Value

4.2. Anomaly Detection

Table 9. Performance metrics according to the threshold of each case.

Figure 15. Analysis

the case in which

Author Contributions: Conceptualization, G.-Y.K.; methodology, G.-Y.K.; software, G.-Y.K.; vali-

You might also like