A Game-Theoretic Approach To Cross-Layer Security Decision-Making in Industrial Cyber-Physical Systems

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2019.2907451, IEEE
Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. XX, NO. XX, XXXX
A Game-Theoretic Approach to
Cross-Layer Security Decision-Making
in Industrial Cyber-Physical Systems
Kaixing Huang, Chunjie Zhou, Yuanqing Qin, Weixun Tu
Abstract—Current security measures in industrial Cyber- loss caused by attacks [6]. Recently, some researchers have
Physical systems (ICPS) lack the active decision capa- employed game theory to propose security decision-making
bility to defend against highly-organized cyber-attacks. In approaches for ICPSs.
this paper, a security decision-making approach based on
stochastic game model is proposed to characterize the In [7], the authors propose a hybrid approach using game
interaction between attackers and defenders in ICPSs and theory and classical optimization to produce decision support
generate optimal defense strategies to minimize system for the defenders of ICPSs. Feng et al. [8] integrate risk
losses. The major distinction of this approach is that it assessment with game model to optimize the allocation of
presents a practical way to build a cross-layer security defensive resources in multiple chemical facilities. Chen et
game model for ICPSs by means of quantitative vulnera-
bility analysis and time-based unified payoff quantification. al. [9] present a comprehensive game framework to seek
A case study on a hardware-in-the-loop simulation testbed reliability strategies for defending power systems. Yuan et al.
is carried out to demonstrate the feasibility of the proposed [10] build a hierarchical Stackelberg game model to address
approach. the problem of resilient control of networked control systems
Index Terms—Industrial Cyber-Physical system, security, under denial of service attacks. Niu and Jagannathan [11] use
game theory, decision-making. a zero-sum game to derive the optimal strategy for defending
dynamic systems in the presence of both cyber-attacks and
physical disturbances. In [4], the authors introduce a games-
I. I NTRODUCTION
in-games principle for defending ICPSs. This is a cross-layer
HE massive deployment of information and commu- solution for resilient defense of ICPSs and it is quite promising
T nication technologies in industry is transforming the
traditional legacy-electromechanical-based systems into mod-
for protecting ICPSs against cross-layer attacks where the
attackers peneatrated from the cyber space to the physical
ern industrial Cyber-Physical systems (ICPS) which tightly space.
integrate the cyber space with the physical space [1]. ICPSs are Despite these previous efforts, however, the majority of
expected to significantly promote manufacturing productivity them make too simple assumptions about the cyber layer of
and realize smart services. However, ICPSs are suffering ICPSs and they generally model the cyber layer as multi-
from cyber-attacks due to their increasing connections to ple independent elements (e.g., [7]–[9]) or abstract dynamic
the Internet [2]. As cyber-attacks against ICPSs could cause systems (e.g., [4], [11]). But in fact ICPSs have many kinds
equipment damage, environmental pollution or even fatalities of devices communicating with each other through complex
[3], ensuring the security of ICPSs is an issue of great concern. networks in the cyber layer. The lack of adequate cyber layer
Existing security countermeasures for ICPSs (e.g., encryp- modeling makes these methods not entirely applicable to real-
tion, access control, intrusion detection) lack a quantitative world ICPSs.
decision-making mechanism to actively defend against ad- Moreover, most previous works assume that game model
vanced persistent threats [4], [5]. Game theory, as an effective parameters (e.g., gains, losses, game state transition probabili-
formal tool for strategic behavior analysis, provides the capa- ties) can be obtained by security experts. But in reality, ICPSs
bility to quantitatively model the interaction between attackers are usually very complex, so it is difficult, if not impossible,
and defenders, which can guide system operators to carry to build a game model with all the parameters accurately
out appropriate attack mitigation strategies and reduce the assigned by security experts. For example, the payoff (net
gain) parameters in the cyber layer and physical layer have
Manuscript received September 29, 2018; revised January 23, 2019; to be evaluated with totally different metrics in existing
accepted March 5, 2019. The work of C. Zhou was supported in part by approaches. As for the cyber layer, the payoffs are typically
the National Science Foundation of China under Grant 61433006, Grant measured by dollar values, while the payoffs in the physical
61873103 and Grant 61272204. (Corresponding author: Chunjie Zhou)
K. Huang, C. Zhou, Y. Qin and W. Tu are with the Key Labo- layer are usually quantified in terms of control performance
ratory of Image Processing and Intelligent Control, Ministry of Ed- degradation. Different quantification metrics increase the dif-
ucation, and the School of Artificial Intelligence and Automation, ficulty of building a comprehensive security game framework
Huazhong University of Science and Technology, Wuhan 430074, China
(e-mail: hyanglu1573@hust.edu.cn; cjiezhou@hust.edu.cn; qinyuan- which contains both the cyber and physical layers. Besides, as
qing@hust.edu.cn; m201672519@hust.edu.cn). indicated in [12], traditional methods usually fail to explore the
0278-0046 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIE.2019.2907451, IEEE
patterns among system variables while data-driven realization

can take full advantage of the abundant running data to build
more accurate models. In short, a unified payoff quantification Corporate
method with data-driven paramter learning ability can help network
alleviate the difficulty of formulating a cross-layer security Cyber layer
game model for ICPSs. Control
Motivated by the discussion above, this paper presents network
a stochastic game model for cross-layer security decision-
making in ICPSs. With consideration to the difficulties in game
Physical
parameter configuration, we first analyze the probability of
successful vulnerability exploitation to obtain the game state network
transition probability distribution, then develop a time-based Physical layer
unified payoff quantification method to quantify the gains and Plant
losses of game players. Furthermore, a Q-learning algorithm is
devised to learn the optimal strategy profiles when a portion of Attacker Cyber device Sensor Actuator
the model parameters is still unable to be accurately specified
using the proposed method. The main contributions of this Fig. 1. Attack propagation process in ICPSs.
work are:
1) A cross-layer game-theoretic approach to security decision-
making in ICPSs is proposed to defend against cross-layer As shown in Fig. 1, the physical layer is comprised of the
attacks where the attacker penetrates from the cyber space physical plant to be controlled and the control components
to the physical space. (including sensors, actuators, etc.); the cyber layer consists
2) Unlike many previous game-theoretic security approaches of the network devices which establishes physical layer com-
which construct two game models for the cyber and phys- munications and many other management devices. Traditional
ical layers in ICPSs, our approach only needs to build one security measures in IT domain generally only consider the
game model with the help of a time-based unified payoff security issues in the communication channel and computers,
quantification method. namely the cyber layer in Fig. 1. However, in ICPSs the control
3) A reinforcement learning algorithm is devised for optimiz- system in the physical layer requires more attention because
ing the security game model of ICPSs. This is a neural- cyber-attacks could cause tremendous damages once they have
based adaptive output-feedback control strategy and it can penetrated into the physical layer.
be applied to many other non-strict-feedback stochastic
nonlinear systems.
B. Probability of Success of Attacks
The remainder of this paper is organized as follows. In
Section II, we model the attack propagation process in ICPSs. In order to quantify the probability of successful attack by
Section III presents the game formulation in detail. Section exploiting a specific vulnerability , here we adopt the met-
IV introduces the proposed Q-learning algorithm. Then a case rics provided by the Common Vulnerability Scoring System
study is described in Section V. Finally, we conclude this paper (CVSS) [15], an open industry vulnerability database. CVSS
with future work in Section VI. has three groups of metrics: base, temporal and environmental.
Specifically, only the exploitability subscore defined in the
II. ATTACK M ODELING base group is considered here because it evaluates the difficulty
In this section, we introduce the architecture of ICPSs, of exploiting a vulnerability. The exploitability subscore con-
describe the attack propagation process, and quantify the prob- sists of the access vector (S AV ), access complexity (S AC)
ability of successful attacks based on vulnerability analysis, and authentication (S AU ). All the scores of S AV, S AC
which will contribute to the game formulation process in the and S AU for each vulnerability can be obtained from public
next section. vulnerability databases. Then, the probability of successful
exploitation on a vulnerability is computed using the following
equation [16]:
A. Attack Propagation Process
A typical ICPS has a hierarchical structure and consists of p = 2 × S AV × S AC × S AU. (1)
three networks from the top down: corporate network, control
network and physical network [13]. Due to the complexity of Since sometimes more than one vulnerability need to be
ICPSs, cyber-attacks are normally launched from remote hosts exploited to carry out an attack, the relationship between these
and infiltrate into the system through the corporate network. 1
vulnerabilities should be figured out to calculate the success
Since the goal of the attacker is typically to disrupt the physical probability of this attack. Here we adopt the logical AND
process under control [14], the attacker needs to keep on gate and logic OR gate to describe the dependencies between
exploiting system vulnerabilities for privilege escalation until these vulnerabilities. Given an atomic attack a which can be
he/she reaches the physical network. Fig. 1 illustrates the achieved by exploiting a set of vulnerabilities {v1 , v2 , ..., vN },
attack propagation process. a logic AND assumes that all the vulnerabilities must be
exploited to implement a. Accordingly, the probability p(a) B. Actions

of success for a is expressed as: The actions the attacker can adopt are all the possible atomic
 attacks against the system (e.g, structured query language
 0,
 ∃ p(vi ) = 0, injection, buffer overflow, spoofing, etc), as well as the no-
p(a) = N
Q (2) operation (NOP) action. Each atomic attack is associated with
p(vi ), otherwise.
a vulnerability in the system, and the probability of success of


i=1
the attack action a, denoted as p(a), can be computed using
where p(vi ) denotes the probability of vi to be successfully (1). On the other hand, the defender’s actions include all the
exploited, which is acquired by (1). As for a logic OR, it only security countermeasures (e.g., installing patches, monitoring,
requires that at least one vulnerability is exploited, so we have: restart, etc) and the NOP action.
 Note that only parts of the action set are available to a player
 0,
 ∀ vi , p(vi ) = 0, in each game stage. Denote the possible actions of the attacker
p(a) = N
Q (3) and the defenderSTin state t as At and DtSrespectively, then we
 1−
 (1 − p(vi )), otherwise. have At ⊂ A, t=1 At = A, Dt ⊂ D, t=1 Dt = D.
T
i=1
C. States and State Transitions

III. G AME F ORMULATION A state of the stochastic game is defined as Si =<
o1 , o2 , ..., oH >, where H is the total number of system
A. Overview of the Security Game Model
devices, and oi indicates the current operation mode of the
A game problem involves at least three elements: players, ith device. oi has a binary value where ”1” means that this
strategies and payoffs. Players are the decision makers. A device has been compromised by the attacker while ”0” means
strategy is an action that a player may take. A combination of not. Consequently, the sum of all the game states is T = 2H .
selected strategies is a strategy profile. The payoff is the net Suppose that the attacker’s action set in game state t is
gain to a player resulting from a specific strategy profile. For At = {At,1 , At,2 , ...} and the defender’s action set is Dt =
more information about game concepts, please refer to [6]. {Dt,1 , Dt,2 , ...}, now the attacker tries to take action At,i to
Due to the complexity of ICPSs, the attacker usually does compromise host h and the defender plans to take defense
not have complete information about the system, so the attack action Dt,j . Then the probability of successful state transition
behaviors are not deterministic. With consideration of this kind from St to Sk is:
of uncertainty, we formulate the attack-defense interactions in p(Sk |St , At,i , Dt,j ) = ε(At,i , Dt,j ), (5)
ICPSs as a two-player stochastic game. A stochastic game
proceeds from one state to another according to a probability where Sk indicates the state where the attacker has compro-
distribution determined jointly by the strategies of the defender mised h, and ε(At,i , Dt,j ) is the probability of success of
and the attacker, as well as the current game state. A stochastic attack At,i .
game G can be defined as a 9-tuple:
D. Payoffs
G =< P, S, A, D, O, π A , π D , U A , U D >, (4)
In each game stage, the payoff of the attacker equals to
where all the definitions are explained in Table I. the benefit minus the cost. Here we propose a time-based
unified quantification method to compute the payoffs. Denote
the attack-defense action pair in state t as (At,i , Dt,j ), then
TABLE I the attacker’s payoff in state t is given by:
D EFINITION OF G AME PARAMETERS
UtA (At,i , Dt,j ) = ε(At,i , Dt,j )R(At,i )+
Parameter Description (6)
P = {P A , P D }. P A is the attacker; T (Dt,j ) − C(At,i ),
Players
and P D is the defender.
States S = {S1 , S2 , ..., ST }. where C(At,i ) is the amount of time the attacker needs to
Attacker’s actions A = {a1 , a2 , ..., aL }. implement At,i , T (Dt,j ) is the amount of time the defender
Defender’s actions D = {d1 , d2 , ..., dM }. needs to perform Dt,j , and R(At,i ) represents the needed time
Transition probability O : S × A × D × S → [0, 1].
Attacker’s strategy profile π A = {π1A , π2A , ..., πT
A }. ∀i, π A ∈ A. to recover (TTR) the compromised device after an attack.
i
Defender’s strategy profile π D = {π1D , π2D , ..., πTD }. ∀i, π D ∈ D.
i
Practically speaking, the defender gains little from defend-
Attacker’s payoff U A : S × A × D × S → R. ing attacks by deploying security measures. Here we assume
Defender’s payoff U D : S × A × D × S → R. that the payoff UtD of the defender is proportional to the
negative of the attacker’s payoff, that is:
Besides, we assume that ICPSs are equipped with intrusion UtD (At,i , Dt,j ) = −ηUtA (At,i , Dt,j ), (7)
detection systems [17] which can detect attacks in real time.
Consequently, the defender knows the attacker’s actions and where η > 0. Also, we can set η to 1 for brevity. Then it
system states. Also, the attacker can probe the system to find becomes a zero-sum game, i.e., the payoff of the attacker plus
out what the defender has done. that of the defender equals zero.
With the help of the TTR metric, we can quantify the player takes actions according to a probabilistic distribution.
payoffs in both the cyber and physical layers in an unified Then the attacker’s and the defender’s strategy in state t can
way, thereby building a cross-layer security game model for be represented as πtA = {πtA (At,1 ), πtA (At,2 ), ..., πtA (At,I )}
ICPSs. In the cyber layer of ICPSs, R(At,i ) is the amount of and πtD = {πtD (Dt,1 ), πtD (Dt,2 ), ..., πtD (Dt,J )}, respectively,
time it takes to bring the compromised host back to a normal where πt indicates the probability of choosing a specific action
state by, typically, restarting the host or switching to a backup in state t, and 0 ≤ πtA (At,i ) ≤ 1, 0 ≤ πtD (Dt,j ) ≤ 1,
device. When it comes to the physical layer, if the attacker
PI A
PJ D
i=1 (πt (At,i )) = 1, j=1 (πt (Dt,j )) = 1.
has successfully attacked a sensor or actuator by taking action Since the attacker and the defender choose actions in a
At,i , then R(At,i ) is defined as: stochastic manner, the strategies the attacker and the defender
R(At,i ) = Rc (At,i ) + Rp (At,i ), (8) adopt are called mixed strategy or stochastic strategy [6]. If
∃ i, πtA (At,i ) = 1 and ∀ l 6= i, πtA (At,l ) = 0, then the attack
where Rc (At,i ) is the TTR for the compromised device in strategy πtA is called pure strategy (As for the defender, it
the physical layer and Rp (At,i ) is the TTR for the physical should satisfy ∃ j, πtD (Dt,j ) = 1 and ∀ l 6= j, πtD (Dt,l ) = 0).
process under control. Obviously, pure strategy is a special case of mixed strategy.
When a sensor or an actuator has been compromised by an In a game problem, each player wants to maximize his/her
attacker, the evolution of the physical process depends on the total payoff by finding the optimal strategy profile, which
intrinsic system dynamics and the injected attack signals. Since is called the Nash equilibrium [6]. In a Nash equilibrium,
we usually do not know the injected attack signals in advance, no player wants to deviate from a specific strategy profile
it is difficult to obtain the value of Rp (At,i ). Therefore, we unilaterally. Denote πt,i A
as a possible mixed strategy for the
consider the worst-case situation in order to quantify Rp (At,i ) attacker in state t (πt,j for the defender), then a mixed strategy
D
when the attacker has not disturbed the physical process. pair (πt,∗
A D
, πt,∗ ) is a Nash equilibrium solution if:
According to [18], the MIN and MAX attacks, as described in
(9), are usually the most effective attacks for an attacker who (
A A D A D
wants to disrupt a plant but does not know system dynamics. ∀ πt,i , UtA (πt,∗ , πt,∗ ) ≥ UtA (πt,i , πt,∗ ),
D D A D D A D
(12)
∀ πt,j , Ut (πt,∗ , πt,j ) ≥ Ut (πt,∗ , πt,j ),
MIN attacks: yi = yimin , ui = umin

,
i (9)
MAX attacks: yi = yimax , ui = umax i .
where U calculates the expected payoff with a given
In (9), y is sensor measurements and u is control commands.
mixed strategy. Then the optimal attack and defense strat-
Consequently, we can compute the TTRs for the physical
egy profiles are π∗A = {π1,∗ A A
, π2,∗ A
, ..., πT,∗ } and π∗D =
process under MIN or MAX attacks by means of numerical
{π1,∗ , π2,∗ , ..., πT,∗ }, respectively.
D D D
simulation. Afterwards, the larger one of the two TTRs will
be assigned as the value of Rp (At,i ).
When the attacker has injected attack signals into the control
system, the physical process model can be described as a F. Game Solution
discrete-time linear time-invariant system with unknown inputs
[19]: A stochastic game is the combination of many matrix
0 games and a Markov decision process [20]. Therefore, given
xk+1 = Axk + Buk + Ba uk + wk , (10a) the attacker’s action set At = {At,1 , At,2 , ..., At,I } and the
yk = Cxk + vk , (10b) defender’s action set Dt = {Dt,1 , Dt,2 , ..., Dt,J } in state t,
0 the game in state t can be regarded as a matrix game Gt . Gt
where x is system states, uk is injected signals, Ba is the
is a I×J matrix whose elements are payoffs to be either gained
matrix for the attack signals, and A, B and C are matrices
or lost when each player takes the corresponding actions. An
representing the dynamics of the physical process. Literature
element at the ith row and jth column of Gt is given by:
[19] designed an unbiased minimum-variance state estimator
for the system described in (10) to estimate system states when T
attacks are in progress. This estimator has the following form: X
t
gi,j = UtA (At,i , Dt,j ) + p(Sk |St , At,i , Dt,j )Vk , (13)
x̂k+1 = Ax̂k + Buk + k=1
(11)
Lk+1 [yk+1 − CAx̂k − CBuk ],
where Vk is the value of Gk [20]. In our case, Vk is the
where x̂k is the estimate of xk and Lk+1 is a gain parameter.
expected payoff of the attacker in a Nash equilibrium.
Based on this estimator, we can estimate xk during the time
period when the physical process is under attack and when In a matrix game Gt , the attacker is assigned as the row
the defender is recovering the compromised sensor or actuator player while the defender is designated as the column player.
back to normal, and finally compute the value of Rp (At,i ). Accordingly, at game state t, the players will choose a row
i and a column j, then the payoffs for the attacker and the
defender are gi,j
t
and −gi,j
t
, respectively. Note that Gt is a two-
E. Strategy Profiles player zero-sum game, so the pure strategy Nash equilibrium
A strategy profile consists of a series of actions a player is the saddle point of matrix Gt [6]. If there does not exist
has adopted in each game state. And in each state t, the any saddle point in Gt , then the optimal strategy is a mixed
strategy. In such situation, the solution of Gt is achieved by Suppose the current game state is t, and the game transits to
solving the following linear programming problem: state k under the attack-defense action pair (At,i , Dt,j ), then
 the Q-function for the attacker can be defined as:
 ∂(UtA (πtA , πtD )) QA (St , At,i , Dt,j ) ← (1 − α)QA (St , At,i , Dt,j )+
= 0, i = 1, 2, · · · , I − 1,

(15)

∂(πtA (At,i ))

(14) α[r(St , At,i , Dt,j ) + βWk ],
∂(U D (π A , π D ))
t t t
where the learning rate α indicates how fast the player updates


 = 0, j = 1, 2, · · · , J − 1,
∂(πtD (Dt,j ))

the Q-function with new reward information, β is a discount
PI PJ factor, r(St , At,i , Dt,j ) represents the reward of the attacker,
with constraints: i=1 πtA (At,i ) = 1, D
j=1 πt (Dt,j ) = 1. and Wk is the value of matrix game QA (Sk ).
To solve the proposed stochastic game and get the optimal Empirically, setting α = 1 means that the player will focus
strategies for all game states, we adapt the algorithm proposed on learning from immediate and future rewards, thus losing the
in [21] to our case. The principle of that algorithm is to previously learnt knowledge and causing divergency. On the
iteratively update Vt until the deviation between two iterations contrary, α = 0 means the algorithm has no ability of learning.
is less than a predefined threshold δ: |Vtr+1 − Vtr | < δ. The So α is often chosen as a tradeoff between aggressiveness and
pseudo code of the algorithm is given in Algorithm 1. Using conservativeness. The reward r(St , At,i , Dt,j ) of the attacker
Algorithm 1, we can obtain the optimal defense strategy profile in our game model is defined as the payoff UtA (At,i , Dt,j )
π∗D , as well as the optimal attack strategy profile π∗A . of the attacker. Since the proposed game model is a zero-
sum game, so we only need to specify one Q-function and
Algorithm 1 Compute the optimal strategy profile. the Q-function for the defender is QD (St , At,i , Dt,j ) =
Input: A stochastic game G −QA (St , At,i , Dt,j ). Finally, the optimal defense strategy is
Output: Optimal attack-defense strategy profile (π∗A , π∗D ) the one that maximizes the value of QD (Sk ), i.e.:
1: r ← 0
2: Randomly initialize V as V 0 = {V10 , V20 , ..., VT0 }
X
3: repeat
D
πk,∗ = arg max min πkD (d)QD (Sk , a, d). (16)
D a∈Ak
πk
4: for each Gt ∈ G = {G1 , G2 , ..., GT } do d∈Dk
5: t
∀ gi,j ∈ Gt , replace Vk in (13) with Vkr The solution of (16) can also be obtained by means of linear
A,r D,r
6: Solve the matrix game Gt and obtain (πt,∗ , πt,∗ ) programming like (14).
r+1 r+1 A A,r D,r
7: ∀ Vt ∈ V , Vt ← Ut (πt,∗ , πt,∗ )
8: end for V. C ASE S TUDY
9: r ←r+1
r+1
10: until ∀ t ∈ {1, 2, ..., T }, |Vt − Vtr | < δ In this section, we implement our game model onto a
11: for each Gt ∈ G = {G1 , G2 , ..., GT } do simulated simplified Tennessee-Eastman (STE) process control
12: Solve the game and obtain (πt,∗ A D
, πt,∗ ) system and evaluate the experiment results.
13: end for
14: return π∗A = {πt,∗ A
}, π∗D = {πt,∗
D
}
A. Experiment Setup
The STE system [23] is a chemical reactor plant which has
been widely adopted as a testbed in fault diagnosis studies.
IV. R EINFORCEMENT L EARNING Fig. 2 illustrates the architecture of the hardware-in-the-loop
simulation testbed where the STE process is simulated in an
In general, one of the most difficult tasks of a game problem agent host. As shown in Fig. 2, AH1 and AH2 have access to
is to accurately specify the model parameters [4] because the control network, and the devices in the control network
we usually do not have enough domain knowledge. The communicate with the controllers using Modbus protocol,
Algorithm 1 can be used to solve the game problem only when which is a widely used industrial protocol. An attacker tries
both the attacker and defender know all the model parameters to penetrate into the system from the corporate network and
completely. Therefore, in this section a Q-learning algorithm manipulate the sensors/actuators to disrupt the STE process.
is devised to help the players learn the the optimal strategies The STE testbed has several vulnerabilities, as summarized
despite not accurately knowing the model parameters. in Table II. In this table, the “Exploitability” column indicates
Q-learning is a semi-supervised learning algorithm that be- the vulnerability exploitation probability computed with (1)
longs to the category of reinforcement learning [22]. The goal where the parameters (S AV, S AC and S AU ) are ac-
of Q-learning is to find an action sequence which generates the quired from the CVE (Common Vulnerabilities & Exposures)
maximal cumulative rewards via a trial-and-error manner. The database by querying the CVE ID. V3 is a Modbus protocol
quality of each action is assessed by a feedback from the en- vulnerability that Modbus lacks authentication mechanisms,
vironment, known as the reward. In the case of attacks against thus allowing unauthenticated attackers to send arbitrary mes-
ICPSs, attacker can probe the system to find out what actions sages. V5 is a SQL (Structured Query Language) database
the defender has taken, so both the attacker and defender can vulnerability which only exists in Siemens WinCC (Windows
know each other’s actions. Therefore, by applying Q-learning Control Center), an industrial supervisory software.
to stochastic game, the players can approximate the optimal By exploiting the vulnerabilities listed in Table II, the
strategy profiles by iteratively updating a Q-function. attacker can infiltrate into the system. Here we assume that
TABLE II
V ULNERABILITY I NFORMATION
No. Host CVE ID Exploitability Description

V1 AH1, AH2 CVE-1999-0547 1.0 File transfer protocol rhost vulnerability allows unauthorized users to access the service
V2 AH2 CVE-2006-2421 1.0 SSH (Secure Shell) buffer overflow vulnerability allows attackers to execute arbitrary code
V3 ES Not indexed 0.86 Modbus authentication vulnerability allows illegal users to send forged messages
V4 HMI CVE-2002-0965 1.0 Oracle TNS (Transparent Network Substrate) listener allows denial of service attacks
V5 IS, PS, FS CVE-2013-3957 1.0 Arbitrary SQL commands execution in Siemens WinCC
TABLE IV
ATTACK ACTIONS
AH1 AH2 Router Attacker Cost (min)
Legend
Action Description
a1 Exploit V1 to acquire user privilege on AH1 4.5
AH: Administration Host a2 Exploit V1 to acquire user privilege on AH2 3.5
HMI: Human Machine Interface Exploit V2 to acquire user privilege on AH2 2.8
ES: Engineering Station
a3
PC: Pressure Controller a4 Exploit V3 to disguise as ES 5.2
Firewall ES HMI FC: Flow Controller a5 Exploit V4 to acquire root privilege on HMI 3.0
CC: Composition Controller Exploit V5 to manipulate the reading of IS 7.0
FS: Flow Sensor
a6
PS: Pressure Sensor a7 Exploit V5 to manipulate the reading of FS 7.5
IS: Ingredient Sensor a8 Exploit V5 to manipulate the reading of PS 8.0
VA: Valve No operation 0
CAN: Controller Area Network
a9
Gateway CAN bus
STE Simulation Platform for ICPS Security Experiments
TABLE V
T IME TO R ECOVERY
CC FC PC Action Rc (min) Rp (min)

a1 7.8 –
a2 7.5 –
a3 6.0 –
Feed 1 Purge
a4 5.6 –
IS PS
VA1 VA3 a5 6.7 –
a6 4.0 6.0
Feed 2 Product a7 5.4 7.8
VA2 FS a8 4.7 4.7
Fig. 2. Structure of the STE simulation testbed. According to Table III and Table IV, the attacker’s game
state transition graph can be drawn, as shown in Fig. 3. The
attacker first needs to exploit the vulnerabilities of AH1 or
the attack target is the sensors/actuators and the attacker will
AH2 to obtain access to the control network, then tries to
not waste time on attacking a target which is not able to help
attack the devices in the control network and disguises as a
him/her move towards the target. Consequently, when we are
legal host to communicate with the controllers, and finally
building the stochastic game model, we can find that most
seeks to compromise IS, FS or PS to disrupt the STE process.
game states are infeasible. Table III lists some feasible game
states we are interested in. And Table IV enumerates all the
attacker’s possible actions. So, the available attack action set in
each game state is: A1 = {a1 , a2 , a3 , a9 }, A2 = {a4 , a5 , a9 }, a9 S6 a9
A3 = {a4 , a5 , a9 }, A4 = {a6 , a7 , a8 , a9 }, A5 = {a4 , a9 }, a9 a6
A6 = ∅, A7 = ∅, A8 = ∅. In addition, the TTRs for each S2 a4 S4 a7 S7 a9
successful attack action are presented in Table V.
a1 a8
a5
TABLE III
a9 S1 a4 S8 a9
a4
I MPORTANT G AME S TATES a2 , a3
State Description State Description S3 a5 S5
S1 Normal system operation S5 Root privilege on HMI a9
S2 User privilege on AH1 S6 Manipulation on IS
S3 User privilege on AH2 S7 Manipulation on FS a9
S4 User privilege on ES S8 Manipulation on PS
Fig. 3. Attacker’s game state transition graph.
With contrast to the attack actions, the defender’s actions are
listed in Table VI. The available defense action set in each state
is: D1 = {d1 , d2 , d3 , d4 , d12 }, D2 = {d5 , d6 , d7 , d8 , d12 }, B. Simulation and Result Analysis
D3 = {d5 , d6 , d7 , d8 , d12 }, D4 = {d9 , d10 , d11 , d12 }, D5 = In order to demonstrate the effectiveness of the proposed
{d5 , d6 , d12 }, D6 = ∅, D7 = ∅, D8 = ∅. approach, two experiment scenarios are designed: 1) Solve
TABLE VI TABLE VIII

D EFENSE ACTIONS O PTIMAL D EFENSE S TRATEGY P ROFILE
Action Description Cost (min) Action S1 S2 S3 S4 S5

d1 Patch V1 on AH1 8.4 d1 0.3063 – – – –
d2 Patch V1 on AH2 7.9 d2 0.3561 – – – –
d3 Close SSH on AH2 1.2 d3 0.3376 – – – –
d4 Patch V2 on AH2 7.5 d4 0 – – – –
d5 Encrypt Modbus packets 9.5 d5 – 0 0 – 0
d6 Employ access control on ES 8.0 d6 – 0.4994 0.4994 – 1
d7 Patch V4 on HMI 7.5 d7 – 0.5006 0.5006 – –
d8 Employ access control on Oracle TNS listener 9.0 d8 – 0 0 – –
d9 Encrypt the messages between CC and IS 9.3 d9 – – – 0.2481 –
d10 Encrypt the messages between FC and FS 10.7 d10 – – – 0.3925 –
d11 Encrypt the messages between PC and PS 12.0 d11 – – – 0.3595 –
d12 No operation 0 d12 0 0 0 0 0
the game to obtain the optimal attack/defense strategy profile TABLE IX

by assuming that all the model parameters are known in PAYOFF OF THE ATTACKER AND D EFENDER IN EACH G AME S TATE
advance; and 2) Use Q-learning algorithm to acquire the
Game state Payoff of the attacker Payoff of the defender
optimal attack/defense strategy profile. S1 12.6490 -12.6490
1) Experiment 1 – Computation of the optimal strategy S2 9.5322 -9.5322
profiles: Algorithm 1 is used to generate the optimal at- S3 9.5322 -9.5322
S4 11.3393 -11.3393
tack/defense strategy profiles and the results are presented S5 2.8000 -2.8000
in Table VII and Table VIII. The threshold parameter in S6 0 0
Algorithm 1 is set as δ = 0.1. Since there are only 8 game S7 0 0
S8 0 0
states in this experiment scenario, Algorithm 1 converges to
the Nash equilibrium after just several iterations.
According to Table VII and Table VIII, we know that
the mixed strategy for the attacker in game state S1 is game converges faster with a higher learning rate. Besides,
A
π1,∗ = {0.4869, 0.4647, 0.0484, 0}, and the mixed strategy for Fig. 4 also tells that W1 is approximating 12.7050 in the
the defender in S1 is π1,∗D
= {0.3063, 0.3561, 0.3376, 0, 0}. learning process, which is in accordance with the payoff value
This means that the attacker in S1 is very likely to take attack obtained using Algorithm 1 when all the model parameters are
action a1 , and the defender should take the defense action d2 known in advance (see the 2nd row of Table IX).
with a higher probability. When it comes to the game state
S5 , both the attacker and the defender take pure strategies.
15
TABLE VII W1 = 12.7050
O PTIMAL ATTACK S TRATEGY P ROFILE
Action S1 S2 S3 S4 S5 10
a1 0.4869 – – – –
a2 0.4647 – – – –
W1
a3 0.0484 – – – –
a4 – 0.4261 0.4261 – 1 α = 0.2
a5 – 0.5739 0.5739 – – 5 α = 0.3
a6 – – – 0.2753 –
a7 – – – 0.3146 – α = 0.5
a8 – – – 0.4100 – α = 0.9
a9 0 0 0 0 0
0
0 100 200 300 400
The expected payoffs for the attacker and the defender in
Iteration
each game state are presented in Table IX. Since it is a zero-
sum game, the defender’s payoff is the negative of that of the Fig. 4. Value of W1 with different learning rates in state S1 .
attacker’s. In Table IX, the game states S6 , S7 and S8 are all
end states, which means that the attacker will take NOP action
in these states, so the payoff values in these states are 0. Fig. 5 shows the learning process of the defense strategy
2) Experiment 2 – Q-learning: In this experiment, we first in game state S1 with α = 0.9. In Fig. 5, d5 is not drawn
set all the Q-function values to zero and then repeat the game because it always equals to zero in the learning process.
many times to generate the optimal strategy profiles. The Finally, the learnt optimal mixed defense strategy is π1,∗
D
=
discount factor β is set as β = 1. {0.3060, 0.3559, 0.3378, 0, 0}, which is in accordance with the
Fig. 4 shows the convergence process of W1 in different result shown in Table VIII. In other words, this experiment
configurations of learning rate α. From Fig. 4 we see that α scenario demonstrates that the optimal defense strategy profile
does not have significant impact on the value of W1 , but the can be obtained through the proposed Q-learning algorithm.
source tools, such as MulVAL [24]. MulVal can automat-

1 d1 ically scan system vulnerabilities and construct an attack
d2 graph to characterize the attack propagation process. Then,
d3 the probability of success of each attack action which
Probability
d4 exploits a specific vulnerability can be computed by (1).

0.5 2) Payoff quantification. The attack cost C, defense cost T and
the TTRs Rc for compomised devices are usually obtained
from technical reports (e.g., [25]) or security consultants.
As for the TTRs Rp for the physical process, when the
0 decision-making is carried out offline, Rp can be computed
based on (9); and when the decision-making process is
0 50 100 150 200 dynamically conducted when cyber-attacks are in progress,
Iteration Rp is estimated through (11).
3) Game state pruning. In each game state, the available
Fig. 5. Evolution process of the defense strategy in state S1 . action set of a player is a subset of all the possible actions,
so there are many transitions between two game states are
infeasible and we can prune these infeasible game states
C. Further Discussion
to reduce calculation complexity. Afterwards, the optimal
Table X gives the comparison of the proposed approach with strategy profiles can be computed using Algorithm 1.
some related works. Since these related works have totally 4) Iterative learning. When a part of the game model pa-
different expeirment enviroments and attack scerios, we can rameters can not be acquired in some cases, the proposed
not directly compare the quantiative results between them. Q-learning algorithm should be utilized to produce the
Therefore, Table X only gives some qulitative metrics. optimal strategy profiles. For example, the Modbus proto-
col vulnerability V3 is not indexed by many vulnerability
TABLE X databases, so we can not obtain the accurate value for the
C OMPARISON P ROPOSED A PPROACH R ELATED W ORKS
OF THE WITH
probability of successful vulnerability exploitation on V3 .
Approaches Our [4] [7] [8] [9] [10] [11] In summary, the proposed approach is very practical and
Cyber layer X X X X × × X could be easily applied to real ICPSs without too much domain
Physical layer X X X × X X X
Expert knowledge × X X X X X × knowledge and human efforts.
Self-learning X × × × × × X
Attack propagation X × X × × × × VI. C ONCLUSION
Unified payoff
quantification
X × × × × × × Existing security measures for ICPSs lack active decision-
making ability to defend against highly organized cyber-
According to Table X, we can see that: attacks. Therefore, this paper proposes a game-theoretic ap-
1) Most previous studies resort to security experts who have proach to security decision-making in ICPSs. With considera-
lots of domain knowledge to specify model parameters. In tion to the problem that previous studies impractically assume
other words, they don’t have the ability of self-learning to that all the game model parameters can be obtained from
derive the optimal defense strategy. experts, we first detail the parameter specification process
2) The existing approaches somewhat overlooked the attack based on quantitative vulnerability analysis and time-based
propagation process in ICPS networks, thus not able to unified payoff quantification, then take use of reinforcement
make dynamic decisions in the cases when cyber-attacks learning to derive the optimal defense strategy profile when
are in progress. full knowledge of the game parameters is unavailable. A case
3) Some researchers have paid special attention to the cyber- study on a simulation testbed demonstrates the effectiveness
physical interaction in ICPS, but unfortunately they have of the proposed game-theoretic decision-making approach.
not provided an unified game model due to the lack of a This approach takes both the cyber and physical layers of
unified cross-layer payoff quantification method. ICPSs into consideration and can generate the optimal defense
strategy profile by modeling the attack-defense interaction in
In our work, the attack propagation issues are taken into ICPSs, and it could be applied to real-world ICPSs to facilitate
consideration and a cross-layer security game model is pro- active defense and minimize the system losses caused by
posed with its model parameters obtained based on quantitative cyber-attacks.
vulnerability analysis and unified payoff quantification. Fur- In the future work, we plan to extend our approach to
thermore, we have utilized Q-learning to derive the optimal the scenario where unknown vulnerabilities are present in the
defense strategy profile even when not 1knowing the game system and the players only have imcomlete information of
model parameters accurately. other player’s actions.
When the proposed security decision-making approach is
applied to a real-world ICPS, four major steps are needed: R EFERENCES
1) Vulnerability scanning. All the known vulnerabilities [1] A. W. Colombo, S. Karnouskos, O. Kaynak, Y. Shi, and S. Yin,
present in an ICPS can be found out using some open- “Industrial cyberphysical systems: A backbone of the fourth industrial
revolution,” IEEE Industrial Electronics Magazine, vol. 11, no. 1, pp. [20] A. M. Fink, “Equilibrium in a stochastic n-person game,” Journal of
6–16, Mar. 2017. science of the hiroshima university, vol. 28, no. 1, pp. 89–93, 1964.
[2] M. Wolf and D. Serpanos, “Safety and security in cyber-physical systems [21] K. Sallhammar, B. E. Helvik, and S. J. Knapskog, “On stochastic
and internet-of-things systems,” Proc. IEEE, vol. 106, no. 1, pp. 9–20, modeling for integrated security and dependability evaluation,” Journal
Jan. 2018. of Networks, vol. 1, no. 5, pp. 31–42, 2006.
[3] S. McLaughlin, C. Konstantinou, X. Wang, L. Davi, A. R. Sadeghi, [23] N. L. Ricker, “Model predictive control of a continuous, nonlinear, two-
M. Maniatakos, and R. Karri, “The cybersecurity landscape in industrial phase reactor,” Journal of Process Control, vol. 3, no. 2, pp. 109–123,
control systems,” Proc. IEEE, vol. 104, no. 5, pp. 1039–1057, May. 1993.
2016. [24] X. Ou, W. F. Boyer, and M. A. McQueen, “A scalable approach to
[4] Q. Zhu and T. Basar, “Game-theoretic methods for robustness, security, attack graph generation,” in Proceedings of the 13th ACM conference
and resilience of cyberphysical control systems: Games-in-games prin- on Computer and communications security, pp. 336–345. ACM, 2006.
ciple for optimal cross-layer resilient control systems,” IEEE Control [25] N. Falliere, L. O. Murchu, and E. Chien, “W32. stuxnet dossier,” White
Systems, vol. 35, no. 1, pp. 46–65, Feb. 2015. paper, Symantec Corp., Security Response, vol. 5, no. 6, pp. 1–29, 2011.
[5] Y. Jiang and S. Yin, “Recursive total principle component regression
based fault detection and its application to vehicular cyber-physical Kaixing Huang received the B.S. and Ph.D.
systems,” IEEE Transactions on Industrial Informatics, vol. 14, no. 4, degrees in control science and engineering from
pp. 1415–1423, Apr. 2018. the Huazhong University of Science and Tech-
[6] C. T. Do, N. H. Tran, C. Hong, C. A. Kamhoua, K. A. Kwiat, E. Blasch, nology, Wuhan, China, in 2012 and 2018, re-
S. Ren, N. Pissinou, and S. S. Iyengar, “Game theory for cyber security spectively.
and privacy,” ACM Comput. Surv., vol. 50, no. 2, pp. 30:1–30:37, May. His research interests include security control
2017. of industrial control systems and game theory.
[7] C. Hankin, Semantics, Logics, and Calculi, ch. Game Theory and Indus-
trial Control Systems, pp. 178–190. Springer International Publishing,
2016.
[8] Q. Feng, H. Cai, Z. Chen, X. Zhao, and Y. Chen, “Using game theory to
optimize allocation of defensive resources to protect multiple chemical
facilities in a city against terrorist attacks,” Journal of Loss Prevention
in the Process Industries, vol. 43, no. Supplement C, pp. 614 – 628,
2016.
[9] G. Chen, Z. Y. Dong, D. J. Hill, and Y. S. Xue, “Exploring reliable Chunjie Zhou received the B.S., M.S. and Ph.D.
strategies for defending power systems against targeted attacks,” IEEE degrees in control theory and control engineer-
Transactions on Power Systems, vol. 26, no. 3, pp. 1000–1009, Aug. ing from the Huazhong University of Science and
2011. Technology, Wuhan, China, in 1988, 1991 and
[10] Y. Yuan, H. Yuan, L. Guo, H. Yang, and S. Sun, “Resilient control of 2001, respectively.
networked control system under dos attacks: A unified game approach,” He is currently a Professor with the School of
IEEE Transactions on Industrial Informatics, vol. 12, no. 5, pp. 1786– Artificial Intelligence and Automation, Huazhong
1794, Oct. 2016. University of Science and Technology. His re-
[11] H. Niu and S. Jagannathan, “Optimal defense and control of dynamic search interests include safety and security con-
systems modeled as cyber-physical systems,” The Journal of Defense trol of industrial control systems, theory and
Modeling and Simulation, vol. 12, no. 4, pp. 423–438, 2015. application of networked control systems and
[12] Y. Jiang, S. Yin, and O. Kaynak, “Data-driven monitoring and safety artificial intelligence.
control of industrial cyber-physical systems: Basics and beyond,” IEEE
Access, vol. 6, pp. 47 374–47 384, 2018.
[13] Y. Jiang, K. Li, and S. Yin, “Cyber-physical system based factory
monitoring and fault diagnosis framework with plant-wide performance
optimization,” in 2018 IEEE Industrial Cyber-Physical Systems (ICPS), Yuanqing Qin received the B.S. degree in elec-
pp. 240–245, May. 2018. trical engineering from the Shandong University
[14] A. A. Cárdenas, S. Amin, Z.-S. Lin, Y.-L. Huang, C.-Y. Huang, and of Technology, Zibo, China, in 2000, and the
S. Sastry, “Attacks against process control systems: Risk assessment, M.S. and Ph.D. degrees in control theory and
detection, and response,” in Proceedings of the 6th ACM Symposium on control engineering from the Huazhong Univer-
Information, Computer and Communications Security, ser. ASIACCS sity of Science and Technology, Wuhan, China,
’11, pp. 355–366. New York, NY, USA: ACM, 2011. in 2003 and 2007, respectively.
[15] P. Mell, K. Scarfone, and S. Romanosky, “Common vulnerability scoring He is currently a Lecturer with the School of
system,” IEEE Security Privacy, vol. 4, no. 6, pp. 85–89, Nov. 2006. Artificial Intelligence and Automation, Huazhong
[16] N. Poolsappasit, R. Dewri, and I. Ray, “Dynamic security risk manage- University of Science and Technology. His re-
ment using bayesian attack graphs,” IEEE Transactions on Dependable search interests include networked cotnrol sys-
and Secure Computing, vol. 9, no. 1, pp. 61–74, Jan. 2012. tem and artificial intelligence.
[17] S. Ntalampiras, “Detection of integrity attacks in cyber-physical critical
infrastructures using ensemble modeling,” IEEE Transactions on Indus-
trial Informatics, vol. 11, no. 1, pp. 104–111, Feb. 2015.
[18] Y.-L. Huang, A. A. Cárdenas, S. Amin, Z.-S. Lin, H.-Y. Tsai, and
S. Sastry, “Understanding the physical and economic consequences of
attacks on control systems,” International Journal of Critical Infrastruc- Weixun Tu received the B.S. degree in Automa-
ture Protection, vol. 2, no. 3, pp. 73–83, 2009. tion from the Xidian University, Xi’an, China, in
[19] K. Huang, C. Zhou, Y. C. Tian, S. H. Yang, and Y. Qin, “Assessing the 2016. He is currently working toward the M.S.
physical impact of cyber-attacks on industrial cyber-physical systems,” degree in control science and control engineer-
IEEE Transactions on Industrial Electronics, vol. 65, no. 10, pp. 8153– ing at School of Artificial Intelligence and Au-
8162, 2018. tomation, Huazhong University of Science and
[22] H. Wang, T. Huang, X. Liao, H. Abu-Rub, and G. Chen, “Reinforce- Technology.
ment learning in energy trading game among smart microgrids,” IEEE His research interests include networked con-
Transactions on Industrial Electronics, vol. 63, no. 8, pp. 5109–5119, trol systems and artificial intelligence.
Aug. 2016.

A Game-Theoretic Approach To Cross-Layer Security Decision-Making in Industrial Cyber-Physical Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Game-Theoretic Approach To Cross-Layer Security Decision-Making in Industrial Cyber-Physical Systems

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

patterns among system variables while data-driven realization

exploited to implement a. Accordingly, the probability p(a) B. Actions

C. States and State Transitions

No. Host CVE ID Exploitability Description

CC FC PC Action Rc (min) Rp (min)

TABLE VI TABLE VIII

Action Description Cost (min) Action S1 S2 S3 S4 S5

the game to obtain the optimal attack/defense strategy profile TABLE IX

source tools, such as MulVAL [24]. MulVal can automat-

d4 exploits a specific vulnerability can be computed by (1).

You might also like