A Game-Theoretic Method For Defending Against Advanced Persistent Threats in Cyber Systems

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.
18, 2023 1349
A Game-Theoretic Method for Defending Against

Advanced Persistent Threats in Cyber Systems
Lefeng Zhang , Tianqing Zhu , Farookh Khadeer Hussain, Member, IEEE, Dayong Ye ,
and Wanlei Zhou, Senior Member, IEEE
Abstract— Advanced persistent threats (APTs) are one of

today’s major threats to cyber security. Highly determined
attackers along with novel and evasive exfiltration techniques
mean APT attacks elude most intrusion detection and prevention
systems. The result has been significant losses for governments,
organizations, and commercial entities. Intriguingly, despite
greater efforts to defend against APTs in recent times, frequent
upgrades in defense strategies are not leading to increased
security and protection. In this paper, we demonstrate this
phenomenon in an appropriately designed APT rivalry game that
captures the interactions between attackers and defenders. What Fig. 1. Evolution of attacker and defender in APT attacks.
is shown is that the defender’s strategy adjustments actually
leave useful information for the attackers, and thus intelligent
and rational attackers can improve themselves by analyzing this
information. Hence, a critical part of one’s defense strategy must threats (APTs) [1], [2] have become a prominent threat to
be finding a suitable time to adjust one’s strategy to ensure cyber security [3]. Different from traditional cyber attacks,
attackers learn the least possible information. Another challenge today’s APTs are long-term, stealthy attacks performed by
for defenders is determining how to make the best use of one’s proficient and well-funded adversaries. The goal of an APT
resources to achieve a satisfactory defense level. In support of
these efforts, we figured out the optimal timings of a player’s attack is not only limited to sabotaging critical infrastructures
strategy adjustment in terms of information leakage, which form and/or exfiltrating critical information [4], but also includes
a family of Nash equilibria. Moreover, two learning mechanisms staying undetected for further exfiltration and/or sabotage [5].
are proposed to help defenders find an appropriate defense Moreover, one salient characteristic of APT attacks is that
level and allocate their resources reasonably. One is based on the threats keep changing. An increasing security budget does
adversarial bandits, and the other is based on deep reinforcement
learning. Experimental simulations show the rationales behind not effectively protect an organization from compromise [6],
the game and the optimality of the equilibria. The results also [7]. The key reason is that, through frequent interactions, the
demonstrate that players indeed have the ability to improve attackers and defenders learn from each other, making the
themselves by learning from past experiences, which shows the attackers more sophisticated and stronger than ever before.
necessity of specifying optimal strategy adjustment timings when It is important to note that the defenders are not incompe-
defending against APTs.
tent, it is continuous learning through a long and determined
Index Terms— Game theory, cyber security, advanced persis- APT campaign means the attackers become more and more
tent attack. competitive with every interaction. More importantly, the
I. I NTRODUCTION attackers continuously collect information about their target,
adapting to the defender’s resistance efforts by sustained
W ITH the fast evolution and frequent innovation in

cyber attack/defense techniques, advanced persistent
analysis and learning [8], [9]. Although the defenders are able
to successfully detect/prevent some attempts at intrusion by
Manuscript received 7 March 2022; revised 5 November 2022; accepted dynamically adjusting their defense strategies and periodically
1 December 2022. Date of publication 15 December 2022; date of current upgrading their defense policies, these adjustments may leave
version 1 February 2023. This work was supported by the Australian Research useful information for the attackers [10], [11]. For example,
Council (ARC), Australia, through the ARC Discovery Project under Grant
DP200100946. The associate editor coordinating the review of this manuscript if an attacker gets detected and prevented from today’s defense
and approving it for publication was Dr. Nils Ole Tippenhauer. (Corresponding method, they may start to analyze the situation and develop a
author: Tianqing Zhu.) novel attack variant to break through that limitation [12].
Lefeng Zhang, Tianqing Zhu, and Dayong Ye are with the Centre
for Cyber Security and Privacy and the School of Computer Science, Figure 1 demonstrates the process. In the beginning, the
University of Technology Sydney, Ultimo, NSW 2007, Australia (e-mail: attacker has limited knowledge about the defender. When
lefeng.zhang@student.uts.edu.au; Tianqing.Zhu@uts.edu.au; dayong.ye@ the defender employs a new defense strategy, some pieces
uts.edu.au).
Farookh Khadeer Hussain is with the School of Computer Science, of information are left for the attacker. This provides the
University of Technology Sydney, Ultimo, NSW 2007, Australia (e-mail: attacker with a chance to learn more about the defender’s
Farookh.Hussain@uts.edu.au). defense strategy – even if that new defense strategy leads to a
Wanlei Zhou is with the Faculty of Data Science, City University of Macau,
Taipei, Macau, China (e-mail: wlzhou@cityu.edu.mo). better defense level. In turn, developments in the attacker’s
Digital Object Identifier 10.1109/TIFS.2022.3229595 infiltration techniques will motivate the defender to devise
1556-6021 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on March 01,2024 at 10:07:16 UTC from IEEE Xplore. Restrictions apply.
1350 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 18, 2023
more advanced methods of protection. From this standpoint, • Theoretical proofs and analyses of the proposed game
it is important for the defender to choose a time to make their and mechanisms. We derived the necessary conditions for
strategy adjustment when they can ensure the least information which the game has a family of equilibria with respect to
will be learnt by the attacker. the players’ rationality. We also derive the regret bounds
Today, many techniques have been adopted to model the of our learning mechanism.
frequent interactions between attackers and defenders in APT • Experimental simulations that show the existence and
campaigns. Of these, game theory is one of the most powerful optimality of the equilibria under different game settings.
choices [13], [14]. Game theory provides an efficient way to Further experiments demonstrate the effectiveness of the
investigate how to defend against APT attacks, and, as such, proposed learning mechanisms.
many game-theoretic models have been devised that describe The rest of this paper is organized as follows. In Section II,
competition between an attacker and a defender [15], [16] [17]. we review the related literature on APTs and game theory.
However, existing solutions generally use a single, highly- In Sections III and IV, we introduce the background knowl-
abstracted game to characterize the interactions between the edge used in this paper and formally defines the problems to
two opponents, and such games do not tend to reflect the be addressed. In Section V, we detail the proposed APT game
attack-defense processes in real APT scenarios. First, most and the learning mechanisms. Section VI presents theoretical
of existing games have pre-specified strategy set for play- analyses of the game-theoretic properties. Section VII contains
ers, which does not consider the notion that attackers and experimental results and related discussions, and Section VIII
defenders can learn. Second, many existing models have a concludes the paper.
clear criterion of winning where each player tries to win
during the course of the game. However, in long-term attack
II. R ELATED W ORK
and defense interactions in APTs, it is hard to specify who
actually wins. Thus far, it has been challenging to quantify A. Game Theory in APTs
how much a player can learn during the course of an APT Game theory is the study of strategic interactions between
attack. Additionally, how to take advantage of past informa- rational decision makers [18]. Since APTs involve dynamic
tion and thus better defend against APT attacks is another strategy adjustments by an attacker and a defender, game
challenge. theory is a useful tool for analyzing their behavior. Much
In this paper, we proposed a two-player APT rivalry game effort has been devoted to studying game theory with APT
to analyze the interactions between attackers and defenders. attacks [19], [20]. In this section, we review the state-of-the-
Rather than attempting to characterize the players’ behaviors art literature according to different game models they adopted.
through a single game, we adopted a game-theoretic model to 1) Cyber Deception Games: Cyber deception is a good way
capture how much information each player can learn from to deceive and mislead attackers in APTs. Wan et al. [21]
their opponent’s strategy adjustments and policy upgrades. developed a cyber deception game based on hypergame theory
Different from previous time-based game models, we consider to mislead decision-making on the part of the attacker. They
that time itself imposes a cost on both players, even they extracted subgames from different stages of a cyber kill chain
do not take any new actions. In addition, the players in and modeled players’ perceived uncertainty, which was shown
our model do not adjust their strategy at arbitrary time, they to influence their understanding of the game. The proposed
must consider the trade-off between time cost and information hypergames allow players view and analyze the original game
leakage. Our assumptions profoundly reflect today’s attack- differently according to asymmetric information. Therefore,
defense campaigns in APTs. That is, the adversary cannot the defender is able to create defensive deception strategies
be eliminated; instead, it can only be suppressed by more to manipulate an attacker’s belief, which is influential to the
advanced techniques [6]. We also propose two reinforcement attacker’s decision-making.
learning methods that help the defender to learn from past Ye et al. [22] proposed a dynamic security game to mislead
experiences and to select optimal defense levels for the future. attackers by hiding or providing inaccurate system informa-
These methods prescribe that the defender should quickly tion. In their game, the defender decides whether to deploy
respond to the detected sabotage activities and set optimal a honeypot or disconnect some systems, the attacker tries to
defense levels. When faced with multiple defense points, the select target configurations that maximize its expected attack-
defender must determine a reasonable plan for allocating the ing gain. They adopted differentially private techniques to
resources available. Our proposed mechanisms are capable of perturb the accurate number of systems and obfuscate system
addressing these issues. configuration. Through this way, the influence of changing
The main contributions of this paper therefore include: system number and system configuration can be limited, and
• An APT rivalry game that considers the information dis- the cyber system is more resistant against attackers’ reasoning.
closed from a player’s strategy adjustments. The defender Tian et al. [23] formulated a honeypot game to defend
and attacker are able to figure out the best timing for against APT attack for the industrial Internet of Things. They
strategy adjustments based on the solution of the game. considered the irrationality of players and employed prospect
• Two learning mechanisms based on reinforcement learn- theory to describe rational behavior. In the proposed game,
ing to help defenders find the optimal defense levels as the attacker and the defender strategically choose periods of
well as appropriate resource allocations during a game attack and defense to maximize their expected utility. Prelec’s
against attackers. probability weighting function is used to characterize players’
ZHANG et al.: GAME-THEORETIC METHOD FOR DEFENDING APTs IN CYBER SYSTEMS 1351
utility when they have incomplete information about their 5) Bayesian Games: Huang and Zhu [19] proposed a multi-
opponent. An evolutionary stable strategy is discovered by stage Bayesian game to capture the incomplete information
checking the eigenvalues of a Jacobi utility matrix. resulted from the attacker’s deceptive actions and multi-phase
2) FlipIt Games: FilpIt game is a useful model to model the movement. They developed a conjugate based method to
interactions in APTs as the attacker and defender takeover the incorporate Bayesian belief update and the computation of
system alternatively [24]. Zhang and Zhu [16] used the FlipIt equilibrium strategy. Their game boils down to backward
game to capture the strategic interactions between an APT dynamic programming and is solved under the assumption
attacker and defender. A local game was played out across of beta-binomial conjugate prior on users’ type. With their
distributed nodes so as to model a device being occupied by a framework, the defender is able to compute the perfect Nash
single defender and a single attacker, while a global game equilibrium and thus to predict the attacker’s behaviors. How-
was played out over the network to describe the complex ever, although the proposed method leads to computationally
interactions among multiple defenders and multiple attackers. tractable equilibrium, their assumptions may be too strong in
Cyber insurance was employed to mitigate the risks of being real-world APT scenarios.
attacked to the cyber system. It was also used to explore the Pawlick et al. [28] studied a three-player game that involves
design of incentive compatible insurance contracts by solving a cloud defender, an APT attacker and a could-controlled
the welfare maximization problem. device. The attacker and defender compete for the control
3) Resource Allocation Games: Yang et al. [25] consid- of the cloud, and the outcome of the competition determines
ered the repair of systems when an APT attack is detected the actual type of the cloud. A signaling game is further
in an organization. They formulated an APT repair game established based on the cloud status, if the device believes
to find effective resource allocation strategies and thus can the message it receives is from a compromised cloud, it will
mitigate the potential loss of the victim organization. Their follow the received commands. Other wise the device will rely
game falls into the differential Nash game category where on its own autonomous lower-level control. The solution of a
the attacker tries to maximize its potential benefit and the game is a Gestalt equilibrium, which is a combination of a
defender aims to minimize the potential loss. The potential perfect Bayesian equilibrium and Nash equilibrium.
equilibria are obtained by solving potential systems, and 6) Trading Games: Hu et al. [29] considered the insider
numeric simulations are used to demonstrate the relationship threat where the insiders trade valuable information to APT
between potential equilibria and the Nash equilibria. attackers for monetary benefits. They developed a two-player
Xiao et al. [26] presented an APT detection game based on game model to characterize the behavior of attacker and
cumulative prospect theory. Specifically, the attacker chooses defender with the interferer of insiders. The proposed model
an attack interval during which to launch an attack, and the quantitatively specifies the cost of players incurred over a long
defender chooses a scan interval to protect the cyber system. time-span, and gives the instantaneous net profit of insiders
The proposed model incorporates probability weighting distor- using market price dynamic. In this game, the optimal strategy
tion and a framing effect on a subjective attacker and defender. of each player is derived by solving differential equations.
To find the optimal defense strategy, they developed a policy, 7) Sequential Games in Extensive Form: Rass et al. [30]
called a hill-climbing detection scheme, to increase uncertainty leveraged extensive form game to describe an APT attack.
and mislead the attacker. An initialization method based on an In the proposed model, the game starts from a root of a tree
experience value accelerates the learning speed. and each tree node corresponds to a piece of information on
4) Information Tracking Games: Moothedath et al. [4] pro- which player is currently in. Extensive form games explicitly
posed a dynamic information flow tracking game to monitor present key information about the attacker and defender.
the data flow and control flow within a network, and thus to In addition, one player’s uncertain movement is captured by
detect APT attacks. They used a graph model to capture the information set, which highly resembles the latent movement
interaction among different nodes in the network, where the of the attacker in APTs. The proposed model are able to
attacker aims to reach a specific node and the defender tries to handle the inherent uncertainty of APTs caused by different
detect the threat in an efficient way. In each stage of the game, risk assessments, adversarial incentives and system states.
the defender decides the tag status of a flow and specifies the 8) Colonel Blotto Game: Zhu et al. [31] used Colonel Blotto
security policies a flow should follow. The best response of the game to model players’ competition on different defense
players can be derived by employing a shortest path algorithm. points. They extended classic single defender/attacker scenario
Sengupta et al. [27] pointed out that there are useful inherent to the setting where multiple defenders/attackers are involved.
information contained in the multi-stage stealthy movement In their game model, defenders and attackers distribute their
of APT attackers. They proposed a general-sum Markov resources on each battlefield and compete for the control of
game along with a vulnerability scoring method to figure a cyber system. A multi-agent deep reinforcement learning
out the utility of players in each stage of the game. They method is proposed to find the Nash equilibrium of the game.
developed a system attack graph to represent the states of Similarly, Gupta et al. [32] developed a three-stage Colonel
the game. The attacker’s actions are modeled based on real- Blotto Game to explore the flow of information between
world APT attacks, and the defender’s actions are allocations defender and APT attacker. Their game allows defenders to
of security resources. The authors showed that the Stackelberg add battlefields and transfer resources to each other. The
equilibrium with some strong assumptions is in fact the Nash attacker finally fights against defenders on two different bat-
equilibrium of the general-sum Markov Game. tlefields. The game runs in an incomplete information setting,
the derived equilibrium demonstrates that players’ extra infor- • Threat: APT attacks are accurately aimed, well designed
mation may not necessarily lead to a better performance. and can cause irreparable damage to the targeted victim.
Even worse, they are long-term threats and hard to
B. Discussion of Related Works detect or eliminate. These have raised serious threats to
The above works have two common limitations. First, most companies, organizations and even nation entities who
of these works ignore the learning ability of attackers and want to keep their data and secrets secure.
defenders. They only consider players’ gain/loss when players In fact, attack and defense in the context of APT is a long
adjust their attack/defense strategy. As mentioned, adjusting and evolutionary process, where the attacker and defender
one’s strategy can leak important information that can make endeavor to explore their opponent’s strategies and maximize
the attacker and/or the defender stronger through iterative their strategic benefits. The National Institute of Standards
learning [13]. Second, the majority of works have a clear and Technology (NIST) [33] has demonstrated that APT
win/loss criterion. However, as mentioned in [12], real APT attackers may keep themselves updated with changes in the
attacks are long-standing threats accompanied by novelty in targeted network, continuously collecting and analyzing useful
attack variants. Therefore, it is not reasonable to specify the information about the victim organization while staying unde-
winner of a long-term campaign or discuss how to win under tected. Therefore, it is crucial to consider potential risks and
the framework of an APT game model. The only practical estimate the influence of undetected behaviors when modeling
outcome is to learn from past experiences so as to reduce interactions between the attacker and defender.
future losses as much as possible.
To this end, we considered the information leaks incurred B. Game Theory
when an attacker/defender makes a strategy adjustment. From Game theory is the study of mathematical models of conflict
this, we developed an APT rivalry game based on evolutionary and cooperation between rational decision-makers [34]. Game-
game theory to find the optimal timing for players to make theoretic solutions have been widely applied across many
a change. In our model, we do not define the criteria of fields of artificial intelligence [35], including distributed con-
winning for each player, instead, we assume that the defender trol systems [36] and multi-agent systems [37], [38].
will periodically obtain some gains or suffer some losses In this paper, we take advantage of a dynamic timing
as long as the threat of APT attack is not fully eliminated. game called the war of attrition [39]. This game characterizes
Therefore, in our solution, at an equilibrium time point, the timing of strategy adjustments noting that the attacker
the defender determines a new defense strategy using rein- and defender are trying to reduce information leaks to their
forcement learning techniques based on previous actions and opponent. Originally formulated to study animal conflicts
gains/losses. and genetic evolution [40], the classic war of attrition game
works as follow. Suppose players 1 and 2 are competing for
III. P RELIMINARIES a single resource. Both players can drop out at any time,
A. Advanced Persistent Threats (APTs) if one player drops, the other player wins (i.e., obtains the
resource) and the game ends immediately. The valuations of
Advanced persistent threats are a major threat to today’s
player 1 and player 2 are v 1 and v 2 respectively, representing
cyber systems. Different from a regular cyber attack, APTs are
the benefit they will obtain if they win the resource. Besides,
often performed by groups of adversaries with sophisticated
each player suffers a cost c for fighting per unit of time.
levels of expertise and significant resources. The characteris-
We consider the mixed-strategy Nash equilibrium of the game,
tics of APTs in terms of their definitions follow.
where the strategy of each player corresponds to a distribution
• Advanced: The APT attackers are usually well-funded by
Ti (i = 1, 2) over drop-out times. Each player decides their
organizations or even governments, which allows them to drop-out time according to distribution Ti .
tailor advanced tools and methods to generate opportuni- Definition 1 (Nash Equilibrium [41]): An action profile a ∗
ties to achieve their specific objective. A distinguishing is a Nash equilibrium if each ai∗ is the best response of a−i ∗ .
character of APTs is the novelty of the attack methods,
That is, ∀i ∈ [n], a ∈ A,
which usually include but are not limited to combinations
of multiple different attack vectors, such as phishing, u i (ai∗ , a−i
∗
) ≥ u i (ai , a−i
∗
). (1)
malware, and viruses. Some attack techniques may never For the war of attrition game, a player’s utility is the
have even been seen before, which guarantees its success. difference between their valuation (if wins) and the cost for
• Persistent: The goal of an APT attack not only involves fighting. Namely, if the competition ends at time t, the winner
stealing sensitive data and undermining critical compo- gets u i = v i − c · t, and the other player gets −c · t. In a Nash
nents of the targeted victim. One important aim of an equilibrium, both players are indifferent between dropping out
APT attack is to stay undetected and position oneself at some time t and waiting to drop out at t + dt. That is, at an
for future post-exfiltration or post-impediment. Unlike equilibrium point, for player i ,
traditional attacks, which often employ “smash and grab”
dT j (t)
style tactics for mere financial gain, APT attackers are vi · − c · dt = 0, (2)
much more patient and persistent. They usually follow 1 − T j (t)
a “low and slow” strategy, gradually expanding their where v i · dT j (t)/(1 − T j (t)) is the possible gain of player i
foothold across the whole network. when it persists in the game and c · dt is the cost of staying
TABLE I We note that the term “information leakage” in the paper

M AIN N OTATIONS U SED IN THE PAPER refers to “what the defender/attacker can know by observing
their opponent’s new strategy adjustment”, which enables them
to improve their strategy or policy. We do not use information
leakage to portray players’ previously unseen cognition. The
reason is that the players do not fully know each other, they
cannot tell which information is new to their opponent.
In addition, knowing when to make a strategy adjustment,
it is also necessary to explore ways for determining what is an
appropriate defense level. There should be a tradeoff between
resource consumption and system security, so determining an
proper resource allocation is another factor to consider.
Formally, we consider the continuous-time rivalry game on
a cyber network between an APT attacker and a defender. The
attacker tries to invade a network and stay undetected, while
the defender endeavors to secure every terminal point within
in. The ratio T j (t)/(1 − T j (t)) represents the probability that the network and minimize the losses caused by the APT attack.
player j drops out at time t, which is also known as the hazard ca denotes the average cost to an attacker per unit of time, i.e.,
rate of player j in the context of evolution theory. the cost generated by the attacker analyzing, moving around,
By integrating, the unique mixed-strategy equilibrium of the and perpetrating malicious activity. Similarly, the defender
war of attrition game is also consumes time, money, and computational resources in
defending against an APT attack via activities such as log
Ti (t) = 1 − e−ct /v j , i = 1, 2. (3)
monitoring or anomaly detection. cd denotes the average cost
In the context of APTs, a strategy adjustment of the to a defender per unit of time. To characterize the information
attacker/defender corresponds to a player’s drop out, which leaks incurred in a strategy adjustment, v a and v d denote the
will leave a piece of information that values v i to their value of the information learned by an attacker/defender from
opponent. We use the war of attrition model to figure out the a strategy adjustment by their opponent.
optimal strategy adjustment timing in terms of information The goal of the game is to find the optimal timing for
leakage. However, the valuations of players are private infor- strategy adjustments that leaks the least information. The aim
mation of the attacker/defender in APT campaigns. Therefore, of the learning method is to find an effective way for the
we consider a more complex scenario where the valuations defender to select an optimal defense strategy from past losses
of players are drawn from different distributions, and discuss and experience.
player’s expected utility when forming a Nash equilibrium.
In addition, we extend the game to the situation where players B. Design Rationale of the APT Rivalry Game
are inertial (i.e., irrational with some probability), making it The course of the APT rivalry game is illustrated in
comparable to an APT attack. We also derive the necessary Figure 2. There are two players P = {attacker, defender}
conditions for which a family of equilibria exists. The main in the game. From the starting point T = 0 (e.g., the
notations used in the paper are summarized in Table I. reconnaissance stage of the attack), the attacker and defender
consume resources to perform attack/defense activities. The
IV. P ROBLEM D EFINITION AND G AME D ESIGN cost per unit of time for each player is ca and cd , respectively.
At each instant, the players decide whether to adjust or change
A. Problem Statement their strategy. One player’s strategy adjustment leaves a piece
This research mainly addresses two issues associated with of useful information to its opponent, which can be used
APT attacks. The first is to develop a game-theoretic model for in the opponent’s subsequent strategy making. For example,
APT attacks to help attackers and defenders find the optimal at time T = T0 , the defender updates a blacklist of URLs. The
times to adjust their strategies – with optimal being defined as attacker, whose list is outdated, will soon realize this fact and
the time when the least information possible will be leaked to so she learns new information. The value of this information
avoid developing a strong opponent. The second is an effective is v a . During the course of the game, each player can freely
method for the defender to select an optimal defense level, choose the timing of when they adjust their strategy. After
as well as find a reasonable resource allocation. one player’s strategy adjustment, the APT rivalry game goes
To defend against APTs, it is essential to find an optimal to the next stage and thus the cost ca and cd also change.
timing scheme for strategy adjustment in terms of informa- This is because attacking or defending against different strate-
tion leakage. On the one hand, frequently making strategy gies of the opponent possibly leads to different resource
adjustment not only results in huge resource consumption consumption.
but may also impart information to one’s attackers. On the In the proposed game, we consider the optimal timing of
other hand, an outdated defense strategy will make the system players’ strategy adjustment in terms of information leakage.
more vulnerable to novel attack variants. Therefore, a balance Here, the strategy adjustment refers to the players’ routine
between the two is required. strategy upgrades but does not necessarily involve emergency
NIST. Moreover, the potential loss V is also treated as a

random variable as the defender is unaware of the attacker’s
stealthy goal. The distribution of L(v) is formulated by the
defender based on the value of the cyber system.
Similarly, from the attacker’s perspective, the value of
information v a can be estimated in the same way, with each
random variable has different interpretations, e.g., X r is the
success rate of the attacker, V ∼ L(v) is the potential gain if
the targeted system is infiltrated.
Given I (X r ; X b ) and distribution L(v), the value of infor-
Fig. 2. APT rivalry game between attacker and defender. mation v d in fact follows a distribution I ◦ V , where I is a
scalar. To improve legibility, we denote F = I ◦ V and assume
actions. For example, if a defender finds an unknown program that the values v a and v d are drawn from distributions Fa and
running within his system, he should take remedial action Fd , with density denoted as f a and f d , respectively. Fa and
immediately, even if these actions will leak some side infor- Fd are assumed to be differentiable from [0, h] → [0, 1].
mation to the attacker.
In this paper, we assume that the attacker and defender are
B. The Equilibrium Strategy Adjustment Timing
risk neutral and aim to maximize their utility with respect to
an optimal timing scheme for strategy adjustment. In this section, we formulate the utility of attacker and
defender, as well as deriving the sufficient conditions for the
V. T HE S OLUTION OF THE G AME AND L EARNING existence of equilibria in the APT rivalry game.
We first normalize the players’ costs incurred in the game
A. The Evaluation of Players’ Information Leakage
to ca = cd = 1, which is without loss of generality, since for
In the APT rivalry game, we use v a and v d to capture nonnegative cost c and valuation v, it is only the ratio c/v that
the value of information learned by an attacker/defender from influences the equilibria. When players have different costs and
a strategy adjustment by their opponent. However, in APT valuations, we can interpret v as players’ cost-valuation ratio.
campaigns, it is challenging to evaluate the actual values v a The result of the game will not change. Then, the equilibria
and v d due to the uncertainty and asymmetry [42] of APTs. Ti (v i ) (i = a, d) that specifies when the players adjust their
Sometimes, even the players themselves cannot foresee the strategies is a function of valuation v i .
effect of their new tactic [43]. To address this issue, we employ Lemma 1 (Monotonicity and Differentiability of Ti (v i )):
metrics from information theory to quantify the information For Ti (v i ) to be an equilibrium of the game, Ti (v i ) should
leakage incurred during player’s strategy adjustment, which be strictly increasing and continuously differentiable.
serves as the basis for the computing v a and v d . The intuition of Lemma 1 is that the player with larger
From the defender’s perspective, let X b be the random vari- v i sticks to their current strategy longer as waiting longer
able that captures the attacker’s strategy adjustment detected may yield more valuable information about their opponent.
by the defender, and X r represents the risk that the cyber In addition, it cannot be the case that Ti is constant on some
system is compromised. Then the mutual information between interval [v i , v i ] and then increasing after v i . Assuming not,
X b and X r is defined as for some v i ≤ v i , T0 = Ti (v i ) = Ti (v i ) > 0. That means
I (X r ; X b ) = H (X r ) + H (X b ) − H (X r , X b ), (4) Pr[Ti (v i ) = T0 ] > 0, i.e., player i would make an adjustment
with a discrete probability at T0 . But, if this were the case,
where H (X) = − x Pr[X = x] · log(Pr[X = x]) is the the opponent player j would never make an adjustment within
Shannon entropy of X. [T0 − , T0 ] given a small , which contradicts T j (v j )’s
The mutual information I (X r ; X b ) measures the reduction continuity. Therefore, Ti must be strictly increasing. For a
of entropy in X r caused by knowing X b . In other words, it rep- complete proof, we refer readers to the study of concession
resents the defender’s reduced uncertainty about the system’s games [44], [45].
risk by observing the attacker’s new strategy adjustment. According to Lemma 1, we can ignore the possibility
To estimate the value of v d , let random variable that both players will adjust their strategies simultaneously.
V ∼ L(v) denote the loss of the defender if the cyber system Assume that the attacker follows the equilibrium ta ∼ Ta (v a )
is compromised. Then the value of information learned by the to make a strategy adjustment. Then, if the defender adjusts
defender is modeled to be his strategy at time z, the expected utility of the defender will
be
v d = V · I (X r ; X b ). (5) z
We consider X b and X r to be random variables because the u d (z; v d ) = (v d − ta )d(Fa (ϕa (ta ))) − z(1 − Fa (ϕa (z))),
0
adjustment of attacker vectors involves changes on multiple (6)
defense points, the defender cannot capture all of them exactly.
In practice, X b is formulated by the defender’s observation where ϕi (·) = Ti−1 (·) is the inverse function of Ti (·). So ϕi (t)
of the attacker’s movement, and the corresponding risk can represents the valuation of a player who makes a strategy
be estimated from cyber security authorities like FireEye and adjustment at time t. The second term can also be written
as (z + v a )(1 − Fa (ϕa (z))), which incorporates the attacker’s TABLE II

valuation, both formulations leads to the same equilibria for J OINT D ISTRIBUTION OF X b AND X r
the defender as long as the attacker is assumed to follow her
equilibrium strategy. To simply the notation, we sometimes
write ϕi (t) as ϕi for i = a, d.
In Equation 6, the first term and second term are the
defender’s expected utility when the attacker makes a strategy
adjustment at ta < z and ta > z, respectively. Differentiating
Equation 6 with respect to z, we have the following equation
∂u(z; v d )
= v d Fa (ϕa (z))ϕa (z) − (1 − Fa (ϕa (z))). (7)
∂z To see this, note that in the settings where there players are
Set the derivative to zero, and we have inertial, there is no longer a set of equilibrium. To see this,
v d Fa (ϕa (z))ϕa (z) = 1 − Fa (ϕa (z)). (8) we rearrange Equation 10 and have
Similarly, from the perspective of the attacker, we have F̂a (ϕa ) F̂d (ϕd )
= . (12)
v a Fd (ϕd (z))ϕd (z) = 1 − Fd (ϕd (z)). (9) ϕa (1 − F̂a (ϕa )) ϕd (1 − F̂d (ϕd ))
And by integrating, we have
Let ϕa (t), ϕd (t) denote the pair of valuation functions for
the attacker and defender, then we have the main theorem of Ya (ϕa ) = Yd (ϕd ) + k, (13)
the paper, as follows:
Theorem 1: (The equilibria of the APT rivalry game) where Yi satisfies
Suppose ϕa (t), ϕd (t) is a solution to the differential equation h ϕ
F̂i (x) F̂i (x)
Yi (ϕ) = dx < − dx
ϕd (t)Fa (ϕa (t))ϕa (t) = 1 − Fa (ϕa (t)), ϕ x(1 − F̂i (x)) h ϕ(1 − F̂i (x))
(10)
ϕa (t)Fd (ϕd (t))ϕd (t) = 1 − Fd (ϕd (t)). 1 1 − F̂i (ϕ)
= ln . (14)
with conditions limt →∞ ϕa (t) = limt →∞ ϕd (t) = h and ϕ 1 − F̂i (h)
min{ϕa (t), ϕd (t)} = 0, then ϕa (t), ϕd (t) forms an equilib-
According to Equation 11, F̂i (·) < 1 and Yi (ϕ) converges
rium of the APT rivalry game.
as ϕ → h. Based on the fact that Yi (ϕ) is a strict monotone
Theorem 1 characterizes the conditions where the APT
function, Equation 13 has a unique solution. In other words,
rivalry game has a family of equilibria. In practice, the solution
there is a unique equilibrium when players become inertial.
of the system is determined by different parameter estimations
of the players. We consider the defender and attacker’s long-
term competition. Every time they make a strategy adjustment, D. A Case Study of Equilibrium Computation
the parameters in the equation change, as does the solution of In this section, we presented a case study to illustrate how
the next equilibrium timing. to find the equilibria in real-world APT scenario, including the
We leave the complete proof of the theorem to Section VI. evaluation of parameters and the solve of Nash equilibria.
Suppose a cyber system has three risk levels, namely X r =
C. The Equilibrium With Inertial Players {low, medium, high}. The attacker makes frequent queries on
In this section, we explore the change of equilibria when the target’s publicly accessible repositories in an effort to
the attacker and defender are inertial. In the context of APTs, collect domain/routing information and locate the websites
it is common that the attacker and/or defender cannot get that have high-risk vulnerabilities. The potential threats related
complete information about the movements of their opponents to the attacker’s strategy are cross-site scripting (XSS), SQL
during the reconnaissance, monitoring, etc. The asymmetry of injections (SQLI) and DNS tunneling attacks (DNST), X b =
information may influence the rationality of a player in APT {XSS, SQLI, DNST}. The defender estimates the joint dis-
rivalry games [23], [46], whom we refer to as an inertial player. tribution of risks and potential attacks based on the system
Hence, consider there is a probability p > 0 such that a and well-known vulnerability databases like NIST National
player is inertial and thus will never make any attack/defense Vulnerability Database (NVD) [8]. The joint distribution of
strategy adjustments during the game. The player’s strategy X b and X r are given in Table II.
adjustment time, viewed from its opponent’s perspective, Based on the table, the defender can compute the mutual
would then satisfy information I (X r ; X b ) = 0.02. Consider a simple case where
the loss of the defender when the system is compromised fol-
F̂i (ϕi (t)) = (1 − p)Fi (ϕi (t)). (11)
lows a uniform distribution on interval [0, 50], V ∼ U (0, 50).
Simply replace Fa and Fd with F̂a and F̂d in Theorem 1, Then we have Fd (v) = v. Similarly, suppose Fa (v) = v, and
and we have the equilibrium ϕ̂a (t), ϕ̂d (t) for inertial players. the differential equation system in Theorem 1 becomes
Theorem 2 (The Equilibrium of APT Game With Inertial
Players): There exist a unique equilibrium of the APT rivalry ϕd (t)ϕa (t) = 1 − ϕa (t),
(15)
game when players are inertial with probability p > 0. ϕa (t)ϕd (t) = 1 − ϕd (t).
Rearrange the order, and we have In this paper, we formulated the defender’s strategy adjust-
dϕa ϕa · (1 − ϕa ) ment problem as a learning problem. We proposed two rein-
= . (16) forcement learning based mechanisms to select the optimal
dϕd ϕd · (1 − ϕd )
defense level as well as specifying how to allocate the
Integrate the above equation, and we have the general defender’s resources across these different defense points.
solution 2) The Adversarial Bandit Solution: A natural solution
ln(1/ϕd − 1) = ln(1/ϕa − 1) + k, (17) to the strategy adjustment problem is to interpret it as an
adversarial bandit problem, where an adversary (the attacker)
The above equation defines a set of equilibria with a has a complete control of the loss suffered by the defender
constant k. When k = 0, by symmetry, we have ϕa = and the defender only gets feedback from the defense actions
ϕd , therefore, Equation 15 boils down to ϕi ϕi = 1 − ϕi . it had taken before.
By integrating and inverting, we have, for both the attacker In the adversarial bandit setting, li1 , . . . lit denote the
and the defender, defense levels selected by the defender from time 1 to t.
Ti (v i ) = −v i − ln(1 − v i ), (18) Let x i1 (1) . . . , x it (t) be the sequence of losses suffered by
the defender. At time t, the defender computes the loss he
which is the symmetric equilibrium of the game. has suffered from his previous overall defense level lit−1 , and
then decides his new defense level based on the knowledge
E. Resource Allocation With Reinforcement Learning it obtained so far, i.e., the history {li1 , x i1 (1), . . . , lit−1 , x it−1
In previous sections, we discussed when to make an (t − 1)}. Intuitively, the loss on defense point Pi is computed
attack/defense strategy adjustment in terms of information by x Pi (t) = |ρid (t) − ρia (t)|. That is, if ρid (t) < ρia (t), the
leakage and derived the strategy adjustment timing that forms defender has failed to defend Pi and suffers ρia (t) − ρid (t).
a set of equilibria for both players. In the following sections, If ρid (t) ≥ ρia (t), the defender has succeed in his defense
we further explore how to make a strategy adjustment when but has wasted ρid (t) − ρia (t) resources, which might have
the equilibrium timing is known. Without loss of generality, been used on P j to further improve the security of the system.

the methods are presented from the perspective of the defender. The overall loss is therefore x it (t) = n1 ni=1 x Pi (t). Assume
1) Learning Problem Formulation: To defend against APT that the K defense levels are common knowledge to the
attacks, multiple defense methods should be implemented, attacker and defender. The attacker specifies a loss vector
such as firewalls and anomaly detection. It is not always (x 1 (t), . . . , x K (t)) by distributing its resources across the set
reasonable to exhaust every defense method to its utmost level of defense points P, where x k (t) ∈ [0, 1] represents the loss
as that would lead to significant consumption of the defender’s of the defender if its overall defense level falls to k ∈ [K ].
resources. The defender generally distributes its recourses For any time T > 0 and any mechanism M that maps from
across these “battlefields”, leading to a comprehensive defense the defender’s history {li1 , x i1 (1), . . . , lit−1 , x it−1 (t − 1)} to the
level over the full cyber system. In this paper, we assume set of defense levels L = {1, . . . , K }, the total loss generated
that there are P = {Pi , . . . , Pn } vulnerable defense points. by the mechanism along the time horizon T is
ρid ∈ [0, 1] is the specific defense level with respect to

T
Pi , which represents the defender’s resource commitment on G M (T ) = x t j (t). (19)
Pi . For example, if Pi is the anti-virus scanning frequency, t =1
then ρid = 1 indicates that the defender sets the highest
possible scanning frequency to protect his system. Similarly, In addition, for a best single defense level j ∗ , the loss it
the attacker also has attack level ρia ∈ [0, 1] on these defense leads to is
points, which reflects the attacker’s resource distribution over
T
P. If ρid ≥ ρia , we say that the defender successfully defends G min (T ) = min x j ∗ (t). (20)
j
against one attack attempt in terms of Pi . t =1
The defender’s overall defense level is represented by k In this paper, we consider the (weak) regret of the mecha-
discrete numbers L = {l1 , . . . , l K }, where lk = {k ∈ Z|k · nism M, which is the gap between G M (T ) and G min (T ),

n/K ≤ n1 ni=1 ρid < (k + 1) · n/K }. Each strategy adjustment
R(T ) = G M (T ) − G min (T ). (21)
on a single defense point will eventually be reflected in the
overall defense level. Regret is a good way to measure the performance of a learning
The strategy adjustment of the defender follows. At any algorithm. It represents the loss suffered by the defender
time t that forms an equilibrium of the APT rivalry game, the relative to the optimal fixed defense level.
defender can modify or change a series of his defense methods, We assume that the defender is risk-neutral and, thus, the
such as the firewall level, the frequency of anomaly pattern objective of the defender is to obtain a bounded expected
detection, etc. These adjustments will change the defense level regret. The exponential-weight algorithm for exploration and
ρid on each defense point Pi , and further change the overall exploitation (Exp3 ) [47] with losses is used to find the best sin-
defense level of the cyber system, say from li to l j with i, j ∈ gle defense level. The original Exp3 algorithm was developed
[K ]. Different overall defense levels will result in different with respect to players’ rewards. However, we reformulated
losses to the defender, which influences the defender’s next this algorithm to derive the regret bound in terms of the
strategy adjustment in return. defender’s losses. The details are shown in Algorithm 1.
Algorithm 1 Mechanism M1 : Exp3 Algorithm With Losses a neural network to approximate the Q-table, which is much
Input: the defense levels L = {l1 , . . . , l K }, parameter γ ∈ easier to maintain.
[0, 1]. In an APT attack, the defender may not be aware of the
Output: defense levels lit for each t. attack model. As a model-free reinforcement learning tech-
1: for i ∈ [K ] do nique, a DQN can be employed to derive the optimal defense
2: initialize weight wi (1) ← 1 for defense level li . policy (i.e., the ρid and lit ) in a Markov decision process
3: end for (MDP). To avoid ambiguity, in this section, the strategy of
4: for t = 1, 2, . . . do the defender indicate the defender’s resource allocation across
5: Set probability different defense points, while in previous sections strategy
has referred to the actual defense methods the defender has
wi (t) γ
pi (t) = (1 − γ ) K + , i = 1, . . . , K . used to protect the system.
j =1 w j (t)
K In the proposed DQN solution, the defender’s strategy in
current round t is based on his strategies and losses in the last
6: draw a defense level lit randomly according to the
τ rounds. The state of the defender is st = {ot −τ , . . . , ot −1 },
distribution of pi (t).
where ot − j = {ρid (t − j )}, {ρia (t − j )} for i ∈ [K ], j ∈ [τ ].
7: observe the loss x it (t).
That is, a state consists of a series of observations, and each
8: for j ∈ [K ] do
observation is made up of the defender’s specific defense levels
9: update the weight for defense level l j by
over P and the attacker’s attack levels over P in a single round.
I{ j = i t }x j (t) The defender’s actions are defined as the defense levels ρid (t)
x̂ j (t) = ;
p j (t) available for selection. We assume that each defense point Pi
γ x̂ j (t) has an importance factor Ii , which indicates how important Pi
w j (t + 1) = w j (t) exp(− ). is in protecting the cyber system.
K The reward of the defender
in round t is therefore r (t) = ni=i Ii · sign(ρid (t) − ρia (t)).
10: end for
We also employ a new strategy-based sampling approach to
11: end for
select samples from the experience replay set. The existing
sampling approaches, such as random sampling [49] and
gradient-based [50] sampling are not suitable for the proposed
APT game as the defender may have a large number of
The algorithm takes as its input the set of defense levels L. available actions. To avoid overfitting, defenders should avoid
The parameter γ controls the trade-off between exploration selecting the samples that have similar defense levels. There-
(trying out a new defense level to find the optimal one) fore, the similarity between strategies must be considered.
and exploitation (playing the best defense level so far to Given the experience replay set D, Euclidean distance is
have minimized loss). The algorithm starts by assigning each used to measure the similarity between strategies. Specifically,
defense level a weight wi (1) = 1 (Line 1-3). At each time step the defender first computes the average of the specific defense
t, it computes a probability mass for each defense level and levels within D, denoted as ρ̄ d ( j ),
draws a particular one accordingly (Line 5-6). The defender
|D|
then adjusts his defense strategy following the selected defense 1 d
level. The next time that it is necessary to make a strategy ρ̄ d ( j ) = ρ ( j ), ρ d ( j ) = (ρid ( j ))i=1
K
. (22)
|D|
adjustment, the defender estimate the loss he suffered, and j =1
updates the weight for each defense level (Line 7-11). In the
weight update process (Line 9), the algorithm uses the esti- Then, for each candidate defense level ρ d ( j ), the defender
mated loss rather than the exact one, which aims to punish the computes the Euclidean distance between ρ d ( j ) and ρ̄ d ( j ),
defense levels that lead to a large loss. A detailed analysis of
Ed(ρi , ρ̄) = ||ρ d ( j ) − ρ̄ d ( j )||2. (23)
the regret is presented in Section VI.
3) The Deep Q-Network Solution: The previous solution Finally, the defender selects m samples that have the largest
endeavors to find the best fixed defense level, but it does not distance according to the order of Ed(ρ d ( j ), ρ̄( j )).
provide a detailed resource allocation scheme for a specific The DQN based learning mechanism is shown in
defense point. This issue is addressed in the second solution. Algorithm 2. In each round t, the defender adopts an -greedy
The second solution takes advantage of deep reinforcement approach to select a defense strategy (Line 4). The defender
learning, where the knowledge of the defender is stored in then achieves the selected defense levels by adjusting a
a deep neural network. Compared to regular reinforcement series of defense methods at each defense point. Here, the
learning, deep Q-networks (DQNs) [48] can handle more reward is also estimated and a new observation is obtained
complex problems with a large number of states or actions. (Line 5-6). The defender then updates the observation for his
In regular reinforcement learning, an agent maintains a Q-table state (Line 10), and puts the sample with the updated state
to track changes in the state-action pairs Q(s, a). However, into the experience replay set (Line 11). Finally, the defender
there is no limit on how much the Q-table can expand, and so selects m samples with which to train the DQN by minimizing
it will become hard to update with a large number of states the loss function defined by the difference between target
or actions. By contrast, deep reinforcement learning adopts Q-values and learned Q-values (Line 12-17).
Algorithm 2 Mechanism M2 : Deep Q-Network By previous argument ϕa (t) is strictly increasing. Therefore,
Input: the defender’s state, weight θ of the deep Q network, for any 0 < z < h, 1 − Fa (ϕd (z)) > 0. That means
experience replay set D. ∂u(z; v d )
Output: a trained DQN. (v d − ϕd (z)) ≥ 0. (25)
∂z
1: initialize weight θ and Q-values of each defense strategy
In addition, the above equality holds if and only if v d =
randomly. ϕd (z); otherwise, the inequality is strictly positive. Hence, the
2: set D = ∅.
defender’s best response to the attacker is to choose a z d∗ such
3: for t = 1, 2, . . . do
that ϕd (z d∗ ) = v d . A symmetric argument demonstrates that
4: With probability , randomly select a defense strat-
ϕa (z a∗ ) = v a is indeed the best response for the attacker.
egy ρ d (t) = {ρid (t)}i=1 K , otherwise select ρ d (t) =
Next, we consider the endpoint of the equilibrium func-
arg maxρ Q(st , ρ; θ ). tions. First, observe that it cannot be the case that
5: adjust defense methods on defense points P = {Pi }i=1 K
ϕa (0) = ϕd (0) = 0. Otherwise, both players would have a
to achieve the selected defense strategy.
positive discrete probability of strategy adjustment at T = 0.
6: obtain the reward r (t) and new observation ot +1 .
If this were the case, each player would have an incentive to
7: input (st , ρ d (t)) to the DQN.
wait infinitesimally so as to gain for extremely small cost. For
8: obtain the output vector Q(st ) from DQN.
similar reasons, the attacker and defender must have the same
9: update state st +1 ← st ∪ {ot +1 } − {ot −τ }.
maximum strategy adjustment time tmax . That is, at the upper
10: put sample (st , ρ d (t), r (t), st +1 ) into D.
endpoint,
11: select m samples from D.
12: for j ∈ [m] do ϕd (tmax ) = ϕa (tmax ) = h. (26)
13: Q j ← r ( j ) + η · maxρ Q(s j +1 , ρ; θ )
To complete the proof, we show that, at the upper endpoint,
14: end for
ϕd (tmax ) = ϕa (tmax ) = h for some tmax ∈ (0, ∞).
15: update weight θ by gradient descent on loss function
1 m Suppose this were not the case, which implies that ϕa (t)
j =1 [Q j − Q(s j , ρ (t); θ )] .
d 2
m and ϕd (t) are bounded away from h. Then, consider any
16: end for
interval [v, w] ⊂ (0, ∞). Since ϕa (t) is increasing, according
to Equation 10 we have
4) Discussion and Comparison of the Two Solutions: The ϕa (v)Fd (ϕd (t))ϕd (t) ϕd (w)Fd (ϕd (t))ϕd (t)
>1> . (27)
adversarial bandit based solution allows the defender to learn 1 − Fd (ϕa (t)) 1 − Fd (ϕa (t))
a defense level fast by only accessing its incurred losses. Integrating over interval [v, w] it follows that
However, we note that in such setting the adversary (attacker)
is generally assumed to be oblivious, i.e., the losses x i (t) 1 − Fd (ϕa (v))
ϕa (w) ln > w−v
are generated without considering the past actions of the 1 − Fd (ϕa (w))

learner (defender). However, the result obtained here in the 1 − Fd (ϕa (v))
> ϕa (v) ln .
paper can be generalized to the non-oblivious setting without 1 − Fd (ϕa (w))
changing the main analysis. In comparison, the DQN based (28)
solution considers the past τ observations of the defender,
which involves the previously detected stealthy movements If ϕa (t) and ϕd (t) are bounded away from h, then the left
and activities on the part of the attacker. Therefore, this method expression is bounded, which contradicts the fact that w in the
may be more suitable for defenders with the ability to monitor middle expression can be arbitrarily large.
or track the attacker’s actions. Then suppose ϕd (t¯d ) = h. From the second inequality,
we have that when w → t¯d , the right expression becomes
unbounded. This implies t¯d = ∞. By a symmetric argument
VI. T HEORETICAL A NALYSIS for the attacker, we have t¯a = ∞. The above analysis shows
This section presents a theoretical analysis of the proposed
APT defense framework, including the game-theoretic proper- lim ϕa (t) = lim ϕd (t) = h, (29)
t →∞ t →∞
ties it provides and the regret bounds it guarantees with respect Therefore Equation 26 holds. This completes the proof.
to different defense strategies. We then explore the change of the defender’s regret led by
mechanism M1 and derive the upper bound of regret as a
A. Proofs of Theorems function of learning rounds.
We first present the complete proof for Theorem 1, i.e., any Theorem 3: Mechanism M1 guarantees that
solution that satisfies the differential Equation 10 forms a Nash K ln K
E[G M1 ] − G min ≤ + (e − 2)γ T, (30)
equilibrium of the APT rivalry game. γ
Theorem 1 Proof: We first prove the optimality of for any K > 0, T > 0 and γ ∈ [0, 1], where e = 2.718 . . . is
ϕa (t), ϕd (t) . Taking Equation 10 into Equation 7, we have the Euler’s number. K
∂u(z; v d ) 1 − Fa (ϕa (z)) Proof: Let (t) = i=1 wi (t) denote the sum of the
= · (v d − ϕd (z)) (24)
∂z ϕd (z) weights at time t. Suppose the loss sequence generated by
mechanism M1 is {li1 , . . . , lit }. Then let us consider the So, taking the expectation of both sides in Inequality 37,
change of total weight between time t and t + 1. we have
(t + 1)
K
wi (t + 1) (1 − γ )K ln K (e − 2)γ
T K
= E[G M1 ] − G min ≤ + x i (t)
(t) (t) γ K
i=1 t =1 i=1

K
wi (t) γ x̂ i (t) (1 − γ )K ln K (e − 2)γ
= exp(− ) ≤ + ·TK
(t) K γ K
i=1 K ln K

K ≤ + (e − 2)γ T. (39)
pi (t) − γ /K γ x̂ j (t) γ
= exp(− )
1−γ K This complete the proof.
i=1
K 2 Based on the above theorem, we can actually derive the
pi (t) γ x̂ j (t) γ x̂ j (t)
≤ 1− + (e − 2) upper bound of the expected regret by minimizing the two
1−γ K K error terms in Equation 39.
i=1
Theorem 4: The expected regret of mechanism M1 is at
(e − 2)γ 2 /K 2
K
γ /K √ √
≤ 1− x it (t) + x̂ i (t). most 2 e − 2 T K ln K .
1−γ 1−γ
i=1 Proof: There are two error terms in Equation 39. The first
(31) one corresponds to the learning inaccuracy – the larger the γ ,
The first three equations are obtained from the definitions of the higher the γ T . The second term relates to the learning
wi (t) and pi (t). The first inequality uses the fact that e−x ≤ overhead – the smaller the γ , the higher the K ln K
γ . Choosing
1 − x + (e − 2)x 2 for x ∈ [0, ∞). The last inequality is based ln(K )
= K(e−2)T to equalize the two error terms, we have
on observations that
√ √
K
x i (t) E[G M1 ] − G min ≤ 2 e − 2 T K ln K . (40)
pi (t)x̂ i (t) = pit (t) t = x it (t); (32)
pit (t)
i=1 That is, the average loss of mechanism M1 approaches√the

K
x it (t)
K average loss of the best fixed defense level at a rate of T .
pi (t)x̂ i (t)2 = pit (t) x̂ i (t) ≤ x̂ it (t) = x̂ i (t). This completes the proof.
pit (t) t
i=1 i=1
(33)
B. Discussions and the Security Insights of the Theorems
We then take the logarithms of Equation 31, and use the
inequality e x ≥ 1 + x to have Theorem 1 captures the nature of informational asymmetry
in APT rivalry game. It proves the existence of a family
(e − 2)γ 2 /K 2
K
(t + 1) γ /K of equiilbria characterized by differential equations. From a
ln ≤− x it (t) + x̂ i (t). security standpoint, this theorem demonstrates that players’
(t) 1−γ 1−γ
i=1
strategy adjustment timing does matter in terms of information
(34)
leakage, which helps to explain why constantly upgrading
By induction, we can write security policies may not always adequately defend an organi-
zation from being compromised [6], [7]. Theorem 1 provides
(e − 2)γ 2 /K 2
T K
(t + 1) γ /K suggestions for the defender in their routine practical strategy
ln ≤− G M1 + x̂ i (t).
(1) 1−γ 1−γ upgrades by indicating the optimal strategy adjustment timing
t =1 i=1
(35) in the light of Nash equilibrium.
Meanwhile, Theorem 3 and 4 state that the proposed
On the other hand, for any defense level l j , we have defending algorithm leads to an expected regret bound of
√
(t + 1) w j (t + 1) γ
T O( T ). Regret measures how much the defender regrets,
ln ≥ ln =− x̂ j (t) − ln K . (36) in hindsight, of not having followed the algorithm’s advice.
(1) (1) K
t =1 The security interpretation of this regret bound is that it limits
Combining the two bounds in Equation 35 and Equation 36, the gap between the real loss suffered by the defender and
we have the ideal loss in theory. The regret will grow at a logarithmic
rate if the defender distributes his or her resources following
(e − 2)γ
T K
G M1 − x̂ i (t) the algorithm’s recommendation. These theorems provides a
K theoretical guarantee for the defender to dynamically adjust
t =1 i=1

T defending strategy against APTs.
(1 − γ )K ln K
≤ (1 − γ ) x̂ j (t) + . (37)
γ
t =1 VII. E XPERIMENT AND A NALYSIS
Note that x̂ i (t) is an unbiased estimator of x i (t), namely
In this section, we evaluate the performance of the proposed
x i (t) methods. We first examine the game-theoretic properties of
Ei∼ pi (t ) [x̂ i (t)] = E pi (t) · + 0 = x i (t). (38)
pi (t) the APT rivalry game, investigating the existence of Nash
Fig. 3. Fully rational ( p = 0) defender’s expected utility under equilibrium with respect to different valuation distributions.
equilibria with different valuation distributions and rational- random strategy. The expected utility with the random strategy
ities for each of the players. Then we look into the learning fluctuates around 0, while the utility in the equilibrium strategy
mechanisms to see whether the defender can learn a near is strictly positive. In addition, although the exponential dis-
optimal defense strategy using the proposed learning scheme. tribution F2 (v) decreases in [0, ∞), this provides a chance for
The experimental results as well as corresponding analyses are the defender to get high-value information from his opponent’s
presented in each subsection. strategy adjustment. Therefore, the defender’s expected utility
when the players’ valuations are drawn from the exponential
A. Experiments for the APT Game distribution is higher than that with other two distributions.
Figure 4 demonstrates the defender’s expected utility when
1) Experimental Setup: We conducted the experiments with
the players are partially rational, given their valuations were
three representative distributions from which the attacker and
drawn from the uniform distribution F1 (v). Figure 4(a) shows
defender draw their valuations independently. The three dis-
the variation tendency of the equilibrium timing T1 (v) with
tributions are given in Equation 41, where F1 is a uniform
different parameters p. From the figure, we observe that when
distribution in support [0, 1], F2 is an exponential distribution
players are fully rational ( p = 0), the equilibrium timing goes
with the parameter λ = 1 in [0, ∞), and F3 is a distribution
to infinity as v approaches 1. In the symmetric equilibrium,
with a probability density of f X (x) = 2x, x ∈ [0, 1].
this indicates that, if the defender thinks the next strategy
The distributions F2 and F3 are monotone decreasing and
adjustment of its opponent will leave valuable information,
monotone increasing in their support, which represent the
then he will wait for a longer period of time before making
players’ different beliefs in the probability of getting low/high-
this strategy adjustment. However, as players become inertial,
value information.
the probability that they never make a strategy adjustment
F1 (v) = v; v ∈ [0, 1] increases. As a result, the strategy adjustment timing in the
F2 (v) = 1 − e−v , v ∈ [0, ∞); equilibrium goes down as p increases. Note that when p =
0.1, 0.5 and 0.9, the value of T1 (v; p) at v = 1 is no longer
F3 (v) = v 2 , v ∈ [0, 1]. (41)
infinity. This indicates that if players are inertial, they may
For a clear and simple presentation, we only considered also make a strategy adjustment immediately even if waiting
symmetric equilibria in the experiment. The symmetric equi- for a while is likely to provide them with valuable information
libria derived from Theorem 1 with respect to the three about their opponent.
distributions are given in Equation 42. Figures 4(b) to 4(d) present the expected utility of the
defender under different rationality settings. In these figures,
T1 (v) = −v − ln(1 − v); v ∈ [0, 1]
the expected utility in the equilibrium strategy is still higher
1
T2 (v) = v 2 , v ∈ [0, ∞); than that in the random strategy. In addition, although the
2 value of the expected utility in equilibrium increases with an
1+v
T3 (v) = −2v + ln , v ∈ [0, 1]. (42) increase of p, these values can only be obtained under the
1−v circumstance that players happen to be rational. Therefore, the
The results are shown in Figures 3 to 5. overall expected utility still decreases as the players become
2) Results and Analyses: Figure 3 illustrates the change of more and more inertial.
the defender’s expected utility when the attacker and defender Figure 5 shows the results when the players’ valuations
are fully rational. Figure 3(a) depicts the distributions of were drawn from an increasing distribution F3 (v). Figure 5(a)
strategy adjustment timing indicated by the Nash equilib- describes the change of the equilibrium timing T3 (v; p) with
ria. Overall, the three distributions increase in their support, respect to different values of p. Figures 5(b) to 5(d) illustrate
which is consistent with the properties of Ti (v) in Lemma 1. the detailed expected utility of the defender after 100 rounds
Figures 3(b) to 3(d) show the defender’s expected utility of the ATP rivalry game. The result patterns are similar to
in 100 rounds of the APT rivalry game, given the players’ those in Figure 4. From the figures, we can see that, although
valuations were drawn from T1 (v), T2 (v), T3 (v), respectively. distributions F1 (v), F2 (v), F3 (v) vary in their support, the
From the figures, we can see that the defender’s expected equilibrium timing T1 (v), T2 (v), T3 (v) fall into similar pat-
utility in the equilibrium is always higher than that with the terns. In addition, the defender’s expected utility is always
Fig. 4. Inertial ( p = 0.1, 0.5, 0.9) defender’s expected utility under equilibrium with respect to valuations sampled from distribution F1 .
Fig. 5. Inertial ( p = 0.1, 0.5, 0.9) defender’s expected utility under equilibrium with respect to valuations sampled from distribution F3 .
higher with Nash equilibria, showing that the Nash’s solution as more rounds are added to the game. However, the rate of
leads to the optimal result in the proposed APT rivalry game. increase drops, as shown in Figure 6(b). Figure 6(b) describes
the change of the defender’s regret as well as the theoretical
regret bound. The theoretical regret bound is a linear function
B. Experiments for the Learning Mechanisms
of the game round T , while the actual regret incurred by
1) Experimental Setup: We also conducted experiments to mechanism M1 is strictly lower within this bound. In addition,
investigate whether the defender can learn a near optimal the rate of increase of the actual regret level is slower than
defense strategy through the proposed learning mechanisms. the slope of the linear function.
For mechanism M1 , we assumed that there are |P| = To thoroughly explore this mechanism, we traced the varia-
10 defense points, across which the defender and attacker can tion tendency of each defense level’s weight during the course
allocate their defense/attack resources. The total amount of of mechanism M1 . As shown in Figure 6(c), at first, the
their resources was fixed, and the resources spent on each weights of the 10 defense levels are all equal and normalized
defense point was represented as a real numbers ρia , ρid ∈ to [0, 1]. As the game goes on, the mechanism iteratively
[0, 1]. In the experiment, we set different defense levels |L| updates the weight of each defense level, calibrating their
as well as changing the amount of defense resources used by weights according to the loss led by choosing them. After
the defender. We then looked into how the defender’s regret 50, 000 rounds, the weight of the best fixed defense level
changed when mechanism M1 was used to select the defense approaches 1, while the weights of other defense levels
levels, and we also examined the robustness of mechanism decrease to 0.
M1 when the defender had limited resources with which to In Figure 6(d), we changed the resource of the defender to
defend against the attacker. see whether the proposed mechanism still worked in such a
For mechanism M2 , we assumed that the defender and circumstance. As APT attackers are always well-prepared and
the attacker had a certain amount of defense/attack resources, well-funded, we fixed the attacker’s resource and gradually
which could be freely allocated to each defense point. In this reduce the resources available to the defender. Figure 6(d)
scenario, the defender not only cared about the overall defense shows the regret incurred with the defender’s resource at
level of the system, but also needed to pay attention to the spe- 100%, 80%, 50%, and 20% proportional to the attacker. The
cific resource allocations for each defense point. To this end, figure tells that with less available defense resources, the
we varied the defender’s resources to see the corresponding regret of the defender increases, meaning that the defender
performance in the proposed mechanism. We also varied the suffers more due to a lack of resources. Note that with only
number of defense points and the distribution of importance 20% resources of the attacker, the defender reaches a barely
factor to check the mechanism’s robustness. The results are satisfactory regret, which is twice as much as that with 100%
presented in Figures 6 to 8. resources. This is due to the oblivious assumption of the
2) Results and Analyses: Figure 6 illustrates the rationale of attacker in adversarial bandit setting. If the attacker keeps track
mechanism M1 with the number of defense level set to |L| = of the defender’s strategy and estimates the importance he has
10. Figure 6(a) depicts the defender’s cumulated loss incurred attached to each defense point, the defender will suffer heavily
by using mechanism M1 and the optimal cumulated loss in as a result of reduced resources (Figure 8(a)).
hind-sight. From the figure, we can see that the regret gap Figures 7 demonstrates how regret changes when the num-
between mechanism M1 and the optimal strategy increases ber of defense level increases. In APT attacks, there are
Fig. 6. Performance of mechanism M1 with |L| = 10 actions and different defense resources of the defender.
Fig. 7. Performance of mechanism M1 with varying defense levels L.
Fig. 8. Performance of mechanism M2 with varying defense resources, defense points and importance values.
multiple vulnerable points that need to be protected, which game goes on, showing the effectiveness of mechanism M2 .
may leads to many defense levels being chosen. Therefore, In this experiment, we assumed that the attacker employs a
we increased the level of defense |L| to see if mechanism memory to record the past resource allocations of the defender,
M1 still worked. Figure 7(a) and Figure 7(b) present the regret and that she has the ability to infer which defense points are
incurred when |L| = 20 and |L| = 50 respectively, showing more important to her opponent. As shown in Figure 8(a),
that increasing the defense level will not influence the relation- reducing the defender’s resources drastically influences his
ship between theoretical regret bound and the actual regret. reward, which provides a stark contrast to Figure 6(d). When
However, as the defense levels grow, the defender’s actual the defender’s resource sit at 60% of the attacker’s, it seems
regret increases. Figure 7(c) captures the increasing tendency that the defender cannot competently defend against the
of the regret bound and actual regret at T = 10, 000 as |L| attacks launched by his opponent. Figure 8(b) shows the
increases from 10 to 50. The increase in actual regret can change in the defender’s reward when the number of defense
be explained by the fact that the mechanism explores more points varies. With an increase in the number of defense points,
defense levels given more possible choices. In addition, the the defender’s reward also grows. This is because the enlarged
regret bound grows at a rate of O(|L| log |L|), which agrees “battlefields” provides more opportunities for the defender to
with the theoretical analysis in Equation 39. Figure 7(d) shows win and gain. Note that this experiment was conducted based
the trends of actual regret incurred and the averaged regret on the defender having enough defense resources. Without this
for each defense level. From the figure, we can observe that assumption, the defender’s reward may not increase.
although the actual regret doubled or tripled (solid lines), the Figure 8(c) considers the scenario where the defender may
average regret on each specific defense level increases quite attach different importance values on different defense points.
slowly (dashed lines). This again tells us that the increase In this experiment, we changed the distribution of the impor-
of actual regret is caused by an increase in defense levels, tance factors used to compute the rewards, and examined the
demonstrating the robustness of the proposed mechanism. learning ability of the proposed mechanism. We set |P| = 5
The experimental results relating to the second mechanism defense points, and the total amount of importance factors
M2 are presented in Figure 8. Figure 8(a) illustrates the was 15. In the uniform condition, the importance factor for
defender’s accumulated reward as the number of game rounds each defense point is I = {3, 3, 3, 3, 3}, while that in the
grows. In the figure, the defender’s reward increases as the increased and skewed conditions were I = {1, 2, 3, 4, 5} and
I = {1, 1, 1, 6, 6}. From the figure, we can see that the [6] E. Cole, Advanced Persistent Threat: Understanding the Danger and
mechanism learns fast when the importance factors are uni- How to Protect Your Organization, 1st ed. Oxford, U.K.: Syngress
Publishing, 2012.
formly distributed. In the other two conditions, the mechanism [7] S. Quintero-Bonilla and A. Martín del Rey, “A new proposal on the
took some rounds to figure out the important defense points advanced persistent threat: A survey,” Appl. Sci., vol. 10, no. 11, p. 3874,
before starting to win. In addition, in the uniform condition, Jun. 2020.
[8] P. Mell, K. Scarfone, and S. Romanosky, “Common vulnerability scoring
the defender received the highest reward. This is because, system,” IEEE Secur. Privacy, vol. 4, no. 6, pp. 85–89, Nov./Dec. 2006.
in the other conditions, the attacker and defender allocated [9] M. Ussath, D. Jaeger, F. Cheng, and C. Meinel, “Advanced persistent
more resources to important defense points. This made for threats: Behind the scenes,” in Proc. Annu. Conf. Inf. Sci. Syst. (CISS),
Mar. 2016, pp. 181–186.
stiff competition, thus reducing the defender’s chance to win. [10] Z. Xu, S. Ray, P. Subramanyan, and S. Malik, “Malware detection using
In Figure 8(d), we considered the fact that winning or losing machine learning based analysis of virtual memory access patterns,”
a specific defense point may not be deterministic; it could pos- in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017,
pp. 169–174.
sibly be probabilistic depending on the resources spent by both [11] D. Tosh, S. Sengupta, C. A. Kamhoua, and K. A. Kwiat, “Establishing
players. For example, if the defender allocated 3 resources to a evolutionary game models for cyber security information exchange
defense point and the attacker spent 7, then the defender had a (CYBEX),” J. Comput. Syst. Sci., vol. 98, pp. 27–52, Dec. 2018.
[12] A. Alshamrani, S. Myneni, A. Chowdhary, and D. Huang, “A survey
probability of 0.3 to win. Figure 8(d) shows that the defender on advanced persistent threats: Techniques, solutions, challenges, and
receives more rewards in the deterministic setting. This is research opportunities,” IEEE Commun. Surveys Tuts., vol. 21, no. 2,
because, in the deterministic setting, the mechanism tries to pp. 1851–1877, 2nd Quart., 2019.
[13] C. T. Do et al., “Game theory for cyber security and privacy,” ACM Com-
improve the selected strategy when players tie with each other put. Surv., vol. 50, no. 2, pp. 1–37, May 2017, doi: 10.1145/3057268.
on a specific defense point, while, in the probabilistic setting, [14] L. Zhang, T. Zhu, P. Xiong, W. Zhou, and P. S. Yu, “More than privacy:
the same strategy may lead to a positive reward and thus Adopting differential privacy in game-theoretic mechanism design,”
ACM Comput. Surv., vol. 54, no. 7, pp. 1–37, Jul. 2021.
deceives the training process. We also observed that the reward [15] S. Rass, S. König, and E. Panaousis, “Cut-the-rope: A game of stealthy
when players have 8 resources increases faster than that with intrusion,” in Decision and Game Theory for Security, T. Alpcan,
10 resources. The reason is that given fixed defense points, Y. Vorobeychik, J. S. Baras, and G. Dán, Eds. Cham, Switzerland:
Springer, 2019, pp. 404–416.
the network with 8 resources has less combinations of output. [16] R. Zhang and Q. Zhu, “FlipIn: A game-theoretic cyber insurance
Therefore, the mechanism learns faster, and the accumulated framework for incentive-compatible cyber risk management of Internet
rewards increase earlier on. of Things,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 2026–2041,
2020.
[17] V. Matta, M. Di Mauro, M. Longo, and A. Farina, “Cyber-threat
VIII. C ONCLUSION AND F UTURE W ORK mitigation exploiting the birth–death–immigration model,” IEEE Trans.
Inf. Forensics Security, vol. 13, no. 12, pp. 3137–3152, Dec. 2018.
In this paper, we developed an APT rivalry game to describe [18] T. Zhu, D. Ye, W. Wang, W. Zhou, and P. Yu, “More than privacy:
the interactions between attackers and defenders. The proposed Applying differential privacy in key areas of artificial intelligence,” IEEE
game considers the information leaks incurred when players Trans. Knowl. Data Eng., vol. 34, no. 6, pp. 2824–2843, Jun. 2021.
[19] L. Huang and Q. Zhu, “Analysis and computation of adaptive defense
adjust their strategies, gives an explanation as to why players strategies against advanced persistent threats for cyber-physical sys-
often become stronger during the course of an APT campaign. tems,” in Decision and Game Theory for Security, L. Bushnell,
We derived the necessary conditions for which a family R. Poovendran, and T. Başar, Eds. Cham, Switzerland: Springer, 2018,
pp. 205–226.
of equilibria holds and showed that these equilibria actually [20] R. Kumar, S. Singh, and R. Kela, “Analyzing advanced persistent threats
imply the optimal timing for each player to make a strategy using game theory: A critical literature review,” in Critical Infrastructure
adjustment. In addition, we proposed two learning mechanisms Protection XV, Cham, Switzerland: Springer, 2022, pp. 45–69.
[21] Z. Wan, J.-H. Cho, M. Zhu, A. H. Anwar, C. A. Kamhoua, and
to help defenders find the best defense levels as well as the M. P. Singh, “Foureye: Defensive deception against advanced persistent
optimal resource allocation. The results from experimental threats via hypergame theory,” IEEE Trans. Netw. Service Manage.,
games show that the equilibria of the game leads to higher vol. 19, no. 1, pp. 112–129, Mar. 2022.
[22] D. Ye, T. Zhu, S. Shen, and W. Zhou, “A differentially private game
utility and the proposed mechanisms can indeed learn from the theoretic approach for deceiving cyber adversaries,” IEEE Trans. Inf.
players’ past experiences. In future work, we plan to explore Forensics Security, vol. 16, pp. 569–584, 2021.
the possible cooperations between attackers and defenders in [23] W. Tian, M. Du, X. Ji, G. Liu, Y. Dai, and Z. Han, “Honeypot detection
strategy against advanced persistent threats in industrial Internet of
the context of bargaining games, which may provide new Things: A prospect theoretic game,” IEEE Internet Things J., vol. 8,
insights into ways to defend against APTs. no. 24, pp. 17372–17381, Dec. 2021.
[24] M. van Dijk, A. Juels, A. Oprea, and R. L. Rivest, “FLIPIT: The game of
R EFERENCES ‘stealthy takeover,”’ J. Cryptol., vol. 26, no. 4, pp. 655–713, Oct. 2013.
[25] L.-X. Yang, P. Li, Y. Zhang, X. Yang, Y. Xiang, and W. Zhou, “Effec-
[1] M. K. Daly, “Advanced persistent threat,” Usenix, vol. 4, no. 4, tive repair strategy against advanced persistent threat: A differential
pp. 2013–2016, 2009. game approach,” IEEE Trans. Inf. Forensic Security, vol. 14, no. 7,
[2] C. Tankard, “Advanced persistent threats and how to monitor and deter pp. 1713–1728, Jul. 2018.
them,” Netw. Secur., vol. 2011, no. 8, pp. 16–19, 2011. [26] L. Xiao, D. Xu, N. B. Mandayam, and H. V. Poor, “Attacker-centric view
[3] A. Lemay, J. Calvet, F. Menet, and J. M. Fernandez, “Survey of publicly of a detection game against advanced persistent threats,” IEEE Trans.
available reports on advanced persistent threat actors,” Comput. Secur., Mobile Comput., vol. 17, no. 11, pp. 2512–2523, Nov. 2018.
vol. 72, pp. 26–59, Jan. 2018. [27] S. Sengupta, A. Chowdhary, D. Huang, and S. Kambhampati, “General
[4] S. Moothedath et al., “A game-theoretic approach for dynamic infor- sum Markov games for strategic detection of advanced persistent threats
mation flow tracking to detect multistage advanced persistent threats,” using moving target defense in cloud networks,” in Proc. Int. Conf.
IEEE Trans. Autom. Control, vol. 65, no. 12, pp. 5248–5263, Dec. 2020. Decis. Game Theory Secur., 2019, pp. 492–512.
[5] L.-X. Yang, P. Li, X. Yang, and Y. Y. Tang, “A risk management [28] J. Pawlick, S. Farhang, and Q. Zhu, “Flip the cloud: Cyber-physical
approach to defending against the advanced persistent threat,” IEEE signaling games in the presence of advanced persistent threats,” in
Trans. Dependable Secure Comput., vol. 17, no. 6, pp. 1163–1172, Decision and Game Theory for Security, Cham, Switzerland: Springer,
Nov. 2020. 2015, pp. 289–308.
[29] P. Hu, H. Li, H. Fu, D. Cansever, and P. Mohapatra, “Dynamic defense Tianqing Zhu received the B.Eng. and M.Eng.
strategy against advanced persistent threat with insiders,” in Proc. IEEE degrees from Wuhan University, China, in 2000 and
Conf. Comput. Commun. (INFOCOM), Apr. 2015, pp. 747–755. 2004, respectively, and the Ph.D. degree in computer
[30] S. Rass, S. König, and S. Schauer, “Defending against advanced per- science from Deakin University, Australia, in 2014.
sistent threats using game-theory,” PLoS ONE, vol. 12, no. 1, pp. 1–43, She was a Lecturer at the School of Information
Jan. 2017. Technology, Deakin University, from 2014 to 2018.
[31] T. Zhu, D. Ye, Z. Cheng, W. Zhou, and P. S. Yu, “Learning games She is currently an Associate Professor with the
for defending advanced persistent threats in cyber systems,” IEEE School of Computer Science, University of Tech-
Trans. Syst., Man, Cybern., Syst., early access, Oct. 19, 2022, doi: nology Sydney, Australia. Her research interests
10.1109/TSMC.2022.3211866. include privacy preserving, data mining, and network
[32] A. Gupta, T. Başar, and G. A. Schwartz, “A three-stage Colonel Blotto security.
game: When to provide more information to an adversary,” in Decision
and Game Theory for Security. Cham, Switzerland: Springer, 2014,
pp. 216–233.
[33] C. Paulsen, “Glossary of key information security terms,” Nat. Inst.
Standards Technol., Gaithersburg, MD, USA, Tech. Rep. NIST IR
7298, 2018.
[34] R. B. Myerson, Game Theory: Analysis of Conflict. Cambridge, MA, Farookh Khadeer Hussain (Member, IEEE) is
USA: Harvard Univ. Press, 1991. currently an Associate Professor with the School of
[35] A. S. Chivukula, X. Yang, W. Liu, T. Zhu, and W. Zhou, “Game Software, University of Technology Sydney. He is
theoretical adversarial deep learning with variational adversaries,” IEEE also an Associate Member of the Advanced Analyt-
Trans. Knowl. Data Eng., vol. 33, no. 11, pp. 3568–3581, Nov. 2021. ics Institute and a Core Member of the Centre for
[36] H. Jiang, U. V. Shanbhag, and S. P. Meyn, “Distributed computation of Artificial Intelligence. His research interests include
equilibria in misspecified convex stochastic Nash games,” IEEE Trans. trust-based computing, cloud of things, blockchains,
Autom. Control, vol. 63, no. 2, pp. 360–371, Feb. 2018. and machine learning. He has published widely in
[37] L. Zhao, Q. Wang, Q. Zou, Y. Zhang, and Y. Chen, “Privacy-preserving these areas in top journals, such as Future Gen-
collaborative deep learning with unreliable participants,” IEEE Trans. eration Computer Systems (FGCS), The Computer
Inf. Forensics Security, vol. 15, pp. 1486–1500, 2019. Journal, Journal of Computer and System Sciences
[38] L. Zhang, T. Zhu, P. Xiong, W. Zhou, and P. Yu, “A robust game- (JCSS), the IEEE T RANSACTIONS ON I NDUSTRIAL I NFORMATICS, and the
theoretical federated learning framework with joint differential pri- IEEE T RANSACTIONS ON I NDUSTRIAL E LECTRONICS .
vacy,” IEEE Trans. Knowl. Data Eng., early access, Jan. 4, 2022, doi:
10.1109/TKDE.2021.3140131.
[39] J. Bulow and P. Klemperer, “The generalized war of attrition,” Amer.
Econ. Rev., vol. 89, no. 1, pp. 175–189, Mar. 1999.
[40] J. M. Smith, “The theory of games and the evolution of animal conflicts,”
J. Theor. Biol., vol. 47, no. 1, pp. 209–221, Sep. 1974.
[41] J. F. Nash Jr., “Equilibrium points in n-person games,” Proc. Nat. Acad. Dayong Ye received the M.Sc. and Ph.D.
Sci. USA, vol. 36, no. 1, pp. 48–49, 1950. degrees in computer science from the University
[42] K. C. Nguyen, T. Alpcan, and T. Basar, “Security games with incomplete of Wollongong, Wollongong, NSW, Australia, in
information,” in Proc. IEEE Int. Conf. Commun., Jun. 2009, pp. 1–6. 2009 and 2013, respectively. He is currently a
[43] D. Pavlovic, “Gaming security by obscurity,” in Proc. New Secur. Research Fellow of cyber-security with the Univer-
Paradigms Workshop, 2011, pp. 125–140.
sity of Technology Sydney, Ultimo, NSW, Australia.
[44] D. Fudenberg and J. Tirole, “A theory of exit in duopoly,” Econometrica,
His research interests include differential privacy,
vol. 54, no. 4, pp. 943–960, 1986.
privacy preservation, and multi-agent systems.
[45] D. Myatt, “Instant exit from the war of attrition,” Economics
Group, Nuffield College, University of Oxford, Oxford, U.K.,
Economics Papers 9922, 1999.
[46] A. Sanjab, W. Saad, and T. Basar, “A game of drones: Cyber-physical
security of time-critical UAV applications with cumulative prospect
theory perceptions and valuations,” IEEE Trans. Commun., vol. 68,
no. 11, pp. 6990–7006, Nov. 2020.
[47] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonsto-
chastic multiarmed bandit problem,” SIAM J. Comput., vol. 32, no. 1,
pp. 48–77, 2002. Wanlei Zhou (Senior Member, IEEE) received the
[48] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013, B.Eng. and M.Eng. degrees in computer science and
arXiv:1312.5602. engineering from the Harbin Institute of Technology,
[49] V. Mnih et al., “Human-level control through deep reinforcement learn- Harbin, China, in 1982 and 1984, respectively, the
ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. Ph.D. degree in computer science and engineering
[50] H. Yin and S. Pan, “Knowledge transfer for deep reinforcement learning from The Australian National University, Canberra,
with hierarchical experience replay,” in Proc. AAAI, Feb. 2017, vol. 31, Australia, in 1991, and the D.Sc. degree from Deakin
no. 1, pp. 1–7. University, Australia, in 2002. He is currently the
Vice Rector (Academic Affairs) and the Dean of
the Faculty of Data Science, City University of
Macau, Macau, SAR, China. Before joining the City
Lefeng Zhang received the B.Eng. and M.Eng. University of Macau, he held various positions, including the Head of the
degrees from the Zhongnan University of Economics School of Computer Science, University of Technology Sydney, Australia;
and Law, China, in 2016 and 2019, respectively. the Alfred Deakin Professor; the Chair of Information Technology; the
He is currently pursuing the Ph.D. degree with the Associate Dean; and the Head of the School of Information Technology,
University of Technology Sydney, Australia. His Deakin University. He has served as a Lecturer at the University of Electronic
research interests include game theory and privacy Science and Technology of China, China; Monash University, Melbourne,
preserving. Australia; and the National University of Singapore, Singapore; and a System
Programmer at HP, USA. He has published more than 400 papers in refereed
international journals and international conferences proceedings, including
many articles in IEEE transactions and journals. His research interests include
security, privacy, and distributed computing.

A Game-Theoretic Method For Defending Against Advanced Persistent Threats in Cyber Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Game-Theoretic Method For Defending Against Advanced Persistent Threats in Cyber Systems

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

18, 2023 1349

A Game-Theoretic Method for Defending Against

Abstract— Advanced persistent threats (APTs) are one of

W ITH the fast evolution and frequent innovation in

TABLE I We note that the term “information leakage” in the paper

NIST. Moreover, the potential loss V is also treated as a

as (z + v a )(1 − Fa (ϕa (z))), which incorporates the attacker’s TABLE II

Fig. 7. Performance of mechanism M1 with varying defense levels L.

You might also like