Professional Documents
Culture Documents
Deception in Supervisory Control
Deception in Supervisory Control
2, FEBRUARY 2022
Abstract—The use of deceptive strategies is important follows a different deceptive policy. We study the synthesis of
for an agent that attempts not to reveal his intentions in an deceptive policies for such agents and the synthesis of reference
adversarial environment. We consider a setting, in which policies for such supervisors that try to prevent deception besides
a supervisor provides a reference policy and expects an achieving a task.
agent to follow the reference policy and perform a task. In the described supervisory control setting, the agent’s de-
The agent may instead follow a different deceptive policy
to achieve a different task. We model the environment and
ceptive policy is misleading in the sense that the agent follows
the behavior of the agent with a Markov decision process, his own policy, but convinces the supervisor that he follows the
represent the tasks of the agent and the supervisor with reference policy. Misleading acts result in plausibly deniable
reachability specifications, and study the synthesis of op- outcomes [6]. Hence, the agent’s misleading behavior should
timal deceptive policies for such agents. We also study the have plausible outcomes for the supervisor. In detail, the su-
synthesis of optimal reference policies that prevent decep- pervisor has an expectation of the probabilities of the possible
tive strategies of the agent and achieve the supervisor’s events. The agent should manipulate these probabilities such that
task with high probability. We show that the synthesis of he achieves his task while closely adhering to the supervisor’s
optimal deceptive policies has a convex optimization prob- expectations.
lem formulation, while the synthesis of optimal reference We measure the closeness between the reference policy and
policies requires solving a nonconvex optimization prob-
lem. We also show that the synthesis of optimal reference
the agent’s policy by the Kullback–Leibler (KL) divergence.
policies is NP-hard. The KL divergence, also called relative entropy, is a measure of
dissimilarity between two probability distributions [7]. The KL
Index Terms—Computational complexity, deception, divergence quantifies the extra information needed to encode
Markov decision processes (MDPs), supervisory control. a posterior distribution using the information of a given prior
distribution. We remark that this interpretation matches the
definition of plausibility: The posterior distribution is plausible
I. INTRODUCTION if the KL divergence between the distributions is low.
ECEPTION is present in many fields that involve two We use a Markov-decision process (MDP) to represent the
D parties, at least one of which is performing a task that
is undesirable to the other party. The examples include cyber
stochastic environment and reachability specifications to repre-
sent the supervisor’s and the agent’s tasks. We formulate the
systems [1], [2], autonomous vehicles [3], warfare strategy [4], synthesis of optimal deceptive policies as an optimization prob-
and robotics [5]. We consider a setting with a supervisor and an lem that minimizes the KL divergence between the distributions
agent, where the supervisor provides a reference policy to the of paths under the agent’s policy and reference policy subject
agent and expects the agent to achieve a task by following the to the agent’s task specification. In order to preempt the agent’s
reference policy. However, the agent aims to achieve another deceptive policies, the supervisor may aim to design its reference
task that is potentially malicious toward the supervisor and policy such that any deviations from the reference policy that
achieves some malicious task do not have a plausible explana-
tion. We formulate the synthesis of optimal reference policies as
Manuscript received February 15, 2020; revised November 1, 2020; a maximin optimization problem, where the supervisor’s optimal
accepted January 20, 2021. Date of publication February 9, 2021; date policy is the one that maximizes the KL divergence between
of current version January 28, 2022. This work was supported in part by
the Air Force Research Laboratory under Grant FA9550-19-1-0169, in itself and the agent’s deceptive policy subject to the supervisor’s
part by the Air Force Office of Scientific Research under Grant FA9550- task constraints.
19-1-0005, and in part by the Defense Advanced Research Projects The agent’s problem, the synthesis of optimal deceptive
Agency under Grant D19AP00004. Recommended by Associate Editor policies, and the supervisor’s problem, the synthesis of opti-
A. Girard. (Corresponding author: Mustafa O. Karabag.)
Mustafa O. Karabag is with the Department of Electrical and Com-
mal reference policies, lead to the following questions: Is it
puter Engineering, The University of Texas at Austin, Austin, TX 78705 computationally tractable to synthesize an optimal deceptive
USA (e-mail: karabag@utexas.edu). policy? Is it computationally tractable to synthesize an op-
Melkior Ornik was with the University of Texas at Austin, Austin, timal reference policy? We show that given the supervisor’s
TX 78705 USA. He is now with the Department of Aerospace En- policy, the agent’s problem reduces to a convex optimization
gineering and the Coordinated Science Laboratory, The University
of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: problem, which can be solved efficiently. On the other hand,
mornik@illinois.edu). the supervisor’s problem results in a nonconvex optimization
Ufuk Topcu is with the Department of Aerospace Engineering and problem even when the agent uses a predetermined policy. We
Engineering Mechanics and the Institute for Computational Engineering show that the supervisor’s problem is NP-hard. We propose the
and Sciences, The University of Texas at Austin, Austin, TX 78705 USA
(e-mail: utopcu@utexas.edu).
duality approach and alternating direction method of multipliers
Color versions of one or more figures in this article are available at (ADMM) [8] to locally solve the supervisor’s optimization
https://doi.org/10.1109/TAC.2021.3057991. problem. We also give a relaxation of the problem that is a linear
Digital Object Identifier 10.1109/TAC.2021.3057991 program.
0018-9286 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 739
The setting described in this article can be considered as a synthesis of optimal policies for MDPs. While we use the same
probabilistic discrete event system under probabilistic supervi- method, the objective functions of these papers differ; since the
sory control [9], [10]. The probabilistic supervisor induces an work in [18] is concerned with the average reward case, whereas
explicit probability distribution over the language generated by we use the ADMM to optimize the KL divergence between the
the system by random disablement of the events. The supervisory distributions of paths. In addition, we use the ADMM to solve
control model considered in this article is similar in that the a decomposable minimax problem, which, to the best of our
reference policy induces an explicit probability distribution over knowledge, has not been explored before.
the paths of the MDP. Different from [9] and [10], we consider We remark that a preliminary conference version [19] of this
that the random disablement is done by the agent, and the article focuses on the synthesis of deceptive policies. In addition
supervisor is only responsible for providing the explicit random to contents of [19], this version contains the NP-hardness result
disablement strategy. on the synthesis of optimal reference policies, duality, and
Similar to our approach, Bakshi and Prabhakaran [11] used ADMM methods for the synthesis of locally optimal reference
the KL divergence as a proxy for the plausibility of messages policies and an additional numerical example (see Sections V-
in broadcast channels. Although we use the KL divergence for A–V-C and VI-C). We also provide proofs of our results, which
the same purpose, the context of this article differs from [11]. were omitted from [19].
In the context of transition systems, the authors of [12] and The rest of this article is organized as follows. Section II
[13] used the metric proposed in this article, the KL divergence provides necessary theoretical background. In Section III, the
between the distributions of paths under the agent’s policy agent’s and the supervisor’s problems are presented. Section IV
and the reference policy, for inverse reinforcement learning. In explains the synthesis of optimal deceptive policies. In Sec-
addition to the contextual difference, the proposed method of tion V, we give the NP-hardness result on the synthesis of
this article differs from that in [12] and [13]. We work in a optimal reference policies. We derive the optimization problem
setting with known transition dynamics and provide a convex to synthesize the optimal reference policy and give the ADMM
optimization problem to synthesize the optimal policy, while algorithm to solve the optimization problem. In this section, we
the authors of [12] and [13] work with unknown dynamics and also give a relaxed problem that relies on a linear program for
use sampling-based gradient descent to synthesize the optimal the synthesis of optimal reference policies. We present numerical
policy. Entropy maximization for MDPs [14] is a special case of examples in Section VI and conclude with suggestions for future
the deception problem, where the reference policy follows every work in Section VII. We discuss the optimal deceptive policies
possible path with equal probability. One can synthesize optimal under nondeterministic reference policies in Appendix A. We
deceptive policies by maximizing the entropy of the agent’s provide the proofs for the technical results in Appendix B.
path distribution minus the cross-entropy of the supervisor’s
path distribution relative to the agent’s. For the synthesis of II. PRELIMINARIES
optimal deceptive policies, we use a method similar to [14] as
we represent the objective function over transition probabilities. The set {x = (x1 , . . . , xn )|xi ≥ 0} is denoted by Rn+ . The set
However, our proofs for the existence and synthesis of the {1, . . . , n} is denoted by [n]. The indicator function 1y (x) of an
optimal deceptive policies significantly differ from the results element y is defined as 1y (x) = 1 if x = y and 0 otherwise. The
of [14]. In particular, Savas et al. [14] restrict attention to station- characteristic function IC (x) of a set C is defined as IC (x) = 0
ary policies without optimality guarantees, whereas we prove if x ∈ C and ∞ otherwise. The projection ProjC (x) of a variable
the optimality of stationary policies for the deception problem. x to a set C is equal to arg miny∈C x − y22 . A Bernoulli
A related concept to deception is probabilistic system opacity, random variable with parameter p is denoted by Ber(p).
which was introduced in [15]. Two hidden Markov models The set K is a convex cone, if for all x, y ∈ K and a, b ≥ 0,
(HMMs) are pairwise probabilistically opaque if the likelihood we have ax + by ∈ K. For the convex cone K, K∗ = {y|y T x ≥
ratio between the HMMs is a nonzero finite number for every 0, ∀x ∈ K} denotes the dual cone. The exponential cone is
infinite observation sequence. This article is related to [15] in that denoted by Kexp = {(x1 , x2 , x3 )|x2 exp(x1 /x2 ) ≤ x3 , x2 >
two HMMs are not pairwise probabilistically opaque if the KL 0} ∪ {(x1 , 0, x3 )|x1 ≤ 0, x3 ≥ 0}, and it can be shown that
∗
divergence between the distribution of observation sequences is Kexp = {(x1 , x2 , x3 )| − x1 exp(x2 /x1 − 1) ≤ x3 , x1 < 0} ∪
infinite. While Keroglou and Hadjicostis [15] provide a method {(0, x2 , x3 )|x2 ≥ 0, x3 ≥ 0}.
to check whether two HMMs with irreducible Markov chains are Definition 1: Let Q1 and Q2 be discrete probability distribu-
pairwise probabilistically opaque, we propose an optimization tions with a countable support X . The KL divergence between
problem for MDPs that quantifies the deceptiveness of the Q1 and Q2 is
induced system by the agent. Deception is also interpreted as the
Q1 (x)
exploitation of an adversary’s inaccurate beliefs on the agent’s KL(Q1 ||Q2 ) = Q1 (x) log .
behavior [16], [17]. The work in [16] focuses on generating Q2 (x)
x∈X
unexpected behavior conflicting with the beliefs of the adversary, 1 (x)
and the work in [17] focuses on generating noninferable behavior We define Q1 (x) log( Q
Q2 (x) )
to be 0 if Q1 (x) = 0, and ∞
leading to inaccurate belief distributions. On the other hand, if Q1 (x) > 0 and Q2 (x) = 0. Data processing inequality states
the deceptive policy that we present generates behavior that is that any transformation T : X → Y satisfies
closest to the beliefs of the other party in order to hide the agent’s
KL(Q1 ||Q2 ) ≥ KL(T (Q1 )||T (Q2 )). (1)
malicious intentions.
We explore the synthesis of optimal reference policies, which, Remark 1: The KL divergence is frequently defined with log-
to the best of our knowledge, has not been discussed before. We arithm to base 2 in information theory. However, we use natural
propose to use the ADMM to synthesize the optimal reference logarithm for the clarity of representation in the optimization
policies. Similarly, Fu et al. [18] also use the ADMM for the problems. The base change does not change the results.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
740 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
A. Markov Decision Processes While the agent operates in M, the supervisor observes
An MDP is a tuple M = (S, A, P, s0 ), where S is a finite set the transitions, but not the actions of the agent, to detect any
deviations from the reference policy. An agent that does not
of states, A is a finite set of actions, P : S × A × S → [0, 1] is
the transition probability function, and s0 is the initial state. want to be detected must use a deceptive policy π A that limits
A(s) the amount of deviations from reference policy π S and achieves
denotes the set of available actions at state s, where ♦RA with high probability.
q∈S P (s, a, q) = 1 for all a ∈ A(s). The successor state of We use the KL divergence to measure the deviation from
state s is denoted by Succ(s), where a state q is in Succ(s) if and S A
only if there exists an action a such that P (s, a, q) > 0. State s the supervisor’s policy. Recall that ΓπM and ΓπM are the dis-
S A
is absorbing if P (s, a, s) = 1 for all a ∈ A(s). tributions of paths under π and π , respectively. We consider
A S
The history ht at time t is a sequence of states and ac- KL(ΓπM ||ΓπM ) as a proxy for the agent’s deviations from the
tions such that ht = s0 a0 s1 , . . . , st−1 at−1 st . The set of all reference policy.
histories at time t is Ht . A policy for M is a sequence The perspective of information theory provides two motiva-
π = μ0 μ1 , . . ., where each μt : Ht × A → [0, 1] is a function tions for the choice of KL divergence. The obvious motivation
such that a∈A(st ) μt (ht , a) = 1 for all ht ∈ Ht . A stationary is that this value corresponds to the amount of information
policy is a sequence π = μμ, . . ., where μ : S × A → [0, 1] is bits that the reference policy lacks while encoding the agent’s
a function such that a∈A(s) μ(s, a) = 1 for every s ∈ S. The path distribution. By limiting the deviations from the reference
policy, we aim to make the agent’s behavior easily explainable by
set of all policies for M is denoted by Π(M), and the set of all the reference policy. Sanov’s theorem [7] provides the second
stationary policies for M is denoted by ΠSt (M). For notational motivation. We note that satisfying the agent’s objective with
simplicity, we use Ps,a,q for P (s, a, q) and πs,a for μ(s, a) if high probability is a rare event under the supervisor’s policy. By
π = μμ . . ., i.e., π is stationary. minimizing the KL divergence between the policies, we make
A stationary policy π for M induces a Markov chain Mπ = the agent’s policy mimic the rare event that satisfies the agent’s
(S, P π ), where S is the finite set of states and P π : S × S → objective and is most probable under the supervisor’s policy.
[0, 1] is the transition probability function such that P π (s, q) =
Formally, let π ∗ be a solution to
a∈A(s) P (s, a, q)π(s, a) for all s, q ∈ S. A state q is accessible
from a state s if there exists an n ≥ 0 such that the probability inf KL ΓπM ||ΓπM
S
from the reference policy to follow a different policy π A . In this inf KL ΓπM ||ΓπM (2a)
π A ∈Π(M)
setting, both the agent and the supervisor know the environment, A
i.e., the components of M. subject to PrπM (s0 |= ♦RA ) ≥ ν A . (2b)
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 741
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
742 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
Assume that under the agent’s policy π A , there exists a path MDP.
Also, due to the convexity of KL divergence, we have
∗
that visits a state in C cl \ RA and leaves C cl \ RA with positive Δ |A(s)| KL(Xs,a ||XsS )dπs∗ (a) ≥ KL(Xsπ ||XsS ), where Δ|A(s)|
probability. In this case, the KL divergence is infinite since an is |A(s)|-dimensional probability simplex. Hence, determinis-
event that happens with probability zero under the supervisor’s tically taking action a∗ is optimal for state s. By generalizing
policy happens with a positive probability under the agent’s this argument to all s ∈ Ss , we conclude that there exists an
policy. Hence, C cl \ RA must also be a closed set under π A . optimal stationary deterministic policy for the semi-infinite
Furthermore, since the agent cannot leave C cl \ RA , and the MDP. Without loss of generality, we assume that π ∗ is stationary
probability of satisfying ♦RA is zero upon entering C cl \ RA , deterministic.
the agent should choose the same policy as the supervisor to We note that the stationary deterministic policy π ∗ of the semi-
minimize the KL divergence between the distributions of paths. infinite MDP corresponds to a stationary randomized policy for
Note that for the recurrent states in RA , i.e., C cl ∩ RA , the the original MDP M. Hence, the proposition holds.
second claim is trivially satisfied by the first claim. We denote the set of states for which the agent’s policy can
For all s ∈ S \ (C cl ∪ RA ), s is transient in MS , and the differ from the supervisor’s policy by Sd = S \ (C cl ∪ RA ).
agent’s policy must eventually stop visiting s, since otherwise We solve the following optimization problem to compute the
we have infinite divergence. Furthermore, we have the following occupation measures of an optimal deceptive policy:
proposition.
Proposition 1: If the optimal value of Problem 1 is finite and inf xA
s,a Ps,a,q
the optimal policy is π A , the state–action occupation measure s∈Sd a∈A(s) q∈Succ(s)
A
xπs,a is finite for all s ∈ S \ (C cl ∪ RA ) and a ∈ A(s). A
The occupation measures are bounded for the states that the a ∈A(s) xs,a Ps,a, q
log S
A
(4a)
agent’s policy may differ from the supervisor’s policy. Since the πs,q a ∈A(s) xs,a
occupation measures are bounded, the stationary policies suffice
for the synthesis of optimal deceptive policies [23]. subject to
Proposition 2: For any policy π A ∈ Π(M) that satisfies
A xA
s,a ≥ 0 ∀s ∈ Sd , ∀a ∈ A(s) (4b)
PrπM (s0 |= ♦RA ) ≥ ν A , there exists a stationary policy π A,St ∈
A,St
Π(M) that satisfies PrπM (s0 |= ♦RA ) ≥ ν A and xA
s,a − xA
q,a Pq,a,s = 1s0 (s) ∀s ∈ Sd (4c)
A,St S
A S
a∈A(s) q∈Sd a∈A(q)
KL ΓπM ||ΓπM ≤ KL ΓπM ||ΓπM .
xA
s,a Ps,a,q + 1s0 (q) ≥ ν
A
(4d)
Sketch of Proof for Proposition 2: Assume that the KL di- q∈RA s∈Sd a∈A(s)
vergence between the path distributions is finite. Note that the S
occupation measures of π A are finite for all s ∈ Sd = S \ (C cl ∪ where πs,q is the transition probability from s to q under π S
RA ). and the decision variables are xA s,a for all s ∈ Sd and a ∈ A(s).
When the reference policy is stationary, we may transform The objective function (4a) is obtained by reformulating the
M into a semi-infinite MDP. The semi-infinite MDP shares the KL divergence between the path distributions as the sum of the
same states with M, but has continuous action space such that KL divergences between the successor state distributions for
for all states, every randomized action of M is an action of the every time step (see Lemma 3 in Appendix B). Constraint (4c)
semi-infinite MDP. Also, the states that belong to RA and C cl encodes the feasible policies and constraint (4d) represents the
are absorbing in the semi-infinite MDP. task constraint.
Let XsS be the successor state distribution at state s under Proposition 3: The optimization problem given in (4) is a
the reference policy in the semi-infinite MDP. At state s ∈ Sd , convex optimization problem that shares the same optimal value
an action a with successor state distribution Xs,a has cost with (1). Furthermore, there exists a policy π ∈ ΠSt (M) that
KL(Xs,a ||XsS ). The cost is 0 for the other states that do not attains the optimal value of (4).
belong to Sd . Consider an optimization problem that minimizes The optimization problem given in (4) gives the optimal state–
the expected cost subject to reaching RA with probability at least action occupation measures for the agent. One can synthesize
ν A . The result of this optimization problem shares the same value the optimal deceptive policy π A using the relationship xA s,a =
A
A S
with the result of Problem 1. This problem is a constrained cost πs,a a ∈A(s) xs,a for all s ∈ Sd and πs,a = πs,a for the other
π
minimization for an MDP, where the only decision variables are states.
the state–action occupation measures. An optimal policy can be The optimization problem given in (4) can be considered as
characterized by the state–action occupation measures. a constrained MDP problem with an infinite action space [23]
The occupation measures must be finite for all s ∈ Sd as we and a nonlinear cost function. This equivalence follows from
showed in Proposition 1. Since every finite occupation measure that there exists a deterministic policy that incurs the same cost
vector of Sd can also be achieved by a stationary policy, there on the infinite action MDP for every randomized policy for M.
exists a stationary policy, which shares the same occupation Since there exists a deterministic optimal policy for the infinite
measures with an optimal policy [23]. Hence, this stationary MDP, we can represent the objective function and constraints of
policy is also optimal. Problem 1 with the occupancy measures. However, we remark
Now, assume that the stationary optimal policy π ∗∗ is ran- that (4) is a convex nonlinear optimization problem, whereas the
domized. Let πs∗ be the action distribution and Xsπ be the constrained MDPs are often modeled with a linear cost function
successor state distribution at state s under π ∗ . Note that and solved using linear optimization methods.
at state s, there exists an action a∗ that has P (s, a∗ , q) = Remark 2: The methods provided in this section can be gen-
Xsπ (q) since the action space is convex for the semi-infinite eralized to task constraints that are co-safe LTL specifications.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 743
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
744 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
problem. In Section V-B, we describe the dualization-based y0 = pi+j yij ∀i ∈ [n], (6c)
procedure to locally solve the optimization problem given in i=1 j=1
(5). As an alternative to solving the dual problem, we give an
0 ≤ xi ≤ 1, yii = xi ∀i, j ∈ [n] (6d)
algorithm based on the ADMM in Section V-C. Finally, we
present a relaxation of the problem in Section V-D that relies ∀i,j∈[n]
0 ≤ yij ≤ 1 i=j , (6e)
on solving a linear program.
xi ≥ yij , xj ≥ yij , yij ≥ xi + xj − 1 ∀i ∈ [m],
(6f)
A. Complexity of the Synthesis of Optimal
Reference Policies
n
xj = 1 (6g)
In this section, we show that the synthesis of an optimal
j=1
reference policy is NP-hard, whereas the feasibility problem for Mij =1
reference policies can be solved in polynomial time.
Finding a feasible policy under multiple reachability con- where the decision variables are xi for all i ∈ [n] and yij for
straints has polynomial complexity in the number states and all i, j ∈ [n], is NP-hard. In detail, Matsui [25] proved that the
actions for a given MDP. When the target states are absorbing, optimal value of (6) is less than or equal to 4p8n if and only
the complexity of the problem is also polynomial in the number if there exists a 0–1 solution for x1 , . . . , xn satisfying (6f).
of constraints [22]. This result follows from that a feasible policy Since the decision problem of (6) corresponds to solving the
can be synthesized with a linear program, where the numbers of set partition problem, (6) is NP-hard.
variables and constraints are polynomial in the number of states, We can construct an MDP with a size polynomial in n and
actions, and task constraints. choose polynomial number of specifications in n such that the
Matsui transformed the set partition problem to the decision optimal value of Problem 2 is
version of an instance of linear multiplicative programming 1
and proved the NP-hardness of linear multiplicative program- max log
2
ming [25]. In the proof of Proposition 5, we give an instance of 1
Problem 2, whose decision problem reduces into the decision
(2p4n −p+ 2p2n x
0 + y0 )(2p
4n − p − 2p2n x + y )
0 0
problem of the instance of linear multiplicative programming
1
that Matsui provided.1 2 2
+ log 4C (n + n + 1) 2
(7a)
While a feasible reference policy can be synthesized in poly- 2
nomial time, the complexity of finding an optimal reference subject to (6a)–(6f) (7b)
policy is NP-hard even when the target states are absorbing.
The hardness proof follows from a reduction of Problem 2 to an where C is a constant depending on n. Due to the result given
instance of linear multiplicative programming, which minimizes in [25], the optimal value of (7) is greater than or equal to
the multiplication of two variables subject to linear inequality − log(4p8n )/2 + log(4C 2 (n2 + n + 1)2 )/2 if and only if there
constraints. Formally, we have the following result. exists a 0–1 solution for x1 , . . . , xn satisfying (6f). Since the
Proposition 5: Problem 2 is NP-hard even under decision problem of (7) corresponds to solving the set partition
Assumption 1. problem, (7) is NP-hard.
Sketch of Proof for Proposition 5: Problem 2 can be reduced Since the number of states, actions, and the task constraints is
to linear multiplicative programming. The linear multiplicative polynomial in n and (7) synthesizes an optimal reference policy,
program can be reduced to the set partition problem. Since the the synthesis of optimal reference policies is NP-hard.
set partition problem is NP-hard, Problem 2 is NP-hard.
In more detail, the set partition problem [26], [27] is NP-hard We remark that the hardness of Problem 2 is due to the
and is the following. nonconvexity of the KL objective function since the feasibility
Instance:An m × n 0 − 1 matrix M satisfying problem can be solved in polynomial time.
n > m.
Question:Is there a 0 − 1 vector x satisfying nj=1 xj = 1
Mij =1 B. Dualization-Based Approach for the Synthesis of
for all i ∈ [n]. Optimal Reference Policies
Linear multiplicative programming minimizes the product of
two variables subject to linear inequality constraints and is NP- Observing that Slater’s condition [28] is satisfied, and the
hard [25]. Let M be an m × n 0 − 1 matrix with n ≥ m and strong duality holds for the optimization problem given in (4),
4
n ≥ 5, and p = nn . The problem to find the optimal value of (5), one may consider solving the
dual of (4) with xSs,a as additional decision variables and (5c)–
min (2p4n − p+ 2p2n x0 + y0 )(2p4n − p− 2p2n x0 +y0 ) (5f) as additional constraints. In this section, we describe the
dualization-based approach for the synthesis of locally optimal
reference policies.
(6a) The optimization problem given in (4) has the following conic
n optimization representation:
subject to x0 = pi x i (6b)
min cT y (8a)
i=1 y
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 745
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
746 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 747
reference policy and takes action down with a high probability this rare event in the following way. Let ΓπM be the probability
to reach the red state. distribution of paths under the reference policy. We create two
S S
We note that the reference policy is restrictive in this case; as conditional probability distributions ΓπM,+ and ΓπM,− which are
can be seen in Fig. 3(a), it follows almost a deterministic path. the distribution of paths under the reference policy given that the
Under such a reference policy, even the policy that is synthesized paths satisfy φA and do not satisfy φA , respectively. We sample
S S
via Problem 1 is easy to detect. To observe the effect of the from ΓπM,− with probability 0.9 and ΓπM,− with probability 0.1.
reference policy on the deceptive policy, we consider a different In addition to the randomly generated MDP, we use a different
reference policy, as shown in Fig. 5(a), which satisfies ♦r with MDP to show that the deceptive policy can help patrolling
probability 10−3 . When the reference policy is not as restrictive, without being detected. The MDP models a region in the north
the deceptive policy becomes hard to detect. Formally, the value east of San Francisco. The map of the region is given in Fig. 6,
of the KL divergence reduces to 1.462. where each intersection is represented with a state and each road
is represented with an action. We design the reference policy
to represent the average driver behavior. We obtain the traffic
B. Detection of a Deceptive Agent density data from Google Maps [35] and synthesize the reference
In this example, by comparing KL divergence with some com- policy by fitting a stationary policy to the data. The aim of the
mon metrics to synthesize the deceptive policies, we show how agent is to patrol the intersection at which the highest number
the choice of KL divergence helps with preventing detection. of crimes happens. Formally, the agent’s policy reaches the
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
748 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
Fig. 6. Map of a region from north east of San Francisco. The green
dot indicates the intersection at which the highest number of crimes
happened. The data are from [34]. The dots on the map represent the Fig. 8. (a) KL divergence between the agent’s policy and the reference
states of the MDP and the arrows represent the available actions. The policy. The curve “Best” refers to the case that the agent’s policy is the
initial state is chosen uniformly randomly among the blue states and the best deceptive policy against the reference policy synthesized during
red states are absorbing. The agent aims to patrol the green state. the ADMM algorithm. The curve “Actual” refers to the case that the
agent’s policy is the policy synthesized during the ADMM algorithm. (b)
Heatmaps of the occupation measures for the reference policy, i.e., z S,k
parameters of Algorithm 1. The value of a state is the expected number
of visits to the state.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 749
value, it causes suboptimality for the reference policy against a set ΠS of permissible policies, solve
the best deceptive policy.
A
The reference policy gradually gets away from the red states S
inf KL ΓπM ||ΓπM (14a)
as shown in Fig. 8(b). Based on this observation, we expect π A ∈Π(M)
that the relaxed problem given in Section V-D provides useful π S ∈ΠS
reference policies for the original problem. This expectation is A
indeed verified numerically: The reference policy synthesized subject to PrπM (s0 |= ♦RA ) ≥ ν A . (14b)
via the relaxed problem has a KL divergence of 0.150, which is
equal to the limit value of the ADMM algorithm. If the optimal value is attainable, find a policy π A that is a
solution to (14).
Similar to Assumption 2, we assume that for every state in
VII. CONCLUSION MDP M, the set of allowed actions is fixed. For state s ∈ S, we
denote the set of allowed actions with AS (s) and the possible
We considered the problem of deception under a supervisor successor state distributions with ΛS (s). Under this assumption,
that provides a reference policy. We modeled the problem using
we can identify the maximal end components of M for the
MDPs and reachability specifications and proposed to use KL
divergence for the synthesis of optimal deceptive policies. We permissible policies. If a maximal end component is closed,
showed that an optimal deceptive policy is stationary and its i.e., the maximum probability leaving the end component under
synthesis requires solving a convex optimization problem. We the permissible policies is 0, then the states of the maximal end
also considered the synthesis of optimal reference policies that component belongs to C cl . If a maximal end component is open,
easily prevent deception. We showed that this problem is NP- i.e., the maximum probability leaving the end component under
hard. We proposed a method based on the ADMM to compute the permissible policies is 1, then there exists an optimal policy
a locally optimal solution and provided an approximation that that eventually leaves the end component since the agent can
can be modeled as a linear program. guarantee the same objective value by following a policy that
In subsequent work, we aim to extend the deception problem leaves the end component and takes permissible actions after
to a multiagent settings, where multiple malicious agents need leaving the end component.
to cooperate. Furthermore, it would be interesting to consider
While there exists an optimal policy that leaves the open end
the case where a malicious agent first needs to detect the other
malicious agents before cooperation. We also aim to study the components eventually, the optimal policy may have infinite
scenario where the supervisor needs to learn the specification of occupation measures at these states. We have the following as-
the agent for the synthesis of the reference policy. sumption to ensure the boundedness of the occupation measures
for these states.
Assumption 4: For all s ∈ S \ (C cl ∪ RA ) and a ∈ A(s), the
APPENDIX A occupation measure xA s,a is upper bounded by θ.
SYNTHESIS OF OPTIMAL DECEPTIVE POLICIES UNDER As in the proof of Proposition 2, we can define a semi-infinite
NONDETERMINISTIC REFERENCE POLICIES MDP with an optimal cost equal to the optimal value of Problem
3. In detail, at state s ∈ Sd , an action a with successor state
In Problem 1, we assume that the supervisor provides an distribution Xs,a has cost minXsS ∈ΛSs KL(Xs,a ||XsS ), where XsS
explicit (possibly probabilistic) reference policy to the agent. is the distribution of successor states under the reference policy.
It is possible that the reference policy is nondeterministic such We note that minXsS ∈ΛSs KL(Xs,a ||XsS ) is a convex function
that the supervisor disallows some actions at every state and the of Xs,a since the KL divergence is jointly convex in its ar-
agent is allowed to take the other actions. We refer the interested guments [28]. Since the occupation measures are bounded for
readers to [36] for the formal definition of the nondeterministic all states in Sd = S \ (Ccl ∪ CA ) due to Assumption 4 and the
policies. A permissible policy is a policy that takes an allowed costs are convex functions of the policy parameters, there exists
action at every time step. The nondeterministic reference policy a stationary deterministic optimal policy for the semi-infinite
represents a set ΠS of policies that are permissible. The super- MDP [23]. Consequently, there exists a stationary randomized
visor is indifferent between the policies in ΠS . As in Section III, optimal policy for Problem 3 under Assumption 4.
we assume that the supervisor observes the transitions, but not Given that the optimal policy is stationary on M, we compute
the actions of the agent. the state–action occupation measures of the optimal policy by
We define the optimal deceptive policy as the policy that solving the following optimization problem:
minimizes the KL divergence to any permissible policy subject
to the task constraint of the agent. Under this definition, the inf xA
s,a Ps,a,q
optimal deceptive policy mimics the rare event that satisfies the s∈Sd a∈A(s) q∈Succ(s)
agent’s task, and that is most probable under one of the permis- ⎛ ⎞
sible policies. Formally, we solve the following problem for the a ∈A(s) xA
s,a Ps,a, q
synthesis of optimal deceptive policies under nondeterministic log ⎝ ⎠ (15a)
reference policies. a∈AS (s) πs,a Ps,a,q a ∈A(s) xA
s,a
Problem 3 (Synthesis of optimal deceptive policies under
subject to
nondeterministic reference policies): Given an MDP M, a
reachability specification ♦RA , a probability threshold ν A , and (4b)–(4d)
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
xA
s,a ≤ θ (15b) i.e., the KL divergence between the path distributions is lower
s∈Sd a∈A(s) bounded by the KL divergence between the distributions of
number visits to s. Therefore, it suffices to prove the following
S
πs,a = 1, ∀s ∈ Sd (15c) claim: Given PrSM (s0 |= ♦s) > 0, PrSM (s |= ♦s) ∈ [0, 1),
a∈AS (s) and d∗ > 0, there exists an Ms such that for all π A that satisfies
A A S
E[Nsπ ] > Ms , we have KL(Nsπ ||Nsπ ) > d∗ .
where the decision variables are xAs,a for all s ∈ Sd and a ∈ A(s)
A
Define a random variable N̂sπ such that Pr(N̂sπ =
A
A
We consider a partitioning of paths according to the number ≥ (p0 − 1) H(N̂sπ ) + p̂i log (lS )i−1 (1 − lS )
of times s appears in a path. By the data processing inequality i=1
A S A S
given in (1), we have that KL(ΓπMp ||ΓπMp ) ≥ KL(Nsπ ||Nsπ ), (17a)
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 751
∞
S i−1 0. The distribution of k-length path fragments for M under
S
+ p̂i log (l ) (1 − l ) (17b) policy π is denoted by ΓπM,k .
A S
i=1
+ p̂i (i − 1) log(lS ) + log(1 − lS ) (17c)
k−1
A
A
Prπ (st = s) Ps,a,q πs,a
i=1
= (1 − p0 ) −M̂sA H Ber
A
A
a ∈A(s) Ps,a, q xs,a
M̂sA Ps,a,q xs,a log S
A
.
s∈S q∈S
πs,q a ∈A(s) xs,a
d d a∈A(s)
A 1 S
= Ms KL Ber ||Ber(1 − l ) . (17e) Proof of Lemma 3: For MDP M, denote the set of k-length
M̂sA
path fragments by Ξk and the probability of the k-length path
Now, assume that MsA ≥ c
, where c > 1 is a constant. In fragment ξk = s0 s1 . . . sk under
thestationary policy π by
1−lS
this case, we have Prπ (ξk ). We have Prπ (ξk ) = k−1t=0 a∈A(st ) Pst ,a,st+1 πst ,a .
A S
Consequently, we have
KL(Nsπ ||Nsπ ) (18a)
A
Prπ (ξk )
πA πS πA
1 KL(ΓM,k ||ΓM,k ) = Pr (ξk ) log S
≥ MsA KL Ber ||Ber(1 − lS ) (18b) ξk ∈Ξk
Prπ (ξk )
M̂sA
k−1 A
1 πA a ∈A(st ) Pst ,a, st+1 πst ,a
≥ MsA KL Ber ||Ber 1 − l S
(18c) = Pr (ξk ) log S
MsA t=0 ξ ∈Ξ k k
a ∈A(st ) Pst ,a, st+1 πst ,a
1−l S
k−1
≥ MsA KL Ber ||Ber 1 − lS (18d) A
c = Prπ (ξk ) 1s (st )
t=0 ξk ∈Ξk s∈Sd
1
since M̂sA > MsA and for a variable x such that x ≥ 1−l S , the
1 S A
Ps,a, q πs,a
value of KL(Ber( x )||Ber(1 − l )) is increasing in x. 1q (st+1 |st = s) log
a ∈A(st )
S S
Ps,a, q πs,a
Note that KL(Ber( 1−l S
c )||Ber(1 − l )) is a positive con- q∈Succ(s) a ∈A(st )
gence between distributions for the number of states to this state Ps,a,q πs,a log S
.
a ∈A(st ) Ps,a, q πs,a
is greater than the constant. Since the KL divergence between
the path distributions is lower bounded by the KL divergence A S
If KL(ΓπM ||ΓπM ) is finite, we have
for states, the finiteness of the KL divergence between the path
A S A S
distributions implies the occupancy measure under the agent’s KL(ΓπM ||ΓπM ) = lim KL(Γπk ||Γπk )
policy for every transient state of the supervisor. k→∞
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
752 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 67, NO. 2, FEBRUARY 2022
= REFERENCES
s∈Sd q∈Succ(s) a∈A(s) [1] T. E. Carroll and D. Grosu, “A game theoretic investigation of deception in
A
network security,” Secur. Commun. Netw., vol. 4, no. 10, pp. 1162–1172,
a ∈A(s) Ps,a, q xs,a 2011.
Ps,a,q xA
s,a log S
A
. [2] M. H. Almeshekah and E. H. Spafford, “Cyber security deception,” in
πs,q a ∈A(s) xs,a Cyber Deception. Berlin, Germany: Springer, 2016, pp. 23–50.
[3] W. McEneaney and R. Singh, “Deception in autonomous vehicle decision
Finally, since Ps,a,q is zero for all q ∈ Succ(s) and we defined making in an adversarial environment,” in Proc. AIAA Guid., Navigat.,
0 log 0 = 0, we can safely replace Succ(s) with S. Control Conf. Exhib., 2005, Art. no. 6152.
A S [4] M. Lloyd, The Art of Military Deception. Barnsley, U.K.: Pen and Sword,
Proof of Proposition 3: Assume that KL(ΓπM ||ΓπM ) is finite 2003.
under the stationary policies π A and π S . The objective function [5] J. Shim and R. C. Arkin, “A taxonomy of robot deception and its benefits
in HRI,” in Proc. Int. Conf. Syst., Man, Cybern., 2013, pp. 2328–2335.
of the problem given in (2) is equal to [6] R. Doody, “Lying and Denying,” 2018. [Online]. Available: http://www.
A
a ∈A(s) Ps,a, q xs,a
mit.edu/%7Erdoody/LyingMisleading.pdf
A
Ps,a,q xs,a log S
A
[7] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York,
πs,q NY, USA: Wiley, 2012.
s∈S q∈Succ(s) a∈A(s) a ∈A(s) xs,a
d [8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
optimization and statistical learning via the alternating direction method of
due to Lemma 3. Constraints (4b) and (4c) define the stationary multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011.
policies that make the states in Sd have valid and finite occu- [9] V. Pantelic, S. M. Postma, and M. Lawford, “Probabilistic supervisory
pation measures and constraint (4d) encodes the reachability control of probabilistic discrete event systems,” IEEE Trans. Autom.
Control, vol. 54, no. 8, pp. 2013–2018, Aug. 2009.
constraint. [10] M. Lawford and W. Wonham, “Equivalence preserving transformations
Note that for timed transition models,” IEEE Trans. Autom. Control, vol. 40, no. 7,
pp. 1167–1179, Jul. 1995.
A
a ∈A(s) Ps,a, q xs,a [11] M. Bakshi and V. M. Prabhakaran, “Plausible deniability over broadcast
A
Ps,a,q xs,a log S
A channels,” IEEE Trans. Inf. Theory, vol. 64, no. 12, pp. 7883–7902,
q∈S
πs,q a ∈A(s) xs,a Dec. 2018.
a∈A(s)
[12] A. Boularias, J. Kober, and J. Peters, “Relative entropy inverse rein-
is the KL divergence between [ a∈A(s) Ps,a,q xA s,a ]q∈Succ(s) and forcement learning,” in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011,
S
A pp. 182–189.
[πs,q x ]
a ∈A(s) s,a q∈Succ(s) , which is convex in xA
s,a variables. [13] S. Levine and P. Abbeel, “Learning neural network policies with guided
Since the objective function of (4) is a sum of convex functions policy search under unknown dynamics,” in Proc. Int. Conf. Neural Inf.
and the constraints are affine, (4) is a convex optimization Process. Syst., 2014, pp. 1071–1079.
[14] Y. Savas, M. Ornik, M. Cubuktepe, M. O. Karabag, and U. Topcu, “En-
problem. tropy maximization for Markov decision processes under temporal logic
We now show that there exists a stationary policy on M that constraints,” IEEE Trans. Autom. Control, vol. 65, no. 4, pp. 1552–1567,
achieves the optimal value of (1). By Proposition 1, we have Apr. 2020.
[15] C. Keroglou and C. N. Hadjicostis, “Probabilistic system opacity in dis-
that for all s ∈ Sd , the occupation measures must be bounded. crete event systems,” Discrete Event Dyn. Syst., vol. 28, no. 2, pp. 289–314,
We may apply the constraints xA s,a ≤ Ms for all s in Sd and 2018.
a in A(s) without changing the optimal value of (4). After [16] M. Ornik and U. Topcu, “Deception in optimal control,” in Proc. 56th
Annu. Allerton Conf. Commun., Control, Comput., 2018, pp. 821–828.
this modification, since the objective function is a continuous [17] M. O. Karabag, M. Ornik, and U. Topcu, “Least inferable policies
function of xA s,a values and the feasible space is compact, there for Markov decision processes,” in Proc. Amer. Control Conf., 2019,
exists a set of occupation measure values and, consequently, a pp. 1224–1231.
[18] J. Fu, S. Han, and U. Topcu, “Optimal control in Markov decision processes
stationary policy that achieves the optimal value of (4). via distributed optimization,” in Proc. 54th Annu. Conf. Decis. Control,
Proof of Proposition 4: The condition Ps,a,q > 0 for all s ∈ 2015, pp. 7462–7469.
Sd , a ∈ A(s), and q ∈ Succ(s) implies that a∈A(s) xSs,a Ps,a,q [19] M. O. Karabag, M. Ornik, and U. Topcu, “Optimal deceptive and reference
policies for supervisory control,” in Proc. 58th Conf. Decis. Control, 2019,
is strictly positive forall q ∈ Succ(s). Note that for the states pp. 1323–1330.
q ∈ Succ(s), we have a∈A(s) xA s,a P (s, a, q) = 0. We also note [20] J. Neyman and E. S. Pearson, “On the problem of the most efficient
that by Assumption 3, the occupation measures are bounded tests of statistical hypotheses,” Philos. Trans. Roy. Soc. London. Ser.
A., Containing Papers Math. Phys. Character, vol. 231, no. 694–706,
for all s ∈ Sd under π S . Hence, the objective function of (5) is pp. 289–337, 1933.
bounded and jointly continuous in xSs,a and xA s,a . [21] C. Baier and J.-P. Katoen, Principles of Model Checking. Cambridge, MA,
Since we showed that there exists a policy that attains the USA: MIT Press, 2008.
[22] K. Etessami, M. Kwiatkowska, M. Y. Vardi, and M. Yannakakis, “Multi-
optimal value of Problem 1, we may represent the optimization objective model checking of Markov decision processes,” in Proc. Int.
problem given in (5) as Conf. Tools Algorithms Construction Anal. Syst., 2007, pp. 50–65.
[23] E. Altman, Constrained Markov Decision Processes. Boca Raton, FL,
sup min f (xS , xA ) USA: CRC Press, 1999.
xS xA [24] O. Kupferman and R. Lampert, “On the construction of fine automata for
safety properties,” in Proc. Int. Symp. Autom. Technol. Verification Anal.,
subject to xS ∈ X S and xA ∈ X A . Note that X S and X A are 2006, pp. 110–124.
compact spaces, since the occupation measures are bounded for [25] T. Matsui, “NP-hardness of linear multiplicative programming and related
problems,” J. Glob. Optim., vol. 9, no. 2, pp. 113–119, 1996.
all state–action pairs. Given that X A is a compact space, the [26] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide
function f (xS ) = minxA f (xS , xA ) is a continuous function to the Theory of NP-Completeness. San Francisco, CA, USA: Freeman,
of xS [38]. The optimal value of supxS f (xS ) is attained. 1979.
[27] R. M. Karp, “Reducibility among combinatorial problems,” in Complexity
Consequently, there exists a policy π S that achieves the optimal of Computer Computations. Berlin, Germany: Springer, 1972, pp. 85–103.
value of (5).
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.
KARABAG et al.: DECEPTION IN SUPERVISORY CONTROL 753
[28] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Melkior Ornik received the Ph.D. degree from
Cambridge Univ. Press, 2004. the University of Toronto, Toronto, ON, Canada,
[29] Y. Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in noncon- in 2017.
vex nonsmooth optimization,” J. Sci. Comput., vol. 78, no. 1, pp. 29–63, He is currently an Assistant Professor with the
2019. Department of Aerospace Engineering and the
[30] M. Hong, Z.-Q. Luo, and M. Razaviyayn, “Convergence analysis of Coordinated Science Laboratory, University of
alternating direction method of multipliers for a family of nonconvex Illinois at Urbana–Champaign, Champaign, IL,
problems,” SIAM J. Optim., vol. 26, no. 1, pp. 337–364, 2016. USA. His research interests include developing
[31] M. Grant and S. Boyd, CVX: MATLAB Software for Disciplined Convex theory and algorithms for learning and planning
Programming, version 2.1, 2014. [Online]. Available: http://cvxr.com/cvx of autonomous systems operating in uncertain
[32] MOSEK ApS, The MOSEK Optimization Toolbox for MATLAB Man- complex and changing environments, and in
ual. Version 8.1., 2017. [Online]. Available: http://docs.mosek.com/8.1/ scenarios where only limited knowledge of the system is available.
toolbox/index.html
[33] A. Wächter and L. T. Biegler, “On the implementation of an interior-point
filter line-search algorithm for large-scale nonlinear programming,” Math.
Program., vol. 106, no. 1, pp. 25–57, 2006.
[34] S. Alamdari, E. Fata, and S. L. Smith, “Persistent monitoring in discrete
environments: Minimizing the maximum weighted latency between ob-
servations,” Int. J. Robot. Res., vol. 33, no. 1, pp. 138–154, 2014.
[35] Google, “Map of San Francisco,” Accessed: Jan. 25, 2019. [Online]. Avail- Ufuk Topcu received the Ph.D. degree from the
able: https://www.google.com/maps/@37.789463,-122.4068681,16.98z University of California, Berkeley, Berkeley, CA,
[36] M. M. Fard and J. Pineau, “Non-deterministic policies in Markovian USA, in 2008.
decision processes,” J. Artif. Intell. Res., vol. 40, pp. 1–24, 2011. He held a Postdoctoral Research Position
[37] K. Conrad, “Probability distributions and maximum entropy,” Entropy, with the California Institute of Technology,
vol. 6, no. 452, pp. 1–10, 2004. Pasadena, CA, USA, until 2012. He was a Re-
[38] F. H. Clarke, “Generalized gradients and applications,” Trans. Amer. Math. search Assistant Professor with the University
Soc., vol. 205, pp. 247–262, 1975. of Pennsylvania, Philadelphia, PA, USA, until
2015. He is currently an Associate Professor
with the University of Texas at Austin, Austin,
TX, USA. His research interests include design
and verification of autonomous systems through novel connections be-
tween formal methods, learning theory, and controls.
Dr. Topcu is the recipient of the Antonio Ruberti Young Researcher
Mustafa O. Karabag received the B.S. degree Prize, the NSF CAREER Award, and the Air Force Young Investigator
in electrical and electronics engineering from Award.
Bogazici University, Istanbul, Turkey, in 2017.
He is currently working toward the Ph.D. de-
gree with the Department of Electrical and Com-
puter Engineering, University of Texas at Austin,
Austin, TX, USA.
His research interests include developing the-
ory and algorithms for noninferable, deceptive
planning in adversarial environments.
Authorized licensed use limited to: Macau Univ of Science and Technology. Downloaded on July 10,2022 at 07:20:33 UTC from IEEE Xplore. Restrictions apply.