You are on page 1of 11

A novel decision making approach with safe reinforcement

learning guided by pSTL online monitor


Anonymous Author(s)
ABSTRACT mathematical models for stochastic factors in the system, making
The Cyber Physical System (CPS) has broad application prospects it challenging to apply these methods on a large scale. Reinforce-
in safety-critical systems such as autonomous driving, aerospace, ment learning (RL) agents use a “trial-and-error” learning method
etc. Research on secure CPS controllers is currently a hot and chal- to learn the best policy with the maximum cumulative reward
lenging topic. Existing rule-based CPS controllers suffer from poor [Toro Icarte et al. 2018], which has characteristics such as end-to-
scalability, dependence on domain experts’ manual design, and end learning and the ability to make decisions in uncertain envi-
are typically unable to adapt to unknown environments. On the ronments[Kaelbling et al. 1996]. Therefore, with the lucubrating of
other hand, control methods based on Deep Reinforcement Learn- AI research, RL control algorithms bring new opportunities for the
ing (DRL) have powerful advantages in handling high-dimensional development of safety-critical CPS systems such as autonomous
states and uncertain environments. However, they ignore the po- driving, and aerospace.
tential losses and costs that agents may suffer during the learning However, reinforcement learning algorithms can be seen as a
process, which cannot effectively ensure the safety of CPS. To ad- black box, with extremely complex and non-interpretable relation-
dress the safety issues of reinforcement learning control methods, ships between inputs and outputs[Kirsch et al. 2022; Selim et al.
this paper proposes an innovative safe DRL framework guied by 2022]. Therefore, when faced with uncertain scenarios and com-
past Signal Temporal Logic online monitoring (POM-SRL). First, plex tasks, the safety of CPS systems cannot be guaranteed [Ivanov
we use pSTL to describe the safety requirements that the RL agent et al. 2019; Tran et al. 2019]. If reinforcement learning methods are
needs to obey during the exploration process and monitor them in directly applied in the real physical world and allowed to explore
real-time. Based on the quantitative evaluation of online monitoring randomly without safety constraints in uncertain environments,
results, we further improve the experience replay mechanism and the process may involve a large number of trial-and-error actions
reward reshaping algorithm in DRL, which improves the learning and dangerous movements, which could result in serious collision
efficiency of the agent and the credibility of the decision-making. accidents. Existing research on ensuring the safety of reinforcement
Experimental results show that the proposed method has signifi- learning agents mainly includes two aspects: changing optimiza-
cant improvements in convergence speed, security, compared to tion objectives and modifying the exploration process[Garcıa and
the traditional DRL algorithm. Fernández 2015]. The first one requires modeling the safety cri-
terion to be considered and transforming them into optimization
KEYWORDS problems, namely considering whether the system can obtain high
cumulative rewards and find the optimal strategy that satisfies
CPS, Signal Temporal Logic, Safe Reinforcement Learning, Online
the safety constraints[Shani et al. 2020]. The second method is to
Monitor
modify the agent’s exploration process, prevent the agent from re-
ACM Reference Format: peatedly getting into dangerous states, make full use of experience
Anonymous Author(s). 2023. A novel decision making approach with safe that satisfies safety constraints, improving the learing speed of the
reinforcement learning guided by pSTL online monitor. In Proceedings of agent[Noothigattu et al. 2019].
46th International Conference on Software Engineering (ICSE 2024). ACM,
Formal Methods (FM) is one of the effective methods for veri-
New York, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
fying the safety of software by using rigorous mathematical def-
initions[Sirjani et al. 2020b]. To address the safety issues of CPS
1 INTRODUCTION systems, existing research has modeled the system and verified
CPS [Negri et al. 2017] is a complex system that integrates physi- whether it satisfies safety specifications or not with formal methods.
cal hardware, IoT, and computing resources to achieve interaction The verification methods for CPS systems can be divided into two
between computational process and physical world. Traditional categories: model checking and runtime verification (RV). Model
control algorithms rely on precise mathematical models that are checking relies on mathematical methods to model the system ab-
expertly designed, such as describing the behavior of physical sys- stractly[Sirjani et al. 2020a]. However, it is difficult to model high-
tems through differential equations. However, it is difficult to obtain dimensional and non-convex AI components using mathematical
formulas, which may cause the problem of state space explosion,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed thus having limitations. On the other hand, runtime verification is a
for profit or commercial advantage and that copies bear this notice and the full citation lightweight method for ensuring the safety of CPS. It monitors the
on the first page. Copyrights for components of this work owned by others than ACM system’s execution trace in real-time to verify whether it satisfies
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a safety requirements[Kojchev et al. 2020; Sánchez et al. 2019]. It
fee. Request permissions from permissions@acm.org. can alleviate the problem of state exponential explosion caused
ICSE 2024, April 2024, Lisbon, Portugal by the increasing state space of system, and some traces can only
© 2023 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 be obtained through real-time interaction with the environment,
https://doi.org/10.1145/nnnnnnn.nnnnnnn
ICSE 2024, April 2024, Lisbon, Portugal Anon.

which can only be verified through runtime verification. STL can • 𝑅 : 𝑆 × 𝐴 × 𝑆 → R: reward function, where 𝑟 = 𝑅(𝑠, 𝑎, 𝑠 ′ )
express continuous real-valued signals and is widely used in var- represents the immediate reward/punishment that the envi-
ious fields, and pSTL is an extension of STL[Maler and Nickovic ronment gives to the agent when the agent takes action 𝑎 in
2004]. Therefore, a natural idea for ensuring the safety of reinforce- state 𝑠 and transitions to state 𝑠 ′ .
ment learning controller in safety-critical systems is to describe
the safety requirements that the agent must follow with, monitor 2.2 Past Signal Temporal Logic
the system’s traces online, and improve the training process of STL can be used to describe the linear temporal properties of con-
the reinforcement learning agent from both optimizing objectives tinuous real-valued signals in CPS systems. ptSTL is a fragment
and modifying exploration processes by leveraging the runtime of STL in which only past temporal operators are used (no future
verification results. The main contribution includes: operator).Before presenting the syntax and semantics of pSTL, it is
necessary to clarify the concept of signals.
• We propose a framework of safe reinforcement Learning
guided by online monitoring of pSTL specfictions and detail Definition 2.2 (Signal). Let T ⊆ R+ and T is a finite or infinite
the basic process for online monitoring of CPS systems with set of time instants. Given a time instant 𝑡 ∈ T, a signal x ∈ X can
RL control components. be regarded as a function from T to a finite set of the real values
• We have integrated runtime verification technology and re- X, i.e. a signal can be viewed as a time series of values that evolve
inforcement learning algorithms. We improved Deep Rein- over time.
forcement Learning (DRL) algorithms with the robust seman-
Given a time point 𝑡 ∈ T and a time interval [𝑎, 𝑏],The syntax of
tics of pSTL by improving the experience replay mechanism
pSTL is defined as follows:
and modifing the reward shaping algorithm.
• We validated the method proposed in this paper with the 𝜙 := 𝜇 | ¬𝜙 | 𝜙 ∧ 𝜑 | 𝜙 ∨ 𝜑 | 𝐴 [𝑎,𝑏 ] 𝜙 | 𝑃 [𝑎,𝑏 ] 𝜙 | 𝜙𝑆 [𝑎,𝑏 ] 𝜑 (1)
car following scenario in autonomous driving. The results
where 𝜇 is an atomic proposition in the form of 𝑓 (x(𝑡)) > 0, and
showed that our proposed method had significant improve-
¬, ∧, ∨ are standard boolean logical operators, and S [𝑎,𝑏 ] (since),
ments in convergence speed, decision safety, and resistance
P [𝑎,𝑏 ] (previously), and A [𝑎,𝑏 ] (always) are operators that represent
to random signal interference compared to the traditional
past tense on the time interval [𝑎, 𝑏].
DDPG algorithm.
The time horizon of formula 𝜙 (denoted hrz(𝜙)) indicates the
time steps that need to be taken into consideration. It depends on
2 PRELIMINARIES the structure of the formula and can be computed recursively.
2.1 Reinforcement Learning ℎ𝑟𝑧 (𝜇) = 0,
The core idea of RL algorithm is that the agent continuously in- ℎ𝑟𝑧 (¬𝜙) = ℎ𝑟𝑧 (𝜙)
teracts with the environment and optimizes its policy through the ℎ𝑟𝑧 (𝜙 1 ∨ 𝜙 2 ) = max (ℎ𝑟𝑧 (𝜙 1 ), ℎ𝑟𝑧 (𝜙 2 ))
environment feedback, so that the agent can learn the strategy that ℎ𝑟𝑧 (𝜙 1 ∧ 𝜙 2 ) = min (ℎ𝑟𝑧(𝜙 1 ), ℎ𝑟𝑧 (𝜙 2 ))
maximizes the cumulative reward in an uncertain environment.  (2)
Firstly, the agent obtains the current observed environment state, ℎ𝑟𝑧 𝑃 [𝑎,𝑏 ] (𝜙) = 𝑏 + ℎ𝑟𝑧 (𝜙)
makes exploratory decisions, and the actions taken by the agent 
will further affect the environment state and receive corresponding ℎ𝑟𝑧 𝐴 [𝑎,𝑏 ] (𝜙) = 𝑏 + ℎ𝑟𝑧(𝜙)
rewards or punishments. Finally, the agent continuously optimizes  
its policy based on the evaluation results, adapts to the changes in ℎ𝑟𝑧 𝜙 1𝑆 [𝑎,𝑏 ] 𝜙 2 = 𝑏 + max (ℎ𝑟𝑧(𝜙 2 ), ℎ𝑟𝑧 (𝜙 1 ))
the environment, and ultimately obtains the optimal strategy that
Definition 2.3 (Boolean Semantics of pSTL). Let 𝜔 be a system
maximizes the long-term cumulative reward.
trace at time 𝑡, and let 𝑠 be a signal in 𝜔. Then, we can denote that
Markov Decision Process (MDP) is the most essential mathe-
𝜔 satisfies the formula 𝜙 as (𝜔, 𝑡) |= 𝜙.
matical model of RL, which characterizes the interactive learning
process between the agent and the environment very well. MDP is (𝜔, 𝑡) |= 𝜇 ⇔ 𝑓 (x(𝑡)) > 0
a model that satisfies the Markov property, which means that the (𝜔, 𝑡) |= 𝜙 ∧ 𝜑 ⇔ (𝜔, 𝑡) |= 𝜙 ∧ (𝜔, 𝑡) |= 𝜑
state transition at future time 𝑡𝑛 only depends on the current time (𝜔, 𝑡) |= 𝜙 ∨ 𝜑 ⇔ (𝜔, 𝑡) |= 𝜙 ∨ (𝜔, 𝑡) |= 𝜑
𝑡𝑛−1 state and is independent of the previous time 𝑡 1 ∼ 𝑡𝑛−2 states.
(𝜔, 𝑡) |= ¬𝜙 ⇔ ¬((𝜔, 𝑡) |= 𝜙)

Definition 2.1 (Markov Decision Process). An MDP can be repre- (𝜔, 𝑡) |= P [𝑎,𝑏 ] 𝜙 ⇔ ∃𝑡 ′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜙
sented as a quadruple M = (𝑆, 𝐴, 𝑃, 𝛾, 𝑅), where: (𝜔, 𝑡) |= A [𝑎,𝑏 ] 𝜙 ⇔ ∀𝑡 ′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜙
(𝜔, 𝑡) |= 𝜙S [𝑎,𝑏 ] 𝜑 ⇔ ∃𝑡 ′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜑∧
• 𝑆: a set of observation states ;
• 𝐴: a finite set of actions; ∀𝑡 ′′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜙
• 𝑃 : 𝑆 × 𝐴 × 𝑆 → [0, 1]: the state transition probability matrix,
where 𝐼 (𝑡, [𝑎, 𝑏]) = [𝑡 − 𝑏, 𝑡 − 𝑎] ∩ [0, 𝑡].
where 𝑃 (𝑠 ′ |𝑠, 𝑎) is the probability of transitioning from state
𝑠 to state 𝑠 ′ by taking action 𝑎; Definition 2.4 (Robust Semantics of pSTL). Define the robustness
• 𝛾: discount factor; function 𝜌 𝑓 (𝜙, 𝜔, 𝑡), representing the quantitative valuation of the
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal

system trace 𝜔 at time t for a pSTL formula 𝜙.


𝜌 (𝜇, 𝜔, 𝑡) ⇔ 𝑓 (𝑥 (𝑡))
𝜌 (¬𝜙, 𝜔, 𝑡) ⇔ −𝜌 (𝜙, 𝜔, 𝑡)
𝜌 (𝜙 ∧ 𝜓, 𝜔, 𝑡) ⇔ min(𝜌 (𝜙, 𝜔, 𝑡), 𝜌 (𝜓, 𝜔, 𝑡))
𝜌 (𝜙 ∨ 𝜓, 𝜔, 𝑡) ⇔ max(𝜌 (𝜙, 𝜔, 𝑡), 𝜌 (𝜓, 𝜔, 𝑡))
𝜌 (𝜙S [𝑎,𝑏 ]𝜓, 𝜔, 𝑡) ⇔ sup (min(𝜌 (𝜓, 𝜔, 𝑡 ′ ),
𝑡 ′ ∈𝐼 (𝑡,[𝑎,𝑏 ] )
inf 𝜌 (𝜙, 𝜔, 𝑡 ′′ )))
𝑡 ′′ ∈ [𝑡,𝑡 ′ ]
𝜌 (P [𝑎,𝑏 ] 𝜙, 𝜔, 𝑡) ⇔ sup 𝜌 (𝜙, 𝜔, 𝑡)
𝜏 ∈𝐼 (𝑡,[𝑎,𝑏 ] )
𝜌 (A [𝑎,𝑏 ] 𝜙, 𝜔, 𝑡) ⇔ inf 𝜌 (𝜙, 𝜔, 𝑡)
𝜏 ∈𝐼 (𝑡,[𝑎,𝑏 ] )

The robustness value of pSTL can also reflect the evaluation


results of qualitative semantics: 𝜌 (𝜙, 𝜔, 𝑡) ≥ 0 indicates that the
trajectory 𝜔 satisfies 𝜙 at time 𝑡, i.e., (𝜔, 𝑡) |= 𝜙, while 𝜌 (𝜙, 𝜔, 𝑡) < 0
indicates that the trajectory 𝜔 does not satisfy 𝜙 at time 𝑡.

3 PROPOSED APPROACH
RL agents may make incorrect decisions when facing uncertain and
complex environments, which can lead to serious safety incidents.
Therefore, during the learning and exploration processes, it is im-
portant to ensure agents to obey formal safety specifications which Figure 1: the framework of POM-SRL
are tailored to specific domains. The online monitoring of pSTL is a
technique that monitors whether a CPS system satisfies given safety human intervention, and enforced execution, to ensure the safety
specifications in real-time. Thus, we propose an innovative safe and correctness of the system. On the other hand, the quantitative
reinforcement learning framework based on the guidance of pSTL evaluation results can be used to further improve the experience
online monitoring (POM-SRL), which provides a new approach for replay mechanism and reward reshaping algorithm in DRL, so as to
ensuring the safety of RL agents. find a policy that maximizes the satisfaction of pSTL specifications.
Algorithm 1 shows the monitoring process of our method. The
3.1 An Overview of POM-SRL inputs of the algorithm consist of two parts: a pSTL safety speci-
Figure 1 shows the framework of POM-SRL, which is a general fication 𝜙 and the RL-CPS system S. Firstly, the pSTL formula is
architecture, combining runtime verification technology and rein- parsed and its syntax tree 𝑆𝑇 is automatically constructed. Then,
forcement learning algorithms. the safe monitor keeps listening to the system to get the signal
Firstly, the RL agent receives states from the real-world, ana- traces 𝑋𝑛 = 𝑠 1, 𝑠 2, ... at time 𝑡. Based on the syntax tree of the pSTL
lyzing them to make decisions, and receives reward/punishment formula, the algorithm recursively calculates the quantitative val-
from the environment. Based on the real-valued reward, the agent uation(denoted as robustness) of the system trace. On the basis
further updates its policy to obtain higher cumulative return. The of robustness, we can determine whether the current state meets
function used to estimate the expected long-term rewards is called the safety specification, and further feed back the verification re-
the value function, while the policy function maps states to actions, sult to the CPS system to enforce safety actions or improve DRL
and it can be either deterministic or stochastic. It is usually denoted algrorithms.
as 𝜋 (𝑎|𝑠). DRL fits the above functions with neural networks (NN),
leveraging the powerful ability of NN to process high-dimensional Algorithm 1 Online Monitoring(𝜙, S)
data, and evaluates the fitting ability of NN with loss function. In Input: pSTL𝜙, system S.
addition, the independent and identical distribution of data is an Output: Monitor S and quantitatively evaluate the formula 𝜙.
important prerequisite for ensuring the effectiveness of the NN, 1: procedure Online Monitoring(𝜙, S)
so the experience replay mechanism is introduced to break the 2: Initialize: t = SystemTime(), ST = SyntaxTree(𝜙)
correlation between data. Therefore, the goal of DRL algorithms is 3: while true do
to optimize the black-box policy and minimize the loss function to 4: (𝑋𝑛 , 𝑡) ←− 𝑊 𝑎𝑖𝑡𝐷𝑎𝑡𝑎(S)
maximize the cumulative reward. 5: (𝑟𝑜𝑏𝑢𝑠𝑡, 𝑡) = 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑅𝑜𝑏𝑢𝑠𝑡 (𝑋𝑛 , 𝑡, 𝑆𝑇 )
pSTL is used to declare safety specifications that the RL agent 6: 𝑂𝑢𝑡𝑝𝑢𝑡 (𝑟𝑜𝑏𝑢𝑠𝑡, 𝑡)
needs to obey during the training process, and a safe monitor is 7: end while
running in the real-time to detect. On the one hand, when errors or 8: end procedure
faults occured in the system, we can use various means to avoid or
minimize possible damages to the system, such as danger warning,
ICSE 2024, April 2024, Lisbon, Portugal Anon.

3.2 Safety Optimization Problems find a control policy that maximizes:


To convert pSTL specifications into RL optimization problems, we
𝜋2∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐸 𝜋 (𝜌 (𝑠𝑡ℎ , Φ, 𝑡)) (4)
introduce the definition of historical Markov decision processes 𝜋
(h-MDP), as well as two policy optimization problems to find a
control policy 𝜋 enforcing the desired specification 𝜙, maximizing where 𝐸 𝜋 (𝜌 (𝑠𝑡ℎ , Φ, 𝑡)) is the expected robustness value of actions
the probability of satisfaction (MPS) and maximizing the expected taken by policy 𝜋 under the history state set 𝑠𝑡ℎ .
robustness degree (MERD).
RL agents must satisfy formal safety specfications, which of- 3.3 Robustness-Based Safe Reinforcement
ten constrain the temporal behavior of the agent. For example, if Learning
a specfication demands that the agent should access area B only
after accessing area A, whether the system should move towards In this paper, we propose two approaches to improve the safety of
B depends on whether it has previously accessed A. However, ac- RL controllers with the robust semantics of pSTL, which can give a
cording to the definition of MDP, the action generated by the agent quantitative result of the agent’s performance, namely modifying
in the next state is only based on the current state and it is inde- optimization objectives and improving exploration processes. The
pendent of the history states, i.e., 𝑃 (𝑆𝑡 +1 |𝑆𝑡 ) = 𝑃 (𝑆𝑡 +1 |𝑆 1, ..., 𝑆𝑡 ). first one incorporates the robust sementics of the safety properties
Therefore, in order to address the conflict between the historical into reward function. When the agent takes safe/dangerous actions,
dependence of safety specfication and the Markov property in the it receives real-valued rewards/penalties, enabling it to learn the op-
decision-making process, we define a history Markov decision pro- timal policy that satisfies safety specifications. The second method
cess (h-MDP), which allows the agent to consider the information evaluates the quantity of historical experiences based on robust-
of the history state trajectory of ℎ − 𝑠𝑡𝑒𝑝 states when making de- ness of pSTL property Φ, enabling the agent to learn the policy
cisions, where ℎ is calculated through the time horizon formula that satisfies safety constraints as quickly as possible. It also helps
2. agent to avoid repeatedly entrying into dangerous situations and
improving the convergence speed.
Definition 3.1 (h-MDP). Let Φ be a pSTL formula, and the formula
horizon of Φ is denoted as ℎ = ℎ𝑟𝑧 (Φ). The ℎ-history state of the 3.3.1 Reward Shaping. Reward shaping is an effective technique
agent at time 𝑡 is defined as 𝑠𝑡ℎ = 𝑠𝑡 −ℎ+1:𝑡 , which includes the to incorporate domain knowledge into reinforcement learning. Ex-
trajectory of the current state and the preceding ℎ − 1 states. Given isting methods, such as potential-based reward shaping, define a
a Markov decision process 𝑀 = (𝑆, 𝐴, 𝑃, 𝛾, 𝑅), the h-MDP can be potential function for each state. If the agent moves from a state
with lower potential to a state with higher, it will receive a pos-
defined as 𝑀 ℎ = (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅ℎ ), where:
itive reward; otherwise, it would be punished, thus speeding up
• 𝑆 ℎ : the set of historical states 𝑠𝑡ℎ = 𝑠𝑡 −ℎ+1:𝑡 of the agent. the learning process. The most common form of reward shaping
• 𝐴: the set of all actions that the agent can take. is addition, defined as 𝑅 ∗ = 𝑅 + 𝐹 , where 𝑅 is the original reward
• 𝑃: probabilistic transition relation. function, 𝐹 is the potential function, and 𝑅 ∗ is the reshaped reward
• 𝛾: the discount factor. function.
• 𝑅ℎ : the reward function that depends on the historical tra- In safety-critical systems, the behavior the agent must be con-
jectory. strained by formal safety properties, and there are strict temporal
relationships between actions that should be reflected in the agent’s
To ensure that RL agents are constrained by formal safety proper-
goals, specifically in the discounted cumulative return. For instance,
ties when making decisions, that is, to find a policy 𝜋 enforcing the
an autonomous vehicle must turn on its turn signal before changing
desired specification Φ. We proposed two optimization problems,
lanes, then move to the left, confirm safety, accelerate to pass the
the first one attempts to maximize the probability of satisfying
front vehicle, and finally return to the original lane. However, such
Φ, we called it Maximizing Probabilistic of Satisfaction (MPS). In
temporal behavior is difficult to describe directly in the reward
the second problem, we proposed Maximum Expected Robustness
function. As defined in Definition 3.1, the reward obtained by the
Degree (MERD), which is based on the robustness semantics of
agent for completing an action not only depends on the current
pSTL.
state but also on the historical trajectory. pSTL can describe such
Definition 3.2 (Maximizing Probabilistic Satisfaction). Let Φ be a temporal properties well, and its robust semantics can quantify the
pSTL formula, and ℎ = ℎ𝑟𝑧(Φ) be the formula’s time horizon. Given degree to how well a RL agent’s historical trajectory satisfies safety
an h-MDP model 𝑀 ℎ = (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅ℎ ), find a control policy such properties.
that: Therefore, we propose a Robustness-based Reward Shaping algo-
rithm based on robubstness obtained by online monitoring of CPS.
𝜋 1∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 (𝑠𝑡ℎ |= Φ) (3) The potential function F is defined as the robustness of the histori-
𝜋
cal trajectory state 𝑆 ℎ satisfying the pSTL formula 𝜙. However, the
where 𝑃 (𝑠𝑡ℎ |= Φ) is the probability of 𝑠𝑡ℎ satisfying Φ under policy reward reshaping mechanism may change the optimal policy, so
𝜋. the obtained robustness values need to be appropriately weighted.
Definition 3.3 (Maximizing Expected Robustness Value). Let Φ be Therefore, the reward for reinforcement learning is defined as:
a pSTL formula, and let ℎ = ℎ𝑟𝑧 (Φ) be its formula time horizon.
Given an h-MDP model 𝑀 ℎ = (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅ℎ ), the objective is to 𝑅 ∗ = 𝑅 + 𝜂𝜌 (𝜙, 𝑠𝑡 , 𝑡) (5)
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal

where 𝜂 is a corresponding parameter used to adjust the weight of The experience replay mechanism extracts samples for training
the robustness value in the reward function. the RL value network or policy network by randomly sampling
After introducing the Robustness-based Reward Shaping algo- a certain number of tuples, which ensures that each experience
rithm, the h-MDP (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅ℎ ) is updated to (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅 ∗ ), is uniformly sampled, thus solving the problem of non-stationary
where 𝑅 ∗ = 𝑅 + 𝜂𝜌 (𝜙, 𝑠𝑡 , 𝑡). Through this method, the optimization distribution of states and avoiding the agent falling into a local
goal of the agent is transformed into finding a control policy that optimum state. However, randomly sampling experiences does not
maximizes the expected robustness value of the safety constraints, take the importance of experiences into account . When the agent
i.e. MERD defined in Definition 3.3, while completing the basic task. encounters a sample with good performance during the exploration
At this point, the reward obtained by the agent consists of two process and stores it in the experience replay buffer, it may not
aspects: one is the reward obtained by the RL agent for completing be replayed for a long time, increasing the training time of the
its basic goals in a specific scenario, and the other is the robustness algorithm.
result of the historical trajectory state signal 𝑆 ℎ satisfying the pSTL In this article, the priority of experience samples is set as 𝑝𝑖 =
1
safety property. According to the robust semantics of pSTL, it is 𝑟𝑎𝑛𝑘 (𝑖 ) . Here, 𝑟𝑎𝑛𝑘 (𝑖) is a comprehensive measure of the impor-
known that the more the agent’s state behavior satisfies the safety tance of experiences based on three indicators: TD error, reward,
constraints, the larger the verified robustness value obtained by and robustness. This improves computational efficiency and ensures
online monitoring, and the agent will receive a larger reward corre- that the decision-making process of the agent complies with the
sponding. On the contrary, if the agent is in a dangerous situation, safety constraints defined by pSTL while obtaining the maximum
the verified robustness value will be a larger negative value, and reward.
the agent will receive a larger punishment. Therefore, the target TD error is the difference between the predicted value and the ac-
cumulative function of the policy network can be defined as: tual reward received by an agent in a particular state. It is computed
by the difference between the online network state estimation and
𝑇
! the actual state value, where the actual state value is the sum of
∑︁

𝐽 (𝜃 ) = E𝐴𝑡 ∼𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑅 the immediate reward and the maximum action value function of
𝑡 =1 the next state, representing the expected future return of the agent.
(6)
𝑇 ∑︁
∑︁ The larger the TD error of a sample, the more the agent needs to
= 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) · 𝑅 ∗ (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡 +1 ) learn to optimize its policy, and therefore it is considered a more
𝑡 =1 𝑎𝑡 ∈𝐴𝑡 important experience.
The parameters of the policy network are updated using gradient
|𝛿𝑖 | = |𝑄 (𝑠𝑡 , 𝑎𝑡 ; 𝜃 ) − 𝑟𝑡 + 𝛾 · 𝑎𝑔𝑟𝑚𝑎𝑥𝑎 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠𝑡 +1, 𝑎; 𝜃 ′

descent as follows: | (8)
𝑇
∑︁ In RL algorithms, environment gives a reward or punishment for
∇𝜃 𝐽 (𝜃 ) = ∇𝜃 log 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) · 𝑅 ∗ (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡 +1 ) (7) the quality of the current policy to help agent learn how to achieve
𝑡 =1 the preset objectives. If an experience tuple receives a high reward,
where 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) represents the policy network, it is responsible it indicates its significance to the learning of the current policy and
for selecting actions based on the current state 𝑠𝑡 , and 𝑇 is the should be trained more frequently. The robustness-based reward
number of iterations. reshaping algorithm introduced in the previous section reflects the
formal safety constraints in the reward function, so the immediate
3.3.2 Priority Experience Replay. Traditional DRL algorithms store reward also partly reflects the safety of the agent’s state. When
the historical experience trajectories in an experience replay buffer, using reward as the priority indicator for experience tuples, the
and then randomly take a small subset of samples to train the neural priority allocation may be biased due to some experience tuples
network. However, different experience trajectories have different having too large or too small rewards. Therefore, the reward is
importance for model training. If the sampling probability of key usually normalized and mapped to the range of a standard normal
experience samples is low, it may affect the convergence and perfor- distribution. Let the fixed size of the experience replay pool be 𝑁 ,
mance of the algorithm. To address this problem, existing methods with 𝑛 experiences stored, where 𝑛 ≤ 𝑁 . The total return of the
Í
such as Prioritized Experience Replay (PER) sort the importance collected experience is 𝑅 = 𝑛𝑖 𝑟𝑖 , and 𝑅¯ is the average value of n
of samples based on their TD error, so that the selected samples reward values. This paper sorts the experience tuples based on the
have higher priority. However, this method cannot guarantee that normalized reward R𝑖 .
the sampled experience satisfies safety properties or has a higher √︃
long-term cumulative return. With the robust semantics pSTL, we
R𝑖 = (𝑟𝑖 − 𝑅)¯ 2 /𝜎
can quantify how strongly a given historical experience generated √︃
by the RL agent satisfies an pSTL formula as a real number. In Í𝑛 ¯2
 (9)
𝜎= 𝑖=1 𝑟𝑖 − 𝑅 /𝑛
other words, the robustness reflects the safety of the experience Í
to a certain extent. Therefore, a Robustness-based Priority Replay 𝑅¯ = 𝑛 /𝑛
𝑖=1
Buffer (RPER) mechanism based on robustness obtained by pSTL Formal methods are often seen as a way to increase the trust-
online monitoring is proposed in this section, so that the agent can worthiness of safety- critical systems. Techniques such as runtime
learn faster and improve training efficiency by learning policies that verification can be used to monitor the execution of CPS systems.
satisfy safety constraints and result in higher cumulative rewards. The robustness degree of pSTL quantifies how strongly a given
ICSE 2024, April 2024, Lisbon, Portugal Anon.

trajectory satisfies an STL formula as a real number rather than the degree of satisfaction of the historical state set 𝑠𝑡ℎ for the safety
just providing a yes or no answer. With the online monitoring constraint 𝜙 in real-time after the agent takes an action, and returns
mechanism of pSTL, it is possible to determine whether the signal the robustness value 𝜌 (𝑠𝑡ℎ , 𝜙, 𝑡). Environment uses the improved
trajectory generated by the RL agent satisfies the pSTL specifica- reward shaping algorithm to provide a reward score for the agent
tion. The quantitative semantics of pSTL can help us quantify the and updates the next state 𝑠𝑡 +1 , thus obtaining an experience tuple
robustness of the signal trajectory of experience 𝑒𝑖 , denoted as (𝑠𝑡ℎ , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡 +1, 𝜌𝑡 ). After obtaining the experience tuple, the priority
𝜌𝑖 (𝜙, 𝑠𝑖 , 𝑖). The larger the robustness value 𝜌𝑖 (𝜙, 𝑠𝑖 , 𝑖) obtained from of the sample is calculated according to Equation 10 and placed in
online monitoring of the RL agent at the current time, the more the the experience replay pool.
experience satisfies the pSTL safety specification, and thus should Therefore, the optimization objective of the POM-SDDPG agent
be given a higher priority. is to find an optimal policy 𝜋 that maximizes the state-value func-
Therefore, in this paper, the importance of experience is mea- tion 𝑄 𝜋∗ (𝑠, 𝑔, 𝑎), where 𝑅 ∗ incorporates the online monitoring ro-
sured by three indicators: TD error, reward, and robustness value. bustness value. Thus, this policy has two optimization objectives: on
The ranking of experience 𝑖 can be represented as: the one hand, the agent aims to efficiently accomplish its intended
task, for instance, a self-driving car completing a lane changing
𝑟𝑎𝑛𝑘 (𝑖) = 𝜅 1 |𝛿𝑖 | + 𝜅 2 R𝑖 + 𝜅 3 𝜌𝑖 (𝜙, 𝑠𝑖 , 𝑖) (10) overtaking maneuver; on the other hand, the agent aims to satisfy
where 𝜅 1, 𝜅 2, 𝜅 3 are hyperparameters representing the weights as- the safety constraints described by the designed pSTL formula as
signed to TD error, reward, and robustness when calculating the much as possible during the motion process, that is, maximizing
priority of experience. the expected robustness value defined in Definition 3.3.
3.3.3 POM-SDDPG. In this section, we will take the Deep Derteminstic ∑︁
Policy Gradient(DDPG) algorithm as an example to illustrate how 𝑄 𝜋∗ (𝑠, 𝑔, 𝑎) = E𝜋 [ 𝛾 𝑡 𝑅 ∗ (𝑠𝑡 , 𝑎𝑡 )|𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎] (11)
to use the POM-SRL framework proposed in this paper to guide 𝑡 =0
the improvement of DRL algorithms and enhance the safety of RL- The policy network is updated through gradient descent, where
controller in the CPS. Figure 2 shows the process of the safe DDPG 𝑄 ∗ (𝑠, 𝑎|𝜃𝑄 ) is computed by the Critic network.
algorithm based on pSTL online monitoring mechanism (denoted
as POM-SDDPG below) . The system engineer needs to design
the pSTL safety specification 𝜙 and use the safe online monitor 𝑁
1 ∑︁
to observe current environment state 𝑠𝑡 in real-time, outputting a ▽𝜙 𝐽 (𝜃 𝜇 ) = ▽𝑎 𝑄 ∗ (𝑠, 𝑎|𝜃𝑄 )|𝑠=𝑠𝑖 ,𝑎=𝜋 ∗ ▽ 𝜃 𝜇 𝜋 ∗ |𝑠𝑖 (12)
𝑁 𝑖
quantitative evaluation value 𝜌 (𝜙, 𝑠𝑡 , 𝑡). According to the definition
of h-MDP, the agent makes decisions based on the observed envi- The Critic is the value network 𝑄 (𝑠, 𝑎|𝜃𝑄 ), and its correspond-
ronment state 𝑠𝑡ℎ and outputs an action 𝑎𝑡 , where ℎ is the formula ing target value network is 𝑄 ′ (𝑠, 𝑎|𝜃𝑄
′ ). It evaluates the Q-value
range of the pSTL specification 𝜙.
of the deterministic policy 𝑎𝑡 = 𝜇 (𝑠𝑡ℎ |𝜃 𝜇 ) generated by the Actor
and provides gradient information to the Actor. The target value
network, based on the RPER mechanism and the target Actor net-
work’s output 𝑠𝑡 +1 and 𝜇 ′ (𝑠𝑡 +1 |𝜃 𝜇′ ), calculates the 𝑄 ′ (𝑠𝑡 +1, 𝑎𝑡 +1, 𝜃𝑄
′ )

part of the target Q-value. The main value network then computes
the current target 𝑄 value as 𝑦𝑖 = 𝑟 ∗ + 𝛾𝑄 ′ (𝑠 ′, 𝑎 ′, 𝜃𝑄 ′ ) based on the

𝑄 part. The parameters of the value network 𝜃𝑄 are updated by
minimizing this loss function using the Adam optimizer.

𝑁
1 ∑︁ ∗
𝐿= (𝑄 (𝑠, 𝑎|𝜃 𝑄 − 𝑦𝑖 )) 2 (13)
𝑁 𝑖

The value network is updated through gradient descent.

Figure 2: POM-SDDPG
h      i
∇𝜃𝑄 = 𝐸 𝑟 + ∇𝜃𝑄 ′ 𝛾𝑄 ′ 𝑠, 𝜇 ′ 𝑠 | 𝜃 𝜇′ − 𝑄 𝑠, 𝑎 | 𝜃𝑄
′ ∇𝜃𝑄 𝑄 𝑠, 𝑎 | 𝜃𝑄 (14)

Algorithm 2 presents the basic idea of POM-SDDPG algorithm. Similar to the DQN algorithm, DDPG periodically copy the pa-
Firstly, inspired by the Deep Q learning(DQN) algorithm, in order rameters of its Actor and Critic networks to their corresponding
to address the problem of unstable Q-values during training, two target networks as Eq. 15 shows:
Actor and Critic networks with the same network structure and
parameters are initialized. The Actor is the policy network 𝜇 (𝑠𝑡 |𝜃 𝜇 ),
𝜃 𝜇′ = 𝜏𝜃 𝜇 + (1 − 𝜏)𝜃 𝜇′
with the corresponding target policy network parameters 𝜇 ′ (𝑠𝑡 |𝜃 𝜇′ ), (15)
′ ′
responsible for selecting the current action 𝐴𝑡 based on the current 𝜃𝑄 = 𝜏𝜃𝑄 + (1 − 𝜏)𝜃𝑄
historical environment state 𝑠𝑡ℎ . The safe online monitor calculates
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal

Algorithm 2 POM-SDDPG Table 1: Introduction to software and hardware environment


of experimental equipment
Input: Maximum training step 𝑇 ; initialized experience replay
buffer H ; initialized online policy network 𝜇 (𝑠𝑡 |𝜃 𝜇 ) and tar-
get policy network 𝜇 ′ (𝑠𝑡 |𝜃 𝜇′ ); initialized online value network Tools and Environments Version
𝑄 (𝑠, 𝑎|𝜃𝑄 ) and target value network 𝑄 ′ (𝑠, 𝑎|𝜃𝑄 ′ ); Linux Ubuntu 20.04
Python 3.7
Output: Trained policy network 𝜇 (𝑠𝑡 |𝜃 𝜇 ) and value network
CARLA 0.9.11
𝑄 (𝑠, 𝑎|𝜃𝑄 )
CPU Intel Core-i7
1: for t=0 to T do
Deep Learning Framework Pytorch 1.7
2: 𝑠 ← 𝑆𝑖𝑛𝑖𝑡 ;
Memory 16GB
3: while s is not a terminal state do
4: Execute action 𝑎𝑡 = 𝜇 (𝑠𝑡ℎ |𝜃 𝜇 ) and obtain new state 𝑠𝑡 +1
5: OnlineMonitor(S, 𝜙)→ 𝜌𝑡 (𝜙, 𝑠𝑡 , 𝑡)
6: Obtain robustness-based reward : 𝑟𝑡∗ = 𝑅(𝑠𝑡 , 𝑎𝑡 ) +
CARLA[Dosovitskiy et al. 2017] is an open-source autonomous
𝜂𝜌 (𝜙, 𝑠𝑡 , 𝑡)
driving simulation software that uses a client-server architecture,
7: Calculate TD-error |𝛿𝑡 | = |𝑟𝑡 +
supports Windows and Linux operating systems, and provides a
𝛾𝑄 target (𝑠𝑡 , 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 𝑄 (𝑠𝑡 , 𝑎; 𝜃 ′ )) − 𝑄 (𝑠𝑡 −1, 𝑎𝑡 −1 ; 𝜃 ) |
series of Python APIs for researchers to carry out secondary devel-
8: √︁ Calculate standard normalized reward R𝑡 =
opment. Users control the main participants in the scene generation
(𝑟𝑡 − 𝑅)
¯ 2 /𝜎
through the client, add sensor and camera information, and calcu-
9: Obtain the ranking of experience tuples
late control information such as throttle, brake, and steering angle
𝑟𝑎𝑛𝑘 (𝑡) = 𝜅 1 |𝛿𝑡 | + 𝜅 2 R𝑡 + 𝜅 3 𝜌𝑡 (𝜙, 𝑠𝑡 , 𝑡)
of the vehicle. In addition, CARLA also supports importing map
10: 𝑝𝑡 = 𝑟𝑎𝑛𝑘1 (𝑡 )
files such as OpenDRIVE[Dupuis et al. 2010] and provides related
11: transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡 +1, 𝜌𝑡 ) with priority 𝑝𝑡 API interfaces to help users build complex simulation scenarios,
12: Sample a minibatch of transitions (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡 +1, 𝜌𝑡 ) with high customizability and scalability.
with probabilities proportional to their priority from H
13: Calculate target Q value 𝑦𝑖 = 𝑟𝑡∗ + 𝛾𝑄 ′ (𝑠 ′, 𝑎 ′, 𝜃𝑄 ′ )

14: Update main policy network ▽𝜙 𝐽 (𝜃 𝜇 ) =


1 Í𝑁 ▽ 𝑄 ∗ (𝑠, 𝑎|𝜃 )| ∗
𝑁 𝑖 𝑎 𝑄 𝑠=𝑠𝑖 ,𝑎=𝜋 ∗ ▽ 𝜃 𝜇 𝜋 |𝑠𝑖
15: Update main evaluation ( network as Eq. 14
𝜃 𝜇′ = 𝜏𝜃 𝜇 + (1 − 𝜏)𝜃 𝜇′
16: Update target networks ′ = 𝜏𝜃 + (1 − 𝜏)𝜃 ′
𝜃𝑄 𝑄 𝑄
17: end while
18: end for

4 EVALUATION
This chapter will discuss the implementation of a deep reinforce-
ment learning method based on pSTL online monitoring and guid- Figure 3: Vehicle Following in CARLA
ance, using the vehicle following scenario as an example. Firstly,
an autonomous driving behavior decision-making method is im-
plemented under uncertain environments based on the DDPG al- CARLA-Gym is a reinforcement learning development environ-
gorithm. Then, safety constraints that must be maintained in the ment based on the CARLA simulator under the Python environment.
vehicle following scenario are expressed using pSTL and online It provides interfaces for observing, calculating reward functions,
monitoring methods proposed in this paper are used to verify the and setting termination conditions for the interaction between the
RL agent online, and the impact of random signal disturbances agent and the environment. In this paper, the interaction between
in the real world on safety verification is analyzed. Finally, based the deep reinforcement learning module and the CARLA simulator
on the obtained online verification results, the DDPG algorithm is will be implemented based on CARLA-Gym. In addition, during the
further improved to provide more reliable decision-making for this training process of the reinforcement learning algorithm, the safety
scenario, and to improve the learning efficiency of the agent. online monitor will first parse the pSTL formula and generate the
formula syntax tree. Then, it receives real-time monitoring of the
4.1 Platform and Parameters state information of the agent from the CARLA simulator, veri-
The software and hardware environment relied upon for algorithm fies the result, and provides it to the deep reinforcement learning
training and validation in this paper are shown in Table 1. The module, thus achieving better learning efficiency and faster conver-
operating system is Linux, version Ubuntu 20.04; Python version is gence rate. The algorithm framework for deploying reinforcement
3.7; CARLA simulator version is 0.9.11; CPU is Intel Core-i7; deep learning based on the CARLA simulator in this paper is shown in
learning framework is Pytorch 1.7; and memory size is 16GB. Figure 4.
ICSE 2024, April 2024, Lisbon, Portugal Anon.

Based on the above derivation, the pSTL constraints that need to


be followed in the following scenario are defined as follows:

𝐴 [0,𝑇 ] ((𝑑 𝑓 > 𝑑 𝑓𝑠𝑎𝑓 𝑒 ) ∧ 𝑝 (𝑑𝑏 > 𝑑𝑏𝑠𝑎𝑓 𝑒 )) (16)

𝑣2 𝑓
𝑣2
where 𝑑 𝑓𝑠𝑎𝑓 𝑒 = 𝑣 𝜀 · 𝑡𝑟𝑒𝑎𝑐𝑡 + 2𝑎𝑚𝑎𝑥
𝑒
− 2𝑎𝑚𝑎𝑥 , and 𝑑𝑏𝑠𝑎𝑓 𝑒 =
2
𝑣𝑚𝑎𝑥 −𝑣𝑏2 𝑣2 −𝑣 2
2𝑎𝑚𝑎𝑥 − (𝑣 𝜀 · 𝑡𝑟𝑒𝑎𝑐𝑡 + 𝑚𝑎𝑥
2𝑎𝑚𝑎𝑥 ), which means that in the past
T seconds, the system always satisfies that the distance between
the Ego vehicle and the front and rear vehicles is greater than the
Figure 4: The CARLA platform deployed RL algorithm
minimum safe distance. The derivation of the above minimum safe
distance formula can be seen in detail in the article [?].
4.2 Vehicle Following
4.2.2 Reward Design. Rewarding the behavior of the agent can
This section will verify the feasibility of the proposed method based
be divided into two ways. One is to only reward or punish the
on the vehicle following scenario. The selection of this scenario
agent when the round is over. Although this method can more
is mainly based on the following two reasons: firstly, the vehicle
accurately evaluate whether the agent has achieved its goal, the
following scenario is the most common scenario in autonomous
learning and training process is extremely slow, because the agent
driving research, with a wide range of applications on highways,
is "blind" during the learning process and does not know whether
urban roads, and rural roads. Secondly, the vehicle following sce-
the action taken is correct. The other way is to immediately give
nario is the easiest scenario to study in autonomous driving, and is
the agent instant rewards or punishments, which can significantly
easy to describe and demonstrate.
speed up the training process. Therefore, based on the proposed
The schematic diagram of the vehicle following scenario is shown
idea of using pSTL online monitoring robustness value for reward
in Figure 5. This scenario mainly consists of three dynamic entities,
reshaping algorithm, this paper will give timely feedback to the
where the red vehicle represents the autonomous driving vehicle
agent, and the design of the reward function will be constrained by
Ego, which is the main subject of behavior decision-making. 𝑉 𝑒ℎ 𝑓
the given formalized safety constraints of pSTL. Next, the design
and 𝑉 𝑒ℎ𝑏 represent the front and rear vehicles on the same lane
of the reward function in the vehicle following scenario will be
as Ego. 𝑡 represents the current time, 𝑡 0 represents the initial time,
introduced in detail.
𝑑 𝑓 and 𝑑𝑏 represent the distance between the Ego vehicle and the
front and rear vehicles, which can be obtained through distance
𝑒𝑥𝑝 − (𝑣𝑚𝑎𝑥 −𝑣) , 𝑣 <= 𝑣𝑚𝑎𝑥
 2
sensors. 𝑣, 𝑎, and 𝛼 respectively represent the speed, acceleration, 𝑅𝑣 = (17)
and steering angle of the Ego vehicle. −1, 𝑣 > 𝑣𝑚𝑎𝑥
Firstly, we hope that the autonomous driving vehicle does not
exceed the maximum speed limit 𝑣𝑚𝑎𝑥 during the driving process,
but at the same time, we hope that it can drive efficiently. Therefore,
the reward function for the vehicle speed is as follows:
Secondly, in the vehicle following scenario, we should encourage
the autonomous driving vehicle to drive as close to the center of the
road as possible during the driving process. The reward function
for the position of the vehicle on the lane is shown in Equation 18,
Figure 5: Vehicle Following where 𝑑𝑚𝑖𝑑 represents the distance between the center axis of the
vehicle and the center line of the lane.
4.2.1 pSTL Safety Specfication. In this scenario, the system de-
signer hopes that the Ego vehicle can drive efficiently and safely, 𝑅𝑙 = 𝑒𝑥𝑝 −𝑑𝑚𝑖𝑑 (18)
and maintain a safe distance from other vehicles. Specifically, the If a serious safety accident such as a collision occurs with the
minimum safe distance to maintain with the front vehicle 𝑑 𝑓 is such autonomous driving vehicle, a large punishment should be given:
that even if the front vehicle suddenly brakes, the Ego vehicle can
stop safely behind it, that is, 𝑑 𝑓 > 𝑑 𝑓𝑠𝑎𝑓 𝑒 . Generally speaking, the 
−1000, 𝑖 𝑓 𝑐𝑜𝑙𝑙𝑖𝑑𝑒,
safe distance between Ego and the rear vehicle is mainly controlled 𝑅𝑐 = (19)
0, 𝑒𝑙𝑠𝑒
by 𝑉 𝑒ℎ𝑏 . However, if the Ego vehicle finds that the distance to the
rear vehicle is too close, it can also take measures to increase the Therefore, in the vehicle following autonomous driving experi-
distance, thus maintaining the property 𝑑𝑏 > 𝑑𝑏𝑠𝑎𝑓 𝑒 while main- ence scenario, this paper gives the reward function of the agent at
taining a safe distance from the front vehicle and accelerating. The time t: 𝑅 = 𝑅𝑣 + 𝑅𝑙 + 𝑅𝑐 . In addition, in the previous section, we
source of signal disturbance in the system comes from sensors or defined pSTL constraints to maintain a safe distance from front and
actuators, so a disturbance is introduced to the speed of the Ego ve- rear vehicles in the following scenario and set the time range T to
hicle with a disturbance threshold of 𝜀. The degree of satisfaction of the previous 2 seconds. If the autonomous driving vehicle can main-
the safety constraints in the system over the past [0, T] is observed. tain a safe distance from the front and rear vehicles in the past 2
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal

seconds, and the quantitative evaluation semantics of pSTL can mea-


sure the degree of satisfaction of the current vehicle state for the con-
straints, that is, the larger the obtained robustness value, the smaller
the probability of collision of the vehicle. Therefore, this paper uses
the robustness value 𝜌 (𝜙, 𝑠𝑡 , 𝑡) = 𝑂𝑛𝑙𝑖𝑛𝑒𝑀𝑜𝑛𝑖𝑡𝑜𝑟𝑖𝑛𝑔(𝜙, 𝑠𝑡 , 𝑡) as an
evaluation of the probability of collision of the agent, and obtains
the immediate reward.

4.3 Analysis (a) Bool Satisfication (b) Robustness


Figure 6 shows the signal monitoring results of the autonomous
driving following scenario in a training process. The simulation
Figure 7: the result of online monitoring
time is set to 20s, and the speed signal of the vehicle is sampled
every 1s. The maximum speed of all vehicles is set to 𝑣𝑚𝑎𝑥 = 15𝑚/𝑠,
and the maximum acceleration is 𝑎𝑚𝑎𝑥 = 8𝑚/𝑠 2 . At time 𝑡 0 , the
experience replay mechanism, the RRS-SDDPG algorithm based
distance between the vehicles is 15m, and they gradually accelerate.
on the improved robust value reward reshaping algorithm, and the
After about 6s, the speed of the vehicles stabilizes between 8𝑚/𝑠
POM-SDDPG algorithm based on pSTL online monitoring guid-
and 10𝑚/𝑠 and fluctuates up and down. Based on the speed of the
ance for 5000 rounds, with a maximum time step of 10000 during
vehicles and the definition of the minimum safe distance, the mini-
each round of training, and compared the simulation experimental
mum safe distance 𝑑 𝑓𝑠𝑎𝑓 𝑒 and 𝑑𝑏𝑠𝑎𝑓 𝑒 that need to be maintained
results. For convenience of observation, samples from adjacent 20
at each time can be obtained. When the minimum safe distance is
rounds were randomly sampled and smoothed by Gaussian smooth-
less than 0, it means that a collision accident will not occur at the
ing. Figure8 shows the comparison of rewards for each round of
current speed.
the algorithm. It can be seen that as the training rounds increase,
the cumulative rewards obtained by the agent are increasing and
finally tend to a stable value.
From Figure 8(a), it can be seen that RPER-SDDPG reached a
stable state close to 1500 rounds, which is significantly faster than
the traditional DDPG algorithm, which gradually tends to be stable
close to 3000 rounds, indicating that RPER-SDDPG has significantly
improved the training speed. Figure 8(b) shows the difference be-
tween DDPG and RRS-SDDPG. It can be seen that the cumulative
reward of RRS-SDDPG is around 1500, because the performance of
(a) Minimum Safe Distance (b) The distance between the front and the agent can comply with the safety constraints in the later stages
rear cars of training, and the robustness value is positive, giving the agent
a positive reward. In addition, RRS-SDDPG reaches a stable state
Figure 6: Signals in the scenario around 2000 rounds, which can not only make the agent perform
better and comply with safety constraints, but also accelerate the
The online monitoring results of the formula Φ = 𝐴 [0,𝑇 ] ((𝑑 𝑓 > learning efficiency of the agent to some extent. Figure 8(c) shows
𝑑 𝑓𝑠𝑎𝑓 𝑒 ) ∧ 𝑝 (𝑑𝑏 > 𝑑𝑏𝑠𝑎𝑓 𝑒 )) are shown in Figure 7 when applying the combination of the above two methods, and it can be seen that
disturbance signals with disturbance thresholds of 1, 5, and 10 to the proposed POM-SDDPG can make the agent learn better strate-
the distance sensors of the vehicles. When no disturbance signal gies in a shorter time. From the comparison of the summarized
is applied, the autonomous driving vehicle can always maintain curves in Figure 8(d), it can be seen that the traditional DDPG al-
a safe distance from the front and rear vehicles. When 𝜀 = 1, that gorithm has large fluctuations and slow convergence speed, while
is, a small disturbance is applied, the Boolean verification result the proposed method has smaller fluctuations, which reflects the
is not affected, but the robustness value becomes smaller. At 12s, improvement of the safety of the agent in the training process.
the signal disturbance with a disturbance threshold of 10 causes
the system trajectory to fail to satisfy the safety constraints, while 5 RELATED WORK
the disturbance signal with a disturbance threshold of 15 at 6s will It is challenging to generate safe controllers with DRL algorithms in
cause the constraints to fail to be satisfied. Therefore, we can draw safety-critical CPS systems[Tran et al. 2019; Wu et al. 2018]. Garcia
a conclusion that the effect of disturbance on Boolean verification et al. provide a definition of safe reinforcement learning[Garcıa
is mainly reflected in the time offset, and the larger the disturbance and Fernández 2015], in which it is important to ensure reasonable
threshold, the earlier the time when the Boolean semantic is not system performance during the learning and deployment processes.
satisfied. For the robustness semantic, the larger the disturbance SRL is divided into two categories: modifying optimization criteria
threshold, the worse the impact on the robustness value, that is, and improving the exploration process of agents.
the smaller the degree of satisfaction of the formula. The first category of methods involves incorporating safety con-
This paper trained the traditional DDPG algorithm, the RPER- straints or introducing relevant factors such as risk into the objec-
SDDPG algorithm based on the improved robust value priority tive or reward function [Akametalu et al. 2014; Ellingsen et al. 2009]
ICSE 2024, April 2024, Lisbon, Portugal Anon.

and a robust reward function is defined, according to the obtained


automaton. Aksaray et al. [Aksaray et al. 2016] studied the prob-
lem of learning the optimal policy for an agent with unknown
stochastic dynamics that satisfies STL specifications and solved
the STL control policy generation problem using an approximation
method. Rong et al. [Rong and Luan 2020] described traffic rules and
driving experience with LTL and introduced a hierarchical struc-
ture to solve safety issues in planning problems. By using formal
safety constraints as the learning goal of the agent, the safety of the
(a) Comparison of DDPG and RPER- (b) Comparison of DDPG and RRS-SDDPG decision-making process of the agent can be partially guaranteed
SDDPG during the learning process. However, these methods still have
many shortcomings, such as the difficulty of approximate solving.
Runtime Verification [Bartocci et al. 2018] is a technique for
monitoring the execution process of software, which detects vi-
olations of properties that occur at runtime and responds to the
program’s incorrect behaviors. Mallozzi et al. proposed a WISEML
framework and referred to it as a safety envelope, which prevents
reinforcement learning agents from executing incorrect actions
through online monitoring[Mallozzi et al. 2019]. Fulton et al. pro-
posed a provably SRL approach that combines theorem proving
(c) Comparison of DDPG and POM- (d) summary and runtime verification to ensure the safety of agents. Once the
SDDPG safety constraints is violated, the agent will abandon efficiency
but learning a safer control policy[Fulton and Platzer 2018]. While
Figure 8: Cumulative Return runtime verification helps agents detect errors during execution,
it also limits their exploration of unknown spaces and learning
efficiency. Therefore, we innovatively propose a framework of safe
. Chow et al. [Chow et al. 2018]introduced the classical Lyapunov reinforcement learning guided by pSTL online monitoring, which
function from control theory into the process of DRL to ensure improves the algorithm’s convergence speed while enhancing the
safety during learning. Achiam et al. solves a constrained opti- trustworthiness of decision.
mization problem to find parameters that satisfy safety constraints,
which to some extent ensures that the adopted data meets safety 6 CONCLUSION
constraints[Achiam et al. 2017]. Modeling risk factors in safety We propose a novel decision making approach with safe reinforce-
constraints and transforming them into optimization problems is ment learning guided by pSTL online monitor. Firstly, we use pSTL
an effective method to ensure agent safety. However, these methods to describe the safety constraints that the DRL agent needs to con-
lack a systematic description and solution and may not perform sider in the learning and training process. Then, an efficient online
satisfactorily. monitoring algorithm is used to quantitatively evaluate the system
The method of improving the exploration process of agents in- trajectory generated by the agent and consider the effect of random
cludes integrating external knowledge, exploring risk metrics, and disturbance signals on the safety verification results. The robustness
constraining the safe action space[Kovalenko et al. 2019; Mu et al. semantic of can quantify the degree of satisfaction of the system
2022]. The article[Lütjens et al. 2019] provides prior knowledge trajectory of the agent for safety constraints. Therefore, based on
related to the risk to agent, reducing its random exploration and this robustness value, this paper further improves the experience
avoiding risks. Abbeel et al. introduce a "teacher" mechanism to replay mechanism and reward reshaping algorithm in the deep rein-
demonstrate safe "state-action" pairs to agent, thereby minimizing forcement learning algorithm, and proposes the online monitoring
the harm to the agent during exploration[Abbeel et al. 2010]. Al- guidance deep reinforcement learning method to improve the learn-
though the above methods to some extent make up for safety, they ing efficiency and decision credibility of the agent. In addition, we
have not been mathematically proven and cannot guarantee safety take the classic scenario of vehicle following in autonomous driv-
of CPS. ing as an example to verify the theoretical methods, and through
Formal method (FM) is a powerful tool for ensuring system experimental comparative analysis, it verifies the advantages of
safety[Wing 1990; Woodcock et al. 2009]. In recent years, scholars the proposed method compared to the traditional DDPG algorithm.
have introduced formal methods into SRL[Leeb and Lynch 2005; The results show that the proposed deep reinforcement learning
Phan et al. 2020]. Mirchevska et al. proposed combining formal method based on online monitoring guidance has significant im-
methods with DRL to ensure that only safe operations are executed provements in adversarial signal interference, convergence rate,
[Mirchevska et al. 2018]. The article [Bozkurt et al. 2020] proposes and learning effect compared to traditional DDPG.
a model-free reinforcement learning algorithm for learning the
optimal policy that satisfies Linear Temporal Logic [De Giacomo REFERENCES
and Vardi 2013] safety requirements. The given temporal proper- Pieter Abbeel, Adam Coates, and Andrew Y Ng. 2010. Autonomous helicopter aerobat-
ties are transformed into a finite deterministic Buchi automaton, ics through apprenticeship learning. The International Journal of Robotics Research
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal

29, 13 (2010), 1608–1639. Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy Madan, Kush R Varshney, Murray Campbell, Moninder Singh, and Francesca Rossi.
optimization. In International conference on machine learning. 22–31. 2019. Teaching AI agents ethical values using reinforcement learning and policy
Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N orchestration. IBM Journal of Research and Development 63, 4/5 (2019), 2–1.
Zeilinger, and Claire J Tomlin. 2014. Reachability-based safe learning with Gaussian Dung T Phan, Radu Grosu, Nils Jansen, Nicola Paoletti, Scott A Smolka, and Scott D
processes. In 53rd IEEE Conference on Decision and Control. IEEE, 1424–1431. Stoller. 2020. Neural simplex architecture. In NASA Formal Methods: 12th Interna-
Derya Aksaray, Austin Jones, Zhaodan Kong, Mac Schwager, and Calin Belta. 2016. tional Symposium, NFM 2020, Moffett Field, CA, USA, May 11–15, 2020, Proceedings
Q-learning for robust satisfaction of signal temporal logic specifications. In 2016 12. Springer, 97–114.
IEEE 55th Conference on Decision and Control (CDC). 6565–6570. Jikun Rong and Nan Luan. 2020. Safe reinforcement learning with policy-guided plan-
Ezio Bartocci, Yliès Falcone, Adrian Francalanza, and Giles Reger. 2018. Introduction ning for autonomous driving. In 2020 IEEE International Conference on Mechatronics
to runtime verification. Lectures on Runtime Verification: Introductory and Advanced and Automation (ICMA). 320–326.
Topics (2018), 1–33. César Sánchez, Gerardo Schneider, Wolfgang Ahrendt, Ezio Bartocci, Domenico Bian-
Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos, and Miroslav Pajic. 2020. Control culli, Christian Colombo, Yliès Falcone, Adrian Francalanza, Srđan Krstić, João M
Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforce- Lourenço, et al. 2019. A survey of challenges for runtime verification from advanced
ment Learning. In 2020 IEEE International Conference on Robotics and Automation application domains (beyond software). Formal Methods in System Design 54 (2019),
(ICRA). 10349–10355. 279–335.
Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Mahmoud Selim, Amr Alanwar, Shreyas Kousik, Grace Gao, Marco Pavone, and Karl H
2018. A lyapunov-based approach to safe reinforcement learning. Advances in Johansson. 2022. Safe reinforcement learning using black-box reachability analysis.
neural information processing systems 31 (2018), 1–10. IEEE Robotics and Automation Letters 7, 4 (2022), 10665–10672.
Giuseppe De Giacomo and Moshe Y Vardi. 2013. Linear temporal logic and linear dy- Lior Shani, Yonathan Efroni, and Shie Mannor. 2020. Adaptive trust region policy opti-
namic logic on finite traces. In IJCAI’13 Proceedings of the Twenty-Third international mization: Global convergence and faster rates for regularized mdps. In Proceedings
joint conference on Artificial Intelligence. Association for Computing Machinery, of the AAAI Conference on Artificial Intelligence. 5668–5675.
854–860. Marjan Sirjani, Edward A Lee, and Ehsan Khamespanah. 2020a. Model checking
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. software in cyberphysical systems. In 2020 IEEE 44th Annual Computers, Software,
2017. CARLA: An open urban driving simulator. In Conference on robot learning. and Applications Conference (COMPSAC). 1017–1026.
PMLR, 1–16. Marjan Sirjani, Edward A Lee, and Ehsan Khamespanah. 2020b. Verification of cyber-
Marius Dupuis, Martin Strobl, and Hans Grezlikowski. 2010. Opendrive 2010 and physical systems. Mathematics 8, 7 (2020), 1068.
beyond–status and future of the de facto standard for the description of road Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A McIlraith.
networks. In Proc. of the Driving Simulation Conference Europe. 231–242. 2018. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the 17th
Christian Ellingsen, Eyal Dassau, Howard Zisser, Benyamin Grosman, Matthew W International Conference on Autonomous Agents and MultiAgent Systems. 452–461.
Percival, Lois Jovanovič, and Francis J Doyle III. 2009. Safety constraints in an Hoang-Dung Tran, Feiyang Cai, Manzanas Lopez Diego, Patrick Musau, Taylor T
artificial pancreatic 𝛽 cell: an implementation of model predictive control with Johnson, and Xenofon Koutsoukos. 2019. Safety verification of cyber-physical
insulin on board. Journal of diabetes science and technology 3, 3 (2009), 536–544. systems with reinforcement learning control. ACM Transactions on Embedded
Nathan Fulton and André Platzer. 2018. Safe reinforcement learning via formal meth- Computing Systems (TECS) 18, 5s (2019), 1–22.
ods: Toward safe control through proof and learning. In Proceedings of the AAAI Jeannette M Wing. 1990. A specifier’s introduction to formal methods. Computer 23, 9
Conference on Artificial Intelligence. 1–8. (1990), 8–22.
Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe rein- Jim Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. 2009. Formal
forcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480. methods: Practice and experience. ACM computing surveys (CSUR) 41, 4 (2009),
Radoslav Ivanov, James Weimer, Rajeev Alur, George J Pappas, and Insup Lee. 2019. 1–36.
Verisig: verifying safety properties of hybrid systems with neural network con- Jun Wu, Shibo Luo, Shen Wang, and Hongkai Wang. 2018. NLES: A novel lifetime
trollers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: extension scheme for safety-critical cyber-physical systems using SDN and NFV.
Computation and Control. 169–178. IEEE Internet of Things Journal 6, 2 (2018), 2463–2475.
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement
learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.
Louis Kirsch, Sebastian Flennerhag, Hado van Hasselt, Abram Friesen, Junhyuk Oh,
and Yutian Chen. 2022. Introducing symmetries to black box meta reinforcement
learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36.
7202–7210.
Stefan Kojchev, Emil Klintberg, and Jonas Fredriksson. 2020. A safety monitoring
concept for fully automated driving*. 2020 IEEE 23rd International Conference on
Intelligent Transportation Systems (ITSC) (2020), 1–7.
Ilya Kovalenko, Daria Ryashentseva, Birgit Vogel-Heuser, Dawn Tilbury, and Kira Bar-
ton. 2019. Dynamic resource task negotiation to enable product agent exploration
in multi-agent manufacturing systems. IEEE Robotics and Automation Letters 4, 3
(2019), 2854–2861.
Gunter Leeb and Nancy Lynch. 2005. Proving safety properties of the Steam Boiler
Controller: Formal methods for industrial applications: A case study. Formal
Methods for Industrial Applications: Specifying and Programming the Steam Boiler
Control (2005), 318–338.
Björn Lütjens, Michael Everett, and Jonathan P How. 2019. Safe reinforcement learning
with model uncertainty estimates. In 2019 International Conference on Robotics and
Automation (ICRA). 8662–8668.
Oded Maler and Dejan Nickovic. 2004. Monitoring Temporal Properties of Continuous
Signals. In Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant
Systems. Springer, Berlin, Heidelberg, 152–166.
Piergiuseppe Mallozzi, Ezequiel Castellano, Patrizio Pelliccione, Gerardo Schneider,
and Kenji Tei. 2019. A runtime monitoring framework to enforce invariants on
reinforcement learning agents exploring complex environments. In 2019 IEEE/ACM
2nd International Workshop on Robotics Software Engineering (RoSE). 5–12.
Branka Mirchevska, Christian Pek, Moritz Werling, Matthias Althoff, and Joschka
Boedecker. 2018. High-level Decision Making for Safe and Reasonable Autonomous
Lane Changing using Reinforcement Learning. In 2018 21st International Conference
on Intelligent Transportation Systems (ITSC). 2156–2162.
Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rock-
täschel, and Edward Grefenstette. 2022. Improving intrinsic exploration with
language abstractions. arXiv preprint arXiv:2202.08938 (2022).
Elisa Negri, Luca Fumagalli, and Marco Macchi. 2017. A review of the roles of digital
twin in CPS-based production systems. Procedia manufacturing 11 (2017), 939–948.

You might also like