Professional Documents
Culture Documents
which can only be verified through runtime verification. STL can • 𝑅 : 𝑆 × 𝐴 × 𝑆 → R: reward function, where 𝑟 = 𝑅(𝑠, 𝑎, 𝑠 ′ )
express continuous real-valued signals and is widely used in var- represents the immediate reward/punishment that the envi-
ious fields, and pSTL is an extension of STL[Maler and Nickovic ronment gives to the agent when the agent takes action 𝑎 in
2004]. Therefore, a natural idea for ensuring the safety of reinforce- state 𝑠 and transitions to state 𝑠 ′ .
ment learning controller in safety-critical systems is to describe
the safety requirements that the agent must follow with, monitor 2.2 Past Signal Temporal Logic
the system’s traces online, and improve the training process of STL can be used to describe the linear temporal properties of con-
the reinforcement learning agent from both optimizing objectives tinuous real-valued signals in CPS systems. ptSTL is a fragment
and modifying exploration processes by leveraging the runtime of STL in which only past temporal operators are used (no future
verification results. The main contribution includes: operator).Before presenting the syntax and semantics of pSTL, it is
necessary to clarify the concept of signals.
• We propose a framework of safe reinforcement Learning
guided by online monitoring of pSTL specfictions and detail Definition 2.2 (Signal). Let T ⊆ R+ and T is a finite or infinite
the basic process for online monitoring of CPS systems with set of time instants. Given a time instant 𝑡 ∈ T, a signal x ∈ X can
RL control components. be regarded as a function from T to a finite set of the real values
• We have integrated runtime verification technology and re- X, i.e. a signal can be viewed as a time series of values that evolve
inforcement learning algorithms. We improved Deep Rein- over time.
forcement Learning (DRL) algorithms with the robust seman-
Given a time point 𝑡 ∈ T and a time interval [𝑎, 𝑏],The syntax of
tics of pSTL by improving the experience replay mechanism
pSTL is defined as follows:
and modifing the reward shaping algorithm.
• We validated the method proposed in this paper with the 𝜙 := 𝜇 | ¬𝜙 | 𝜙 ∧ 𝜑 | 𝜙 ∨ 𝜑 | 𝐴 [𝑎,𝑏 ] 𝜙 | 𝑃 [𝑎,𝑏 ] 𝜙 | 𝜙𝑆 [𝑎,𝑏 ] 𝜑 (1)
car following scenario in autonomous driving. The results
where 𝜇 is an atomic proposition in the form of 𝑓 (x(𝑡)) > 0, and
showed that our proposed method had significant improve-
¬, ∧, ∨ are standard boolean logical operators, and S [𝑎,𝑏 ] (since),
ments in convergence speed, decision safety, and resistance
P [𝑎,𝑏 ] (previously), and A [𝑎,𝑏 ] (always) are operators that represent
to random signal interference compared to the traditional
past tense on the time interval [𝑎, 𝑏].
DDPG algorithm.
The time horizon of formula 𝜙 (denoted hrz(𝜙)) indicates the
time steps that need to be taken into consideration. It depends on
2 PRELIMINARIES the structure of the formula and can be computed recursively.
2.1 Reinforcement Learning ℎ𝑟𝑧 (𝜇) = 0,
The core idea of RL algorithm is that the agent continuously in- ℎ𝑟𝑧 (¬𝜙) = ℎ𝑟𝑧 (𝜙)
teracts with the environment and optimizes its policy through the ℎ𝑟𝑧 (𝜙 1 ∨ 𝜙 2 ) = max (ℎ𝑟𝑧 (𝜙 1 ), ℎ𝑟𝑧 (𝜙 2 ))
environment feedback, so that the agent can learn the strategy that ℎ𝑟𝑧 (𝜙 1 ∧ 𝜙 2 ) = min (ℎ𝑟𝑧(𝜙 1 ), ℎ𝑟𝑧 (𝜙 2 ))
maximizes the cumulative reward in an uncertain environment. (2)
Firstly, the agent obtains the current observed environment state, ℎ𝑟𝑧 𝑃 [𝑎,𝑏 ] (𝜙) = 𝑏 + ℎ𝑟𝑧 (𝜙)
makes exploratory decisions, and the actions taken by the agent
will further affect the environment state and receive corresponding ℎ𝑟𝑧 𝐴 [𝑎,𝑏 ] (𝜙) = 𝑏 + ℎ𝑟𝑧(𝜙)
rewards or punishments. Finally, the agent continuously optimizes
its policy based on the evaluation results, adapts to the changes in ℎ𝑟𝑧 𝜙 1𝑆 [𝑎,𝑏 ] 𝜙 2 = 𝑏 + max (ℎ𝑟𝑧(𝜙 2 ), ℎ𝑟𝑧 (𝜙 1 ))
the environment, and ultimately obtains the optimal strategy that
Definition 2.3 (Boolean Semantics of pSTL). Let 𝜔 be a system
maximizes the long-term cumulative reward.
trace at time 𝑡, and let 𝑠 be a signal in 𝜔. Then, we can denote that
Markov Decision Process (MDP) is the most essential mathe-
𝜔 satisfies the formula 𝜙 as (𝜔, 𝑡) |= 𝜙.
matical model of RL, which characterizes the interactive learning
process between the agent and the environment very well. MDP is (𝜔, 𝑡) |= 𝜇 ⇔ 𝑓 (x(𝑡)) > 0
a model that satisfies the Markov property, which means that the (𝜔, 𝑡) |= 𝜙 ∧ 𝜑 ⇔ (𝜔, 𝑡) |= 𝜙 ∧ (𝜔, 𝑡) |= 𝜑
state transition at future time 𝑡𝑛 only depends on the current time (𝜔, 𝑡) |= 𝜙 ∨ 𝜑 ⇔ (𝜔, 𝑡) |= 𝜙 ∨ (𝜔, 𝑡) |= 𝜑
𝑡𝑛−1 state and is independent of the previous time 𝑡 1 ∼ 𝑡𝑛−2 states.
(𝜔, 𝑡) |= ¬𝜙 ⇔ ¬((𝜔, 𝑡) |= 𝜙)
Definition 2.1 (Markov Decision Process). An MDP can be repre- (𝜔, 𝑡) |= P [𝑎,𝑏 ] 𝜙 ⇔ ∃𝑡 ′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜙
sented as a quadruple M = (𝑆, 𝐴, 𝑃, 𝛾, 𝑅), where: (𝜔, 𝑡) |= A [𝑎,𝑏 ] 𝜙 ⇔ ∀𝑡 ′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜙
(𝜔, 𝑡) |= 𝜙S [𝑎,𝑏 ] 𝜑 ⇔ ∃𝑡 ′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜑∧
• 𝑆: a set of observation states ;
• 𝐴: a finite set of actions; ∀𝑡 ′′ ∈ 𝐼 (𝑡, [𝑎, 𝑏])(𝜔, 𝑡 ′ ) |= 𝜙
• 𝑃 : 𝑆 × 𝐴 × 𝑆 → [0, 1]: the state transition probability matrix,
where 𝐼 (𝑡, [𝑎, 𝑏]) = [𝑡 − 𝑏, 𝑡 − 𝑎] ∩ [0, 𝑡].
where 𝑃 (𝑠 ′ |𝑠, 𝑎) is the probability of transitioning from state
𝑠 to state 𝑠 ′ by taking action 𝑎; Definition 2.4 (Robust Semantics of pSTL). Define the robustness
• 𝛾: discount factor; function 𝜌 𝑓 (𝜙, 𝜔, 𝑡), representing the quantitative valuation of the
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal
3 PROPOSED APPROACH
RL agents may make incorrect decisions when facing uncertain and
complex environments, which can lead to serious safety incidents.
Therefore, during the learning and exploration processes, it is im-
portant to ensure agents to obey formal safety specifications which Figure 1: the framework of POM-SRL
are tailored to specific domains. The online monitoring of pSTL is a
technique that monitors whether a CPS system satisfies given safety human intervention, and enforced execution, to ensure the safety
specifications in real-time. Thus, we propose an innovative safe and correctness of the system. On the other hand, the quantitative
reinforcement learning framework based on the guidance of pSTL evaluation results can be used to further improve the experience
online monitoring (POM-SRL), which provides a new approach for replay mechanism and reward reshaping algorithm in DRL, so as to
ensuring the safety of RL agents. find a policy that maximizes the satisfaction of pSTL specifications.
Algorithm 1 shows the monitoring process of our method. The
3.1 An Overview of POM-SRL inputs of the algorithm consist of two parts: a pSTL safety speci-
Figure 1 shows the framework of POM-SRL, which is a general fication 𝜙 and the RL-CPS system S. Firstly, the pSTL formula is
architecture, combining runtime verification technology and rein- parsed and its syntax tree 𝑆𝑇 is automatically constructed. Then,
forcement learning algorithms. the safe monitor keeps listening to the system to get the signal
Firstly, the RL agent receives states from the real-world, ana- traces 𝑋𝑛 = 𝑠 1, 𝑠 2, ... at time 𝑡. Based on the syntax tree of the pSTL
lyzing them to make decisions, and receives reward/punishment formula, the algorithm recursively calculates the quantitative val-
from the environment. Based on the real-valued reward, the agent uation(denoted as robustness) of the system trace. On the basis
further updates its policy to obtain higher cumulative return. The of robustness, we can determine whether the current state meets
function used to estimate the expected long-term rewards is called the safety specification, and further feed back the verification re-
the value function, while the policy function maps states to actions, sult to the CPS system to enforce safety actions or improve DRL
and it can be either deterministic or stochastic. It is usually denoted algrorithms.
as 𝜋 (𝑎|𝑠). DRL fits the above functions with neural networks (NN),
leveraging the powerful ability of NN to process high-dimensional Algorithm 1 Online Monitoring(𝜙, S)
data, and evaluates the fitting ability of NN with loss function. In Input: pSTL𝜙, system S.
addition, the independent and identical distribution of data is an Output: Monitor S and quantitatively evaluate the formula 𝜙.
important prerequisite for ensuring the effectiveness of the NN, 1: procedure Online Monitoring(𝜙, S)
so the experience replay mechanism is introduced to break the 2: Initialize: t = SystemTime(), ST = SyntaxTree(𝜙)
correlation between data. Therefore, the goal of DRL algorithms is 3: while true do
to optimize the black-box policy and minimize the loss function to 4: (𝑋𝑛 , 𝑡) ←− 𝑊 𝑎𝑖𝑡𝐷𝑎𝑡𝑎(S)
maximize the cumulative reward. 5: (𝑟𝑜𝑏𝑢𝑠𝑡, 𝑡) = 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑅𝑜𝑏𝑢𝑠𝑡 (𝑋𝑛 , 𝑡, 𝑆𝑇 )
pSTL is used to declare safety specifications that the RL agent 6: 𝑂𝑢𝑡𝑝𝑢𝑡 (𝑟𝑜𝑏𝑢𝑠𝑡, 𝑡)
needs to obey during the training process, and a safe monitor is 7: end while
running in the real-time to detect. On the one hand, when errors or 8: end procedure
faults occured in the system, we can use various means to avoid or
minimize possible damages to the system, such as danger warning,
ICSE 2024, April 2024, Lisbon, Portugal Anon.
where 𝜂 is a corresponding parameter used to adjust the weight of The experience replay mechanism extracts samples for training
the robustness value in the reward function. the RL value network or policy network by randomly sampling
After introducing the Robustness-based Reward Shaping algo- a certain number of tuples, which ensures that each experience
rithm, the h-MDP (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅ℎ ) is updated to (𝑆 ℎ , 𝐴, 𝑃, 𝛾, 𝑅 ∗ ), is uniformly sampled, thus solving the problem of non-stationary
where 𝑅 ∗ = 𝑅 + 𝜂𝜌 (𝜙, 𝑠𝑡 , 𝑡). Through this method, the optimization distribution of states and avoiding the agent falling into a local
goal of the agent is transformed into finding a control policy that optimum state. However, randomly sampling experiences does not
maximizes the expected robustness value of the safety constraints, take the importance of experiences into account . When the agent
i.e. MERD defined in Definition 3.3, while completing the basic task. encounters a sample with good performance during the exploration
At this point, the reward obtained by the agent consists of two process and stores it in the experience replay buffer, it may not
aspects: one is the reward obtained by the RL agent for completing be replayed for a long time, increasing the training time of the
its basic goals in a specific scenario, and the other is the robustness algorithm.
result of the historical trajectory state signal 𝑆 ℎ satisfying the pSTL In this article, the priority of experience samples is set as 𝑝𝑖 =
1
safety property. According to the robust semantics of pSTL, it is 𝑟𝑎𝑛𝑘 (𝑖 ) . Here, 𝑟𝑎𝑛𝑘 (𝑖) is a comprehensive measure of the impor-
known that the more the agent’s state behavior satisfies the safety tance of experiences based on three indicators: TD error, reward,
constraints, the larger the verified robustness value obtained by and robustness. This improves computational efficiency and ensures
online monitoring, and the agent will receive a larger reward corre- that the decision-making process of the agent complies with the
sponding. On the contrary, if the agent is in a dangerous situation, safety constraints defined by pSTL while obtaining the maximum
the verified robustness value will be a larger negative value, and reward.
the agent will receive a larger punishment. Therefore, the target TD error is the difference between the predicted value and the ac-
cumulative function of the policy network can be defined as: tual reward received by an agent in a particular state. It is computed
by the difference between the online network state estimation and
𝑇
! the actual state value, where the actual state value is the sum of
∑︁
∗
𝐽 (𝜃 ) = E𝐴𝑡 ∼𝜋𝜃 (𝑎𝑡 |𝑠𝑡 ) 𝑅 the immediate reward and the maximum action value function of
𝑡 =1 the next state, representing the expected future return of the agent.
(6)
𝑇 ∑︁
∑︁ The larger the TD error of a sample, the more the agent needs to
= 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) · 𝑅 ∗ (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡 +1 ) learn to optimize its policy, and therefore it is considered a more
𝑡 =1 𝑎𝑡 ∈𝐴𝑡 important experience.
The parameters of the policy network are updated using gradient
|𝛿𝑖 | = |𝑄 (𝑠𝑡 , 𝑎𝑡 ; 𝜃 ) − 𝑟𝑡 + 𝛾 · 𝑎𝑔𝑟𝑚𝑎𝑥𝑎 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 𝑠𝑡 +1, 𝑎; 𝜃 ′
descent as follows: | (8)
𝑇
∑︁ In RL algorithms, environment gives a reward or punishment for
∇𝜃 𝐽 (𝜃 ) = ∇𝜃 log 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) · 𝑅 ∗ (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡 +1 ) (7) the quality of the current policy to help agent learn how to achieve
𝑡 =1 the preset objectives. If an experience tuple receives a high reward,
where 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) represents the policy network, it is responsible it indicates its significance to the learning of the current policy and
for selecting actions based on the current state 𝑠𝑡 , and 𝑇 is the should be trained more frequently. The robustness-based reward
number of iterations. reshaping algorithm introduced in the previous section reflects the
formal safety constraints in the reward function, so the immediate
3.3.2 Priority Experience Replay. Traditional DRL algorithms store reward also partly reflects the safety of the agent’s state. When
the historical experience trajectories in an experience replay buffer, using reward as the priority indicator for experience tuples, the
and then randomly take a small subset of samples to train the neural priority allocation may be biased due to some experience tuples
network. However, different experience trajectories have different having too large or too small rewards. Therefore, the reward is
importance for model training. If the sampling probability of key usually normalized and mapped to the range of a standard normal
experience samples is low, it may affect the convergence and perfor- distribution. Let the fixed size of the experience replay pool be 𝑁 ,
mance of the algorithm. To address this problem, existing methods with 𝑛 experiences stored, where 𝑛 ≤ 𝑁 . The total return of the
Í
such as Prioritized Experience Replay (PER) sort the importance collected experience is 𝑅 = 𝑛𝑖 𝑟𝑖 , and 𝑅¯ is the average value of n
of samples based on their TD error, so that the selected samples reward values. This paper sorts the experience tuples based on the
have higher priority. However, this method cannot guarantee that normalized reward R𝑖 .
the sampled experience satisfies safety properties or has a higher √︃
long-term cumulative return. With the robust semantics pSTL, we
R𝑖 = (𝑟𝑖 − 𝑅)¯ 2 /𝜎
can quantify how strongly a given historical experience generated √︃
by the RL agent satisfies an pSTL formula as a real number. In Í𝑛 ¯2
(9)
𝜎= 𝑖=1 𝑟𝑖 − 𝑅 /𝑛
other words, the robustness reflects the safety of the experience Í
to a certain extent. Therefore, a Robustness-based Priority Replay 𝑅¯ = 𝑛 /𝑛
𝑖=1
Buffer (RPER) mechanism based on robustness obtained by pSTL Formal methods are often seen as a way to increase the trust-
online monitoring is proposed in this section, so that the agent can worthiness of safety- critical systems. Techniques such as runtime
learn faster and improve training efficiency by learning policies that verification can be used to monitor the execution of CPS systems.
satisfy safety constraints and result in higher cumulative rewards. The robustness degree of pSTL quantifies how strongly a given
ICSE 2024, April 2024, Lisbon, Portugal Anon.
trajectory satisfies an STL formula as a real number rather than the degree of satisfaction of the historical state set 𝑠𝑡ℎ for the safety
just providing a yes or no answer. With the online monitoring constraint 𝜙 in real-time after the agent takes an action, and returns
mechanism of pSTL, it is possible to determine whether the signal the robustness value 𝜌 (𝑠𝑡ℎ , 𝜙, 𝑡). Environment uses the improved
trajectory generated by the RL agent satisfies the pSTL specifica- reward shaping algorithm to provide a reward score for the agent
tion. The quantitative semantics of pSTL can help us quantify the and updates the next state 𝑠𝑡 +1 , thus obtaining an experience tuple
robustness of the signal trajectory of experience 𝑒𝑖 , denoted as (𝑠𝑡ℎ , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡 +1, 𝜌𝑡 ). After obtaining the experience tuple, the priority
𝜌𝑖 (𝜙, 𝑠𝑖 , 𝑖). The larger the robustness value 𝜌𝑖 (𝜙, 𝑠𝑖 , 𝑖) obtained from of the sample is calculated according to Equation 10 and placed in
online monitoring of the RL agent at the current time, the more the the experience replay pool.
experience satisfies the pSTL safety specification, and thus should Therefore, the optimization objective of the POM-SDDPG agent
be given a higher priority. is to find an optimal policy 𝜋 that maximizes the state-value func-
Therefore, in this paper, the importance of experience is mea- tion 𝑄 𝜋∗ (𝑠, 𝑔, 𝑎), where 𝑅 ∗ incorporates the online monitoring ro-
sured by three indicators: TD error, reward, and robustness value. bustness value. Thus, this policy has two optimization objectives: on
The ranking of experience 𝑖 can be represented as: the one hand, the agent aims to efficiently accomplish its intended
task, for instance, a self-driving car completing a lane changing
𝑟𝑎𝑛𝑘 (𝑖) = 𝜅 1 |𝛿𝑖 | + 𝜅 2 R𝑖 + 𝜅 3 𝜌𝑖 (𝜙, 𝑠𝑖 , 𝑖) (10) overtaking maneuver; on the other hand, the agent aims to satisfy
where 𝜅 1, 𝜅 2, 𝜅 3 are hyperparameters representing the weights as- the safety constraints described by the designed pSTL formula as
signed to TD error, reward, and robustness when calculating the much as possible during the motion process, that is, maximizing
priority of experience. the expected robustness value defined in Definition 3.3.
3.3.3 POM-SDDPG. In this section, we will take the Deep Derteminstic ∑︁
Policy Gradient(DDPG) algorithm as an example to illustrate how 𝑄 𝜋∗ (𝑠, 𝑔, 𝑎) = E𝜋 [ 𝛾 𝑡 𝑅 ∗ (𝑠𝑡 , 𝑎𝑡 )|𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎] (11)
to use the POM-SRL framework proposed in this paper to guide 𝑡 =0
the improvement of DRL algorithms and enhance the safety of RL- The policy network is updated through gradient descent, where
controller in the CPS. Figure 2 shows the process of the safe DDPG 𝑄 ∗ (𝑠, 𝑎|𝜃𝑄 ) is computed by the Critic network.
algorithm based on pSTL online monitoring mechanism (denoted
as POM-SDDPG below) . The system engineer needs to design
the pSTL safety specification 𝜙 and use the safe online monitor 𝑁
1 ∑︁
to observe current environment state 𝑠𝑡 in real-time, outputting a ▽𝜙 𝐽 (𝜃 𝜇 ) = ▽𝑎 𝑄 ∗ (𝑠, 𝑎|𝜃𝑄 )|𝑠=𝑠𝑖 ,𝑎=𝜋 ∗ ▽ 𝜃 𝜇 𝜋 ∗ |𝑠𝑖 (12)
𝑁 𝑖
quantitative evaluation value 𝜌 (𝜙, 𝑠𝑡 , 𝑡). According to the definition
of h-MDP, the agent makes decisions based on the observed envi- The Critic is the value network 𝑄 (𝑠, 𝑎|𝜃𝑄 ), and its correspond-
ronment state 𝑠𝑡ℎ and outputs an action 𝑎𝑡 , where ℎ is the formula ing target value network is 𝑄 ′ (𝑠, 𝑎|𝜃𝑄
′ ). It evaluates the Q-value
range of the pSTL specification 𝜙.
of the deterministic policy 𝑎𝑡 = 𝜇 (𝑠𝑡ℎ |𝜃 𝜇 ) generated by the Actor
and provides gradient information to the Actor. The target value
network, based on the RPER mechanism and the target Actor net-
work’s output 𝑠𝑡 +1 and 𝜇 ′ (𝑠𝑡 +1 |𝜃 𝜇′ ), calculates the 𝑄 ′ (𝑠𝑡 +1, 𝑎𝑡 +1, 𝜃𝑄
′ )
part of the target Q-value. The main value network then computes
the current target 𝑄 value as 𝑦𝑖 = 𝑟 ∗ + 𝛾𝑄 ′ (𝑠 ′, 𝑎 ′, 𝜃𝑄 ′ ) based on the
′
𝑄 part. The parameters of the value network 𝜃𝑄 are updated by
minimizing this loss function using the Adam optimizer.
𝑁
1 ∑︁ ∗
𝐿= (𝑄 (𝑠, 𝑎|𝜃 𝑄 − 𝑦𝑖 )) 2 (13)
𝑁 𝑖
Figure 2: POM-SDDPG
h i
∇𝜃𝑄 = 𝐸 𝑟 + ∇𝜃𝑄 ′ 𝛾𝑄 ′ 𝑠, 𝜇 ′ 𝑠 | 𝜃 𝜇′ − 𝑄 𝑠, 𝑎 | 𝜃𝑄
′ ∇𝜃𝑄 𝑄 𝑠, 𝑎 | 𝜃𝑄 (14)
Algorithm 2 presents the basic idea of POM-SDDPG algorithm. Similar to the DQN algorithm, DDPG periodically copy the pa-
Firstly, inspired by the Deep Q learning(DQN) algorithm, in order rameters of its Actor and Critic networks to their corresponding
to address the problem of unstable Q-values during training, two target networks as Eq. 15 shows:
Actor and Critic networks with the same network structure and
parameters are initialized. The Actor is the policy network 𝜇 (𝑠𝑡 |𝜃 𝜇 ),
𝜃 𝜇′ = 𝜏𝜃 𝜇 + (1 − 𝜏)𝜃 𝜇′
with the corresponding target policy network parameters 𝜇 ′ (𝑠𝑡 |𝜃 𝜇′ ), (15)
′ ′
responsible for selecting the current action 𝐴𝑡 based on the current 𝜃𝑄 = 𝜏𝜃𝑄 + (1 − 𝜏)𝜃𝑄
historical environment state 𝑠𝑡ℎ . The safe online monitor calculates
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal
4 EVALUATION
This chapter will discuss the implementation of a deep reinforce-
ment learning method based on pSTL online monitoring and guid- Figure 3: Vehicle Following in CARLA
ance, using the vehicle following scenario as an example. Firstly,
an autonomous driving behavior decision-making method is im-
plemented under uncertain environments based on the DDPG al- CARLA-Gym is a reinforcement learning development environ-
gorithm. Then, safety constraints that must be maintained in the ment based on the CARLA simulator under the Python environment.
vehicle following scenario are expressed using pSTL and online It provides interfaces for observing, calculating reward functions,
monitoring methods proposed in this paper are used to verify the and setting termination conditions for the interaction between the
RL agent online, and the impact of random signal disturbances agent and the environment. In this paper, the interaction between
in the real world on safety verification is analyzed. Finally, based the deep reinforcement learning module and the CARLA simulator
on the obtained online verification results, the DDPG algorithm is will be implemented based on CARLA-Gym. In addition, during the
further improved to provide more reliable decision-making for this training process of the reinforcement learning algorithm, the safety
scenario, and to improve the learning efficiency of the agent. online monitor will first parse the pSTL formula and generate the
formula syntax tree. Then, it receives real-time monitoring of the
4.1 Platform and Parameters state information of the agent from the CARLA simulator, veri-
The software and hardware environment relied upon for algorithm fies the result, and provides it to the deep reinforcement learning
training and validation in this paper are shown in Table 1. The module, thus achieving better learning efficiency and faster conver-
operating system is Linux, version Ubuntu 20.04; Python version is gence rate. The algorithm framework for deploying reinforcement
3.7; CARLA simulator version is 0.9.11; CPU is Intel Core-i7; deep learning based on the CARLA simulator in this paper is shown in
learning framework is Pytorch 1.7; and memory size is 16GB. Figure 4.
ICSE 2024, April 2024, Lisbon, Portugal Anon.
𝑣2 𝑓
𝑣2
where 𝑑 𝑓𝑠𝑎𝑓 𝑒 = 𝑣 𝜀 · 𝑡𝑟𝑒𝑎𝑐𝑡 + 2𝑎𝑚𝑎𝑥
𝑒
− 2𝑎𝑚𝑎𝑥 , and 𝑑𝑏𝑠𝑎𝑓 𝑒 =
2
𝑣𝑚𝑎𝑥 −𝑣𝑏2 𝑣2 −𝑣 2
2𝑎𝑚𝑎𝑥 − (𝑣 𝜀 · 𝑡𝑟𝑒𝑎𝑐𝑡 + 𝑚𝑎𝑥
2𝑎𝑚𝑎𝑥 ), which means that in the past
T seconds, the system always satisfies that the distance between
the Ego vehicle and the front and rear vehicles is greater than the
Figure 4: The CARLA platform deployed RL algorithm
minimum safe distance. The derivation of the above minimum safe
distance formula can be seen in detail in the article [?].
4.2 Vehicle Following
4.2.2 Reward Design. Rewarding the behavior of the agent can
This section will verify the feasibility of the proposed method based
be divided into two ways. One is to only reward or punish the
on the vehicle following scenario. The selection of this scenario
agent when the round is over. Although this method can more
is mainly based on the following two reasons: firstly, the vehicle
accurately evaluate whether the agent has achieved its goal, the
following scenario is the most common scenario in autonomous
learning and training process is extremely slow, because the agent
driving research, with a wide range of applications on highways,
is "blind" during the learning process and does not know whether
urban roads, and rural roads. Secondly, the vehicle following sce-
the action taken is correct. The other way is to immediately give
nario is the easiest scenario to study in autonomous driving, and is
the agent instant rewards or punishments, which can significantly
easy to describe and demonstrate.
speed up the training process. Therefore, based on the proposed
The schematic diagram of the vehicle following scenario is shown
idea of using pSTL online monitoring robustness value for reward
in Figure 5. This scenario mainly consists of three dynamic entities,
reshaping algorithm, this paper will give timely feedback to the
where the red vehicle represents the autonomous driving vehicle
agent, and the design of the reward function will be constrained by
Ego, which is the main subject of behavior decision-making. 𝑉 𝑒ℎ 𝑓
the given formalized safety constraints of pSTL. Next, the design
and 𝑉 𝑒ℎ𝑏 represent the front and rear vehicles on the same lane
of the reward function in the vehicle following scenario will be
as Ego. 𝑡 represents the current time, 𝑡 0 represents the initial time,
introduced in detail.
𝑑 𝑓 and 𝑑𝑏 represent the distance between the Ego vehicle and the
front and rear vehicles, which can be obtained through distance
𝑒𝑥𝑝 − (𝑣𝑚𝑎𝑥 −𝑣) , 𝑣 <= 𝑣𝑚𝑎𝑥
2
sensors. 𝑣, 𝑎, and 𝛼 respectively represent the speed, acceleration, 𝑅𝑣 = (17)
and steering angle of the Ego vehicle. −1, 𝑣 > 𝑣𝑚𝑎𝑥
Firstly, we hope that the autonomous driving vehicle does not
exceed the maximum speed limit 𝑣𝑚𝑎𝑥 during the driving process,
but at the same time, we hope that it can drive efficiently. Therefore,
the reward function for the vehicle speed is as follows:
Secondly, in the vehicle following scenario, we should encourage
the autonomous driving vehicle to drive as close to the center of the
road as possible during the driving process. The reward function
for the position of the vehicle on the lane is shown in Equation 18,
Figure 5: Vehicle Following where 𝑑𝑚𝑖𝑑 represents the distance between the center axis of the
vehicle and the center line of the lane.
4.2.1 pSTL Safety Specfication. In this scenario, the system de-
signer hopes that the Ego vehicle can drive efficiently and safely, 𝑅𝑙 = 𝑒𝑥𝑝 −𝑑𝑚𝑖𝑑 (18)
and maintain a safe distance from other vehicles. Specifically, the If a serious safety accident such as a collision occurs with the
minimum safe distance to maintain with the front vehicle 𝑑 𝑓 is such autonomous driving vehicle, a large punishment should be given:
that even if the front vehicle suddenly brakes, the Ego vehicle can
stop safely behind it, that is, 𝑑 𝑓 > 𝑑 𝑓𝑠𝑎𝑓 𝑒 . Generally speaking, the
−1000, 𝑖 𝑓 𝑐𝑜𝑙𝑙𝑖𝑑𝑒,
safe distance between Ego and the rear vehicle is mainly controlled 𝑅𝑐 = (19)
0, 𝑒𝑙𝑠𝑒
by 𝑉 𝑒ℎ𝑏 . However, if the Ego vehicle finds that the distance to the
rear vehicle is too close, it can also take measures to increase the Therefore, in the vehicle following autonomous driving experi-
distance, thus maintaining the property 𝑑𝑏 > 𝑑𝑏𝑠𝑎𝑓 𝑒 while main- ence scenario, this paper gives the reward function of the agent at
taining a safe distance from the front vehicle and accelerating. The time t: 𝑅 = 𝑅𝑣 + 𝑅𝑙 + 𝑅𝑐 . In addition, in the previous section, we
source of signal disturbance in the system comes from sensors or defined pSTL constraints to maintain a safe distance from front and
actuators, so a disturbance is introduced to the speed of the Ego ve- rear vehicles in the following scenario and set the time range T to
hicle with a disturbance threshold of 𝜀. The degree of satisfaction of the previous 2 seconds. If the autonomous driving vehicle can main-
the safety constraints in the system over the past [0, T] is observed. tain a safe distance from the front and rear vehicles in the past 2
A novel decision making approach with safe reinforcement learning guided by pSTL online monitor ICSE 2024, April 2024, Lisbon, Portugal
29, 13 (2010), 1608–1639. Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy Madan, Kush R Varshney, Murray Campbell, Moninder Singh, and Francesca Rossi.
optimization. In International conference on machine learning. 22–31. 2019. Teaching AI agents ethical values using reinforcement learning and policy
Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N orchestration. IBM Journal of Research and Development 63, 4/5 (2019), 2–1.
Zeilinger, and Claire J Tomlin. 2014. Reachability-based safe learning with Gaussian Dung T Phan, Radu Grosu, Nils Jansen, Nicola Paoletti, Scott A Smolka, and Scott D
processes. In 53rd IEEE Conference on Decision and Control. IEEE, 1424–1431. Stoller. 2020. Neural simplex architecture. In NASA Formal Methods: 12th Interna-
Derya Aksaray, Austin Jones, Zhaodan Kong, Mac Schwager, and Calin Belta. 2016. tional Symposium, NFM 2020, Moffett Field, CA, USA, May 11–15, 2020, Proceedings
Q-learning for robust satisfaction of signal temporal logic specifications. In 2016 12. Springer, 97–114.
IEEE 55th Conference on Decision and Control (CDC). 6565–6570. Jikun Rong and Nan Luan. 2020. Safe reinforcement learning with policy-guided plan-
Ezio Bartocci, Yliès Falcone, Adrian Francalanza, and Giles Reger. 2018. Introduction ning for autonomous driving. In 2020 IEEE International Conference on Mechatronics
to runtime verification. Lectures on Runtime Verification: Introductory and Advanced and Automation (ICMA). 320–326.
Topics (2018), 1–33. César Sánchez, Gerardo Schneider, Wolfgang Ahrendt, Ezio Bartocci, Domenico Bian-
Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos, and Miroslav Pajic. 2020. Control culli, Christian Colombo, Yliès Falcone, Adrian Francalanza, Srđan Krstić, João M
Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforce- Lourenço, et al. 2019. A survey of challenges for runtime verification from advanced
ment Learning. In 2020 IEEE International Conference on Robotics and Automation application domains (beyond software). Formal Methods in System Design 54 (2019),
(ICRA). 10349–10355. 279–335.
Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Mahmoud Selim, Amr Alanwar, Shreyas Kousik, Grace Gao, Marco Pavone, and Karl H
2018. A lyapunov-based approach to safe reinforcement learning. Advances in Johansson. 2022. Safe reinforcement learning using black-box reachability analysis.
neural information processing systems 31 (2018), 1–10. IEEE Robotics and Automation Letters 7, 4 (2022), 10665–10672.
Giuseppe De Giacomo and Moshe Y Vardi. 2013. Linear temporal logic and linear dy- Lior Shani, Yonathan Efroni, and Shie Mannor. 2020. Adaptive trust region policy opti-
namic logic on finite traces. In IJCAI’13 Proceedings of the Twenty-Third international mization: Global convergence and faster rates for regularized mdps. In Proceedings
joint conference on Artificial Intelligence. Association for Computing Machinery, of the AAAI Conference on Artificial Intelligence. 5668–5675.
854–860. Marjan Sirjani, Edward A Lee, and Ehsan Khamespanah. 2020a. Model checking
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. software in cyberphysical systems. In 2020 IEEE 44th Annual Computers, Software,
2017. CARLA: An open urban driving simulator. In Conference on robot learning. and Applications Conference (COMPSAC). 1017–1026.
PMLR, 1–16. Marjan Sirjani, Edward A Lee, and Ehsan Khamespanah. 2020b. Verification of cyber-
Marius Dupuis, Martin Strobl, and Hans Grezlikowski. 2010. Opendrive 2010 and physical systems. Mathematics 8, 7 (2020), 1068.
beyond–status and future of the de facto standard for the description of road Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A McIlraith.
networks. In Proc. of the Driving Simulation Conference Europe. 231–242. 2018. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the 17th
Christian Ellingsen, Eyal Dassau, Howard Zisser, Benyamin Grosman, Matthew W International Conference on Autonomous Agents and MultiAgent Systems. 452–461.
Percival, Lois Jovanovič, and Francis J Doyle III. 2009. Safety constraints in an Hoang-Dung Tran, Feiyang Cai, Manzanas Lopez Diego, Patrick Musau, Taylor T
artificial pancreatic 𝛽 cell: an implementation of model predictive control with Johnson, and Xenofon Koutsoukos. 2019. Safety verification of cyber-physical
insulin on board. Journal of diabetes science and technology 3, 3 (2009), 536–544. systems with reinforcement learning control. ACM Transactions on Embedded
Nathan Fulton and André Platzer. 2018. Safe reinforcement learning via formal meth- Computing Systems (TECS) 18, 5s (2019), 1–22.
ods: Toward safe control through proof and learning. In Proceedings of the AAAI Jeannette M Wing. 1990. A specifier’s introduction to formal methods. Computer 23, 9
Conference on Artificial Intelligence. 1–8. (1990), 8–22.
Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe rein- Jim Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. 2009. Formal
forcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480. methods: Practice and experience. ACM computing surveys (CSUR) 41, 4 (2009),
Radoslav Ivanov, James Weimer, Rajeev Alur, George J Pappas, and Insup Lee. 2019. 1–36.
Verisig: verifying safety properties of hybrid systems with neural network con- Jun Wu, Shibo Luo, Shen Wang, and Hongkai Wang. 2018. NLES: A novel lifetime
trollers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: extension scheme for safety-critical cyber-physical systems using SDN and NFV.
Computation and Control. 169–178. IEEE Internet of Things Journal 6, 2 (2018), 2463–2475.
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement
learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.
Louis Kirsch, Sebastian Flennerhag, Hado van Hasselt, Abram Friesen, Junhyuk Oh,
and Yutian Chen. 2022. Introducing symmetries to black box meta reinforcement
learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36.
7202–7210.
Stefan Kojchev, Emil Klintberg, and Jonas Fredriksson. 2020. A safety monitoring
concept for fully automated driving*. 2020 IEEE 23rd International Conference on
Intelligent Transportation Systems (ITSC) (2020), 1–7.
Ilya Kovalenko, Daria Ryashentseva, Birgit Vogel-Heuser, Dawn Tilbury, and Kira Bar-
ton. 2019. Dynamic resource task negotiation to enable product agent exploration
in multi-agent manufacturing systems. IEEE Robotics and Automation Letters 4, 3
(2019), 2854–2861.
Gunter Leeb and Nancy Lynch. 2005. Proving safety properties of the Steam Boiler
Controller: Formal methods for industrial applications: A case study. Formal
Methods for Industrial Applications: Specifying and Programming the Steam Boiler
Control (2005), 318–338.
Björn Lütjens, Michael Everett, and Jonathan P How. 2019. Safe reinforcement learning
with model uncertainty estimates. In 2019 International Conference on Robotics and
Automation (ICRA). 8662–8668.
Oded Maler and Dejan Nickovic. 2004. Monitoring Temporal Properties of Continuous
Signals. In Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant
Systems. Springer, Berlin, Heidelberg, 152–166.
Piergiuseppe Mallozzi, Ezequiel Castellano, Patrizio Pelliccione, Gerardo Schneider,
and Kenji Tei. 2019. A runtime monitoring framework to enforce invariants on
reinforcement learning agents exploring complex environments. In 2019 IEEE/ACM
2nd International Workshop on Robotics Software Engineering (RoSE). 5–12.
Branka Mirchevska, Christian Pek, Moritz Werling, Matthias Althoff, and Joschka
Boedecker. 2018. High-level Decision Making for Safe and Reasonable Autonomous
Lane Changing using Reinforcement Learning. In 2018 21st International Conference
on Intelligent Transportation Systems (ITSC). 2156–2162.
Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rock-
täschel, and Edward Grefenstette. 2022. Improving intrinsic exploration with
language abstractions. arXiv preprint arXiv:2202.08938 (2022).
Elisa Negri, Luca Fumagalli, and Marco Macchi. 2017. A review of the roles of digital
twin in CPS-based production systems. Procedia manufacturing 11 (2017), 939–948.