Paper Autonomous

Transportation Research Part C 134 (2022) 103452
Contents lists available at ScienceDirect
Transportation Research Part C

journal homepage: www.elsevier.com/locate/trc
Decision making of autonomous vehicles in lane change scenarios:

Deep reinforcement learning approaches with risk awareness
Guofa Li a, Yifan Yang a, Shen Li b, Xingda Qu a, *, Nengchao Lyu c, Shengbo Eben Li d
a
Institute of Human Factors and Ergonomics, College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen 518060, China
b
Department of Civil Engineering, Tsinghua University, Beijing 100084, China
c
Intelligent Transportation Systems Research Center, Wuhan University of Technology, Wuhan 430063, China
d
State Key Lab of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
A R T I C L E I N F O A B S T R A C T
Keywords: Driving safety is the most important element that needs to be considered for autonomous vehicles
Driving safety (AVs). To ensure driving safety, we proposed a lane change decision-making framework based on
Driving risk deep reinforcement learning to find a risk-aware driving decision strategy with the minimum
Autonomous vehicle
expected risk for autonomous driving. Firstly, a probabilistic-model based risk assessment method
Driver assistance system
was proposed to assess the driving risk using position uncertainty and distance-based safety
Reinforcement learning
metrics. Then, a risk aware decision making algorithm was proposed to find a strategy with the
minimum expected risk using deep reinforcement learning. Finally, our proposed methods were
evaluated in CARLA in two scenarios (one with static obstacles and one with dynamically moving
vehicles). The results show that our proposed methods can generate robust safe driving strategies
and achieve better driving performances than previous methods.
1. Introduction
Vehicles are a fundamental tool of transportation. However, according to a US National Highway Traffic Safety Administration
(NHTSA) report (NAHTSA, 2019), there were about 50 thousand fatal crashes and more than 3 million injury crashes each year, most
of which are caused by driver errors (Shirazi and Morris, 2017). The situation is similar in China with driver errors accounting for more
than 90% of all the crashes (Li et al., 2021b). With the rapid development of intelligent transportation systems (ITSs) in the recent
years, numerous driving safety applications have been applied in advanced driver assistance systems (ADASs) and autonomous ve
hicles (AVs) to help drivers make decisions for collision avoidance.
The existing methods for driving decision making can be mainly divided into three categories: motion planning based methods (Tu
et al., 2019; Tahir et al., 2020; Lee and Kum, 2019; Wang et al., 2019), risk assessment based methods (Noh, 2019; Kim and Kum, 2018;
Yu et al., 2018; Shin et al., 2019), and learning based methods (including both supervised learning (Codevilla et al., 2018; Xu et al.,
2017) and reinforcement learning (Wang et al., 2018; Shi et al., 2019; Long et al., 2018; Moghadam and Elkaim, 2019; Li et al., 2019).
* Corresponding author at: Institute of Human Factors and Ergonomics, 3688 Nanhai Avenue, Shenzhen University, Shenzhen, Guangdong
Province 518060, China.
E-mail addresses: hanshan198@gmail.com (G. Li), lvan0619@qq.com (Y. Yang), sli299@tsinghua.edu.cn (S. Li), quxd@szu.edu.cn (X. Qu), lnc@
whut.edu.cn (N. Lyu), lishbo@tsinghua.edu.cn (S.E. Li).
https://doi.org/10.1016/j.trc.2021.103452
Received 7 April 2021; Received in revised form 1 September 2021; Accepted 24 October 2021
Available online 18 November 2021
0968-090X/© 2021 Elsevier Ltd. All rights reserved.
G. Li et al. Transportation Research Part C 134 (2022) 103452
1.1. Motion planning based methods
The traditional motion planning based methods are often inspired by the robot motion planning algorithm, e.g., A* or artificial
potential field (APF). Tu et al. (2019) proposed a hybrid A* based motion planning method using a heuristic function. The performance
of the proposed methods was further improved by combining with optimization algorithms. Huang et al. (2020) used APF to generate
the drivable area, and then employed a local current comparison method to generate the collision-free path. However, the above
mentioned methods heavily depend on how graphs are generated (without violating physical constraints) and do not fully consider
vehicle dynamics so that the generated path may be incompatible with the dynamics (Tahir et al., 2020; Lee and Kum, 2019; Wang
et al., 2019).
Some researchers developed another solution by taking vehicle dynamics into consideration. Tahir et al. (2020) proposed a
heuristic method in a multi-agent environment, decomposing and accelerating the solution of motion planning using non-linear
programming. Lee and Kum (2019) proposed a predictive occupancy map to generate candidate routes, and then chose the route
with the minimum risk as the motion path. Focusing on inevitable collision situations during driving, Wang et al. (2019) used model
predictive control for motion planning to avoid crashes or mitigate crash severities. A limitation of these methods is that some motion
constraints used in motion planning are typically non-linear and non-convex, which may cause the NP-hard (non-deterministic
polynomial-time hard) problem (Wang et al., 2019).
1.2. Risk assessment based methods
Risk assessment based methods usually adopt the following two steps to guarantee driving safety (Li et al., 2021d): 1) assessing the
risk of the current driving state, 2) formulating a sequential action strategy based on the risk assessment results to maintain safety. This
hierarchical solution makes the risk assessment based methods easier to be modularized in practical applications in AVs and ADASs
(Ali et al., 2012; Noh, 2019). Currently, two categories of methods are typically used for risk assessment including the deterministic
approach and the probabilistic approach.
The deterministic approach is a binary prediction approach which only estimates whether a potential collision will happen or not.
Traditionally, researchers use specific safety metrics to evaluate the risk, such as time to collision (TTC), time to brake (TTB) or time
headway (THW) (Glaser et al., 2010; Kim and Kum, 2015; Bosnak and Skrjanc, 2017) and assess the potential risk by comparing the
obtained safety metric values with pre-defined thresholds. These methods are with almost free computational burden and can assess
the threat risk accurately in longitudinal driving in single-lane scenarios (Kim and Kum, 2018). However, their performances in multi-
lane scenarios are mostly unsatisfactory and the uncertainty of the input data is not considered, making the derived strategies not
practical for real-world applications (Noh and An, 2018).
Therefore, probabilistic approaches have been proposed to avoid the uncertainty problem mentioned above (Noh, 2019; Kim and
Kum, 2018; Yu et al., 2018, Shin et al., 2019). The probabilistic approaches usually utilize a probabilistic description to model the risk
level by using the temporal and spatial relationships between vehicles and the uncertainties of the input data. Noh (2019) and Noh and
An (2018) incorporated regular metrics (e.g., TTC) into risk probability assessment using Bayesian model, and then developed a rule-
based expertise to control the subject vehicle to drive at intersections and on highways. Yu et al. (2018) modeled the vehicles as
particles, propagated the particles through their kinematics, and used the distribution of the particles during propagation as the risk
distribution for collision avoidance. Shin et al. (2019) introduced vehicle-to-vehicle communication to predict remote vehicle posi
tions with uncertainties, and used the number of collision cases within the uncertainty boundaries to assess the risk probability. A
common limitation of the probabilistic approaches is that they only use expert knowledge to generate the rule-based action, suffering
from the failure of correct decision making in environment disturbances and neglecting the human driver learning characteristics in
driving (Li et al., 2021a; Li et al., 2021f).
1.3. Learning based methods
With the development of deep learning technologies (Li et al., 2020a; Li et al., 2021c; Li et al., 2021g), researchers began to address
the decision-making problem using learning based methods, which can be further categorized into supervised learning methods or
reinforcement learning methods. In the supervised learning research area, Codevilla et al. (2018) proposed a conditional imitation
learning method to generate driving policies as a chauffeur to handle sensorimotor coordination. The continuous responses to navi
gational commands helped make decisions to successfully avoid obstacles in the experiments. Xu et al. (2017) used the long short-term
memory architecture to encode the instantaneous monocular camera observations and previous vehicle states, and their network was
then trained to imitate drivers’ realistic behaviors based on a large-scale video dataset. These methods are in an end-to-end manner,
benefiting from the most significant advantage of supervised learning that the relationship between sensor inputs and model outputs
can be directly mapped by using the developed network. Moreover, the supervised learning methods can make the networks generate
realistic behaviors as human drivers (Xu et al., 2017).
The current autonomous driving technologies based on deep learning heavily rely on the large amount of data for training and
testing (Paden et al., 2016; Bojarski et al., 2017; Duan et al., 2020). However, the collection of driving data is time-consuming and
expensive. It is even more difficult to collect driving data in crash or near-crash scenarios, but these dangerous scenarios are of critical
importance for the development of autonomous driving technologies. Thus, the current autonomous driving technologies are strongly
limited by the data collection problem. Even the current largest naturalistic driving datasets (e.g., KITTI, BDD100K) lack crash or near-
crash samples (Geiger et al., 2012; Yu et al., 2018), making the current AVs difficult to handle these dangerous situations (Grigorescu
2
et al., 2020; Kiran et al., 2021). Only learning to drive the AVs in the situations that are not dangerous cannot well promote the
development of AVs.
Following the major breakthroughs of deep reinforcement learning (DRL) in the recent years (Mnih et al., 2015; Hasselt et al., 2015;
Schaul et al., 2016; Duan et al., 2021), researchers have started to apply DRL to address the driving decision making problems in
autonomous driving (Shin et al., 2019; Long et al., 2018; Li et al., 2019; Ye et al., 2019; Zhu et al., 2020). DRL based methods can
greatly decrease the heavy reliance on the large amount of data because they do not need labeled driving data for training (Zhu et al.,
2018; Moghadam and Elkaim, 2019; Hoel et al., 2020). Alternatively, they learn and enhance their driving knowledge and skills via
trial-and-error, which means that DRL based methods can be used in the crash or near-crash scenarios to help AVs avoid crashes (Kiran
et al., 2021). Wang et al. (2018) proposed a reinforcement learning based approach to train the agent to learn an automated lane
change behavior so that it can intelligently make a lane change for collision avoidance under unforeseen scenarios. Long et al. (2018)
proposed a DRL based system-level scheme for multi-agents to plan their own collision-free actions without observing other agents’
states and intents. Moghadam and Elkaim (2019) introduced DRL into a hierarchical architecture to make a sequential tactical decision
(e.g., lane change) for AVs to avoid collisions, and then the tactical decision was converted to low-level actions for vehicle control.
Unlike supervised learning methods, DRL based methods can compensate the high cost of data collection in dangerous scenarios by
training models in virtual simulation environments with affordable trial-and-error.
However, many DRL based decision-making methods do not fully take driving risk into consideration (Long et al., 2018; Zhu et al.,
2018; Moghadam and Elkaim, 2019), leading to risk insensitive driving strategies that could make the vehicle unsafe (Ma et al., 2020).
For improvement, Kahn et al. (2017) proposed an uncertainty-aware reinforcement learning algorithm to learn obstacle avoidance
strategies by using uncertainty to estimate driving risk. Bouton et al. (2019) proposed a probabilistic guarantee based reinforcement
learning strategy for autonomous driving at intersections by using a desired probabilistic specification expressed with linear temporal
logic to constrain the actions of the agent. Similarly, Mokhtari and Wagner (2021) developed a risk based decision-making framework
that integrates risk based path planning with reinforcement learning based control for safe driving. Although risk assessment has
already been considered in some of the DRL based decision-making methods, the examination of risk assessment consideration for
autonomous driving in lane change scenarios still needs to be further investigated (Chen et al., 2021).
1.4. Contributions
To solve the above-mentioned risk insensitive problem in lane change scenarios, risk assessment and DRL were comprehensively
considered in this study to prompt the agent to learn a strategy with the minimum expected risk. Firstly, a quantized method based on
position uncertainty using Bayes theory was proposed to assess the driving risk. Next, DRL was used to learn a strategy with the
minimum risk. Finally, the effectiveness of our proposed methods was evaluated in various scenarios in CARLA (Car Learning to Act).
The main contributions of this study include:
(1) A new probabilistic-model based risk assessment has been proposed to quantify the driving risk.
(2) An innovative driving strategy with the minimum expected risk has been proposed for safe driving.
(3) Driving risk awareness has been innovatively incorporated into DRL based driving decisions to make the agent capable of
finding a risk sensitive driving strategy.
These contributions can be adopted in the development of AVs to help drivers reduce the potential risk in multi-scenarios, which
would substantially improve driving safety.
1.5. Paper organization
The remainder of this paper is organized into four sections. The following Section 2 introduces the related works. Section 3 and
Section 4 respectively describe the problem statement and the proposed driving decision strategy with risk awareness. The experiments
and obtained results are presented and discussed in Section 5 and Section 6, respectively. Finally, the conclusions of this paper are
presented in Section 7.
2. Problem statement and previous solutions
In DRL based decision making strategy, an agent can act in a stochastic environment by choosing a series of actions over a sequence
of time steps, then learn from the feedbacks (i.e., reward) to maximize a cumulative reward. This process is usually modeled as a
Markov decision process (MDP), which can be written as:
def
M < S ,A ,P ,R > (1)
=
where S denotes a finite set of states, A denotes a finite set of actions, P denotes the state transition probability, and R denotes the
def
reward space. denotes equal by definition.
=
def
Typically, a stochastic policy πθ (a|s) π(a|s; θ) is used to generate a series of actions a (a ∈ A ) according to the specific envi
=
3

ronment state s (s ∈ S ), where π(a|s; θ) denotes the probability of the action taken from the policy π. Through the interaction between
agent and environment, a trajectory, also called state-action samples, can be generated and denoted as {s0 , a0 , s1 , r1 , …, st , at , st+1 , rt+1 ,
…}, and the solution of MDP is to find a best policy π * (s) which can maximize the expected cumulative reward, denoted as follows:
{ ⃒ }
∑
+∞ ⃒
⃒
*
π (s) = argmaxEπ i
γ rt+i ⃒st = s (2)
π i=0
⃒
where γ ∈ [0, 1] denotes the discount factor, which controls the weight of the future reward, π * (s) denotes the best policy, and rt+i
denotes the reward of time t + i, which calculated by pre-defined reward function r(st , at , st+1 ).
In order to solve Eq. (2) and find the best policy π* (a|s), Q-value function can be introduced to guide the policy improvement and
represent the policy itself, which is defined as:
{ }
def ∑
+∞
qπ (s, a) Eπ γi rt+i |st = s, at = a (3)
= t=i
where qπ (s, a) denotes the expected cumulative reward starting from state s following policy π and taking action a. π (s) = argmaxqπ (s,
′
a
a) is used to improve the policy π. Therefore, the problem of Eq. (2) can be converted to ‘how to find the best strategy π* to maximize
*
qπ (s, a), i.e., qπ (s, a) = maxqπ (s, a)’.
π
2.1. Deep Q-network (DQN)
In deep Q-network (Mnih et al., 2015), two neural networks are used to approximate the Q-value function qπ (s, a), one called Qtarget
and the other called Qonline. Qonline is used to generate the trajectory (st , at , st+1 , rt+1 ) which is restored in memory M later and is
randomly sampled to update the network in order to reduce the relevance of the data. Qtarget is introduced to provide the target Q-value
to update the Qonline using temporal difference (TD) error and to improve the stability of the network. The loss function is defined as:
def { }
L E ′ (y − Q(s, a; θ) )2
= (s,a,r,s )∼M
(4)
def ′ ′ ′
y r + γmax Q(s , a ; θ )
= a
′
where (s, a, r, s ) represents the sample from memory M, θ denotes the weights of Qonline, θ denotes the weights of Qtarget.
′ ′
2.2. Double deep Q-network (DDQN)
Through memory replay to reduce the relevance of data and using two different Q-value networks to improve the stability, DQN can
specify a Qπ* (s, a). However, the Q-value can be overestimated in certain conditions, which makes the Q-value of suboptimal action
higher than the optimal one. Specifically, the max operation in DQN (see Eq. (4)) uses the same value in both selection (i.e., selecting a,
according to Q(θ )) and evaluation (i.e., evaluating y through Q(θ )). This makes DQN more likely to select an overestimated value.
′ ′
Therefore, Hasselt et al. (2015) proposed double DQN to solve the overestimated problem. Double DQN decouples the selection from
evaluation, i.e. selecting a, according to Q(θ) but evaluating y through Q(θ ). Therefore, the improvement of loss can be written as:
′
def { 2}
L E , )∼M (y − Q(s, a; θ) )
= (s,a,r,s
( ) (5)
def
y r + γQ s’ , argmaxQ(s’ , a’ ; θ); θ’
= a’
2.3. Dueling deep Q-network (DulDQN)
Traditionally, DQN uses a single stream output to denote the Q-value that is determined by the state and action together. However,
in some cases (Wang et al., 2016), the action will not affect the environment, and the Q-value only depends on the state. Dueling DQN
(2016) decomposes the representation of Q-value as state value and action advantage, named as dueling architecture, denoted as:
def
Q(s, a; θ) V(s; θ) + A(s, a; θ) (6)
=
where V(s; θ) denotes the value of state s, and A(s, a; θ) denotes the advantage after taking the action a in state s
The dueling architecture is particularly useful in the situations where the actions do not affect the environment, as it can help
realize the valuable states. However, Eq. (6) suffers from an unidentifiable problem (Wang et al., 2016), i.e., the V and A cannot be
recovered based on the given Q. In dueling DQN, a constraint (i.e., Ea∼π(s) [A(s, a; θ) ] = 0) is used to solve the unidentifiable problem,
and the final equation is written as follows:
4
( )
def 1 ∑
(7)
′
Q(s, a; θ) V(s; θ) + A(s, a; θ) − A(s, a ; θ)
= |A | ′
a
where |A | denotes the size of action space.
2.4. Deep Q-network with prioritized replay (PRDQN)
The above DQN based methods use a uniform sample strategy in memory replay. However, in the memory pool, the samples with
small TD errors are easy to be learned, which is not the case for the samples with higher TD errors. Therefore, Schaul et al. (2016)
proposed a prioritized replay to DQN with the idea that the samples with higher TD errors should be preferentially replayed for
learning. The priority is defined as:
(a) DQN
(b) DDQN
(c) DulDQN
(d) PRDQN
Fig. 1. The frameworks of (a) DQN, (b) DDQN, (c) DulDQN, and (d) PRDQN.
5
def pai
P(i) ∑ a (8)
= k pk
where P(i) is the sampled probability of sample i, p denotes the TD error, and a is a pre-defined coefficient.
Given that the uniform sample criterion for experience replay cannot be guaranteed after introducing the priority, the importance
of sample weight should be introduced to adjust the corresponding gradients when updating the network for bias elimination, which is
defined as follows:
( )β
def 1 1
wi ⋅ (9)
= N P(i)
where N is the replay size, and β is a pre-defined coefficient.

The frameworks of the above-introduced DQN-based algorithms including DQN, DDQN, DulDQN, and PRDQN are shown in Fig. 1.
3. Proposed methods with risk assessment
In this section, our driving decision making strategy with risk awareness is introduced. Firstly, we introduce how to define the
driving risk. Then, the driving risk combined with DRL methods is used to find a strategy with the minimum driving risk. The solution
framework is shown in Fig. 2.
3.1. Risk assessment
Unlike the methods that only predict the binary potential risk (occur or not), our risk assessment method will estimate the concrete
probabilities at different risk levels. The three risk levels used in this study are defined as follows:
def def
Ω {dangerous, attentive, safe} {D, A, S} (10)
= =
where Ω is the set of risk levels. The scores of 2, 1, and 0 are assigned to the D, A, and S in Ω, respectively, to describe the risk levels.
Hence, the risk level τ is defined as:
def
τ∈Ω {2, 1, 0} (11)
=
Fig. 2. The solution framework of our proposed methods.
6
To model the risk based on nondeterministic theories, we take the relative position d and the uncertainty σ into consideration and
use safety metrics based distribution to calculate the conditional probability at different risk levels. The Bayes inference is then used to
assess the risk level in each given state.
The safety metrics based distribution using the relative position d could be calculated as follows:
⎧
⎪
⎪ 1, if d < d D
def ⎪
⎨
P(d |τ = D)
= ⎪⎪ 2
⎪ − Δd2σ2D
⎩ e , otherwise
def Δd 2
A
P(d |τ = A) e− 2σ2
=
⎧ (12)
⎪ 2
⎪ − Δd2σ2s
def ⎪
⎨e , otherwise
P(d |τ = S)
= ⎪
⎪
⎪
⎩ 1, if d > d S
def
Δdi |d − d i |, i ∈ Ω
=
where d denotes the relative distance between the host vehicle (HV) and other vehicles (OVs), d D ,d A and d S are the pre-defined
thresholds for driving risk assessment. Fig. 3 shows a qualitative presentation of Eq. (12). The illustration shows that the state with
a shorter relative distance corresponds to a higher potential risk and vice versa. The main hyper-parameters in the risk model (i.e.,
d D ,d A , d S and σ) are used to control the smoothness of different risk curves (please see Fig. 2). The values of these hyper-parameters
are determined by adjusting themselves to get smooth risk curves with reasonable distributions, the details of which can be found in Li
et al. (2021).
Using the Bayes theory, the posteriori probability of a specific risk level τ could be calculated as
P(d |τ) ⋅ P(τ)
P(τ|d ) = ∑ (13)
τ∈Ω P(τ)⋅ P(d |τ)
where P(τ|d ) denotes the probability of a specific risk level in the given state d , P(d |τ) is the conditional probability from Eq. (12), P(τ)
is the priori probability for each risk level. In this study, it is assumed that different risk levels have the same prior probability with a
∑
constrain τ∈Ω P(τ) = 1.
3.2. Decision making with minimum risk
To find a strategy with the minimum risk for safe driving, the risk assessment results should be introduced to the DRL based
methods. However, the discrete risk assessment results (i.e., P(τ|d )) descried above cannot be directly introduced to the DRL methods.
To solve this problem, a continuous risk coefficient ε is defined with respect to the risk level τ as follows:
def ∑ ∑
ε E(τ) = τ ⋅ P(τ|d ) = τ ⋅ P(τ|d ) (14)
= τ∈Ω τ∈{2,1}
where τ is the discrete risk level described in Eq. (11), and εdenotes the expectation of the assessed risk.
By quantizing a continuous value to denote the driving risk, the following Eq. (15) is used to represent a strategy with the minimum
expected risk
Fig. 3. A qualitative illustration for the priori distribution in Eq. (12).
7
{ ⃒ }
∑
+∞ ⃒
def ⃒
π* (s) argminEπ γi εt+i ⃒st = s (15)
= π i=0
⃒
An equivalent transformation can be written as follows

[ ⃒ ]
∑
+∞ ⃒
def ⃒
π* (s) argmaxEπ γi (max ε − εt+i )⃒st = s (16)
= π i=0
⃒
where maxε denotes the defined maximum risk value. Considering the definitions in Eq. (11), we can obtain that max ε = 2.
Compared with Eq. (2), we can find that Eq. (16) has the same format if rt+i :max ε − εt+i , which means that the best policy with the
minimum expected risk can be found by using DRL-based methods. The corresponding Q-value function is defined as:
[ ⃒ ]
∑
+∞ ⃒
⃒
Qπ (s, a) = Eπ γ i (max ε − εt+i )⃒st = s, at = a (17)
i=0
⃒
4. Collision avoidance decision-making
4.1. State space and action space
There are generally two kinds of methods to define state space in AVs. One is based on the fully known numerical information, e.g.,
the vehicle dynamics. The other is based on the raw sensor information, also known as end-to-end information. However, the state
space based on raw sensor information requires a huge amount of explorations to collect various states for DRL to train the agent.
Besides, it suffers from the transfer problem and deployment problem when using DRL (Pan et al., 2017). Therefore, we use the nu
merical information to denote the environment state.
To make a correct decision while driving based on the relationship between the HV and other vehicles (OVs), the relative distance
(both longitudinal and lateral), yaw angle, and yaw rate are comprehensively considered for decision learning and interference. The
state of the relationship between HV and OVs can be written as:
VAOi = [ii ,[loi , lai , yaw, Δloi , Δlai], Δyaw]
s = VAO1 , VAO2, …VAOn (18)
ii ∈ [0, 1], 0 ≤ i ≤ n
where ii denotes whether there is another vehicle in lane i within the perception range (+50 m, − 5m), n is the number of lanes, lo and la
denote the relative distances between the HV and obstacle in the longitudinal and lateral direction respectively, Δlo and Δla are the
corresponding change rates of lo and la, and yaw and Δyaw denote the vehicle yaw angle and yaw rate. Fig. 4 shows an illustration of
the state defined above.
For autonomous driving using our proposed methods, both steering action for lateral control and throttle action for longitudinal
control are considered. To prevent over-conservative behaviors of the DRL agent (e.g., taking too many brake actions), the brake action
is left to human drivers or other ADAS applications for the final safety decision. Although the brake action is not considered in this
study, the proposed methods can still work well because of the consideration of driving risks. The final steering action at time t with a
pre-defined throttle can be written as:
def
at [LTLt , LTSt , St , RTSt , RTLt ] (19)
=
where LTL and RTL denote the left-turn and right-turn with a large numerical value, LTS and RTS denote the left-turn and right-turn
with a small numerical value, S is the straight driving action without steering.
Fig. 4. Illustration of the vehicle state definition.
8
Since DQN based methods only support discrete actions (e.g., Eq. (19)), the performance of AVs driven by DQN based agents will be
very unstable, which may make drivers and passengers feel uncomfortable. Therefore, an exponential moving averaging strategy is
introduced to smooth the action, denoted as follows:
def
a*t a + γ(at − at− 1 ) (20)
= t− 1
where a*t denotes the action after smoothing, γ is a pre-defined constant to control the smother, and at and at− 1 denote the action
generated by the DQN based agents at time t and t-1.
4.2. Reward function
The target of reinforcement learning is to find a strategy that can maximize a pre-designed reward function. However, in this paper,
we want to find a strategy which can minimize the expected driving risk (please see Eq. (15)). Therefore, a negative sign was added to
the assessed driving risk so that our goal would be changed to a maximization problem. To encourage the agent to learn a robust
strategy, the reward would usually be a positive value with practical meanings (Duan et al., 2020). Therefore, the maximum risk value
maxε is added to the assessed driving risk (with the negative sign, i.e., − εt ) to describe the redundant risk space, indicating the safety
level of the corresponding strategy. Thus, in order to find a minimum risk strategy, the risk assessment result ε should be considered in
the first place, as described in Eq. (16). Therefore, the reward of risk is written as:
def
rrisk maxε − εt (21)
=
where rrisk denotes the reward of risk assessment, and εt denotes risk assessment result at the instantaneous time t, and maxε denotes
the defined maximum risk value.
Besides, other reward schemes are included when planning vehicle behavior in order to follow traffic rules and human driving
habits, as follows.
Traffic rule: Illegal lane change is one of the dangerous driving behaviors (Li et al., 2021b). Traditionally, researchers set a binary
penalty to teach AVs not to make an illegal lane change (Shi et al., 2019). However, this may not make the AVs realize that the actions
that make the AVs close to lane invasion should be punished. In our proposed methods, a soft penalty is used to address this problem
2
def (lald − lahv )
rinvasion − e− 2σ2 (22)
=
where lald and lahv denote the lane boundary and the lateral position of the HV. The smaller the relative distance between the HV and
lane boundary, the greater the penalty.
Human driving habits: Usually, human drivers will drive the car in the center of the lane for driving safety. Therefore, driving in the
center of the lane should be regarded as an incentive, which is defined as follows:
lahv )2
def − (lacenter −
rcenter e 2σ2 (23)
=
where lacenter denotes the current lane center.

rcenter encourages the vehicle to drive following the lane center, but it does not have the awareness that the vehicle should not drive
on the lane making. To make the agent learn the lane marking rules, rinvasion is further designed to penalize the agent for driving close to
the lane boundary. Thus, the agent can learn how to drive in a lane by considering rcenter together with rinvasion .
Besides, the HV needs to be encouraged to keep within road boundaries and survive (i.e., no collision) in the environment as long as
possible, which can be written as
{
def 0.1, if exist
rexist (24)
= − 1, otherwise
where exist means that collision and running out of boundaries do not occur. The values 0.1 and − 1 in rexist are commonly used in
previous studies (Long et al., 2018; Li et al., 2020).
According to Eq. (21), (22), (23), and (24), we can obtain that rrisk ∈ [0, 2], rinvasion ∈ [− 1, 0], rcenter ∈ [0,1], and rexist ∈ { − 1,0.1}. The
upper bound of rrisk is 2, which is twice as the upper bound (absolute value) of rinvasion , rcenter , and rrisk . Given that reducing risk is more
significant than the other rewards for autonomous driving (Bouton et al., 2019) and the sub-reward functions are designed with
reasonable ranges to avoid significant range difference problems, the weights of sub-reward functions in this study are set as 1, which is
the commonly well-accepted simplification in reinforcement learning studies (Mnih et al., 2015; Qi et al., 2019). Therefore, by
considering all the above elements, the complete reward function is defined as:
def
r r + rinvasion + rcenter + rexist (25)
= risk
9
Eq. (15), (16) and (17) are the elements that combine risk assessment with reinforcement learning, denoting that we want to generate a
driving strategy with the minimum expected risk. Once we have Eq. (15), (16) and (17), the driving strategy can be found using
reinforcement learning algorithms via the reward settings in Eq. (25) which considers driving risk assessment. The reward function
used to train the original DRL methods used in this study (i.e., DQN, DDQN, DulDQN, and PRDQN) is defined as:
def
r r + rcenter + rexist (26)
= invasion
It should be noted that the reward function is only used in the training phase to teach an agent how to drive. We do admit that the
performance of different methods may be further improved if these hyper-parameters (e.g., weights) are optimized using some
optimization algorithms (e.g., the genetic algorithm). But for fair comparison, the same hyper-parameters of the risk model and reward
functions are used in this study to examine the effectiveness of different methods in the lane change scenarios.
4.3. Training details
In our proposed methods, DQN-based algorithms are introduced to help find an optimal policy with the minimum expected risk.
The fully connected layer is utilized to approximate the Q-value function. The details of the network are shown in Table 1.
In our network, batch normalization (BN) (Ioffe and Szegedy, 2015) is utilized to normalize the neural output before activation. The
idea of BN consists of two parts: 1) the standard normalization is used to normalize the output to Gaussian distribution, which can
prevent all the outputs from falling in the negative and causing gradient disappear by ReLu; 2) since the operation in 1) will destroy the
raw distribution, two learnable parameters are used to restore the output. The corresponding algorithm is introduced in Algorithm 1.
Algorithm 1: Batch normalization.
Input: Values of x over a mini-batch: B = {x1 , …, xm };

Parameters to be learned: γ, β
Output:{yi = BNγ,β (xi )}
1 ∑m
μB = xi //mini-batch mean
m i=1
1 ∑m
σB =2
(xi − μB )2 //mini-batch variance
m i=1
xi − μB
x i = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
̂ //normalization
σB 2 + ∊
yi = γ̂
x i + β = BNγ,β (xi ) //scale and shift
In order to reduce the variance when updating the network, some techniques are used to train the network including a warmup
learning rate strategy, using a big mini-batch data (i.e., 256), clipping the gradient by normalization, and soft network updating.
Warmup learning rate strategy: In the early stages of the training process, the optimizers usually have a large variance, making the
network updating unsteady. Therefore, a small learning rate is firstly used to train the network in the warmup learning rate strategy.
Then, the learning rate goes back to the normal value for training. In our experimental settings, the learning rate is set to be 0.01 as the
initial value, and recovers to 0.1 after 50 episodes.
Gradient clipping: To avoid gradient explosion, gradient clipping with normalization is used. For any gradient in layer i, the gradient
after clipping is defined as:
def clipnorm
grad*i gradi * (27)
= max(norm(gradi ), clipnorm)
where gradi and grad*i denote the raw gradient and the gradient after clipping in layer i; norm denotes the standard deviation calcu
lation; and clipnorm is a pre-defined coefficient which denotes the standard deviation after clipping. To reduce the variance of
updating, clipnorm is set as 0.1.
Soft network updating: Rather than using the hard network updating, soft updating is used to copy the weights from the online
network to the target network, which can be written as:
Table 1
Details of the Q-value network.
Layers Hidden units Activation
I 32 ReLu
II 108 ReLu
III 296 ReLu
IV 108 ReLu
V 32 ReLu
Output 5 Linear
10
def
θtarget (1 − η) ⋅ θtarget + η ⋅ θonline (28)
=
where θtarget and θonline respectively denote the weights in the target network and the online network, η is a small value that affects the
speed of the target network updating and it is set as 0.01.
All these sub-rewards will be considered in all the training phase without being turned off. It seems that rcenter and rcenter conflict with
the lane change behavior. However, the motivation of lane change comes from the requirement on collision avoidance because rrisk will
be much greater than these two rewards in the scenarios for collision avoidance (braking was not considered for collision avoidance in
this study). Therefore, even if we take these two rewards into consideration in the lane change process, the agent can still learn a lane
change behavior because of the comprehensive consideration of our designed reward function.
The algorithms developed in this paper are based on DQN. All the values of the hyper-parameters except learning rate are
determined by recommendations from the existing literature (Mnih et al., 2015). To determine the learning rate, we firstly test all the
DQN based methods in a game called Cart Pole, which is a famous game in OpenAI Gym (Brockman et al., 2016), to examine whether
the DQN based methods can work successfully. Given that all the methods successfully pass the Cart Pole test, we then apply them to
the pre-designed lane change scenarios in CARLA. DQN is firstly used to adjust the learning rate to find the best initial value which is
then applied to the other methods. The source code and the parameter details of our proposed methods and the compared methods can
be found at https://github.com/YoungYoung619/reinforcement-learning-based-driving-decision-in-Carla.
The corresponding pseudo-code of our proposed methods is described in Algorithm 2. Those DRL methods with the prefix RA are
our proposed methods. It should be noted that the purpose of training is to teach an agent how to drive the vehicle. The agent trained
from the training phase would be directly used in evaluation to examine its effectiveness. For fair comparison, the improved methods
and the original DRL methods are trained and evaluated in the same way. These measures can help draw reliable and convincing
conclusions.
Algorithm 2: The driving decision making framework (training).
Input: DQN-based algorithm m ∈ [RA-DQN, RA-DDQN, RA-DulDDQN, RA-PPDQN], M = 2000 (the number of episodes in the training phase)
Output: the best policy π* (s)
Randomly initialize the online network Q(s, a; θ) and target network Q(s, a; θ ), memories D with capacity N and the environment E.
′
For episode = 1, …, M do
Initialize the initial state s1 from E
For t = 1, …, T do
Select at = argmaxQπ (s, a; θ) with probability (1 − ∊)
a
Otherwise randomly select the at
Execute action at in E, calculate rt+1 using Eq. (25) and observe st+1
Store [st , at , rt+1 , st+1 ] in D
Sample mini-batch B of transitions from D
Using B to update the Online network Q(s, a; θ) with batch normalization and grad clipping
Using θ = (1 − τ) ⋅ θ +τ ⋅ θ to update theQ(s, a; θ )
′ ′ ′
End for
End for
5. Experiments
Currently, most of the DRL based driving decision-making methods are trained and tested on simulators (Ye et al., 2019; Zhu et al.,
2020; Duan et al., 2020; Kiran et al., 2021), due to the unaffordable trial-and-error cost in real world experiments (Grigorescu et al.,
2020). Therefore, our proposed methods are also trained and tested on a professional and well-accepted simulator called Car Learning
to Act (CARLA) (Dosovitskiy et al., 2017). In our experiments, two driving scenarios were proposed to evaluate the performance of our
proposed methods.
Fig. 5. An illustration for the scenario I in our experiments.
11
Scenario I (static vehicles): To collect training samples of our proposed methods, a random number of static vehicles (10 ~ 26) are
randomly placed on a 420 m long straight road to serve as static obstacles in the training phase. The HV should drive forward safely to
avoid collisions with any of the static vehicles. See Fig. 5 for more details. Four vehicles (including the HV) are located on a road with 4
sections, each of which is named as an interval section with one static vehicle. The location of the static vehicle in each interval section
is randomly specified using the Gaussian based position sampling probability method, and the lane choices of all the vehicles
(including HV and OVs) are randomly initialized. Therefore, the situation with two vehicles parallel located in two lanes to block the
road would not happen in our experiments.
However, the longest straight road we can find in the available maps in CARLA is 420 m, which is not long enough to provide
convincing evaluation results on the effectiveness of the proposed methods. Therefore, we evaluated the proposed methods on the 420
m-long road for 100 times with random scenarios (i.e., 100 episodes) in the evaluation phase. In each of the episodes, the positions of
26 vehicles (including the HV) were randomly set using the rules illustrated in Fig. 5. Therefore, each of the examined methods were
tested 100 times on the 420 m road with 26 randomly distributed vehicles. This setting requires the vehicle to change lanes about 4
times per 100 m, which is a very high lane change frequency for on-road driving. In the testing experiments, the total traveled distance
of the autonomous vehicle in this scenario is 42 km and the total number of lane changes is 1600. Another lane change research using
the same 420 m road in CARLA can be found in Chen et al. (2020).
Scenario II (moving vehicles): The initial position strategy of the vehicles is the same as scenario-I, but all the vehicles except the HV
are set in the autopilot mode provided by CARLA, and the speed limit is 30 m/s. The HV should drive safely without collision with any
of the dynamically moving vehicles. In the evaluation phase, 100 random scenarios are sampled to evaluate our proposed methods.
Brief information about the HV dynamics we used in CARLA is shown in Table 2. The effectiveness of our proposed methods is
evaluated by the driving distance before collision that denotes the agent’s ability to avoid obstacles. Besides, risk assessment results are
also recorded, denoting the risk level when the agent controls the HV.
6. Results and discussion
6.1. Quantitative results
The quantitative results when using different methods in scenario-I are shown in Fig. 6 and Table 3. In Fig. 6, the baseline means a
random action strategy, denoting the difficulty level of the experimental scenario. The methods with prefix RA denote the proposed
risk awareness based methods. The episode number in Fig. 6 means the number of evaluation experiments. In Table 3, score (μ) and
score (σ) denote the average score (i.e., the traveled distance) and the corresponding standard deviation in the examined scenarios,
respectively. nCs denotes the number of collisions with vehicles or road boundaries (e.g., curbs) occurring in the experiments.
The presented results show that the proposed methods can attain better scores than the original methods. Specifically, the average
scores of RA-DQN, RA-DDQN, RA-DulDQN, and RA-PRDQN are 399.6, 400.8, 192.0, and 412.0, improved by 93.8%, 124.4%, 105.5%,
and 30.5% than the corresponding original methods, respectively. The standard deviations of the scores when using RA-DQN, RA-
DDQN, and RA- PRDQN are 53.0, 24.7 and 23.8, decreased by 60.4%, 80.4%, and 82.8% than the corresponding original methods,
respectively, indicating that the proposed methods with risk awareness can achieve more stable performance. Similarly, the perfor
mance of our proposed methods on nCs shows similar trends. Specifically, after introducing the risk awareness strategy, the nCs
decrease from 89, 93 and 42 when using DQN, DDQN, and PRDQN, to 30, 43 and 6 when using the improved RA-DQN, RA-DDQN, and
RA-PRDQN. However, when using the RA-DulDQN, the results on the standard deviation of the scores and the nCs do not show much
improvement than the traditional DulDQN.
As indicated by the comparison between the proposed methods and the original methods, the improvement comes from the
introduction of the risk awareness strategy. After applying the risk awareness strategy to the Q-value, the agent could be aware of the
dangerous actions that may lead the vehicle into collisions in each running step. Therefore, the Q-value with risk awareness will be
more accurate when describing the potential danger of driving state, leading to better performance of the corresponding agent. Be
sides, unlike traditional methods that punish the agent only when a collision occurs (Shi et al., 2019), introducing the risk awareness in
our proposed methods will make the agent be punished when any action that could lead to a riskier situation is taken. Therefore, the
Table 2
Brief information of the HV dynamics.
Parameter Description Value
Max rpm The maximum RPM of the vehicle engine 5000.0

Moi The moment of inertia of the vehicle’s engine. 1.0
Clutch strength The clutch strength of the vehicle in Kgm2 /s. 10.0
Final ratio The fixed ratio from transmission to wheels. 4.0
Mass The mass of the vehicle in Kg. 1000.0
Drag coefficient Drag coefficient of the vehicle’s chassis. 0.3
Steering curve Curve that indicates the maximum steering for a specific forward speed. –
Tire friction A scalar value that indicates the friction of the wheel. 2.0
Damping rate Damping rate of the wheel. 0.25
Max brake torque Maximum brake torque in Nm. 1500.0
Max handbrake torque Maximum handbrake torque in Nm. 3000.0
12
Fig. 6. The scores when using different methods in scenario-I (RA means risk assessment based methods, and the shadow area denotes the standard
deviation of the score).
Table 3
The metrics when using different methods in scenario-I. (Δ: the relative change rate. RA means risk assessment based methods.)
Method Score (μ) Δ(%) Score (σ) Δ(%) nCs Δ(%)
Baseline 16.7 – 7.2 – 100 –

DQN 206.1 – 134.0 – 89 –
DDQN 178.6 – 126.2 – 93 –
DulDQN 93.4 – 79.9 – 100 –
PRDQN 315.6 – 138.7 – 42 –
RA-DQN 399.6 93.8↑ 53.0 60.4↓ 30 66.3↓
RA-DDQN 400.8 124.4↑ 24.7 80.4↓ 43 53.8↓
RA-DulDQN 192.0 105.5↑ 138.9 78.1↑ 98 2.0↓
RA-PRDQN 412.0 30.5↑ 23.8 82.8↓ 6 85.7↓
proposed methods will be better in reflecting the potential risk in complex driving scenarios, and more adaptive to taking reasonable
actions to avoid near-collisions and reach a more stable performance.
The quantitative results in scenario-II when using each of the examined methods are shown in Table 4 and Fig. 7. Similar trends
with the performance in scenario-I are observed. The average scores of RA-DQN, RA-DDQN, RA-DulDQN, and RA-PRDQN are 303.3,
223.5, 51.7 and 399.0, improved by 175.4%, 43.0%, 58.1% and 76.1% than the corresponding original methods, respectively. The
standard deviations of the scores when using RA-DDQN and RA- PRDQN are 102.6 and 63.6, decreased by 16.0% and 57.2% than the
corresponding original methods, respectively. Besides, the nCs decrease from 100, 90 and 68 when using DQN, DDQN, and PRDQN, to
45, 78 and 12 when using the improved RA-DQN, RA-DDQN, and RA-PRDQN. In general, all the risk awareness based decision-making
strategies except RA-DulDDQN have better scores than the original methods in scenario-II with dynamically moving traffic.
It is not surprising that PRDQN and RA-PRDQN are performing much better than the others, because PRDQN based methods can
dealt with hard samples for learning (Schaul et al., 2016). Therefore, PRDQN based methods are more adaptive to our test scenarios
that are challenging with frequent lane changes and dense obstacles. The results also show that RA-DulDQN does not achieve an
obvious improvement, which may be attributed to that DulDQN mainly works in the environments where actions do not affect the
states (Wang et al., 2016; Qi et al., 2019). However, in our cases, most of the unsuitable actions can directly change the environment
states and lead to collisions, making the DulDQN related methods not suitable for the examined driving scenarios.
In summary, the presented quantitative results show that most of our improved methods can achieve superiority performance than
the compared methods in both the static obstacle scenario and the scenario with dynamically moving vehicles, demonstrating the
Table 4
The metrics when using different methods in scenario-II. (Δ: the relative change rate. RA means risk assessment based methods.)
Method Score (μ) Δ(%) Score (σ) Δ(%) nCs Δ(%)
Baseline 19.1 – 10.6 – 100 –

DQN 110.1 – 94.8 – 100 –
DDQN 156.3 – 122.2 – 90 –
DulDQN 32.7 – 28.2 – 100 –
PRDQN 226.6 – 148.6 – 68 –
RA-DQN 303.3 175.4↑ 135.3 42.7↑ 45 55.0↓
RA-DDQN 223.5 43.0↑ 102.6 16.0↓ 78 13.3↓
RA-DulDQN 51.7 58.1↑ 37.4 32.6↑ 100 0.0
RA-PRDQN 399.0 76.1↑ 63.6 57.2↓ 12 82.4↓
13
Fig. 7. The scores when using different methods in scenario-II. (RA means risk assessment based methods, and the shadow area denotes the
standard deviation of the score).
robustness of our proposed methods. Specifically, the RA-PRDQN method achieves the best performance among the improved
methods, hence, RA-PRDQN is used for qualitative results presentation in the following subsection.
6.2. Qualitative results
Some failure cases in scenario I of PRDQN are shown in Fig. 8. The results show that the PRDDQN agent tends to make mistakes in
lane change behavior, especially when the longitudinal distance between the front vehicle in the current lane and the lag vehicle in the
target lane is short. In addition, the illustrated trajectories show that the motion path of PRDDQN has obvious fluctuation which
suggests its instability.
As improved by our proposed methods, the motion path and the corresponding risk assessment results when using RA-PRDDQN in
scenario I is shown in Fig. 9. The illustrated results show that the agent can complete the entire driving task safely and steadily. The
illustrated motion path is smoother than PRDQN, because during the experiment, the agent would not take redundant actions (e.g.,
making many left turn or right turn actions when driving straightly) that may potentially cause increased risks to increase the
complexity of the motion path as these redundant actions could be caught by our risk awareness strategy. Besides, in the places where
the PRDQN agent collides with another vehicle, our improved RA-PRDQN can control the HV to safely avoid the obstacles without
collisions.
In Fig. 9, the highlighted segments (red rectangle) show that the risk level continuously increases when the HV is getting close to an
obstacle, which is reasonable in naturalistic real-world driving. If the HV keeps going straight without changing lanes, a collision will
inevitably occur. However, due to the newly added risk awareness module in the deep reinforcement learning algorithms, the agent
has learned to take correct behaviors to prevent the occurrence of a collision with the obstacle, making the potential risk level go back
to the normal levels (attentive or safe), as shown in green rectangle in Fig. 9.
The motion path and the corresponding risk assessment results when using the improved RA-PRDQN in scenario-II are shown in
Fig. 10. Similar with the illustrated results in Fig. 9, the performed motion path in scenario II with dynamically moving vehicles also
shows smooth trajectories and the HV can make lane changes when necessary to avoid collisions.
6.3. Difficulties of our examined scenarios
Our test scenarios include a static vehicle scenario and a moving vehicle scenario. We admit that the static scenario is easier than
Fig. 8. Some failure cases of PRDQN in scenario-I.
14
Fig. 9. The motion path and risk assessment results when using RA-PRDQN in scenario-I.
Fig. 10. The motion path and risk assessment result of RA-PRDQN in scenario-II.
the moving scenario. However, due to the high lane change frequency, the static scenario is still challenging. Table 5 shows the
comparisons of the simulated driving parameters between our research work and the previous studies (Wang et al., 2018; Mirchevska
et al., 2018; Ye et al., 2020; Duan et al., 2020; Chen et al., 2020). From Table 5, we can tell that the simulated driving task in our study
are more challenging than those in other studies. In the static scenario, the agent needs to make about 16 lane changes in each
evaluation episode (see Fig. 9). In the moving scenario, the agent should make about 6 lane changes (see Fig. 10) with a relatively high
speed (roughly between 10 ~ 20 m/s). Therefore, the examined lane change scenarios in the present study are more challenging as
compared with the lane change scenarios in the previous studies. In our examined challenging scenarios, the success rate of our
proposed RA-PRDQN is about 88 ~ 94% in the examined dense traffic flow scenario with high lane change frequency, whereas the
success rates of the previous DRL based methods in dense lane change scenarios are usually around 90% (Saxena et al., 2020).
Therefore, the success rate of our proposed RA-PRDQN should be within a reasonable and acceptable range.
6.4. Causations of the failure cases
Although our proposed RA-PRDQN has greatly improved the motion path stability and quantitative performance, there are still
some failure cases. As shown in Table 3 and Table 4, there are six collisions in scenario-I and eight collisions in scenario-II when using
our proposed RA-PRDQN. Similar with the results presented in Fig. 8, most of the failure cases were caused by the short longitudinal
15
Table 5
Comparison of the scenario difficulties.
Method Total experiment distance Total obstacle number Mean barrier spacing
Wang et al., 2018 1000 m 8 125 m

Mirchevska et al., 2018 1255 m * 10 50 * 10 25 m
Ye et al., 2020 800 m – –
Duan et al., 2020 800–1000 m 2 450 m
Chen et al., 2020 420 m 2 200 m
Ours 420 m * 100 25 * 100 16 m
distance between the front vehicle in the current lane and the lag vehicle in the target lane. One of the causations of these failures is the
imperfectness of the used sampling probability method for position initialization of the HV and OVs. Although the used sampling
probability method (as shown in Fig. 5) can ensure that the situation with two vehicles parallel located in two lanes blocking the road
would not happen, there would be the situation where two vehicles are on different lanes with very short longitudinal distance (i.e.,
gap) which is not enough for the HV to execute a lane change maneuver. Therefore, a collision would not be avoidable because the
trained agent is targeted to arrive at the destination to complete the driving task. In our future work, the sampling probability method
for position initialization will be improved to avoid this problem, based on which the performance of our proposed methods should be
enhanced.
Another important reason leading to these failures is that the actions considered in this study for collision avoidance only cover the
steering behavior. However, braking behavior is also very important to control the distance to the front and lag vehicles for collision
avoidance (Li et al., 2019; Li et al., 2020). The missing of speed adaptation by braking may greatly contribute to those failure cases. To
implement more realistic behavioral actions for autonomous driving, our future work will enrich the action space by including the
braking behavior for longitudinal control together with the steering behaviors for lateral control.
The third possible causation of these failures may be that the driving style attribute in the designed RA-PRDQN is similar with the
aggressive driving style performance. Given that driving style affects drivers’ lane change decisions (Li et al., 2017), the future efforts
should focus on solutions to addressing this problem by considering driving style preferences in the designed decision making
strategies.
Although we have introduced the exponential moving averaging strategy to smooth the motion path, and the generated trajectories
(as shown in Figs. 9 and 10) are smoother than those generated by PRDQN (as shown in Fig. 8), the smoothness of the generated
trajectories by RA-PRDQN is still undesirable and needs to be further improved. The main reason leading to this problem is the discrete
action space, which is a common problem of DQN based algorithms (Li et al., 2020b; Li et al., 2021e). In our future work, we will
propose continuous motion space based DRL algorithms (e.g., DDPG (deep deterministic policy gradient)) to solve this problem for safe
and smooth lane change.
7. Conclusion
In this paper, deep reinforcement learning algorithms combining with risk assessment functions are innovatively proposed to find
an optimal driving strategy with the minimum expected risk. The experiment results show that most of the proposed methods can
generate a series of actions to minimize the driving risk and prevent the host vehicle from collisions in both the scenarios with crowded
static obstacles and the scenarios with dynamically moving vehicles. The most superior method among the examined ones is the risk-
awareness prioritized replay DQN (RA-PRDQN) with the following features: 1) when the HV tends to drive out of the road boundary,
the policy will correct the HV to return to the drive lane; 2) the agent will encourage the HV to drive at the center of the lane; 3) when
the potential risk level is high, the strategy will generate a series of actions to reduce the risk level for potential near collision
avoidance; 4) the agent can generate correct steering decisions for lane changing when a potential collision obstacle exists. Our
proposed methods in this paper could be further improved by mending the sampling probability function for vehicle position
initialization, supplementing braking behavior for speed control, considering the influence of driving style, deploying continuous
action space based DRL algorithms, and optimizing the hyper-parameters in our proposed methods.
CRediT authorship contribution statement
Guofa Li: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Funding acquisition. Yifan Yang:
Methodology, Data curation, Validation, Visualization, Software, Formal analysis, Writing – original draft. Shen Li: Conceptualization,
Methodology, Writing – review & editing. Xingda Qu: Supervision, Writing – review & editing. Nengchao Lyu: Writing – original
draft. Shengbo Eben Li: Conceptualization, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
16
Acknowledgement
This study is supported by the National Natural Science Foundation of China (grant number: 51805332), and the Shenzhen
Fundamental Research Fund (grant number: JCYJ20190808142613246, 20200803015912001).
References
Ali, M., Falcone, P., Sjoberg, J., 2012. Threat assessment design under driver parameter uncertainty. In: Proceedings of 2012 IEEE Conference on Decision and Control
(CDC), pp. 6315–6320.
Bojarski, M., Yeres, P., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., Muller, U., 2017. Explaining how a deep neural network trained with end-to-end
learning steers a car. arXiv preprint arXiv:1704.07911.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W., 2016. OpenAI Gym, arXiv preprint arXiv:1606.01540.
Bosnak, M., Skrjanc, I., 2017. Efficient Time-To-Collision Estimation for a Braking Supervision System with LIDAR. In: Proceedings of 2017 IEEE International
Conference on Cybernetics, pp. 1–6.
Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K., Kochenderfer, M.J., Tumova, J. 2019. Reinforcement learning with probabilistic guarantees for autonomous
driving. arXiv preprint arXiv:1904.07189.
Chen, D., Jiang, L., Wang, Y., Li, Z., 2020. Autonomous driving using safe reinforcement learning by incorporating a regret-based human lane-changing decision
model. In: Proceedings of the 2020 American Control Conference (ACC), pp. 4355–4361.
Codevilla, F., Müller, M., López, A., Koltun, V., Dosovitskiy, A., 2018. End-to-end Driving via Conditional Imitation Learning. arXiv:1710.02410.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V., 2017. CARLA: An Open Urban Driving Simulator. arXiv:1711.03938.
Chen, Y., Li, G., Li, S., Wang, W., Li, S.E., Cheng, B., 2021. Exploring behavioral patterns of lane change maneuvers for human-like autonomous driving. IEEE Trans.
Intell. Transp. Syst. https://doi.org/10.1109/TITS.2021.3127491.
Duan, J., Guan, Y., Li, S.E., Ren, Y., Cheng, B., 2021. Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors. IEEE
Trans. Neural Networks Learn. Syst. https://doi.org/10.1109/TNNLS.2021.3082568.
Duan, J., Li, S.E., Guan, Y., Sun, Q., Cheng, B., 2020. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data.
IET Intel. Transport Syst. 14 (5), 297–305.
Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In: Proceedings of the 2012 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361.
Glaser, S., Vanholme, B., Mammar, S., Gruyer, D., Nouveliere, L., 2010. Maneuver-Based Trajectory Planning for Highly Autonomous Vehicles on Real Road With
Traffic and Driver Interaction. IEEE Trans. Intell. Transp. Syst. 11 (3), 589–606.
Grigorescu, S., Trasnea, B., Cocias, T., Macesanu, G., 2020. A survey of deep learning techniques for autonomous driving. J. Field Rob. 37 (3), 362–386.
Hasselt, H., Guez, A., Silver, D., 2015. Deep reinforcement learning with double q-learning. arXiv:1509.06461.
Hoel, C.-J., Driggs-Campbell, K., Wolff, K., Laine, L., Kochenderfer, M.J., 2020. Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for
Autonomous Driving. IEEE Trans. Intell. Veh. 5 (2), 294–305.
Huang, Y., Ding, H., Zhang, Y., Wang, H., Cao, D., Xu, N., Hu, C., 2020. A motion planning and tracking framework for autonomous vehicles based on artificial
potential field elaborated resistance network approach. IEEE Trans. Ind. Electron. 67 (2), 1376–1386.
Kahn, G., Villaflor, A., Pong, V., Abbeel, P., Levine, S. 2017. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182.
Kiran, B.R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A.A., Yogamani, S., Pérez, P., 2021. Deep reinforcement learning for autonomous driving: A survey. IEEE
Trans. Intell. Transp. Syst. https://doi.org/10.1109/TITS.2021.3054625.
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine
Learning (ICML), pp. 448–456.
Kim, J., Kum, D., 2018. Collision Risk Assessment Algorithm via Lane-Based Probabilistic Motion Prediction of Surrounding Vehicles. IEEE Trans. Intell. Transp. Syst.
19 (9), 2965–2976.
Kim, J.-H., Kum, D.-S., 2015. Threat prediction algorithm based on local path candidates and surrounding vehicle trajectory predictions for automated driving
vehicles. In: Proceedings of the 2015 IEEE Intelligent Vehicles Symposium (IV), pp. 1220–1225.
Lee, K., Kum, D., 2019. Collision Avoidance/Mitigation System: Motion Planning of Autonomous Vehicle via Predictive Occupancy Map. IEEE Access 7, 52846–52857.
Li, G., Chen, Y., Cao, D., Qu, X., Cheng, B., Li, K., 2021a. Extraction of descriptive driving patterns from driving data using unsupervised algorithms. Mech. Syst. Sig.
Process. 156, 107589. https://doi.org/10.1016/j.ymssp.2020.107589.
Li, G., Lai, W., Sui, X., Li, X., Qu, X., Zhang, T., Li, Y., 2020. Influence of traffic congestion on driver behavior in post-congestion driving. Accid. Anal. Prev. 141,
105508. https://doi.org/10.1016/j.aap.2020.105508.
Li, G., Li, S.E., Cheng, B., Green, P., 2017. Estimation of driving style in naturalistic highway traffic using maneuver transition probabilities. Transportat. Res. Part C:
Emerging Technol. 74, 113–125.
Li, G., Liao, Y., Guo, Q., Shen, C., Lai, W., 2021b. Traffic crash characteristics in Shenzhen, China from 2014 to 2016. Int. J. Environ. Res. Public Health 18 (3), 1176.
https://doi.org/10.3390/ijerph18031176.
Li, G., Yang, Y., Qu, X., 2020a. Deep learning approaches on pedestrian detection in hazy weather. IEEE Trans. Ind. Electron. 67 (10), 8889–8899.
Li, G., Yang, Y., Qu, X., Cao, D., Li, K., 2021c. A deep learning based image enhancement approach for autonomous driving at night. Knowl.-Based Syst. 213, 106617.
https://doi.org/10.1016/j.knosys.2020.106617.
Li, G., Yang, Y., Zhang, T., Qu, X., Cao, D., Cheng, B., Li, K., 2021d. Risk assessment based collision avoidance decision-making for autonomous vehicles in multi-
scenarios. Transport. Res. Part C: Emerg. Technol. 122, 102820. https://doi.org/10.1016/j.trc.2020.102820.
Li, G., Li, S., Li, S., Qu, X., 2021e. Continuous decision-making for autonomous driving at intersections using deep deterministic policy gradient. IET Intel. Transport
Syst. https://doi.org/10.1049/itr2.12107.
Li, G., Yang, L., Li, S., Luo, X., Qu, X., Paul, G., 2021f. Human-like decision-making of artificial drivers in intelligent transportation systems: an end-to-end driving
behavior prediction approach. IEEE Intell. Transp. Syst. Mag. https://doi.org/10.1109/MITS.2021.3085986.
Li, G., Li, S., Li, S., Qin, Y., Cao, D., Qu, X., Cheng, B., 2020b. Deep reinforcement learning enabled decision-making for autonomous driving at intersections.
Automotive Innovation 3 (4), 374–385.
Li, G., Yan, W., Li, S., Qu, X., Chu, W., Cao, D., 2021g. A temporal-spatial deep learning approach for driver distraction detection based on EEG signals. IEEE Trans.
Autom. Sci. Eng. https://doi.org/10.1109/TASE.2021.3088897.
Li, G., Wang, Y., Zhu, F., Sui, X., Wang, N., Qu, X., Green, P., 2019. Drivers’ visual scanning behavior at signalized and unsignalized intersections: A naturalistic
driving study in China. J. Saf. Res. 71, 219–229.
Li, M., Wu, L., Wang, J., Ammar, H., 2019. Multi-View Reinforcement Learning. In: Conference on Neural Information Processing Systems (NeruIPS), pp. 2304–2312.
Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., Pan, J., 2018. Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning.
arXiv:1709.10082.
Ma, X., Zhang, Q., Xia, L., Zhou, Z., Yang, J., Zhao, Q., 2020. Distributional Soft Actor Critic for Risk Sensitive Learning. arXiv:2004.14547.
Mirchevska, B., Pek, C., Werling, M., Althoff, M., Boedecker, J., 2018. High-level decision making for safe and reasonable autonomous lane changing using
reinforcement learning. In: Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2156–2162.
17
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C.,
Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D., 2015. Human-level control through deep reinforcement learning. Nature 518
(7540), 529–533.
Moghadam, M., Elkaim, G.H., 2019. A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning. arXiv:
1906.08464.
Mokhtari, K, Wagner, A.R., 2021. Don’t Get Yourself into Trouble! Risk-aware Decision-Making for Autonomous Vehicles. arXiv preprint arXiv:2106.04625.
NHTSA, 2019. Traffic Safety Facts 2017 (DOT HS 812 806). National Highway Traffic Safety Administration, U.S. Department of Transportation, Washington, DC, U.S.
Noh, S., 2019. Decision-Making Framework for Autonomous Driving at Road Intersections: Safeguarding Against Collision, Overly Conservative Behavior, and
Violation Vehicles. IEEE Trans. Ind. Electron. 66 (4), 3275–3286.
Noh, S., An, K., 2018. Decision-Making Framework for Automated Driving in Highway Environments. IEEE Trans. Intell. Transp. Syst. 19 (1), 58–71.
Paden, B., Cap, M., Yong, S.Z., Yershov, D., Frazzoli, E., 2016. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell.
Veh. 1 (1), 33–55.
Pan, X., You, Y., Wang, Z., Lu, C., 2017. Virtual to Real Reinforcement Learning for Autonomous Driving. arXiv:1704.03952.
Qi, X., Luo, Y., Wu, G., Boriboonsomsin, K., Barth, M., 2019. Deep reinforcement learning enabled self-learning control for energy efficient driving. Transport. Res.
Part C: Emerg. Technol. 99, 67–81.
Saxena, D.M., Bae, S., Nakhaei, A., Fujimura, K., Likhachev, M., 2020. Driving in dense traffic with model-free reinforcement learning. In: Proceedings of the 2020
IEEE International Conference on Robotics and Automation (ICRA), pp. 5385–5392.
Schaul, T., Quan, J., Antonoglou, I., Silver, D., 2016. Prioritized experience replay. In: International Conference on Learning Representations (ICLR), pp. 3240–3248.
Shi, T., Wang, P., Cheng, X., Chan, C., 2019. Driving decision and control for autonomous lane change based on deep reinforcement learning. arXiv:1904.10171.
Shin, D., Kim, B., Yi, K., Carvalho, A., Borrelli, F., 2019. Human-Centered Risk Assessment of an Automated Vehicle Using Vehicular Wireless Communication. IEEE
Trans. Intell. Transp. Syst. 20 (2), 667–681.
Shirazi, M.S., Morris, B.T., 2017. Looking at intersections: A survey of intersection monitoring, behavior and safety analysis of recent studies. IEEE Trans. Intell.
Transp. Syst. 18 (1), 4–24.
Tahir, H., Syed, M.N., Baroudi, U., 2020. Heuristic Approach for Real-Time Multi-Agent Trajectory Planning Under Uncertainty. IEEE Access 8, 3812–3826.
Tu, K., Yang, S., Zhang, H., Wang, Z., 2019. Hybrid A* Based Motion Planning for Autonomous Vehicles in Unstructured Environment. In: IEEE ISCAS, pp. 1–4.
Wang, H., Huang, Y., Khajepour, A., Zhang, Y., Rasekhipour, Y., Cao, D., 2019. Crash Mitigation in Motion Planning for Autonomous Vehicles. IEEE Trans. Intell.
Transp. Syst. 20 (9), 3313–3323.
Wang, P., Chan, C.-Y., de La Fortelle, A., 2018. A Reinforcement Learning Based Approach for Automated Lane Change Maneuvers. In: Proceedings of the 2018 IEEE
Intelligent Vehicles Symposium (IV), pp. 1379–1384.
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N., 2016. Dueling network architectures for deep reinforcement learning. In: International
Conference on Machine Learning (ICML), pp. 1995–2003.
Xu, H., Gao, Y., Yu, F., Darrell, T., 2017. End-to-End Learning of Driving Models from Large-Scale Video Datasets. In: Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3530–3538.
Ye, F., Cheng, X., Wang, P., Chan, C.Y., Zhang, J., 2020. Automated lane change strategy using proximal policy optimization-based deep reinforcement learning. In:
Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 1746–1752.
Ye, Y., Zhang, X., Sun, J., 2019. Automated vehicle’s behavior decision making using deep reinforcement learning and high-fidelity simulation environment.
Transport. Res. Part C: Emerg. Technol. 107, 155–170.
Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., Darrell, T., 2018. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv
preprint arXiv:1805.04687.
Zhu, M., Wang, X., Wang, Y., 2018. Human-like autonomous car-following model with deep reinforcement learning. Transport. Res. Part C: Emerg. Technol. 97,
348–368.
Zhu, M., Wang, Y., Pu, Z., Hu, J., Wang, X., Ke, R., 2020. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving.
Transport. Res. Part C: Emerg. Technol. 117, 102662. https://doi.org/10.1016/j.trc.2020.102662.
18

Paper Autonomous

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper Autonomous

Uploaded by

Copyright:

Available Formats

Transportation Research Part C 134 (2022) 103452

Contents lists available at ScienceDirect

Transportation Research Part C

Decision making of autonomous vehicles in lane change scenarios:

1.1. Motion planning based methods

1.2. Risk assessment based methods

1.3. Learning based methods

1.5. Paper organization

2. Problem statement and previous solutions

2.1. Deep Q-network (DQN)

2.2. Double deep Q-network (DDQN)

2.3. Dueling deep Q-network (DulDQN)

where |A | denotes the size of action space.

2.4. Deep Q-network with prioritized replay (PRDQN)

where N is the replay size, and β is a pre-defined coefficient.

3. Proposed methods with risk assessment

3.1. Risk assessment

Fig. 2. The solution framework of our proposed methods.

3.2. Decision making with minimum risk

Fig. 3. A qualitative illustration for the priori distribution in Eq. (12).

An equivalent transformation can be written as follows

4. Collision avoidance decision-making

4.1. State space and action space

Fig. 4. Illustration of the vehicle state definition.

4.2. Reward function

where lacenter denotes the current lane center.

4.3. Training details

Algorithm 1: Batch normalization.

Input: Values of x over a mini-batch: B = {x1 , …, xm };

Algorithm 2: The driving decision making framework (training).

Fig. 5. An illustration for the scenario I in our experiments.

6. Results and discussion

6.1. Quantitative results

Max rpm The maximum RPM of the vehicle engine 5000.0

Baseline 16.7 – 7.2 – 100 –

Baseline 19.1 – 10.6 – 100 –

6.2. Qualitative results

6.3. Difficulties of our examined scenarios

Fig. 8. Some failure cases of PRDQN in scenario-I.

6.4. Causations of the failure cases

Wang et al., 2018 1000 m 8 125 m

CRediT authorship contribution statement

Declaration of Competing Interest

You might also like