Attention-Based Highway Safety Planner For Autonomous Driving Via Deep Reinforcement Learning

162 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO.
1, JANUARY 2024
Attention-Based Highway Safety Planner for

Autonomous Driving via Deep
Reinforcement Learning
Guoxi Chen , Student Member, IEEE, Ya Zhang , Senior Member, IEEE, and Xinde Li , Senior Member, IEEE
Abstract—In this article, a motion planning for autonomous its control performance depends on a precise knowledge of the
driving on highway is studied. A high-level motion planning con- model, which is hardly satisfied in practical applications. Intel-
troller with discrete action space is designed based on deep Q ligent driver model (IDM) [6] and minimizing overall braking
network (DQN). An occupancy grid based state presentation aim-
ing at specific scenarios is proposed and then a novel attention induced by lane change (MOBIL) [7] are classical rule-based
mechanism named external spatial attention (ESA) is designed for model-free controllers. However, their dependence on prior
occupancy grid to improve the network performance. Considering knowledge and preset parameters restrict their performance in
both computational complexity and interpretability, a lightweight complex scenarios. In recent years, with the development of deep
data-driven safety layer consisting of two-dimensional linear bi-
learning and reinforcement learning, deep reinforcement learn-
ased support vector machine (2D-LBSVM) is proposed to improve
safety. The advantages of this controller and the role of each ing (DRL) provides a new model-free solution by treating the
module are illustrated by experiments. In addition, the superior motion planning of autonomous driving as a Markov Decision
performance of occupancy grid state and the interpretability of Process (MDP) [2], which has been widely studied.
safety layer are further analyzed. In DRL, state can be raw sensor data or condensed abstracted
Index Terms—Autonomous vehicles, deep reinforcement data. The former such as image or LIDAR provides the benefit of
learning, attention, safety layer. finer contextual information, while using the latter reduces the
complexity of the state space [8]. There have been many works
on DRL based motion planning with condensed abstracted data.
I. INTRODUCTION
By collecting the information composition state of ego vehicle
UTONOMOUS vehicles have broad application
A prospects [1]. As an important direction, motion planning
for autonomous vehicles is a vast and long-researched area,
and 4 closest agents, Yu et al. [9] modeled the continuously
changing topology of vehicles by using dynamic coordination
graph, and proposed two basic learning approaches to coor-
and has received extensive attention [2]. In this field, safety dinate the driving maneuvers for a group of vehicles. Using
and mobile efficiency are two important criteria, and achieving the information of 8 closest agents, Ali Alizadeh et al. [10]
high-speed with collision-free driving is more challenging [3]. developed a deep reinforcement learning agent that yielded
Motion planning for autonomous vehicles mainly includes consistent performance in a variety of dynamic and uncertain
two types of strategies, model-based and model-free. Scheffe traffic scenarios. Similar state selection methods can be seen
et al. [4] provided a real-time-capable model predictive control in [11], [3] and [12]. Directly using a fixed number of surround-
algorithm according to a nonlinear single-track vehicle model ing vehicle information might lose permutation invariance and
and Pacejka’s magic tire formula for racing. Xing et al. [5] the density of the vehicle is ignored, while adopting occupancy
proposed a joint Cartesian-Frent model predictive control plan- grid can alleviate this problem [13]. Wang et al. [14] achieved an
ning method based on a locally linear intermediary connection efficient lane changing behavior by using a 4 × 45 occupancy
which is updated along with solving iterations. Although the grid and combining a high-level lateral decision-making and
model-based controller has achieved satisfactory performance, low-level rule-based trajectory modification. Saxena et al. [15]
used occupancy grid to find a safe spot for autonomous vehicles
Manuscript received 21 April 2023; revised 12 June 2023 and 13 July 2023; to execute amongst dense traffic on roads. Isele [16] transformed
accepted 9 August 2023. Date of publication 11 August 2023; date of current bird’s eye view into a grid in Cartesian coordinates, in which each
version 17 January 2024. This work was supported in part by the National Key vehicle in the gird was represented by its heading angle, velocity,
R&D Program of China under Grant 2021ZD0112700, in part by the National
Natural Science Foundation (NNSF) of China under Grants 61973082 and and an indicator term. Leurent [17] provided an occupancy grid
62233003, and in part by the Natural Science Foundation of Jiangsu Province construction in highway-env, which ignored the information of
of China under Grant BK20202006. The review of this article was coordinated ego vehicle. In practical applications, the information of vehicles
by Dr. Zeeshan Kaleem. (Corresponding author: Ya Zhang.)
The authors are with the School of Automation, Southeast University, Nanjing behind the ego vehicle on the current lane is redundant, while the
210096, China, and also with the Key Laboratory of Measurement and Control lateral position information of ego vehicle is valuable. However,
of Complex Systems of Engineering, Ministry of Education, Nanjing 210096, these differences are not reflected in the above studies. How to
China (e-mail: 230218763@seu.edu.cn; yazhang@seu.edu.cn; xindeli@seu.
edu.cn). represent state as condensed abstracted data is one topic of this
Digital Object Identifier 10.1109/TVT.2023.3304530 article.
0018-9545 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 163
An appropriate neural network structure plays an important TABLE I

TERMINAL SCENARIO
role in DRL. With the development of deep learning, attention
mechanism is believed to indeed improve the performance of the
network [18]. As self attention [19] is verified to be effective,
attention mechanisms for different problems are proposed [20],
[21]. In intelligent transportation, Tian et al. [22] proposed a fast
and accurate object detector by using self attention mechanism
in YOLOv3. Cai et al. [23] proposed graph attention-based
networks to implicitly model interactions and designed deep algorithm. Occupancy grid including lane information of ego ve-
Q-learning to train the network end-to-end for autonomous hicle without redundant information is used as state. Inspired by
vehicles. In these studies, attention mechanism is used to deal external attention (EA) [21], a novel attention mechanism named
with state consisting of raw sensor data. Chen et al. [24] designed external spatial attention (ESA) is designed for occupancy grid,
an attention layer to extract the most important local infor- which describes the contribution of surrounding vehicles at
mation. They concatenated the information of ego vehicle and the same position to the action value function under different
surrounding vehicles as state, and applied the attention weight environments. A lightweight data-driven safety layer is fur-
to extract the similarity between ego vehicle and surrounding ther proposed, which is composed of multiple two-dimensional
vehicles. However, the attention weight cannot be directly used linear biased support vector machines (2D-LBSVM). It is
in occupancy grid. interpretable and highly matched with the proposed control
As a model-free controller, DRL maximizes long-term re- algorithm.
ward while cannot necessarily prevent the actions that cause The main contributions of this article are summarized as
instant large negative rewards [2], which means the safety of the follows.
controller needs to be discussed additionally. There are three 1) Occupancy grid is used as the state and redundant informa-
methods to ensure safety including adjusting reward function, tion is removed, which ensures the stability of the training
designing additional prediction network and designing safety process and enhances the generalization ability.
rules. It is an intuitive scheme to improve the safety rate by 2) A novel attention mechanism ESA is proposed to signifi-
adjusting reward function. Hu et al. [25] proposed a reward cantly improve the training speed and performance aiming
function to comprehensively consider the impact of backward at occupancy grid.
target types, safety clearance, and vehicle roll stability on the 3) A lightweight data-driven safety layer is proposed, which
rear collision. Tang et al. [12] designed aggressive ego ve- is composed of multiple 2D-LBSVMs. Prior knowledge is
hicle and conservative ego vehicle respectively by adjusting used to reduce the complexity and degree of the classifiers.
the weight of reward function. These studies adjusted reward The relationship between the decision boundary and the
function without optimization method, which consumes a lot safe rules proposed in [28] is further explained.
of computing resources. Wen et al. [26] proposed a parallel The rest of this article is organized as follows. In Section II,
constrained policy optimization method including three learning the driving scenario is introduced in detail, including coordinate
frameworks, which are used to approximate policy function, system and vehicle model for simulation. Section III provides
value function and a novel risk function. Zhang et al. [27] com- the ESA-DQN with safety layer for safety motion planning on
bined the Lyapunov method and the classical Soft Actor-Critic highway. In Section IV, simulations and ablation experiments
algorithm to develop a policy that could theoretically guarantee are given to illustrate the effectiveness of the proposed method.
that the state trajectory always stayed in a safe area. They also The conclusion and outlook are presented in Section V.
used a neural network as an evaluator of future collisions. It
is a good idea to design an additional prediction network to
help the DRL judge whether ego vehicle will collide in the II. PROBLEM FORMULATION
future. However, the training coupling between DRL and the
prediction network will increase training difficulty, and an addi- A. Driving Scenario
tional neural network also increases the computational burden. The driving scenario considered in this article is an au-
More importantly, it may not match the current control algorithm tonomous driving environment on the highway with equal lane
and is lack of interpretation. Ali et al. [28] provided a safe width l. All vehicles can only move forward.
deep reinforcement learning framework for automated driving The rightward direction is assumed to be the travel direction
including heuristic safety rule based on common driving practice of the vehicle. The downward and rightward directions are the
and dynamically-learned safety module based on prediction of positive directions of the y-axis and x-axis, respectively. The
recurrent neural network. Wang et al. [14] formulated a safety centerline of the uppermost lane (the leftmost lane, for vehicles)
rule according to the action frequency and safety distance to is used as the origin of the y-axis. The x-coordinates involved in
prevent DRL from taking dangerous actions. Simple safety rules this article are all relative positions, which means that the origin
are indeed effective, but they have poor applicability and rely on of x is unimportant. The coordinate system is shown in Fig. 1.
expert knowledge. In this scenario, the goal of ego vehicle is to complete lane
In this article, the motion planning of autonomous vehicles changing and overtaking, i.e., to find a driving strategy to ensure
is treated as MDP and solved base on deep Q network (DQN) safety (no collision) and efficient (high speed).
164 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
C. Surrounding Vehicles
Define the closest vehicle in front of ego vehicle as the
leading vehicle, the closest vehicle behind the ego vehicle as
the following vehicle. Similar to [3] and [12], the surrounding
vehicles’ driving strategy is based on the IDM and MOBIL.
As the longitudinal motion controller of vehicle, IDM outputs
acceleration control command a through interaction with the
leading vehicle which can be expressed as [6]:

v 4 d ∗ 2
a = amax 1 − ∗ − , (6)
v d
d∗ = d0 + v(T + τ Δv ), (7)
Fig. 1. Coordinate system of the scenario.
where amax and v are the maximum acceleration and the speed
of the controlled vehicle, respectively. d and Δv are the distance
In this scenario, there are three terminal states listed in Table I. and relative speed between the controlled vehicle and its leading
The following assumptions are supposed in this scenario. vehicle, respectively. v ∗ is the target speed, d0 is the minimum
Assumption 1: The ego vehicle can accurately perceive the safety distance, T is the desired time gap, and τ is an adjustable
information (speed and position) of itself and surrounding vehi- parameter for safety.
cles, i.e., the environment is fully observable to the ego vehicle. As the lateral controller of the vehicle, MOBIL determines
Assumption 2: The ego vehicle and the vehicle in front of it whether the vehicle changes lanes or not. If the target lane
that may collide in the near future will not change lanes in the satisfies the following conditions
same direction at the same time.
Assumption 3: All vehicles do not change lanes continuously. ãn ≥ bsafe , (8)
Assumption 4: The expected speed of surrounding vehicles (āc − ãc ) + p((ān − ãn ) + (āo − ão )) ≥ Δath , (9)
is less than the maximum speed.
Remark 1: Assumption 1 ensures that the ego vehicle can then the MOBIL controller provides the target lane and the
make reasonable decisions by correctly perceiving the surround- steering controller designs the corresponding steering angle.
ing environment. Assumption 2 is based on traffic rules. Vehicles Where āc and ãc are the acceleration of the controlled vehicle
will turn on the cornering lamp before lane changing, and before and after the lane changing, respectively. ān and ãn
then surrounding vehicles can obtain its intention. Thus, two are the acceleration of the new following vehicles before and
vehicles with collision risk will not change lanes in the same after the lane changing, respectively, and āo and ão are the
direction at the same time. Assumption 3 is also based on traffic acceleration of the old following vehicles before and after the
rules. Assumption 4 provides a condition of lane changing and lane changing. bsafe is the maximum braking imposed to a vehicle
overtaking to ego vehicle. during decelaration, which is designed to avoid collision with
the new following vehicle after the lane changing. p ∈ [0, 1] is
the politeness coefficient, which represents the proportion of
B. Vehicle Kinematics
the influence of surrounding vehicles before and after the lane
In this scenario, the vehicle model is only used to build the changing, where p = 0 means that the surrounding vehicles are
simulation environment. The kinematic bicycle model can be adopting a more aggressive strategy and do not consider the
expressed as [29]: impact on the vehicles behind when changing lanes. Conversely,
p = 1 means that the surrounding vehicles are more courteous
ẋ = v cos(ψ + β), (1) and considerate of the impact on other traffic participants, and
ẏ = v sin(ψ + β), (2) they may not change lanes frequently. Δath is an adjustable
threshold that motivates the controlled vehicle to change lane.
v̇ = a, (3) The steering controller is a proportional controller which can
v be expressed as follows [12]:
ψ̇ = sin β, (4)
lr lwr
δ = arcsin , (10)
−1 lr 2v
β = tan tan δ , (5)
lf + l r
wr = Kψ (ψ ∗ − ψ), (11)
where x and y denote the longitudinal and lateral coordinates
Klat Δlat
in the Frenet frame as shown in Fig. 1, respectively. lf and lr ψ ∗ = arcsin , (12)
represent the distance from the center of the vehicle to the front v
and rear axles. ψ is heading angle, β is sideslip angle, and v is where Klat and Kψ are the steering controller parameters, l is
speed. The control inputs a and δ given by the controller are the lane-width, and Δlat is lateral position of the vehicle with
acceleration and steering angle, respectively. respect to the lane center-line. When the MOBIL controller
provides a channel change signal, Δlat changes accordingly. Compared with [17], the longitudinal position of the ego vehicle
If the controlled vehicle is required to keep lane invariant, the is also ignored, but the lateral position, i.e., lane information,
steering controller will make the controlled vehicle move along is retained. Experiment and theoretical analysis will show that
the centerline. lane information is important. In addition, the information of
This article focuses on designing a model-free control strategy the vehicles behind the ego vehicle is ignored. The detailed state
for ego vehicle to travel efficiently and safely if the surrounding space is given as follows.
vehicles follow the IDM and MOBIL controller on the basis of The state space S = {s1 , s2 , . . .} is composed of three-
Assumptions 1–4. dimensional occupancy grid st ∈ RH×W ×C , where H repre-
sents the number of lanes involved in the state. According to
III. CONTROLLER DESIGN Assumption 2, only the current lane, left lane and right lane of
ego-vehicle are necessary, while the information of other lanes
Deep reinforcement learning is deployed as a model-free
is superfluous in most cases, and thus H = 3. W = Wf + Wb ,
controller of the ego vehicle. In this article, ESA-DQN with D
lightweight data-driven safety layer is proposed as a high-level where Wf = lgf and Wb = Dlgb , indicates that the distance
controller, and then speed and steering controllers are used to within the observation range can be divided into W grids, and the
map the action output of ESA-DQN into corresponding accel- length of each grid is lg . · is ceil function, a gives the smallest
eration signal a and steering angle signal δ. integer greater than or equal to a. C = 4 is the dimension of
In this section, the observation space is introduced, and the the vehicle feature space, which will be further described in
state space using occupancy grid is applied. Then, the action the following. lg , Df and Db are parameters to be designed.
space and its mapping laws to speed and steering control signal For convenience, they are selected to make Wf and Wb be
is proposed. After that, DQN is deployed to estimate the action integers and Wf > Wb . In addition, in order to fully describe the
value function, and the ESA is introduced to improve the network vehicle information without introducing too many dimensions,
performance. Finally, the safety layer is proposed to increase the the selection of lg depends on the vehicle length, safe distance
safety rate of ego vehicle. and the vehicle speed. It is ideal to choose parameters to make
each cell contain at most one vehicle. t t
vx vyt e T
A. Mdp For ego vehicle, define Fet = 0, Dyef , vmax e
, vmax as the
In this article, the motion planning of autonomous driving feature vector of ego vehicle at time t, where yet , vxt e , vyt e are
is treated as the MDP including observation space, state space, the lateral coordinate, longitudinal velocity, and lateral velocity
action space and reward, which will be illustrated in detail as of the ego vehicle, respectively. Compared with [17], although
follows. the relative position with x-coordinate being 0 is also used,
t
1) State: Based on Assumption 1, in this scenario, the obser- the lane information Dyef is additionally introduced to prevent
vation space of ego vehicle includes the speed and position of the vehicle from making illegal lane changes. Define Fatk =
itself and its surrounding vehicles in a certain range. xtsk yst k t
vx vyt
T
sk sk
Since this scenario is fully observable, the state space contains Df , Df , vmax , vmax as the feature vector of surrounding
the same information as the observation space. Considering the vehicle k at time t, where xtsk , yst k , vxt s and vyt s are the relative
fitting performance of neural network, occupancy grid is used k k
longitudinal position, relative lateral position, relative longitu-
to describe the state. dinal velocity, and relative lateral velocity of the surrounding
[11] and [10] modeled the state as discrete values with spe- vehicle k, respectively. k satisfies 1 ≤ k ≤ m(t), and m(t)
cific rank to describe an approximate range, which is a crude represents a time-varying number of all vehicles in the field of
description of the state. [28] used the information of the six vision except the following vehicles in the current lane. Simple
surrounding vehicles to form the state, which may not fully normalization based on dividing by the maximum velocity is
reflect the environment of the ego vehicle. [12] applied the used since it can ensure a stable reinforcement learning process.
features of the nearest four vehicles of the ego vehicle to form Then, the elements st (i, j, :)(0 ≤ i ≤ H − 1, 0 ≤ j ≤ W −
a column vector as the state space while it did not consider the 1) in the state are composed of information about ego vehicles
following vehicles. In fact, when changing lanes, ego vehicle and surrounding vehicles, which can be written mathematically
needs to pay attention to the following vehicles of the target as
lane. [24] also concatenated the information of the ego vehicle ⎧ t
and surrounding vehicles as state. These methods ignored the ⎨Fe , I,
⎪
lane quantity and vehicle density. Human drivers often pay st (i, j, :) = Fatk , II, (13)
attention to all vehicles in a certain range in front of and behind ⎪
⎩04 , others,
the ego vehicle rather than a certain number of vehicles.
To keep the permutation invariance, the vehicle within Df
where 04 denotes a 4-dimensional zero vector, I denotes (j ==
meters in front of and Db meters behind the ego vehicle is
Wb − 1) ∧ (i == 1) representing the position of the ego vehicle
converted into the occupancy grid. Each spatial grid, if occupied, i
contains the normalized relative position and relative speed. in the occupancy grid. II is (j == xlgs + Wb ) ∧ (i == ysi +
According to the scenario, the feature of the following vehicle on 3
2l ) ∧ (¬(xtsi < 0 ∧ yst i + 23 l == 1)) representing the posi-
the current lane is treated as redundant information and removed. tion of the surrounding vehicles in the occupancy grid, except
TABLE II Thus, the total reward funtion is

ACTION MAPPING TABLE
0, Out of lane,
r = λ∗r1 +r2 +1 (17)
1+λ , Others,
where λ is the percentage of the reward for high speed. Such a
normalization of rewards can always perform better in deep re-
inforcement learning. Considering that the surrounding vehicles
follow the same stability strategy, the state transition probability
is stable but unknown.
B. External Space Attention-Based Deep Q Network

In this subsection, the detailed architecture of ESA-DQN is
for the vehicles behind the ego vehicle in the same lane as the proposed, and its role is illustrated.
ego vehicle, which has little influence on the decision-making Q-learning is a classical discrete reinforcement learning algo-
of ego vehicle in this scenario. If multiple surrounding vehicles rithm to learn the optimal action value function Q∗ (s, a), which
are in the same grid, the information of the vehicle closest to is defined as the maximum expected reward when being in a
the ego vehicle is used as the feature of that grid. == is used to state s, taking action a, and then following the optimal policy
compare whether the two sides are the same. a == b returns 1 π(s, a). The corresponding mathematical description is [11]
iff a = b. ∧ and ¬ are logical operators, a ∧ b is true iff a and
∞

b are both true. ¬a is true iff a is false. · is floor function, by
Q∗ (s, a) = max E γ i−t ri (si , ai ) | π(s, a), st = s, at = a ,
which a gives the largest integer less than or equal to a. π(s,a)
i=t
2) Action: High-level discrete action is firstly given by the (18)
DRL controller, and then mapped into control signals a and δ where γ ∈ (0, 1] is a discount factor, st and at denote state and
by the speed and steering controller. action at time t, respectively, rt (st , at ) denotes the reward at
First, the speed and steering controllers are designed. time t. Furthermore, based on the Bellman equation, the action
A proportional controller for speed control is constructed to value function can be obtained by the following formula [31]
control the acceleration
Q∗ (st , at ) ← Q∗ (st , at )
a = Kp (vr − v), (14) + α[rt (st , at ) + γ max Q(st+1 , at+1 ) − Q(st , at )]. (19)
at+1
where Kp is a parameter of the speed controller, vr is the ∗
Intuitively, if Q (st , at ) is obtained, then the optimal policy at
reference speed, which is mapped by the action of the high- state st is to select an action at maximizing Q∗ (st , at ).
level controller. The steering controller follows the form and DQN is a neural network implementation of Q-learning that
parameters of formula (10). manages to approximate the action-value function with neural
Then, following [17] and [30], the high-level discrete action networks with weights θ, i.e., finding action-value function
space A is defined to include five actions, i.e., lane change Q(st , at ; θ) ≈ Q∗ (st , at ). The loss function is [32]
left (LLC), keep state, lane change right (RLC), accelerate and
decelerate. These actions determine reference values Δlat and L(θ) = EM [(r + γ max

Q(s , a ; θ− ) − Q(s, a; θ))2 ], (20)
a
vr in the steering and speed controller, and the specific mapping
is shown in Table II, where δvr is an adjustable parameter that where Q(s, a; θ− ) is the target action-value function, which
represents the variance in the reference value and p is the number is periodically updated from the latest action-value function
of lanes. The steering controller will prohibit the illegal lane Q(st , at ; θ) to ensure stability. The tradeoff between exploration
changing command, and the speed controller will forbid the and exploitation is deployed by an −greedy strategy. The train-
acceleration command exceeding the maximum speed. ing process adopts stochastic gradient descent algorithm, and
3) Reward: Reward function combines reward r1 for high the mini-batches with size B of experiences (st , at , rt , st+1 ) are
speed and safe reward r2 , which are defined as follows respec- drawn from an experience replay memory. Activate function is
tively. selected as Relu function. In DQN, the optimal strategy π ∗ (s, a)
is

v cos(ψ) − vmin π ∗ (s, a) = argmax Q(s, a; θ). (21)
r1 = clip , 0, 1 , (15)
vmax − vmin a∈A
The neural network is designed as follows. The essential network

where v cos(ψ)−v min
vmax −vmin represents the normalized longitudinal consists of a single convolutional layer without pooling layer
speed as a reward for the ego vehicle to keep high speed along and two fully connected layers. A novel attention module is
the lane, and clip(·) is used to limit it in an array [0,1]. added to the essential network to improve the expression and
generalization ability.
−1, Collision, Attention mechanism is believed to improve the performance
r2 = (16)
0, Others. of the network, which is also more popular in computer vision.
Fig. 2. EA versus ESA. (a) External attention. (b) External space attention.
However, most researches use the attention mechanism in the Fig. 3. Driving scenarios. (a) Scenario I. (b) Scenario II.
network with image as the state in the motion planning envi-
ronment [22], [23]. Although images and occupancy grid have
the contribution of surrounding vehicles at the same position to
similar forms, they have different meanings. It is unreasonable to
the action value function under different environment.
directly use the attention mechanism designed for the computer
Similar to [32], the whole network is optimized by the gradient
vision [20], [21] to the motion planning.
descent algorithm.
The self attention mechanism [19] in [24] is to learn the
corresponding attention weights between feature vectors of ego
vehicle and surrounding vehicles. This method is also not suit- C. Safety Layer
able for occupancy grid. Overemphasis on collision penalty would make the agent
EA [21], which is learning weights from the training dataset give up on attempting to accelerate and increase the difficulty
and independent of the input, is considered to outperform the of learning efficient action. On the contrary, small punishment
self attention. Inspired by it, in this article ESA is designed to would make the agent aggressive, even unsafe. Safety cannot
improve the ego-vehicle’s perception of the surrounding envi- be completely ensured by increasing collision penalty in reward
ronment, which connects state with essential network. function, and repeated training is required to ensure the safety
The EA, as shown in Fig. 2(a), learns the correlation among rate by adjusting reward function [12]. Adding additional net-
pixel points with an external learnable matrix. The correlation works to predict whether the ego vehicle will collide or not in the
between vehicles in occupancy grid is far more affected by dis- future increases the training difficulty and lacks the interpretabil-
tance than that between pixels in image. So a 3 × 3 convolution ity [26]. Simple safety rules are intuitive and practical, but
kernel without pooling layer and bias is used as the external they depend on expert knowledge [28]. To reduce collision rate
learning matrix, which acts as a memory of the experience re- while ensuring high efficiency, a lightweight data-driven safety
play. In addition, when normalizing, different from EA, it is only layer is proposed to judge the action chosen by ESA-DQN.
required to normalize the zeroth dimension, which represents Its training is independent and parallel to the training process
the features of vehicles. The external learnable matrix is treated of the ESA-DQN. The safety layer consists of classifiers for
as the attention weight, and the final output is the hadamard every dangerous action, and each classifier consists of several
product of attention weight and input features. The detailed elementary units. Giving attention to both interpretability and
structure of ESA is shown in Fig. 2(b), where G@H × 3 × 3 expressiveness, LBSVM classifier is designed as the elementary
and H@G × 3 × 3 mean G learnable convolution kernels of unit to evaluate the action selected by the ESA-DQN.
size H × 3 × 3 and H learnable convolution kernels of size As a popular discriminative classifier, support vector machine
G × 3 × 3, respectively. ESA enhances the ego-vehicle’s per- (SVM) has a solution with interpretability for linearly separable
ception of the current environment. The scenario in Fig. 3 is problems, which is very suitable for this scenario.
used to illustrate the significance of ESA. The standard linear SVM is used to solve the following
Consider the surrounding vehicle 1, i.e., the green one, in optimization problem [33]
Fig. 3. Its importance in the action-value function should be de-
termined by its surrounding vehicles. In Scenario I, there is a ve- N
1
hicle following surrounding vehicle 1, and then its contribution min w2 + C ξi ,
2 i=1
to the action-value function Q(s, a; θ) is limited. However, as
shown in Scenario II, there is no vehicle following surrounding s.t. yi (w · xi + b) ≥ 1 − ξi ,
vehicle 1, then, it will make a huge contribution to the action-
value function Q(s, a; θ), especially for Action 0. ESA learns ξi ≥ 0, i = 1, 2, . . . , N, (22)
TABLE III
CLASSIFIER COMPOSITION OF EACH ACTION
Fig. 4. Schematic diagram of 2D-LBSVM.
where w and b are the parameters of the classification hyper- the training process of ESA-DQN. The rule of labeling samples
plane. xi and yi are the feature vector and label of samples, is first introduced, and then the composition of the classifier
respectively. In this scenario, dangerous samples are labeled as for each action is proposed. The dimension of LBSVM (23) is
T rue, while safe samples are labeled as F alse. N is the size of further reduced according to the scenario.
training set. ξi are slack variables, which are introduced to allow Based on historical attempts, at that causes st+1 become
some classification errors. Terminal II is recorded as a dangerous action under state st .
Such an SVM classifier (22) is fair for positive and negative Note that deceleration has little impact on the collision,as a con-
samples. However, in this scenario, for the negative sample, it sequence, at == 4 means that executing at−1 in the state st−1
only means that the action executed in the current state is safe, is dangerous. Similarly, at−1 == 4 ∧ at == 4 means executing
without considering the decisions of surrounding vehicles. If at−2 in the state st−2 is dangerous. The nearest non-decelerating
there are positive and negative labels in some similar states, the action is always tracked back to be responsible for the final
decisions of other vehicles in such states will affect the label. collision. This action and its corresponding state is marked as a
Thus, the penalty for positive sample classification errors should positive sample.
be much larger. Since Action 4 should not be evaluated as a dangerous action,
In addition, the decision boundary of LBSVM is designed to only the other four actions are equipped with a classifier.
be far away from the positive samples. The schematic diagram There are some problems such as high dimensionality, poor
of the 2D-LBSVM is shown in Fig. 4. The black implementation interpretability and easy overfitting if the states are directly used
represents the decision boundary. The red points are the positive as the input of the classifier. So, the input features are selected
samples, and the blue samples are the negative samples. The according to the scenario.
distance from the red support vectors to the decision boundary is In longitudinal control (Action 1 and Action 3), the ego vehi-
1 + M , which is greater than the distance from the blue support cle can only collide with the leading vehicle. Thus, 2D-LBSVM
vectors to the decision boundary. classifiers are used to judge the Action 1 and Action 3, respective,
Mathematically, the LBSVM manages to solve the following and the inputs are the relative position xtslc and relative speed
optimal problems vxt s c of the leading vehicle on the current lane.
l
N+1 N−1
1 In the process of changing lanes (Action 0 or Action 2),
min w + C+1
2
ξi + C−1 ξj , surrounding vehicles that may collide with ego vehicle include
2 i=1 j=1 the leading vehicle on the current lane (LVCL), the leading
s.t. w · xi + b ≥ 1 + M − ξi , when yi = 1, vehicle on the target lane (LVTL), and the following vehicle of
the target lane (FVTL). Thus, three 2D-LBSVM classifiers are
w · xj + b ≤ −1 + ξj , when yj = −1, used to judge the Action 0 and Action 2, respectively, and the
ξi ≥ 0, i = 1, 2, . . . , N+1 , inputs are the relative longitudinal positions (xtslc , xtslt , xtsf t )
and relative speeds (vxt s c , vxt s t , vxt s t ) of the three types of
ξj ≥ 0, i = 1, 2, . . . , N−1 , (23) l l f
vehicles that may be involved in a collision. Note that the safety
where M > 0 represents the deviation of the decision boundary of Action 0 and 2 depends on the vehicles in the target lane and
from the positive samples, C+1 C−1 implies that the classi- the current lane, and need similar criteria. In order to make full
fication errors of negative samples are unimportant. N+1 and use of the data, the samples of Action 0 and 2 are combined.
N−1 are the size of the positive samples and negative samples. Table III is used to describe the composition and input of the
Compared with neural networks, LBSVM only needs a small classifier for different actions.
amount of data and training resources. Its training is decoupled For convenience, a softmax layer is added after the output of
from DRL, which does not increase the training difficulty. Next, ESA-DQN. When the selected action is rejected by the safety
LBSVM (23) will be used as a component of classifier for each layer, the corresponding Q value is modified to −1, so that other
action. actions can still be setected by argmax function. In this section,
In order to make the classifier more suitable for the control the relationship between state, ESA-DQN and lightweight safety
algorithm, the training samples of the classifier are obtained from layer is shown in Fig. 5.
Algorithm 1: ESA-DQN With Safety Layer.

Training for ESA-DQN:
Initialization:
Initialize replay memory D to capacity N .
Initialize Q(s, a; θ) with random weights θ.
Initialize Q(s, a; θ− ) with random weights θ− .
Initialize the ESA module with random weights.
For timesteps = 1, M do
Obtain state st according to (13).
Select a random action at with probability ,
otherwise select at according to (21).
Execute at and observe next state st and rt .
Store (st , at , rt , st+1 ) in the replay memory buffer D.
Fig. 5. Structure diagram of the whole high-level controller. If st+1 is terminal III:
Reset environment.
Elif st+1 is terminal II:
For a large MDP such as this scenario, the state space cannot
Reset environment.
be sufficiently sampled, and the convergence cannot be guaran-
Store positive samples based on rule in T
teed theoretically [34]. An efficient solution is to approximate
Elif st+1 is terminal I:
Q∗ (st , at ) with a learnable neural network [35]. In this article,
Reset environment.
a more suitable neural network is designed for this scenario,
Store negative samples (st , at ) in traing set T
and the similar residual connection technique [36] is used to
Else:
stabilize the algorithm. The DRL also contains a number of
Store negative samples (st , at ) in traing set T
important innovations such as experience replay and fixing target
End if
Q-network to stabilize the learning algorithm [32]. A safety
If timesteps > minibatch B:
layer is used to prevent the agent from making unsafe actions,
Randomly sample B transitions from D.
alleviating the safety risks brought by the approximation. Based
Perform a gradient descent step on loss value (20).
on these ideas, the proposed ESA-DQN with safety layer can
Update Q(s, a; θ− ) = Q(s, a; θ) every 50 timesteps.
converge to an approximate optimal solution. It is summarized
End for
in Algorithm 1.
Training for safety layer:
Remark 2: Compared with adopting additional recurrent neu-
Take positive samples of each action and randomly take
ral network [28] and modifying the reward value function
the same number of negative samples from T to train the
for retraining [12], the introduction of an extra LBSVM is a
classifier of each action based on (23).
lightweight approach. It does not necessitate retraining the neu-
ral network. Owing to the scarce number of collision samples,
the problem scale addressed by SVM is comparatively small.
Furthermore, adopting a linear boundary further reduces the A. Experimental Setup
computational complexity. In practical experiments, the time The highway-env provided by [17] is adopted as the driving
consumption introduced by this module is less than 10 seconds, scenario. Some parameter modifications are used to compare
which is negligible compared to the training time of the neural with [12], which are detailed given in Table IV.
network. ESA-DQN is built on the framework of stable-baseline 3 [37]
and pytorch 1.12 [38]. Its hyperparameters are shown in Table V.
IV. EXPERIMENTS ESA-DQN and the safety layer are trained respectively on
a desktop computer equipped with an Intel i7 − 10700 CPU,
In this section, firstly, the environment of the experiments and 32 GB memory, and a Nvidia GeForce RTX 3060Ti GPU with
the hyperparameters of the network are illustrated. Secondly, 8 GB memory.
the performance of ESA-DQN with safety layer is evaluated
by several simulation experiments. The results are compared
with Soft Actor-Critic (SAC) and DQN in [12]. The effect B. Experimental Results
of ESA mechanism was demonstrated through experimental In the same training process with M = 8 × 105 timesteps,
results. Furthermore, a comparative analysis of performance the performance is tested through 100 episodes, and the exper-
and computational resource consumption between our proposed imental comparison results are shown in Table VI. The safety
method and alternative strategies that incorporate additional rate and average speed of 100 episodes are used as the evaluation
prediction networks is proposed. The algorithm’s robustness indexes. Under the condition of similar average speed, the safety
to observation, various environments and different observation rate under Algorithm 1 is greatly improved compared with that
resolutions is further discussed. Finally, some related issues of Aggressive SAC in [12]. Under the condition of similar
including the selection of state and the safety layer are discussed. safety rate, Algorithm 1 achieves higher average speed than
TABLE IV TABLE VI
DRIVING SCENARIO SIMULATION RESULT
generalization ability of the deep network and increases the

estimation accuracy of Q value function, which will be further
discussed. The additional safety layer improves the safety rate
of ego vehicle without significantly decreasing the speed since
it does not hinder the ego vehicle’s exploration of high-speed
driving.
The work [12] provides a method to obtain an aggressive
agent or a conservative agent by adjusting the reward function.
However, adjusting the reward value and retraining will consume
huge computing resources. In addition, it may make the ego ve-
TABLE V hicle unwilling to accelerate by increasing the collision penalty.
HYPERPARAMETER OF ESA-DQN WITH SAFETY LAYER As can be seen from the results of [12], if the high safety rate
is obtained by adjusting reward, then its speed is significantly
reduced. In other words, it is difficult to achieve a better balance
between safety and efficiency. In this article, the data collected
in exploration of ESA-DQN is used to train the safety layer
independent of the ESA-DQN training. Considering that the
safety layer is composed of several 2D-LBSVM classifiers, its
consumption of computing resources can be almost ignored
compared with the training process of DRL.
A lower control frequency than [12] is adopted. Different
from continuous reinforcement learning controller, high control
frequency is not necessary for discrete high-level controller. In
addition, the discount factor γ is designed by considering the
control frequency and the speed of surrounding vehicles. Too
large γ will cause the agent to attribute the collision after multiple
steps to the initial acceleration behavior, which is inappropriate
for this scenario. Besides, too small γ will often cause the
ego-vehicle to be in a hopeless situation. In other words, although
collision has not yet occurred, it is impossible to be prevented
even if the ego-vehicle decelerates continuously.
In the next subsection, the effect of ESA will be highlighted
by discussing the different training timesteps.
C. Ablation Experiment
In this subsection, ablation experiments will be deployed to
illustrate the role of each module. Since ESA is an improvement
of EA in this scenario, in addition to comparing the performance
difference with or without the ESA module, we also include
Conservative SAC in [12]. Compared with the simple DQN a comparison between ESA-DQN and EA-DQN, which is de-
algorithm, the safety rate and average speed are both improved. signed by using the EA module. Different training timesteps are
The advantage of Algorithm 1 is mainly attributed to the tested in the same random seed and environment, and the results
state selection and safety layer. Occupancy grid improves the are shown in Table VII, Fig. 6, Fig. 7 and Fig. 8.
TABLE VII
ABLATION EXPERIMENT
Fig. 6 presents the comparison results based on the reward

(17). It is observed that the reward of ESA-DQN with safety
layer is higher than those of other algorithms, especially when
the number of training steps is less than 8 × 104 . Moreover,
after about 8 × 104 timesteps of training, ESA-DQN obtains
an approximate optimal reward, while similar results require
4 × 105 timesteps of training for DQN. Similarly, after 8 × 104
timesteps of training, the ESA-DQN with safety layer performs
approximately optimally, while similar results require 8 × 105
timesteps of training for DQN with safety layer. However, as
the number of training timesteps increases, DQN with safety
Fig. 6. Comparison results based on the reward. layer begins to learn that vehicles in different positions carry
varying levels of importance, and the performance of the two
approaches becomes comparable. Hence, the ESA module can
improve the training efficiency. Additionally, Table VII pro-
vides the numerical results that underpin the reward expression,
inclusive of parameters such as the safety rate and average
speed. The safety rates and average speeds corresponding to
the approximate optimal rewards under different algorithms are
bolded in the table. Table VII and Fig. 7 also show that ESA helps
DQN understand the different effects of surrounding vehicles
at similar locations on the action value function in different
scenarios more quickly.
From Table VII, if the average speed of the ego vehicle
exceeds 30 m/s under satisfactory safety rate, it suggests that
the controller is performing approximately optimally and can
Fig. 7. Comparison after introducing the ESA.
perform lane-changing and overtaking behaviors to improve
efficiency without compromising safety. After 8 × 104 timesteps
of training, the safety rate of ESA-DQN with safety layer sta-
bilizes above 0.97 and the average speed remains above 30 m/s,
which means the controller performs approximately optimally.
However, under identical training steps, DQN with safety layer
only achieves a safety rate of 0.93. Moreover, in the early
training, it is observed from the table that in most cases, the safety
rate of DQN with safety layer is larger than that of ESA-DQN
with safety layer at the similar average speed.
If the EA module is used, the expression performance of the
network decreases. This is because the EA mechanism forces
on learning the dependency between unrelated vectors in the
Fig. 8. Comparison after introducing the safety layer. grid, which leads to confusion in network expression and brings
TABLE VIII
WHETHER LANE INFORMATION IS AVAILABLE OR NOT
Fig. 10. Time consumption comparison.
TABLE IX
COMPARISON WITH PERFORMANCE OF THE SOLUTION WITH ADDITIONAL
PREDICTION NETWORK
The approach using an additional deep network for prediction,

such as the one proposed in [28], is also compared. In the prob-
lem, since the observation is an occupancy grid, convlstm [39] is
used as the basic unit of the prediction network. The prediction
Fig. 9. Operational mechanism of ESA. of a collision after two steps is used to determine whether the
current reward value should be penalized with a negative reward.
The number of parameters in the predictive network is set to
negative impact on performance. Safety layer greatly improves half that of the main DRL network. The time consumption of
the safety rate without significantly affecting the average speed the prediction network and the main DRL network is shown in
and retraining. Fig. 10, and the performance comparison is shown in Table IX.
The average speed and safety rate under the controller with From the experimental results, it is evident that the prediction
and without safety layers when the lane information is available network offers the agent with an early negative reward value to
or unavailable are compared in Table VIII, Fig. 6 and Fig. 8. ensure that the agent has an ability to predict future collisions
The controller may make an illegal lane changing command if it ahead of time. However, due to the coupled learning of the two
does not recognize the lane information. Although this command networks, the improvement in the safety rate is not as apparent
is rejected by the steering controller, the ego vehicle misses as when using the approach proposed in this article. In addition,
the timely and effective control action. Although the data in even though DRL training process handles the task of data
the occupancy grid is relative position or relative velocity, the collection, which consumes more time, the training time of the
introduction of lane information is necessary. ConvLSTM is still not negligible when compared to the main
Random action with safety layer is also tested. The average training process. When the parameter quantity is half that of the
speed is 25.22 m/s, and the safety rate is 0.94. This shows that main DRL network, utilizing a predictive network will result in
the reinforcement learning algorithm enables the ego-vehicle to an increase of 36.47% in the computational cost.
learn strategies to improve the average speed. Next, the robustness of the algorithm is tested when there are
In the following, we will demonstrate how the ESA mecha- random noises in the observations and when the strategy pa-
nism helps the agent to learn in the early stages. rameters of surrounding vehicles (i.e., the politeness parameter
When scenes as shown in Fig. 9(a) occur, the visualization of p) are different. We test four different scenarios: p = 1, p = 0,
the ESA mechanism is shown in Fig 9(b), where darker shades p = 0.5, and assigning a unique p to each surrounding vehicle,
represent higher attention weights. In this case, although the where p is a random number that follows a uniform distribution.
vehicle circled in red also affects the decision-making of the ego The results are shown in Table X and Table XI.
vehicle, it seems to be overshadowed by the attention mechanism We add a random number that obeys uniform distribution to
due to the influence of the vehicle circled in blue. If the vehicle the observation to simulate noise pollution. We discuss three
circled in blue is removed from the observation, the attention intensities of noises, whose energies are expected to account for
mechanism diagram is shown as in Fig. 9(c), and the impact 5%, 10% and 30% of the observation signal, respectively. When
of the vehicle circled in red becomes evident. The influence of observation is corrupted by noises, the control performance may
other vehicles on the ego vehicle is much smaller since they are decrease. Nevertheless, the proposed algorithm demonstrates
at least 90 meters away from it. stronger noise resistance compared to the normal DQN due to
TABLE X
PERFORMANCE TESTING WITH OBSERVATIONS CONTAINING NOISES
TABLE XI
PERFORMANCE TESTING WITH DIFFERENT STRATEGY PARAMETERS FOR
SURROUNDING VEHICLES
Fig. 11. Driving scenarios for compare. (a) Scenario I. (b) Scenario II. (c)
Scenario III.
TABLE XII
PERFORMANCE TESTING WITH DIFFERENT lg
D. Discussion
State: DQN is used to fit the action value function. It manages
to judge the value of an action according to current state, and it is
used to solve the situation where the number of states is infinite.
Thus, the network must have the generalization ability. In other
words, the Q value of an action obtained from similar states is
the safety margin and the remarkable generalization capability similar. Occupancy grid is used instead of stacking the infor-
of the simple linear safety layer. mation of surrounding vehicles to improve the generalization
Like many DRL algorithms, the performance of the algorithm ability of the model.
is significantly influenced by changes in the environment. Fortu- If the information of surrounding vehicles is stacked like [12],
nately, LBSVM mitigates this impact, and the performance of the the order of information may be ambiguous. The work [17]
algorithm remains satisfactory. Those experiments demonstrate provides a solution by sorting the absolute value of the distance
that the proposed algorithm has good generalization ability. from the ego vehicle, which may be feasible. However, this
We vary the resolution of observation by changing lg to test the method may cause that the space position of the surrounding
algorithm’s dependence on hyperparameters of the observation. vehicles and their position in the state vector of the ego vehicle
The results are presented in Table XII. cannot be one-to-one corresponding.
It is more intuitive to use occupancy grid as the state to display Consider Fig. 11(a), (b) and (c). Assume that the speed of
the spatial information of the observation to the agent than surrounding vehicles (blue) is the same, and the longitudinal
surrounding vehicles’ information. As for resolution, a small position of vehicles in the same column of different scenario is
value of lg means a finer spatial partitioning, which increases also the same. In these cases, if the speeds of all the vehicles are
the number of the neural network parameters. A large value the same, their state vectors are similar. However, for Scenario I,
of lg leads to a coarser spatial partitioning, which may result it is a reasonable choice to keep going straight, while for Scenario
in multiple vehicles occupying the same grid. In such cases, II, going straight is dangerous due to the high speed of ego
selecting the features of the vehicle nearest to the ego vehicle vehicle. In Scenario III, it is wise to choose LLC. If only Scenario
as a representative does not substantially influence the decision- I and Scenario II appear in the training set, it may be difficult
making process. The crudest approach is to store the features of for ego-vehicles to make an optimal decision for Scenario III. It
the closest vehicle to the ego vehicle in each lane, which indeed is difficult for Q function approximated by neural networks to
loses some information about other surrounding vehicles and output significantly different Q values in similar states.
leads to short-sighted decisions by the agent. Our experiments From the standpoint of the training process, if the ego vehicle
show that the algorithm’s performance is satisfactory within a obtains a large penalty due to LLC in Scenario I, the correspond-
larger and appropriate range of lg . ing weight will also change according to the gradient descent
TABLE XIII V. CONCLUSION

DECISION BOUNDARY
This article proposes an ESA-DQN based high-level con-
troller with lightweight safety layer for autonomous vehicles in
the motion planning scenario. The role of each module is further
analyzed. The training results of the safety layer are explained,
and the relationship between the handcrafted safety module and
safety layer is further explained. In this article, the surrounding
vehicles are modeled as rule-abiding traffic participants. If their
behavior is aggressive, it would bring a greater challenge to the
control of the ego vehicle, which may be a challenging issue we
could explore in the future.
algorithm. However, in Scenario III, the input characteristics
corresponding to these weights belong to the vehicle in the REFERENCES
right lane. The mismatch between the input feature vector and
the spatial position of the vehicle will make it difficult for the [1] A. Haydari and Y. Yłlmaz, “Deep reinforcement learning for intelligent
transportation systems: A survey,” IEEE Trans. Intell. Transp. Syst.,
network to converge. vol. 23, no. 1, pp. 11–32, Jan. 2022.
The work [17] provides the occupancy grid to obtain similar [2] S. Aradi, “Survey of deep reinforcement learning for motion planning of
state form. Compared with it, the introduction of lane informa- autonomous vehicles,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 2,
pp. 740–759, Feb. 2022.
tion enables ego-vehicle to make the correct decision instead of [3] Z. Cao et al., “Highway exiting planner for automated vehicles using
illegal lane change when they are in the Leftmost or rightmost reinforcement learning,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 2,
lane. pp. 990–1000, Feb. 2021.
[4] P. Scheffe, T. M. Henneken, M. Kloock, and B. Alrifaee, “Sequential
In addition, the information of vehicles behind the ego vehicle convex programming methods for real-time optimal trajectory planning
on the current lane is removed since it is redundant for the ego in autonomous vehicle racing,” IEEE Trans. Intell. Veh., vol. 8, no. 1,
vehicle with higher speed. pp. 661–672, Jan. 2023.
[5] X. Xing, B. Zhao, C. Han, D. Ren, and H. Xia, “Vehicle motion planning
Safety layer: The work [28] provides a handcrafted safety with joint Cartesian–Frenét MPC,” IEEE Robot. Automat. Lett., vol. 7,
module as follows to prohibit dangerous actions no. 4, pp. 10738–10745, Oct. 2022.
[6] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in
empirical observations and microscopic simulations,” Phys. Rev. E, vol. 62,
dmin = (vav − vfront ) × tmin , (24) pp. 1805–1824, Aug. 2000.
[7] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model
where vav and vfront represent the speeds of the ego vehicle MOBIL for car-following models,” Transp. Res. Rec., vol. 1999, no. 1,
and the leading vehicle on the same lane, respectively. When pp. 86–94, 2007. [Online]. Available: https://doi.org/10.3141/1999-10
[8] B. R. Kiran et al., “Deep reinforcement learning for autonomous driving:
|xav − xleading | < dmin , the ego vehicle will attain the maximum A survey,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926,
deceleration, where xav and xfront represent the horizontal coor- Jun. 2022.
dinates of the ego vehicle and the leading vehicle in the same [9] C. Yu et al., “Distributed multiagent coordinated learning for autonomous
driving in highways based on dynamic coordination graphs,” IEEE Trans.
lane, respectively. Such a safety rule is required to select dmin Intell. Transp. Syst., vol. 21, no. 2, pp. 735–748, Feb. 2020.
and tmin , which depends on experts. [10] A. Alizadeh, M. Moghadam, Y. Bicer, N. K. Ure, U. Yavas, and C. Kur-
Taking the classifier of Action 3 (acceleration) as an example, tulus, “Automated lane change decision making using deep reinforcement
learning in dynamic and uncertain highway environment,” in Proc. IEEE
according to the collected safe and dangerous samples, the Intell. Transp. Syst. Conf., 2019, pp. 1399–1404.
decision boundary of 2D-LBSVM is [11] C.-J. Hoel, K. Wolff, and L. Laine, “Automated speed and lane change
decision making using deep reinforcement learning,” in Proc. 21st IEEE
117.9 72.28 Intell. Transp. Syst. Conf., 2018, pp. 2148–2155.
− ∗ (xleading −xav )− ∗ (vleading −vav )+7.31 = 0. [12] X. Tang, B. Huang, T. Liu, and X. Lin, “Highway decision-making and
120 33 motion planning for autonomous driving via soft actor-critic,” IEEE Trans.
(25) Veh. Technol., vol. 71, no. 5, pp. 4706–4717, May 2022.
Simplify it and obtain [13] E. Leurent, “A survey of state-action representations for autonomous
driving,” 2018. [Online]. Available: https://hal.archives-ouvertes.fr/hal-
01908175
(xleading − xav ) = 2.23 ∗ (vav − vleading ) + 7.44, (26) [14] J. Wang, Q. Zhang, D. Zhao, and Y. Chen, “Lane change decision-making
through deep reinforcement learning with rule-based constraints,” in Proc.
which is equivalent to the safety rule (24) with dmin = 7.44 and IEEE Int. Joint Conf. Neural Netw., 2019, pp. 1–6.
[15] D. M. Saxena, S. Bae, A. Nakhaei, K. Fujimura, and M. Likhachev,
tmin = 2.23. Compared with the preset, this is a data driven selec- “Driving in dense traffic with model-free reinforcement learning,” in Proc.
tion method, which is more consistent with the proposed control IEEE Int. Conf. Robot. Automat., 2020, pp. 5385–5392.
algorithm. The decision boundary of other action classifiers is [16] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,
“Navigating occluded intersections with autonomous vehicles using deep
summarized in Table XIII. reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Automat., 2018,
Most of safety actions which are wrongly classified avoid pp. 2034–2039.
collision through continuous deceleration and lucky actions of [17] E. Leurent, “An environment for autonomous driving decision-making,”
2015. [Online]. Available: https://github.com/eleurent/highway-env
surrounding vehicles. It is helpful to define them as dangerous [18] G. Brauwers and F. Frasincar, “A general survey on attention mecha-
actions to trace back to the unreasonable action, so as to avoid nisms in deep learning,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 4,
continuous deceleration and protect the actuator. pp. 3279–3298, Apr. 2023.
[19] A. Vaswani et al., “Attention is all you need,” in Proc. Conf. Workshop [37] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann,
Neural Inf. Process. Syst., 2017, pp. 6000–6010. “Stable-baselines3: Reliable reinforcement learning implementations,” J.
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. Mach. Learn. Res., vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available:
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. http://jmlr.org/papers/v22/20-1364.html
[21] M.-H. Guo, Z.-N. Liu, T.-J. Mu, and S.-M. Hu, “Beyond self-attention: [38] P. Adam et al., “Pytorch: An imperative style, high-performance deep
External attention using two linear layers for visual tasks,” IEEE Trans. learning library,” in Proc. 33rd Adv. Neural Inf. Process. Syst., 2019,
Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5436–5447, May 2023. pp. 8026–8037.
[22] D. Tian et al., “SA-YOLOV3: An efficient and accurate object detector [39] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-C.
using self-attention mechanism for autonomous driving,” IEEE Trans. Woo, “Convolutional LSTM network: A machine learning approach for
Intell. Transp. Syst., vol. 23, no. 5, pp. 4099–4110, May 2022. precipitation nowcasting,” in Proc. 28th Int. Conf. Neural Inf. Process.
[23] P. Cai, H. Wang, Y. Sun, and M. Liu, “DQ-GAT: Towards safe and efficient Syst., 2015, pp. 802–810.
autonomous driving with deep Q-learning and graph attention networks,”
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 21102–21112,
Nov. 2022.
[24] L. Chen, Y. He, Q. Wang, W. Pan, and Z. Ming, “Joint optimization of sens-
ing, decision-making and motion-controlling for autonomous vehicles: A Guoxi Chen (Student Member, IEEE) received the
deep reinforcement learning approach,” IEEE Trans. Veh. Technol., vol. 71, B.E. degree in automation in 2021 from Southeast
no. 5, pp. 4642–4654, May 2022. University, Nanjing, China, where he is currently
[25] W. Hu et al., “A rear anti-collision decision-making methodology based on working toward the Ph.D. degree in control engineer-
deep reinforcement learning for autonomous commercial vehicles,” IEEE ing. His research interests include autonomous driv-
Sensors J., vol. 22, no. 16, pp. 16370–16380, Aug. 2022. ing, reinforcement learning, and network security.
[26] L. Wen, J. Duan, S. E. Li, S. Xu, and H. Peng, “Safe reinforcement learning
for autonomous vehicles through parallel constrained policy optimization,”
in Proc. IEEE 23rd Int. Conf. Intell. Transp. Syst., 2020, pp. 1–7.
[27] L. Zhang, R. Zhang, T. Wu, R. Weng, M. Han, and Y. Zhao, “Safe
reinforcement learning with stability guarantee for motion planning of
autonomous vehicles,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32,
no. 12, pp. 5435–5444, Dec. 2021.
[28] A. Baheri, S. Nageshrao, H. E. Tseng, I. Kolmanovsky, A. Girard, and
D. Filev, “Deep reinforcement learning with enhanced safety for au- Ya Zhang (Senior Member, IEEE) received the B.S.
tonomous highway driving,” in Proc. IEEE Intell. Veh. Symp. (IV), 2020, degree in applied mathematics from the China Uni-
pp. 1550–1555. versity of Mining and Technology, Xuzhou, China,
[29] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and in 2004, and the Ph.D. degree in control engineering
dynamic vehicle models for autonomous driving control design,” in Proc. from Southeast University, Nanjing, China, in 2010.
IEEE Intell. Veh. Symp. (IV), 2015, pp. 1094–1099. Since 2010, she has been with Southeast University,
[30] M. Mukadam, A. Cosgun, A. Nakhaei, and K. Fujimura, “Tactical decision where she is currently a Professor with the School
making for lane changing with deep reinforcement learning,” in Proc. Conf. of Automation. Her research interests include multi-
Workshop Neural Inf. Process. Syst., 2017, pp. 1–7. agent systems, reinforcement learning, and network
[31] K. Liu, H. Zhang, Y. Zhang, and C. Sun, “False data injection attack security.
detection in cyber–physical systems with unknown parameters: A deep
reinforcement learning approach,” IEEE Trans. Cybern., vol. 53, no. 11,
pp. 7115–7125, Nov. 2023.
[32] V. Mnih et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[33] J. Pang, X. Pu, and C. Li, “A hybrid algorithm incorporating vector
quantization and one-class support vector machine for industrial anomaly Xinde Li (Senior Member, IEEE) received the Ph.D.
detection,” IEEE Trans. Ind. Informat., vol. 18, no. 12, pp. 8786–8796, degree in control theory and control engineering from
Dec. 2022. the Department of Control Science and Engineering,
[34] C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, Huazhong University of Science and Technology,
pp. 279–292, 1992. Wuhan, China, in 2007. He joined the School of
[35] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep Automation, Southeast University, Nanjing, China,
reinforcement learning for wireless sensor scheduling in cyber–physical where he is currently a Professor and the Ph.D. Su-
systems,” Automatica, vol. 113, 2020, Art. no. 108759. pervisor. His research interests include information
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image fusion, object recognition, computer break vision, and
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, intelligent robot.
pp. 770–778.

Attention-Based Highway Safety Planner For Autonomous Driving Via Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attention-Based Highway Safety Planner For Autonomous Driving Via Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

162 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO.

Attention-Based Highway Safety Planner for

An appropriate neural network structure plays an important TABLE I

TABLE II Thus, the total reward funtion is

B. External Space Attention-Based Deep Q Network

The neural network is designed as follows. The essential network

Fig. 4. Schematic diagram of 2D-LBSVM.

Algorithm 1: ESA-DQN With Safety Layer.

generalization ability of the deep network and increases the

Fig. 6 presents the comparison results based on the reward

Fig. 10. Time consumption comparison.

The approach using an additional deep network for prediction,

TABLE XIII V. CONCLUSION

You might also like