Professional Documents
Culture Documents
1, JANUARY 2024
Abstract—In this article, a motion planning for autonomous its control performance depends on a precise knowledge of the
driving on highway is studied. A high-level motion planning con- model, which is hardly satisfied in practical applications. Intel-
troller with discrete action space is designed based on deep Q ligent driver model (IDM) [6] and minimizing overall braking
network (DQN). An occupancy grid based state presentation aim-
ing at specific scenarios is proposed and then a novel attention induced by lane change (MOBIL) [7] are classical rule-based
mechanism named external spatial attention (ESA) is designed for model-free controllers. However, their dependence on prior
occupancy grid to improve the network performance. Considering knowledge and preset parameters restrict their performance in
both computational complexity and interpretability, a lightweight complex scenarios. In recent years, with the development of deep
data-driven safety layer consisting of two-dimensional linear bi-
learning and reinforcement learning, deep reinforcement learn-
ased support vector machine (2D-LBSVM) is proposed to improve
safety. The advantages of this controller and the role of each ing (DRL) provides a new model-free solution by treating the
module are illustrated by experiments. In addition, the superior motion planning of autonomous driving as a Markov Decision
performance of occupancy grid state and the interpretability of Process (MDP) [2], which has been widely studied.
safety layer are further analyzed. In DRL, state can be raw sensor data or condensed abstracted
Index Terms—Autonomous vehicles, deep reinforcement data. The former such as image or LIDAR provides the benefit of
learning, attention, safety layer. finer contextual information, while using the latter reduces the
complexity of the state space [8]. There have been many works
on DRL based motion planning with condensed abstracted data.
I. INTRODUCTION
By collecting the information composition state of ego vehicle
UTONOMOUS vehicles have broad application
A prospects [1]. As an important direction, motion planning
for autonomous vehicles is a vast and long-researched area,
and 4 closest agents, Yu et al. [9] modeled the continuously
changing topology of vehicles by using dynamic coordination
graph, and proposed two basic learning approaches to coor-
and has received extensive attention [2]. In this field, safety dinate the driving maneuvers for a group of vehicles. Using
and mobile efficiency are two important criteria, and achieving the information of 8 closest agents, Ali Alizadeh et al. [10]
high-speed with collision-free driving is more challenging [3]. developed a deep reinforcement learning agent that yielded
Motion planning for autonomous vehicles mainly includes consistent performance in a variety of dynamic and uncertain
two types of strategies, model-based and model-free. Scheffe traffic scenarios. Similar state selection methods can be seen
et al. [4] provided a real-time-capable model predictive control in [11], [3] and [12]. Directly using a fixed number of surround-
algorithm according to a nonlinear single-track vehicle model ing vehicle information might lose permutation invariance and
and Pacejka’s magic tire formula for racing. Xing et al. [5] the density of the vehicle is ignored, while adopting occupancy
proposed a joint Cartesian-Frent model predictive control plan- grid can alleviate this problem [13]. Wang et al. [14] achieved an
ning method based on a locally linear intermediary connection efficient lane changing behavior by using a 4 × 45 occupancy
which is updated along with solving iterations. Although the grid and combining a high-level lateral decision-making and
model-based controller has achieved satisfactory performance, low-level rule-based trajectory modification. Saxena et al. [15]
used occupancy grid to find a safe spot for autonomous vehicles
Manuscript received 21 April 2023; revised 12 June 2023 and 13 July 2023; to execute amongst dense traffic on roads. Isele [16] transformed
accepted 9 August 2023. Date of publication 11 August 2023; date of current bird’s eye view into a grid in Cartesian coordinates, in which each
version 17 January 2024. This work was supported in part by the National Key vehicle in the gird was represented by its heading angle, velocity,
R&D Program of China under Grant 2021ZD0112700, in part by the National
Natural Science Foundation (NNSF) of China under Grants 61973082 and and an indicator term. Leurent [17] provided an occupancy grid
62233003, and in part by the Natural Science Foundation of Jiangsu Province construction in highway-env, which ignored the information of
of China under Grant BK20202006. The review of this article was coordinated ego vehicle. In practical applications, the information of vehicles
by Dr. Zeeshan Kaleem. (Corresponding author: Ya Zhang.)
The authors are with the School of Automation, Southeast University, Nanjing behind the ego vehicle on the current lane is redundant, while the
210096, China, and also with the Key Laboratory of Measurement and Control lateral position information of ego vehicle is valuable. However,
of Complex Systems of Engineering, Ministry of Education, Nanjing 210096, these differences are not reflected in the above studies. How to
China (e-mail: 230218763@seu.edu.cn; yazhang@seu.edu.cn; xindeli@seu.
edu.cn). represent state as condensed abstracted data is one topic of this
Digital Object Identifier 10.1109/TVT.2023.3304530 article.
0018-9545 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 163
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
164 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
C. Surrounding Vehicles
Define the closest vehicle in front of ego vehicle as the
leading vehicle, the closest vehicle behind the ego vehicle as
the following vehicle. Similar to [3] and [12], the surrounding
vehicles’ driving strategy is based on the IDM and MOBIL.
As the longitudinal motion controller of vehicle, IDM outputs
acceleration control command a through interaction with the
leading vehicle which can be expressed as [6]:
v 4 d ∗ 2
a = amax 1 − ∗ − , (6)
v d
d∗ = d0 + v(T + τ Δv ), (7)
Fig. 1. Coordinate system of the scenario.
where amax and v are the maximum acceleration and the speed
of the controlled vehicle, respectively. d and Δv are the distance
In this scenario, there are three terminal states listed in Table I. and relative speed between the controlled vehicle and its leading
The following assumptions are supposed in this scenario. vehicle, respectively. v ∗ is the target speed, d0 is the minimum
Assumption 1: The ego vehicle can accurately perceive the safety distance, T is the desired time gap, and τ is an adjustable
information (speed and position) of itself and surrounding vehi- parameter for safety.
cles, i.e., the environment is fully observable to the ego vehicle. As the lateral controller of the vehicle, MOBIL determines
Assumption 2: The ego vehicle and the vehicle in front of it whether the vehicle changes lanes or not. If the target lane
that may collide in the near future will not change lanes in the satisfies the following conditions
same direction at the same time.
Assumption 3: All vehicles do not change lanes continuously. ãn ≥ bsafe , (8)
Assumption 4: The expected speed of surrounding vehicles (āc − ãc ) + p((ān − ãn ) + (āo − ão )) ≥ Δath , (9)
is less than the maximum speed.
Remark 1: Assumption 1 ensures that the ego vehicle can then the MOBIL controller provides the target lane and the
make reasonable decisions by correctly perceiving the surround- steering controller designs the corresponding steering angle.
ing environment. Assumption 2 is based on traffic rules. Vehicles Where āc and ãc are the acceleration of the controlled vehicle
will turn on the cornering lamp before lane changing, and before and after the lane changing, respectively. ān and ãn
then surrounding vehicles can obtain its intention. Thus, two are the acceleration of the new following vehicles before and
vehicles with collision risk will not change lanes in the same after the lane changing, respectively, and āo and ão are the
direction at the same time. Assumption 3 is also based on traffic acceleration of the old following vehicles before and after the
rules. Assumption 4 provides a condition of lane changing and lane changing. bsafe is the maximum braking imposed to a vehicle
overtaking to ego vehicle. during decelaration, which is designed to avoid collision with
the new following vehicle after the lane changing. p ∈ [0, 1] is
the politeness coefficient, which represents the proportion of
B. Vehicle Kinematics
the influence of surrounding vehicles before and after the lane
In this scenario, the vehicle model is only used to build the changing, where p = 0 means that the surrounding vehicles are
simulation environment. The kinematic bicycle model can be adopting a more aggressive strategy and do not consider the
expressed as [29]: impact on the vehicles behind when changing lanes. Conversely,
p = 1 means that the surrounding vehicles are more courteous
ẋ = v cos(ψ + β), (1) and considerate of the impact on other traffic participants, and
ẏ = v sin(ψ + β), (2) they may not change lanes frequently. Δath is an adjustable
threshold that motivates the controlled vehicle to change lane.
v̇ = a, (3) The steering controller is a proportional controller which can
v be expressed as follows [12]:
ψ̇ = sin β, (4)
lr lwr
δ = arcsin , (10)
−1 lr 2v
β = tan tan δ , (5)
lf + l r
wr = Kψ (ψ ∗ − ψ), (11)
where x and y denote the longitudinal and lateral coordinates
Klat Δlat
in the Frenet frame as shown in Fig. 1, respectively. lf and lr ψ ∗ = arcsin , (12)
represent the distance from the center of the vehicle to the front v
and rear axles. ψ is heading angle, β is sideslip angle, and v is where Klat and Kψ are the steering controller parameters, l is
speed. The control inputs a and δ given by the controller are the lane-width, and Δlat is lateral position of the vehicle with
acceleration and steering angle, respectively. respect to the lane center-line. When the MOBIL controller
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 165
provides a channel change signal, Δlat changes accordingly. Compared with [17], the longitudinal position of the ego vehicle
If the controlled vehicle is required to keep lane invariant, the is also ignored, but the lateral position, i.e., lane information,
steering controller will make the controlled vehicle move along is retained. Experiment and theoretical analysis will show that
the centerline. lane information is important. In addition, the information of
This article focuses on designing a model-free control strategy the vehicles behind the ego vehicle is ignored. The detailed state
for ego vehicle to travel efficiently and safely if the surrounding space is given as follows.
vehicles follow the IDM and MOBIL controller on the basis of The state space S = {s1 , s2 , . . .} is composed of three-
Assumptions 1–4. dimensional occupancy grid st ∈ RH×W ×C , where H repre-
sents the number of lanes involved in the state. According to
III. CONTROLLER DESIGN Assumption 2, only the current lane, left lane and right lane of
ego-vehicle are necessary, while the information of other lanes
Deep reinforcement learning is deployed as a model-free
is superfluous in most cases, and thus H = 3. W = Wf + Wb ,
controller of the ego vehicle. In this article, ESA-DQN with D
lightweight data-driven safety layer is proposed as a high-level where Wf = lgf and Wb = Dlgb , indicates that the distance
controller, and then speed and steering controllers are used to within the observation range can be divided into W grids, and the
map the action output of ESA-DQN into corresponding accel- length of each grid is lg . · is ceil function, a gives the smallest
eration signal a and steering angle signal δ. integer greater than or equal to a. C = 4 is the dimension of
In this section, the observation space is introduced, and the the vehicle feature space, which will be further described in
state space using occupancy grid is applied. Then, the action the following. lg , Df and Db are parameters to be designed.
space and its mapping laws to speed and steering control signal For convenience, they are selected to make Wf and Wb be
is proposed. After that, DQN is deployed to estimate the action integers and Wf > Wb . In addition, in order to fully describe the
value function, and the ESA is introduced to improve the network vehicle information without introducing too many dimensions,
performance. Finally, the safety layer is proposed to increase the the selection of lg depends on the vehicle length, safe distance
safety rate of ego vehicle. and the vehicle speed. It is ideal to choose parameters to make
each cell contain at most one vehicle. t t
vx vyt e T
A. Mdp For ego vehicle, define Fet = 0, Dyef , vmax e
, vmax as the
In this article, the motion planning of autonomous driving feature vector of ego vehicle at time t, where yet , vxt e , vyt e are
is treated as the MDP including observation space, state space, the lateral coordinate, longitudinal velocity, and lateral velocity
action space and reward, which will be illustrated in detail as of the ego vehicle, respectively. Compared with [17], although
follows. the relative position with x-coordinate being 0 is also used,
t
1) State: Based on Assumption 1, in this scenario, the obser- the lane information Dyef is additionally introduced to prevent
vation space of ego vehicle includes the speed and position of the vehicle from making illegal lane changes. Define Fatk =
itself and its surrounding vehicles in a certain range. xtsk yst k t
vx vyt
T
sk sk
Since this scenario is fully observable, the state space contains Df , Df , vmax , vmax as the feature vector of surrounding
the same information as the observation space. Considering the vehicle k at time t, where xtsk , yst k , vxt s and vyt s are the relative
fitting performance of neural network, occupancy grid is used k k
longitudinal position, relative lateral position, relative longitu-
to describe the state. dinal velocity, and relative lateral velocity of the surrounding
[11] and [10] modeled the state as discrete values with spe- vehicle k, respectively. k satisfies 1 ≤ k ≤ m(t), and m(t)
cific rank to describe an approximate range, which is a crude represents a time-varying number of all vehicles in the field of
description of the state. [28] used the information of the six vision except the following vehicles in the current lane. Simple
surrounding vehicles to form the state, which may not fully normalization based on dividing by the maximum velocity is
reflect the environment of the ego vehicle. [12] applied the used since it can ensure a stable reinforcement learning process.
features of the nearest four vehicles of the ego vehicle to form Then, the elements st (i, j, :)(0 ≤ i ≤ H − 1, 0 ≤ j ≤ W −
a column vector as the state space while it did not consider the 1) in the state are composed of information about ego vehicles
following vehicles. In fact, when changing lanes, ego vehicle and surrounding vehicles, which can be written mathematically
needs to pay attention to the following vehicles of the target as
lane. [24] also concatenated the information of the ego vehicle ⎧ t
and surrounding vehicles as state. These methods ignored the ⎨Fe , I,
⎪
lane quantity and vehicle density. Human drivers often pay st (i, j, :) = Fatk , II, (13)
attention to all vehicles in a certain range in front of and behind ⎪
⎩04 , others,
the ego vehicle rather than a certain number of vehicles.
To keep the permutation invariance, the vehicle within Df
where 04 denotes a 4-dimensional zero vector, I denotes (j ==
meters in front of and Db meters behind the ego vehicle is
Wb − 1) ∧ (i == 1) representing the position of the ego vehicle
converted into the occupancy grid. Each spatial grid, if occupied, i
contains the normalized relative position and relative speed. in the occupancy grid. II is (j == xlgs + Wb ) ∧ (i == ysi +
According to the scenario, the feature of the following vehicle on 3
2l ) ∧ (¬(xtsi < 0 ∧ yst i + 23 l == 1)) representing the posi-
the current lane is treated as redundant information and removed. tion of the surrounding vehicles in the occupancy grid, except
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
166 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 167
Fig. 2. EA versus ESA. (a) External attention. (b) External space attention.
However, most researches use the attention mechanism in the Fig. 3. Driving scenarios. (a) Scenario I. (b) Scenario II.
network with image as the state in the motion planning envi-
ronment [22], [23]. Although images and occupancy grid have
the contribution of surrounding vehicles at the same position to
similar forms, they have different meanings. It is unreasonable to
the action value function under different environment.
directly use the attention mechanism designed for the computer
Similar to [32], the whole network is optimized by the gradient
vision [20], [21] to the motion planning.
descent algorithm.
The self attention mechanism [19] in [24] is to learn the
corresponding attention weights between feature vectors of ego
vehicle and surrounding vehicles. This method is also not suit- C. Safety Layer
able for occupancy grid. Overemphasis on collision penalty would make the agent
EA [21], which is learning weights from the training dataset give up on attempting to accelerate and increase the difficulty
and independent of the input, is considered to outperform the of learning efficient action. On the contrary, small punishment
self attention. Inspired by it, in this article ESA is designed to would make the agent aggressive, even unsafe. Safety cannot
improve the ego-vehicle’s perception of the surrounding envi- be completely ensured by increasing collision penalty in reward
ronment, which connects state with essential network. function, and repeated training is required to ensure the safety
The EA, as shown in Fig. 2(a), learns the correlation among rate by adjusting reward function [12]. Adding additional net-
pixel points with an external learnable matrix. The correlation works to predict whether the ego vehicle will collide or not in the
between vehicles in occupancy grid is far more affected by dis- future increases the training difficulty and lacks the interpretabil-
tance than that between pixels in image. So a 3 × 3 convolution ity [26]. Simple safety rules are intuitive and practical, but
kernel without pooling layer and bias is used as the external they depend on expert knowledge [28]. To reduce collision rate
learning matrix, which acts as a memory of the experience re- while ensuring high efficiency, a lightweight data-driven safety
play. In addition, when normalizing, different from EA, it is only layer is proposed to judge the action chosen by ESA-DQN.
required to normalize the zeroth dimension, which represents Its training is independent and parallel to the training process
the features of vehicles. The external learnable matrix is treated of the ESA-DQN. The safety layer consists of classifiers for
as the attention weight, and the final output is the hadamard every dangerous action, and each classifier consists of several
product of attention weight and input features. The detailed elementary units. Giving attention to both interpretability and
structure of ESA is shown in Fig. 2(b), where G@H × 3 × 3 expressiveness, LBSVM classifier is designed as the elementary
and H@G × 3 × 3 mean G learnable convolution kernels of unit to evaluate the action selected by the ESA-DQN.
size H × 3 × 3 and H learnable convolution kernels of size As a popular discriminative classifier, support vector machine
G × 3 × 3, respectively. ESA enhances the ego-vehicle’s per- (SVM) has a solution with interpretability for linearly separable
ception of the current environment. The scenario in Fig. 3 is problems, which is very suitable for this scenario.
used to illustrate the significance of ESA. The standard linear SVM is used to solve the following
Consider the surrounding vehicle 1, i.e., the green one, in optimization problem [33]
Fig. 3. Its importance in the action-value function should be de-
termined by its surrounding vehicles. In Scenario I, there is a ve- N
1
hicle following surrounding vehicle 1, and then its contribution min w2 + C ξi ,
2 i=1
to the action-value function Q(s, a; θ) is limited. However, as
shown in Scenario II, there is no vehicle following surrounding s.t. yi (w · xi + b) ≥ 1 − ξi ,
vehicle 1, then, it will make a huge contribution to the action-
value function Q(s, a; θ), especially for Action 0. ESA learns ξi ≥ 0, i = 1, 2, . . . , N, (22)
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
168 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
TABLE III
CLASSIFIER COMPOSITION OF EACH ACTION
where w and b are the parameters of the classification hyper- the training process of ESA-DQN. The rule of labeling samples
plane. xi and yi are the feature vector and label of samples, is first introduced, and then the composition of the classifier
respectively. In this scenario, dangerous samples are labeled as for each action is proposed. The dimension of LBSVM (23) is
T rue, while safe samples are labeled as F alse. N is the size of further reduced according to the scenario.
training set. ξi are slack variables, which are introduced to allow Based on historical attempts, at that causes st+1 become
some classification errors. Terminal II is recorded as a dangerous action under state st .
Such an SVM classifier (22) is fair for positive and negative Note that deceleration has little impact on the collision,as a con-
samples. However, in this scenario, for the negative sample, it sequence, at == 4 means that executing at−1 in the state st−1
only means that the action executed in the current state is safe, is dangerous. Similarly, at−1 == 4 ∧ at == 4 means executing
without considering the decisions of surrounding vehicles. If at−2 in the state st−2 is dangerous. The nearest non-decelerating
there are positive and negative labels in some similar states, the action is always tracked back to be responsible for the final
decisions of other vehicles in such states will affect the label. collision. This action and its corresponding state is marked as a
Thus, the penalty for positive sample classification errors should positive sample.
be much larger. Since Action 4 should not be evaluated as a dangerous action,
In addition, the decision boundary of LBSVM is designed to only the other four actions are equipped with a classifier.
be far away from the positive samples. The schematic diagram There are some problems such as high dimensionality, poor
of the 2D-LBSVM is shown in Fig. 4. The black implementation interpretability and easy overfitting if the states are directly used
represents the decision boundary. The red points are the positive as the input of the classifier. So, the input features are selected
samples, and the blue samples are the negative samples. The according to the scenario.
distance from the red support vectors to the decision boundary is In longitudinal control (Action 1 and Action 3), the ego vehi-
1 + M , which is greater than the distance from the blue support cle can only collide with the leading vehicle. Thus, 2D-LBSVM
vectors to the decision boundary. classifiers are used to judge the Action 1 and Action 3, respective,
Mathematically, the LBSVM manages to solve the following and the inputs are the relative position xtslc and relative speed
optimal problems vxt s c of the leading vehicle on the current lane.
l
N+1 N−1
1 In the process of changing lanes (Action 0 or Action 2),
min w + C+1
2
ξi + C−1 ξj , surrounding vehicles that may collide with ego vehicle include
2 i=1 j=1 the leading vehicle on the current lane (LVCL), the leading
s.t. w · xi + b ≥ 1 + M − ξi , when yi = 1, vehicle on the target lane (LVTL), and the following vehicle of
the target lane (FVTL). Thus, three 2D-LBSVM classifiers are
w · xj + b ≤ −1 + ξj , when yj = −1, used to judge the Action 0 and Action 2, respectively, and the
ξi ≥ 0, i = 1, 2, . . . , N+1 , inputs are the relative longitudinal positions (xtslc , xtslt , xtsf t )
and relative speeds (vxt s c , vxt s t , vxt s t ) of the three types of
ξj ≥ 0, i = 1, 2, . . . , N−1 , (23) l l f
vehicles that may be involved in a collision. Note that the safety
where M > 0 represents the deviation of the decision boundary of Action 0 and 2 depends on the vehicles in the target lane and
from the positive samples, C+1 C−1 implies that the classi- the current lane, and need similar criteria. In order to make full
fication errors of negative samples are unimportant. N+1 and use of the data, the samples of Action 0 and 2 are combined.
N−1 are the size of the positive samples and negative samples. Table III is used to describe the composition and input of the
Compared with neural networks, LBSVM only needs a small classifier for different actions.
amount of data and training resources. Its training is decoupled For convenience, a softmax layer is added after the output of
from DRL, which does not increase the training difficulty. Next, ESA-DQN. When the selected action is rejected by the safety
LBSVM (23) will be used as a component of classifier for each layer, the corresponding Q value is modified to −1, so that other
action. actions can still be setected by argmax function. In this section,
In order to make the classifier more suitable for the control the relationship between state, ESA-DQN and lightweight safety
algorithm, the training samples of the classifier are obtained from layer is shown in Fig. 5.
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 169
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
170 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
TABLE IV TABLE VI
DRIVING SCENARIO SIMULATION RESULT
C. Ablation Experiment
In this subsection, ablation experiments will be deployed to
illustrate the role of each module. Since ESA is an improvement
of EA in this scenario, in addition to comparing the performance
difference with or without the ESA module, we also include
Conservative SAC in [12]. Compared with the simple DQN a comparison between ESA-DQN and EA-DQN, which is de-
algorithm, the safety rate and average speed are both improved. signed by using the EA module. Different training timesteps are
The advantage of Algorithm 1 is mainly attributed to the tested in the same random seed and environment, and the results
state selection and safety layer. Occupancy grid improves the are shown in Table VII, Fig. 6, Fig. 7 and Fig. 8.
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 171
TABLE VII
ABLATION EXPERIMENT
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
172 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
TABLE VIII
WHETHER LANE INFORMATION IS AVAILABLE OR NOT
TABLE IX
COMPARISON WITH PERFORMANCE OF THE SOLUTION WITH ADDITIONAL
PREDICTION NETWORK
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 173
TABLE X
PERFORMANCE TESTING WITH OBSERVATIONS CONTAINING NOISES
TABLE XI
PERFORMANCE TESTING WITH DIFFERENT STRATEGY PARAMETERS FOR
SURROUNDING VEHICLES
Fig. 11. Driving scenarios for compare. (a) Scenario I. (b) Scenario II. (c)
Scenario III.
TABLE XII
PERFORMANCE TESTING WITH DIFFERENT lg
D. Discussion
State: DQN is used to fit the action value function. It manages
to judge the value of an action according to current state, and it is
used to solve the situation where the number of states is infinite.
Thus, the network must have the generalization ability. In other
words, the Q value of an action obtained from similar states is
the safety margin and the remarkable generalization capability similar. Occupancy grid is used instead of stacking the infor-
of the simple linear safety layer. mation of surrounding vehicles to improve the generalization
Like many DRL algorithms, the performance of the algorithm ability of the model.
is significantly influenced by changes in the environment. Fortu- If the information of surrounding vehicles is stacked like [12],
nately, LBSVM mitigates this impact, and the performance of the the order of information may be ambiguous. The work [17]
algorithm remains satisfactory. Those experiments demonstrate provides a solution by sorting the absolute value of the distance
that the proposed algorithm has good generalization ability. from the ego vehicle, which may be feasible. However, this
We vary the resolution of observation by changing lg to test the method may cause that the space position of the surrounding
algorithm’s dependence on hyperparameters of the observation. vehicles and their position in the state vector of the ego vehicle
The results are presented in Table XII. cannot be one-to-one corresponding.
It is more intuitive to use occupancy grid as the state to display Consider Fig. 11(a), (b) and (c). Assume that the speed of
the spatial information of the observation to the agent than surrounding vehicles (blue) is the same, and the longitudinal
surrounding vehicles’ information. As for resolution, a small position of vehicles in the same column of different scenario is
value of lg means a finer spatial partitioning, which increases also the same. In these cases, if the speeds of all the vehicles are
the number of the neural network parameters. A large value the same, their state vectors are similar. However, for Scenario I,
of lg leads to a coarser spatial partitioning, which may result it is a reasonable choice to keep going straight, while for Scenario
in multiple vehicles occupying the same grid. In such cases, II, going straight is dangerous due to the high speed of ego
selecting the features of the vehicle nearest to the ego vehicle vehicle. In Scenario III, it is wise to choose LLC. If only Scenario
as a representative does not substantially influence the decision- I and Scenario II appear in the training set, it may be difficult
making process. The crudest approach is to store the features of for ego-vehicles to make an optimal decision for Scenario III. It
the closest vehicle to the ego vehicle in each lane, which indeed is difficult for Q function approximated by neural networks to
loses some information about other surrounding vehicles and output significantly different Q values in similar states.
leads to short-sighted decisions by the agent. Our experiments From the standpoint of the training process, if the ego vehicle
show that the algorithm’s performance is satisfactory within a obtains a large penalty due to LLC in Scenario I, the correspond-
larger and appropriate range of lg . ing weight will also change according to the gradient descent
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
174 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 73, NO. 1, JANUARY 2024
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: ATTENTION-BASED HIGHWAY SAFETY PLANNER FOR AUTONOMOUS DRIVING VIA DEEP REINFORCEMENT LEARNING 175
[19] A. Vaswani et al., “Attention is all you need,” in Proc. Conf. Workshop [37] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann,
Neural Inf. Process. Syst., 2017, pp. 6000–6010. “Stable-baselines3: Reliable reinforcement learning implementations,” J.
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. Mach. Learn. Res., vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available:
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. http://jmlr.org/papers/v22/20-1364.html
[21] M.-H. Guo, Z.-N. Liu, T.-J. Mu, and S.-M. Hu, “Beyond self-attention: [38] P. Adam et al., “Pytorch: An imperative style, high-performance deep
External attention using two linear layers for visual tasks,” IEEE Trans. learning library,” in Proc. 33rd Adv. Neural Inf. Process. Syst., 2019,
Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5436–5447, May 2023. pp. 8026–8037.
[22] D. Tian et al., “SA-YOLOV3: An efficient and accurate object detector [39] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-C.
using self-attention mechanism for autonomous driving,” IEEE Trans. Woo, “Convolutional LSTM network: A machine learning approach for
Intell. Transp. Syst., vol. 23, no. 5, pp. 4099–4110, May 2022. precipitation nowcasting,” in Proc. 28th Int. Conf. Neural Inf. Process.
[23] P. Cai, H. Wang, Y. Sun, and M. Liu, “DQ-GAT: Towards safe and efficient Syst., 2015, pp. 802–810.
autonomous driving with deep Q-learning and graph attention networks,”
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 21102–21112,
Nov. 2022.
[24] L. Chen, Y. He, Q. Wang, W. Pan, and Z. Ming, “Joint optimization of sens-
ing, decision-making and motion-controlling for autonomous vehicles: A Guoxi Chen (Student Member, IEEE) received the
deep reinforcement learning approach,” IEEE Trans. Veh. Technol., vol. 71, B.E. degree in automation in 2021 from Southeast
no. 5, pp. 4642–4654, May 2022. University, Nanjing, China, where he is currently
[25] W. Hu et al., “A rear anti-collision decision-making methodology based on working toward the Ph.D. degree in control engineer-
deep reinforcement learning for autonomous commercial vehicles,” IEEE ing. His research interests include autonomous driv-
Sensors J., vol. 22, no. 16, pp. 16370–16380, Aug. 2022. ing, reinforcement learning, and network security.
[26] L. Wen, J. Duan, S. E. Li, S. Xu, and H. Peng, “Safe reinforcement learning
for autonomous vehicles through parallel constrained policy optimization,”
in Proc. IEEE 23rd Int. Conf. Intell. Transp. Syst., 2020, pp. 1–7.
[27] L. Zhang, R. Zhang, T. Wu, R. Weng, M. Han, and Y. Zhao, “Safe
reinforcement learning with stability guarantee for motion planning of
autonomous vehicles,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32,
no. 12, pp. 5435–5444, Dec. 2021.
[28] A. Baheri, S. Nageshrao, H. E. Tseng, I. Kolmanovsky, A. Girard, and
D. Filev, “Deep reinforcement learning with enhanced safety for au- Ya Zhang (Senior Member, IEEE) received the B.S.
tonomous highway driving,” in Proc. IEEE Intell. Veh. Symp. (IV), 2020, degree in applied mathematics from the China Uni-
pp. 1550–1555. versity of Mining and Technology, Xuzhou, China,
[29] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and in 2004, and the Ph.D. degree in control engineering
dynamic vehicle models for autonomous driving control design,” in Proc. from Southeast University, Nanjing, China, in 2010.
IEEE Intell. Veh. Symp. (IV), 2015, pp. 1094–1099. Since 2010, she has been with Southeast University,
[30] M. Mukadam, A. Cosgun, A. Nakhaei, and K. Fujimura, “Tactical decision where she is currently a Professor with the School
making for lane changing with deep reinforcement learning,” in Proc. Conf. of Automation. Her research interests include multi-
Workshop Neural Inf. Process. Syst., 2017, pp. 1–7. agent systems, reinforcement learning, and network
[31] K. Liu, H. Zhang, Y. Zhang, and C. Sun, “False data injection attack security.
detection in cyber–physical systems with unknown parameters: A deep
reinforcement learning approach,” IEEE Trans. Cybern., vol. 53, no. 11,
pp. 7115–7125, Nov. 2023.
[32] V. Mnih et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[33] J. Pang, X. Pu, and C. Li, “A hybrid algorithm incorporating vector
quantization and one-class support vector machine for industrial anomaly Xinde Li (Senior Member, IEEE) received the Ph.D.
detection,” IEEE Trans. Ind. Informat., vol. 18, no. 12, pp. 8786–8796, degree in control theory and control engineering from
Dec. 2022. the Department of Control Science and Engineering,
[34] C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, Huazhong University of Science and Technology,
pp. 279–292, 1992. Wuhan, China, in 2007. He joined the School of
[35] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep Automation, Southeast University, Nanjing, China,
reinforcement learning for wireless sensor scheduling in cyber–physical where he is currently a Professor and the Ph.D. Su-
systems,” Automatica, vol. 113, 2020, Art. no. 108759. pervisor. His research interests include information
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image fusion, object recognition, computer break vision, and
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, intelligent robot.
pp. 770–778.
Authorized licensed use limited to: Fengchia University. Downloaded on January 18,2024 at 09:04:17 UTC from IEEE Xplore. Restrictions apply.