Cognitive Radar Using Reinforcement Learning in Automotive Applications

1
Cognitive Radar Using Reinforcement Learning in

Automotive Applications
Pengfei Liu, Yimin Liu* , Member, IEEE, Tianyao Huang, Yuxiang Lu, and Xiqin Wang
Abstract—The concept of cognitive radar (CR) enables radar Environment

Radar signal Radar echoes
systems to achieve intelligent adaption to a changeable environ-
arXiv:1904.10739v1 [eess.SP] 24 Apr 2019
ment with feedback facility from receiver to transmitter. However, Receiver

the implementation of CR in a fast-changing environment usually Transmitter Environmental
requires a well-known environmental model. In our work, we Environment model
stress the learning ability of CR in an unknown environment Analyzer / parameters
Designer Predictor
Prior
using a combination of CR and reinforcement learning (RL), Knowledge
called RL-CR. Less or no model of the environment is required.
We also apply the general RL-CR to a specific problem of auto-
motive radar spectrum allocation to mitigate mutual interference. Predictions on the environment
Using RL-CR, each vehicle can autonomously choose a frequency
subband according to its own observation of the environment. Fig. 1. Block diagram of conventional CR.
Since radar’s single observation is quite limited compared to
the overall information of the environment, a long short-term
memory (LSTM) network is utilized so that radar can decide the Radar signal Radar echoes
Environment
next transmitted subband by aggregating its observations over
time. Compared with centralized spectrum allocation approaches,
our approach has the advantage of reducing communication Transmitter Receiver
between vehicles and the control center. It also outperforms
Learner Reward Signal
some other distributive frequency subband selecting policies in π!"# Processor
reducing interference under certain circumstances.
Index Terms—cognitive radar, reinforcement learning, auto-
motive, interference, spectrum allocation High-level observations, e.g. range, velocity, etc.
Fig. 2. Block diagram of RL-CR. The two feedback paths, for the observation
I. I NTRODUCTION and reward, are distinguished by solid and dashed lines, respectively. The
former is before the transmission and the latter is after the transmission.
T HE concept of cognitive radar (CR) offers a way for radar

systems to achieve intelligent adaption to a changeable
environment (targets are also taken as part of the environment). sharing [12, 13]. Transmitter waveform design and receiver
CR was first formally put forward in [1]. The CR framework cognition to the environment are the two main focuses of this
proposed in [1] is manifested in the block diagram in Fig. research. On one hand, various criteria for waveform design
1. First, radar senses the environment by sending electro- have been developed, such as minimizing the minimum mean
magnetic signals. Then, a parameterized environmental model square error (MMSE) [6] or various types of Cramér-Rao
is produced by the analyzer or from the prior knowledge. lower bound (CRLB) for target tracking [7, 9], maximizing
A frequently used model is the state-space equation, which signal-to-noise ratio (SNR) [2] or mutual information (MI) [4]
depicts the dynamics of the environment. Based on it, the for target detection, and maximizing class distance [10, 11] for
predictor makes probabilistic predictions on what the environ- target recognition. On the other hand, receiver cognition has
ment will evolve into using methods such as Bayesian filtering. been implemented in different ways. Some research follows
Finally, the transmitter designs new waveforms according to the procedures in [1] (or Fig. 1). For instance, in [2, 6, 7, 9], a
the feedback from the receiver. Working in such a closed loop, state-space equation is formulated to represent the relationship
radar can continuously adjust its transmission corresponding between environment states in two adjacent time steps so that
to a dynamic environment. receiver prediction can be executed by the Bayesian tracking
Inspired by [1], subsequent research has applied CR in approach. However, it requires considerable prior knowledge
diverse applications, such as target detection [2–4], ranging to model the dynamics of the environment well and thus,
[5], tracking [6–9], recognition [10, 11] and radar spectrum model mismatch is an underlying problem. In others, extra
transmissions or sensors are deployed to perceive the environ-
P. Liu, Y. Liu, T. Huang and X. Wang are with the Depart-
ment of Electronic Engineering, Tsinghua University, Beijing, China ment, and no modeling or predicting is involved. For instance,
(e-mail: liupf16@mails.tsinghua.edu.cn; yiminliu@tsinghua.edu.cn; huang- in [5], radar transmits an extra pulse to acquire the target
tianyao@tsinghua.edu.cn; wangxq ee@tsinghua.edu.cn). impulse response, according to which waveform of the next
Y. Lu is with Beijing Radar Research Institute, Beijing, China (email:
luyuxiang92@163.com). pulse is designed. In [13], a communication receiver senses the
* Corresponding author. occupied frequency bands and then a radar transmitter uses this
2
information to select bands with minimum interfering energy. related to time, frequency and power assigned to each radar are
However, such methods are only applicable to situations where calculated by a graph coloring algorithm. However, centralized
the environment does not change much between successive allocation requires the control center to know the overall
time steps. information such as the geometrical distribution of all vehicles.
In the face of an environment with scarce knowledge Hence, continuous communication needs to be established
about its dynamics, learning ability appears crucial to the between vehicles and the control center. The former send their
realization of CR. In [14–16], reinforcement learning (RL) is positions to the latter, which in turn broadcasts the allocation
adopted for radar to control its transmission. RL is a machine result to each vehicle.
learning method in which an agent learns to make decisions This paper aims to develop a distributive spectrum allocation
corresponding to an unknown environment [17]. Enlightened approach that is more effective and flexible than a centralized
by this research, we examine closely the combination of RL one, but with more challenges. On one hand, the geometrical
and CR, which we refer to as RL-CR. The block diagram distribution of vehicles changes fast. On the other hand, each
of RL-CR is shown in Fig. 2. First, the receiver processes vehicle only observes part of the distribution. By using RL-
radar echoes to some high-level observation, such as the target CR, each radar autonomously chooses a frequency subband
range or velocity. Then, a learner in the transmitter outputs using its own observations with barely any communication
the next transmission by taking the current observation as with others or the control center. Moreover, because radar’s
input using a policy function, π(·). After the transmission single observation is quite partial compared to the overall in-
acts on the environment or is received and processed by the formation, a long short-term memory (LSTM) neural network
receiver, the learner obtains a reward signal as an immediate is employed in the RL-CR learner so the radar can decide the
evaluation for its last transmission, based on which it adjusts next transmitted subband by aggregating its observations over
the policy. In this way, radar can learn by interacting with time[24].
the environment despite scarce knowledge about it. Compared The main innovations are summarized as follows:
to the conventional CR, RL-CR differs in several ways. First, • We propose a conceptual combination of CR and RL,
two types of feedback are included in RL-CR. One is from RL-CR, to enable radar to learn to choose transmission
receiver to transmitter and it updates the transmitter with recent in an unknown dynamic environment.
environmental information to decide the next transmission. • We demonstrate how the general RL-CR is specified to
The other is the reward signal, which continuously gives achieve distributive spectrum allocation among automo-
assessments on the last transmission so that radar can learn tive radar to mitigate mutual interference.
a better policy. Second, RL-CR aims to learn the relationship • We develop an algorithm to train one network shared by
between the current observation and next transmission based all cars with stability.
on its interaction with the environment, rather than explic- • We build a simulation environment to model road scenar-
itly modeling the environment and making predictions using ios where different allocation approaches can be trained
statistical inference. The dynamics of the environment and and tested, and verify the advantage of our approach by
the design of waveforms are learned in the form of a policy comparing it with two contrasting ones.
function that maps observation to transmission. The rest of the paper is organized as follows. In Section II,
In this paper, we show how to specialize the conceptual the problem of automotive radar spectrum allocation using RL-
idea of RL-CR to a particular problem: mitigating mutual CR is formulated. In Section III, the network architecture used
interference among automotive radars. It is a problem of wide in RL-CR is constructed and a training algorithm is developed.
concern as currently the population and bandwidth demand In Section IV, simulation results and relevant discussion are
of automotive radars are both on the rise while the radio provided. In Section V, conclusions are drawn.
resource allocated for them is quite limited. For instance, the Notations: Throughout this paper, we use superscript {·}i to
frequency range of 76-77 GHz is allocated to automotive usage represent Car i and subscript {·}t to represent time step t. A
in most countries [18]. Current solutions to the interference variable with superscript {·}i,j stands for a value between Car
problem can be divided into two types. One is interference i and Car j. Bold lower case letters denote row vectors. We
canceling (IC), i.e. to eliminate received interference using use (·, ..., ·) to represent a tuple and use [·, ..., ·] to represent
signal processing algorithms. Typical methods include pre- a cascade of multiple row vectors.
FFT and post-FFT [19]. The other is interference avoidance
(IA), i.e., to prevent the radar and interferers from using II. AUTOMOTIVE RADAR SPECTRUM ALLOCATION USING
similar waveforms or frequency bands. One IA method is RL-CR
to use randomized waveform parameters, such as staggered The approach we propose applies to a series of scenes
pulse repetition frequency (PRF) [19] and random chirp types concerning the problem of distributive resource allocation
[20]. Another is spectrum allocation, i.e., to assign different among dynamic sensor networks. On one hand, the network
frequency bands to each radar and this is taken as one of is time-varying, such as the geometrical distribution of the
the most promising interference mitigation techniques [21]. sensors. On the other hand, each sensor selects resources
Some centralized spectrum allocation approaches have been autonomously using its local observations. In this work, we
proposed. In [22], frequency bands are assigned to each car choose the problem of automotive radar spectrum allocation as
without overlap, and the allocated bandwidth is determined by an example to implement and verify the approach. Extension
minimizing the largest ranging error of all. In [23], parameters to other applications can be made in a similar way.
3
travel in opposite directions, it is easy to determine which

target belongs to which lane. Although radar measurement
may fail at times due to interference, pit can be inferred using
recent measurements of the positions and velocities of the two
front cars.
B. Reward
The reward design relates to the desired performance. In
our problem, we use interference-to-noise ratio (INR) as a
metric for whether a transmission is successful. If the INR is
Fig. 3. Problem scenario. below a certain threshold, it is taken as a success. Otherwise,
it is a failure. Our desired performance is the success rate,
which equals to the number of successes divided by the total
We consider a simplified scenario, shown in Fig. 3, in which transmissions. Hence, the reward is set as 1 if the INR is below
cars are traveling on two lanes in different traffic directions. a predefined threshold and as 0 otherwise. In the following,
Each car is equipped with one long-range radar (LRR) on the reward expression will be derived.
its front and one short range radar (SRR) on its back. The The interference received by the LRR comes from two
LRR, for distance up to 250m, is used to provide a forward- parts. One is from the LRR on cars that travel in the opposite
looking view for applications such as adaptive cruise control direction in the different lane, and the other is from the SRR
(ACC) and collision mitigation systems (CMS) [25]. The on cars in the same lane. Interference between two LRRs in
SRR, for distance within 30m, is used to detect obstacles for different lanes is given as:
applications such as lane change assistance (LCA) and assisted (
2
PL GAe g
parking system (APS) [25]. J= 4π(L2 +d2 ) · [pr (θ (d))] d > d0
(2)
Suppose that the total frequency band for automotive radar 0 d ≤ d0
usage is divided into M subbands and that there are N cars
where
on a certain section of the road. The N cars are indexed by
• PL : transmitting power of LRR;
1, 2, ..., N . At each time step, the LRR and SRR on one car
• G: antenna gain;
choose the same subband to transmit and receive. The time
• Ae : effective area;
interval between two transmissions is T . As the number of
• L: vertical distance between two lanes
subbands is smaller than the number of cars, i.e. N > M , more
• d: horizontal distance between two LRRs;
than one cars will inevitably collide into the same subband
• θ: radiation direction between two LRRs;
causing interference. In this paper, we only focus on reducing
• pr (·): normalized antenna pattern;
LRR interference, as SRR interference is usually not a concern
• g: decaying factor of propagation of radar signal.
[20] (SRR is mainly used for obstacle detection that does not
require range or velocity measurement). As (2) shows, the interference power decreases as the inverse
In the following, we will concretize the elements in RL-CR square law with the distance between two radars [26]. The
shown in Fig. 2 in terms of automotive interference. decaying factor g represents the effect of shadowing between
cars caused by the road environment such as obstacles. The
antenna pattern pr (·) is taken into consideration, which indi-
A. Observation cates that the transmitting or receiving power is dependent on
It is easy for a car to choose the best subband if it can the angle from one radar to another. An illustration is provided
fully observe the environment containing the positions and in Fig. 4. The angle can be written as a function of d:
chosen subbands of all the others. However, only part of the
L
environment is observable. The observation Car i acquires at θ(d) = arctan . (3)
d
time step t is
For simplicity, we are only concerned with the main lobe
oit = uit−1 , rt−1
i i
, pit , pit ,

, It−1 (1) of the antenna because the side lobe is negligibly low by
where comparison. In other words, the radiant power beyond the
i width of the main lobe is approximately taken as 0. Hence,
• ut−1 : last subband Car i transmitted;
i almost no interference is caused if two incoming cars in
• rt−1 : last reward Car i received;
i different lanes are located within d0 , shown in Fig. 3. The
• It−1 : interference power at last time step;
i minimum interference distance is determined as
• pt : position of Car i;
i L
• pt : positions of cars in front of Car i. d0 = , (4)
tan(φ/2)
The vector pit contains two components, which are the posi-
tions of the nearest cars in front of Car i in the same and differ- where φ is the width of the main lobe.
ent lane, respectively. These two positions can be acquired by Likewise, the interference caused by an SRR in the same
the LRR, which measures the distance and velocity of the front lane is
PS GAe g
targets within its maximal detection range. As the two targets K= , (5)
4πd2
4
where γ ∈ [0, 1] is the discounting factor. The factor γ reflects

how much we consider the influence of the current action on
the future. An extreme example is γ = 0, which corresponds
to the case where the agent aims to maximize the immediate
reward rt . The Q-function is defined as the expectation of Gt
after taking action at under environment state st following
policy π:
Qπ (st , at ) = E {Gt |st , at , π} , (9)
where the expectation is taken over the probabilistic se-
quence, st+1 , at+1 , rt+1 , st+2 , at+2 , rt+2 , ..., following policy
π. Learning the optimal policy equals finding the optimal Q-
function:
Q∗ (st , at ) = max Qπ (st , at ). (10)
π
Fig. 4. Illustration of radar antenna pattern. Then, the best action can be determined by the optimal Q-
function:
a∗t = arg max Q∗ (st , a′ ). (11)
where PS is the SRR transmitting power . The antenna pattern ′ a
is omitted here because the angle is approximately 0 in the The Q-learning algorithm provides an iterative way to esti-
same lane. mate the optimal Q-function without an explicit model of the
Equation (2) and (5) give the interference an LRR receives environment. Each iteration is based on an experience of the
from an LRR in the different lane and an SRR in the same lane, agent, which is represented by a quadruple, (st , at , rt , st+1 ).
respectively. Let di,j
t denote the horizontal distance between The iteration is performed as [17]:
Car i and Car j at time step t. By replacing d with di,j t in
(2) and (5), we obtain Jti,j and Kti,j , which represent the
interference received by the LRR on Car i from the LRR and Q(st , at ) ← Q(st , at )
h i
′
SRR on car j, respectively. Thereby, the total interference with +αt rt + γ max Q(s t+1 , a ) − Q(s t , a t ) ,
′ a
the LRR on Car i at time step t can be expressed as
(12)
X i,j X i,j
Iti = Jt + Kt , (6) where αt is the learning step size. As the iteration in (12)
j∈Φit j∈Ψit is limited to cases where the state and action space are low
dimensional and discrete, a neural network is usually used to
where Φit and Ψit are sets of cars that transmit the same
approximate the Q-function [27]. The network is referred to
subband as Car i in different and the same lanes, respectively.
as Q-network in the following.
Then, the reward is derived as
i In our problem, as st is not fully observable, using a single
i It observation to represent the environment state is inadequate.
rt = rect , (7)
N0 η To construct a more complete environment state, each car
aggregates their historical observations:
where N0 is the noise power, η is the predefined threshold of
sit = oi1 , oi2 , ..., oit ,

INR, and the function rect(·) is defined as (13)
where sit is the constructed environment state by Car i, and

1 0≤x<1
rect(x) = .
0 otherwise oit is defined in (1). Here, we use an LSTM recurrent neural
network to approximate the Q-function since LSTM is capable
of memorizing the past by maintaining a hidden state [28]. The
C. Learner
Q-network for Car i is denoted as Qi (oit , hit−1 , uit ; wi ), where
The learner utilizes RL to learn how to select subbands wi is the network parameter, uit is the chosen subband index
in light of its local observations. First, we recap RL and its by Car i at time step t and hit−1 is the hidden state. The hidden
commonly used algorithm, Q-learning. Then, we show how to state is what the network extracts from past observations
apply RL to our problem. oi1 , oi2 , ..., oit−1 . Using the LSTM network, the constructed
In RL, an agent learns how to choose actions by receiving environment state by Car i is equally represented by a fixed-
rewards from an unknown environment [17]. Let st , at and size vector:
rt denote the state of the environment, the action of the agent sit = oit , hit−1 .

(14)
and the reward it receives at time step t. At each time step,
action at is determined by environment state st following a Although there are multiple cars, we do not assign each to
policy π, which is a mapping from the state space to action a different network. Instead, we only train one to be shared
space, to maximize a discounted sum of future rewards: by all cars, i.e., Q(oit , hit−1 , uit ; w). Two main reasons account
for choosing such a method. First, suppose N networks are
Gt = rt + γrt+1 + γ 2 rt+2 + ..., (8) already trained for N particular cars; if one is replaced by
5
hit−1 uit uit+1 uit+2
arg max arg max arg max

uit−1
FCL FCL FCL
i
It−1 Sub-band 1
i
rt−1 Sub-band 2 hit−1 hit hit+1 hit+2
oit FCL LSTM FCL LSTM LSTM LSTM
pit
…
pit Sub-band M FCL FCL FCL
i
oit oit+1 oit+2
Fig. 5. The Q-network architecture used in the problem. The black square Fig. 6. Unfolded representation of the Q-network, showing how a subband
stands for a time-step delay. is determined combining the current and past observations.
X n 2 o
L(w) = E(oit ,hit−1 ,uit ,rti ,oit+1 ,hit ) y i − Q oit , hit−1 , uit ; w (16)
i
a new car, then the new combination of N cars needs to be subband. The hidden state hit−1 is extracted from the past
trained again for the cars to work together; whereas, training observations and it evolves as new observations come in.
one shared network can apply to arbitrary N cars. Second, it Fig. 6 shows the unfolded representation of the Q-network.
can also reduce the amount of network parameters. Although At time t, the network combines the current observation oit
the network is identical for all cars, the cars will make and hidden state hit−1 to yield the chosen subband uit . Then
different decisions because they receive different observations hit−1 evolves into hit by incorporating oit . In this way, the
and maintain different hidden states. Furthermore, we can add subband at each time step is determined by both the present
a car index in the observation to make the network specialized and previous observations.
for each car [24]. Hence, (1) is replaced by
B. Network Training
i
oit = uit−1 , rt−1 i
, pit , pit , i .

, It−1 (15)
The network training is centralized, indicating that all cars’
To update the network parameter, the loss function is defined experiences need to be gathered to train the shared network.
as (16), where y i is the target value defined as: However, the execution is distributive, which means once the
trained network is loaded into each car in the beginning,
y i = rti + γ max Q oit+1 , hit , u′ ; w− ,

(17) they use it to choose subbands according to their own local
′ u
− observations, without any communication required.
where w represents the network parameter before updating.
A technique named experience replay combined with batch
As Bellman’s equation shows an equality between the optimal
learning is used to train the Q-network [27]. During the
Q-function and its target value [17], the loss function is defined
training, each experience, eit = oit , hit−1 , uit , rti , oit+1 , hit , is
to minimize the error between the Q-function and its target
stored in memory. Before updating the Q-network, a batch of
value over all cars’ experiences. Then, the gradient descent
experiences are drawn from memory. Then, gradient descent
step is performed as:
is performed over the batch. A batch is formed as (19) shows:
 i1
et1 eit11 +1 · · · eit11 +P −1 

∂L(w)
−

 i2
w =w −β , et2 eit22 +1 · · · eit22 +P −1 
 
(18)  
∂w w− , (19)
. .. .. ..
 ..
 . . . 
where β is the learning rate.  

 iK
etK eitK · · · eitK

K +1 K +P −1
III. N ETWORK ARCHITECTURE AND TRAINING where the kth row is a sequence of P successive experiences
from the ik th agent. The car indices ik and the start time of
A. Network Architecture each sequence tk are randomly picked. A batch includes both
The Q-network architecture used in our problem is shown sequential and randomized experiences. The former is for the
in Fig. 5. The input is the current observation oit , which first LSTM network training that needs sequential samples. The
goes through a fully connected layer (FCL). Then, the output latter is to increase training stability because randomization
of the FCL is fed into the LSTM layer, which is followed breaks the correlation of experiences [29].
by an FCL to generate the final output. Each element of the To train the Q-network usually takes many episodes of
output represents the Q-function value corresponding to each experiences. An episode is a succession of experiences starting
6
from an initial state and ending when the terminal state occurs, distribution [30]. The distance l between any two adjacent
or after a certain number of time steps if there is no terminal cars satisfies the following distribution:
state.
λ · ρ1 exp(− ρl ) dmin ≤ l ≤ dmax
During the training, the ǫ-greedy policy [17] is adopted. p(l) = , (21)
0 otherwise
At each time step, each car chooses a random subband with
small exploring probability ǫ. Otherwise, it chooses the one where ρ is the intensity parameter, and λ is a normalizing
that maximizes the Q-function, i.e. coefficient to assure that the integral of p(l) equals 1. The
( intensity parameter ρ reflects the traffic density. Cars in each
arg max Q(oit , hit−1 , u′ ; w) with probability 1 − ǫ
uit = u′ . lane are assumed to travel at constant velocity, i.e., v1 and
a random subband with probability ǫ v2 , respectively. The detailed scenario settings are shown in
(20) TABLE I.
The ǫ-greedy policy enables the agent to explore beyond the
current Q-function to see if there is a better policy [17]. TABLE I
Moreover, we use a learning rate β (defined in (18)) and S CENARIO SETTINGS
exploring probability ǫ that vary with the episode number. Notation Description Value
Before training, β and ǫ are assigned an initial value. Every d0 Minimum interference distance (m) 40
Nd episodes, β and ǫ are multiplied by a decaying coefficient T Time interval between two transmis- 0.1
ranging from 0 to 1. When β and ǫ decrease below the sions (s)
PL LRR’s transmitting power PL Gt Ae
predefined minimum values, they remain unchanged until the PS SRR’s transmitting power Set 4π
= 104 ,
training terminates. Ae Effective area PS Gt Ae
= 103
The algorithm to train the shared LSTM network in our Gt Antenna gain 4π
g Decaying coefficient 0.1
problem is summarized in Algorithm 1. v1 Velocity of cars on Lane 1 (m/s) 30
v2 Velocity of cars on Lane 2 (m/s) -25
Algorithm 1: Training the LSTM network in automotive
radar spectrum allocation problem The network architecture is shown previously in Fig. 5. The
Initialize the LSTM network parameter w− and clear the detailed structure parameters are listed in TABLE II.
memory
for each episode do TABLE II
Initialize the hidden state hi0 and observation oi1 for N ETWORK STRUCTURE PARAMETERS
each car Layer No. Type Size
for each time step t do 1 FCL1 30
for each Car i do 2 LSTM1 30
Feed observation oit and hidden state hit−1 to 3 LSTM2 30
4 LSTM3 20
the network, and obtain new hidden state hit 5 LSTM4 10
and the network output 6 FCL2 M
Choose a subband according to ǫ-greedy
policy as (20)
Get reward rti B. Contrasting approaches
Obtain new observation oit+1 We use two other approaches as a contrast to ours. The first
Store experience into the memory is called Random policy. Each car randomly selects a subband
end
at each time step. Randomization is a common approach to
end mitigating automotive interference. The other is called Myopic
Form a batch from the memory as (19) policy [31]. It is a simple and robust policy for spectrum access
Apply gradient descent to the loss function over the
problems, in which the user selects one frequency subband to
batch with respect to w as (18) sense if it is unoccupied . The problem is to develop a policy
end
so that the probability of sensing an unoccupied subband,
defined as a success, is high. This is quite similar to our
problem. In the original Myopic policy, an order is given
IV. S IMULATIONS
to all M subbands, i.e. O = (m1 , m2 , ..., mM ), which is a
In this section, we use simulations to evaluate our approach rearrangement of (1, 2, ..., M ). For a subband mk (k ∈ [1, M ]),
and compare it with two other contrasting ones. In the fol- let (mk )+O denote the next one in the order (if mk is the last
lowing, first, we describe the simulation setup in detail. Then subband in the order, (mk )+ O refers to the first one), i.e.
we introduce two contrasting approaches. Last, we provide the
simulation results along with corresponding discussion. (mk )+
O = mmod(k+1,M) . (22)
According to Myopic policy, the subband chosen at time step
A. Simulation Setup t is
The simulated scenario is constructed as Fig. 3 shows.
i
ut−1 if subband uit−1 is unoccupied
The flow of traffic is modeled by a truncated exponential uit = i + . (23)
(ut−1 )O otherwise
7
1 0.9
0.9 0.85
0.8
0.8
Average normalized loss
Average success rate

0.7
0.75
0.6
0.7
0.5
0.65
0.4
0.6
0.3
0.2 0.55
0.1 0.5
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
Number of episodes Number of episodes
(a) Loss (b) Success rate
Fig. 7. Learning curves when the number of cars and subbands are 6 and 3, respectively
Considering that there are multiple users (cars) in our there are only a few subbands. As the number of subbands
problem, we apply some randomization to the original Myopic increases, Myopic policy catches up with our approach. The
policy. A car keeps using the subband if it results in a success. two approaches both achieve success near 1 when the number
Otherwise, it randomly choose one from all M subbands with of subbands is close to the number of cars.
equal probability. In Fig. 9, we plot the success rate improvement by the RL-
i CR approach over the two contrasting ones, respectively. In
ut−1 Iti /N0 < η
uit = . (24) both subfigures, each curve corresponds to a certain number
a random subband otherwise
of cars. Compared to both contrasting approaches, the im-
The modified Myopic policy turns out to perform better provement first increases with the number of subbands and
than the original one in our problem. Suppose two cars select decreases afterwards. In the left subfigure, the improvement
the same subband and cause interference. In the original one, relative to Random policy has a maximum ranging from 22%
if the next subbands in their respective orders are the same, to 35%. In the right, the improvement relative to Myopic
they will again collide into the same subband. By applying policy reaches its peak, which is approximately 8% to 12%, at
randomization, such occasions will be effectively reduced. 2 or 3 subbands. Then, it gradually drops to 0 after the peak,
indicating that the gap between our approach and Myopic
policy narrows when subbands increase. From Fig. 9, it can be
C. Results and discussions affirmed that our approach has more significant advantage over
In Fig. 7, we plot the learning curves in the scenarios where the two contrasting approaches when the number of subbands
there are 6 cars and 3 subbands. Fig. 7 (a) and (b) show is much fewer than the number of cars. Fig. 8 and Fig. 9 both
the average normalized loss and success rate over all cars demonstrate the potential of the RL-CR approach in situations
versus the number of learning episodes, respectively. In each where resources are relatively scarce compared to the number
subfigure, the darker line indicates the average loss or sucess of radars, which is a widespread radar problem especially with
rate over 100 episodes. The loss and sucess rate gradually go respect to radio resources [12].
down and rise up, respectively, and stabilize in the end. The Studying the performance of our approach versus the
learning curves reflect that an effective policy can be learned varying number of subbands is instructive for deciding how
with stability using our approach. many subbands the whole band should be divided into. More
We compare our approach with the contrasting ones under subbands guarantee less interference among all cars. However,
different scenarios, in which the number of cars N differs less bandwidth is allocated to each subband, which lowers
while the traffic density parameter ρ remains constant. In Fig. radar ranging precision and resolution. Hence, we use Figs.
8, we plot success rates achieved by the three approaches 8 and 9 to balance between interference and ranging perfor-
versus the number of subbands. Each subfigure corresponds mance (or other bandwidth-related performance). For example,
to a specific number of cars. Obviously, the success rate considering the scenario of 14 cars, we divide the whole band
increases with the number of available subbands. Our approach into 4 subbands, because adding one more subband brings
achieves the highest success rate, whereas Random policy has little elevation in success rate, whereas the bandwidth of each
the lowest. When there is only one available subband, all three subband would be reduced by 20%.
approaches are equal. The gap between success rates achieved Furthermore, we compare the three approaches under differ-
by our approach and Myopic policy is more significant when ent traffic density parameters. In Fig. 10, we plot success rates
8
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Success rate
Success rate
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
RL-CR RL-CR
0.1 Myopic 0.1 Myopic
Random Random
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of subbands Number of subbands
(a) N = 4 (b) N = 6
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Success rate
Success rate
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
RL-CR RL-CR
Random Random
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(c) N = 8 (d) N = 10
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Success rate
Success rate
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
RL-CR RL-CR
Random Random
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(e) N = 12 (f) N = 14
Fig. 8. Success rates of the three subband selecting approaches versus the number of subbands under different numbers of cars.
9
0.4 0.12
4 cars 4 cars
0.35 6 cars 6 cars
8 cars 0.1 8 cars
10 cars 10 cars
0.3 12 cars 12 cars
Success rate improvement
Success rate improvement

14 cars 0.08 14 cars
0.25
0.06
0.2
0.15 0.04
0.1
0.02
0.05
0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(a) RL-CR v. Random (b) RL-CR v. Myopic
Fig. 9. Success rate improvement by the RL-CR approach over Random and Myopic policy.
0.6 0.85
RL-CR
0.8
0.55 Myopic
Random
0.75
0.5
Sucess rate
Sucess rate
0.7
0.45
0.65
0.4 RL-CR
0.6 Myopic
Random
0.35
0.55
0.3 0.5
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
-1
Traffic density parameter (m ) Traffic density parameter (m-1)
(a) M = 2 (b) M = 3
1 1
0.95 0.95
0.9
0.9
0.85
Sucess rate
Sucess rate
0.85
0.8
0.8
0.75
RL-CR RL-CR
Myopic 0.75 Myopic
0.7 Random
Random
0.65 0.7
0.6 0.65
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Traffic density parameter (m-1) Traffic density parameter (m-1)
(c) M = 4 (d) M = 5
Fig. 10. Success rate of the three subband selecting approaches versus the traffic density parameter under different number of subbands.
10
Sucess Failure Subband 1 Subband 2 which each radar selects frequency subbands according to its
Snapshot 1 Snapshot 2 local observations. Considering a single radar observation is
quite partial and inadequate, we use an LSTM network so that
radar can learn how to integrate observations over time. As
there are multiple cars in our problem, we train one network
for all cars to share instead of assigning each one a network.
meters meters Moreover, we construct a simulation environment to model
Snapshot 4 Snapshot 3 the road scenario; the network is trained and tested in this
environment. We compare our approach with two contrast-
ing ones, i.e., Random and Myopic policies, under different
numbers of cars and subbands. Our approach outperforms the
others especially when subbands are much fewer than cars.
meters meters
The relationship between performance and the traffic density
Fig. 11. How each car selects subbands using RL-CR approach when the
parameter is also investigated. Our approach is less sensitive
number of cars and subband are 4 and 2, respectively. to the traffic density parameter when there are more subbands.
Finally, an example is provided of how cars choose subbands
using the trained network in a simple case.
of the three approaches versus the traffic density parameter Our contributions are the proposal of combining RL and
under different numbers of subbands while the number of cars CR to adapt radar transmission to an unknown dynamic envi-
is fixed to 10. It is apparent that the success rate decreases with ronment and the explanation of how RL-CR is implemented
traffic density parameter ρ increasing for all three approaches. in the problem of automotive radar spectrum allocation. The
When the number of subbands is 2 or 3, the success rate simulation model is rather simplified to demonstrate the fea-
of our approach improves steadily approximately 10% over sibility and potential of the proposed approach. Future work
Myopic policy within the range of traffic density parameter will focus on two aspects. One is to construct a simulation
considered. The improvement is less significant when the model that is closer to reality and to test the approach in real
number of subbands is 4 or 5, which again demonstrates that practice. The other is to improve the generalization capability
our approach has more obvious advantage when subbands of our approach, since now a network is trained and applied for
are fewer. Moreover, the curves corresponding to RL-CR and one specific scenario where the numbers of cars and subbands
Myopic policy are flatter in the last two subfigures, which are constant. More effort will be put into generalizing our
manifests that our approach, as well as Myopic policy, is less approach for scenarios with variable numbers of cars and
influenced by the traffic density parameter when there are more subbands.
subbands.
Finally, we take several snapshots from an episode to see R EFERENCES
how each car selects subbands using the learned policy, shown [1] S. Haykin, “Cognitive Radar: A Way of the Future,”
in Fig. 11. In this case, the numbers of cars and subbands Signal Processing Magazine IEEE, vol. 23, no. 1, pp.
are 4 and 2, respectively. In the first snapshot, because the 30–40, 2006.
two cars in red are located within the minimum interference [2] F. Xin, J. Wang, and B. Wang, “Optimal Waveform De-
distance d0 , they can choose the same subband without causing sign Considering Target Impulse Response Prediction,”
interference, while the other two cars use the other subband. in Proceeding of the 11th World Congress on Intelligent
In Snapshot 2, as the cars in green get closer, their interference Control and Automation. IEEE, 2014, pp. 2496–2499.
increases, which results in a failure. Then, the cars adjust their [3] X. Zhang and C. Cui, “Signal Detection for Cognitive
subband choices and a better outcome is yielded, although Radar,” Electronics Letters, vol. 49, no. 8, pp. 559–560,
there is still a failure, shown in Snapshot 3. Finally, in 2013.
Snapshot 4, the cars in each lane use different subbands and [4] Y. Chen, Y. Nijsure, C. Yuen, Y. H. Chew, Z. Ding,
no interference is caused among them. and S. Boussakta, “Adaptive Distributed MIMO Radar
The brief analysis on the learned policy shows that due to Waveform Optimization Based on Mutual Information,”
the limitation that each car only has its own local observations, IEEE Transactions on Aerospace and Electronic Systems,
executing the learned policy distributively may not achieve vol. 49, no. 2, pp. 1374–1385, 2013.
the best allocation results. However, the learned policy can [5] P. Liu, Y. Liu, and X. Wang, “A Cognitive Radar Ap-
contribute to increasing the number of successes among all proach for Extended Target Ranging,” in Radar Confer-
cars when failures happen. ence, 2017.
[6] F. Dai, H. Liu, P. Wang, and S. Xia, “Adaptive Waveform
V. C ONCLUSION Design for Range-Spread Target Tracking,” Electronics
In this paper, we combine RL with CR to enable radar to Letters, vol. 46, no. 11, pp. 793–794, 2010.
learn how to choose transmission in an unknown dynamic [7] S. Haykin, Y. Xue, and P. Setoodeh, “Cognitive radar:
environment. We also examine the application of RL-CR for Step toward bridging the gap between neuroscience and
mitigating interference among automotive radars. Using RL- engineering,” Proceedings of the IEEE, vol. 100, no. 11,
CR, we realize a distributive spectrum allocation approach in pp. 3102–3130, 2012.
11
[8] R. A. Romero and N. A. Goodman, “Cognitive Radar T. Campbell, “RadarMAC: Mitigating Radar Interference
Network: Cooperative Adaptive Beamsteering for Inte- in Self-Driving Cars,” in IEEE International Conference
grated Search-and-Track Application,” IEEE Transac- on Sensing, Communication, and Networking, 2016, pp.
tions on Aerospace and Electronic Systems, vol. 49, no. 2, 1–9.
pp. 915–931, 2013. [24] M. Hausknecht and P. Stone, “Deep Recurrent Q-
[9] K. Bell, C. Baker, G. Smith, J. Johnson, and M. Ran- Learning for Partially Observable MDPs,” Computer
gaswamy, “Cognitive Radar Framework for Target De- Science, pp. 29–37, 2015.
tection and Tracking,” IEEE Journal of Selected Topics [25] J. Hasch, E. Topak, R. Schnabel, T. Zwick, R. Weigel,
in Signal Processing, vol. 9, no. 8, pp. 1427–1439, 2015. and C. Waldschmidt, “Millimeter-Wave Technology for
[10] N. A. Goodman, P. R. Venkata, and M. A. Neifeld, Automotive Radar Sensors in the 77 GHz Frequency
“Adaptive Waveform Design and Sequential Hypothesis Band,” IEEE Transactions on Microwave Theory and
Testing for Target Recognition With Active Sensors,” Techniques, vol. 60, no. 3, pp. 845–860, 2012.
IEEE Journal of Selected Topics in Signal Processing, [26] A. Alhourani, R. J. Evans, S. Kandeepan, B. Moran, and
vol. 1, no. 1, pp. 105–113, 2007. H. Eltom, “Stochastic Geometry Methods for Modeling
[11] Y. Wei, H. Meng, Y. Liu, and X. Wang, “Extended Tar- Automotive Radar Interference,” IEEE Transactions on
get Recognition in Cognitive Radar Networks,” Sensors, Intelligent Transportation Systems, vol. 19, no. 2, pp.
vol. 10, no. 11, pp. 10 181–10 197, 2010. 333–344, 2018.
[12] P. Stinco, M. S. Greco, F. Gini, and B. Himed, “Cog- [27] M. Riedmiller, “Neural Fitted Q Iteration First Ex-
nitive Radars in Spectrally Dense Environments,” IEEE periences with a Data Efficient Neural Reinforcement
Aerospace and Electronic Systems Magazine, vol. 31, Learning Method,” in European Conference on Machine
no. 10, pp. 20–27, 2016. Learning, 2005, pp. 317–328.
[13] D. Cohen, K. V. Mishra, and Y. C. Eldar, “Spectrum [28] S. Hochreiter and J. Schmidhuber, “Long Short-Term
Sharing Radar: Coexistence via Xampling,” IEEE Trans- Memory.” Neural Computation, vol. 9, no. 8, pp. 1735–
actions on Aerospace and Electronic Systems, vol. 54, 1780, 1997.
no. 3, pp. 1279–1296, 2018. [29] M. Volodymyr, K. Koray, S. David, A. A. Rusu, V. Joel,
[14] Z. Feng, D. Zhou, and Y. Geng, “Target Tracking in Inter- M. G. Bellemare, G. Alex, R. Martin, A. K. Fidjeland,
ference Environments Reinforcement Learning and De- and O. Georg, “Human-Level Control through Deep
sign for Cognitive Radar Soft Processing,” in Congress Reinforcement Learning,” Nature, vol. 518, no. 7540, pp.
on Image and Signal Processing, 2008. 529–533, 2015.
[15] S. Haykin, “Cognition is the Key to the Next Generation [30] S. Liu, L. Ying, and R. Srikant, “Throughput-Optimal
of Radar Systems,” in IEEE Digital Signal Processing Opportunistic Scheduling in the Presence of Flow-Level
Workshop and IEEE Signal Processing Education Work- Dynamics,” IEEE/ACM Transactions on Networking,
shop, 2009. vol. 19, no. 4, pp. 1057–1070, 2011.
[16] L. O. Wabeke and W. A. J. Nel, “Utilizing Q-Learning [31] Q. Zhao, B. Krishnamachari, and K. Liu, “On Myopic
to Allow a Radar to Choose Its Transmit Frequency, Sensing for Multi-Channel Opportunistic Access: Struc-
Adapting to Its Environment,” in International Workshop ture, Optimality, and Performance,” IEEE Transactions
on Cognitive Information Processing, 2010. on Wireless Communications, vol. 7, no. 12, pp. 5431–
[17] R. Sutton and A. Barto, Reinforcement Learning: An 5440, 2008.
Introduction. MIT Press, 1998.
[18] M. Goppelt, H. L. Blcher, and W. Menzel, “Automotive
Radar - Investigation of Mutual Interference Mecha-
nisms,” Advances in Radio Science, vol. 8, no. 4, pp.
55–60, 2010.
[19] G. M. Brooker, “Mutual Interference of Millimeter-Wave
Radar Systems,” IEEE Transactions on Electromagnetic
Compatibility, vol. 49, no. 1, pp. 170–181, 2007.
[20] T. N. Luo, C. H. E. Wu, and Y. J. E. Chen, “A 77-
GHz CMOS Automotive Radar Transceiver With Anti-
Interference Function,” IEEE Transactions on Circuits
and Systems I Regular Papers, vol. 60, no. 12, pp. 3247–
3255, 2013.
[21] M. Kunert, “More Safety for All by Radar Interference
Mitigation,” Robert Bosch GmbH, Tech. Rep. 248231,
Dec. 2012.
[22] R. Hang, Y. Liu, H. Meng, and X. Wang, “Sharing for
Safety: The Bandwidth Allocation among Automotive
Radars,” in Signal and Information Processing, 2017.
[23] J. Khoury, R. Ramanathan, M. C. Dan, R. Smith, and

Cognitive Radar Using Reinforcement Learning in Automotive Applications

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cognitive Radar Using Reinforcement Learning in Automotive Applications

Uploaded by

Copyright:

Available Formats

1

Cognitive Radar Using Reinforcement Learning in

Abstract—The concept of cognitive radar (CR) enables radar Environment

ment with feedback facility from receiver to transmitter. However, Receiver

T HE concept of cognitive radar (CR) offers a way for radar

travel in opposite directions, it is easy to determine which

where γ ∈ [0, 1] is the discounting factor. The factor γ reflects

hit−1 uit uit+1 uit+2

arg max arg max arg max

Average success rate

Success rate improvement

You might also like