Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
Abstract—Resource allocation has a direct and profound im- to guarantee the diverse quality-of-service (QoS) requirements
pact on the performance of vehicle-to-everything (V2X) networks. in vehicular networks, such as extremely large capacity, ultra
In this paper, we develop a hybrid architecture consisting of reliability, and low latency [8]. To address such issues, efficient
centralized decision making and distributed resource sharing
(the C-Decision scheme) to maximize the long-term sum rate resource allocation for spectrum sharing becomes necessary
of all vehicles. To reduce the network signaling overhead, each in the V2X scenario. Existing works on spectrum sharing in
vehicle uses a deep neural network to compress its observed vehicular networks can be mainly categorized into two classes:
information that is thereafter fed back to the centralized decision centralized schemes [9]–[12] and distributed approaches [13],
making unit. The centralized decision unit employs a deep Q- [14]. For the centralized schemes, decisions are usually made
network to allocate resources and then sends the decision results
to all vehicles. We further adopt a quantization layer for each centrally at a given node, such as the head in a cluster or the
vehicle that learns to quantize the continuous feedback. In base station (BS) in a given coverage area. Novel graph-based
addition, we devise a mechanism to balance the transmission of resource allocation schemes have been proposed in [9] and
vehicle-to-vehicle (V2V) links and vehicle-to-infrastructure (V2I) [10] to maximize the vehicle-to-infrastructure (V2I) capacity,
links. To further facilitate distributed spectrum sharing, we also exploiting the slow fading statistics of channel state informa-
propose a distributed decision making and spectrum sharing
architecture (the D-Decision scheme) for each V2V link. Through tion (CSI). In [11], an interference hyper-graph based resource
extensive simulation results, we demonstrate that the proposed C- allocation scheme has been developed in the non-orthogonal
Decision and D-Decision schemes can both achieve near-optimal multiple access (NOMA)-integrated V2X scenario with the
performance and are robust to feedback interval variations, input distance, channel gain, and interference known in each vehicle-
noise, and feedback noise. to-vehicle (V2V) and V2I group. In [12], a segmentation
Index Terms—Vehicular networks, deep reinforcement learn- medium access control (MAC) protocol has been proposed
ing, spectrum sharing, binary feedback. in large-scale V2X networks, where the location information
of vehicles is updated. In these schemes, the decision making
I. I NTRODUCTION node needs to acquire accurate CSI, interference information
ONNECTING vehicles on the road as a dynamic com- of all the V2V links, and each V2V link’s transmit power to
C munication network, commonly known as vehicle-to-
everything (V2X) networks, is gradually becoming a reality
make spectrum sharing decisions. However, reporting all such
information from each V2V link to the decision making node
to make our daily experience on wheels safer and more poses a heavy burden on the feedback links, and even becomes
convenient [1]. V2X enabled coordination among vehicles, infeasible in practice.
pedestrians, and other entities on the road can alleviate traffic As for distributed schemes [13], [14], each V2V link makes
congestion, improve road safety, in addition to providing its own decision with partial or little knowledge of other V2V
ubiquitous infotainment services [2]–[5]. Recently, the 3rd links. In [13], a distributed shuffling based Hopcroft-Karp
generation partnership project (3GPP) begins to support V2X algorithm has been devised to handle the subchannel allocation
services in the long-term evolution (LTE) [6] and further the in V2V communications with one-bit CSI broadcasting. In
fifth generation (5G) mobile communication networks [7]. [14], the spatio-temporal traffic pattern has been exploited for
Cross-industry alliance has also been founded, such as the 5G distributed load-aware resource allocation for V2V commu-
automotive association (5GAA), to push development, testing, nications with slowly varying channel information. In these
and deployment of V2X technologies. methods, V2V links may exchange partial or none channel
information with their neighbors before making a decision.
However, each V2V link can only observe partial information
A. Related Work of its surrounding environment since it is geographically apart
Due to high mobility of vehicles and complicated time- from other V2V links in the V2X scenario. This may leave
varying communication environments, it is very challenging some channels overly congested while others underutilized,
leading to substantial performance degradation.
Liang Wang is with the Key Laboratory of Modern Teaching Technology,
Ministry of Education, Xi’an 710062, China, and also with the School of Notably, the above works usually rely on some levels of
Computer Science, Shaanxi Normal University, Xi’an 710119, China (e-mail: channel information, such as channel gain, interference, loca-
wangliang@snnu.edu.cn). tions and so on. This kind of channel information is usually
Hao Ye, Le Liang and Geoffrey Ye Li are with the School of Electrical and
Computer Engineering, Georgia Institute of Technology, Atlanta, GA, 30332 hard to obtain perfectly in practical wireless communication
USA (e-mail: {yehao, lliang}@gatech.edu; liye@ece.gatech.edu). systems, which is even challenging in the V2X scenario. For-
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
tunately, machine learning enables wireless communications its own right but rather the end-to-end system performance
systems to learn their surroundings and feed critical infor- matters. To further reduce feedback overhead, we adopt a
mation back to the BS for resource allocation. In particular, quantization layer in each vehicle’s DNN and learn how to
reinforcement learning (RL) can make decisions to maximize quantize the continuous feedback. Besides, to further facilitate
long-term return in the sequential decision problems, which distributed spectrum sharing and thus improve the scalability
has gained great success in various applications, such as in vehicular networks, we devise a distributed spectrum shar-
AlphaGo [15]. Inspired by its remarkable performance, the ing architecture to let each V2V link make its own decision
wireless community is increasingly interested in leveraging locally. The contributions of this paper are summarized as
machine learning for the physical layer and resource allocation follows.
design [16]–[24]. In particular, machine learning for future • We leverage the power of DNN and RL to devise a
vehicular networks has been discussed in [25] and [26]. In centralized decision making and distributed implemen-
[27], each V2V link is treated as an agent to ensure the tation architecture for vehicular spectrum sharing that
latency constraint is satisfied while minimizing interference maximizes the long-term sum rate of all vehicles. We
to V2I link transmission. In [28], a multi-agent RL-based use a weighted sum rate reward to balance V2I and V2V
spectrum sharing scheme has been proposed to promote the performance dynamically.
payload delivery rate of V2V links while improving the sum • We exploit the DNN at each vehicle to compress local
capacity of V2I links. A dynamic RL scheduling algorithm has observations, which is further augmented by a quantized
been developed to solve the network traffic and computation layer, to reduce network signaling overhead while achiev-
offloading problems in vehicular networks [29]. ing desirable performance.
• We also develop a distributed decision making architec-
B. Motivation and Contribution ture that allows spectrum sharing decisions to be made
at each vehicle locally and binary feedback is designed
In this paper, we exploit the data-driven capability of learn- for signaling overhead reduction.
ing methods to optimize the spectrum sharing performance • Based on extensive computer simulations, we demon-
and meanwhile reduce network signaling overhead in highly strate both of the proposed architectures can achieve near-
dynamic V2X networks. Conceivably, it is challenging, if not optimal performance and are robust to feedback interval
impossible, for the BS to acquire accurate global channel variations, input noise, and feedback noise. In addition,
state information (CSI) in such high-mobility environments, the optimal number of continuous feedback and feedback
which changes fast and meanwhile dictates spectrum sharing bits for each V2V link are presented that strike a balance
performance. What is worse, the signaling overhead for all between signaling overhead and performance loss.
vehicles to transmit their respective local CSI to the BS also
The rest of this paper is organized as follows. The system
puts a heavy burden on the scarce spectrum resources. The
model is presented in Section II. Then, the BS aided spectrum
feedback overhead also increases linearly with number of the
sharing architecture, including distributed CSI compression
available channels and the vehicles. To tackle this issue, in
and feedback, centralized resource allocation and quantized
this paper, we try to make a clever spectrum sharing decision
feedback, is introduced in Section III. The distributed decision
in the dynamic V2X networks scenario automatically.
making and spectrum sharing architecture is discussed in
In particular, to fully exploit the advantages of both cen- Section IV. Simulation results are presented in Section V.
tralized and distributed schemes while alleviating the require- Finally, conclusions are drawn in Section VI.
ment on CSI for spectrum sharing in vehicular networks,
we propose an RL-based resource allocation scheme with
learned feedback. In particular, we devise a distributed CSI II. S YSTEM M ODEL
compression and centralized decision making architecture to We consider a vehicular communication network with N
maximize the sum rate of all V2V links in the long run. In cellular users (CUs) and K pairs of coexisting device-to-
this architecture, each V2V link first observes the state of device (D2D) users, where all devices are equipped with a
its surrounding channels and adopts a deep neural network single antenna. Let K = {1, 2, ..., K} and N = {1, 2, ..., N }
(DNN) to learn what to feed back to the decision making unit, denote the sets of all D2D pairs and CUs, respectively. Each
such as the BS, instead of sending all observed information pair of D2D users exchange important and short messages,
directly, which greatly alleviates the communication overhead such as safety-related information via establishing a V2V link
of feeding the local CSI to the BS. To maximize the long- while each CU uses a V2I link to support bandwidth-intensive
term sum rate of all links, the BS then adopts deep RL to applications, such as social networking and video streaming.
allocate spectrum for all V2V links automatically. Notably, In order to ensure the QoS of the CUs, we assume all V2I
from a global perspective, we optimize the CSI compression links are assigned orthogonal radio resources. Without loss of
and resource allocation jointly. In other words, we adopt a generality, we assume that each CU occupies one channel for
decision-oriented compression scheme to mitigate the harmful its uplink transmission. To improve the spectrum utilization
impact of CSI feedback, where the compression depends on efficiency, all V2V links share the spectrum resource with V2I
the specific resource allocation result. Here, the compression links. Therefore, N is also referred to as the channel set.
DNNs at local vehicles and the DQN at the BS are trained Denote the channel power gain from the n-th CU to the
jointly. This implies the compression is not important in BS on the n-th channel, i.e., the n-th V2I link, by gn [n].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
Let hk,B [n] represent the cross channel power gain from the III. BS D ECISION BASED S PECTRUM S HARING
transmitter of the k-th V2V link to the BS on the n-th channel. A RCHITECTURE
The received signal-to-interference-plus-noise-ratio (SINR) of As shown in Fig. 1, we adopt the deep RL approach
the n-th V2I link can be expressed as for resource allocation. In this section, we first design the
Pnc gn [n] DNN architecture of each V2V link and the deep Q-network
γnc [n] = PK , (1) (DQN) for centralized control at the BS, respectively. Then,
k=1 ρk [n]Pkd hk,B [n] + σ 2
we propose the centralized decision making and distributed
where Pnc and Pkd refer to the transmit powers of the n-th spectrum sharing architecture, termed C-Decision scheme. Fi-
V2I link and the k-th D2D pair, respectively, σ 2 represents nally, we introduce the binary feedback design for information
the noise power, and ρk [n] ∈ {0, 1} is the channel allocation compression.
indicator with ρk [n] = 1 if the k-th D2D user pair chooses
the n-th channel and ρk [n] = 0 otherwise.P We assume each
N A. V2V DNN Design
D2D pair only occupies one channel, i.e., n=1 ρk [n] ≤ 1.
Then, the capacity of the n-th V2I link on the n-th channel Here, we discuss the DNN at each V2V link to com-
can be written as press local observation for feedback. As shown in Fig.
1, each V2V link k first observes its surroundings, and
Cnc [n] = B log2 (1 + γnc [n]) , (2) obtains its transmission power, the current channel gains
where B denotes the channel bandwidth. and interference powers of all channels, which are de-
Similarly, hk [n] denotes the channel power gain of the k-th noted as hk = (hk [1] , ..., hk [n] , ..., hk [N ]) and Ik =
V2V link on the n-th channel. Meanwhile, hl,k [n] denotes the (Ik [1] , ..., Ik [n] , ..., Ik [N ]), respectively. Here, Ik [n] refers
cross channel power gain from the transmitter of the l-th D2D to the aggregated interference powers at the k-th V2V link on
pair to the receiver of the k-th D2D pair on the n-th channel. the n-th channel as shown in (4). To consider the impact of
Denote the cross channel power gain from the n-th CU to the V2V links on V2I links, the observation of the k-th V2V also
receiver of the k-th D2D pair on the n-th channel by gn,k [n]. needs to include the cross channel gain from the k-th V2V
Then, the SINR of the k-th V2V link over the n-th channel link to all V2I links, such as hk,B [n] , ∀n ∈ N . Then, the
can be written as observation of the k-th V2V can be written as
ρk [n]Pkd hk [n] ok = hk , Ik , Pkd , hk,B ,
(6)
γkd [n] = , (3)
Ik [n] + σ 2
where hk,B = (hk,B [1] , ..., hk,B [n] , ..., hk,B [N ]). Here, the
where the interference power for the k-th V2V link Ik [n] is channel information hk can be accurately estimated by the
K
X receiver of the k-th V2V link and we assume it is also available
Ik [n] = ρl [n]Pld hl,k [n] + Pnc gn,k [n] . (4) at the transmitter through delay-free feedback [30]. Similarly,
l6=k the received interference power over all channels Ik can be
PK measured at the k-th V2V receiver. Each V2V transmitter
In (4), the terms l6=k ρl [n]Pld hl,k [n] and Pnc gn,k [n] refer to knows its transmit power Pkd . Besides, the vector hk,B can
the interference of the other V2V links and the V2I link on be estimated at the BS and then broadcast to all V2V links in
the n-th channel, respectively. Hence the capacity of the k-th its coverage, which incurs a small signaling overhead [28].
V2V link on the n-th channel can be written as Then, the local observation, ok , is compressed using the
Ckd [n] = B log2 1 + γkd [n] . DNN at each V2V link. The compressed information, bk ,
(5)
which is the output of the DNN, is fed back to the DQN
In the V2X networks, a naive distributed approach will at the BS. To limit overhead on information feedback, each
allow each V2V link to select a channel independently such V2V link only reports the compressed information vector,
that its own data rate is maximized. However, local rate max- bk , instead of ok to the BS. Here, bk = {bk,j } is also
imization often leads to suboptimal global performance due known as the feedback vector of the k-th V2V link and the
to the interference among different V2V links. On the other term bk,j , ∀j ∈ {1, 2, ..., Nk } refers to the j-th feedback
hand, the BS in the V2X scenario has enough computational element of the k-th V2V, where Nk denotes the number
and storage resources to achieve efficient resource allocation. of feedback learned by the k-th V2V link. All V2V links
With the help of machine learning, we propose a centralized aim at maximizing their global sum rate in the long run
decision making scheme based on compressed information while minimizing the feedback information bk . Therefore, the
learned by each individual V2V link distributively. parameters of the DNNs at all V2V links and those of the
In order to achieve this goal, each V2V link first learns DQN will be jointly determined to maximize the sum rate of
to compress local observations, including the channel gain, the whole V2X network.
the observed interference from other V2V links and V2I link,
transmit power, etc., and then feeds the compressed informa-
tion back to the BS. According to feedback information from B. Deep Q-Network at the BS
all V2V links, the BS will make optimal decisions for all V2V To make a proper resource sharing decision, we introduce
links using RL. Then, the BS broadcasts the decision result to the deep RL architecture at the BS as shown in Fig. 1. In
all V2V links. order to maximize the long-term sum rate of all links, we
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
PN
VW99 where Ckd = n=1 Ckd [n] refers to the capacity of the k-th
VW9,
V2V on all the channels. Besides, λc and λd are nonnegative
weights to balance the performance of V2I links and V2V
VW'11 links.
kWK99 nWK9, The solution of the RL problem is related to the concept
of policy π (a, s), which defines the probabilities of choosing
each action in A when observing a state in S. The goal of
NWK'11 learning is to find an optimal policy π ∗ to maximize the
KWK'11 expected return Gt from anyPinitial state s0 . The expected
%6'41
%6 ∞ k
return is defined as Gt = k=0 γ Rt+k+1 , which is the
'HFLVLRQ NWK9,
)HHGEDFN
cumulative discounted return with a discount factor γ.
KWK99 To solve this problem, we resort to the Q-learning [31],
which is a well-known effective approach to tackle the RL
Fig. 1. Neural network architecture for V2V links and the BS in the C- problem, due to its model-free property where p (s0 , r|s, a) is
Decision scheme. not required a priori. Q-learning is based on the idea of action-
value function qπ (s, a) = Eπ [Gt |St = s, At = a] for a given
policy π, which means the expected return when the agent
resort to the RL technique by treating the BS as the agent. starts from the state s, takes action a, and thereafter follows the
In the RL, an agent interacts with its surroundings, named policy π. The optimal action-value function q ∗ (s, a) under the
as the environment, via taking actions, and then observes a optimal policy π ∗ satisfies the well-known Bellman optimality
corresponding numerical reward from the environment. The equations [32], which can be approached through an iterative
agent’s goal is to find optimal actions so that the expected update method:
sum of rewards is maximized. Mathematically, the RL can
be modelled as a Markov decision process (MDP). At each Q (St , At ) ← Q (St , At ) + α [Rt+1
discrete time slot t, the agent observes the current state St of
i (10)
+γ max Q (St+1 , a) − Q (St , At ) ,
the environment from the state space S and then chooses an a
action At from the action space A and one time step later
where α is the step-size parameter. Besides, the choice of
obtains a reward Rt+1 . Then, the environment evolves to the
action At in state St follows some exploratory policies, such
next state St+1 , with the transition probability p (s0 , r|s, a) ,
as the -greedy policy. For better understanding, the -greedy
Pr {St+1 = s0 , Rt+1 = r|St = s, At = a}.
policy can be expressed as
The BS treats all the learned feedback as the current state
s of the agent’s environment, which can be expressed as: (
arg max Q (s, a), with probability 1 − ;
A← a (11)
s = {b1 , b2 , ..., bK } , ∀k ∈ K. (7) a random action, with probability .
Then, the action of the BS is to determine the values of the Here, is also known as the exploration rate in the RL
channel indicators, ρk [n], for each V2V link. Thus, we define literature. Furthermore, it has been shown in [32] that with
the action a of the BS as a variant of the stochastic approximation conditions on α and
the assumption that all the state-action pairs continue to be
a = {ρ1 , ..., ρk , ..., ρK } , ∀k ∈ K, (8)
updated, Q converges with probability 1 to the optimal action-
where ρk = {ρk [n]} , ∀n ∈ N refers to the channel allocation value function q ∗ .
vector for the k-th V2V link. However, in many practical problems, the state and action
Finally, we design the reward for the BS, which is very space can be extremely large, which prevents storing all action-
crucial to the performance of RL. To maximize the long-term value functions in a tabular form. As a result, it is common
sum rate of V2V links while ensuring the QoS of V2I links in to adopt function approximation to estimate these action-value
the V2X scenario, we need to devise a mechanism to consider functions. Moreover, by doing so, we can generalize action-
the transmissions of V2V links and V2I links simultaneously. value functions from limited seen state-action pairs to to a
As we know, the V2V links usually carry the safety-critical much larger space.
messages, such as vehicle’s speed and emergency vehicle In [33], a DNN parameterized by θ is employed to represent
warning on the road, while the V2I links often support the the action-value function, thus called as DQN. DQN adopts
entertainment services [28]. Thus, we should guarantee the the -greedy policy to explore the state space and store the
transmission of V2V links as the primal target while making transition tuple (St , At , Rt+1 , St+1 ) in a replay memory (also
sure that the impact of V2V transmission on the V2I links can known as the replay buffer) at each time step. The replay
be tolerable and adjustable to some specific applications. To memory accumulates agent’s experiences over many episodes
this end, we model the reward of the BS as of the MDP. At each time step, a mini-batch of experiences
N K
D are uniformly sampled from the replay memory, called
R = λc
X
Cnc [n] + λd
X
Ckd , (9) experience replay, to update the network parameters θ with
n=1 k=1
variants of stochastic gradient descent method to minimize
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
the squared errors shown as follows: Details of the training framework for the C-Decision scheme
Xh i2 are provided in Algorithm 1. We define Ot = {otk } , ∀k ∈
Rt+1 + γ max Q St+1 , a; θ − − Q (St , At ; θ) ,
a
K as the observations of all V2Vs at the time step t ∈
t∈D {1, 2, ..., T }, where otk refers to the observation of the k-th
(12)
V2V at the time step t. Then, we can express the estimation
where θ − is the parameter set of a target Q-network, which is
of the return also known as the approximate target value [33]
duplicated from the training Q-network parameter set θ, and
as
fixed for a couple of updates with the aim of further improving
yt = Rt+1 + γ max Q Ot+1 , a; θ − ,
(13)
the stability of DQN. Besides, experience replay improves a
sample efficiency via repeatedly sampling experiences from
where Rt+1 and Q Ot+1 , a; θ − represent the reward of all
the replay memory and also breaks correlation in successive
links and the Q function of the target DQN with parameters θ −
updates, which also stabilizes the learning process.
under the next observation Ot+1 and the action a, respectively.
Then, the updating process for the BS DQN can be written as
C. Centralized Control and Distributed Transmission Archi- [33], [34]:
tecture
X ∂Q (Ot , at ; θ)
θ ←θ+β [yt − Q (Ot , at ; θ)] , (14)
Algorithm 1 Training algorithm for the C-Decision scheme ∂θ
t∈D
Input: the DNN model for each V2V, the DQN model for
the BS, the V2X environment simulator where β is the step size in one gradient iteration.
Output: the DNN for each V2V, the optimal control policy As for the testing phase, at each time step t, each V2V
π ∗ represented by a DQN Q with parameters θ adopts its observation otk as the input of the trained DNN to
obtain its learned feedback btk , and then sends it to the BS.
1: Initialize all DNNs and DQN models respectively
After that, the BS takes {btk } as the input of its trained DQN to
2: for episode l = 1, ..., Ltrain do
generate the decision result at , and broadcasts at to all V2Vs.
3: Start the V2X environment simulator, generate
Finally, each V2V chooses the specific channel indicated by
vehicles, V2V links and V2I links
at to transmit.
4: Initialize the beginning observations O0
Here, note that in Algorithm 1, the BS needs to feed the
5: Initialize the policy π randomly
error gradient in Eq. (14) back to each V2V link over the air
6: for time-step t = 1, ..., T do
in the training process, which nonetheless does not have to
7: Each V2V adopts the current observation otk as the
happen very frequently. Meanwhile, as for the testing period,
input of its DNN to learn the feedback btk , and
each V2V link just uses the trained DNN to compress its local
sends it to BS
observation and send the compressed CSI to the BS. On the
8: BS takes st = {btk } as the input of its DQN Q
other hand, one may propose to train the DNNs and DQN
9: BS chooses at according to st using some policy
separately to mitigate the overhead in the training process.
π derived from Q, e.g., -greedy strategy as
However, the separate training of DNNs and DQN for the
in (11)
proposed scheme is challenging since it is not immediately
10: BS broadcasts the action at to every V2V
clear what the appropriate training objective is for an opti-
11: Each V2V takes action based on at , and gets
mized end-to-end system performance.
the reward Rt+1 and the next observation ot+1 k
12: Save the data {Ot , at , Rt+1 , Ot+1 } into the replay
buffer B D. Spectrum Sharing with Binary Feedback
13: Sample a mini-batch of data D from B uniformly In order to further reduce feedback overhead, we propose
14: Use the data in D to train all V2Vs’ DNNs and a framework to quantize the V2V links’ real-valued feedback
BS’s DQN together as in (14) into several binary digits. In other words, we try to constrain
15: Each V2V updates its observation otk ← ot+1 k bk,j ∈ {−1, 1} , ∀k ∈ K, ∀j ∈ {1, 2, ..., Nk }. The binarization
16: Update the target network: θ − ← θ every Nu steps procedure can help force the neural networks to learn efficient
17: end for representations of the feedback information compared to the
18: end for standard floating-point layer. In other words, a binary layer can
make each V2V compress its observation more efficiently.
In this part, the architecture for the C-Decision scheme The binary quantization process consists of two steps [35].
is shown in Fig. 1. Each V2V link first observes its local The first step is to generate the required number of continuous
environment and then adopts a DNN to compress the observed feedback values in the continuous interval [−1, 1], which is
information into several real numbers, which are finally fed also equal to the desired number of the binary feedback. Then,
back to the BS for centralized decision making. The BS the second step takes the outputs of the first step as its input
takes the feedback information of all V2V links as the input, to produce the desired number of discrete feedback in the set
utilizes the DQN to perform Q-learning to decide the channel {−1, 1} for each output real-valued feedback of the first step.
allocation for all V2V links, and broadcasts its decision. For the first step, we adopt a fully-connected layer with
Finally, each V2V link chooses the BS-allocated channel for tanh activations, defined as tanh (x) = 1+e2−2x − 1, where
its transmission. we term this layer as the pre-binary layer. Here, the input
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
b (x) = (15)
−1, x < 0.
However, the gradient of this function is not continuous, chal- %6
%6$JJUHJDWLRQ'11
lenging the back propagation procedure for DNN training. As a
remedy to this, we adopt the identity function in the backward kWK'HFLVLRQ'41 )HHGEDFN
NWK9,
$*,
pass, which is known as the straight-through estimator [36]. KWK99 'HFLVLRQ
Combining these two steps together, we can express the full
binary feedback function as
KWK&RPSUHVVLRQ'11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
Algorithm 2 Training algorithm for the D-Decision scheme better understanding, we provide related parameters and their
Input: the Compression DNN and Decision DQN for each corresponding settings in Table I. In addition, we list the
V2V, the Aggregation DNN for the BS, the V2X corresponding channel models for both V2V and V2I links
environment simulator respectively in Table II.
Output: the Compression DNN, the optimal policy πk∗
represented by the Decision DQN Qk with parameters TABLE I
θ k for each V2V, the Aggregation DNN for the BS S IMULATION PARAMETERS
1: Initialize all DNNs and DQNs models respectively Parameters Typical values
2: for episode l = 1, ..., Ltrain do Number of V2I links N 4
Number of V2V links K 4
3: Start the V2X environment simulator, generate Carrier frequency 2 GHz
vehicles, V2V links and V2I links Normalized Channel Bandwidth 1
4: Initialize the beginning observations O0 BS antenna height 25 m
5: Initialize the policy πk for each V2V randomly BS antenna gain 8 dBi
BS receive noise figure 5 dB
6: for time-step t = 1, ..., T do Vehicle antenna height 1.5 m
7: Each V2V adopts the current observation otk as the Vehicle antenna gain 3 dBi
input of its Compression DNN to learn the Vehicle receive noise figure 9 dB
feedback btk , and sends it to BS Vehicle speed randomly in [10, 15] km/h
Vehicle drop and mobility model Urban case of A.1.2 in [6]
8: BS takes {btk } as the input of its Aggregation V2I transmit power Pnc 23 dBm
DNN, and generates the AGI φt V2V transmit power Pkd 10 dBm
9: BS broadcasts the AGI φt to every V2V
10: Each V2V takes stk = {otk , φt } as the input of its The specific architecture of DNNs and BS DQN under the
Decision DQN Qk C-Decision scheme are summarized in Table III, where Nk
11: Each V2V chooses atk according to stk using some to refers the number of feedback for each V2V link and
policy πk derived from Qk , e.g., -greedy strategy FC denotes the fully connected (FC) layer respectively. In
as in (11) addition, the number of neurons in the output layer of the BS
12: Each V2V takes action atk , and gets the reward DQN is set as 256, which refers to all the possible channel
Rt+1 and the next observation ot+1k
allocations for all V2V links under current simulation setting.
13: Save the data {Ot , at , Rt+1 , Ot+1 } into the Besides, the settings for the DNNs and DQNs under the D-
buffer B Decision scheme are listed in Table IV. In general, we choose
14: Sample a mini-batch of data D from B uniformly the values of these parameters to balance training complexity
15: Use the data in D to train all V2Vs’ Compression and performance through a trial and error approach, in addition
DNNs and Decision DQNs and BS’s Aggregation to drawing lessons from some existing studies, such as [27],
DNN together as in (14) [28]. Firstly, the number of inputs for the DNN at each V2V
16: Each V2V updates its observation otk ← ot+1 k
link is determined by the number of local observations at
17: Each V2V updates its target network: θ − k ← θk
a given V2V link, which further depends on the number of
every Nu steps the available channels and which kinds of information related
18: end for to the spectrum sharing. Then, the number of outputs at
19: end for the DQN is determined by the number of possible resource
allocation options, which also relies on the number of available
spectrum resources. Then, as for the number of hidden layers
V. S IMULATION R ESULTS at the DNNs and the DQN, we have drawn lessons from the
In this section, we conduct extensive simulation to verify parameter selection of some existing works, such as [27], [28],
the performance of the proposed schemes. In particular, we and choose this number as 3. Besides, we choose the number
provide the simulation settings in Part A, and evaluate the of neurons in each hidden layer at the DNNs and the DQN
training performance of the C-Decision scheme in Part B. according to the principle where this number should be bigger
Then, we assess the testing performance under the real-valued than the number of inputs, which is known as parameter tuning
feedback and binary feedback in Parts C and D respectively. in the RL field. Then, we run multiple simulation trials under
Besides, we demonstrate the impacts of V2I and V2V links different numbers of neurons in the hidden layers of all DNNs
weights on the performance in Part E and the robustness of the and the DQN, and choose the setting with better performance
proposed scheme in Part F, respectively. Finally, we show the while taking the computational complexity into account to
training and testing performance of the D-Decision scheme in strike a balance between complexity and performance.
Part G. We use the rectified linear unit (ReLU) activation function
for both DNN and DQNs, defined as f (x) = max (0, x). Here,
the activation function of output layers in DNNs and DQNs is
A. Simulation Settings set as a linear function. Besides, the RMSProp optimizer [38]
The simulation scenario follows the urban case in Annex is adopted to update the network parameters with a learning
A of [6]. The simulation area size is 1, 299 m × 750 m, rate of 0.001. The loss function is set as the Huber loss [39].
where the BS is located in the center of this area. For We choose the weights λc = 0.1 and λd = 1 for V2I and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
TABLE II
C HANNEL MODELS FOR V2I AND V2V LINKS
TABLE III
A RCHITECTURE FOR DNN AND BS DQN IN THE C-D ECISION SCHEME
Training Loss
0.005
TABLE IV
A RCHITECTURE FOR DNN S AND DQN S IN THE D-D ECISION SCHEME 0.004
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
Optimal Scheme
C-Decision achieve very similar performance. Particularly, the mini-batch
Average of C-Decision
80% Random Action size D = 512 achieves the best performance, which is good
Average of Random Action
enough considering the computational overhead in the training
70% process and the gained performance.
To demonstrate the performance of the proposed C-Decision
60%
scheme further, we compare the C-Decision scheme with the
method adapted from [28] termed as the MARL method.
50%
Besides, we choose the traditional approach for spectrum
40%
sharing adapted from [40], denoted as the SS-SOLEN scheme.
0 250 500 750 1000 1250 1500 1750 2000 The ARP performance of C-Decision, MARL and SS-SOLEN
Number of Testing Episodes
schemes are shown as below:
(a) Normalized return comparison
TABLE V
100% P ERFORMANCE C OMPARISON
80%
We can see that the proposed C-Decision scheme achieves
the best performance, meanwhile the SS-SOLEN also obtains
70%
quite good performance. However, the SS-SOLEN scheme
requires the perfect CSI of all V2V and V2I links at a
60%
centralized decision-making unit.
For better comparison, we can also adopt an alternative
50% Batchsize = 256
Batchsize = 512
approach, where we make a small change to the system model
Batchsize = 1000 in the C-Decision scheme, take the hk,B [n] as the input of
40%
0 1 2 3 4 5 6 7 8 9 the BS DQN directly, and term this scheme as the CSI-BS
Number of Real-valued Feedback
method. Besides, we provide the ARP performance of the CSI-
(b) ARP performance with real-valued feedback BS method and the proposed C-Decision scheme in Table VI:
Fig. 4. Performance evaluation for the C-Decision scheme with real-valued
feedback.
TABLE VI
ARP COMPARISON BETWEEN CSI-BS AND C-D ECISION SCHEMES
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
10
100%
1.0
90%
0.8
Average Return Percentage
80%
Empirical CDF
0.6
70%
Optimal λc = 0.1
0.4
60% Real FB λc = 0.1
Binary FB λc = 0.1
0.2 Optimal λc = 0.5
50% Batchsize = 256
Batchsize = 512
Real FB λc = 0.5
Batchsize = 1000 Binary FB λc = 0.5
40% 0.0
0 5 10 15 20 25 30 35 40 45 10 20 30 40 50 60
Number of Feedback Bits V2I Sum Rate per Step/[bps/Hz]
(a) V2I sum rate comparison
Fig. 5. The ARP performance for the C-Decision scheme with binary
feedback.
1.0
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
11
80%
noise compared with the real-valued feedback scheme.
70%
G. Performance Evaluation for the D-Decision Scheme
60%
Fig. 9 evaluates the training process of the D-Decision
scheme. Here, we choose D = 512, Nk = 3 and Ngr = 16,
50%
respectively. In particular, the training loss for the 1st V2V
Real Feedback = 3 in Fig. 9 (a) first decreases very slowly with some jitters
Binary Feedback = 36
40% with an increasing Ltrain , and then drops almost linearly, and
-20 -15 -10 -5 0 5 10 15 20 25 30
Noise to Input value ratio /[dB]
finally becomes nearly unchanged with the further increase
of Ltrain . The average return per episode under the D-
(a) The ARP performance under the noisy input
Decision scheme in Fig. 9 (b) first increases quickly with
the increase of Ltrain , and then increases slowly, and finally
100%
gradually converges despite some fluctuations, which shows
90%
the stability of the training process. Besides, we observe that
Ltrain = 10, 000 under the D-Decision scheme is much bigger
Average Return Percentage
80%
than Ltrain = 2, 000 under the C-Decision scheme, which
indicates that the D-Decision scheme converges more slowly
70%
than the C-Decision scheme. To train the whole neural network
well, we set Ltrain = 10, 000 under the D-Decision scheme.
60%
Besides, the exploration rate is linearly annealed from 1
to 0.01 over the beginning 8, 000 episodes and then keeps
50%
constant.
Real Feedback = 3 Then, the testing performance of the D-Decision scheme
Binary Feedback = 36
40%
with the increasing number of AGI values is shown in Fig.
-20 -15 -10 -5 0 5 10 15 20 25 30
Noise to Feedback value ratio/[dB]
10. In particular, Fig. 10 (a) illustrates the ARP performance
with the increasing number of real-valued AGI Ngr . Here,
(b) The ARP performance under the noisy feedback
we set the number of real-valued feedback which each V2V
Fig. 8. Impact of noise on the ARP performance. transmits to the BS as 3 as indicated by Fig. 4 (b). The
APR first increases with increasing Ngr , and then keeps nearly
unchanged with the further increase of Ngr . Especially, the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
12
100%
0.5 90%
0.3 70%
60%
0.2
50%
0.1
Batchsize = 512
40%
0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 20
Number of Training Episodes Number of Real-valued Aggregated Global Information
(a) Training loss of the 1st V2V (a) ARP with real-valued aggregated global information
100%
11000
90%
10000
Average Return per Episode
8000 70%
7000 60%
6000 50%
Batchsize = 512
5000 40%
0 2000 4000 6000 8000 10000 0 10 20 30 40 50 60 70 80 90 100
Number of Training Episodes Number of Aggregated Global Information Bits
(b) Average return per episode (b) ARP with binary aggregated global information
Fig. 9. Training performance evaluation for the D-Decision scheme. Fig. 10. The ARP performance with two kinds of aggregated global
information.
ARP nearly achieves its maximal value 96% when Ngr = 16. when Ngb = 36. Similarly, compared with the C-Decision
In other words, the BS only needs 16 real-valued AGI to scheme with binary feedback, the D-Decision scheme with
represent the real-valued feedback of all V2V links to achieve the binary feedback only incurs 4% ARP degradation, which,
96% of the optimal performance. Furthermore, even when however, can be implemented in a fully distributed manner.
Ngr = 2, the ARP can still reach 90%, which is suitable for
the bandwidth-constrained broadcast channel of the BS. Com- VI. C ONCLUSION
pared with the C-Decision scheme, the D-Decision scheme In this paper, we proposed a novel C-Decision architecture
only incurs 2% ARP degradation. However, it can achieve to allow distributed V2V links to share spectrum efficiently
the fully distributed decision making and spectrum sharing, with the aid of the BS in V2X scenario and also devised
which is very appealing in the V2X scenario. In addition, the an approach to binarize the continuous feedback. To further
computational complexity for decision making under the D- facilitate distributed decision making, we have developed a
Decision scheme is greatly reduced compared with that under D-Decision scheme for each V2V link to make its own
the C-Decision scheme, which can further facilitate the fully decision locally and also designed the binary procedure for
distributed spectrum sharing in the V2X scenario. this scheme. Simulation results demonstrated that the num-
Besides, the testing performance of the D-Decision scheme ber of real-valued feedback can be quite small to achieve
with the binary AGI is evaluated in Fig. 10 (b). Here, we near-optimal performance. Meanwhile, the D-Decision scheme
choose the number of feedback bits as 36 for each V2V link can also gain near-optimal performance and enable a fully
and the number of real-valued AGI Ngr = 16. In Fig. 10 (b), distributed decision making, which is more appealing to the
the ARP first increases with the increasing number of AGI V2X networks. Besides, the quantization of the feedback or
bits Ngb , and then becomes nearly unchanged with the further AGI incurs small performance loss with an acceptable number
increase of Ngb . In particular, the APR reaches 90% when of bits under both schemes. Our proposed scheme is quite
Ngb = 80. Meanwhile, the APR is very close to 90% even immune to the variation of feedback interval, input noise, and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
13
feedback noise respectively, which validates the robustness of [23] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning based mode
the proposed scheme. In the future, we will investigate joint selection and resource management for green fog radio access networks,”
IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Apr. 2019.
power control and spectrum sharing issue in this scenario. [24] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep learning based wireless re-
source allocation with application to vehicular networks,” arXiv preprint
arXiv:1907.03289, 2019.
R EFERENCES [25] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning
for vehicular networks: Recent advances and application examples,”
[1] H. Seo, K. Lee, S. Yasukawa, Y. Peng, and P. Sartori, “LTE evolution IEEE Veh. Technol. Mag., vol. 13, no. 2, pp. 94–101, Jun. 2018.
for vehicle-to-everything services,” IEEE Commun. Mag., vol. 54, no. 6, [26] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks:
pp. 22–28, Jun. 2016. A machine learning framework,” IEEE Internet Things J., vol. 6, no. 1,
[2] S. Chen, J. Hu, Y. Shi, Y. Peng, J. Fang, R. Zhao, and L. Zhao, “Vehicle- pp. 124–135, Feb. 2019.
to-everything (V2X) services supported by LTE-based systems and 5G,” [27] H. Ye, G. Y. Li, and B. F. Juang, “Deep reinforcement learning
IEEE Commun. Standards Mag., vol. 1, no. 2, pp. 70–76, 2017. based resource allocation for V2V communications,” IEEE Trans. Veh.
[3] L. Liang, H. Peng, G. Y. Li, and X. Shen, “Vehicular communications: Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
A physical layer perspective,” IEEE Trans. Veh. Technol., vol. 66, no. 12, [28] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks
pp. 10 647–10 659, Dec. 2017. based on multi-agent reinforcement learning,” IEEE J. Sel. Areas Com-
[4] Y. Lin, Z. Cai, X. Wang, and F. Hao, “Incentive mechanisms for mun., vol. 37, no. 10, pp. 2282–2292, Oct. 2019.
crowdblocking rumors in mobile social networks,” IEEE Trans. Veh. [29] Y. Wang, K. Wang, H. Huang, T. Miyazaki, and S. Guo, “Traffic and
Technol., vol. 68, no. 9, pp. 9220–9232, Sep. 2019. computation co-offloading with reinforcement learning in fog computing
[5] H. Peng and L. Liang and X. Shen and G. Y. Li, “Vehicular communica- for industrial applications,” IEEE Trans. Ind. Informat., vol. 15, no. 2,
tions: A network layer perspective,” IEEE Trans. Veh. Technol., vol. 68, pp. 976–986, Feb. 2019.
no. 2, pp. 1064–1078, Feb. 2019. [30] Y. S. Nasir and D. Guo, “Deep reinforcement learning for dis-
[6] 3rd Generation Partnership Project, “Technical spefication group radio tributed dynamic power allocation in wireless networks,” arXiv preprint
access network: Study on LTE-based V2X services,” 3GPP, TR 36.885 arXiv:1808.00490, 2018.
V14.0.0, Jun. 2016. [31] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
[7] ——, “Study on enhancement of 3GPP support for 5G V2X services,” no. 3-4, pp. 279–292, Feb. 1992.
3GPP, TR 22.886 V15.1.0, Mar. 2017. [32] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
[8] C. Guo, L. Liang, and G. Y. Li, “Resource allocation for low-latency Cambridge, MA, USA: MIT Press, 2018.
vehicular communications: An effective capacity perspective,” IEEE J. [33] V. Mnih et al., “Human-level control through deep reinforcement learn-
Sel. Areas Commun., vol. 37, no. 4, pp. 905–917, Apr. 2019. ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[9] L. Liang, G. Y. Li, and W. Xu, “Resource allocation for D2D-enabled [34] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
vehicular communications,” IEEE Trans. Commun., vol. 65, no. 7, pp. with double Q-learning,” in Proc. 30th AAAI Conf., Feb. 2016, pp. 2094–
3186–3197, Jul. 2017. 2100.
[10] L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu, “Graph-based resource [35] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
sharing in vehicular communication,” IEEE Trans. Wireless Commun., S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres-
vol. 17, no. 7, pp. 4579–4592, Jul. 2018. sion with recurrent neural networks,” arXiv preprint arXiv:1511.06085,
[11] C. Chen, B. Wang, and R. Zhang, “Interference hypergraph-based 2015.
resource allocation (IHG-RA) for NOMA-integrated V2X networks,” [36] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
IEEE Internet Things J., vol. 6, no. 1, pp. 161–170, Feb. 2019. gradients through stochastic neurons for conditional computation,” arXiv
[12] C. Han, M. Dianati, Y. Cao, F. Mccullough, and A. Mouzakitis, preprint arXiv:1308.3432, 2013.
“Adaptive network segmentation and channel allocation in large-scale [37] Y. Bultitude and T. Rautiainen, “IST-4-027756 WINNER II d1. 1.2 v1.
V2X communication networks,” IEEE Trans. Commun., vol. 67, no. 1, 2 WINNER II channel models.”
pp. 405–416, Jan. 2019. [38] S. Ruder, “An overview of gradient descent optimization algorithms,”
[13] B. Bai, W. Chen, K. B. Letaief, and Z. Cao, “Low complexity outage arXiv preprint arXiv:1609.04747, 2016.
optimal distributed channel allocation for vehicle-to-vehicle communi- [39] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
cations,” IEEE J. Sel. Areas Commun., vol. 29, no. 1, pp. 161–172, Jan. Learning: Data Mining, Inference, and Prediction. New York, NY,
2011. USA: Springer Science & Business Media, 2009.
[14] M. I. Ashraf, M. Bennis, C. Perfecto, and W. Saad, “Dynamic proximity- [40] W. Sun, E. G. Ström, F. Brännström, K. C. Sou, and Y. Sui, “Radio re-
aware resource allocation in vehicle-to-vehicle (V2V) communications,” source management for D2D-based V2V communication,” IEEE Trans.
in Proc. IEEE Globecom Workshops (GC Wkshps), Dec. 2016, pp. 1–6. Veh. Technol., vol. 65, no. 8, pp. 6636–6650, Aug. 2016.
[15] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot et al., “Mastering the game of Go with deep neural networks
and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
[16] T. O’Shea and J. Hoydis, “An introduction to deep learning for the
physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp.
563–575, Dec. 2017.
[17] Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, “Deep learning in physical layer
communications,” IEEE Wireless Commun., vol. 26, no. 2, pp. 93–99,
Apr. 2019. Liang Wang (M’17) received the B.S. degree in
[18] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel telecommunications engineering and the Ph.D. de-
estimation and signal detection in OFDM systems,” IEEE Wireless gree in communication and information systems
Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018. from Xidian University, Xi’an, China, in 2009 and
[19] F. A. Aoudia and J. Hoydis, “End-to-end learning of communications 2015, respectively, and was a visiting scholar from
systems without a channel model,” arXiv preprint arXiv:1804.02276, 2018 to 2019 at Georgia Institute of Technology,
2018. Atlanta, GA, USA. He is currently an Associate Pro-
[20] C. Jiang, H. Zhang, Y. Ren, Z. Han, K. Chen, and L. Hanzo, “Ma- fessor with the Key Laboratory of Modern Teaching
chine learning paradigms for next-generation wireless networks,” IEEE Technology, and also with the School of Computer
Wireless Commun., vol. 24, no. 2, pp. 98–105, Apr. 2017. Science, Shaanxi Normal University, Xi’an, China.
[21] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang, His research interests focus on dynamic spectrum
“Intelligent 5G: When cellular networks meet artificial intelligence,” access, energy-efficient transmission, applications of reinforcement learning,
IEEE Wireless Commun., vol. 24, no. 5, pp. 175–183, Oct. 2017. and robust design in wireless communications networks.
[22] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce-
ment learning for dynamic multichannel access in wireless networks,”
IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun.
2018.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2020.2979124, IEEE
Transactions on Communications
14
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.