Resource Allocation For Low-Latency NOMA-V2X Networks Using Reinforcement Learning

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) | 978-1-6654-0443-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/INFOCOMWKSHPS51825.2021.
9484529 IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium
Resource Allocation for Low-Latency NOMA-V2X

Networks Using Reinforcement Learning
Huiyi Ding Ka-Cheong Leung*
Department of Electrical and Electronic Engineering School of Computer Science and Technology
The University of Hong Kong Harbin Institute of Technology, Shenzhen
Pokfulam Road, Hong Kong, China Shenzhen 518055, China
E-mail: hyding@eee.hku.hk E-mail: kcleung@ieee.org
Abstract—With the development of the Internet of things assign different spectra to individual users. However, limited
(IoT), vehicle-to-everything (V2X) plays an essential role in wire- radio resources cause significant competition among vehicles
less communication networks. Vehicular communications meet and may degrade network performance. The stringent latency
tremendous challenges in guaranteeing low-latency transmission
for safety-critical information due to dynamic channels caused and reliability requirements for V2X communications are
by high mobility. To handle the challenges, non-orthogonal hard to achieve through the existing OMA technologies. The
multiple access (NOMA) has been considered as a promising successful realization of the URLLC entails the advent of new
candidate for future V2X networks. However, it is still an technological concepts. Besides, the inter-user interference al-
open issue on how to organize multiple transmission links with ways exists in the OMA networks due to the carrier frequency
suitable resource allocation. In this paper, we investigate the
problem of the resource allocation for the low-latency NOMA- offset (CFO) caused by the Doppler effect of moving vehicles,
integrated V2X (NOMA-V2X) communication networks. First, thereby limiting the system capacity.
a cross-layer optimization problem is formulated to consider To break through the restriction of limited frequency re-
user scheduling and power allocation jointly while satisfying source and reduce inter-user interference, non-orthogonal mul-
the quality-of-service (QoS) requirements, including the delay tiple access (NOMA) has been considered as a potential tech-
requirements, rate demands, and power constraints. To cope
with the limited time-varying channel information, a machine nique for vehicular networks, since it can support connections
learning based resource allocation algorithm is proposed to find with higher throughput, a lower latency, and higher reliability
solutions. Specifically, reinforcement learning is applied to learn than OMA [3]. In particular, NOMA applies superposition
the dynamic channel information for reducing the transmission coding (SC) at the transmitters for user multiplexing. It also
delay. The numerical results indicate that our proposed algo- uses successive interference cancellation (SIC) at the receivers
rithm can significantly reduce the system delay compared with
other methods while satisfying the QoS requirements, so as to to mitigate the inter-user interference. Hence, multiple users
tackle the congestion issues for V2X communications. can transmit concurrently with different power levels on the
Index Terms—Machine Learning, NOMA, Reinforcement same frequency domain in NOMA networks so as to avoid the
Learning, Resource Allocation, Vehicular Networks, V2X interference caused by the CFO effect. Resource allocation is
vital to improve the benefits of using NOMA. Specifically,
I. I NTRODUCTION due to the multiplexed characteristics of SC and SIC, the
Vehicle-to-everything (V2X) has attracted significant at- resource allocation for one user may affect the throughput
tention due to the rapid development of vehicular tech- for other users in NOMA-integrated V2X (NOMA-V2X)
nologies for beyond 5th generation (B5G) communications. communication networks. Hence, there is a need to design a
There have been greatly increasing vehicular applications, suitable resource allocation method to fully utilize the benefits
including autonomous driving, advanced driver assistance, and of the NOMA technique.
in-vehicle entertainment services. With the help of cellular Many existing studies are focusing on resource allocation
networks and device-to-device communications, cellular-V2X in NOMA networks, especially for optimization-based ap-
(C-V2X) [1] enables diverse V2X communication modes si- proaches. In [4], an interference-hypergraph-based resource
multaneously, including vehicle-to-vehicle (V2V), vehicle-to- allocation algorithm has been devised, where the users with
infrastructure (V2I), vehicle-to-pedestrian (V2P), and vehicle- smaller interference are multiplexed. However, the system
to-cloud (V2C). transmission rate is affected by the interference as well as the
To ensure improved quality of experience (QoE) of C-V2X, channel gain. The proposed fixed power allocation limits the
B5G networks require ultra-reliable low-latency communica- performance gain of the NOMA-V2X networks. Therefore,
tions (URLLC). The resource allocation and mode selection there is no guarantee on the system transmission rate for
for V2X communication have been discussed in [2], where the proposed heuristic method that has only considered the
the orthogonal multiple access (OMA) technique is used to minimization of the interference. The authors in [5] proposed
a power allocation scheme to maximize the system achievable
* Corresponding author. transmission rate while satisfying the predefined target thresh-
978-1-6654-0443-3/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: The University of Hong Kong Libraries. Downloaded on August 08,2021 at 00:50:28 UTC from IEEE Xplore. Restrictions apply.
IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium
old of each NOMA user in the vehicular networks. However,

the approach does not consider the user scheduling order
so that it fails to satisfy variant quality-of-service quality-
of-service (QoS) requirements for different users. The mixed
centralized-distributed scheme was employed in [6] to formu-
late the resource allocation problem. However, the proposed
decentralized algorithm may cause severe interference with
greedy transmitters. A common feature of optimization-based
studies is the requirement of intensive computation to perform
the reallocation under time-varying network conditions, such
as fading and mobility. Hence, the aforementioned algorithms
may incur huge computational and communication overhead
Fig. 1. A NOMA-V2X System.
during the re-allocation.
Besides, the above studies focus only on the physical
layer performance optimization, where the resource allocation reduce the inter-user interference and enhance the benefits of
policy is adaptive to the channel information only. They do not employing NOMA.
consider the stringent delay requirement for high-level packet Motivated by the aforementioned difficulties, we investi-
transmission. A typical assumption is that the data flow is gate the low-latency resource allocation for the NOMA-V2X
delay insensitive with an infinite backlog at the transmitter communication networks, aiming at the packet level delay
side. In that case, the spectrum resource may be allocated to minimization. Specifically, We devise a scheme to consider
one user with good physical channel condition, but there is no user scheduling and power allocation jointly, and formulate
packet to transmit at that period. Hence, resource allocation it as a mixed-integer programming problem to minimize the
under this assumption cannot accurately capture the behavior total system delay while satisfying other QoS requirements,
of NOMA-V2X networks, leading to the wastage of the including the rate and the power constraints for all users.
spectrum resources and unsuccessful packet transmissions. It In order to solve the complex problem efficiently, the RL
is difficult to formulate the latency and reliability constraints approach is applied to determine the user scheduling order,
directly in the model-based optimization methods, since there which is then employed for power allocation by solving the
may not be closed-form expressions relating to the optimiza- transformed feasible problem.
tion objective and constraints. Although some approaches The paper is organized as follows. In Section II, we describe
were proposed to deal with delay-aware resource allocation, the model of the system and formulate the optimization
the latency was modeled as a long-term time-average delay problem. The proposed RL-based algorithm is discussed in
for the system in the optimization-based approach [7]. Those Section III. Performance evaluation is given in Section IV.
approaches have to know the statistical information on channel Finally, Section V gives the conclusion.
state and the arrival data rate, while this prior knowledge is
expensive to get or even unavailable. II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
To overcome this problem, machine learning (ML) has
attracted some attention, providing systems the ability to A. V2X Communications
automatically learn and improve through experience. It is As illustrated in Fig. 1, a typical V2X communication
an efficient technique to solve the complex problem so as system is considered in this work, consisting of one base
to guarantee the latency and the reliability of the system station (BS) and vehicular users. In the V2X applications,
with dynamic channel state information (CSI). Due to the vehicular communications refer to the case that messages
absence of the perfect labeled data, reinforcement learning are transmitted directly to BS, such as updating the location
(RL) becomes a potential approach for resource allocation in information or traffic data. Moreover, communication may
NOMA-V2X communication networks. One of the advantages take place among different vehicles, such as the dissemination
of RL approaches is that they can operate efficiently without of congestion messages. Following the typical representa-
relying on predefined models that characterize the environ- tion, we classify the V2X communications into two main
mental statistics, which train the agent to do actions based types in our proposed scenario, namely, the V2I commu-
on the cumulative reward with the environmental feedback. nication and V2V communication. In our model, one BS
The authors in [8] applied RL to solve resource allocation with single antenna serves M uplink V2I users, denoted
problems for vehicle platoon networks. However, only the by C = {1, 2, ..., Cm−1 , Cm , Cm+1 , ..., CM }. Besides, N
V2V mode is considered in that work with the number of V2V links reuse the spectrum with V2I links, denoted by
multiplexed vehicular users less than two, thereby under- D = {1, 2, ..., Dn−1 , Dn , Dn+1 , ..., DN }. Each V2V link
utilizing the spectrum resources in the system. In [9], RL is consists of one transmitter and one receiver. Each transmitter
applied to solve the resource allocation problems for vehicular or receiver is assumed to equip with a single antenna, which
networks. Although two-user multiplexing is considered in can be extended to multiple antennas, similar to [10]. Every
this study, the SIC technique has not been considered to transmitter maintains a buffer queue to store the packets to
be transmitted. Time is divided into equal-sized transmission The remaining interference to Cm caused by other V2I users
periods. The transmission packet load for vehicular user i is Bi at time slot k is:
in each transmission period. Due to the mobility of vehicles, X
semi-persistent scheduling (SPS) [11] is applied in the V2X ICC = XCi ,k PCi ,k HCi ,k (3)
Ci ∈C 0
communication networks. The resources are pre-allocated by
BS to the vehicles every transmission period. The system is where C 0 is the set of V2I users with HCi ,k < HCm ,k , and
assumed to operate in a certain transmission period with K PCi ,k denotes the transmit power for V2I user Ci at time slot
time slots, denoted by K = {1, 2, ..., k − 1, k, k + 1, ..., K}, k. The interference to Cm caused by other V2V links at time
where the time interval [k, k + 1) is referred to the time slot slot k is: X
k. At the beginning of each transmission period, BS decides IDC = XDi ,k PDi ,k HDi ,k (4)
the scheduling order and responses for the power allocation. Di ∈D 0
Actually, the scheduling order affects system throughput and 0
where D is the set of V2V links with HDi ,k < HCm ,k , and
delay. Scheduling all users in the same time slot induces PDi ,k denotes the transmit power for V2V link Di at time slot
serious inter-user interference or even impractical for the k. Thus, the received signal-to-interference-plus-noise ratio
hardware design of the SIC technique. Besides, the system (SINR) for V2I users Cm at time slot k can be expressed
transmission rate is affected by power allocation. Hence, user as:
scheduling and power allocation schemes are needed to be XCm ,k PCm ,k HCm ,k
ΓCm ,k = (5)
carefully designed. ICC + IDC + N0
where PCm ,k and N0 denote the transmit power for V2I user
B. NOMA-V2X Networks Cm and the Gaussian noise at time slot k, respectively.
NOMA is a promising technique to improve the spectrum Similarly, the SINR for V2V link Dn at time slot k is
efficiency for wireless networks. By applying NOMA, multi- written as:
XDn,k PDn,k HDn,k
ple users can transmit data simultaneously on the same spec- ΓDn,k = (6)
trum with different power levels. In particular, SIC is applied ICD + IDD + N0
at the receivers to remove the partial inter-user interference. where PDn ,k denotes the transmit power for V2V link Dn at
As indicated in [12], the optimal decoding in the NOMA time slot k. The interference to Dn caused by other V2I users
transmission is the descending order in the channel gain to at time slot k is:
achieve the maximum system capacity. Specifically, in order X
ICD = XCi,k PCi,k HCi,k (7)
to decode the own signal of each user at the receiver side, the
Ci ∈C 00
signals from other users with the stronger channel gain are
00
decoded first and removed from the combined signals, while where C is the set of V2I users with HCi ,k < HDn ,k . The
the remaining signals from users with smaller channel gain interference to Dn caused by other V2V links at time slot k
are regarded as interference. is: X
In order to improve the reliability and reduce the delay of IDD = XDi,k PDi,k HDi,k (8)
the systems, NOMA is applied in the V2X communication Di ∈D 00
networks. In the proposed NOMA-V2X system model, the 00
where D is the set of V2V links with HDi ,k < HDn ,k .
channel between each transmitter and receiver suffers from According to the Shannon-Hartley theorem [13], the achiev-
Rayleigh fading and path loss with time-varying CSI. The able rate obtained from V2I user Cm and V2V link Dn at time
channel coefficient Hi,k for link i in time slot k is defined as: slot k can be calculated as:
Hi,k = |Gi,k (Li,k )−z |2 (1) RCm,k = log2 (1 + ΓCm,k ) (9)
where Gi,k and z denote the Rayleigh fading and path loss RDn,k = log2 (1 + ΓDn,k ) (10)
exponent between transceivers for each link, respectively. The Thus, the total system transmission rate in one transmission
distance between each transmitter and receiver is defined as: period can be expressed as:
Li,k = di,k + vi,k Tia
X X X X
(2) Rtot = RCm,k + RDn,k (11)
Cm∈C k∈K Dn∈D k∈K
where di,k , vi,k , and Tia denote the initial distance at the start
of a transmission period, the velocity, and the access delay The delay for user i is defined as Ti = Tia + Tit , where the
a
P
for user i, respectively. access delay is: Ti = k∈K kXi,k , and the transmission
At each transmission period, a binary indicator, denoted as delay is denoted as: Tit = RBi,k i
, where Bi is the size of
Xi,k , is the scheduling policy decided by BS for link i in each transmission load. Therefore, the total system delay in one
time slot k. In particular, Xi,k = 1 when the link is scheduled, transmission period is given as:
otherwise, Xi,k = 0. After applying NOMA, part of the inter- Ttot =
X
TCm +
X
TDn (12)
user interference can be canceled by using the SIC technique. Cm∈C Dn∈D
C. Problem Formulation the next state s(t + 1). The specific settings are discussed in
Compared with traditional cellular communications, the the following:
newly proposed C-V2X applications, such as sensor mon- 1) State: In this work, the BS is treated as the agent.
itoring and autonomous vehicle platooning, require tighter The state of the environment is characterized by the CSI of
latency and higher reliability [14]. Considering the stringent the NOMA-V2X communication network. The environment
requirements in NOMA-V2X communication networks, we state for link i at episode t is thus defined by si (t) =
aim to propose an algorithm to minimize the system latency [Hi,1 (t), ..., Hi,K (t)] ∈ s, which reflects the environment dy-
while satisfying the reliability requirements. In particular, the namics. The size of state space is denoted as Ns . Specifically,
reliability requirements can be expressed as the transmission we assume that a BS has limited knowledge about imperfect
rate and power constraints. The transmission rate larger than CSI with path loss information, due to the mobility of vehicles
the threshold with an acceptable transmission power guaran- in the NOMA-V2X communication networks.
tees the successful transmission for each user. According to 2) Action: Due to the unpredictable environmental
(11) and (12), system performance can be determined based changes, BS needs to take the appropriate action for the user
on user scheduling order and power allocation. Therefore, we scheduling and power allocation based on the observed chan-
formulate a joint user scheduling and power allocation scheme nel information. Based on (13), the optimal power allocation
as an optimization problem, which can be expressed as: can be conducted by a convex problem with the obtained user
scheduling order:
min Ttot
X,P min Ttot
s.t. C1 : 0 ≤ Pi ≤ Pimax , i ∈ C ∪ D P
(14)
th s.t. C1 : 0 ≤ Pi ≤ Pimax , i ∈ C ∪ D
C2 : RPi ≥ Ri , i∈C∪D (13) C2 : Ri ≥ Rith , i ∈ C ∪ D
C3 : Xi,k = 1, i ∈ C ∪ D
k∈K The convexity can be proved by following a similar way in
Xi,k ≤ X max , k ∈ K
P
C4 : [15] and the proof is omitted here because of constraints
i∈C∪D
in space. Thus, instead of the action space formulation,
where Pimax and Rithdenote the maximum transmit power we convert it into the penalty for ensuring that the QoS
and the rate threshold of successful transmission, respectively. requirements of the system can be satisfied. To deal with
In (13), C1 limits the peak transmit power for each users. C2 the discrete problem for user scheduling, we apply the Q-
maintains the successful transmission for each users. C3 and learning framework to learn the NOMA-V2X communication
C4 are the binary constraints for the user scheduling variables networks. Q-learning seeks to find the best actions from the
so as to make sure that each user can be scheduled during a action space with given state information by creating and
transmission period, while the total number of multiplexed updating a Q-table. The action for each link i at episode t
users is satisfied for the SIC hardware design. is denoted as ai (t) = [Xi,1 (t), ..., Xi,K (t)] ∈ a, which is
chosen according to the -greedy policy π as:
III. RL- BASED R ESOURCE A LLOCATION S CHEME
a(t) = arg max Q(s(t), a0 (t)) , with probability
(
The aforementioned optimization problem (13) is an NP- a0 (t)
π:
hard mixed-integer programming problem, which is difficult a(t) = random action a0 (t) , with probability 1 −
to be solved by the conventional optimization-based meth- (15)
ods. Although some algorithms, such as the branch-and- Specifically, the -greedy policy here takes the greedy action
bound method, can approximately solve the problem, the a(t) to maximize the Q-value with probability , which
computational complexity is still expensive. Besides, such guarantees the performance of the policy. Besides, the random
optimization-based algorithms can only solve the problem action is chosen with probability 1 − to ensure the full
with the known CSI information for K time slots. However, exploration for the problem for finding the global optimal
that information cannot be accurately predicted in real sit- policy.
uations. In order to overcome these challenges, ML can be 3) Reward: The reward is regarded as the feedback from
applied to find solutions for resource allocation. Specifically, the environment. The agent has a higher probability of se-
RL is a suitable candidate to deal with imperfect information, lecting an action that brings a larger reward value. In this
which can interact with a dynamic channel environment, work, the aim for BS consists of the delay reduction and the
and learn both the behaviors of users and the environmental transmission rate increment. According to the characteristics
conditions online so as to maximize the system utility. of the system model (13), the reward function at episode t is
Hence, we reformulate the optimization problem as an RL defined as:
task by four tuples as φ = {a, s, r, π}, where a, s, r and X
π denote the actions space, state space, reward, and action r(t) = −Ttot (t) + c · I((Ri (t) − Rith (t))) (16)
policy, respectively. In state s(t), the action a(t) taken by i∈C∪D
the agent is choosing by the policy π based on the collective where the first term is a utility function to minimize the
information of the environment. The reward r(t) is obtained delay, and the second term is penalty part, which ensures the
by taking the specific action, and the environment evolves into requirements on the system transmission rates, where I(·) is
a sign function and it equals one when the rate requirement

is satisfied. After obtaining the reward, the Q-table in the Q- Delay vesus episodes
30
RL-RA
MAX-RATE
Algorithm 1 RL-RA 25 RANDOM-US
1: Initialize State space, Action space, Q-table

2: Set Emax , , α, and γ 20
Total Delay (ms)

3: for each episode e < Emax do
15
4: if In random chooss policy then
5: Randomly choose action X
10
6: end if
7: if In -greedy policy then 5
8: Choose action X with maximum Q-value
9: end if 0
0 20 40 60 80 100
10: Get P by solving the convex problem (16) Episodes
11: Observes the state transition; update Q-table
12: end for Fig. 2. Total system delay comparison among various algorithms.
13: Output: X and P
learning model will be updated by: 60 km/h, which follows [5]. We set the maximum transmit
powers for V2I and V2V users are set as 4 watts and 2
Q(s(t), a(t)) ←Q(s(t), a(t))(1 − α) watts, respectively. As far as the Q-learning parameters are
+ α[r(t) + γ max
0
Q(s(t + 1), a0 (t))] (17) concerned, the reward decay γ, the learning rate α, and
a (t)
the greedy probability coefficient are set as 0.5, 0.1, and
where α and γ denote the learning rate and the reward 0.95, respectively. We compare the total transmission delay
decay, respectively. The details for the proposed RL-based of our proposed algorithm (RL-RA) with two commonly
resource allocation algorithm, denoted as RL-RA, is shown used baseline resource allocation algorithms, namely: A1)
in Algorithm 1. Specifically, Step 1 initializes the variables the heuristic algorithm proposed in [5], which focuses on
for the NOMA-V2X network and the RL framework. From the power allocation with random user scheduling, denoted
Steps 4-9, the action is taken by BS for user scheduling as RANDOM-US; and A2) the method raised in [6], which
following the policy π. Step 10 shows the power allocation concentrates on the physical layer optimization in order to
approach. The updating of environmental variables is illus- maximize the system transmission rate, denoted MAX-RATE,
trated in Step 11. The computational complexity of RL-RA is respectively.
O((N + M )KNe ), where Ne denotes the value of episodes. Fig. 2 illustrates the achieved total system delay of dif-
To show the convergence of the algorithm, we prove that the ferent algorithms. It can be seen that our proposed RL-RA
following relation: algorithm outperforms both the MAX-RATE and RANDOM-
A B US methods with a smaller total delay. The performance
f (X∗ , P∗ ) ≥ f (X, P∗ ) ≥ f (X, P) (18) of the RL-RA algorithm is improved and finally converged
∗ ∗
always holds, where f , X , and P denote the original as the episode increases. The total system delay for the
optimization problem, the optimal user scheduling, and the MAX-RATE algorithm is even larger than RANDOM-US,
optimal power allocation, respectively. The inequation (A) since the MAX-RATE algorithm only considers the physical
follows the convergence results on RL approaches [16], while layer optimization for the system transmission rate. Hence,
inequation (B) holds from the definition of the convex opti- choosing a smaller number of multiplexed users can reduce
mization [17]. We, therefore, conclude that the solution found the inference in order to maximize the transmission rate,
by Algorithm 1 converges since the difference in values thereby causing a larger system delay. Besides, some spectrum
between the calculated solution and the true optimal one is resources may be allocated to the users without packets being
bounded. sent. Therefore, the trade-off between system throughput and
total delay should be considered carefully in the resource
IV. P ERFORMANCE E VALUATION allocation scheme.
In this section, we illustrate the performance of the pro- Fig. 3 shows the total system delay with various con-
posed algorithm using Python 3.6. We consider a NOMA- figurations on the number of V2V users in the system for
V2X communication network with one BS located at the cell our proposed RL-RA algorithm. It is observed that the total
center. The cell radius can be set as 50 meters to fulfill the system delay is reduced with increasing episodes and finally
requirement of vehicular communications, similar to [14]. The converged for all the cases in the RL-RA algorithm. Besides,
path loss component and the noise are set as three and −174 the total system delay for the two-V2V system is less than
dBm/Hz. We set the V2I and V2V links as four and six, the four-V2V system and six-V2V system. It shows that
respectively, and the maximum speed for each user can be the number of total users affects the user scheduling order.
tion scheme. In order to deal with the complex cross-layer

Delay for NOMA-V2X with RL problem with dynamic channel information, we propose an
Two-V2V
Four-V2V
RL-based resource allocation algorithm to jointly consider the
20 Six-V2V user scheduling and power allocation schemes. Performance
evaluation is conducted to demonstrate that the proposed RL-
15 RA algorithm can significantly reduce the total system delay
Total Delay (ms)
while satisfying rate and power QoS requirements, compared

with other baseline methods. In the future, we plan to extend
10
our work to the multi-cell systems to analyze the complex
scenarios among vehicles across cellular cells.
5
R EFERENCES
[1] M. Gonzalez-Martı́n, M. Sepulcre, R. Molina-Masegosa, and J. Goza-
0 lvez, “Analytical Models of the Performance of C-V2X Mode 4 Ve-
0 20 40 60 80 100
Episodes hicular Communications,” IEEE Transactions on Vehicular Technology,
vol. 68, no. 2, pp. 1155–1166, Dec. 2019.
[2] X. Li, L. Ma, R. Shankaran, Y. Xu, and M. A. Orgun, “Joint Power
Fig. 3. Total system delay with different number of users. Control and Resource Allocation Mode Selection for Safety-related
V2X Communication,” IEEE Transactions on Vehicular Technology,
vol. 68, no. 8, pp. 7970–7986, Jun. 2019.
[3] D. Zhang, Y. Liu, L. Dai, A. K. Bashir, A. Nallanathan, and B. Shim,
Delay for NOMA-V2X with RL
“Performance Analysis of Decentralized V2X System with FD-NOMA,”
25 EPSILON = 0.85
in 2019 IEEE 90th Vehicular Technology Conference, Sep. 2019.
EPSILON = 0.9
EPSILON = 0.95
[4] C. Chen, B. Wang, and R. Zhang, “Interference Hypergraph-Based
Resource Allocation (IHG-RA) for NOMA-Integrated V2X Networks,”
20
IEEE Internet of Things Journal, vol. 6, no. 1, pp. 161–170, Oct. 2019.
[5] H. Xiao, Y. Chen, S. Ouyang, and A. T. Chronopoulos, “Power Control
Total Delay (ms)
15
for Clustering Car-Following V2X Communication System With Non-
Orthogonal Multiple Access,” IEEE Access, vol. 7, pp. 68160–68171,
May 2019.
10 [6] B. Di, L. Song, Y. Li, and G. Y. Li, “Non-Orthogonal Multiple
Access for High-Reliable and Low-Latency V2X Communications in
5G Systems,” IEEE Journal on Selected Areas in Communications,
5 vol. 35, no. 10, pp. 2383–2397, Jul. 2017.
[7] H. Ding and K.-C. Leung, “Cross-Layer Resource Allocation in NOMA
Systems with Dynamic Traffic Arrivals,” in 2020 IEEE Wireless Com-
0 20 40 60 80 100 munications and Networking Conference (WCNC), May 2020.
Episodes [8] S. Xu, C. Guo, and Z. Li, “NOMA Enabled Resource Allocation for
Vehicle Platoon-Based Vehicular Networks,” in 2019 IEEE Globecom
Workshops (GC Wkshps), Dec. 2019.
Fig. 4. Total system delay with different learning settings.
[9] Y. Xu, C. Yang, M. Hua, and W. Zhou, “Deep Deterministic Policy Gra-
dient (DDPG)-Based Resource Allocation Scheme for NOMA Vehicular
Communications,” IEEE Access, vol. 8, pp. 18797–18807, Jan. 2020.
Besides, it can be seen that the convergence speed is faster [10] J. Ding and J. Cai, “Two-Side Coalitional Matching Approach for
Joint MIMO-NOMA Clustering and BS Selection in Multi-Cell MIMO-
for the system with a smaller number of V2V users. Indeed, NOMA Systems,” IEEE Transactions on Wireless Communications,
the system with a smaller number of V2V users has a smaller vol. 19, no. 3, pp. 2006–2021, 2020.
action space, thereby a faster convergence. [11] Third Generation Partnership Project (3GPP), “Technical Specification
Group Radio Access Network,” Study on LTE-Based V2X Services,
We further investigate the impact of greedy probability 3GPP TR 36.885 V0.5.0, Release 14, Feb. 2016.
coefficient in the learning setting of RL-RA algorithm. As [12] L. Dai, B. Wang, Y. Yuan, S. Han, C. I, and Z. Wang, “Non-orthogonal
shown in Fig. 4, it can be observed that the converge speed Multiple Access for 5G: Solutions, Challenges, Opportunities, and
Future Research Trends,” IEEE Communications Magazine, vol. 53,
increases with the increment of . Actually, relates to the no. 9, pp. 74–81, Sep. 2015.
action choice. A larger value of means that there is a large [13] J. R. Pierce, “An Introduction to Information Theory: Symbols, Signals
probability to choose a greedy action with the maximum and Noise,” Courier Corporation, vol. 2, Nov. 1980.
[14] M. Patra, R. Thakur, and C. S. R. Murthy, “Improving Delay and
reward. Otherwise, a random action will be chosen to explore Energy Efficiency of Vehicular Networks Using Mobile Femto Access
the environment. Actually, the greedy action ensures the Points,” IEEE Transactions on Vehicular Technology, vol. 66, no. 2,
performance of the policy based on the known environment, pp. 1496–1505, May 2017.
[15] Z. Yang, W. Xu, C. Pan, Y. Pan, and M. Chen, “On the Optimality of
and the random action encourages the state space exploration Power Allocation for NOMA Downlinks With Individual QoS Con-
for the unknown environment. straints,” IEEE Communications Letters, vol. 21, no. 7, pp. 1649–1652,
Mar. 2017.
V. C ONCLUSION [16] S. Singh, T. Jaakkola, and M. L. Littman, “Convergence Results for
Single-Step On-Policy Reinforcement-Learning Algorithms,” Machine
In this work, we study a low-latency NOMA-V2X network Learning, vol. 38, no. 3, pp. 287–308, Mar. 2000.
to support both V2I and V2V communications. To deal [17] K. Wang, W. Liang, Y. Yuan, Y. Liu, Z. Ma, and Z. Ding, “User
Clustering and Power Allocation for Hybrid Non-Orthogonal Multiple
with the delay requirements, we formulate the mixed integer Access Systems,” IEEE Transactions on Vehicular Technology, vol. 68,
programming optimization problem for the resource alloca- no. 12, pp. 12052–12065, Oct. 2019.

Resource Allocation For Low-Latency NOMA-V2X Networks Using Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Resource Allocation For Low-Latency NOMA-V2X Networks Using Reinforcement Learning

Uploaded by

Copyright:

Available Formats

IEEE INFOCOM 2021 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) | 978-1-6654-0443-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/INFOCOMWKSHPS51825.2021.

9484529 IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium

Resource Allocation for Low-Latency NOMA-V2X

978-1-6654-0443-3/21/$31.00 ©2021 IEEE

old of each NOMA user in the vehicular networks. However,

a sign function and it equals one when the rate requirement

1: Initialize State space, Action space, Q-table

Total Delay (ms)

tion scheme. In order to deal with the complex cross-layer

while satisfying rate and power QoS requirements, compared

You might also like