Professional Documents
Culture Documents
Abstract—With the development of the Internet of things assign different spectra to individual users. However, limited
(IoT), vehicle-to-everything (V2X) plays an essential role in wire- radio resources cause significant competition among vehicles
less communication networks. Vehicular communications meet and may degrade network performance. The stringent latency
tremendous challenges in guaranteeing low-latency transmission
for safety-critical information due to dynamic channels caused and reliability requirements for V2X communications are
by high mobility. To handle the challenges, non-orthogonal hard to achieve through the existing OMA technologies. The
multiple access (NOMA) has been considered as a promising successful realization of the URLLC entails the advent of new
candidate for future V2X networks. However, it is still an technological concepts. Besides, the inter-user interference al-
open issue on how to organize multiple transmission links with ways exists in the OMA networks due to the carrier frequency
suitable resource allocation. In this paper, we investigate the
problem of the resource allocation for the low-latency NOMA- offset (CFO) caused by the Doppler effect of moving vehicles,
integrated V2X (NOMA-V2X) communication networks. First, thereby limiting the system capacity.
a cross-layer optimization problem is formulated to consider To break through the restriction of limited frequency re-
user scheduling and power allocation jointly while satisfying source and reduce inter-user interference, non-orthogonal mul-
the quality-of-service (QoS) requirements, including the delay tiple access (NOMA) has been considered as a potential tech-
requirements, rate demands, and power constraints. To cope
with the limited time-varying channel information, a machine nique for vehicular networks, since it can support connections
learning based resource allocation algorithm is proposed to find with higher throughput, a lower latency, and higher reliability
solutions. Specifically, reinforcement learning is applied to learn than OMA [3]. In particular, NOMA applies superposition
the dynamic channel information for reducing the transmission coding (SC) at the transmitters for user multiplexing. It also
delay. The numerical results indicate that our proposed algo- uses successive interference cancellation (SIC) at the receivers
rithm can significantly reduce the system delay compared with
other methods while satisfying the QoS requirements, so as to to mitigate the inter-user interference. Hence, multiple users
tackle the congestion issues for V2X communications. can transmit concurrently with different power levels on the
Index Terms—Machine Learning, NOMA, Reinforcement same frequency domain in NOMA networks so as to avoid the
Learning, Resource Allocation, Vehicular Networks, V2X interference caused by the CFO effect. Resource allocation is
vital to improve the benefits of using NOMA. Specifically,
I. I NTRODUCTION due to the multiplexed characteristics of SC and SIC, the
Vehicle-to-everything (V2X) has attracted significant at- resource allocation for one user may affect the throughput
tention due to the rapid development of vehicular tech- for other users in NOMA-integrated V2X (NOMA-V2X)
nologies for beyond 5th generation (B5G) communications. communication networks. Hence, there is a need to design a
There have been greatly increasing vehicular applications, suitable resource allocation method to fully utilize the benefits
including autonomous driving, advanced driver assistance, and of the NOMA technique.
in-vehicle entertainment services. With the help of cellular Many existing studies are focusing on resource allocation
networks and device-to-device communications, cellular-V2X in NOMA networks, especially for optimization-based ap-
(C-V2X) [1] enables diverse V2X communication modes si- proaches. In [4], an interference-hypergraph-based resource
multaneously, including vehicle-to-vehicle (V2V), vehicle-to- allocation algorithm has been devised, where the users with
infrastructure (V2I), vehicle-to-pedestrian (V2P), and vehicle- smaller interference are multiplexed. However, the system
to-cloud (V2C). transmission rate is affected by the interference as well as the
To ensure improved quality of experience (QoE) of C-V2X, channel gain. The proposed fixed power allocation limits the
B5G networks require ultra-reliable low-latency communica- performance gain of the NOMA-V2X networks. Therefore,
tions (URLLC). The resource allocation and mode selection there is no guarantee on the system transmission rate for
for V2X communication have been discussed in [2], where the proposed heuristic method that has only considered the
the orthogonal multiple access (OMA) technique is used to minimization of the interference. The authors in [5] proposed
a power allocation scheme to maximize the system achievable
* Corresponding author. transmission rate while satisfying the predefined target thresh-
Authorized licensed use limited to: The University of Hong Kong Libraries. Downloaded on August 08,2021 at 00:50:28 UTC from IEEE Xplore. Restrictions apply.
IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium
be transmitted. Time is divided into equal-sized transmission The remaining interference to Cm caused by other V2I users
periods. The transmission packet load for vehicular user i is Bi at time slot k is:
in each transmission period. Due to the mobility of vehicles, X
semi-persistent scheduling (SPS) [11] is applied in the V2X ICC = XCi ,k PCi ,k HCi ,k (3)
Ci ∈C 0
communication networks. The resources are pre-allocated by
BS to the vehicles every transmission period. The system is where C 0 is the set of V2I users with HCi ,k < HCm ,k , and
assumed to operate in a certain transmission period with K PCi ,k denotes the transmit power for V2I user Ci at time slot
time slots, denoted by K = {1, 2, ..., k − 1, k, k + 1, ..., K}, k. The interference to Cm caused by other V2V links at time
where the time interval [k, k + 1) is referred to the time slot slot k is: X
k. At the beginning of each transmission period, BS decides IDC = XDi ,k PDi ,k HDi ,k (4)
the scheduling order and responses for the power allocation. Di ∈D 0
Actually, the scheduling order affects system throughput and 0
where D is the set of V2V links with HDi ,k < HCm ,k , and
delay. Scheduling all users in the same time slot induces PDi ,k denotes the transmit power for V2V link Di at time slot
serious inter-user interference or even impractical for the k. Thus, the received signal-to-interference-plus-noise ratio
hardware design of the SIC technique. Besides, the system (SINR) for V2I users Cm at time slot k can be expressed
transmission rate is affected by power allocation. Hence, user as:
scheduling and power allocation schemes are needed to be XCm ,k PCm ,k HCm ,k
ΓCm ,k = (5)
carefully designed. ICC + IDC + N0
where PCm ,k and N0 denote the transmit power for V2I user
B. NOMA-V2X Networks Cm and the Gaussian noise at time slot k, respectively.
NOMA is a promising technique to improve the spectrum Similarly, the SINR for V2V link Dn at time slot k is
efficiency for wireless networks. By applying NOMA, multi- written as:
XDn,k PDn,k HDn,k
ple users can transmit data simultaneously on the same spec- ΓDn,k = (6)
trum with different power levels. In particular, SIC is applied ICD + IDD + N0
at the receivers to remove the partial inter-user interference. where PDn ,k denotes the transmit power for V2V link Dn at
As indicated in [12], the optimal decoding in the NOMA time slot k. The interference to Dn caused by other V2I users
transmission is the descending order in the channel gain to at time slot k is:
achieve the maximum system capacity. Specifically, in order X
ICD = XCi,k PCi,k HCi,k (7)
to decode the own signal of each user at the receiver side, the
Ci ∈C 00
signals from other users with the stronger channel gain are
00
decoded first and removed from the combined signals, while where C is the set of V2I users with HCi ,k < HDn ,k . The
the remaining signals from users with smaller channel gain interference to Dn caused by other V2V links at time slot k
are regarded as interference. is: X
In order to improve the reliability and reduce the delay of IDD = XDi,k PDi,k HDi,k (8)
the systems, NOMA is applied in the V2X communication Di ∈D 00
networks. In the proposed NOMA-V2X system model, the 00
where D is the set of V2V links with HDi ,k < HDn ,k .
channel between each transmitter and receiver suffers from According to the Shannon-Hartley theorem [13], the achiev-
Rayleigh fading and path loss with time-varying CSI. The able rate obtained from V2I user Cm and V2V link Dn at time
channel coefficient Hi,k for link i in time slot k is defined as: slot k can be calculated as:
Hi,k = |Gi,k (Li,k )−z |2 (1) RCm,k = log2 (1 + ΓCm,k ) (9)
where Gi,k and z denote the Rayleigh fading and path loss RDn,k = log2 (1 + ΓDn,k ) (10)
exponent between transceivers for each link, respectively. The Thus, the total system transmission rate in one transmission
distance between each transmitter and receiver is defined as: period can be expressed as:
Li,k = di,k + vi,k Tia
X X X X
(2) Rtot = RCm,k + RDn,k (11)
Cm∈C k∈K Dn∈D k∈K
where di,k , vi,k , and Tia denote the initial distance at the start
of a transmission period, the velocity, and the access delay The delay for user i is defined as Ti = Tia + Tit , where the
a
P
for user i, respectively. access delay is: Ti = k∈K kXi,k , and the transmission
At each transmission period, a binary indicator, denoted as delay is denoted as: Tit = RBi,k i
, where Bi is the size of
Xi,k , is the scheduling policy decided by BS for link i in each transmission load. Therefore, the total system delay in one
time slot k. In particular, Xi,k = 1 when the link is scheduled, transmission period is given as:
otherwise, Xi,k = 0. After applying NOMA, part of the inter- Ttot =
X
TCm +
X
TDn (12)
user interference can be canceled by using the SIC technique. Cm∈C Dn∈D
Authorized licensed use limited to: The University of Hong Kong Libraries. Downloaded on August 08,2021 at 00:50:28 UTC from IEEE Xplore. Restrictions apply.
IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium
C. Problem Formulation the next state s(t + 1). The specific settings are discussed in
Compared with traditional cellular communications, the the following:
newly proposed C-V2X applications, such as sensor mon- 1) State: In this work, the BS is treated as the agent.
itoring and autonomous vehicle platooning, require tighter The state of the environment is characterized by the CSI of
latency and higher reliability [14]. Considering the stringent the NOMA-V2X communication network. The environment
requirements in NOMA-V2X communication networks, we state for link i at episode t is thus defined by si (t) =
aim to propose an algorithm to minimize the system latency [Hi,1 (t), ..., Hi,K (t)] ∈ s, which reflects the environment dy-
while satisfying the reliability requirements. In particular, the namics. The size of state space is denoted as Ns . Specifically,
reliability requirements can be expressed as the transmission we assume that a BS has limited knowledge about imperfect
rate and power constraints. The transmission rate larger than CSI with path loss information, due to the mobility of vehicles
the threshold with an acceptable transmission power guaran- in the NOMA-V2X communication networks.
tees the successful transmission for each user. According to 2) Action: Due to the unpredictable environmental
(11) and (12), system performance can be determined based changes, BS needs to take the appropriate action for the user
on user scheduling order and power allocation. Therefore, we scheduling and power allocation based on the observed chan-
formulate a joint user scheduling and power allocation scheme nel information. Based on (13), the optimal power allocation
as an optimization problem, which can be expressed as: can be conducted by a convex problem with the obtained user
scheduling order:
min Ttot
X,P min Ttot
s.t. C1 : 0 ≤ Pi ≤ Pimax , i ∈ C ∪ D P
(14)
th s.t. C1 : 0 ≤ Pi ≤ Pimax , i ∈ C ∪ D
C2 : RPi ≥ Ri , i∈C∪D (13) C2 : Ri ≥ Rith , i ∈ C ∪ D
C3 : Xi,k = 1, i ∈ C ∪ D
k∈K The convexity can be proved by following a similar way in
Xi,k ≤ X max , k ∈ K
P
C4 : [15] and the proof is omitted here because of constraints
i∈C∪D
in space. Thus, instead of the action space formulation,
where Pimax and Rithdenote the maximum transmit power we convert it into the penalty for ensuring that the QoS
and the rate threshold of successful transmission, respectively. requirements of the system can be satisfied. To deal with
In (13), C1 limits the peak transmit power for each users. C2 the discrete problem for user scheduling, we apply the Q-
maintains the successful transmission for each users. C3 and learning framework to learn the NOMA-V2X communication
C4 are the binary constraints for the user scheduling variables networks. Q-learning seeks to find the best actions from the
so as to make sure that each user can be scheduled during a action space with given state information by creating and
transmission period, while the total number of multiplexed updating a Q-table. The action for each link i at episode t
users is satisfied for the SIC hardware design. is denoted as ai (t) = [Xi,1 (t), ..., Xi,K (t)] ∈ a, which is
chosen according to the -greedy policy π as:
III. RL- BASED R ESOURCE A LLOCATION S CHEME
a(t) = arg max Q(s(t), a0 (t)) , with probability
(
The aforementioned optimization problem (13) is an NP- a0 (t)
π:
hard mixed-integer programming problem, which is difficult a(t) = random action a0 (t) , with probability 1 −
to be solved by the conventional optimization-based meth- (15)
ods. Although some algorithms, such as the branch-and- Specifically, the -greedy policy here takes the greedy action
bound method, can approximately solve the problem, the a(t) to maximize the Q-value with probability , which
computational complexity is still expensive. Besides, such guarantees the performance of the policy. Besides, the random
optimization-based algorithms can only solve the problem action is chosen with probability 1 − to ensure the full
with the known CSI information for K time slots. However, exploration for the problem for finding the global optimal
that information cannot be accurately predicted in real sit- policy.
uations. In order to overcome these challenges, ML can be 3) Reward: The reward is regarded as the feedback from
applied to find solutions for resource allocation. Specifically, the environment. The agent has a higher probability of se-
RL is a suitable candidate to deal with imperfect information, lecting an action that brings a larger reward value. In this
which can interact with a dynamic channel environment, work, the aim for BS consists of the delay reduction and the
and learn both the behaviors of users and the environmental transmission rate increment. According to the characteristics
conditions online so as to maximize the system utility. of the system model (13), the reward function at episode t is
Hence, we reformulate the optimization problem as an RL defined as:
task by four tuples as φ = {a, s, r, π}, where a, s, r and X
π denote the actions space, state space, reward, and action r(t) = −Ttot (t) + c · I((Ri (t) − Rith (t))) (16)
policy, respectively. In state s(t), the action a(t) taken by i∈C∪D
the agent is choosing by the policy π based on the collective where the first term is a utility function to minimize the
information of the environment. The reward r(t) is obtained delay, and the second term is penalty part, which ensures the
by taking the specific action, and the environment evolves into requirements on the system transmission rates, where I(·) is
Authorized licensed use limited to: The University of Hong Kong Libraries. Downloaded on August 08,2021 at 00:50:28 UTC from IEEE Xplore. Restrictions apply.
IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium
learning model will be updated by: 60 km/h, which follows [5]. We set the maximum transmit
powers for V2I and V2V users are set as 4 watts and 2
Q(s(t), a(t)) ←Q(s(t), a(t))(1 − α) watts, respectively. As far as the Q-learning parameters are
+ α[r(t) + γ max
0
Q(s(t + 1), a0 (t))] (17) concerned, the reward decay γ, the learning rate α, and
a (t)
the greedy probability coefficient are set as 0.5, 0.1, and
where α and γ denote the learning rate and the reward 0.95, respectively. We compare the total transmission delay
decay, respectively. The details for the proposed RL-based of our proposed algorithm (RL-RA) with two commonly
resource allocation algorithm, denoted as RL-RA, is shown used baseline resource allocation algorithms, namely: A1)
in Algorithm 1. Specifically, Step 1 initializes the variables the heuristic algorithm proposed in [5], which focuses on
for the NOMA-V2X network and the RL framework. From the power allocation with random user scheduling, denoted
Steps 4-9, the action is taken by BS for user scheduling as RANDOM-US; and A2) the method raised in [6], which
following the policy π. Step 10 shows the power allocation concentrates on the physical layer optimization in order to
approach. The updating of environmental variables is illus- maximize the system transmission rate, denoted MAX-RATE,
trated in Step 11. The computational complexity of RL-RA is respectively.
O((N + M )KNe ), where Ne denotes the value of episodes. Fig. 2 illustrates the achieved total system delay of dif-
To show the convergence of the algorithm, we prove that the ferent algorithms. It can be seen that our proposed RL-RA
following relation: algorithm outperforms both the MAX-RATE and RANDOM-
A B US methods with a smaller total delay. The performance
f (X∗ , P∗ ) ≥ f (X, P∗ ) ≥ f (X, P) (18) of the RL-RA algorithm is improved and finally converged
∗ ∗
always holds, where f , X , and P denote the original as the episode increases. The total system delay for the
optimization problem, the optimal user scheduling, and the MAX-RATE algorithm is even larger than RANDOM-US,
optimal power allocation, respectively. The inequation (A) since the MAX-RATE algorithm only considers the physical
follows the convergence results on RL approaches [16], while layer optimization for the system transmission rate. Hence,
inequation (B) holds from the definition of the convex opti- choosing a smaller number of multiplexed users can reduce
mization [17]. We, therefore, conclude that the solution found the inference in order to maximize the transmission rate,
by Algorithm 1 converges since the difference in values thereby causing a larger system delay. Besides, some spectrum
between the calculated solution and the true optimal one is resources may be allocated to the users without packets being
bounded. sent. Therefore, the trade-off between system throughput and
total delay should be considered carefully in the resource
IV. P ERFORMANCE E VALUATION allocation scheme.
In this section, we illustrate the performance of the pro- Fig. 3 shows the total system delay with various con-
posed algorithm using Python 3.6. We consider a NOMA- figurations on the number of V2V users in the system for
V2X communication network with one BS located at the cell our proposed RL-RA algorithm. It is observed that the total
center. The cell radius can be set as 50 meters to fulfill the system delay is reduced with increasing episodes and finally
requirement of vehicular communications, similar to [14]. The converged for all the cases in the RL-RA algorithm. Besides,
path loss component and the noise are set as three and −174 the total system delay for the two-V2V system is less than
dBm/Hz. We set the V2I and V2V links as four and six, the four-V2V system and six-V2V system. It shows that
respectively, and the maximum speed for each user can be the number of total users affects the user scheduling order.
Authorized licensed use limited to: The University of Hong Kong Libraries. Downloaded on August 08,2021 at 00:50:28 UTC from IEEE Xplore. Restrictions apply.
IEEE INFOCOM WKSHPS: GI 2021: IEEE Global Internet Symposium
15
for Clustering Car-Following V2X Communication System With Non-
Orthogonal Multiple Access,” IEEE Access, vol. 7, pp. 68160–68171,
May 2019.
10 [6] B. Di, L. Song, Y. Li, and G. Y. Li, “Non-Orthogonal Multiple
Access for High-Reliable and Low-Latency V2X Communications in
5G Systems,” IEEE Journal on Selected Areas in Communications,
5 vol. 35, no. 10, pp. 2383–2397, Jul. 2017.
[7] H. Ding and K.-C. Leung, “Cross-Layer Resource Allocation in NOMA
Systems with Dynamic Traffic Arrivals,” in 2020 IEEE Wireless Com-
0 20 40 60 80 100 munications and Networking Conference (WCNC), May 2020.
Episodes [8] S. Xu, C. Guo, and Z. Li, “NOMA Enabled Resource Allocation for
Vehicle Platoon-Based Vehicular Networks,” in 2019 IEEE Globecom
Workshops (GC Wkshps), Dec. 2019.
Fig. 4. Total system delay with different learning settings.
[9] Y. Xu, C. Yang, M. Hua, and W. Zhou, “Deep Deterministic Policy Gra-
dient (DDPG)-Based Resource Allocation Scheme for NOMA Vehicular
Communications,” IEEE Access, vol. 8, pp. 18797–18807, Jan. 2020.
Besides, it can be seen that the convergence speed is faster [10] J. Ding and J. Cai, “Two-Side Coalitional Matching Approach for
Joint MIMO-NOMA Clustering and BS Selection in Multi-Cell MIMO-
for the system with a smaller number of V2V users. Indeed, NOMA Systems,” IEEE Transactions on Wireless Communications,
the system with a smaller number of V2V users has a smaller vol. 19, no. 3, pp. 2006–2021, 2020.
action space, thereby a faster convergence. [11] Third Generation Partnership Project (3GPP), “Technical Specification
Group Radio Access Network,” Study on LTE-Based V2X Services,
We further investigate the impact of greedy probability 3GPP TR 36.885 V0.5.0, Release 14, Feb. 2016.
coefficient in the learning setting of RL-RA algorithm. As [12] L. Dai, B. Wang, Y. Yuan, S. Han, C. I, and Z. Wang, “Non-orthogonal
shown in Fig. 4, it can be observed that the converge speed Multiple Access for 5G: Solutions, Challenges, Opportunities, and
Future Research Trends,” IEEE Communications Magazine, vol. 53,
increases with the increment of . Actually, relates to the no. 9, pp. 74–81, Sep. 2015.
action choice. A larger value of means that there is a large [13] J. R. Pierce, “An Introduction to Information Theory: Symbols, Signals
probability to choose a greedy action with the maximum and Noise,” Courier Corporation, vol. 2, Nov. 1980.
[14] M. Patra, R. Thakur, and C. S. R. Murthy, “Improving Delay and
reward. Otherwise, a random action will be chosen to explore Energy Efficiency of Vehicular Networks Using Mobile Femto Access
the environment. Actually, the greedy action ensures the Points,” IEEE Transactions on Vehicular Technology, vol. 66, no. 2,
performance of the policy based on the known environment, pp. 1496–1505, May 2017.
[15] Z. Yang, W. Xu, C. Pan, Y. Pan, and M. Chen, “On the Optimality of
and the random action encourages the state space exploration Power Allocation for NOMA Downlinks With Individual QoS Con-
for the unknown environment. straints,” IEEE Communications Letters, vol. 21, no. 7, pp. 1649–1652,
Mar. 2017.
V. C ONCLUSION [16] S. Singh, T. Jaakkola, and M. L. Littman, “Convergence Results for
Single-Step On-Policy Reinforcement-Learning Algorithms,” Machine
In this work, we study a low-latency NOMA-V2X network Learning, vol. 38, no. 3, pp. 287–308, Mar. 2000.
to support both V2I and V2V communications. To deal [17] K. Wang, W. Liang, Y. Yuan, Y. Liu, Z. Ma, and Z. Ding, “User
Clustering and Power Allocation for Hybrid Non-Orthogonal Multiple
with the delay requirements, we formulate the mixed integer Access Systems,” IEEE Transactions on Vehicular Technology, vol. 68,
programming optimization problem for the resource alloca- no. 12, pp. 12052–12065, Oct. 2019.
Authorized licensed use limited to: The University of Hong Kong Libraries. Downloaded on August 08,2021 at 00:50:28 UTC from IEEE Xplore. Restrictions apply.