Heterogeneous Machine-Type Communications in Cellular Networks: Random Access Optimization by Deep Reinforcement Learning

Heterogeneous Machine-Type Communications in
Cellular Networks: Random Access Optimization

by Deep Reinforcement Learning
Ziqi Chen David B. Smith
University of New South Wales CSIRO Data61
CSIRO Data61 The Australian National University
Email: ziqi.chen@student.unsw.edu.au Email: David.Smith@data61.csiro.au
Abstract—One of the significant challenges for managing that the actual RA procedure is not efficient for managing
machine-to-machine (M2M) communication in cellular networks, massive simultaneous access, as the physical random access
such as LTE-A, is the overload of the radio access network due channel (PRACH) suffers from a large number of MTCDs
to very many machine type communication devices (MTCDs)
requesting access in burst traffic. This problem can be addressed competing for resources [4], [5]. So, several methods have
well by applying an access class barring (ACB) mechanism to been proposed to control the congestion and provide better
regulate the number of MTCDs simultaneously participating in network performance.
random access (RA). In this regard, here we present a novel deep In LTE-A, 3GPP suggests the use of access class barring
reinforcement learning algorithm, first for dynamically adjusting (ACB), and then extended access barring (EAB), for con-
the ACB factor in a uniform priority network. The algorithm is
then further enhanced to accommodate heterogeneous MTCDs gestion control in the PRACH. ACB is a probability based
with different quality of service (QoS) requirements. Simulation solution that limits the number of MTCDs requesting network
results show that the ACB factor controlled by the proposed access and RA is performed simultaneously. The optimal value
algorithm coincides with the theoretical optimum in a uniform of the ACB factor that best reduces congestion and access
priority network, and achieves higher access probability, as delay if the BS knows the number of backlogged users was
well as lower delay, for each priority class when there are
heterogeneous QoS requirements. determined in [6]. In [7], the effectiveness of an ACB method
in highly congested environments was evaluated according
I. I NTRODUCTION to key performance indicators such as delay and energy
Machine-to-machine (M2M) communications, also known consumption. A Markov-Chain-based traffic-load estimation
as Machine-Type Communications (MTC), feature a wide scheme according to network collision status was developed in
range and large number of autonomous devices communicat- [8]. In [9], the authors proposed a dynamic RACH preamble
ing with little or no human intervention, and are an essential allocation scheme based on the ACB factor.
part of the internet of things (IoT) and future wireless systems As MTCDs have a diverse range of applications, quality
[1]. To satisfy versatile massive M2M traffic characteristics, of service (QoS) requirements in M2M communications are
from best effort applications like water/gas metering systems highly variable. Therefore, meeting the growing range of QoS
and environmental monitoring, to ultra-reliable ones such as requirements for MTC devices is an urgent area for research.
healthcare, public safety and mission-critical industry, enhanc- However, little attention has been given to QoS provisions
ing cellular networks is a promising means to accommo- where MTCDs with different requirements are treated accord-
date such heterogeneous requirements. However, as current ingly. QoS requirements, for various M2M services, include
LTE-A networks have predominantly been used to support delay requirements as a primary concern. As MTCDs can
Human-to-Human (H2H) communications, there are problems have various QoS requirements, a multiple access class barring
for MTCDs accessing LTE networks. Due to an anticipated (MACB) scheme is proposed in [10], which assigns distinct
massive number of MTCDs, each carrying a small amount access probabilities to different classes of MTCDs requiring
of data to be transmitted, simultaneous transmission attempts different service levels. Similar mechanisms are presented
from these MTCDs requires large-scale synchronization and in [11] and [12], but the parameters are adjusted according
will result in data traffic flow congestion in the radio access to estimated traffic analysed with partial information of the
network (RAN) [2], [3]. M2M communications can cause network.
signaling congestion in several ways: an external event can Reinforcement learning (RL) is a type of machine learning
trigger very many MTCDs to become active from an idle technique that mimics the fundamental way in which humans
state and access the network at once, or massive number learn. A reinforcement learning agent evolves by interacting
of scheduled MTCDs periodically request network access to with the environment through observing it, taking actions and
report data. receiving immediate reward feedback, while the goal of the
Each MTCD performs a random access (RA) procedure for agent is to select actions that maximize cumulative future
initial uplink access and and to synchronize with the base rewards. Deep reinforcement learning (DRL) is an enhanced
station (BS/eNodeB). in LTE-A. Recent studies have shown version of RL that uses a deep neural network (DNN) to
978-1-5386-3180-5/18/$31.00 ©2018 IEEE

approximate the cumulative future discounted reward for each
action, so that the agent can make accurate predictions and
decisions that lead to a good overall performance for more
complicated problems [13] [14]. DRL can be used to solve
RACH overload problems because the DRL agent uses net-
work metrics directly to exploring the network condition and
make adequate control decisions dynamically for optimal long-
term performance.
Therefore, here we present an deep reinforcement learning- Fig. 1. LTE-A Contention-based random access (RA) Procedure
based access class barring factor controlling algorithm, which
meets diverse QoS requirements for MTCDs and reduces
overall delay. Our main contributions are: of preambles serves as a request for a dedicated time-
∙ For the first time, a deep reinforcement learning (DRL)- frequency resource block in the upcoming scheduling
based ACB factor controlling algorithm, in M2M com- transmission in step 3.
munications, is presented for an LTE-A network. 2) The eNB acknowledges all the preambles that it has
∙ For the case of MTCDs having uniform QoS require- successfully received with RS response (RAR), which
ments in a cell, a single ACB factor dynamically adjusted contains identification of the detected preambles and
by a DRL agent is applied to all users. A state space, uplink grant for step 3 message, msg3.
action space and reward function is defined for the DRL 3) The UE, after receiving its corresponding RAR within
agent, to optimize the total long-term success number of random access response window 𝑊𝑅𝐴𝑅 , transmits msg3
MTCDs in random access (RA) procedures. A DNN is including its ID in PUSCH. When two UEs have selected
applied as a function approximator of each action-value. the same preamble in step 1, both will be granted the
∙ For the case of MTCDs having different QoS require- same time-frequency resource block for msg3 uplink
ments in a cell, we create a DRL agent that dynamically transmission, and a collision will happen.
adjusts multiple ACB factors to apply to MTCDs accord- 4) eNB broadcasts contention resolution which contains IDs
ing to differing priority classes. We define the state space, of UEs whose msg3 are successfully decoded. There
action space and reward function for the DRL agent, and will be no response to the collided msg3 and those UEs
then maximize the success probability of MTCDs under choosing collided preambles are declared failed in the
heterogeneous delay requirements. contention resolution.
∙ Simulation results prove the effectiveness of the proposed In this paper, we only consider how MTCDs compete
algorithm for RA success probability, average delay and for dedicated preambles amongst themselves. Therefore we
user QoS requirements satisfaction. suppose separate resources are allocated to M2M traffic and
H2H traffic.
II. S YSTEM M ODEL
A. Random Access Procedure B. Access Class Barring
In this section we introduce the random access procedure Access Class Barring (ACB) is a method to redistribute
of LTE-A networks. To schedule uplink data transmissions to the network access requests of UEs through time to alleviate
the eNB, User Equipments (UE) should succeed in random random access channel pressure, by regulating the number of
access procedure. There are two modes operated in random access requests per RAO. ACB is applied to the UEs before
access procedure, one is contention-free mode and the other they perform the RA procedure. The eNodeB broadcasts an
is contention-based mode. Contention-free mode provides low- ACB factor 𝑃𝐴𝐶𝐵 ∈ [0, 1] as part of the system information
latency services for high priority users in situations such before each RAO. Before each RAO, an MTC device, which
as downlink data arrival, handover or positioning. While has not yet connected to the network, generates a random
contention-based mode, which is the standard mode for net- number 𝑞 ∈ [0, 1]. If 𝑞 < 𝑃𝐴𝐶𝐵 , then the requested packet will
work access, is used by regular UEs to synchronize with the be sent. Otherwise, the MTC device stays silent and waits for
eNB (or BS) when changing Radio Resource Control state access barring time 𝑇𝑏𝑎𝑟𝑟𝑖𝑛𝑔 . This process is repeated until
from idle to connected, recovering from link failure or sending the UE generate a value 𝑞 lower than 𝑃𝐴𝐶𝐵 and sends the
scheduling requests. Here we only focus on contention-based preamble. If more than one MTC device selects the same
random access which consists of four steps in each random preamble, then a collision will occur at the eNodeB. We
access opportunity (RAO). The four steps for random access, assume that when a collision happens, the eNodeB will not
summarized in Fig. 1 are: be able to decode the collided Step 3 messages msg3’s, and
1) Each UE randomly selects a sequence called a preamble thus none of the collided MTC devices succeed in such an
from a pool known to both UEs and BSs. There are up access channel. Whenever a user fails in one random access
to 64 orthogonal preambles, generated from a Zadoff- channel, it will try to send the sequence after a backoff time
Chu sequence, available to the MTCDs. The transmission 𝑇𝐵𝑂 .
TABLE I TABLE II
3GPP T RAFFIC MODELS [15] C LASSIFICATION OF MTC TRAFFIC
Characteristics Traffic Model 1 Traffic Model 2 Class name Application example QoS requirement
Number of MTCDs 1000,3000,5000,10000,30000 1000,3000,5000,10000,30000 High priority Seismic alarm/E-care extremly strict delay
Arrival dist. Uniform dist. over T Beta(3,4) dist. over T
Dist. period (T) 60 seconds 10 seconds Low priority Consumer electronics medium delay
Scheduled Smart meters delay tolerant
C. Network Configuration are activated in Beta(3, 4) distribution over 10s, for low
To evaluate congestion solution proposals, 3GPP TR 37.868 priority category, we consider 4,000 MTCDs whose access
[15] defines two different traffic models (see Table I) for the attempts are distributed uniformly over time, with an arrival
evaluation of the network performance with M2M communi- rate of 400 per second; scheduled devices follow the traffic
cations. This gives examples of how each type of MTCDs model 2, and each scheduled burst traffic contains 30,000
reestablish connections with the eNB in typical scenarios. MTCDs, which are activated in Beta(3, 4) distribution over
Traffic model 1 can be viewed as a typical scenario in which 10s. In the simulation period, high priority category burst
M2M devices access the network uniformly over a period of traffic seldom happens, while scheduled priority MTCDs cause
time, i.e., in a non-synchronized manner. Traffic model 2 can periodic burst traffics more often.
be considered as an extreme scenario in which a large amount There is 1 RAO every 5ms and 𝑀 = 54 out of the 64
of M2M devices access the network in a highly synchronized available preambles are used for contention-based RA. Under
manner, e.g., after an application alarm that activates them. these conditions, the system offers 200 RAOs per second;
preambleTransMax, which is the maximum allowable num-
In traffic model 2, Each MTC device is activated at time
ber of msg1 transmissions, is set to 10. ENB broadcasts
0 ≤ 𝑡 ≤ 𝑇 with probability 𝑔(𝑡), following a beta distribution
an ACB factor 𝑃𝐴𝐶𝐵 as part of the system information
with parameters 𝛼 = 3, 𝛽 = 4,
before each random access opportunity (RAO). In each random
𝑡𝛼−1 (𝑇 − 𝑡)𝛽−1 access channel, an MTC device, which has not yet connected
𝑔(𝑡) = (1) to the network, generates a random number 𝑞 ∈ [0, 1]. If
𝑇 (𝛼+𝛽−1) B(𝛼, 𝛽)
𝑞 < 𝑃𝐴𝐶𝐵 , then the requested packet will be sent. Otherwise,
where B(⋅) is the beta function. the MTC device stays silent and waits for 𝑇𝑏𝑎𝑟𝑟𝑖𝑛𝑔 until
According to the RA procedure, eNB does not know MTCD the next RAO, in which both the new activations in the
identifiers until msg3’s are successfully received. Thus, to next slot and the backlogged users will perform an ACB
allocate distinguishing RA resources with respect to different check before transmission. We apply exponential barring time,
QoS requirements, we classify MTCDs into 3 categories: high 𝑇𝑏𝑎𝑟𝑟𝑖𝑛𝑔 = (0.7 + 0.6 × 𝑟𝑎𝑛𝑑) × 2𝑁𝑏𝑎𝑟𝑟𝑖𝑛𝑔 ; where rand =
priority, low priority and scheduled as shown in Table II, and U[0, 1) and 𝑁𝑏𝑎𝑟𝑟𝑖𝑛𝑔 is the number of failed ACB checks.
assign them different resources through separated ACB factors. When the RA attempt of a UE fails, we also apply expo-
High priority category includes public safety devices and nential backoff policy, where backoff time depends on the
healthcare applications etc., whose traffic features a low fre- number of preamble transmissions, 𝑃𝑡 , attempted previously
quency of occurrence, extremely short delay constraints and by 𝑇𝐵𝑂 = 𝑈 (0, 10 × 2𝑃𝑡 −1 ). When 𝑃𝑡 > 10, the network is
high channel access success rate requirements. This type of declared unavailable by the UE and the problem is reported
traffic is best represented by traffic model 2, in which an to upper layer.
unpredicted event may trigger thousands of MTCDs. We are interested in estimating the total time it takes
Low priority category includes consumer electronics, fac- for the eNodeB to collect each user’s data. If a preamble
tory management sensors etc., which features looser delay con- is successfully transmitted, the actual user data will then
straints and medium channel access success rate requirements. be transmitted without contention on PUSCH via scheduled
The activating distribution can be represented by traffic model transmissions that takes a fixed time. Therefore, the time for
1, in which the devices are uniformly activated without a burst. all the MTC devices to successfully transmit Step 1 preamble
Scheduled priority category contains delay tolerant devices sequences dominates the total delay. We define delay 𝐷𝑖 as the
such as smart meters, which report data periodically to the number total of random access opportunities before MTCD 𝑖 is
eNB. A large number of these devices reports data periodically successfully connected to the network after the RA procedure.
to the eNB during a short period, e.g., every half hour, and
III. D EEP R EINFORCEMENT L EARNING BASED ACB
this burst of MTC traffic is the main factor causing the RACH
FACTOR CONTROLLING A LGORITHM
overload. They fit into traffic model 2, in which tens of
thousands of them are periodically activated at the same time. A. Optimising single ACB factor in uniform priority network
Network performance is evaluated in a single cell en- In this section, we present the DRL-based ACB factor
vironment, where high priority, low priority and scheduled controlling algorithm with respect to uniform priority. In net-
categories co-exist in the cell. Thus the network is subjected works where QoS of MTCDs are not of important concern, or
to different access intensities; for high priority MTCDs, we broadcasting multiple ACB factor is unpractical, we implement
consider in each event triggered burst traffic, 10,000 MTCDs DRL-based algorithm with a single ACB factor. As all MTCDs
in the cell are uniformly treated, the optimization problem is memory in this offline procedure can smooth out learning and
simply to maximize the total number of MTCDs successful in avoid oscillations or divergence in parameters.
RA over a particular period. Then, based on the offline-built DNN, the deep Q-learning
If the system knows the number of request packets waiting is adopted for further improving of the controlling of online
to be transmitted in each RAO, then the theoretical optimal dynamic ACB factor. In each decision epoch 𝑡𝑘 , the DRL agent
∗
ACB factor 𝑃𝐴𝐶𝐵 can be numerically derived to maximize derives the estimated 𝑄 value from the DNN with the input of
the number of MTCDs successful in each RA procedure [6]. current state 𝑠𝑘 and each available action 𝑎𝑘 . Then we apply
The expected number of successful preamble transmissions 𝐾 the 𝜀 − 𝑔𝑟𝑒𝑒𝑑𝑦 policy to select the execution action 𝑎𝑘 . More
when 𝑁 MTCDs are prepared to request network access is: specifically, with (1 − 𝜀) probability we follow the greedy
( )𝑛−1 policy and select the action with highest 𝑄 value, or with 𝜀
𝑃𝐴𝐶𝐵
𝐸[𝐾∣𝑁 = 𝑛] = 𝑛𝑃𝐴𝐶𝐵 1 − (2) probability we select a random action.
𝑀
After taking an action, DRL agent observes another expe-
Where 𝑃𝐴𝐶𝐵 is ACB factor, 𝑀 is the number of available rience 𝑒𝑘 and store it into the experience memory 𝐷. After
preambles and 𝑛 represent the number of request packets that, the DRL agent updates weight parameters 𝜃 of the DNN
waiting to be transmitted in current RAO. By taking the with 𝑁𝐵 samples from the experience memory 𝐷 every T
derivative of 2, the maximum expected 𝐾 is achieved when epochs to avoid oscillation. In our implementation, for the
∗
𝑃𝐴𝐶𝐵 = min(1, 𝑀 𝑛 ). DNN construction, we employ a feed-forward neural network
Remark 1 The theoretical optimal ACB factor with respect that has one hidden layer of fully-connected units with 10
to 𝑀 available preambles and 𝑛 access requests is: 𝑃𝐴𝐶𝐵 ∗
= neurons. We set capacity of mini-batch 𝑁𝐵 = 32, the reward
𝑀
min(1, 𝑛 ) discount factor 𝜇 = 0.9.
However, in practice, the eNB cannot acquire the exact Since DRL employs a DNN as function approximator for
number of MTCDs requesting packets transmission in each action-value, it can deal with a large or continuous state
RAO. The information it has is limited to the number of suc- space and action space, which is very suitable for continuous
cessful transmissions and the number of preambles collided, management of dynamic ACB factor. The state space, action
as well as the time each received packet has back-off when space and reward function are defined as follows:
examining those MTCDs successful in finishing RA. Thus, State Space: The state space of the DRL agent consists of
the eNB can only estimate the upcoming traffic based on such 4 components, the number of MTCDs successfully accessed
limited information. However, there is an inherent trade-off in in the network through an RA procedure during the last RAO
choosing the ACB factor 𝑃𝐴𝐶𝐵 . When 𝑃𝐴𝐶𝐵 is too large, 𝑁𝑠 , the number of preambles being detected as collided during
there will be a lot of preambles transmitted in each RAO, last RAO 𝑁𝑐 , The average delay 𝐷𝑎𝑣𝑔 of successful accessed
and there will be a large amount of collisions on most of the MTCDs during last RAO and currently broadcasting ACB rate
preambles. On the other hand, when 𝑃𝐴𝐶𝐵 is too small, very 𝑃𝐴𝐶𝐵 .
few users will be able to pass the ACB check and transmit their
preambles, resulting in fewer collisions but then the network 𝑠 = {𝑁𝑠 , 𝑁𝑐 , 𝐷𝑎𝑣𝑔 , 𝑃𝐴𝐶𝐵 } (4)
resources are under-utilised. Action space: The action space of DRL agent is the ACB
Therefore, we present a deep reinforcement learning (DRL) factor 𝑃𝐴𝐶𝐵 ′ to be broadcast prior to upcoming RAO .
algorithm, which learns from experience and choose the best
action according to its estimated future reward. We first present 𝑎 = 𝑃𝐴𝐶𝐵 ′ (5)
a generalized form of DRL. DRL comprises two phases: Reward: The reward needs to represent the objective of the
an offline DNN training phase and a online reinforcement algorithm, which is maximizing the total number of MTCDs
learning phase. The offline training phase take in observed successful in RA procedures in each RAO. Thus, we define
data from randomly choosing actions and train the DNN the immediate reward that the DRL agent receives as number
to fit the correlation between state-action pair (𝑠, 𝑎) and of successful access after performing action 𝑎 in upcoming
corresponding value function 𝑄(𝑠, 𝑎), which represent the RAO, 𝑁𝑠′ .
expected cumulative reward with discount of staying in state The complete DRL-based ACB factor controlling algorithm
𝑠 and take action 𝑎. Value function 𝑄(𝑠, 𝑎) is given as: is presented in Algorithm 1.
𝑄(𝑠, 𝑎) = 𝐸[𝑟 + 𝜇 max

′
𝑄(𝑠′ , 𝑎′ )∣𝑠, 𝑎] (3) B. Optimizing multiple ACB factors in a network of heteroge-
𝑎
neous priorities
Where 𝑟 is reward and 𝜇 is discount factor. The offline
DNN training phase needs to accumulate enough samples of Here we represent the DRL agent with respect to hetero-
experience at each epoch 𝑘: 𝑒𝑘 = (𝑠𝑘 , 𝑎𝑘 , 𝑟𝑘 , 𝑠𝑘+1 ), which geneous priority classes. In a network where high priority,
is composed of state, action, immediate reward and state low priority and scheduled categories possess distinguish QoS
transition, and update the DNN with experiences randomly requirements, different ACB factors should be applied to each
drawn from the pool of stored pool. The use of experience of them, in order to meet their respective delay constraints.
The goal of the optimization problem is to maximize the Algorithm 1: Deep Reinforcement Learning based
mean success access probability of each category ACB controlling algorithm
( 1 )
1 2 3 𝑁𝑠 𝑁𝑠2 𝑁𝑠3 1: Offline:
{𝑃𝐴𝐶𝐵 ′ , 𝑃𝐴𝐶𝐵 ′ , 𝑃𝐴𝐶𝐵 ′ } = arg max + + /3 1: Load past state transition profile experiences into experience
𝑎 𝑁𝑡1 𝑁𝑡2 𝑁𝑡1 memory 𝐷
(6) 2: Pre-train the value function DNN with experiences
𝑒𝑘 = (𝑠𝑘 , 𝑎𝑘 , 𝑟𝑘 , 𝑠𝑘+1 ) from memory 𝐷;
where 𝑁𝑠1 , 𝑁𝑠2 and 𝑁𝑠3 are the number of received packets 2: Online
1: for each decision epoch 𝑡𝑘 do
that successfully complete RA within delay constraints of high 2: With probability 𝜀, randomly select an action;
priority, low priority and scheduled category respectively. 𝑁𝑡1 , otherwise 𝑎𝑘 = arg max𝑎 𝑄(𝑠𝑘 , 𝑎𝑘 ; ), where 𝑄(𝑠𝑘 , 𝑎𝑘 ; 𝜃)
𝑁𝑡2 and 𝑁𝑡3 are the number of all received network access is estimated by the DNN;
3: Execute action 𝑎𝑘 ;
requests of high priority, low priority and scheduled category 4: feed action into the network;
respectively. 5: Observe the reward 𝑟𝑘 and the new state 𝑠𝑘+1 ;
The state space, action space and reward function for 6: Store the state transition (𝑠𝑘 , 𝑎𝑘 , 𝑟𝑘 , 𝑠𝑘+1 ) in 𝐷;
7: for every T epochs do
optimizing multiple ACB factors with heterogeneous priorities 8: Randomly sample a mini-batch of experiences
are defined as follows: (𝑠𝑘 , 𝑎𝑘 , 𝑟𝑘 , 𝑠𝑘+1 ) with a size of 𝑁𝐵 ;
State Space: The state space of DRL agent consists of 9: Target 𝑦𝑘 = 𝑟𝑘 + 𝜇 max𝑎 𝑄(𝑠𝑘+1 , 𝑎′ ; 𝜃);
10: Update the DNN’s weights 𝜃 with a loss function of
4 components: the number of MTCDs of each category (𝑦𝑘 − 𝑄(𝑠𝑘+1 , 𝑎′ ; 𝜃))2 ;
successfully accessed to the network through RA procedure 11: endfor
during last RAO 𝑁𝑠 ; the number of preambles being detected 12: endfor
as collided during last RAO 𝑁𝑐 ; the average delay 𝐷𝑎𝑣𝑔 of
successful accessed MTCDs of each category during last RAO;
and currently broadcasting ACB rate 𝑃𝐴𝐶𝐵 . 1
𝑠 = {𝑁𝑠 , 𝑁𝑐 , 𝐷𝑎𝑣𝑔 , 𝑃𝐴𝐶𝐵 } (7) 0.9
0.8
where 𝑁𝑠 = {𝑁𝑠1 , 𝑁𝑠2 , 𝑁𝑠3 }, representing success Number acb
1 2 3
of each category. 𝐷𝑎𝑣𝑔 = {𝐷𝑎𝑣𝑔 , 𝐷𝑎𝑣𝑔 , 𝐷𝑎𝑣𝑔 } representing 0.7
ACB factor p
average delay of successful accessed MTCDs of each category. 0.6
Action space: We let DRL agent make decision on the ACB 0.5
1 2 3
factor 𝑃𝐴𝐶𝐵 , 𝑃𝐴𝐶𝐵 , 𝑃𝐴𝐶𝐵 simultaneously. Specifically, the 0.4
action space of DRL agent is these 3 ACB rates to be broadcast DRL-based algorithm
0.3 Theoretical optimum
prior to the upcoming RAO. Dynamic ACB
0.2
1 2 3
𝑎 = {𝑃𝐴𝐶𝐵 ′ , 𝑃𝐴𝐶𝐵 ′ , 𝑃𝐴𝐶𝐵 ′ } (8) 0 200 400 600 800 1000 1200
Random access opportunity (RAO)
1400 1600 1800 2000
Reward: The reward needs to represent the objective of the

Fig. 2. Theoretical optimum ACB and dynamic ACB from [6], compared
algorithm, which is maximizing the probabilities of MTCDs with proposed DRL-based algorithm for ACB
successfully completing RA procedure within delay con-
straints. Overall, we define the immediate reward that the DRL
learning based ACB controlling algorithm with respect to
agent receives as the average ratio between the number of
single ACB factor introduced in Section III-A. After offline-
received packets that successfully completed RA procedure
training the DNN and online reinforcement-learning, the
within delay constraints and the number of all received net-
learned ACB factor, compared with the theoretical optimum
work access requests for each category.
( ) ACB factor and the dynamic ACB controlling scheme [6] in
𝑁𝑝1 𝑁𝑝2 𝑁𝑝3 a period of 10 seconds containing one scheduled burst traffic,
𝑟= + 2 + 3 /3 (9) is shown in Fig. 2. The proposed algorithm adjusts the ACB
𝑁𝑠1 𝑁𝑠 𝑁𝑠
factor to be very close to the theoretical optimum, with root-
where 𝑁𝑝1 , 𝑁𝑝2 and 𝑁𝑝3 are the number of received packets mean-square error (rmse) of 0.047, while the dynamic ACB
that successfully completed RA procedure within delay con- scheme makes inaccurate adjustments in transition periods,
straints. Then, with new state space, action space and reward with rmse of 0.177, much higher than that of our scheme.
function, DRL-based algorithm presented in Algorithm 1 is Then, we implement multiple ACB factors assigning algo-
modified to produce 3 ACB factors simultaneously controlling rithm in a cell where high priority, low priority and scheduled
3 separate categories. categories co-exist. We define the outage condition for MTC
classes with QoS requirement of maximum delay. For high
IV. P ERFORMANCE E VALUATION priority devices, an access delay over 1 s is considered an
In this section, we present simulation settings and results of outage. For low priority devices, an access delay over 5 s is
the proposed deep reinforcement learning based ACB control- considered failed, and for scheduled devices, the access delay
ling algorithm. Firstly, we implement our deep reinforcement cannot exceed 10 s.
TABLE III A deep reinforcement learning based ACB factor controlling
ACCESS SUCCESS PROBABILITY AND ACCESS DELAY. RA-R ANDOM ACCESS , algorithm for solving the RAN overload problem in a 3GPP
DRL-D EEP R EINFORCEMENT L EARNING , 𝑃𝑠 - SUCCESS PROBABILITY,
𝐷𝑎𝑣𝑔 - AVERAGE DELAY IN MS . LTE-A network has been proposed, which is also likely to be
applicable to future cellular networks. This algorithm utilizes
High priority Low priority Scheduled the different traffic characteristics of heterogeneous MTC
𝑃𝑠 𝐷𝑎𝑣𝑔 𝑃𝑠 𝐷𝑎𝑣𝑔 𝑃𝑠 𝐷𝑎𝑣𝑔
Prior. RA [16] 0.985 77 0.939 1862 0.792 3937
services to allocate ACB factors to each priority class in
DRL ACB 0.997 200 0.941 1642 0.853 3027 order to satisfy different QoS requirements. Through neural
network-based reinforcement learning, it has been shown that
100
newly activated of high priority
the proposed algorithm assigns ACB factors very close to the
90 newly activated of scheduled theoretical optimum, and achieves high reliability and shorter
requested access of high priority
80 requested access of scheduled

delay for, e.g., critical and smart meter services, compared
70
successful access of high priority
to other schemes. It also differentiates between high and
successful access of scheduled
low priority traffic and provides the optimal ACB factors to
Number per RAO
60
simultaneously satisfy distinct requirements.
50
R EFERENCES
40
30
[1] Z. Dawy, W. Saad, A. Ghosh, J. G. Andrews, and E. Yaacoub, “To-
ward massive machine type cellular communications,” IEEE Wireless
20 Communications, vol. 24, no. 1, pp. 120–128, 2017.
10
[2] H. S. Dhillon, H. Huang, and H. Viswanathan, “Wide-area wireless
communication challenges for the Internet of Things,” IEEE Commu-
0 nications Magazine, vol. 55, no. 2, pp. 168–174, 2017.
0 1000 2000 3000 4000 5000 6000 7000 8000
Random access opportunity (RAO) [3] S.-Y. Lien, K.-C. Chen, and Y. Lin, “Toward ubiquitous massive accesses
in 3GPP machine-to-machine communications,” IEEE Communications
Magazine, vol. 49, no. 4, 2011.
Fig. 3. Temporal distribution of MTCD activation, access and success number [4] L. Ferdouse, A. Anpalagan, and S. Misra, “Congestion and overload
when the DRL algorithm is activated control techniques in massive M2M systems: a survey,” Transactions on
Emerging Telecommunications Technologies, vol. 28, no. 2, 2017.
We consider the following metrics to evaluate the perfor- [5] A. Laya, L. Alonso, and J. Alonso-Zarate, “Is the random access channel
of LTE and LTE-A suitable for M2M communications? a survey of
mance of the proposed algorithm. Access delay, defined as alternatives.,” IEEE Communications Surveys and Tutorials, vol. 16,
the time delay between the first activation and the completion no. 1, pp. 4–16, 2014.
of the RA procedure for MTCDs that successfully access the [6] S. Duan, V. Shah-Mansouri, and V. W. Wong, “Dynamic access class
barring for M2M communications in LTE networks,” in Globecom
network. Access success probability, 𝑃𝑠 , defined as the prob- Workshops (GC Wkshps), 2013 IEEE, pp. 4747–4752, IEEE, 2013.
ability to successfully complete the random access procedure [7] I. Leyva-Mayorga, L. Tello-Oquendo, V. Pla, J. Martinez-Bauset, and
within the maximum allowable access delay. V. Casares-Giner, “Performance analysis of access class barring for
handling massive M2M traffic in LTE-A networks,” in Communications
The simulation results are shown in Table III, the average (ICC), 2016 IEEE International Conference on, pp. 1–6, IEEE, 2016.
delay is represented in milliseconds. By applying the DRL [8] H. He, Q. Du, H. Song, W. Li, Y. Wang, and P. Ren, “Traffic-aware
based ACB controlling algorithm, the success probability is ACB scheme for massive access in machine-to-machine networks,”
very high in all MTCD classes, and when compared with the in Communications (ICC), 2015 IEEE International Conference on,
pp. 617–622, IEEE, 2015.
prioritized RA scheme [16], the DRL algorithm achieves both [9] H.-Y. Hwang, S.-M. Oh, C. Lee, J. H. Kim, and J. Shin, “Dynamic
higher success rate and lower average delay. RACH preamble allocation scheme,” in Information and Communication
To illustrate the effect of the DRL-ACB algorithm with Technology Convergence (ICTC), 2015 International Conference on,
pp. 770–772, IEEE, 2015.
heterogeneous classes, we obtain the newly activated number [10] N. Zangar, S. Gharbi, and M. Abdennebi, “Service differentiation
of both high priority class and scheduled class MTCDs, the strategy based on MACB factor for M2M communications in LTE-
number of initial access of each category, as well as their A networks,” in Consumer Communications & Networking Conference
(CCNC), 2016 13th IEEE Annual, pp. 693–698, IEEE, 2016.
respective successful access number in each RAO. The results [11] U. Phuyal, A. T. Koc, M.-H. Fong, and R. Vannithamby, “Controlling
are shown in Fig. 3, where we observe the dynamic ACB factor access overload and signaling congestion in M2M networks,” in Signals,
applying to different classes causing different probability of Systems and Computers (ASILOMAR), 2012 Conference Record of the
Forty Sixth Asilomar Conference on, pp. 591–595, IEEE, 2012.
access. The high priority class receives a higher ACB factor [12] N. Li, C. Cao, and C. Wang, “Dynamic resource allocation and access
allowing them to succeed faster in the RA procedure, while class barring scheme for delay-sensitive devices in machine to machine
when scheduled MTCDs are activated, some of them are (M2M) communnications,” Sensors, vol. 17, no. 6, p. 1407, 2017.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, et al., “Human-level control
rejected due to a low ACB factor, ensuring better performance through deep reinforcement learning,” Nature, vol. 518, no. 7540,
of higher priority MTCDs. When the number of scheduled pp. 529–533, 2015.
MTCDs is even greater, causing overload to the RA channel, [14] D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go
with deep neural networks and tree search,” Nature, vol. 529, no. 7587,
the ACB factor is properly adjusted to avoid the majority of pp. 484–489, 2016.
preambles colliding. [15] 3GPP, “TR 37.868 study on RAN improvements for machine type
communications,”
V. C ONCLUSIONS [16] J.-P. Cheng, C.-h. Lee, and T.-M. Lin, “Prioritized random access with
dynamic access barring for RAN overload in 3GPP LTE-A networks,” in
In this paper, the random access network (RAN) overload GLOBECOM Workshops (GC Wkshps), 2011 IEEE, pp. 368–372, IEEE,
issue in cellular networks, such as LTE-A, has been addressed. 2011.

Heterogeneous Machine-Type Communications in Cellular Networks: Random Access Optimization by Deep Reinforcement Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Heterogeneous Machine-Type Communications in Cellular Networks: Random Access Optimization by Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Heterogeneous Machine-Type Communications in

Cellular Networks: Random Access Optimization

978-1-5386-3180-5/18/$31.00 ©2018 IEEE

𝑄(𝑠, 𝑎) = 𝐸[𝑟 + 𝜇 max

𝑠 = {𝑁𝑠 , 𝑁𝑐 , 𝐷𝑎𝑣𝑔 , 𝑃𝐴𝐶𝐵 } (7) 0.9

average delay of successful accessed MTCDs of each category. 0.6

Reward: The reward needs to represent the objective of the

80 requested access of scheduled

You might also like