You are on page 1of 6

A Reinforcement Learning Based Medium

Access Control Method for LoRa Networks


Xu Huang, Jie Jiang, Shuang-Hua Yang, Yulong Ding*
Department of Computer Science and Engineering
Southern University of Science and Technology
Shenzhen, China
huangx2016@mail.sustech.edu.cn, {jiangj, yangsh, dingyl}@sustech.edu.cn

Abstract—LoRa is a low-power long-range network technology, together and the traffic load is heavy, the communication effi-
which is used widely in power sensitive and maintenance free ciency is low, and a package has to retransmit many times due
Internet of Things applications. LoRa only defines the physical to collision. Our subsequent experiments show the inefficienty
layer protocol, while LoRaWAN is a medium access control
(MAC) layer protocol above it. However, simply using ALOHA of ALOHA. Furthermore, more data retransmissions mean
in LoRaWAN makes a high package collision rate when the more energy consumption, which is unacceptable for power-
number of the end-devices in the network is large, since many critical applications. As for CSMA/CA, it does achieve a lower
end-devices will send the packages to gateway at the same time. collision rate comparing to ALOHA, which make it be used
To solve this, we present a reinforcement learning (RL) based in IEEE 802.11. However, similar to ALOHA, CSMA/CA
multi access method for LoRaWAN, which allows end-devices
decide when to transmit data based on the environment and also performs bad when the number of end-devices connected
reduce the package collision rate. A comparation between the increases and traffic load is high. What’s worse, CSMA/CA
RL method and ALOHA is also included in the paper, which requires continuous channel monitoring to prevent collision,
shows that the RL method has a lower package collision rate. which is energy inefficient [2].
Index Terms—LoRa, multi-agent, reinforcement learning, de- To reduce the collision rate, several authors proposed dif-
centralized
ferent solution. To et al. [3] put CSMA into the LoRa, and
used NS3 to simulate and compare the CSMA and ALOHA in
I. I NTRODUCTION LoRaWAN. In simulation, CSMA achieves a lower collision
LoRaWAN is a popular technology in IoT. Due to its low- rate and a higher packet delivery rate while only slightly
power and long-range characteristics, it is used widely in air increasing the energy consumption, comparing to ALOHA.
quality monitoring, smart lighting, remote meter reading, smart Although CSMA improves the LoRa performance in terms
agriculture and smart industries. of collision rate, the collision rate increases rapidly when the
However, with the number of end-devices connected growth, number of end-devices is more than 2000 [3]. In addition,
the collision rate will increase, since LoRaWAN use ALOHA terminal hidden problem is not considered in [3], which will
to handle the multi access problem. It is a challenge that not increase the collision rate because an end device may not know
only LoRaWAN, but also other technologies used in IoT have whether the gateway is talking to another end device.
to deal with when the connected devices increase. Laya et While some authors don’t propose method for LoRa specif-
al. [1] has mentioned that the current wireless communication ically, but for the whole IoT area. Zarate et al. [4] intro-
network is built for human-type communication (HTC), rather duces Distributed Queuing Collision Avoidance (DQCA) as
than machine-type communication (MTC). MTC requires low a MAC protocol for WLAN. Each end-device in DQCA
delay to enable the control of massive connected devices. has two queues, contention resolution queue(CRQ) and data
Besides, most of messages of MTC are periodically reporting transmission queue(DTQ). End-devices use CRQ to decide
messages, for example, remote meter reading. their position in DTQ, then transmit the data according to that
To avoid the collision, ALOHA is an option, which is the position. [4] claims that it has a near-optimum performance.
simplest random access protocol. In ALOHA, each end-device In simulation, DQCA achieves a 2.5 Mbps throughput, while
will send the data when there is data to send and resend the throughput of the legacy 802.11 is 1.3 Mbps. Meanwhile,
after a random time if transmission failed. There are two it is a decentralized protocol and the gateway does not decide
kinds of ALOHA: slotted ALOHA and unslotted ALOHA. when the device should transmit.
The difference between them is all the devices in the slotted Tinka et al. [5] presents a ALOHA-based scheduling algo-
ALOHA are time synchronized, so that they only can transmit rithm and a reservation-based scheduling algorithm. The main
data at the beginning of each slot. ALOHA works well when differences between the ALOHA-based and the reservation-
the number of connected devices is small, which makes it suit based scheduling algorithms are reservation-based schedul-
in LoRaWAN, but when there are massive devices connected ing algorithm builds a bidirectional link while ALOHA-
based scheduling algorithm builds a unidirectional link
*Corresponding author and end-devices have knowledge of their two-hop neigh-

Authorized licensed use limited to: University of Canberra. Downloaded on May 22,2021 at 23:09:25 UTC from IEEE Xplore. Restrictions apply.
bors in reservation-based scheduling algorithm, which makes II. M ETHODS
reservation-based scheduling algorithm perform better than A. RL Based Access Method
ALOHA-based scheduling algorithm, especially in bidirec-
tional communication. The goal of the proposed method is to apply reinforcement
Balevi et al. [2] introduces ALOHA-NOMA, which com- learning to train the end-devices, making them sense the
bines ALOHA and non-orthogonal multiple access (NOMA) environment and have basic knowledge of their neighbors
together. In NOMA, gateway can use successive interference and the nearby traffic. So that they can decide the time
cancellation (SIC) to resolve the collisions, if the transmission transmitting the data independently according to the various
power level of the source is different, which means different dynamic environment.
devices can transmit data at the same time. [2] claims that 1) Model Description: Similar to LoRaWAN, a star topol-
the maximum throughput of ALOHA-NOMA increases with ogy is used in study, so our simulation is also based on that.
a greater than linear slope when the number of active trans- We assume that all the end-devices connected in the network
mitters increases. However, [2] does not use simulations or are time synchronized so that we can divide the continuous
experiments to prove it. time into time slots which are grouped into superframes which
Nearly all the methods proposed are so far based on repeat over time, and each transmission only occupies one
ALOHA, and make a great or small progress. But all the time slot in a transmission. Fig. 1 shows the proposed frame
end-devices in their methods used fixed strategy, which means structure.
they cannot adapt for the dynamic environment. It is doubt that
Frame 1 Frame 2 Frame 3
whether these methods can achieve a high throughput and low
collision rate in a dynamic network. So, we want to apply S1 S2 … Sl-1 Sl S1 S2 … Sl-1 Sl S1 S2 … Sl-1 Sl
reinforcement learning into it.
We plan to use reinforcement learning to let each end-device Fig. 1. The proposed frame structure
itself decide when to transmit packages, which reduces the
work load of the gateway, and ensures the energy efficiency In reinforcement learning, agent, action, state, reward are
and a low collision rate. Reinforcement learning is about the most important concepts. By taking actions, agents can
how to choose action according to the environment, and transit from one state to another, and get the corresponding
to maximize the reward [6]. Considering the scenario of reward, which will be positive or negative. Agents explore and
LoRa, there will be many end-devices connected to gateway, change the environment by taking action, and learn to take the
which means there will be many end-devices in the network appropriate actions to maximize their rewards.
competing the channel for sending data, so the problem comes There are two actions for each agent: sleep and send. As
to multiagent reinforcement learning (MARL) in multiplayer for state, suppose we have N indefinitely repeating superframe
stochastic games. formed by ` consecutive slots, we can define the state for each
In [7], each rational agent should keep observing and slot in the superframe. At each state, end-devices can choose
tracking the other agents’ behaviors, and adjust its action action sleep or send. But the transmission is only successful
dynamically. As for multiplayer game, there are two kinds when there is only one end-device chooses action send, while
of games, zero-sum games and general-sum games. In zero- others choose action sleep. If the transmission successes, the
sum games, each agent tries to maximize its own reward and agent will get a positive reward, otherwise, it will get a
minimize the reward of other agents, like playing chess, while negative reward. sleep will also get a negative reward, but
in general-sum games, agents cooperate with each other and the punishment will not as heavy as in transmission failure,
try to maximize the reward of the entire system. Obviously, because sleep can save energy. If an agent transmits data
end-devices in a LoRa network sending data sharing a single successfully in a time slot, it will sleep ` time slots for next
channel should be a general-sum cooperating multiplayer transmission, which simulates the periodical packet generation.
game. Our goal is to use MARL to train the end-devices, so After enough exploration, agents will have knowledge of
that they can negotiate and cooperate to make better use of the different states, and the probability of actions in different
channel resources and reduce the collision as low as possible. states. For example, if the channel is more likely to be busy,
The contributions of this paper are 2-fold. Firstly, a scalable, the probability of choosing sleep is higher, while if the channel
energy efficient reinforcement learning based MAC protocol is more likely to be free, the probability of choosing send is
for LoRaWAN is proposed to reduce package collision. In higher. The system will converge when all the end-devices find
addition, a frame structure suitable for the proposed protocol a time slot and occupy it to transmit data in each superframe.
is introduced and a simulation is studied. Fig. 2 shows the state transition diagram of the proposed
The paper is organized as follows: Section II is devoted to method.
describe the RL based method and its corresponding frame 2) Learning Algorithm: Since the LoRa devices mostly
structure. Section III introduces the simulation hyperparame- have poor computing capability and cannot implement com-
ters. A simulation and comparation between the proposed RL plex algorithms, like DQN... etc. For MARL problem, Bowl-
based method and ALOHA is included in Section IV. Finally, ing and Veloso [8] proposed a policy climbing hill (PHC)
the paper is concluded in Section V. algorithm, see Algorithm 1, which is a simple modification

Authorized licensed use limited to: University of Canberra. Downloaded on May 22,2021 at 23:09:25 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Policy hill climbing algorithm
𝑆𝑖 to 𝑆𝑖+1: Transmission fail
𝑆𝑖 to 𝑆𝑢𝑐𝑐: Transmission success 1: Initialize Qi (s, ai ) ← 0 and πi (s, ai ) ← |A1i | . Choose the
𝑆𝑢𝑐𝑐 to 𝑆𝑖 : Sleep 𝑙 slots for next transmission learning rate α, δ, and the discount factor γ
2: for each iteration do
3: Select action ac from current state s based on a mixed
exploration-exploitation strategy
𝑆𝑢𝑐𝑐 4: Take action ac and observe the reward ri and the
subsequent state s0
5: Qi (s, ac ) ← Qi (s, ac ) + α[r + γmaxai Q(s0 , ai ) −
Q(s, ac )]
6: πi (s, ai ) ← πi (s, ai ) + ∆sai
where
𝑆1 𝑆2 𝑆𝑙−1 𝑆𝑙
(
−δsai , ifac 6= argmaxai ∈Ai
∆sai = (1)
Σaj 6=ai δsaj , otherwise
δ
δsai = min(πi (s, ai ), ) (2)
|Ai | − 1
Fig. 2. State transition diagram of the proposed method 7: end for

of Q-learning and supports multiagent system. Unlike other


III. I MPLEMENTATION S ETTING UP
variants of Q-learning such as minimax-Q algorithm, Nash
Q-learning and friend-or-foe Q-learning, which need to know We use a python-based simulator to find the converge speed
the actions taken by all the agents, which is unacceptable in and collision rate of the proposed method and slotted ALOHA.
wireless network due to hidden terminal problem, the only The superframe size is chosen as 30 slots. We also test the
information PHC algorithm need to know is the current action converge speed when the size of network is different, and
of a agent taking. PHC algorithm maintains two tables: Q table wants to find out a network size that can get a balance between
and π table. The former one stores the rewards of the state the the converge speed and throughput.
agents has explored, while the latter one stores the probability
A. Hyperparameters
of actions in each state. In each state, agents choose an action
based on probability and get the corresponding reward. If Details of simulation hyperparameters are given in TABLE
the reward is the biggest, probability of the corresponding I.
action will increase, otherwise, it will decrease. To increase
the speed the system converge, if an end-device transmits data TABLE I
S IMULATION H YPERPARAMETERS
successfully in a time slot, the probability of action send in this
slot will be set to one. Meanwhile, roulette wheel selection, Parameters Values
α 0.8
rather than exploration-exploitation, is used to choose action,
δ 0.8
so the action selection is much more stable after the system γ 0.2
is converged. transmission success reward 100
transmission failure reward -20
B. Slotted ALOHA sleep reward -5
Although LoRaWAN uses pure ALOHA, also known as un-
slotted ALOHA, rather than slotted ALOHA, slotted ALOHA
performs better than unslotted ALOHA because the end- IV. E XPERIMENTS AND R ESULTS A NALYSIS
devices only can transmit data at the beginning of the slot, The first part of the experiments is to verify the throughput
which avoids the collision between the slots. We choose slotted and collision rate of RL based method is better than slotted
ALOHA rather than unslotted ALOHA to compare with RL ALOHA during the training. When the number of end-devices
based method because both slotted ALOHA and the proposed is 20, Fig. 3 shows the number of superframes slotted ALOHA
method needs time synchronization and easy to compare. In needed to get stable and its collision rate during this process,
slotted ALOHA, when an end device fails to transmit a data where Success End-devices Count is the number of end-
packet, there will be a retransmission after a random time. devices that transmit data successfully in a superframe. From
The interval of random time will increase exponentially with Fig. 3, it takes about 350 superframes to make the end-
the number of retransmissions increase. To compare with the devices count reach 20, the number of end-devices in the
proposed method fairly, each end-device in slotted ALOHA network, which means the system need 350 superframes to
will sleep for the same time for next transmission as it in the get converged. Actually, the number of superframes the system
proposed method, if its transmission successes. taking to get converged is not stable, which can be ranging

Authorized licensed use limited to: University of Canberra. Downloaded on May 22,2021 at 23:09:25 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Success End-devices Count and Collision Rate of Slotted ALOHA Fig. 5. Success End-devices Count and Collision Rate of Slotted ALOHA
with 20 End-devices with 30 End-devices

Fig. 4. Success End-devices Count and Collision Rate of RL Based Method Fig. 6. Success End-devices Count and Collision Rate of RL Based Method
with 20 End-devices with 30 End-devices

from 100 to 1000 for different simulation runs due to random and RL based method performs worse than before. Fig. 5 and
interval of retransmission. In addition, since end-devices in Fig. 6 show the corresponding results. In slotted ALOHA,
slotted ALOHA have no knowledge of the environment and the system cannot get converged in 2000 superframes. It is
do not know whether the channel is busy or free, there is no extremely hard to make each end-device find a time slot to
downward trend in terms of collision rate, but fluctuations, transmit data, when the number of end-devices is the same as
that is the reason why the collision rate of slotted ALOHA the number of slots in superframe. Actually, the closer these
around the 125th superframes is high. What’s worse, that two values, the harder the system gets converged, which is
will make the end-devices that have found a slot to transmit suitable for both slotted ALOHA and RL based method. As
data abandon their slots and find new one. As for RL based for RL based method, although it performs worse than before,
method, Fig. 4 shows that it performs better than slotted the system still can get converged. Although the fluctuations
ALOHA. From Fig. 4 it takes only about 17 superframes to in collision is much more greater than before, there is still
converge. Comparing to slotted ALOHA, it is clear that there a clear downward trend in terms of collision rate. From Fig.
is a downward trend in terms of collision rate. What’s more, 6, the number of superframes the system to get converged is
the number of superframes the system taking is much more 100, however, the result is still unstable and sometimes the
stable than it in slotted ALOHA, which is various from 15 to system cannot get converged in 2000 superframes, like slotted
25. ALOHA.
When the number of end-devices is 30, both slotted ALOHA We also test the relationship between the number of end-

Authorized licensed use limited to: University of Canberra. Downloaded on May 22,2021 at 23:09:25 UTC from IEEE Xplore. Restrictions apply.
TABLE III
S TANDARD D EVIATION OF THE N UMBER OF C ONVERGE S UPERFRAMES

Number of Number of
RL ALOHA RL ALOHA
End-devices End-devices
1 0 0 16 4.55 32.5
2 0 0 17 5.95 176.6
3 0 0 18 3.01 371.2
4 0 0 19 3.69 720.3
5 0 0.18 20 5.23 581.0
6 0.18 2.57 21 6.45
7 0.35 1.04 22 9.25
8 0.62 1.38 23 9.95
9 0.56 6.26 24 498.5
10 0.69 2.37 25 17.3
11 0.61 4.29 26 357.0
12 0.88 23.1 27 666.4
13 1.78 5.44 28 667.2
14 4.01 17.8 29 814.9
15 4.02 32.5 30 869.1
Fig. 7. Relationship between the Number of End-devices and Converge
Superframes
V. C ONCLUSIONS
In this paper, we present a decentralized reinforcement
TABLE II learning based method to improve the multi access perfor-
M EAN OF THE N UMBER OF C ONVERGE S UPERFRAMES mance, like collision rate. We use PHC algorithm, which is
Number of Number of based on Q-learning, to train the end-devices, so that they
RL ALOHA RL ALOHA can have knowledge of channel and take appropriate actions
End-devices End-devices
1 1.00 1.00 16 12.4 26.3 to maximize the reward, in other words, transmit the data as
2 1.00 1.00 17 15.2 92.6
3 1.00 1.00 18 15.9 137.6
quick as possible and keep a low collision rate. Comparing to
4 1.00 1.00 19 17.0 408.3 slotted ALOHA, it performs better in terms of collision rate
5 1.00 1.03 20 21.6 412.1 and converge speed, extremely with a large number of end-
6 1.03 1.83 21 25.0 devices. However, the proposed method still has limitations:
7 1.13 1.53 22 28.6
8 1.40 1.97 23 32.4 the result is not stable when the number of end-devices is close
9 1.57 3.37 24 168.1 to the number of slots in superframe. Besides, the end-devices
10 1.93 3.03 25 47.4 also can use listen to detect whether the channel is free or
11 2.37 4.33 26 123.1
busy, which can save more energy than sending data directly.
12 2.9 11.13 27 332.0
13 3.47 7.17 28 330.7 Further work could focus on improve the stability and add
14 6.43 11.1 29 608.4 listen into the method to reduce the collision rate.
15 8.93 25.9 30 879.9
ACKNOWLEDGMENT
This research is supported by the National Natural Sci-
ence Foundation of China under Grant Nos. 61873119
devices and converge superframes. Since the system is hard to and 61911530247, and the Science and Technology In-
get converged when the number of end-devices and the number novation Commission of Shenzhen under Grant Nos.
of slots in superframe is close, we limit the maximum number KQJSCX20180322151418232.
of end-devices in slotted ALOHA to 20. Due to the uncertainty
R EFERENCES
of random number, the simulation is taken 30 times to get
mean and standard deviation. Fig. 7, TABLE II and TABLE [1] A. Laya, C. Kalalas, F. Vazquez-Gallego, L. Alonso, and J. Alonso-Zarate,
“Goodbye, aloha!” IEEE access, vol. 4, pp. 2029–2044, 2016.
III show the result. When the number of end-devices is small, [2] E. Balevi, F. T. Al Rabee, and R. D. Gitlin, “Aloha-noma for massive
for example, smaller than 9, there is no any obvious difference machine-to-machine iot communication,” in 2018 IEEE International
between the slotted ALOHA and RL based method. However, Conference on Communications (ICC). IEEE, 2018, pp. 1–5.
[3] T.-H. To and A. Duda, “Simulation of lora in ns-3: Improving lora
the performance has a big difference when the number of performance with csma,” in 2018 IEEE International Conference on
end-devices beyonds 13. For slotted ALOHA, the mean and Communications (ICC). IEEE, 2018, pp. 1–7.
standard deviation become badly when the number of end- [4] J. Alonso-Zarate, C. Verikoukis, E. Kartsakli, A. Cateura, and L. Alonso,
“A near-optimum cross-layered distributed queuing protocol for wireless
devices exceeds 16, while the same number for RL based lan,” IEEE Wireless Communications, vol. 15, no. 1, pp. 48–55, 2008.
method is 23. It means the proposed method can support a [5] A. Tinka, T. Watteyne, and K. Pister, “A decentralized scheduling
larger network than slotted ALOHA and it can get the balance algorithm for time synchronized channel hopping,” in International
Conference on Ad Hoc Networks. Springer, 2010, pp. 201–216.
between the converge speed and throughput when the number [6] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
of end-devices is 23. MIT press, 2018.

Authorized licensed use limited to: University of Canberra. Downloaded on May 22,2021 at 23:09:25 UTC from IEEE Xplore. Restrictions apply.
[7] H. M. Schwartz, Multi-agent machine learning: A reinforcement ap-
proach. John Wiley & Sons, 2014.
[8] M. Bowling and M. Veloso, “Multiagent learning using a variable learning
rate,” Artificial Intelligence, vol. 136, no. 2, pp. 215–250, 2002.

Authorized licensed use limited to: University of Canberra. Downloaded on May 22,2021 at 23:09:25 UTC from IEEE Xplore. Restrictions apply.

You might also like