Ee599 2 111403

University of
Tripoli Faculty of
Engineering
Electrical and Electronic Engineering Department
5G Networks Optimization analysis study

using Deep Reinforcement Learnings
A project report submitted in partial fulfillment of the requirements for the

degreeof Bachelor of Science in Electronics and Communication Engineering
Prepared by:
Mahmoud Mohammed Abu Qamar
Supervised by:
Eng. Khaled Elgdamsi
Fall 2022
Libya-Tripoli
‫اﻹھﺪاء‬
‫إﻟﻰ اﻷﯾﺎدي اﻟﻄﺎھﺮة اﻟﺘﻲ أزاﻟﺖ ﻣﻦ أﻣﺎﻣﻨﺎ أﺷﻮاك اﻟﻄﺮﯾﻖ‬
‫و رﺳﻤﺖ ﻟﻨﺎ اﻟﻤﺴﺘﻘﺒﻞ ﺑﺨﻄﻮط ﻣﻦ اﻷﻣﻞ و اﻟﺒﮭﺠﺔ‬
‫إﻟﻰ ﻣﻦ ﻻ ﺗﻔﯿﮭﻢ ﺣﻘﮭﻢ ﻣﻦ اﻟﺸﻜﺮ و اﻟﻌﺮﻓﺎن ﺑﺎﻟﺠﻤﯿﻞ‬
‫إﻟﻰ ﻣﻦ ﻛﺎن ﺳﺒﺒﺎ ﻓﯿﻤﺎ ﻧﺤﻦ ﻓﯿﮫ اﻟﯿﻮم‬
‫إﻟﻰ آﺑﺎﺋﻨﺎ و أﻣﮭﺎﺗﻨﺎ‪ ،‬إﺧﻮاﻧﻨﺎ و أﺧﻮاﺗﻨﺎ‪ ،‬أﺻﺪﻗﺎﺋﻨﺎ و ﻛﻞ ﻣﻦ وﻗﻒ ﻣﻌﻨﺎ و ﺳﺎﻧﺪﻧﺎ‬
‫اداﻣﮭﻢ ﷲ ﻟﻨﺎ و أﻋﺰھﻢ و ﻟﮭﻢ ﻣﻨﺎ ﻛﻞ اﻟﺘﻘﺪﯾﺮ و اﻹﺣﺘﺮام‪.‬‬

Acknowledgment
First of all, we give our thanks to "Allah" for all his blessings, and for
giving us strength and ability to complete this project.
We are highly indebted to our supervisor "Eng. Khaled Elgdamsi" for

his guidance and constant supervision, patience and as well as for
providing necessary information regarding theproject and also for his
support in completing the project.
We would like to express our special gratitude and thanks to all our
teachers who taught us and gave us the knowledge and motivation
throughout our education career.
3
ABSTRACT
The demand of wireless access users is increasing explosively. The 5G network
data rates are increasing exponentially and showing a trend of diversity and
heterogeneity, further more enhanced the ratability for voice calls, which makes
network traffic volume forecasting face many challenges.
By studying the actual performance of the 5G network which are jointly

beamforming, power control and interference coordination to enhance the
communication network performance, this project adjusting the design of
beamforming, power control and interference coordination as a non-convex
problem to obtaining the maximum signal to interference noise ratio (SINR) by
using the deep reinforcement learning (DRL) algorithms.
This study suggests an algorithm for voice bearers and data bearers in sub-6GHz
and millimeter wave (mmWave) frequency bands, respectively, using the
reported coordinates of the users serviced by the network the greedy nature of
deep Q-learning to predict future rewards of actions, performance as evaluates by
SINR and sum-rate capacity is improved by the algorithm.
The simulation finding demonstrate that our technique outperforms the link
adaptation industry standards for sub-6GHz voice bearers in realistic cellular
environments. this algorithm approaches the maximum total rate capacity for data
bearers operation in the mmWave frequency band. The results show that the
algorithm can effectively improve the accuracy of 5G traffic prediction.
The model can be used for 5G traffic prediction for decision making.
4
‫اﻟﻤﻠﺨﺺ‬
‫اﻟﻄﻠﺐ ﻋﻠﻰ ﻣﺴﺘﺨﺪﻣﻲ اﻟﻮﺻﻮل اﻟﺸﺒﻜﺔ اﻟﻼﺳﻠﻜﯿﺔ ﯾﺘﺰاﯾﺪ ﺑﺸﻜﻞ ﻛﺒﯿﺮ ﻣﻤﺎ ﯾﺠﻌﻞ ﻣﻌﺪﻻت اﻟﺒﯿﺎﻧﺎت ﻟﺸﺒﻜﺔ ‪5G‬‬
‫ﺗﺰﯾﺪ ھﻲ اﻻﺧﺮي ﺑﺘﺴﺎرع ﻏﯿﺮ ﻣﻨﺘﻈﻢ‪ ،‬ﻣﻤﺎ ﯾﻌﺰز ﺑﺸﻜﻞ اﺿﺎﻓﻲ ﻣﻦ ﻗﺎﺑﻠﯿﺔ ﻣﻌﺪل اﻟﻤﻜﺎﻟﻤﺎت اﻟﺼﻮﺗﯿﺔ‪ ،‬وھﺬا‬
‫ﯾﺠﻌﻞ ﺗﻮﻗﻊ ﺣﺠﻢ ﺣﺮﻛﺔ اﻟﻤﺮور ﻓﻲ اﻟﺸﺒﻜﺔ ﯾﻮاﺟﮫ اﻟﻌﺪﯾﺪ ﻣﻦ اﻟﺘﺤﺪﯾﺎت‬
‫ﻣﻦ ﺧﻼل دراﺳﺔ اﻷداء اﻟﻔﻌﻠﻲ ﻟﺸﺒﻜﺔ ‪ 5G‬اﻟﺘﻲ ﺗﻌﻤﻞ ﺑﺸﻜﻞ ﻣﺸﺘﺮك ﻋﻠﻰ ﺗﺸﻜﯿﻞ اﻟﺤﺰم واﻟﺘﺤﻜﻢ ﻓﻲ اﻟﻄﺎﻗﺔ‬
‫وﺗﻨﺴﯿﻖ اﻟﺘﺪاﺧﻞ ﻟﺘﻌﺰﯾﺰ أداء ﺷﺒﻜﺔ اﻻﺗﺼﺎﻻت‪ ،‬ﯾﻌﻤﻞ ھﺬه اﻟﻤﺸﺮوع ﻋﻠﻲ ﺗﺼﻤﯿﻢ و ﺗﺸﻜﯿﻞ اﻟﺤﺰﻣﺔ واﻟﺘﺤﻜﻢ‬
‫ﻓﻲ اﻟﻄﺎﻗﺔ وﺗﻨﺴﯿﻖ اﻟﺘﺪاﺧﻞ‪ ,‬ﻟﻠﺤﺼﻮل ﻋﻠﻰ أﻗﺼﻰ ﻣﻌﺪل ﻧﺴﺒﺔ اﻹﺷﺎرة إﻟﻰ ﺗﺪاﺧﻞ اﻟﻀﻮﺿﺎء )‪(SINR‬‬
‫ﺑﺎﺳﺘﺨﺪام ﺧﻮارزﻣﯿﺎت اﻟﺘﻌﻠﻢ اﻟﻤﻌﺰز اﻟﻌﻤﯿﻖ)‪.(DRL‬‬
‫ھﺬه اﻟﺪراﺳﺔ ﺗﺘﻘﺘﺮح ﺧﻮارزﻣﯿﺔ ﻟﻨﺎﻗﻠﻲ اﻟﺼﻮت وﻧﺎﻗﻠﻲ اﻟﺒﯿﺎﻧﺎت ﻓﻲ ﻧﻄﺎﻗﺎت اﻟﺘﺮدد اﻟﻔﺮﻋﯿﺔ ‪ 6‬ﺟﯿﺠﺎھﺮﺗﺰ‬
‫وﻣﻮﺟﺔ اﻟﻤﻠﯿﻤﺘﺮ )‪ ، (mmWave‬ﻋﻠﻰ اﻟﺘﻮاﻟﻲ‪ ،‬ﺑﺎﺳﺘﺨﺪام اﻹﺣﺪاﺛﯿﺎت اﻟﻤﺒﻠﻎ ﻋﻨﮭﺎ ﻟﻠﻤﺴﺘﺨﺪﻣﯿﻦ اﻟﺬﯾﻦ ﺗﺨﺪﻣﮭﻢ‬
‫اﻟﺸﺒﻜﺔ اﻟﻄﺒﯿﻌﺔ اﻟﻤﻌﺰزة ﻟـ ‪ Q-Learning‬اﻟﻌﻤﯿﻖ ﺣﺘﻲ ﺗﺘﻨﺒﺄ ﺑﺎﻟﺨﺼﺎﺋﺺ اﻟﻤﺴﺘﻘﺒﻠﯿﺔ ﻟﻺﺟﺮاءات ﻓﯿﺘﻢ ﺗﺤﺴﯿﻦ‬
‫اﻷداء ﻛﻤﺎ ﯾﺘﻢ ﺗﻘﯿﯿﻤﮫ ﺑﻮاﺳﻄﺔ ‪ SINR‬و اﻟﺴﻌﺔ اﻻﺟﻤﺎﻟﯿﺔ ﻟﻨﻘﻞ اﻟﺒﯿﺎﻧﺎت ﺑﻮاﺳﻄﺔ اﻟﺨﻮارزﻣﯿﺔ‪.‬‬
‫ﺗﻮﺿﺢ ﻧﺘﯿﺠﺔ اﻟﻤﺤﺎﻛﺎة أن أﺳﻠﻮﺑﻨﺎ ﯾﺘﻔﻮق ﻓﻲ اﻷداء ﻋﻠﻰ ﻣﻌﺎﯾﯿﺮ ﺻﻨﺎﻋﺔ ﺗﻜﯿﯿﻒ اﻻرﺗﺒﺎط ﻟﻨﺎﻗﻼت اﻟﺼﻮت‬
‫ﺗﺤﺖ ‪ 6‬ﺟﯿﺠﺎھﺮﺗﺰ ﻓﻲ اﻟﺒﯿﺌﺎت اﻟﺨﻠﻮﯾﺔ اﻟﻮاﻗﻌﯿﺔ‪.‬‬
‫ﺗﻘﺘﺮب ھﺬه اﻟﺨﻮارزﻣﯿﺔ ﻣﻦ اﻟﺴﻌﺔ اﻹﺟﻤﺎﻟﯿﺔ اﻟﻘﺼﻮى ﻟﻠﻤﻌﺪل ﻟﺘﺸﻐﯿﻞ ﻧﺎﻗﻠﻲ اﻟﺒﯿﺎﻧﺎت ﻓﻲ ﻧﻄﺎق اﻟﺘﺮدد‬
‫‪.mmWave‬‬
‫ﺗﻈﮭﺮ اﻟﻨﺘﺎﺋﺞ أﻧﮫ ﯾﻤﻜﻦ ﻟﻲ اﻟﺨﻮارزﻣﯿﺔ أن ﺗﺤﺴﻦ ﺑﺸﻜﻞ ﻓﻌﺎل ﻣﻦ دﻗﺔ اﻟﺘﻨﺒﺆ ﺑﺤﺮﻛﺔ ﻣﺮور ‪.5G‬‬
‫ﯾﻤﻜﻦ اﺳﺘﺨﺪام اﻟﻨﻤﻮذج ﻟﻠﺘﻨﺒﺆ ﺑﺤﺮﻛﺔ ﻣﺮور ﺷﺒﻜﺔ اﻟﺠﯿﻞ اﻟﺨﺎﻣﺲ ﻻﺗﺨﺎذ اﻟﻘﺮارات‪.‬‬
‫‪5‬‬
Contents:
ABSTRACT…………………………………………………………………………………...4
‫……………………………………………………………………………………… اﻟﻤﻠﺨﺺ‬...5
Contents……………………………………………………………...…………………….......6
List of Figures ……………………………………………………………………..………….8
List of Tables …………………………………………………………...……………………10
List of Abbreviations…………………………………………………………………………11
CHAPTER 1................................................................................................................ 12
INTRODUCTION ................................................................................................................... 12
1.1 Introduction on wireless communication system ...................................................... 13
1.2 A Brief History of Wireless Communication ............................................................ 14
1.3 Deep Reinforcement Learning for 5G Network ........................................................ 16
1.4 Project Objectives...................................................................................................... 18
1.5 Project Outlines ......................................................................................................... 18
CHAPTER 2................................................................................................................ 19
OVERVIEW ON 5G WITH DEEP REINFORCEMENT LEARNING ................................. 19
2.1 5G Network ............................................................................................................... 20
2.2 Deep Q-Network Concept ......................................................................................... 21
2.3 Joint Beamforming (JBF) .......................................................................................... 22
2.4 Power Control (PC) ................................................................................................... 23
2.5 Interference Coordination (IC) .................................................................................. 24
2.6 Voice Bearer .............................................................................................................. 24
2.6.1 Fixed Power Allocation (FPA) ............................................................................ 25
2.6.2 Tabular Technique ............................................................................................... 25
2.7 Data Bearer ................................................................................................................ 26
CHAPTER 3 ................................................................................................................ 27
Deep Reinforcement Learning to Optimize the 5G Network .................................................. 27
3.1 Models for Networks, Systems, and Channels............................................................... 28
3.1.1 Network Model ........................................................................................................ 28
3.1.2 System Model .......................................................................................................... 29
3.1.3 Channel Model ........................................................................................................ 30
6
3.2 Problem Formulation...................................................................................................... 31
3.3 An Introduction on Deep Reinforcement Learning........................................................ 31
3.4 Deep Reinforcement Learning in Voice PC and IC…………………………………...35
3.4.1 FPA .......................................................................................................................... 35
3.4.2 Tabular RL............................................................................................................... 36
3.4.3 Deep Reinforcement Learning Approach ................................................................ 37
3.5 Deep Reinforcement Learning in mmWave Beamforming PC and IC .......................... 39
3.5.1 Proposed Algorithm................................................................................................. 39
3.5.2 Brute Force .............................................................................................................. 41
CHAPTER 4 ................................................................................................................ 42
SIMULATION RESULTS AND DISCUSSIONS .................................................................. 42
4.1 Performance Measures ................................................................................................... 43
4.1.1 Convergence ............................................................................................................ 43
4.1.2 Coverage .................................................................................................................. 43
4.1.3 Sum Rate capacity ................................................................................................... 44
4.2 Simulation and Results ................................................................................................... 45
4.2.1 Setup Configurations ............................................................................................... 49
4.2.2 JB-PCIC Algorithm Flowchart ................................................................................ 55
4.3 Simulation Analysis for JB-PCIC Algorithm................................................................. 55
4.3.1 Simulation Analysis with The Basic Environment Parameters ............................... 56
4.3.2 Simulation Analysis 1.............................................................................................. 60
CHAPTER 5 ................................................................................................................ 80
CONCLUSION AND THE FUTURE WORK........................................................................ 80
5.1 Conclusions .................................................................................................................... 81
5.2 Future Work ................................................................................................................... 82
REFERENCES ........................................................................................................................ 83
APPENDICES ......................................................................................................................... 85
Appendix A .......................................................................................................................... 86
7
List of Figures:
Figure 1.1: Wireless Communication System ......................................................................... 13
Figure 1.2: Cellular System Architecture ................................................................................ 15
Figure 1.3: Simple Deep Q-Network ....................................................................................... 17
Figure 2.5: Deep Q-Network [4].............................................................................................. 22
Figure 3. 1: The interaction of the agent as well as the environment in reinforcement learning
.................................................................................................................................................. 32
Figure 3. 2: Downlink joint beamforming, power control, and interference coordination
Module ..................................................................................................................................... 39
Figure 4. 1: Simulation flowchart for Data and Voice bearers ................................................ 48

Figure 4. 2: Binary Encoding of Actions for Beamforming, Power Control, and Interference
Coordination in Different Bearer TypesDifferentBearer Types .............................................. 50
Figure 4. 3: Proposed algorithm flowchart .............................................................................. 55
Figure 4. 4: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 for JB-PCIC algorithm as afunction of 𝑀𝑀 ............................ 56
Figure 4. 5: The Normalized Convergence Time for The JB-PCIC Algorithm versus 𝑀𝑀....... 57
Figure 4. 6: Achievable SINR and The Normalized Transmit Power for The Brute Force and
JB-PCIC Algorithms as a Function of 𝑀𝑀. ................................................................................ 58
Figure 4. 7: Sum-Rate Capacity of The Convergence Episode as a Function of The 𝑀𝑀. ........ 59
Figure 4. 8: The CCDF of 𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾 for three different voice algorithms ............................ 59
Figure 4. 9: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 for JB-PCIC algorithm as afunction of M............................ 61
Figure 4. 10: The Normalized Convergence Time for The JB-PCIC Algorithm versus M..... 61
JB-PCIC Algorithms as a Function of 𝑀𝑀 ................................................................................. 62
Figure 4. 12: Sum-Rate Capacity of The Convergence Episode as a Function of The M. ...... 62
Figure 4. 13: The CCDF of 𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾𝛾 for three different voice algorithms .......................... 63
Figure 4. 14: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 for JB-PCIC algorithm as afunction of 𝑀𝑀 .......................... 64
JB-PCIC Algorithms as a Function of M................................................................................. 65
Figure 4. 17: Sum-Rate Capacity of The Convergence Episode as a Function of The M ....... 65
8
Figure 4. 19: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 for JB-PCIC algorithm as afunction of M .......................... 67
Figure 4. 22: The Normalized Convergence Time for the JB-PCIC Algorithm versus M ...... 68
Figure 4. 24: CCDF of SINR_eff for JB-PCIC algorithm as afunction of M .......................... 70
Figure 4. 28: The CCDF of γ_eff^voice for three different voice algorithms ......................... 72
Figure 4. 29: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 for JB-PCIC algorithm as afunction of M.......................... 73
Figure 4. 34: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 for JB-PCIC algorithm as afunction of M.......................... 76
9
List of Tables:
Table 1. 1: Comparison of All Generations of Mobile Technologies...................................... 15
Table 4. 1: Reinforcement Learning Hyperparameter ............................................................. 49

Table 4. 2: Simulation State 𝑆𝑆 .................................................................................................. 49
Table 4. 3: The Power Control, And Interference Coordination Commands For Voice Bearer
.................................................................................................................................................. 50
Table 4. 4: The Joint Beamforming, Power Control, And Interference Coordination ............ 51
Table 4. 5: Radio Environment Parameters ............................................................................. 54
Table 4. 6: Input And Output Parameters ................................................................................ 55
Table 4. 7 Radio Enviroment Parameters ................................................................................ 60
Table 4. 10 Radio Enviroment Parameters .............................................................................. 69
10
List of Abbreviations:
2G Second Generation
3G Third Generation
4G Fourth Generation
5G Fifth Generation
ASS Advanced Antenna System
BS Base Station
CDMA Code Division Multiple Access
DQN Deep Q-Network
DRL Deep Reinforcement Learning
eNB Enhanced Node B
FDMA Frequency Division Multiple Access
gNB Next Generation Node B
GSM Global System for Mobile Communication
IC Interference Coordination
JBF Joint Beamforming
M Number Of Antennas
NOMA Non-Orthogonal Multiple Access
NR New Radio
OFDMA Orthogonal Frequency Division Multiple Access
PC Power Control
QoS Quality-Of-Service
RF Radio Frequency
RL Reinforcement Learning
SGD Sigmoid Function
SINR Signal To Interference Noise Ration
TDMA Time Division Multiple Access
UE User Equipment
WAN Wide Area Network
11
CHAPTER 1
INTRODUCTION
12
1.1 Introduction on wireless communication system:
Wireless communication enables voice and data transmission without cables or wires. It uses
electromagnetic signals to send data to devices, providing increased flexibility and mobility
compared to wired communication. This technology has transformed communication and
information access.
Wireless technology has advanced, enabling faster and longer data transmission. It's now an
essential part of our lives, connecting us via smartphones, tablets, laptops, and wearables. It
also benefits businesses and industries by enhancing efficiency and productivity.
Wireless communication has also led to the development of new devices and applications, such
as IoT (Internet of Things) devices, smart homes, and autonomous vehicles. These devices and
applications rely on wireless communication to function properly, and their popularity is only
expected to grow in the coming years.
Overall, wireless communication is a vital technology that has transformed the way of the live
and the work, and its impact on our daily lives is only set to increase in the future.
In a communication system, information is sent from a transmitter to a receiver over a short or

long distance. This is crucial for voice and data transmission, TV broadcasting, and internet
connectivity. The range can vary from a few meters (e.g., TV remote control) to thousands of
kilometers using wireless technologies like satellite communication.
Wireless communication has revolutionized the way we communicate and has made it possible
to transmit information across vast distances without the need for physical connections. The
block diagram of a wireless communication system, as shown in Fig 1.1, typically consists of
several components, including a transmitter, a receiver, an antenna, and a communication
channel. The transmitter converts the information into a signal that can be transmitted
wirelessly through the antenna, which receives the signal and passes it on to the receiver.
Figure 1.1: Wireless Communication System
13
1.2 A Brief History of Wireless Communication:
The history of wireless communication is a fascinating one that spans over a century. It all
began in the early 20th century when the first wireless transmitters were introduced. These
early transmitters utilized a form of radio communication known as radiotelegraphy, which
involved the use of Morse code or other coded signals to transmit information.
As technology progressed, wireless communication evolved to allow for the transmission of

voice and music using modulation. This advancement led to the medium being referred to as
radio. Wireless transmitters use electromagnetic waves to transmit voice, data, video, or signals
over a communication path, making it possible for people to communicate with each other
across great distances.
In the early 1970s, the groundwork for modern wireless networking was laid with the launch
of the ALOHA system in Hawaii. The network, which was technically a wide area network
(WAN), relied on ultra-high frequency signals to broadcast data among the islands. The
technology that underpinned the ALOHA system played a crucial role in the creation of
Ethernet in 1973 and was instrumental in the development of 802.11, the first wireless standard.
Wireless communication is now an essential part of daily life, connecting people worldwide
through various cellular networks. In the past, two primary standards, GSM and CDMA,
dominated the industry, but with the introduction of 4G/LTE and especially 5G, the distinctions
between these technologies have become less clear. Consequently, older GSM and CDMA
networks are being phased out as they are becoming outdated. The newer 5G networks offer
faster speeds and greater bandwidth, ensuring high-quality wireless connectivity no matter the
location [1].
Modern cellular networks are typically defined in terms of which generation of wireless
standard is supported. Here's a look at the different types of cellular networks:
• 2G. This first major wave of cellular technology adoption was introduced in 1991, with
speeds limited to 50 Kbps.
• 3G. Third-generation networks began to appear in 2001. 3G offered increased
bandwidth and signal quality over 2G and provided a peak speed of 7.2 Mbps.
• 4G/LTE. Fourth-generation wireless and LTE began to appear in 2009 as successors to
3G. As opposed to the 2G and 3G standards, the International Telecommunication
Union specified a strict minimum data rate for 4G. To be considered 4G/LTE, the
14
cellular networks have to transmit and receive at 100 Mbps.
• 5G. Fifth-generation wireless was first introduced as a technical standard in 2016, and
carriers began to deploy it in 2019. 5G provides more bandwidth than its predecessors,
data speeds that can range as high as 20 Gbps and ultra-low latency -- five milliseconds
or less. These networks can either be public or private 5G, and the standard has fueled
a variety of new business cases, among them autonomous automobiles and
sophisticated industrial control systems.
The cellular system architecture observed in Fig 1.2, also the comparison between mobile
network generations is shown in Table 1.1
Figure 1.2: Cellular System Architecture
Table 1. 1: Comparison of All Generations of Mobile Technologies [1]
Technology 1G 2G 3G 4G 5G
Feature
start deployment 1970-1980 1990-2004 2004-2010 2010-2015 2015
data bandwidth 2kbps 64kbps 2Mbps 1Gbps higher than

1Gbps
15
technology Analog Digital CDMA Wi-Max NR
Cellular Cellular 2000 LTE
Technology Technology Wi-Fi
service Voice Voice, SMS high audio Dynamic Dynamic

and data Information Information
quality access access wearable
video Wearable devices with AI
devices capabilities
multiplexing FDMA TDMA, CDMA OFDMA NOMA

CDMA
switching circuit circuit, packet all packet all packet

packet
core network PSTN PSTN Packet Internet Internet

N/W
1.3 Deep Reinforcement Learning for 5G Network:
The arrival of 5G network marks a significant milestone in telecommunications. This

technology will revolutionize our lives, work, and communication by offering numerous
benefits that transform business operations. 5G is capable of supporting diverse scenarios
across different industries, driving innovation and progress. Exciting applications include
intelligent security systems, high-definition video streaming, telemedicine, smart home
automation, autonomous vehicles, and augmented reality. 5G stands out by meeting the unique
communication requirements of each scenario, such as high-speed and low-latency for security
systems or reliability for telemedicine. It also offers enhanced mobility, advanced billing
options, and greater policy control. 5G plays a crucial role in shaping our connected and
digitalized future, fostering efficiency, sustainability, and connectivity. With unmatched speed,
reliability, and flexibility, 5G redefines our way of living, working, and interacting, bringing
endless possibilities and a bright future.
Wireless communication is a fundamental part of our daily lives, and it is constantly evolving
to meet the growing demand for faster and more reliable connectivity. To address this demand,
16
researchers and engineers have developed innovative techniques such as Deep Q-Network
(DQN) or Deep Reinforcement Learning (DRL), which has shown great potential in improving
the End-to-End (E2E) connectivity between users (UEs) and Base Stations (BSs).
DQN works by analyzing feedback information from UEs and using this data to optimize the
connection between the devices. The algorithm is designed to learn and adapt to new situations,
making it an excellent tool for handling complex wireless communication scenarios. Figure 1.3
provides a visual representation of how the input, hidden, and output layers of a simple DQN
operate together to improve connectivity.
DQN is revolutionizing wireless communication by optimizing connections, reducing latency,

and improving bandwidth utilization. It will play a vital role as more devices connect and the
need for faster connectivity grows. DQN is a game-changing technology that improves wireless
communication by learning, adapting, and optimizing connectivity to meet increasing
demands. Deep Reinforcement Learning (DRL) revolutionizes the optimization of Joint
Beamforming (JBF), Power Control (PC), and Interference Coordination (IC) in 5G networks.
By using DRL, feedback states can be dynamically modified to maximize Signal and
Interference to Noise Ratio (SINRs) for voice and data bearers across different frequency bands
(sub-6GHz and mmWave). This leads to improved network performance, including higher
throughput, lower latency, and greater reliability, meeting the requirements of emerging 5G
applications and services. DRL enables operators to achieve network efficiency, scalability,
and flexibility for a connected and digital future [2].
Figure 1.3: Simple Deep Q-Network
17
1.4 Project Objectives:
• To study voice bearer algorithms, research common techniques in telecommunications

systems with DRL algorithms proposed for optimizing wireless networks.
• To improve voice call and data reliability and maximize SINR, use advanced signal
processing techniques "like beamforming" to enhance the desired signal and reduce
interference.
• To optimize voice and data problems, use Python code and TensorFlow. Collect
wireless network performance data to train a deep neural network (DNN) that predicts
optimal network settings.
1.5 Project Outlines:
• Chapter two presents a thorough theoretical overview of all project components. It will
cover the fundamental principles, theories, and concepts that form the project's
foundation. The chapter will also explain the equations and assumptions used during
the project's development.
• In the third chapter, readers will find a detailed description of the equations,
assumptions, methodology, tools, and techniques used in the project. This chapter
explains the technical aspects and provides a clear understanding of how the project
was successfully completed.
• Chapter forth explores the project's data, including challenges faced. It offers an
overview of collected data, analyses, and insights, providing valuable information on
the project's impact.
• Finally, in the fifth chapter, readers will find a conclusion and discussion of future
work.
18
CHAPTER 2
OVERVIEW ON 5G WITH DEEP

REINFORCEMENT LEARNING
19
2.1 5G Network:
The fifth-generation (5G) network is expected to offer a significant improvement in operational

performance by increasing spectral efficiency, providing higher data rates and reducing latency
as showing in Figure 2.1. Furthermore, it should offer a superior user experience comparable
to that of fixed networks while still maintaining full mobility and coverage. This is especially
important for the massive deployment of Internet of Things (IoT) devices, which require low
energy consumption, equipment cost, and network deployment and operation cost. To meet
these requirements, 5G needs to support a wide variety of applications and services, including
those that demand high bandwidth and low latency, such as virtual and augmented reality,
telemedicine, and autonomous vehicles. As such, the development and deployment of 5G
networks is expected to have a significant impact on various industries and society as a whole.
Figure 2. 1:IMT-2020 (5th generation) Spider Chart [3]
The modern and advanced wireless communication technology, also known as the New Radio
network (NR) or 5G network, employs a technique called non-orthogonal multiple access
(NOMA) to manage multiple users on the same resource block. The NOMA technique is a
20
significant breakthrough in wireless communication technology and is based on the key
concept of serving multiple users in a single orthogonal resource block. This approach helps to
improve the efficiency of resource utilization and increase the capacity of the network,
providing faster data transfer rates and better connectivity.
The NOMA technique utilizes a unique multiple access technique, which is different from
conventional orthogonal multiple access (OMA) techniques. In the NOMA technique, the
signal of each user is assigned a different power level, allowing multiple users to share the
same frequency band and time slot simultaneously. This approach enables the NR to support a
massive number of devices, which is essential for the successful implementation of the Internet
of Things (IoT) and other emerging technologies.
Overall, the NOMA technique is a promising development that has the potential to
revolutionize wireless communication technology. Its implementation in the NR has
significantly increased the network's capacity, reliability, and efficiency, paving the way for a
new era of wireless communication [3].
The most significant components that 5G network interested to analyzing it is as following:
• Joint Beamforming (JBF).

• Power Control (PC)
• Interference Coordination (IC).
To perform the optimization based on feedback from UEs, DRL is used to ensure that data and
voice bearers have the highest possible SINR and dependability.
2.2 Deep Q-Network Concept:
Deep Q-learning is a powerful approach in reinforcement learning that allows training the agent
to interact with its environment and make decisions based on the feedback it receives. One of
the key aspects of this approach is the use of a neural network to approximate the Q-value
function. The Q-value function determines the expected future rewards for all possible actions
that the agent can take in a given state. By using a neural network to approximate this function,
which can efficiently learn the best action to take in any given state.
In reinforcement learning, the agent's objective is to maximize its total reward across an
episode, which is a sequence of states that begins with an initial state and ends with a terminal
state. The agent's actions in each state lead to rewards that could be positive or negative. By
21
learning from experience, the agent develops a strategy or policy that enables it to take the best
action in each state to maximize its cumulative reward.
The training process involves the agent repeatedly interacting with the environment,
performing actions, and observing the resulting rewards. The agent's experience is then used
to update the Q-value function approximation, which in turn informs the agent's policy. This
process continues until the agent's policy converges, and it has learned to make optimal
decisions in all possible states.
Deep Q-learning has been successfully applied in various real-world applications, including
gaming, robotics, and finance. This approach has shown promise in solving complex problems,
where the state space is large and the optimal policy is not known a priori. However, it is worth
noting that deep Q-learning is still an active area of research, and there are still many challenges
that need to be addressed, such as the problem of overestimation of the Q-value function and
the issue of stability in the learning process. Nonetheless, deep Q-learning remains a key
technique in the field of reinforcement learning and is likely to continue to be an area of active
research for years to come [4].
Figure 2.4: Deep Q-Network [4]
To provide the greatest services to users, the algorithm optimize the JBF, PC, and IC by the
help of DQN in order to improve the voice and data bearers.
2.3 Joint Beamforming (JBF):
Beamforming is a radio frequency (RF) management technique that has become increasingly
important in recent years as wireless networks continue to grow in complexity and demand.
This technology enables access points to dynamically adjust their radio signal beams to focus
on specific clients or devices, rather than broadcasting signals in all directions, which is a less
efficient use of wireless spectrum. By focusing the signal, beamforming can significantly
22
improve the strength and quality of the connection between the access point and the client
device, resulting in faster and more reliable data transmission speeds.
Beamforming is often considered a subset of Advanced Antenna Systems (AAS), which

includes other smart antenna technologies that can optimize the direction, polarization, and
power of radio signals. Unlike traditional antennas, which transmit and receive radio signals in
all directions, smart antennas can dynamically adjust their beam patterns to focus on specific
clients or devices, depending on the location and orientation of those devices.
There are two main types of beamforming: static and dynamic. Static beamforming involves
configuring the antennas in advance to focus on a specific location or direction, while dynamic
beamforming adjusts the antenna beam on-the-fly in response to changes in the network
environment, such as the movement of client devices. Dynamic beamforming is typically more
effective than static beamforming because it can adapt to changing network conditions in real-
time.
One of the key benefits of beamforming is that it can significantly improve the SNR
performance of wireless networks. SNR is a measure of the strength of the desired signal (i.e.,
the data being transmitted) relative to the level of background noise in the wireless
environment. By focusing the radio signal on the client device, beamforming can reduce the
level of noise and interference in the wireless environment, resulting in a stronger and more
reliable connection.
Overall, beamforming is a powerful technology that can help businesses and organizations
achieve faster and more reliable wireless connectivity. As wireless networks continue to evolve
and demand increases, beamforming is likely to become an increasingly important tool for
optimizing wireless performance and improving the user experience [5].
2.4 Power Control (PC):
Wireless communication is a vital aspect of modern society, and the demand for high-speed
connectivity is increasing exponentially. In this context, uplink power control plays a crucial
role in maintaining the quality of wireless communication. Uplink power control involves the
regulation of the transmit power of UE or mobile devices, which can be either increased or
decreased based on the requirements of the system. The primary goal of uplink power control
is to ensure that the SNR or bit error rate (BER) at the base station, gNB, or eNB meets the
desired level of performance.
23
In wireless systems, the transmit power is increased to improve the SNR or BER at the base
station, gNB, or eNB. This increase in transmit power is necessary to maintain high-quality
connectivity, especially in areas where the signal strength is low. Conversely, the transmit
power is decreased to minimize co-channel interference, which can occur when multiple
devices transmit on the same frequency at the same time.
Power control is an essential technique used in wireless communication systems to manage the
transmit power of mobile devices. By optimizing the transmit power, power control can
improve the efficiency and reliability of wireless communication. Moreover, power control is
crucial in ensuring that the available spectrum is utilized efficiently, and interference is
minimized, leading to better overall performance [6].
2.5 Interference Coordination (IC):
The advent of the 5G New Radio (NR) technology has opened up exciting prospects for the
implementation of innovative inter-cell interference coordination (ICIC) mechanisms. These
mechanisms aim to serve two primary objectives. Firstly, they aim to maximize the advantages
of ICIC while adhering to the principles of radio resource management (RRM) that govern 5G
networks. Secondly, they aim to facilitate the smooth implementation of new services and
deployment scenarios that are made possible by 5G. The ICIC mechanisms that are devised for
5G NR will play a critical role in ensuring that interference between cells is managed efficiently
and that network resources are utilized optimally. By enabling effective inter-cell
communication and reducing interference, these mechanisms will pave the way for the
seamless implementation of diverse services, such as high-speed internet, augmented reality,
and the Internet of Things (IoT). As such, they represent an exciting opportunity for network
operators and service providers to deliver enhanced value to their customers and drive growth
in the 5G ecosystem [7].
2.6 Voice Bearer:
In the section focused on voice communication, we conduct an in-depth analysis of the primary
methods utilized. Specifically, to explore and compare the Fixed Power Allocation method
(FPA), the voice Tabular approach, and the DQN algorithm technique. These methods are
crucial in determining how voice is transmitted and received in communication systems.
Through our evaluation of these techniques, we can gain a better understanding of their
24
strengths and limitations, which can ultimately inform the development of more efficient and
effective communication systems.
2.6.1 Fixed Power Allocation (FPA):
In a large-scale MIMO system, a fixed power allocation algorithm that relies on a time-shift
pilot has been developed. The primary objective of this algorithm is to optimize the channel
capacity of the edge terminal. This is achieved by establishing an optimization goal based on
the time-shift pilot system. To implement the algorithm, a previous study on fixed power
allocation is taken as a reference. The algorithm uses this study as a basis to allocate power
efficiently in the system, in order to improve the performance of the edge terminal. Through
this method, the system can enhance its overall capacity and improve the quality of the
connection for the edge terminal. Therefore, the fixed power allocation algorithm based on
time-shift pilot has been proposed as a promising solution for improving the performance of
large-scale MIMO systems [8].
2.6.2 Tabular Technique:
Tabular methods refer to problems in which the state and actions spaces are small enough for
approximate value functions to be represented as arrays and tables, The aim of reinforcement
learning is to find a solution to the following equation, called Bellman equation:
𝑉𝑉𝜋𝜋 (𝑠𝑠) = � 𝜋𝜋(𝑎𝑎|𝑠𝑠) � 𝑝𝑝(𝑠𝑠 ′ , 𝑟𝑟|𝑠𝑠, 𝑎𝑎)((𝑟𝑟 + 𝛾𝛾𝑉𝑉𝜋𝜋 (𝑠𝑠 ′ ))) (2. 1)
𝑎𝑎 𝑠𝑠 ′ ∈𝑆𝑆
Where:
• π: Policy.
• a: Action.
• s: State.
• s’: Next state
• γ: Discount factor.
• r: Reward signal.
• p: Transition probability.
What we mean by solving the Bellman equation is to find the optimal policy that maximizes
the State Value function.
Since an analytical solution is hard to get, therefore the using of iterative methods in order to
compute the optimal policy. The optimal State and Action value functions are denoted as the
25
following:
𝑣𝑣∗ (𝑠𝑠) = 𝑚𝑚𝑚𝑚𝑚𝑚𝑎𝑎 � 𝑝𝑝�𝑠𝑠′ , 𝑟𝑟 �𝑠𝑠, 𝑎𝑎) [𝑟𝑟 + 𝛾𝛾𝑣𝑣∗ (𝑠𝑠′ )]) (2. 2)
𝑠𝑠′ ,𝑟𝑟
𝑞𝑞∗ (𝑠𝑠, 𝑎𝑎) = � 𝑝𝑝(𝑠𝑠 ′ , 𝑟𝑟 |𝑠𝑠, 𝑎𝑎) [𝑟𝑟 + 𝛾𝛾 𝑚𝑚𝑚𝑚𝑚𝑚𝑎𝑎′ 𝑞𝑞∗ (𝑠𝑠 ′ , 𝑎𝑎′ )]) (2. 3)
𝑠𝑠 ′ ,𝑟𝑟
Each state's value is computed using the values of the surrounding states as input (disregarding
if those values are accurate or not). After a value for one state has been computed, we go on to
another state and repeat the procedure there (taking into account any new value computed in
previous states). The method is repeated enough times until the sum of the changes in each
state is less than the predetermined limit [9].
2.7 Data Bearer:
This study develops a complement cumulative distribution function (CCDF) of SINR with
different numbers of antennas (M). Use this approach, investigate the normalized convergence
DQN algorithm that proposed, which is called the JB-PCIC algorithm. By analyzing the impact
of the number of antennas on the achievable SINR and the normalized transmit power, then
compare the performance of the optimal algorithm with the proposed DQN algorithm.
Moreover, evaluated the sum-rate capacity of the convergence episode in relation to the number
of antennas. The outcomes reveal the relationship between the number of antennas and the
performance of the proposed algorithm. Additionally, provide insights into the potential
benefits of utilizing the JB-PCIC algorithm in various scenarios, depending on the number of
antennas available.
26
CHAPTER 3
Deep Reinforcement Learning to

Optimize the 5G Network
27
3.1 Models for Networks, Systems, and Channels:
This section provides a comprehensive description of the adopted network, system, and channel
models. The network model encompasses the architecture and design of the interconnected
components, ensuring efficient data flow and reliable communication. The system model
outlines the overall structure and functioning of the entire system, including hardware and
software components. Finally, the channel model explains the characteristics and behavior of
the communication channel, enabling us to analyze and optimize data transmission
performance. All the equations considering in this project obtained from [10].
3.1.1 Network Model:
We take into account a L BS multi-access downlink orthogonal frequency division

multiplexing (OFDM) network.
A serving Base Station (ȴth BS) and at least one interfering (bth BS) make up this network. Also
use a downlink scenario in which a BS transmits to a single user (UE). The UEs are distributed
at random throughout their service area, whereas the BSs have an inter-site distance of R. Based
on their distance from one another, users and the BS that serves them are connected. One BS
may only serve one user at a time. The cell radius (r) is greater than R/2; to allow for coverage
overlap which is required to measure the total amount of a single interference in the area.
Although data bearers employ mmWave frequency bands, voice bearers operated on sub-6
GHz frequency bands.
For the data bearers, the uses of analog beamforming to compensate for the greater propagation
loss caused on by the higher center frequency.
28
3.1.2 System Model:
Each BS uses a uniform linear array (ULA) with M antennas. However, each UE has only one
antenna, therefore the ȴth BS's received signal at the UE could be expressed as:
∗ ∗ ∗ ∗
𝑦𝑦ȴ = ℎȴ,ȴ 𝑓𝑓ȴ,ȴ 𝑥𝑥ȴ � ℎȴ,𝑏𝑏 𝑓𝑓ȴ,𝑏𝑏 𝑥𝑥𝑏𝑏 + 𝑛𝑛ȴ (3. 1)
𝑏𝑏 ≠ ȴ
Where the:
∗ ∗
ℎȴ,ȴ , ℎȴ,𝑏𝑏 : are the Mx1 channel vectors connected the user in ȴth BS with the ȴth BS and bth BS.
𝑥𝑥ȴ , 𝑥𝑥𝑏𝑏 : are the transmitted signals from the ȴth BSs and bth BSs.
𝑓𝑓ȴ,ȴ , 𝑓𝑓ȴ,𝑏𝑏 ∶ represented the adopted downlink (DL) M x 1 beamforming vectors at the ȴth BS and
bth BS.
𝑛𝑛ȴ ∶ the receiving noise at the user's location were generated from a complex Normal
distribution with a zero-mean and variance Normal (0, σ2).
The beamforming weights of each beamforming vector are implementing using the constant-
modulus phase shifters, |fȴ,ȴ|m = ejθm ; ȴ = 1, 2, 3, …, L, A beamsteering-based beamforming
codebook is employed to select each beamforming vector (Ƒ) with this codebook's n-th element
being defined as:
1 𝑇𝑇
𝑓𝑓𝑛𝑛 ≡ 𝑎𝑎(𝜃𝜃𝑛𝑛 ) = �1, 𝑒𝑒 𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗(𝜃𝜃𝑛𝑛) , … , 𝑒𝑒 𝑗𝑗𝑗𝑗𝑗𝑗(𝑀𝑀−1) cos(𝜃𝜃𝑛𝑛) � (3. 2)
√𝑀𝑀
Where:
• k: The wave number.

• d: Antenna spacing.
[0:𝜋𝜋]
• 𝜃𝜃𝑛𝑛 : Steering angle represented as .
𝑀𝑀
• 𝑎𝑎(𝜃𝜃𝑛𝑛 ): The array steering vector in the direction of (𝜃𝜃𝑛𝑛 )

• M: The number of antennas.
29
3.1.3 Channel Model:
By using the narrow-band geometric channel for analyzing and designing the mmWave system
therefore, the DL channel from ȴ Bs and b BS can be expressed as:
𝑝𝑝
𝑁𝑁ȴ,𝑏𝑏
√𝑀𝑀 𝑝𝑝 𝑝𝑝
ℎȴ,𝑏𝑏 = � 𝛼𝛼ȴ,𝑏𝑏 𝑎𝑎∗ (𝜃𝜃𝑙𝑙,𝑏𝑏 )
𝜌𝜌ȴ,𝑏𝑏 (3. 3)
𝑝𝑝=1
Where the:
𝑝𝑝
• 𝛼𝛼ȴ,𝑏𝑏 : the complex path gain of the pth path.
𝑝𝑝
• 𝜃𝜃𝑙𝑙,𝑏𝑏 : The angle of departure (AoD) of the pth path.
𝑝𝑝
• 𝑎𝑎(𝜃𝜃𝑙𝑙,𝑏𝑏 ): Is the array response vector associated with the AoD.
𝑝𝑝
• 𝑁𝑁ȴ,𝑏𝑏 : The number of channel paths is normally a small number in mmWave channels
compared to sub-6 GHz channels [11].
• 𝜌𝜌ȴ,𝑏𝑏 : The path-loss between BS b and the user serviced in the region of BS ȴ .
Keep in mind that in (3), the channel model takes both the line of site (LOS) and non-line of
𝑝𝑝
site (NLOS), which in NLOS the 𝑁𝑁ȴ,𝑏𝑏 = 1.
Then the received downlink power as measured by the UE over a number of physical resource
blocks (PRBs) at a specific time 𝑡𝑡 is what we refer to as PUE(t), so:
ȴ,𝑏𝑏 ∗ ( ) 2
𝑃𝑃𝑈𝑈𝑈𝑈 = 𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 (𝑡𝑡)�ℎȴ,𝑏𝑏 𝑡𝑡 𝑓𝑓𝑏𝑏 (𝑡𝑡)� (3. 4)
where 𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 represents the PRB transmit power coming from BS-b. The received SINR for the
UE serviced in BS-ȴ at time step t is then determined as follows:
∗ 2
𝑃𝑃𝑇𝑇𝑇𝑇,ȴ (𝑡𝑡) �ℎȴ,ȴ (𝑡𝑡) 𝑓𝑓ȴ (𝑡𝑡)�
𝛾𝛾 ȴ (𝑡𝑡) = (3. 5)
∗ 2
𝜎𝜎𝑛𝑛2 + ∑𝑏𝑏 ≠ȴ 𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 (𝑡𝑡)�ℎȴ,𝑏𝑏 (𝑡𝑡) 𝑓𝑓𝑏𝑏 (𝑡𝑡)�
30
3.2 Problem Formulation:
In order to increase the users achievable sum rate, it is required to jointly optimize the
beamforming vectors and the transmit power at the L BSs.
The combined optimization issue for JBF, PC, IC is expressed as:
The maximize of 𝑃𝑃𝑇𝑇𝑇𝑇,𝑗𝑗 (𝑡𝑡), ∀𝑗𝑗 and 𝑓𝑓𝑗𝑗 (𝑡𝑡), ∀𝑗𝑗 for the ∑𝑗𝑗 ∈{𝑎𝑎,2,…,𝐿𝐿} 𝛾𝛾 𝑗𝑗 (𝑡𝑡) subject to:
𝑃𝑃𝑇𝑇𝑇𝑇,𝑗𝑗 (𝑡𝑡) ∈ 𝜌𝜌, ∀𝑗𝑗
𝑓𝑓𝑗𝑗 (𝑡𝑡) ∈ Ƒ, ∀𝑗𝑗
𝛾𝛾 𝑗𝑗 (𝑡𝑡) ≥ 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (3. 6)
Where the 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 indicates the target SINR of the DL transmission ρ and Ƒ are the set of
transmit powers and beamforming codebook.
While the first two constraints are not convex, this problem is a non-convex optimization
problem, ln order to identify the most suitable PTX, ȴ and fȴ for the UE it is serving at time t, the
ȴth BS attempts to solve this problem, The optimal solution to this problem is found through an
exhaustive search over this space (i.e., by optimal)
3.3 An Introduction on Deep Reinforcement Learning:

Reinforcement learning is a machine learning approach that allows an agent to learn what
action to perform in an interactive environment to maximize its predicted future reward [12].
The interaction between the agent and the environment is shown in Figure 3.1. DRL, in
particular, appears to take advantage of deep neural networks ability to collect better
representations than constructed features and serve as a universal function approximator.
31
Figure 3. 1: The interaction of the agent as well as the environment in reinforcement learning
The basic elements of reinforcement learning include:
• Agent: the decision-making entity that interacts with the environment.

• Environment: the external system in which the agent operates.
• State: the current situation of the environment that the agent observes, the state 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆.
• Action: the decision made by the agent to influence the environment, the action 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴.
• Reward: the feedback signal 𝑟𝑟𝑠𝑠,𝑠𝑠′ ,𝑎𝑎 (𝑡𝑡, 𝑞𝑞) that the agent receives from the environment
after taking an action. The reward signal is determined after the agent performs action
a while in state 𝑠𝑠 at time step t and transfers to next state 𝑠𝑠′. The parameter 𝑞𝑞 ∈ {0,1} is
the bearer selector, which is a binary parameter used to differentiate between voice and
data bearers.
• Policy: π is the strategy used by the agent to determine the next action based on the
current state.
• State action value function: the state action value function under a given policy 𝜋𝜋 is
defined as 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎). It is the predicted discounted reward when beginning in state 𝑠𝑠 and
performing policy action 𝑎𝑎.
These elements interact, and their interaction is guided by the aim of optimizing the future
discounted reward for every action taken by the agent that causes the environment to change
state. The policy describes the agent's interaction with the state.
If 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎) is updated at each time step t, it will converge to the optimum state-action value
function 𝑄𝑄𝜋𝜋∗ (𝑠𝑠, 𝑎𝑎) as t → ∞ [11]. This, however, may not be easy to achieve. As a result, the
employ of a function approximator aligned with [13]. As shown in Figure 3. 1, also create a
32
neural network with its weights at time step t as Θt. Furthermore, by defining 𝜃𝜃𝑡𝑡 := Θt.
Moreover, create a function approximator 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎, 𝜃𝜃𝑡𝑡 ) ≈ 𝑄𝑄𝜋𝜋∗ (𝑠𝑠, 𝑎𝑎). This neural network-based
function approximator is referred as the Deep Q-Network (DQN) [13]. An essential component
of neural networks is activation functions, which are non-linear functions that compute the
1
hidden layer values. The sigmoid (SGD) function 𝜎𝜎 ∶ 𝑥𝑥 → [14] is a popular option for
1+𝑒𝑒 −𝑥𝑥
the activation function. This DQN is trained by adjusting 𝜃𝜃 at each time step t in order to reduce
the mean-squared error loss 𝐿𝐿𝑡𝑡 (𝜃𝜃𝑡𝑡 ):
2
𝐿𝐿𝑡𝑡 (𝜃𝜃𝑡𝑡 ) ≔ 𝐸𝐸𝑠𝑠,𝑎𝑎 [�𝑦𝑦𝑡𝑡 − 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎, 𝜃𝜃𝑡𝑡 )� ] (3. 7)
Where the 𝑦𝑦𝑡𝑡 ∶= 𝐸𝐸𝑠𝑠,𝑎𝑎 � 𝑟𝑟𝑠𝑠,𝑠𝑠′ ,𝑎𝑎 + 𝛾𝛾 𝑚𝑚𝑚𝑚𝑚𝑚𝑎𝑎′ 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑠𝑠 ′ , 𝜃𝜃𝑡𝑡−1 )� 𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ] represents the estimated
function value at time step t when the current state and action are 𝑠𝑠 and 𝑎𝑎, respectively.
"Online learning" refers to the process of interacting with the environment and the DQN to
generate a prediction, compare it to the real answer, and suffer a loss x1. In online learning,
UEs provide data to the serving BS, which then sends it to the central site for DQN training.
This information represents the current condition of our network environment 𝑆𝑆. In the DQN,
we set the dimension of the input layer to be equal to the number of states S. The output layer's
dimension is equal to the number of actions 𝐴𝐴. Also chose a minimal depth for the hidden layer
dimension since depth has the important effect on computational cost, where the dimension of
the width follows [15].
The weights 𝜃𝜃𝑡𝑡 in the DQN are modified after every iteration in time 𝑡𝑡 using the stochastic
gradient descent (SGD) method on a minibatch of data during the training phase of the DQN.
SGD starts with a random initial value of 𝜃𝜃 and iteratively updates 𝜃𝜃 with a step size 𝜂𝜂 > 0
as follows:
𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 − 𝜂𝜂∇𝐿𝐿𝑡𝑡 (𝜃𝜃𝑡𝑡 ) (3. 8)
"Experience replay" [32] improves DQN training. The experience replay buffer D stores the
experiences at each time step t. An experience et is defined as:
𝑒𝑒𝑡𝑡 = (𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 , 𝑟𝑟𝑠𝑠,𝑠𝑠′ ,𝑎𝑎 (𝑡𝑡, 𝑞𝑞), 𝑠𝑠 ′ ) (3. 9)
We randomly select samples of experience from this buffer and execute minibatch training on
the DQN. This method has the advantages of stability and eliminating local minimum
33
convergence [28]. Since the current parameters of the DQN change from those used to create
the sample from D, the use of off-policy learning techniques is also allowed. Therefore, the
DQN estimates the state-action value function as:
𝑄𝑄𝜋𝜋∗ (𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) = 𝐸𝐸𝑠𝑠′ � 𝑟𝑟𝑠𝑠,𝑠𝑠′ ,𝑎𝑎 + 𝑚𝑚𝑚𝑚𝑚𝑚𝑎𝑎′ 𝑄𝑄𝜋𝜋∗ (𝑠𝑠 ′ , 𝑎𝑎′ )� 𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ] (3. 10)
In this case, 𝛾𝛾 ∶ 0 < 𝛾𝛾 < 1 is the discount factor that defines the significance of the predicted
future rewards. The next state is s0, and the following action is a0. Using DQN, to discover a
solution that maximize the state-action function 𝑄𝑄𝜋𝜋∗ (𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ).
In the policy selection the Q-learning is an algorithm for off-policy reinforcement learning. An
off-policy algorithm finds a near-optimal policy even when actions are selected based on an
arbitrary exploratory policy [12]. As a result, adopted a near-greedy action selection policy.
This policy operates in two modes:
1. exploration: Every time step t, the agent performs different actions at random in order
to find an effective action 𝑎𝑎𝑡𝑡 .
2. exploitation: According to previous experience, the agent selects an action at time step
t that optimizes the state-action value function 𝑄𝑄𝜋𝜋 (𝑠𝑠, 𝑎𝑎, 𝜃𝜃𝑡𝑡 ).
The agent in this policy exploration with a probability of ∈ and exploitation with a probability
of 1− ∈, where ∈∶ 0 < ∈ < 1 is a hyperparameter that modifies the trade-off between
exploration and exploitation. Due to this trade-off, this approach is also known as the ∈
−𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 action selection policy.
This strategy is known to have a linear regret in t (regret being the loss of one-time step's
opportunity) [16].
The UEs move at speed v at each time step t, and the agent executes a specific action at from
its current state 𝑠𝑠𝑡𝑡 . The agent is rewarded 𝑟𝑟𝑠𝑠.𝑠𝑠′ ,𝑎𝑎 (𝑡𝑡, 𝑞𝑞) and advances to the goal state 𝑠𝑠0 = 𝑠𝑠𝑠𝑠 +
1. The span of time during which an interaction between the agent and the environment occurs
is referred to as an episode.
𝑇𝑇 time steps separate each episode. An episode is considered to have converged if the desired
aim was achieved within 𝑡𝑡 time steps.
The UE coordinates are extremely important in our DQN implementation. The network's
performance increases when UE coordinates are sent back to the network and utilized to make
intelligent decisions. Therefore, UE coordinates must be a component of the DRL state space.
34
3.4 Deep Reinforcement Learning in Voice PC and IC:
This section describes the voice power control and interference coordination reinforcement
learning method, as well as the baseline solutions against which is assessed the solution. First,
discussing the fixed power allocation technique, which is the industry standard today, and then
we build the suggested approach utilizing tabular and DQL implementations. Finally, obtaining
comparisons of these three methods.
3.4.1 FPA:
In a communication system, power allocation refers to the distribution of power across multiple
channels or subcarriers to maximize the overall system performance. Fixed power allocation is
a technique that involves assigning a fixed amount of power to each channel or subcarrier.
The mean idea behind fixed power allocation is to allocate power uniformly across all channels
or subcarriers. This means that each channel or subcarrier is allocated the same amount of
power, regardless of its channel gain or noise level. This is a simple and easy-to-implement
technique that is widely used in many communication systems.
The equation for fixed power allocation is given by:
𝑚𝑚𝑚𝑚𝑚𝑚
𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 (𝑡𝑡) = 𝑃𝑃𝐵𝐵𝐵𝐵 − 10 ∗ log(𝑁𝑁𝑃𝑃𝑃𝑃𝑃𝑃 ) + 10 ∗ log�𝑁𝑁𝑃𝑃𝑃𝑃𝑃𝑃,𝑏𝑏 (𝑡𝑡)� (3. 11)
Where:
𝑃𝑃𝐵𝐵𝐵𝐵 : The maximum transmitted power of BS.
𝑁𝑁𝑃𝑃𝑃𝑃𝑃𝑃 : The total number of physical resource blocks in the BS.
𝑁𝑁𝑃𝑃𝑃𝑃𝑃𝑃,𝑏𝑏 : Represents the number of PRBs accessible to the UE in the 𝑏𝑏 𝑡𝑡ℎ BS.
For example, if there are 10 subcarriers and the total power available is 100 watts, then each
subcarrier will be allocated 10 watts of power.
The advantage of fixed power allocation is that it is easy to implement and does not require
any feedback or channel state information. However, it may not always be the most efficient
technique, as some channels or subcarriers may require more power than others to achieve the
desired performance. In such cases, dynamic power allocation techniques may be more
appropriate.
The BS fixes it transmit power in this standard algorithm and only modifies the modulation
35
and coding schemes of the transmission. This is referred to as "link adaptation." Link adaption
is dependent on the reports provided back to the BS by the UE (i.e., the SINR and received
power). Since the BS transmit power is fixed, the connection is adapted depending on either
periodic or aperiodic measurement feedback from the voice UE to the serving BS [17].
3.4.2 Tabular RL:
Tabular reinforcement learning is a type of reinforcement learning algorithm that works with
discrete state and action spaces. In this approach, the agent maintains a table of values that
represent the expected rewards for each possible action in each possible state.
The agent uses this table to select the action with the highest expected reward in each state,
based on a policy that is either deterministic or stochastic. The agent then updates the table of
values based on the observed rewards and transitions to a new state, using a learning rule that
is based on the principle of temporal difference (TD) learning.
The update equation for tabular reinforcement learning is given by:
𝑄𝑄𝜋𝜋 (𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) = (1 − 𝛼𝛼) ∗ 𝑄𝑄𝜋𝜋 (𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) + 𝛼𝛼 ∗ (𝑟𝑟𝑠𝑠,𝑠𝑠′ ,𝑎𝑎 + 𝛾𝛾 ∗ 𝑚𝑚𝑚𝑚𝑚𝑚𝑎𝑎′ (𝑄𝑄𝜋𝜋 (𝑠𝑠′, 𝑎𝑎′))) (3. 12)
Where:
• 𝑄𝑄𝜋𝜋 (𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) the state action value of action 𝑎𝑎 in state 𝑠𝑠.
• α: The learning rate.
• 𝑟𝑟𝑠𝑠,𝑠𝑠′ ,𝑎𝑎 : The observed reward.
• 𝛾𝛾 : The discount factor.
• 𝑠𝑠′ : The next state.
• 𝑎𝑎′ : Is the action selected in 𝑠𝑠′.
This equation updates the expected reward for the current state-action pair based on the
difference between the observed reward and the expected reward for the next state-action pair,
discounted by 𝛾𝛾. The learning rate controls the rate at which the table of values is updated,
while the discount factor controls the importance of future rewards relative to immediate
rewards [9].
36
3.4.3 Deep Reinforcement Learning Approach:
DRL is the proposed algorithm that handles power control and interference coordination
without requiring the UE to provide explicit power control or interference coordination
commands. Depending on the number of states and the depth of the DQN, this method of using
the DQN may have a lower computational cost than tabular Q-learning [18]. DRL fundamental
steps are as follows:
• Use an optimization operation for time step t.

• Select a joint beamforming, power control, and interference coordination.
ȴ
• Examine the effect on effective SINR 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 (𝑡𝑡).
ȴ
• Reward the attached depending on its effects on 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 (𝑡𝑡) and its distance toward 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
or 𝛾𝛾𝑚𝑚𝑚𝑚𝑚𝑚 .
• According to the outcomes, train the DQN.
The power control for the serving BS b is explained as follows:
𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 (𝑡𝑡) = min (𝑃𝑃𝐵𝐵𝐵𝐵 , 𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 (𝑡𝑡 − 1) + 𝑃𝑃𝐶𝐶𝑏𝑏 (𝑡𝑡)) (3. 13)
We add a new requirement for interference coordination on the interfering BS-b as follows:
𝑃𝑃𝑇𝑇𝑇𝑇,ȴ (𝑡𝑡) = min (𝑃𝑃𝐵𝐵𝐵𝐵 , 𝑃𝑃𝑇𝑇𝑇𝑇,ȴ (𝑡𝑡 − 1) + 𝐼𝐼𝐼𝐼(𝑡𝑡)) (3. 14)
where the function of the BS (serving vs. interfering) could possibly change based on the UE
being served. The IC and PC instructions are the identical, but the BS function differentiates
one as an interferer (requiring coordination) and the other as a server (requiring power control).
As illustrated in the proposed algorithm, Simulating the PCIC technique using deep Q-learning.
The algorithm uses a sophisticated methodology that takes into account various factors in
optimizing the performance of voice communication systems. Specifically, the approach
ȴ
employs three different voice algorithms, each of which utilizes an effective (SINR) 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 (𝑡𝑡)
that considers coding gain. Additionally, the adaptive code rate 𝛽𝛽 is determined based on
another measure of SINR, 𝛾𝛾 ȴ (𝑡𝑡), ensuring that the system is always operating at optimal
capacity.
To ensure that voice communication is reliable and high quality, the algorithm employs an
adaptive multi-rate (AMR) codec with quadrature phase shift keying (QPSK) modulation.
37
QPSK is an efficient modulation scheme that is well suited for voice communication since
voice carriers typically do not require high data speeds. This choice of modulation allows for
the use of a narrow bandwidth, which is desirable in a voice communication system.
ȴ
The quantity that is optimized in this approach is the effective SINR 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 (𝑡𝑡). By optimizing
this parameter, the algorithm is able to ensure that voice communication is always reliable and
of high quality. This approach is particularly effective in scenarios where the quality of voice
communication is of utmost importance, such as in emergency response systems or military
operations. Overall, the proposed approach is an effective way to optimize voice
communication systems and ensure high-quality communication.
The efficiency and run-time complexity of an algorithm are crucial factors to consider when
designing a communication system. The algorithm utilizes two different techniques, namely
FPA and tabular Q-learning PCIC, to optimize voice communication. The FPA has a time
complexity of O(1), which means that it takes a constant amount of time to run regardless of
the input size. On the other hand, the tabular Q-learning PCIC has a higher time complexity of
O(2) [19]. Here, 𝑆𝑆 and 𝐴𝐴 are the status and action sets for voice carriers, respectively, and their
sizes can impact the overall efficiency of the algorithm.
In addition to optimizing the voice communication performance, the proposed algorithm also
takes into account the overhead due to transmission over the backhaul. In particular, one of the
L BSs in the algorithm acts as a central hub for the surrounding BSs. As a result, the overhead
due to transmission to this central location for NUE UEs in the service area has a time
complexity of O(gLNUE), where g is the number of measurements sent by any given UE during
a particular time step 𝑡𝑡 [20]. Minimizing this overhead is crucial to improve the overall
efficiency of the communication system, and the proposed algorithm is designed to achieve
this goal.
Overall, the proposed algorithm is a comprehensive approach that considers various factors in
optimizing the performance of voice communication systems. By leveraging efficient
techniques like FPA and tabular Q-learning PCIC, and minimizing the overhead due to
transmission over the backhaul, the algorithm is able to provide reliable and high-quality voice
communication services. These features are particularly important in scenarios where voice
communication is critical, such as in emergency response systems or military operations.
The proposed algorithm is designed to address a specific problem in communication systems,

which is formulated mathematically as equation (3.6). Specifically, the algorithm is able to
38
solve equation (3.6) and optimize the performance of voice communication systems in various
scenarios.
Serving Base Station ȴ

PC ȴ (t)
ϒ ȴ(t)
Joint Beamforming, Power Control, and f ȴ (t)

ϒ target Interference Coordination
IC b (t)
Interfering Base Station b
Figure 3. 2: Downlink joint beamforming, power control, and interference coordination

Module
3.5 Deep Reinforcement Learning in mmWave Beamforming

PC and IC:
In this particular section an in-depth description of the proposed methods is provided, along
with an evaluation of their efficacy in improving the system's performance. The suggested
methods aim to address the issues arising from UE movement and optimize the system's
performance using a reinforcement learning (RL) based algorithm. The changes in the SINR
serve as a metric for assessing the effectiveness of the proposed methods. SINR, being a crucial
factor in evaluating the quality of service provided by a communication system, allows for
accurate evaluation of the performance of the proposed methods and identification of any areas
that require further improvement. This evaluation involves analyzing the impact of UE
movement and RL-based algorithm optimization operations on SINR.
3.5.1 JB-PCIC Algorithm:
The algorithm in this research study is a DRL based approach that is specifically designed to
optimize the beamforming vectors and transmit powers at the base stations. The objective of
this algorithm is to maximize the function outlined in equation (3.6) in an efficient and effective
manner. By using a command register that utilizes a string of bits, multiple operations can be
conducted concurrently, which improves the speed and overall performance of the system. The
DRL algorithm used in this research study is designed to operate in real-time and can efficiently
39
allocate the necessary resources required for optimal system performance. The proposed
algorithm's efficiency is demonstrated through extensive simulations, which show that it
outperforms existing methods in terms of convergence speed and overall system performance.
The process of beamforming, which is critical for improving the performance of wireless
communication systems, involves the selection of a beamforming vector. This vector directs
the transmission of radio waves from the transmitter to the receiver, with the objective of
achieving the maximum possible signal strength at the receiver. The selection of the
beamforming vector (𝑓𝑓𝑛𝑛 ) is a crucial step in the beamforming process, and several methods
can be used for this purpose. One of the commonly used methods is the circular increment or
decrement approach, where the agent progresses through the beamforming codebook in
circular increments (n + 1) or decrements (n -1). This approach is effective in selecting the
optimal beamforming vector, as it ensures that all possible vectors in the codebook are
considered. Once the optimal beamforming vector is selected, it is used to transmit data from
the transmitter to the receiver, resulting in improved system performance.
𝑓𝑓𝑛𝑛 (𝑡𝑡) ∶ 𝑛𝑛 = 𝑛𝑛 ± 1 (3. 15)
The transmission from BSs b and ȴ to the mobile user is being considered. Let 𝛾𝛾 ȴ denote the
signal vector transmitted from BSs 𝑏𝑏 and ȴ respectively. Minting records of the way 𝛾𝛾 ȴ varies
ȴ ȴ
when the beamforming vector adjusts. In estimating 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 for the data bearers 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 = 𝛾𝛾 ȴ , A
coding gain of unity is applied.
When the beamforming vectors for a particular UE are selected, the agent additionally performs
power control of that beam by adjusting the transmit power of the BS to this UE (or the
interference coordination of other BSs). Other BS interference cooperation). (3.13) and (3.14),
which define the set 𝑃𝑃, control the selection of the transmit power. These equations provide a
range of feasible power levels that can be transmitted to the UE. The agent then chooses a
transmit power level from this set that is both feasible and provides the best system
performance. This process of selecting the optimal transmit power level for each beamforming
vector is repeated for each UE in the network, ensuring that the system operates at its maximum
efficiency.
The algorithm in this study has demonstrated significantly faster run times for deep
reinforcement learning compared to the upper bound algorithm across all antenna sizes M.
Additionally, reporting the user equipment (UE) coordinates, including the longitude and
40
latitude, to the base station (BS) instead of the channel state information has reduced the
reporting overhead from M complex-valued elements to only two real-valued coordinates and
its received signal-to-interference-plus-noise ratio (SINR). Assuming that reporting M
complex-valued elements results in an overhead of 2M, reporting the UE coordinates provides
a reduction gain of 1-1/M in overhead.
The algorithm is referred to as the JB-PCIC (joint beamforming, power control, and
interference coordination) algorithm.
3.5.2 Brute Force:
This study methodology known as "Brute Force Beamforming and PCIC" utilizes an
exhaustive search approach within the Euclidean space 𝜌𝜌 × Ƒ to optimize the SINR for each
base station. This technique presents the greatest efficiency limit in joint SINR optimization.
The size of the Euclidean space P is independent of the number of antennas in the Uniform
Linear Array (ULA) M, where the size of Ƒ scales linearly with M. While the brute force
approach works effectively for low values of M and a limited number of base stations L, it
becomes problematic when M is large due to the algorithm's runtime complexity, which is
𝑂𝑂((|𝜌𝜌||Ƒ|)𝐿𝐿 ) = 𝑂𝑂(𝑀𝑀𝑀𝑀). Therefore, the search time becomes a significant concern for large
M. In contrast, the proposed algorithm's runtime complexity is much lower than 𝑂𝑂(𝑀𝑀𝐿𝐿 ), The
algorithm is more efficient when M is large and search time is a concern.
41
CHAPTER 4
SIMULATION RESULTS AND

DISCUSSIONS
42
4.1 Performance Measures:
In the following section, the primary objective is to delve into the various performance metrics
that have been implemented to evaluate and compare the effectiveness of the algorithms.
Thorough examination will be conducted on the benchmarks that have been established to
measure the efficiency, accuracy, and reliability of the methods. By utilizing these performance
metrics, the performance of the algorithms can be accurately assessed and areas where
improvements can be made can be identified.
4.1.1 Convergence:
The convergence metric, denoted by ζ, can be described as the episode or time step when the
intended SINR is achieved for all UEs in the network for the entire duration of T. As the
increase occurs the number of antennas in the ULA M, the convergence time ζ is likely to
increase as well. This is because with more antennas, the system can achieve higher SINR
levels, but it may also take more time for all UEs to reach the desired level.
It should be noted that convergence as a function of M is not a significant concern in voice

applications, since only single antennas are used. However, in other applications such as data
transmission, where high data rates are essential, multiple antennas may be used to improve
the SINR. Therefore, it is crucial to measure the convergence episode for various M values to
determine the optimal number of antennas required for the desired performance.
To obtain a more comprehensive understanding of the convergence behavior, consider the

aggregated percentile convergence episode for many random seeds. This approach can provide
insight into the statistical distribution of convergence episodes and help identify potential
outliers that could affect the overall system performance. By carefully analyzing the
convergence metric, it's possible to evaluate the effectiveness of the algorithms and optimize
the system design for enhanced performance.
4.1.2 Coverage:
ȴ
To gain a better understanding of the behavior of the convergence metric 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 , Utilizing a
complementary cumulative distribution function (CCDF) approach is possible [21]. This
involves repeating the simulation multiple times while varying the random seed used to drop
users into the network. By altering the random seed, the placement of users within the network
43
can be adjusted to observe how this impacts the convergence episode.
ȴ
The CCDF of 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 can provide us with valuable information about the statistical distribution
of convergence episodes. It enables us to identify the probability of convergence occurring at
a particular time step or episode, given that it has not occurred up until that point. The CCDF
approach can be thought of as a complement to the cumulative distribution function (CDF),
which provides information on the probability of convergence occurring at or before a specific
time step.
ȴ
By generating the CCDF of 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 through multiple simulations, Insight into how the
convergence metric varies across different scenarios and identification of any potential outliers
or unusual behavior can be gained. This can help us to fine-tune the algorithms and system
design for improved performance and reliability.
In essence, the CCDF approach allows us to create a complementary view of the distribution
of convergence episodes. By repeating the simulation with different random seeds and
observing how users are dropped into the network, by obtaining a more comprehensive
understanding of the convergence metric and its behavior under varying conditions.
4.1.3 Sum Rate capacity:
The average sum rate capacity is computed by utilizing the effective SINRs as follows:
𝑇𝑇
1 ȴ
𝐶𝐶 = � � log 2 (1 + 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 (𝑡𝑡)) (4. 1)
𝑇𝑇
𝑡𝑡=1 𝑗𝑗∈{ȴ,𝑏𝑏}
The data rate served by the network is a crucial performance metric that indicates the amount
of data that can be transmitted over the network within a given period. To evaluate the
network's data rate, the algorithm computes the maximum sum-rate capacity resulting from
computing equation (4.1) over several episodes. This metric provides an accurate estimation of
the network's total data-carrying capacity under varying network conditions and interference
scenarios.
It should be noted that the maximum sum-rate capacity is a function of several variables,
including the number of users, the available bandwidth, and the SNR of the system. By
computing this metric over many episodes. By obtaining a statistical representation of the
network's performance and identifying areas where improvements can be made, a statistical
44
representation of the network's performance can be obtained and areas where improvements
can be made can be identified.
Furthermore, the maximum sum-rate capacity can be utilized to optimize network resource
allocation and improve the Quality-of-Service (QoS) for all users. By maximizing the
network's data rate capacity, to ensure efficient handling of large volumes of data traffic while
maintaining high throughput and low latency, the system can be configured accordingly.
Computing the maximum sum-rate capacity over many episodes yields a reliable measure of
the network's data rate capacity and identify opportunities for performance optimization. This
metric is a critical performance indicator that can help network operators make informed
decisions about resource allocation and improve the overall network performance.
4.2 Simulation and Results:

This section is dedicated to evaluating the performance of the proposed solutions, which are
based on Reinforcement Learning (RL) techniques. The required aim is to assess the
effectiveness of these solutions in terms of the performance measures that were introduced and
discussed in the previous section.
In this project, Python programming language is used for the implementation of the deep
learning models in the field of communication engineering. With its rich set of libraries and
frameworks, Python provides us with a powerful toolset to train and evaluate the models
efficiently. Popular deep learning libraries such as TensorFlow and Keras are utilized to build
and train neural networks for tasks such as signal classification, speech recognition, and image
segmentation. Additionally, Python allows us to easily preprocess and manipulate large
datasets of complex data, enabling us to extract meaningful insights from the data. By
leveraging Python's flexibility and versatility, accurate and reliable results can be achieved in
the deep learning project.
Here is a summary of what will be discussed in this chapter:
• Firstly, in this study, the setup configuration of the simulation was adopted to ensure
that the results are reliable and replicable. The simulation setup involves defining the
parameters and variables that are relevant to the problem under investigation. Careful
selection of simulation parameters, such as the number of nodes, transmission power,
and data rate, was made to represent realistic scenarios in the field of communication
engineering. Multiple iterations of the simulation were also conducted to ensure the
45
stability and consistency of the results. By adopting a well-defined and carefully
selected simulation setup, the performance of the proposed algorithm can be
confidently analyzed, and meaningful conclusions can be drawn. Furthermore, the
results can be compared with other studies that adopt similar simulation setups to
evaluate the generalizability of the findings. Overall, the adoption of a standardized
simulation setup is crucial for ensuring the validity and reliability of the results and
advancing the state-of-the-art in the field of communication engineering.
• Secondly, in this demonstration, standard and realistic simulation parameters for voice
and data bearers will be showcased. The simulation results consist of five figures that
provide valuable insights into the performance of various techniques in the field of
voice communication. One of these figures represents a comparison between two
commonly used techniques in the field. Four additional figures analyze various aspects
of the performance of the proposed algorithm, compared to a standard technique
denoted as M. The first figure presents the CCDF of 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 , a parameter that characterizes
the signal quality, for both the proposed algorithm and the standard technique. The
second figure presents the normalized convergence time for both techniques. The third
figure presents the achievable SINR and normalized transmit power for both the
optimal and JB-PCIC algorithms as a function of "M". Finally, the fourth figure
presents the sum-rate capacity of the convergence episode, also as a function of "M".
These figures provide a comprehensive overview of the performance of the proposed
algorithm and enable us to identify areas for improvement and future research.
• Thirdly, to demonstrate the strength of the proposed algorithm and its ability to adapt
to changing environments, the parameters for the project will be changed and observed
how the algorithm performs under these new conditions. By adjusting the parameters,
such as the speed of UEs and the number of UEs in the algorithm, the algorithm's ability
to find the best solution can be evaluated even when the environment states are
changing. Introducing new datasets that vary in complexity and size, as well as new
tasks that require different levels of accuracy and speed, will allow for observation of
how the algorithm performs under these different scenarios and determine its overall
robustness and adaptability. Through these experiments, the aim is to demonstrate the
46
strength of the proposed algorithm and its ability to provide optimal solutions in diverse
and dynamic environments.
• Forth, discussing the changes in the results figures of the simulation, the high
convergence similarity between the JB-PCIC algorithm and the optimal result is
demonstrated under the changes in environment states. The simulation involved
running multiple experiments with different configurations to analyze the
performance of the proposed algorithm. The results figures for each experiment are
presented, and the observed changes are discussed. A comparison is made between
these changes and the hypothesis, explaining the underlying factors that contributed to
them. Statistical analyses are also presented to support the findings and evaluate the
significance of the observed changes. Overall, the discussion of the changes in the
results figures provides valuable insights into the behavior of the proposed algorithm
and its performance in different scenarios. It helps identify areas for further
improvement and guides towards the most effective solutions for the problem at hand.
47
Voice Data
Start
Bearer Bearer
Start Fixed Power

Start Tabular
Allocation
Algorithm And Set The Set The
Algorithm And
Save The Environment Environment
Save The
Generated Data In Parameters Parameters
Generated Data In
Figure2.txt File
Figure1.txt File
Start JB-PCIC
Algorithm And
Save The
Generated Data In
Figure3.txt File
Considering The
Generated Data In Using JB-PCIC Using Optimal
Ms = [4, 8, 16, 32, 64]
The Three Files Algorithm Method
I=0 I=0
Plotting The CCDF

Of SINReff For M = Ms[I] M = Ms[I]
The Three
Different Voice
Algorithms
Generate The Data
Generate The Data
And Save It In A
And Save It In A
Figures_Optimal_M.t
Figures_M.txt File
xt File
End If (I < 5) Yes I++ I++ Yes If (I < 5)
Considering
The Data In
NO NO
The Generated
Files
Plotting The
Achievable
Plotting The Sum Plotting The
SINR And The Plotting The
-Rate Capacity Normalized
Normalized CCDF Of JB-
Of The Convergence
Transmit Power PCIC
Convergence Time For The
For The Brute Algorithm As
Episode As A JB-PCIC
Force And JB- A Function Of
Function Of The Algorithm
PCIC M
M Versus M
Algorithms Vs
M
Figure 4. 1: Simulation flowchart for Data and Voice bearers
48
4.2.1 Setup Configurations:
In the urban cellular environment, users are evenly dispersed across the network region. The
UEs are moving at 𝑣𝑣, while experiencing both log-normal shadow fading and small-scale
fading. The radius of the cell is 𝑟𝑟, and the distance between sites is 𝑅𝑅 = 1.5𝑟𝑟.
The effective SINRs were configured as follows:
𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉
• 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 3 𝑑𝑑𝑑𝑑.
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷
• 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝛾𝛾0𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 + 10 log(𝑀𝑀) 𝑑𝑑𝑑𝑑. (4.2)
where 𝛾𝛾0𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 is a constant threshold and set to be 𝛾𝛾0𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 5𝑑𝑑𝑑𝑑. Moreover, the minimum
SINR (𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅𝑚𝑚𝑚𝑚𝑚𝑚 ) adjusted by -3dB, Table 4.1 show the RL hyperparameter.
Table 4. 1: Reinforcement Learning Hyperparameter
Parameter Value
Discount factor (ϒ) 0.995
Initial exploration rate (Є) 1.00
Number of states (𝑺𝑺) 8
Deep Q-Network width (H) 24
Exploration rate decay (d) 0.9995
Minimum exploration rate for voice ∈𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗
𝒎𝒎𝒎𝒎𝒎𝒎 0.15
Minimum exploration rate for data ∈𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅
𝒎𝒎𝒎𝒎𝒎𝒎 0.10
Number of actions (𝑨𝑨) 16
Deep Q-Network depth 2
The setup of simulation state 𝑆𝑆 are adjusted as the following Table 4.2.
Table 4. 2: Simulation State S
State Environment Value
𝒔𝒔𝟎𝟎𝒕𝒕 , 𝒔𝒔𝟏𝟏𝒕𝒕 𝑈𝑈𝐸𝐸ȴ (𝑥𝑥(𝑡𝑡), 𝑦𝑦(𝑡𝑡))
49
𝒔𝒔𝟐𝟐𝒕𝒕 , 𝒔𝒔𝟑𝟑𝒕𝒕 𝑈𝑈𝐸𝐸𝑏𝑏 (𝑥𝑥(𝑡𝑡), 𝑦𝑦(𝑡𝑡))
𝒔𝒔𝟒𝟒𝒕𝒕 𝑃𝑃𝑇𝑇𝑇𝑇,ȴ (𝑡𝑡)
𝒔𝒔𝟓𝟓𝒕𝒕 𝑃𝑃𝑇𝑇𝑇𝑇,𝑏𝑏 (𝑡𝑡)
𝒔𝒔𝟔𝟔𝒕𝒕 𝑓𝑓𝑛𝑛ȴ (𝑡𝑡)
𝒔𝒔𝟕𝟕𝒕𝒕 𝑓𝑓𝑛𝑛𝑏𝑏 (𝑡𝑡)
Where 𝑥𝑥, 𝑦𝑦 are the Cartesian coordinates "longitude and latitude" of the given UE.
The fact that is utilized Ƒ and 𝑃𝑃 either have a cardinality that is a power of two to determine
the actions 𝐴𝐴. This allows us to use a register to create the binary encoding of the activities.
an as observed in Fig. 4.1. The combined beamforming, power control, and interference
coordination instructions may be obtained using bitwise AND, masks, and shifting, the code
shown in Table 4.3 and Table 4.4 was decided upon.
• q = 0:
𝐼𝐼𝐶𝐶ȴ (𝑡𝑡) 𝑃𝑃𝐶𝐶𝑏𝑏 (𝑡𝑡)
• q = 1:
𝑓𝑓𝑛𝑛𝑏𝑏 (𝑡𝑡) 𝑓𝑓𝑛𝑛ȴ (𝑡𝑡) 𝐼𝐼𝐶𝐶ȴ (𝑡𝑡) 𝑃𝑃𝐶𝐶𝑏𝑏 (𝑡𝑡)
𝒂𝒂𝒕𝒕 ∈ 𝑨𝑨
Figure 4. 2: Binary Encoding of Actions for Beamforming, Power Control, and Interference
Coordination in Different Bearer TypesDifferentBearer Types
Table 4. 3: The Power Control, And Interference Coordination Commands for Voice Bearer
Voice Bearer (q = 0)
Action Code Task
𝒂𝒂[𝟎𝟎,𝟏𝟏] 00 Decrease the BS 𝑏𝑏 transmit
50
power by 3 dB
𝒂𝒂[𝟎𝟎,𝟏𝟏] 01 Decrease the BS 𝑏𝑏 transmit

power by 1 dB
𝒂𝒂[𝟎𝟎,𝟏𝟏] 10 Increase the BS 𝑏𝑏 transmit

power by 1 dB
𝒂𝒂[𝟎𝟎,𝟏𝟏] 11 Increase the BS 𝑏𝑏 transmit

power by 3 dB
𝒂𝒂[𝟐𝟐,𝟑𝟑] 00 Decrease the BS ȴ transmit

power by 3 dB
𝒂𝒂[𝟐𝟐,𝟑𝟑] 01 Decrease the BS ȴ transmit

power by 1 dB
𝒂𝒂[𝟐𝟐,𝟑𝟑] 10 Increase the BS ȴ transmit

power by 1 dB
𝒂𝒂[𝟐𝟐,𝟑𝟑] 11 Increase the BS ȴ transmit

power by 3 dB
Table 4. 4: The Joint Beamforming, Power Control, And Interference Coordination
Data Bearer (q = 1)
Action Code Task
𝒂𝒂[𝟎𝟎] 0 Decrease the BS 𝑏𝑏 transmit

power by 1 dB
51
𝒂𝒂[𝟎𝟎] 1 Increase the BS 𝑏𝑏 transmit
power by 1 dB
𝒂𝒂[𝟏𝟏] 0 Decrees the BS ȴ transmit

power by 1 dB
𝒂𝒂[𝟏𝟏] 1 Increase the BS ȴ transmit

power by 1 dB
𝒂𝒂[𝟐𝟐] 0 Step down the beamforming

codebook index
of BS ȴ
𝒂𝒂[𝟐𝟐] 1 Step up the beamforming

codebook index of
BS ȴ
𝒂𝒂[𝟑𝟑] 0 Step down the beamforming

codebook index
of BS 𝑏𝑏
𝒂𝒂[𝟑𝟑] 1 Step up the beamforming

codebook index of
BS 𝑏𝑏
Conclusively, it can be inferred from this that 𝑃𝑃 = {±1, ±3} is offset from the transmit power.
The selection of these numbers is justified by:
• conforming to industry standards [22], which specify integers for power increments.
• preserving the problem formulation's non-convexity by keeping the restrictions
discrete.
The activities to increase and reduce BS transmit powers are carried out as shown in (3.13) and
(3.14) respectively.
A 3dB power steps are provided exclusively for voice to compensate for the lack of
52
beamforming, aligning with industry requirements for packetized voice carriers [22].
The JB-PCIC algorithms utilize a two-tiered reward system. The first tier considers the
significance of the executed action, while the second tier evaluates whether the desired SINR
was achieved or if it fell below the minimum threshold. To facilitate this, a function is
introduced, denoted as p(.), which determines a specific component of 𝑃𝑃 based on the provided
code. Specifically, p (00) = -3, p (01) = -1, p (10) = 1, and p (11) = 3.
Subsequently, the received SINR resulting from the encoded activities is determined, denoted
as and, corresponding to the BSs 𝑏𝑏 and ȴ, respectively.
The combined reward for both voice and data bearers are written as follows:
𝑟𝑟𝑠𝑠, 𝑠𝑠′ ,𝑎𝑎 (𝑡𝑡, 𝑞𝑞) = � 𝑝𝑝 �𝑎𝑎[0,1] (𝑡𝑡)� − 𝑝𝑝 �𝑎𝑎[2,3] (𝑡𝑡)�� (1 − 𝑞𝑞) + � 𝛾𝛾𝑎𝑎𝑏𝑏[0],𝑎𝑎[3] + 𝛾𝛾𝑎𝑎ȴ[1] ,𝑎𝑎[2] � 𝑞𝑞 (4.3)
Where:
• 𝛾𝛾𝑎𝑎𝑏𝑏[0],𝑎𝑎[3] : The received SINR from the b BS.
• 𝛾𝛾𝑎𝑎ȴ[1],𝑎𝑎[2] : The received SINR from the ȴ BS.
• q: for voice bearers is 0 and 1 for data bearers.
When a data bearer performs joint power control and beamforming and a voice bearer performs
power control and interference coordination, to the agent, the most is provided per time step 𝑡𝑡.
If any of the restrictions in the problem formulation become inactive, the agent will terminate
the episode.
At this instance, the RL agent is rewarded with 𝑟𝑟𝑠𝑠, 𝑠𝑠′ ,𝑎𝑎 (𝑡𝑡, 𝑞𝑞) = 𝑟𝑟𝑚𝑚𝑚𝑚𝑚𝑚 . According to the proposed
algorithm, either a penalty 𝑟𝑟𝑚𝑚𝑚𝑚𝑚𝑚 or a maximum reward 𝑟𝑟𝑚𝑚𝑚𝑚𝑚𝑚 is applied depending on whether
the minimum 𝛾𝛾𝑚𝑚𝑚𝑚𝑚𝑚 was violated or 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 were achieved.
A minibatch sample size is employed of 𝑁𝑁𝑚𝑚𝑚𝑚 = 32 training instances in the simulation. 𝐻𝐻 =

�(𝐴𝐴 + 2) ∗ 𝑁𝑁𝑚𝑚𝑚𝑚 = 24 could be used to determine the depth of the DQN where A = 16.
The following Table 4.5 observed the basic radio environment parameters of the simulation.
53
Table 4. 5: Radio Environment Parameters
Parameter Value
BS Maximum Transmitted Power (𝑷𝑷𝒎𝒎𝒎𝒎𝒎𝒎

𝑻𝑻𝑻𝑻,𝑩𝑩𝑩𝑩 ) 40 watts
Antenna Gain (𝑮𝑮𝑻𝑻𝑻𝑻 ) 11dB
Maximum Number of UEs Per BS (𝑵𝑵𝒎𝒎𝒎𝒎𝒎𝒎

𝑼𝑼𝑼𝑼 ) 10
Number Of Transmitted Antenna for Voice 1

Bearer (𝑴𝑴𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗 )
Number Of Transmitted Antenna for Data {4, 8, 16, 32, 64}

Bearer (𝑴𝑴𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗 )
Downlink Frequency Band for Voice Bearer 2.1GHz
Downlink Frequency Band for Data Bearer 28GHz
Cell Radius for Voice Bearer (𝒓𝒓𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗 ) 350m
Cell Radius for Data Bearer (𝒓𝒓𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 ) 150m
Inter Site Distance for Voice Bearer (𝑹𝑹𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗 ) 525m
Inter Site Distance for Data Bearer (𝑹𝑹𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 ) 225m
Number of Multipaths for Voice Bearer (𝑵𝑵𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗

𝒑𝒑 ) 15
Number of Multipaths for Data Bearer (𝑵𝑵𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅

𝒑𝒑 ) 4
UE Movement Speed (𝒗𝒗) 2km/h
Radio Frame Duration (𝑻𝑻) 10ms
54
4.2.2 JB-PCIC Algorithm Flowchart:
Observing the flowchart of the proposed DRL algorithm, for this Table 4.6 show the input and
output parameters to the proposed algorithm.
Table 4. 6: Input and Output Parameters
Input The downlink received 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 measured by

UE
Output Sequence of beamforming, power control,

and interference coordination commands to
solve problem formulation.
Figure 4. 3: Proposed algorithm flowchart
4.3 Simulation Analysis for JB-PCIC Algorithm:

This section observes the analysis of the proposed algorithm using python code to apply the
simulation, start with basic radio environment parameters and enhance the algorithm to
55
optimize the solution.
4.3.1 Simulation Analysis with The Basic Environment Parameters:
In this section, the results of the simulation are observed, focusing on the basic radio
environment parameters presented in Table 4.5 for data and voice bearers, along with a
corresponding discussion.
Figure 4. 4: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒 for JB-PCIC algorithm as afunction of 𝑀𝑀
Figure 4. 3 shows the probability of achieving the 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 with the increasing in number of
antennas 𝑀𝑀, therefore as M increases the probability also increase, since the 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 depends on
the beamforming antenna gain.
56
Figure 4. 5: The Normalized Convergence Time for The JB-PCIC Algorithm versus 𝑀𝑀.
Figure 4. 4 demonstrate the convergence time required to achieving the 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 with the changing
in 𝑀𝑀. When 𝑀𝑀 is small the impact of constant threshold 𝛾𝛾0𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 will be the dominant and the
time of convergence being small. However, as the size of 𝑀𝑀 increases the number of episodes
required to make convergence increases with minimal effects of 𝛾𝛾0𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 as show in (4.2). This
is due to the longer time required for the agent to search through a grid of beams of size Ƒ.
Moreover, this causes the agent to spend longer time to meet the target SINR 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 .
57
JB-PCIC Algorithms as a Function of 𝑀𝑀.
Figure 4.5 displays the relative performance of the agent "JB-PCIC" compared to the brute
force "optimal" performance. The P_TX is nearly equal to the maximum, and the achieved
SINR is proportional to the antenna size 𝑀𝑀. It can be observed that the performance gap in both
the transmit power of the base stations and the SINR diminishes across all values of 𝑀𝑀. This
occurs due to the DQN's ability to estimate the function that determines the upper limit of
performance.
The SINR JB-PCIC line is having some drop in 𝑀𝑀 = 32, due to the agent obtained a DNN
function approximately estimated to the SINR Optimal and with the iteration learning the agent
obtain DNN function better to the optimal.
58
Figure 4. 7: Sum-Rate Capacity of The Convergence Episode as a Function of The 𝑀𝑀.
In Figure 4.6, the sum-rate capacity of both the agent and the upper limit of performance is
observed. The DNN function exhibits a drop at M = 32, which is optimized by changing the
stated parameters as observed at a later stage.
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
Figure 4. 8: The CCDF of 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 for three different voice algorithms
59
Figure 4.7 presents the coverage CCDF probability of achieving 𝛾𝛾𝑒𝑒𝑒𝑒𝑒𝑒 for the three voice PCIC
algorithms at the same episode. As a result, the FPA algorithm has the worst performance
especially at the cell edge. In addition, the tabular technique with Q-learning having better
performance compared with the FPA algorithm. However, the JB-PCIC algorithm that utilized
the deep Q-learning has a result in higher reward compared to tabular technique, due to deep
Q-learning has converged at a better solution than the Q-learning. Furthermore, as the user
approaches to a BS the 𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒 increased therefore the three voice algorithm techniques
obtain almost similar result.
4.3.2 Simulation Analysis and Results case 1:
Configure all settings as shown in Table 4.5 and increase the number of users per cell as shown
in Table 4. 7 to observe how the JB-PCIC algorithm performs more accurate for both data and
voice bearers.
Parameter Value
𝑵𝑵𝒎𝒎𝒎𝒎𝒎𝒎
𝑼𝑼𝑼𝑼 50
60
Figure 4. 9: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒 for JB-PCIC algorithm as afunction of M
Figure 4. 10: The Normalized Convergence Time for The JB-PCIC Algorithm versus M.
61
JB-PCIC Algorithms as a Function of 𝑀𝑀
As depicted in Figure 4.10, an improvement in the SINR plot is observed, where the SINR of
JB-PCIC exhibits similarity to the optimal solution.
Figure 4. 12: Sum-Rate Capacity of The Convergence Episode as a Function of The M.
62
Also, in Figure 4. 11 the result of sum rate capacity from the JB-PCIC algorithm has optimized.
Hence, the improvement for this simulation component can be identified in Figures 4.11 and
4.10, with the similar answer for the other figures.
In this simulation, all parameters were adjusted as specified in Table 4.5. Additionally, the
movement speed of the UE was changed according to the values provided in Table 4.8 to
observe the improvement in results.
Parameter Value
𝒗𝒗 8 km/hr
63
Figure 4. 14: CCDF of 𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒 for JB-PCIC algorithm as afunction of 𝑀𝑀
Figure 4. 15: The Normalized Convergence Time for The JB-PCIC Algorithm versus M
64
JB-PCIC Algorithms as a Function of M
Figure 4. 17: Sum-Rate Capacity of The Convergence Episode as a Function of The M
65
As a result, this simulation has higher reward in JB-PCIC algorithm especially in Figures 4. 15
and 4. 16 than in previous simulation due to the algorithm has learned and optimize the solution.
Regarding the remainder figures, remains the similar result from the simulation.
In this scenario, all parameters were modified as indicated in Table 4.5, and the maximum
transmitted power was altered according to Table 4.9 to observe the effectiveness of the
outcomes.
Parameter Value
𝑷𝑷𝒎𝒎𝒎𝒎𝒎𝒎
𝑻𝑻𝑻𝑻 60 watts
66
67
Figure 4. 22: The Normalized Convergence Time for the JB-PCIC Algorithm versus M
68
As a result, changing the 𝑃𝑃𝑇𝑇𝑇𝑇 leads to a degradation in results, as depicted in Figure 4.18,
especially when M is set to 64. The probability of 𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒 is lower compared to the previous
simulation, as shown in Figure 4.19. Regarding Figure 4.20, the SINR of JB-PCIC decreases
when 𝑀𝑀 is set to 64, as the increased 𝑃𝑃𝑇𝑇𝑇𝑇 requires the agent to optimize the DNN function for
a better solution. Similarly, Figure 4.21 follows the same trend, with no changes observed in
Figure 4.22.
Set the same changes in parameter for simulation analysis 3 and increase the number of users
per cell to be as Table 4. 10.
Parameters Value
𝑷𝑷𝒎𝒎𝒎𝒎𝒎𝒎
𝑻𝑻𝑻𝑻 60 watts
𝑼𝑼𝑼𝑼 50
69
Figure 4. 24: CCDF of SINR_eff for JB-PCIC algorithm as afunction of M
70
71
Figure 4. 28: The CCDF of γ_eff^voice for three different voice algorithms
As a result, the outcomes are the same as in simulation analysis 3, with minor deterioration in
Figures 4.26 and 4.27 at M = 32.
Experimenting with increasing the antenna gain for simulation analysis 4, the effect was
observed. The parameters are shown in Table 4.11.
Parameter Value
𝑷𝑷𝑻𝑻𝑻𝑻 60 watts
𝑼𝑼𝑼𝑼 50
𝑮𝑮𝑻𝑻𝑻𝑻 18 dBi
72
73
74
For the results in this simulation the deterioration increased, the agent has distortion when m
set by 64 in figure 4. 28. in addition, the figure 4. 30 has a degradation in SINR and 𝑃𝑃𝑇𝑇𝑇𝑇 since
the increasing in 𝐺𝐺𝑇𝑇𝑇𝑇 , the same thing in figure 4. 31. However the figures 4.29 and 4.32 remain
the same result as in previous simulations.
As a result, the agent must optimize the DNN function to track changes in states in order to
eliminate this deterioration.
In this scenario, the speed of user v is increased using the same parameters as in simulation
analysis 5, as shown in Table 4.12. The purpose is to optimize the DNN function of the
proposed JB-PCIC algorithm and observe the results.
Table 4. 12: Radio Enviroment Parameters
Parameter Value
𝑷𝑷𝑻𝑻𝑻𝑻 60 watts
75
𝑼𝑼𝑼𝑼 50
𝑮𝑮𝑻𝑻𝑻𝑻 18 dBi
𝒗𝒗 8 km/hr
76
77
In this section, the agent focuses on optimizing the deep neural network (DNN) function to
achieve the highest level of performance. The agent's goal is to ensure that the outcomes
78
clearly demonstrate superior performance without any degradation caused by changes in
states, such as variations in transmitted power and antenna gain. By carefully considering and
addressing these factors, the agent aims to enhance the overall efficiency and effectiveness of
the DNN, resulting in improved outcomes. Figure 4.33 illustrates an increasing probability of
SINR as M increases. Additionally, Figure 4.34 demonstrates a decrease in convergence time
when M is set to 32 compared to the other simulations. The results depicted in
Figure 4.35 show that the maximum power is attained across all M values, with similar SINR
outcomes observed for JB-PCIC and the optimal method. Notably, the sum rate capacity for
JB-PCIC in Figure 4.36 remains consistent with the optimal solution. Similarly,
Figure 4.37 yields comparable results to the preceding simulations.
79
CHAPTER 5
CONCLUSION AND THE FUTURE

WORK
80
5.1 Conclusions:
DRL for 5G networks addresses the challenges faced by 5G networks in optimizing
beamforming, power control and interference coordination. The study proposes a novel
approach utilizing DRL techniques to tackle these complex optimization problems.
The project highlights the importance of efficient resource allocation in 5G network,

considering factors such as spectral efficiency, power consumption and interference
management. By leveraging DRL, that demonstrate the ability to learn optimal policies and
make adaptive decisions in real-time, leading to significant improvements in network
performance.
DRL enables advanced network analytics and prediction capabilities. By leveraging DNN,
telecommunication companies can analyze massive amounts of data generated by network
devices, user interactions and network traffic patterns. This enables them to gain insights into
network performance, anticipate demand fluctuations and proactively optimize network
resources. Furthermore, the simulation results show that:
• Focuses on optimizing joint beamforming, power control and interference coordination

in a 5G setting. It mandates that the UE share its coordinates and received SINR with
the BS every millisecond.
• The proposed algorithm eliminates the necessity of channel state information, thus
eliminating the need for channel estimation and the associated training sequences.
• In addition, incorporating the UE's coordinates reduces the overall feedback required
from the UE since explicit commands regarding beamforming vector changes, power
control or interference coordination are no longer necessary.
• DRL can optimize resource allocation and network efficiency, by applying DRL
techniques, telecommunication network can dynamically allocate resources such as
power, bandwidth and spectrum to meet varying demands while minimizing
interference and maximizing throughput.
• As the environment states changes the agent improves the DNN function that obtained
the most amplification performance for the 5G network, that optimized the solution
which increase the SINR of the system.
In general, these results improved network performance, reduced congestion and enhanced
overall efficiency.
81
5.2 Future Work:
The future work of this project is summarized as follows:
• Integrating DRL with edge computing in 5G networks can enable real time decision
making and reduce latency. Future work can explore the development of efficient DRL
models thar can deployed at the network edge, enabling distributed intelligence and
enhancing the overall performance of 5G systems.
• Beamforming is a crucial technology in 5G to enhance signal transmission and
reception. DRL can be leveraged to optimize beamforming techniques by learning the
optimal beamforming weights based on channel conditions, user locations and traffic
patterns. Future research can focus on developing DRL based beamforming algorithms
that can adapt to dynamic network condition and provide significant performance gains.
• DRL can play a vital role in enhancing the security of 5G networks. Future work
explores the application of DRL techniques for intrusion detection, anomaly detection
and security threat prediction in 5G systems. This includes developing DRL models
that can analyze network traffic patterns, detect malicious activities and proactively
protect the network against cyber threats.
• DRL can be utilized to optimize the Quality of Service (QoS) in 5G, ensuring that
different types of services receive the required performance levels. Future research can
focus on enabling dynamic QoS management in 5G networks.
82
REFERENCES
[1] https://www.techtarget.com/searchmobilecomputing/definition/wireless, visited on
26/4/2023.
[2] Xiaorong Zhu., XU LIU., An End-to-End Network Slicing Algorithm Based on Deep Q-
Learning for 5G Network.
[3] https://www.etsi.org/technologies/5G, visited on 26/4/2023.
[4] https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python,
visited on 28/4/2023.
[5] Madiha showkat, javid A.sheikh, arshid iqbal khan. A new Beamforming technique in 5G
enviroment for reliable transmission (2018).
[6] L. Yun and D. Messerchmitt. Power control for variable quality of service in cellular
systems.
[7] https://www.rfwireless-world.com/5G/5G-NR-Uplink-Power-Control.html, visited on

3/5/2023.
[8] Zijiao Guo. Fixed Power Allocation Algorithm Based on Time-Shift Pilot in Massive
MIMO System (2020).
[9] PPT - Introduction to Reinforcement Learning PowerPoint Presentation, free download -

ID:5985265 (slideserve.com), visited on 4/5/2023.
[10] Faris B. Mismar, Brain L. Evans, Ahmed Alkhateeb. Deep reinforcement learning for
5G network: joint Beamforming, Power Control, Interference Coordination (2020).
[11] T. Rappaport, F. Gutierrez, E. Ben-Dor, J. Murdock, Y. Qiao, and J. Tamir, Broadband

millimeter-wave propagation measurements and models using adaptive-beam antennas for
outdoor urban cellular communications (2013).
[12] Pieter Abbeel and John Schulman. Deep Reinforcement Learning: An Introduction by
Richard S. Sutton and Andrew G. Barto.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.Wierstra, and M.

Riedmiller. Playing Atari with Deep Reinforcement Learning (2013).
[14] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 1st ed. Cambridge, MA,
83
USA: The MIT Press (2016).
[15] G.-B. Huang, “Learning capability and storage capacity of two-hiddenlayer feedforward
networks,” IEEE Transactions on Neural Networks, vol. 14, no. 2, pp. 274–281, Mar. 2003.
[16] D. Silver, Advanced Topics – Reinforcement Learning, (2015).
[17] Ruiyuan Li, Ying-Chang Liang, and Bin Ning. A New Approach to Subcarrier and
Power Allocation for Fixed-Power Coordinated Direct and Relay Systems (2015).
[18] "Deep Reinforcement Learning for Dynamic Power Allocation and Interference
Management in Wireless Networks" by Zappone et al (2017).
[19] F. B. Mismar, J. Choi, and B. L. Evans, “A Framework for Automated Cellular Network
Tuning with Reinforcement Learning,” IEEE Transactions on Communications, vol. 67, no.
10, pp. 7152–7167, oct 2019.
[20] 3GPP, “NR; Physical channels and modulation,” 3rd Generation Partnership Project
(3GPP), TS 38.211, Jun. 2018.
[21] T. Bai and R. W. Heath Jr., “Coverage and Rate Analysis for Millimeter- Wave Cellular
Networks,” IEEE Transactions on Wireless Communications, (2015).
[22] 3GPP, “Evolved Universal Terrestrial Radio Access Physical layer procedures,” (2015).
84
APPENDICES
85
Appendix A: Python code for Deep Reinforcement Learning for
5G Networks: Joint Beamforming, Power Control, and
Interference Coordination for Data Bearer.
The GitHup link for the files of python code:
GitHub - farismismar/Deep-Reinforcement-Learning-for-5G-Networks: Code for my

publication: Deep Reinforcement Learning for 5G Networks: Joint Beamforming, Power
Control, and Interference Coordination. Paper accepted for publication to IEEE Transactions
on Communications.
86

Ee599 2 111403

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ee599 2 111403

Uploaded by

Copyright:

Available Formats

University of

5G Networks Optimization analysis study

A project report submitted in partial fulfillment of the requirements for the

‫إﻟﻰ اﻷﯾﺎدي اﻟﻄﺎھﺮة اﻟﺘﻲ أزاﻟﺖ ﻣﻦ أﻣﺎﻣﻨﺎ أﺷﻮاك اﻟﻄﺮﯾﻖ‬

‫و رﺳﻤﺖ ﻟﻨﺎ اﻟﻤﺴﺘﻘﺒﻞ ﺑﺨﻄﻮط ﻣﻦ اﻷﻣﻞ و اﻟﺒﮭﺠﺔ‬

‫إﻟﻰ ﻣﻦ ﻻ ﺗﻔﯿﮭﻢ ﺣﻘﮭﻢ ﻣﻦ اﻟﺸﻜﺮ و اﻟﻌﺮﻓﺎن ﺑﺎﻟﺠﻤﯿﻞ‬

‫إﻟﻰ ﻣﻦ ﻛﺎن ﺳﺒﺒﺎ ﻓﯿﻤﺎ ﻧﺤﻦ ﻓﯿﮫ اﻟﯿﻮم‬

‫إﻟﻰ آﺑﺎﺋﻨﺎ و أﻣﮭﺎﺗﻨﺎ‪ ،‬إﺧﻮاﻧﻨﺎ و أﺧﻮاﺗﻨﺎ‪ ،‬أﺻﺪﻗﺎﺋﻨﺎ و ﻛﻞ ﻣﻦ وﻗﻒ ﻣﻌﻨﺎ و ﺳﺎﻧﺪﻧﺎ‬

‫اداﻣﮭﻢ ﷲ ﻟﻨﺎ و أﻋﺰھﻢ و ﻟﮭﻢ ﻣﻨﺎ ﻛﻞ اﻟﺘﻘﺪﯾﺮ و اﻹﺣﺘﺮام‪.‬‬

We are highly indebted to our supervisor "Eng. Khaled Elgdamsi" for

By studying the actual performance of the 5G network which are jointly

List of Figures ……………………………………………………………………..………….8

List of Tables …………………………………………………………...……………………10

2.6.2 Tabular Technique ............................................................................................... 25

2.7 Data Bearer ................................................................................................................ 26

Figure 1.2: Cellular System Architecture ................................................................................ 15

Figure 1.3: Simple Deep Q-Network ....................................................................................... 17

Figure 2.5: Deep Q-Network [4].............................................................................................. 22

Figure 4. 1: Simulation flowchart for Data and Voice bearers ................................................ 48

Table 4. 1: Reinforcement Learning Hyperparameter ............................................................. 49

In a communication system, information is sent from a transmitter to a receiver over a short or

Figure 1.1: Wireless Communication System

As technology progressed, wireless communication evolved to allow for the transmission of

Figure 1.2: Cellular System Architecture

Table 1. 1: Comparison of All Generations of Mobile Technologies [1]

start deployment 1970-1980 1990-2004 2004-2010 2010-2015 2015

data bandwidth 2kbps 64kbps 2Mbps 1Gbps higher than

service Voice Voice, SMS high audio Dynamic Dynamic

multiplexing FDMA TDMA, CDMA OFDMA NOMA

switching circuit circuit, packet all packet all packet

core network PSTN PSTN Packet Internet Internet

1.3 Deep Reinforcement Learning for 5G Network:

The arrival of 5G network marks a significant milestone in telecommunications. This

DQN is revolutionizing wireless communication by optimizing connections, reducing latency,

Figure 1.3: Simple Deep Q-Network

• To study voice bearer algorithms, research common techniques in telecommunications

1.5 Project Outlines:

OVERVIEW ON 5G WITH DEEP

The fifth-generation (5G) network is expected to offer a significant improvement in operational

Figure 2. 1:IMT-2020 (5th generation) Spider Chart [3]

The most significant components that 5G network interested to analyzing it is as following:

• Joint Beamforming (JBF).

2.2 Deep Q-Network Concept:

Figure 2.4: Deep Q-Network [4]

2.3 Joint Beamforming (JBF):

Beamforming is often considered a subset of Advanced Antenna Systems (AAS), which

2.4 Power Control (PC):

2.5 Interference Coordination (IC):

2.6 Voice Bearer:

2.6.1 Fixed Power Allocation (FPA):

2.6.2 Tabular Technique:

2.7 Data Bearer:

Deep Reinforcement Learning to

3.1.1 Network Model:

We take into account a L BS multi-access downlink orthogonal frequency division

• k: The wave number.

• 𝑎𝑎(𝜃𝜃𝑛𝑛 ): The array steering vector in the direction of (𝜃𝜃𝑛𝑛 )

The combined optimization issue for JBF, PC, IC is expressed as:

𝑃𝑃𝑇𝑇𝑇𝑇,𝑗𝑗 (𝑡𝑡) ∈ 𝜌𝜌, ∀𝑗𝑗

𝑓𝑓𝑗𝑗 (𝑡𝑡) ∈ Ƒ, ∀𝑗𝑗

𝛾𝛾 𝑗𝑗 (𝑡𝑡) ≥ 𝛾𝛾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 (3. 6)

3.3 An Introduction on Deep Reinforcement Learning: