Multi-Agent DRL Based PA in Multi-User Cellular Networks

Power Allocation in Multi-user Cellular Networks
With Deep Q LearningApproach

FanMeng PengChen LenanWu
Southeast University Southeast University SoutheastUniversity
InformationS cience andEngineering StateKeyLaboratory of Millimeter Waves InformationS cience andEngineering
Nanjing 210096, China Nanjing 210096, China Nanjing 210096, China
mengxiaomaomao @outlook.com chenpengseu @seu.edu.cn wuln @seu.edu.cn
Abstract-The model-driven power allocation (PA) algorithms cations [6].The se algorithms are usually model-free, andare
inthe wireless cellular networks with interfering multiple-access compliant with optimizations in practical communication sce-
channel (IMAC) have been investigated for decades. Nowadays, narios. Additionally, with developments of graphic processing
the data-driven model-free machine learning-based approaches
are rapidly developing inthisfield, and among them thedeep unit (OPU) or specialized chips, the executions canbe both
reinforcement learning (DRL)is proved tobeof great potential. fastand energy-efficient, which brings insolid foundations for
Different from supervised learning, the DRL takes advantages of massive applications.
exploration and exploitation to maximize the objective function Two main branches of ML , supervised learning andrein-
under certain constraints. In our paper, we propose atwo- forcement learning (RL) [7], arebriefly introduced here.With
step training framework. First, withthe off-line learning in
simulated environment, adeepQ network (DQN)is trained supervised learning, a deep neural network (DNN) is trained
withdeepQ learning (DQL) algorithm, whichis well-designed to approximate some given optimal (or suboptimal) objective
tobein consistent withthisPAissue. Second, theDQNwillbe algorithms, andithasbeen realized in some applications [8]-
further fine-tuned with real data in on-line training procedure. [10]. However, the target algorithm isusu ally unavailable and
The simulation results show that the proposed DQN achieves the performance of DNN canbe bounded bythe supervisor.
the highest averaged sum-rate, comparing totheoneswith
present standard DQL training. With different user densities, Therefore, theRLhas received widespread attention, dueto
our DQN outperforms benchmark algorithms and thus agood its nature of interacting withan unknown environment by
generalization ability is verified. exploration and exploitation. The Q learning method isthe
Index Terms-Deep reinforcement learning, deepQ learning, most well-studied RL algorithm, anditis exploited to cope
interfering multiple-access channel, power allocation. withPAin [11]-[13] , and some others [14]. The DNN trained
withQ learning is called deep Q network (DQN), andit
I. INTRODUCTIO N is proposed to address the curse of dimensionality andthe
Data transmitting in wireless communication networks has lack of generalization. In[15],di stributed dynamic downlink
experienced explosive growth in recent decades andwillkeep power allocation in single-user cellular networks with deep
rising inthefuture. The user density is greatly increasing, reinforcement learning (DRL) approach is studied.
resulting in critical demand for more capacity and spectral Inthis paper, we extend the same system-level optimization
efficiency. Therefore, both intra-cell and inter-cell interference problem [15]butwith multiple users , andan interfering
managements are significant to improve the overall capacity multiple-access channel (IMAC) scenario is considered. The
of a cellular network system. The problem of maximizing a design of our DQN model is discussed and introduced. Sim-
generic sum-rate is studied inthis paper, anditis non-convex, ulation results show thatthe DQN outperforms the present
NP-hard and cannot be solved efficiently. standard DQNs andthe benchmark algorithms. The contribu-
Various model-driven algorithms havebeen proposed in tionsofthisworkare summarized as follows :
the present papers for power allocation (PA) problems, •A model-free two-step training framework is proposed.
suchas fractional programming (FP) [I], weighted MMSE The DQN isfirst off-line trained with DRL algorithm
(WMMSE) [2]and some others [3],[4] . Excellent per- in simulated scenarios. Second, the learned DQN can
formance canbe observed through theoretical analysis and be further dynamically optimized inreal communication
numerical simulations, but serious obstacles arefacedin scenarios, withtheaid of transferlearning .
practical deployments [5].First, these techniques highly rely • The PA problem using deep Q learning (DQL) isdis-
on tractable mathematical models, which are imperfect inreal cussed, thena DQN enabled approach is proposed to
communication scen arios withthe specific user distribution, be trained with current sum-rate as reward function,
geographical environment, etc . Second, the computational including nofuture rewards. The input features arewell-
complexities of these algorithms arehigh . designed to help the DQN get closer tothe optimal
In recent years, the machine learning (ML)-based ap- solution.
proaches have been rapidly developed in wireless communi- • After centralized training, the proposed DQN is tested
978-1-5386-8088-9/19/$31.00 ©2019 IEEE
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.
by distributed execution. The averaged rate-sum of DQN
outperforms the model-driven algorithms,a ndal sos hows
good generalization abi lity ina series of benchmark X
•
85 I
User I
simu lationte sts. 4
The rem ainder of this paper is organized as fo llow s. Sec-
tion II outlines thePA problem inthe wireless cellular network 3
with IMAC. In Section III, our proposed DQN is introduced
in detail. Then, this DQN is tested in distinct scenarios , along
E
~ 2
with benchmark algorithm s, andthe simulation results are '"
'x
ana lyzed in Section IV. Conclusions and discuss ion areg iven
'"
>-
in Section V .
o
II. SY STEM MOD EL
We consider power allocation inthece llular network with - 1
IMAC. In a communication system with N ce lls, atthe center
- 1 o 23 54 6
of each ce ll aba se station (BS) sim ultaneously serves J{ users X axis (km)
witha sharing freq uency band. A simple ne twork example is
shown inFig. I . At time slot t , the independent ch annel gain Fig. I. Anillu strative example of a multi-user cellular network with9 cells.
between then -th BSandtheu ser k ince ll j is denoted by Ineachcell ,a BS serves 2 userss imultaneous ly.
g ~ , j , k' andcanbe expressed as
. --
9 n ,J,k Ihtn ,j,k 1
2
(3n ,J,k,
. (I ) sowe propose a data-driven learning algorithm based onthe
DQN model inthefo llowing sec tion.
where h tn ,].,k isthe small sca le complex flat fading element;
(3n,j ,k isthe large scale fading component, taking account of III. D EEP Q N ETWORK
both geometric atten uation and shadow fading. According to A.Ba ckground
theJ akes' model [ 16], the small-scale flat fading is modeled Q learningisone of the most popular RL algorithms
as a first -order complex Gauss-Markov process aiming to deal with the Markov decision process (MDP)
It t problem s [17]. At time instant i ; by observing state st E 5, the
= Ph n,j,k
- 1
~ n ,j , k + n nt ,j ,k (2)
agent take s action at E A and interacts withthe environment,
where h; ,j ,k '" CN(O, 1) and n ; ,j ,k '" CN(O , 1 - p 2 ) . The thenit gets therew ard r t andthe next state s t+ l is obtained.
correlation p is determined by p = Jo(2'iffdTs), where J o(-) The notations A and 5 are theac tion set andthe state set,
is thefir st kind zero-order Besse l function, f d isthemax imum respectively. Since 5 canbe continuous, the DQN is proposed
Doppler freq uency, and T s is the time interval between adja- to combine Q learningwi th a flexible DNN tosett le infinite
cent instants. Therefore, the signal-to -interference-pl us-noise state space . The cum ulative discounted reward function is
ratio (S INR) of thi s link canbe described by given as
L "t r + +
00
t t
t gn ,n ,kPn ,k Rt = t T 1
, (6)
I n,k = '\'
~ k ' =J k
t
gn ,n ,kPn ,k'
t + '\'
~n
t '\' t
' E D n gn' ,n,k ~ j Pn' ,j
+ a2 ' T= O
(3) where "( E [0, 1) isa discount factor that trades off the
where D n isthe set of interference ce lls aro undt hen -th ce ll, importance of immediate andf uture rewards,a nd r denotes
P isthe emitting power of BS,and a 2 denotes the additional the reward. Under a certain policy 'if , theQf unction of the
noise power. With normalized bandwidth, the downlink rate agent withan action a in state s is given as
of this linki s given as
Qn(s , a ; 0) = En [Rt lst = s , at = aJ ' (7)
(4)
where 0 denotes the DQN parameters, and E [.J is the expec-
The optimization target isto maximize this generic sum-rate tation operator. Q learning concerns with how agents oug ht
objective function under maximum power constraint, anditis to interact withan unknown environment soas to maximize
formu lated as theQ function. The maximization of (7) is equivalent tothe
Bellman optimality equation [18], anditi s de scribe as
m ax
pt. LL C~ ,k
°<
n k (5) yt = r t + "( m~x Q(st+ l, a '; ot), (8)
s.t. P; ,k < P max , \:In, k , a
where yt is the optimal Qva lue. The DQN is trained to

where p t = {p; ,k l\:ln , k}, and P max denotes them axim um
approximate theQ function , andthes tandard Q learning
emitting power. Wea lso define sum -rate c' = L n L k C~ ,k '
update of the parameters 0 isde scr ibed as
c' = {C nt ,k \:In , k }, and channel state information (CSI) gt =
1
{i .
n ,) ,
k \:In , i . k}. This problem is non -convex and NP-hard,
1 o t+! = o' + 'f/ (y t _ Q (st, at ; ot)) voi«,at; o t), (9)
where TJ is the learning rate. This update resembles stochas-
tic gradient descent, gradually updating the current value
Q(st , at ; ()t ) towards the target yt. The experience data of the
agent is loaded as (st , at , It , st H). The DQN is trained with
recorded batch data randomly sampled fromthe experience
rep lay memory, whichis a first-in first-out queue.
B. Discussion on DRL
Inmany applicationss uchaspl aying video games [17],
where current strategyhaslong -term impact on cumulative
rew ard, the DQN achieves remarkable results andbe ats hu-
man s. However, thedi scount factor is suggestedtobe zero
inthi s static PA problem. The DQL aim s tom aximize theQ
function. Welet '"Y = 0, and maximization of (7) isgivena s
m axQ = max E; [,t lst = s ,at = a] . (10 ) Fig. 2. Thes olutionofDQ Nis determined by CSIg t, along withd ownlink
a EA
rate C t - 1 andtr ansmitting power pt - l .
Inthi s PA problem , clearly that s = gt, a = pt. Then welet
,t = C t andgetthat
dynamically fine-tuned inreal scenarios. Since the practical
m axQ = m ax lE7f [ct lgt,pt ]. (II )
0::spL::SPmax wireless communication systemi s dynamica nd influenced by
unknown issues, thed ata-driven algorithmis regarded as a
Inthee xec ution period thepoli cy is deterministic, and
promising technique. We just discuss thetwo -step framework
thus (II) canbe written as
here , andthefir st trainings tepi s mainly focused inthe
(12) following manuscript.
In a certain cellular network, each BS-user linki s regarded
which is a equivalent form of (5). Inthis inference process we as an agentandthu s a multi-agent system is studied. However,
assume that '"Y = 0 and , t = c-, indicating thatthe optimal mu lti-a gent training is difficult sinceit needs much more
solution to (5) is identical tothat of (7), under thesetwo learning data, training timeand DNN parameters. There-
conditions. fore , centralized train ingis con sidered, andonlyonea gent
Ass howninFi g. 2, itis well-known thatthe optimal is trained byu sing all agents'e xperience repl ay memory.
solution pt* of (5) is only determined by current CSI gt, and Then , this agent's learned policy is shared inthedi stributed
the sum-rate c' isc alculated with (gt , pt). Theoretically the execution period. For ourdesi gned DQN, components of the
optimal power ph canbeobt ained using a DQN with input replay memory are introduced as follows.
being just gt. Infact , the performance of this designed DQN 1)St ate: The state design fora certain agent (n , k) is
is poor,s inceiti s non-convex andthe optimal point is hard important, since thefullen vironment information is redundant
tofind . Therefore, we propose toutili ze two more auxiliary andirrel evant elements must be removed. The agent is as-
features : C t- 1 and pt-l. Since thech annel can be modeled as sumedtohave corresponding perfect instant CSI information
a first-order Markov process, the solution of last time period in(3) , andwe define logarithmic norm alized interferer set
canhelpthe DQN getclo ser tothe optimum, and ( 12)c an be r ~ , k as
rewritten as
t It
I' n ,k := -t-- 9 n ,k 0 1 K , ( 14)
9 n ,k ,k
Once '"Y = 0 and ,t

= c- , (8) is simplified tobe yt = where 9 ~ k = {9 ~ ' n kin' E {n , D n }, \f k } ; 0 isthe Kronecker
C", andthe replay memory isa lso reduced tobe (st, at, I t). product, ~ n d l K i ~ ~ vector filled with J{ one s. The channel
The DQN works as an estimator to predict the current sum- amplitude of interferersa re normalized bythat of the needed
rate of corresponding power levels witha certain CSI.The se link ,a ndthelog arithmic representation is preferred since the
discussions provide guidanceforthe following DQN design. amplitudes of ch annel often vary by orders ofma gnitude. The
cardinality of r ~ , k is (IDn l + 1)K. To further decrease thein-
C. DQN Design inC ellular Network putdim ension and reduce the computational complexities, the
Inour proposed model-free two-step training framework, elements in r ~ , k are sortedin decrease turnandonlyth e first
the DQN is first off-line pre-trained with DRL algorithm in Ie elements remain. Aswedi scu ssed in III-B, the se remained
simulated wireless communication system.Thi s procedure is components' andthi s link 's corresponding downlink rate C ~ ~ ~
to reduce theon -line training stress,duetothelar ge data andtr ansmitting power pt-k n,
l atla st time slot, arethe additional
requirement of data-driven algorithm by nature. Second, with two part s of the input toour DQN. Therefore, thestatei s
theaidoftran sfer learning, the learned DQN can be further composed of three features: S~ ,k = {r ~ , k ' C ~ ~ ~ , p ~~ ~}. The
cardinality ofstate , i.e., theinput dimension forDQNi s TABL E I
lSI = 3I e · HYPER-PARAMETERSSETUP OF DQNT RAINING
2)A ction: In(5)the downlink poweris a continuous vari- Parameter Value Parameter Value
ableandi s only constrained by maximum power constraint, Numberof T per epi sode 50 Initial T/ 10 -;;
buttheactionspaceofDQNmu st befinite.Thepo ssible Ob serve epi sode nu mber 100 Final T/ 10- 4
positive emitting poweri s quantized exponentially in IAI - 1 Exploreepi sode number 9900 Initial E 0.2
10- 4
Train interval 10 Final E
levels, along with a zeropowerlevelwhich represents no Memorys ize 50000 Batch size 256
transmittings ignals. Therefore, theallowedpowerseti sg iven
as
' ,..., }
is trainedwith 256 random samples inth e experience replay
A =
{
0, Pmi n , Pmi n (p:;:)
P. IAI- 2
Pmax , (15) memory every10timeslots.TheAd am algorithm [19]i s
adoptedastheoptimi zer inourpaper,andthelearnin g rate T/
exponentially decay s from10 - 3 to10 - 4 • Alltraininghyper-
where Pmi n is thepo sitive minimum emitting power.
parameters are listed inTab .I forbett er illustration. Inthe
3) Reward: Insomemanu scripts therewardfunctionis
following simulations,the se default hyper-parameters willbe
elaboratelyde signed toimprovethea gent's transmitting rate
clarified oncechan ged.
andal so mitigate the interference influence.However , mostof
The FPal gorithm, WMMS E algorithm, maximum PA and
these rewardfunction sa re suboptimal approachestothet arget
random PA schemesaretreated asb enchmarks toev aluate our
functionof(5 ). Inourpaper , the c' is directly used asthe
propos ed DQN-based algorithm. Theperfe ct CSIof current
rewardfunction , anditi ss haredbyallagent s. Inthetrainin g
moment is assumed tobeknownforall schemes. Thesim-
simulations with smallormedium scale cellular network , this
ulationcodeisavailableathttp s://github.com/mengxiaomao/
simplemethodprove s tobefe asible.
PA_ICC.
IV. SIM ULATION R ES ULTS B.Di scount Factor
A. Simulation Configuration Inthis subsection,the sum-rate performance ofdif-
A cellular networkwith N = 25 cellsis simulated.At ferentdi scount factor "( iss tudied I . We set "( E
center ofeachcell , aBSi s deployed tosyn chronously serve {0 .0,0.1,0.3,0.7,0.9}, andthea verage rate C over thetrain-
J( = 4u sers whicharelocated uniformly and randomly within ing periodi s s ~ o w n in Fig.3.Atthe sametim e slot, obviou sly
thecellran ge r E [R min , R min ], where R mi n = 0.01 km thevalue s of C withhi gher "( E {0.7, 0.9} arelowerth an the
and R m in = 1kmaretheinnersp acea ndh alf cell-to-cell rest withlower "( values. Thetr ained DQN sa rethente sted
distance, respectively. The small-scalefadingi ss imulatedto inthreecellul ar networkswith different cellnumber s N . As
be Rayleigh distributed, andtheJ akes modelis adopted with showninFi g. 4, it showsthattheDQNwith "( = 0.0 achieves
Doppler frequenc y f d = 10H z andtimeperi od T = 20 thehi ghest C score , whilethelowe st average ratev alue is
ms. According totheLT Es tandard,thelar ge-scale fading obtained bytheonewiththehi ghest "( value.Thesimul ation
is modeled as f3 = -1 20.9 -3 7.6 log 10 (d) + 1010g 10 (z) result proves thatthenon- zero "( has anegativeinfluence
dB, where Z isa log-normal randomv ariable with standard onthesum-r ate performance, whichiscon sistent withthe
deviationbeing 8 dB, and d isthetran smitter-to-receiver analy sis inIII-B.Th erefore, azerodi scount factorva luei s
distance( km).TheAWGNpower iJ 2 is - 114dBm ,a ndth e propo sed forthi ss taticPAis sue.
emitting powercon straints Rnin and Pmax are 5 and 38 dBm , C. Algorithm Comparison
respectively.
Afour-l ayer feed-forward neuralnetwork(FNN)i s chosen TheDQNtrainedwithzero "( is used, andthefourbench-
as DQN, andtheneuron numbers oftwohiddenlayer s are markal gorithmss tatedb eforea rete sted asc omparisons.In
realcellul ar network , theuserdensityi s changing over time,
128 and 64,re spectively. The activationfunctionof output
andtheDQNmu st haveg ood generalization abilityagain st
layer is linear, andtherectifiedlinearunit (ReLU)i s adopted
this issue.Inthetr aining period theu ser number per cell
inth e twohiddenla yers. The cardinality of adjacent cells is
J( is 4, butinthet esting period J( is assumed tob e in set
IDnl = 18, \In, thefir st L; = 16 interferers remain andthe
inputdimen sion is48. Thepowerlevel number IAI = 10and {I ,2, 4, 6}.Theaver aged simulation results areobt aineda fter
5_00repeat s. AsshowninFi g. 5, theDQNachieve s thehi ghest
thustheoutputdimen sion is 10.
C in allt esting scenarios. Althou gh theDQNistr ained with
Intheoff-linetrainin g period , theDQN is first randomly
J( = 4, it stilloutp erforms theother algorithms inth e other
initialized andthentrainedepo ch byepo ch. Inthefirst100
cases. Weal so note that theg apb etween random/maximum
episodes , the agentsonlystocha stically take action s, thenthey
PAsc hemes andthe restoptimi zationa lgorithmsi s increased
followbyadaptive e-greedylearning strategy[18]to stepin
when J( becom es larger. Thi s can bemainly attributedthat
thefollowin g exploring period. Inea ch episode, the large-
theintra-cell interference gets strongerwithincre ased user
scalefadingi s invariant,andthusthe number oftrainin g
episodemustbel arge enou gh toover come theg eneralization ITh e presents tandardDQLi s the onew itht ypical 'Y valueinr ange
problem . There are 50time slotsperepi sode,a ndtheDQN [0.7, 0.9] ore ven higher , andou r proposed DQLh asaze ro 'Y value .
TABLEII
AVERAGE TIMEC OSTP ERE XECUTION T; (sec) .
Algorithm
DQNF P WMMSE
CPU 3.lOe- 4 4 .80 - 3
_ DQN
~ - +-- -- ---l --- -+ --- --+t -- y= 0.0 , _ FP
3.0
-- - y= 0 .1 _ WMM5E
-.- y= 0.3 _ Random power
-l - .JJ' ~ --- I-- -- --+ --- -+ --- -H ... y= 0.7 r 2.5 _ Maximal power I
-- - y= 0.9
2000 4000 6000 8000 10000

Tr aining Epoch
Fig. 3. Withd ifferent I values, ther ecord ed averagera te during training

period(C urves are smoothedb yave raged window) .
0.5
1.6 .,-----,--,,- . - - - - - - - - - - - - - - - - - - - - ----, 0.0

K=l K= 2 K=4 K=6
User number per cell
1.4
1.2 Fig. 5. Theaveragera te C versus user numberp erce ll. Five powera llocation
schem esare tested .
user number. Thesi mulationplat form is presented as: CPU

Intel i7-6700. There are 100u sers inth es imulated cellular
0.4 network, thet imecos tper exec ution 'E; of ourp roposed DQL
algorithm andth ece ntralizedm odel-based method s arel isted
0.2
inT ab. II. It canb esee nth at the timecos t of DQN isa bout
0.0 15.5 and 61.0time s faster than FP andWMMS E, respectively.
N=25 N=49 N=100
Numb er of cells
Fig. 4. Theaveragera te C versus cellularn etworksca lability for trained

DQNs with different I values.
1.8 -j -+--..--f---:--.t-.-:;----:--+------;+---:-.,-----+-I
density,w hichind icating thatth eo ptimization ofPAi s more 1.6

significantinth ece llularn etworks withd enser users.
<n 1.4 -l- -¥*(J"'"l'''-lFl:>''l!1Al,f \l Ir.-r
Wea lso give an exampler esulto f onet estinge pisodeh ere
(K = 4). In comparison withth ea veraged sum-rate valuesin
e
.,
- DQN
--- FP
~ 1.2 -l- +-- - - I--- - --+- - - -+-----1 _ ._ WMM5E
Fig.6 , the performancesof three PA algorithms (DQN, FP,
WMM SE)a reno ts table, especiallydep endingo nth es pecific
g Random power
<u 1.0 --- Maximal power r
large-scale fading effects. Addition ally, in some episodesth e ~
DQN cann ot be better than the otheral gorithmsove rth e 0.8 -j- +---- f-----+--- -+---+--- +-I
time( not shown inthi s paper) , whichm eans thatth ere iss till
potential to improve the DQNp erformance .
Interm s ofc omputation complexity , the time costo fDQN
o 200 400 600 800 1000
keeps fixedas the totala mount ofu sersi ncreases , due to the Time slot
distributedexe cution . Meanwhile, both FP andWMMS Ea re
centralizedit erative algorithms , and thusth e timecos ti s not Fig. 6. Comparisons ofa llfi vepowe r allocation scheme s over 1000 time
constant,d ependingo nth es toppingc riterionc ondition and slots (Curves ares moothedb ya veraged window) .
V. CON CLUSIO NS [II] R. Amiri, H.M ehrpouyan , L. Fridman , R. K.M allik , A. Nallanathan ,
andD . Matolak , "A machinel earninga pproach forp owera llocation in
Thedi stributed PA problem inthecellul ar networkwith hetnets considering qos,' CoRR , vol. abs/1803.06760 , 2018 . [Online].
IMAChasbeeninve stigated, andthe data-driven model-free Available : http://arxiv.org/abs/ 1803 .06760
[12] E. Ghadimi , F. D. Calabrese , G. Peters , andP .So Idati, "A reinforcement
DQLha s been appliedto solvethi s issue. Tobein consistent learningap proacht opowerco ntrol andr atea daptationin cellular
withth e PAoptimi zation target , thecurr ents um-ratei s used networks ," in 20 /7 IEEE Int. Con! C0l11111un (ICC) , May2 017 , pp.
as rewardfun ction , includin g nofuturere wards. Withour 1- 7.
[13] F. D. Calabrese , L.Wan g, E. Ghadimi , G. Peters, L.Han zo, and
propos ed discount factor '"Y = 0, andtheDQNsimpl y works as P. Soldati , "Learningra dio resource management inR ANs : Framew ork,
ane stimator to predict the current sum-rateunderallpower opportunities,a nd challenges ," IEEECO l11l11 un. Mag., vol.5 6, no. 9, pp.
levelswith a certain CSI.Simul ation resultss howth at the 138-145,Sep . 2018.
[14] L. Xiao , D. Jiang , D. Xu, H. Zhu, Y. Zhang , and V. Poor, "Two-
DQNtrainedwithzero '"Y achieveshi gher averages um-rate dimensionala nti-jammingm obileco mmunicationb asedo nr einforce -
thanth es tandardDQLwithp ositive '"Y values. Thenina ment learning ," IEEETra ns. Veh.T echnol ., pp. 1-1, 2018 .
seriesofdi stinctsc enarios , thepropo sed DQNoutp erforms [15] Y. S. NasirandD .G uo, " Deepr einforcement learning for
distribut ed dynamic powera llocationin wirelessn etworks ," CoRR , vol.
theben chmark algorithms, also indicating thatourd esigned abs/1808.00490 , 2018 . [Online]. Available: http://arxiv.orglabs/1808.
DQNha sg ood generalization abilities. In thet wo-step training 00490
framework,weha ve realized theoff-line centralized learning [16] Bottomley , G.E, and Croft, "Jakes fading model revisited ," Electron.
Lett. , vol. 29, no. 13, pp. 1162-1 163, Jun . 1993 .
with simulated communication networks, andthelearned [17] V. Mnih, K. Kavukcuoglu , D. Silver, A. A. Rusu, J. Veness , M.G.
DQNi s tested bydi stributed executions. In our futurework , Bellemare , A.Graves , M.R iedmiIler, A. K.Fi djeland , andG . Ostrovski ,
theon -line learnin g willb e furth ers tudiedtoaccomm odate the " Human-l evel controlth rough deep reinforcementl earning ." Na ture,
vol.51 8, no.7 540, p.5 29, Feb. 2015.
real scenarioswith specificu ser distributions and geographical [18] R. Suttonand A. Barto , Reinforcement Learnin g: AnInt roduction . MIT
environments. Press , 2018 .
[19] D. P. Kingmaa ndJ . Ba," Adam:Ame thodf ors tochastic
optimization ," CoRR , vol. abs/ 1412.6980, 2014 . [Online] . Available:
VI. A CK NOWL ED GME NTS http://arxiv.orglabs/141 2.6980
Thi s workwa ss upportedinp art by the National Nat-

uralSci enceF oundation of China (GrantN o. 6180111 2,
61471117, 61601281), theN atural Scienc e Found ation of
Jian gsu Province (GrantNo . BK20180357), theOpenPro gram
ofStat e Key Laboratory of Millim eter Waves (Southe ast
University, Grant No.Z 201804).
R EFERENCES
[I]K . Shen andW . Yu," Fractional programming for communication

systems - Part I: Powerco ntrol andb eamforrning,' IEEETrans . Signal
Process., vol. 66 , no. 10, pp.26 16-2630, May2 018.
[2] Q. Shi, M. Razaviyayn , Z. Q. Luo, and C. He, "An iteratively weighted
mmse approacht o distributeds um-utility maximization for a mimo
interferingbroa dcastc hannel ," in IEEE Int. Con! A cous t., Speec h,
SignalPr ocess ., 2011 , pp. 4331--4340.
[3] M. Chiang , P. Hande, T. Lan,a nd C. W. Tan, " Powerc ontrol in wireless
cellularn etworks ," Found. Trends Ne tw., vol. 2, no. 4, pp. 38 1-533,
2008.
[4]H . Zhang , L.V enturino , N. Prasad , P. Li, S. Rangarajan , and X. Wang,
"Weighted sum-rate maximizationi n multi-cell networksviacoor-
dinatedsc heduling andd iscrete powerco ntrol ," IEEE J. Sel. A reas
COl11l11un., vol. 29, no. 6, pp. 1214- 1224, Jun . 2011.
[5]Z . Qin, H. Ye,G . Y. Li, andB . F. Juang, " Deep learning inph ysical
layerco mmunications ," CoRR , vol. abs/ 1807.11713, 2018.[ Online].
Available : http://arxiv.orglabs/1807 .1 1713
[6] T. O ' Sheaan dJ.Ho ydis , "An introduction to deep learning for the
physicallaye r," IEEE Trans . on Cogn . COI11I11 Un. Netw., vol.3 , no.4 ,
pp. 563- 575, Dec. 2017 .
[7] Y. Lecun , Y. Bengio , and G.H inton , "Deepl earnin g." Natu re, vol.5 21,
no. 7553,p . 436, May2 015 .
[8] H. Sun, X. Chen, Q. Shi, M. Hong, X.F u, and N. D.S idiropoulo s,
"Learningt oop timize: Trainingde ep neural networks for interference
management ," IEEE Trans. Sig nalP rocess ., vol. 66, no. 20, pp. 5438-
5453 , Oct.2 018 .
[9] F. Meng, P. Chen, L.Wu , and X. Wang ," Automatic modulation
classification: A deep learninge nabled approach ," IEEE Trans . Veh.
Technol ., vol. 67, no. II , pp. 10 760-10 772,Nov . 2018 .
[10] H. Ye, G. Y. Li,andB . Juang , "Power ofd eep learning for channel
estimation and signalde tection in OFDM systems ," IEEE Wireless
COl11l11un. Lett., vol. 7, no. I , pp. 114-1 17, Feb. 2018 .

Multi-Agent DRL Based PA in Multi-User Cellular Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Agent DRL Based PA in Multi-User Cellular Networks

Uploaded by

Copyright:

Available Formats

Power Allocation in Multi-user Cellular Networks

With Deep Q LearningApproach

978-1-5386-8088-9/19/$31.00 ©2019 IEEE

g ~ , j , k' andcanbe expressed as

where yt is the optimal Qva lue. The DQN is trained to

Once '"Y = 0 and ,t

2000 4000 6000 8000 10000

Fig. 3. Withd ifferent I values, ther ecord ed averagera te during training

1.6 .,-----,--,,- . - - - - - - - - - - - - - - - - - - - - ----, 0.0

user number. Thesi mulationplat form is presented as: CPU

Fig. 4. Theaveragera te C versus cellularn etworksca lability for trained

density,w hichind icating thatth eo ptimization ofPAi s more 1.6

Thi s workwa ss upportedinp art by the National Nat-

[I]K . Shen andW . Yu," Fractional programming for communication

You might also like