Professional Documents
Culture Documents
Multi-Agent DRL Based PA in Multi-User Cellular Networks
Multi-Agent DRL Based PA in Multi-User Cellular Networks
Abstract-The model-driven power allocation (PA) algorithms cations [6].The se algorithms are usually model-free, andare
inthe wireless cellular networks with interfering multiple-access compliant with optimizations in practical communication sce-
channel (IMAC) have been investigated for decades. Nowadays, narios. Additionally, with developments of graphic processing
the data-driven model-free machine learning-based approaches
are rapidly developing inthisfield, and among them thedeep unit (OPU) or specialized chips, the executions canbe both
reinforcement learning (DRL)is proved tobeof great potential. fastand energy-efficient, which brings insolid foundations for
Different from supervised learning, the DRL takes advantages of massive applications.
exploration and exploitation to maximize the objective function Two main branches of ML , supervised learning andrein-
under certain constraints. In our paper, we propose atwo- forcement learning (RL) [7], arebriefly introduced here.With
step training framework. First, withthe off-line learning in
simulated environment, adeepQ network (DQN)is trained supervised learning, a deep neural network (DNN) is trained
withdeepQ learning (DQL) algorithm, whichis well-designed to approximate some given optimal (or suboptimal) objective
tobein consistent withthisPAissue. Second, theDQNwillbe algorithms, andithasbeen realized in some applications [8]-
further fine-tuned with real data in on-line training procedure. [10]. However, the target algorithm isusu ally unavailable and
The simulation results show that the proposed DQN achieves the performance of DNN canbe bounded bythe supervisor.
the highest averaged sum-rate, comparing totheoneswith
present standard DQL training. With different user densities, Therefore, theRLhas received widespread attention, dueto
our DQN outperforms benchmark algorithms and thus agood its nature of interacting withan unknown environment by
generalization ability is verified. exploration and exploitation. The Q learning method isthe
Index Terms-Deep reinforcement learning, deepQ learning, most well-studied RL algorithm, anditis exploited to cope
interfering multiple-access channel, power allocation. withPAin [11]-[13] , and some others [14]. The DNN trained
withQ learning is called deep Q network (DQN), andit
I. INTRODUCTIO N is proposed to address the curse of dimensionality andthe
Data transmitting in wireless communication networks has lack of generalization. In[15],di stributed dynamic downlink
experienced explosive growth in recent decades andwillkeep power allocation in single-user cellular networks with deep
rising inthefuture. The user density is greatly increasing, reinforcement learning (DRL) approach is studied.
resulting in critical demand for more capacity and spectral Inthis paper, we extend the same system-level optimization
efficiency. Therefore, both intra-cell and inter-cell interference problem [15]butwith multiple users , andan interfering
managements are significant to improve the overall capacity multiple-access channel (IMAC) scenario is considered. The
of a cellular network system. The problem of maximizing a design of our DQN model is discussed and introduced. Sim-
generic sum-rate is studied inthis paper, anditis non-convex, ulation results show thatthe DQN outperforms the present
NP-hard and cannot be solved efficiently. standard DQNs andthe benchmark algorithms. The contribu-
Various model-driven algorithms havebeen proposed in tionsofthisworkare summarized as follows :
the present papers for power allocation (PA) problems, •A model-free two-step training framework is proposed.
suchas fractional programming (FP) [I], weighted MMSE The DQN isfirst off-line trained with DRL algorithm
(WMMSE) [2]and some others [3],[4] . Excellent per- in simulated scenarios. Second, the learned DQN can
formance canbe observed through theoretical analysis and be further dynamically optimized inreal communication
numerical simulations, but serious obstacles arefacedin scenarios, withtheaid of transferlearning .
practical deployments [5].First, these techniques highly rely • The PA problem using deep Q learning (DQL) isdis-
on tractable mathematical models, which are imperfect inreal cussed, thena DQN enabled approach is proposed to
communication scen arios withthe specific user distribution, be trained with current sum-rate as reward function,
geographical environment, etc . Second, the computational including nofuture rewards. The input features arewell-
complexities of these algorithms arehigh . designed to help the DQN get closer tothe optimal
In recent years, the machine learning (ML)-based ap- solution.
proaches have been rapidly developed in wireless communi- • After centralized training, the proposed DQN is tested
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.
by distributed execution. The averaged rate-sum of DQN
outperforms the model-driven algorithms,a ndal sos hows
good generalization abi lity ina series of benchmark X
•
85 I
User I
simu lationte sts. 4
The rem ainder of this paper is organized as fo llow s. Sec-
tion II outlines thePA problem inthe wireless cellular network 3
with IMAC. In Section III, our proposed DQN is introduced
in detail. Then, this DQN is tested in distinct scenarios , along
E
~ 2
with benchmark algorithm s, andthe simulation results are '"
'x
ana lyzed in Section IV. Conclusions and discuss ion areg iven
'"
>-
in Section V .
o
II. SY STEM MOD EL
We consider power allocation inthece llular network with - 1
IMAC. In a communication system with N ce lls, atthe center
- 1 o 23 54 6
of each ce ll aba se station (BS) sim ultaneously serves J{ users X axis (km)
witha sharing freq uency band. A simple ne twork example is
shown inFig. I . At time slot t , the independent ch annel gain Fig. I. Anillu strative example of a multi-user cellular network with9 cells.
between then -th BSandtheu ser k ince ll j is denoted by Ineachcell ,a BS serves 2 userss imultaneous ly.
. --
9 n ,J,k Ihtn ,j,k 1
2
(3n ,J,k,
. (I ) sowe propose a data-driven learning algorithm based onthe
DQN model inthefo llowing sec tion.
where h tn ,].,k isthe small sca le complex flat fading element;
(3n,j ,k isthe large scale fading component, taking account of III. D EEP Q N ETWORK
both geometric atten uation and shadow fading. According to A.Ba ckground
theJ akes' model [ 16], the small-scale flat fading is modeled Q learningisone of the most popular RL algorithms
as a first -order complex Gauss-Markov process aiming to deal with the Markov decision process (MDP)
It t problem s [17]. At time instant i ; by observing state st E 5, the
= Ph n,j,k
- 1
~ n ,j , k + n nt ,j ,k (2)
agent take s action at E A and interacts withthe environment,
where h; ,j ,k '" CN(O, 1) and n ; ,j ,k '" CN(O , 1 - p 2 ) . The thenit gets therew ard r t andthe next state s t+ l is obtained.
correlation p is determined by p = Jo(2'iffdTs), where J o(-) The notations A and 5 are theac tion set andthe state set,
is thefir st kind zero-order Besse l function, f d isthemax imum respectively. Since 5 canbe continuous, the DQN is proposed
Doppler freq uency, and T s is the time interval between adja- to combine Q learningwi th a flexible DNN tosett le infinite
cent instants. Therefore, the signal-to -interference-pl us-noise state space . The cum ulative discounted reward function is
ratio (S INR) of thi s link canbe described by given as
L "t r + +
00
t t
t gn ,n ,kPn ,k Rt = t T 1
, (6)
I n,k = '\'
~ k ' =J k
t
gn ,n ,kPn ,k'
t + '\'
~n
t '\' t
' E D n gn' ,n,k ~ j Pn' ,j
+ a2 ' T= O
(3) where "( E [0, 1) isa discount factor that trades off the
where D n isthe set of interference ce lls aro undt hen -th ce ll, importance of immediate andf uture rewards,a nd r denotes
P isthe emitting power of BS,and a 2 denotes the additional the reward. Under a certain policy 'if , theQf unction of the
noise power. With normalized bandwidth, the downlink rate agent withan action a in state s is given as
of this linki s given as
Qn(s , a ; 0) = En [Rt lst = s , at = aJ ' (7)
(4)
where 0 denotes the DQN parameters, and E [.J is the expec-
The optimization target isto maximize this generic sum-rate tation operator. Q learning concerns with how agents oug ht
objective function under maximum power constraint, anditis to interact withan unknown environment soas to maximize
formu lated as theQ function. The maximization of (7) is equivalent tothe
Bellman optimality equation [18], anditi s de scribe as
m ax
pt. LL C~ ,k
°<
n k (5) yt = r t + "( m~x Q(st+ l, a '; ot), (8)
s.t. P; ,k < P max , \:In, k , a
{i .
n ,) ,
k \:In , i . k}. This problem is non -convex and NP-hard,
1 o t+! = o' + 'f/ (y t _ Q (st, at ; ot)) voi«,at; o t), (9)
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.
where TJ is the learning rate. This update resembles stochas-
tic gradient descent, gradually updating the current value
Q(st , at ; ()t ) towards the target yt. The experience data of the
agent is loaded as (st , at , It , st H). The DQN is trained with
recorded batch data randomly sampled fromthe experience
rep lay memory, whichis a first-in first-out queue.
B. Discussion on DRL
Inmany applicationss uchaspl aying video games [17],
where current strategyhaslong -term impact on cumulative
rew ard, the DQN achieves remarkable results andbe ats hu-
man s. However, thedi scount factor is suggestedtobe zero
inthi s static PA problem. The DQL aim s tom aximize theQ
function. Welet '"Y = 0, and maximization of (7) isgivena s
m axQ = max E; [,t lst = s ,at = a] . (10 ) Fig. 2. Thes olutionofDQ Nis determined by CSIg t, along withd ownlink
a EA
rate C t - 1 andtr ansmitting power pt - l .
Inthi s PA problem , clearly that s = gt, a = pt. Then welet
,t = C t andgetthat
dynamically fine-tuned inreal scenarios. Since the practical
m axQ = m ax lE7f [ct lgt,pt ]. (II )
0::spL::SPmax wireless communication systemi s dynamica nd influenced by
unknown issues, thed ata-driven algorithmis regarded as a
Inthee xec ution period thepoli cy is deterministic, and
promising technique. We just discuss thetwo -step framework
thus (II) canbe written as
here , andthefir st trainings tepi s mainly focused inthe
(12) following manuscript.
In a certain cellular network, each BS-user linki s regarded
which is a equivalent form of (5). Inthis inference process we as an agentandthu s a multi-agent system is studied. However,
assume that '"Y = 0 and , t = c-, indicating thatthe optimal mu lti-a gent training is difficult sinceit needs much more
solution to (5) is identical tothat of (7), under thesetwo learning data, training timeand DNN parameters. There-
conditions. fore , centralized train ingis con sidered, andonlyonea gent
Ass howninFi g. 2, itis well-known thatthe optimal is trained byu sing all agents'e xperience repl ay memory.
solution pt* of (5) is only determined by current CSI gt, and Then , this agent's learned policy is shared inthedi stributed
the sum-rate c' isc alculated with (gt , pt). Theoretically the execution period. For ourdesi gned DQN, components of the
optimal power ph canbeobt ained using a DQN with input replay memory are introduced as follows.
being just gt. Infact , the performance of this designed DQN 1)St ate: The state design fora certain agent (n , k) is
is poor,s inceiti s non-convex andthe optimal point is hard important, since thefullen vironment information is redundant
tofind . Therefore, we propose toutili ze two more auxiliary andirrel evant elements must be removed. The agent is as-
features : C t- 1 and pt-l. Since thech annel can be modeled as sumedtohave corresponding perfect instant CSI information
a first-order Markov process, the solution of last time period in(3) , andwe define logarithmic norm alized interferer set
canhelpthe DQN getclo ser tothe optimum, and ( 12)c an be r ~ , k as
rewritten as
t It
I' n ,k := -t-- 9 n ,k 0 1 K , ( 14)
9 n ,k ,k
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.
cardinality ofstate , i.e., theinput dimension forDQNi s TABL E I
lSI = 3I e · HYPER-PARAMETERSSETUP OF DQNT RAINING
2)A ction: In(5)the downlink poweris a continuous vari- Parameter Value Parameter Value
ableandi s only constrained by maximum power constraint, Numberof T per epi sode 50 Initial T/ 10 -;;
buttheactionspaceofDQNmu st befinite.Thepo ssible Ob serve epi sode nu mber 100 Final T/ 10- 4
positive emitting poweri s quantized exponentially in IAI - 1 Exploreepi sode number 9900 Initial E 0.2
10- 4
Train interval 10 Final E
levels, along with a zeropowerlevelwhich represents no Memorys ize 50000 Batch size 256
transmittings ignals. Therefore, theallowedpowerseti sg iven
as
' ,..., }
is trainedwith 256 random samples inth e experience replay
A =
{
0, Pmi n , Pmi n (p:;:)
P. IAI- 2
Pmax , (15) memory every10timeslots.TheAd am algorithm [19]i s
adoptedastheoptimi zer inourpaper,andthelearnin g rate T/
exponentially decay s from10 - 3 to10 - 4 • Alltraininghyper-
where Pmi n is thepo sitive minimum emitting power.
parameters are listed inTab .I forbett er illustration. Inthe
3) Reward: Insomemanu scripts therewardfunctionis
following simulations,the se default hyper-parameters willbe
elaboratelyde signed toimprovethea gent's transmitting rate
clarified oncechan ged.
andal so mitigate the interference influence.However , mostof
The FPal gorithm, WMMS E algorithm, maximum PA and
these rewardfunction sa re suboptimal approachestothet arget
random PA schemesaretreated asb enchmarks toev aluate our
functionof(5 ). Inourpaper , the c' is directly used asthe
propos ed DQN-based algorithm. Theperfe ct CSIof current
rewardfunction , anditi ss haredbyallagent s. Inthetrainin g
moment is assumed tobeknownforall schemes. Thesim-
simulations with smallormedium scale cellular network , this
ulationcodeisavailableathttp s://github.com/mengxiaomao/
simplemethodprove s tobefe asible.
PA_ICC.
IV. SIM ULATION R ES ULTS B.Di scount Factor
A. Simulation Configuration Inthis subsection,the sum-rate performance ofdif-
A cellular networkwith N = 25 cellsis simulated.At ferentdi scount factor "( iss tudied I . We set "( E
center ofeachcell , aBSi s deployed tosyn chronously serve {0 .0,0.1,0.3,0.7,0.9}, andthea verage rate C over thetrain-
J( = 4u sers whicharelocated uniformly and randomly within ing periodi s s ~ o w n in Fig.3.Atthe sametim e slot, obviou sly
thecellran ge r E [R min , R min ], where R mi n = 0.01 km thevalue s of C withhi gher "( E {0.7, 0.9} arelowerth an the
and R m in = 1kmaretheinnersp acea ndh alf cell-to-cell rest withlower "( values. Thetr ained DQN sa rethente sted
distance, respectively. The small-scalefadingi ss imulatedto inthreecellul ar networkswith different cellnumber s N . As
be Rayleigh distributed, andtheJ akes modelis adopted with showninFi g. 4, it showsthattheDQNwith "( = 0.0 achieves
Doppler frequenc y f d = 10H z andtimeperi od T = 20 thehi ghest C score , whilethelowe st average ratev alue is
ms. According totheLT Es tandard,thelar ge-scale fading obtained bytheonewiththehi ghest "( value.Thesimul ation
is modeled as f3 = -1 20.9 -3 7.6 log 10 (d) + 1010g 10 (z) result proves thatthenon- zero "( has anegativeinfluence
dB, where Z isa log-normal randomv ariable with standard onthesum-r ate performance, whichiscon sistent withthe
deviationbeing 8 dB, and d isthetran smitter-to-receiver analy sis inIII-B.Th erefore, azerodi scount factorva luei s
distance( km).TheAWGNpower iJ 2 is - 114dBm ,a ndth e propo sed forthi ss taticPAis sue.
emitting powercon straints Rnin and Pmax are 5 and 38 dBm , C. Algorithm Comparison
respectively.
Afour-l ayer feed-forward neuralnetwork(FNN)i s chosen TheDQNtrainedwithzero "( is used, andthefourbench-
as DQN, andtheneuron numbers oftwohiddenlayer s are markal gorithmss tatedb eforea rete sted asc omparisons.In
realcellul ar network , theuserdensityi s changing over time,
128 and 64,re spectively. The activationfunctionof output
andtheDQNmu st haveg ood generalization abilityagain st
layer is linear, andtherectifiedlinearunit (ReLU)i s adopted
this issue.Inthetr aining period theu ser number per cell
inth e twohiddenla yers. The cardinality of adjacent cells is
J( is 4, butinthet esting period J( is assumed tob e in set
IDnl = 18, \In, thefir st L; = 16 interferers remain andthe
inputdimen sion is48. Thepowerlevel number IAI = 10and {I ,2, 4, 6}.Theaver aged simulation results areobt aineda fter
5_00repeat s. AsshowninFi g. 5, theDQNachieve s thehi ghest
thustheoutputdimen sion is 10.
C in allt esting scenarios. Althou gh theDQNistr ained with
Intheoff-linetrainin g period , theDQN is first randomly
J( = 4, it stilloutp erforms theother algorithms inth e other
initialized andthentrainedepo ch byepo ch. Inthefirst100
cases. Weal so note that theg apb etween random/maximum
episodes , the agentsonlystocha stically take action s, thenthey
PAsc hemes andthe restoptimi zationa lgorithmsi s increased
followbyadaptive e-greedylearning strategy[18]to stepin
when J( becom es larger. Thi s can bemainly attributedthat
thefollowin g exploring period. Inea ch episode, the large-
theintra-cell interference gets strongerwithincre ased user
scalefadingi s invariant,andthusthe number oftrainin g
episodemustbel arge enou gh toover come theg eneralization ITh e presents tandardDQLi s the onew itht ypical 'Y valueinr ange
problem . There are 50time slotsperepi sode,a ndtheDQN [0.7, 0.9] ore ven higher , andou r proposed DQLh asaze ro 'Y value .
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.
TABLEII
AVERAGE TIMEC OSTP ERE XECUTION T; (sec) .
Algorithm
DQNF P WMMSE
CPU 3.lOe- 4 4 .80 - 3
_ DQN
~ - +-- -- ---l --- -+ --- --+t -- y= 0.0 , _ FP
3.0
-- - y= 0 .1 _ WMM5E
-.- y= 0.3 _ Random power
-l - .JJ' ~ --- I-- -- --+ --- -+ --- -H ... y= 0.7 r 2.5 _ Maximal power I
-- - y= 0.9
0.5
1.2 Fig. 5. Theaveragera te C versus user numberp erce ll. Five powera llocation
schem esare tested .
1.8 -j -+--..--f---:--.t-.-:;----:--+------;+---:-.,-----+-I
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.
V. CON CLUSIO NS [II] R. Amiri, H.M ehrpouyan , L. Fridman , R. K.M allik , A. Nallanathan ,
andD . Matolak , "A machinel earninga pproach forp owera llocation in
Thedi stributed PA problem inthecellul ar networkwith hetnets considering qos,' CoRR , vol. abs/1803.06760 , 2018 . [Online].
IMAChasbeeninve stigated, andthe data-driven model-free Available : http://arxiv.org/abs/ 1803 .06760
[12] E. Ghadimi , F. D. Calabrese , G. Peters , andP .So Idati, "A reinforcement
DQLha s been appliedto solvethi s issue. Tobein consistent learningap proacht opowerco ntrol andr atea daptationin cellular
withth e PAoptimi zation target , thecurr ents um-ratei s used networks ," in 20 /7 IEEE Int. Con! C0l11111un (ICC) , May2 017 , pp.
as rewardfun ction , includin g nofuturere wards. Withour 1- 7.
[13] F. D. Calabrese , L.Wan g, E. Ghadimi , G. Peters, L.Han zo, and
propos ed discount factor '"Y = 0, andtheDQNsimpl y works as P. Soldati , "Learningra dio resource management inR ANs : Framew ork,
ane stimator to predict the current sum-rateunderallpower opportunities,a nd challenges ," IEEECO l11l11 un. Mag., vol.5 6, no. 9, pp.
levelswith a certain CSI.Simul ation resultss howth at the 138-145,Sep . 2018.
[14] L. Xiao , D. Jiang , D. Xu, H. Zhu, Y. Zhang , and V. Poor, "Two-
DQNtrainedwithzero '"Y achieveshi gher averages um-rate dimensionala nti-jammingm obileco mmunicationb asedo nr einforce -
thanth es tandardDQLwithp ositive '"Y values. Thenina ment learning ," IEEETra ns. Veh.T echnol ., pp. 1-1, 2018 .
seriesofdi stinctsc enarios , thepropo sed DQNoutp erforms [15] Y. S. NasirandD .G uo, " Deepr einforcement learning for
distribut ed dynamic powera llocationin wirelessn etworks ," CoRR , vol.
theben chmark algorithms, also indicating thatourd esigned abs/1808.00490 , 2018 . [Online]. Available: http://arxiv.orglabs/1808.
DQNha sg ood generalization abilities. In thet wo-step training 00490
framework,weha ve realized theoff-line centralized learning [16] Bottomley , G.E, and Croft, "Jakes fading model revisited ," Electron.
Lett. , vol. 29, no. 13, pp. 1162-1 163, Jun . 1993 .
with simulated communication networks, andthelearned [17] V. Mnih, K. Kavukcuoglu , D. Silver, A. A. Rusu, J. Veness , M.G.
DQNi s tested bydi stributed executions. In our futurework , Bellemare , A.Graves , M.R iedmiIler, A. K.Fi djeland , andG . Ostrovski ,
theon -line learnin g willb e furth ers tudiedtoaccomm odate the " Human-l evel controlth rough deep reinforcementl earning ." Na ture,
vol.51 8, no.7 540, p.5 29, Feb. 2015.
real scenarioswith specificu ser distributions and geographical [18] R. Suttonand A. Barto , Reinforcement Learnin g: AnInt roduction . MIT
environments. Press , 2018 .
[19] D. P. Kingmaa ndJ . Ba," Adam:Ame thodf ors tochastic
optimization ," CoRR , vol. abs/ 1412.6980, 2014 . [Online] . Available:
VI. A CK NOWL ED GME NTS http://arxiv.orglabs/141 2.6980
R EFERENCES
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on May 09,2021 at 21:11:29 UTC from IEEE Xplore. Restrictions apply.