Reinforcement Learning-Based NOMA Power Allocation in The Presence of Smart Jamming

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO.
4, APRIL 2018 3377
Reinforcement Learning-Based NOMA Power

Allocation in the Presence of Smart Jamming
Liang Xiao , Senior Member, IEEE, Yanda Li, Canhuang Dai, Huaiyu Dai, Fellow, IEEE,
and H. Vincent Poor , Fellow, IEEE
Abstract—Nonorthogonal multiple access (NOMA) systems are I. INTRODUCTION

vulnerable to jamming attacks, especially smart jammers who
Y USING power domain multiplexing with successive
apply programmable and smart radio devices such as software-
defined radios to flexibly control their jamming strategy according
to the ongoing NOMA transmission and radio environment. In this
B interference cancellation, non-orthogonal multiple access
(NOMA) as an important candidate for 5G cellular communi-
paper, the power allocation of a base station in a NOMA system cations can significantly improve both the outage performance
equipped with multiple antennas contending with a smart jammer
is formulated as a zero-sum game, in which the base station as the
and user fairness compared with orthogonal multiple access sys-
leader first chooses the transmit power on multiple antennas, while tems [1]. Multiple-input and multiple-output (MIMO) NOMA
a jammer as the follower selects the jamming power to interrupt transmission can achieve even higher spectral efficiency. For
the transmission of the users. A Stackelberg equilibrium of the instance, the MIMO NOMA system as proposed in [2] applies
antijamming NOMA transmission game is derived and conditions zero-forcing based beamforming and pairing to reduce the clus-
assuring its existence are provided to disclose the impact of multi-
ple antennas and radio channel states. A reinforcement learning-
ter size of the cellular system and interference and thus signif-
based power control scheme is proposed for the downlink NOMA icantly increases the capacity. The MIMO NOMA scheme as
transmission without being aware of the jamming and radio chan- developed in [3] provides fast transmission of small-size packets
nel parameters. The Dyna architecture that formulates a learned following the required outage performance. The MIMO-NOMA
world model from the real antijamming transmission experience system implements successive interference cancellation to de-
and the hotbooting technique that exploits experiences in simi-
lar scenarios to initialize the quality values are used to accelerate
tect and remove the signals of the weaker users according to
the learning speed of the Q-learning-based power allocation, and downlink cooperative transmission protocol such as those in [4]
thus, improve the communication efficiency of the NOMA trans- and [5].
mission in the presence of smart jammers. Simulation results show NOMA transmission is vulnerable to jamming attacks, as an
that the proposed scheme can significantly increase the sum data attacker can use programmable and smart radio devices such as
rates of users, and thus, the utilities compared with the standard
Q-learning-based strategy.
software defined radios to flexibly control the jamming power
according to the ongoing communication [6]–[9]. As an extreme
Index Terms—Nonorthogonal multiple access (NOMA), smart case, all the downlink users in a NOMA system can be simul-
jamming, power allocation, game theory, reinforcement learning. taneously blocked if a jammer sends strong jamming power
on the frequency at which the users send their signals. By re-
ducing the signal-to-interference-plus-noise ratio (SINR) of the
user signals, jammers can significantly decrease the data rates
Manuscript received June 12, 2017; revised September 29, 2017 and Novem-
ber 30, 2017; accepted December 3, 2017. Date of publication December 12, of the NOMA transmission and even result in denial of service
2017; date of current version April 16, 2018. This work was supported in part attacks.
by National Natural Science Foundation of China under Grants 61671396 and Game theoretic study of anti-jamming communications in
91638204, in part by the U.S. National Science Foundation under Grants CCF-
1420575, CNS-1456793, ECCS-1307949 and EARS-1444009, and in part by wireless networks such as [9]–[13] has provided insights into the
the open research fund of National Mobile Communications Research Lab- defense against smart jammers. In this paper, the anti-jamming
oratory, Southeast University (No.2018D08). The review of this paper was transmissions of a base station (BS) in a MIMO NOMA sys-
coordinated by Dr. Y. Gao. (Corresponding author: Liang Xiao.)
L. Xiao is with the Department of Communication Engineering, Xiamen tem are formulated as a zero-sum Stackelberg game. In this
University, Xiamen 361005, China, and also with the National Mobile Com- game, the transmit power of the BS for each user is first deter-
munications Research Laboratory, Southeast University, China (e-mail: lxiao@ mined on each antenna and then a smart jammer chooses the
xmu.edu.cn).
Y. Li and C. Dai are with the Department of Communication Engineer- jamming power in an attempt to interrupt the NOMA transmis-
ing, Xiamen University, Xiamen 361005, China (e-mail: 694201889@qq.com; sion at a low jamming cost. The Stackelberg equilibrium (SE)
daich_xmu@163.com). of the anti-jamming MIMO NOMA transmission game is de-
H. Dai is with the Department of Electrical and Computer Engineering, NC
State University, Raleigh, NC 27695 USA (e-mail: huaiyu_dai@ncsu.edu). rived, showing that the BS tends to allocate more power to the
H. V. Poor is with the Department of Electrical Engineering, Princeton Uni- strong user but has to satisfy the minimum rate demand of weak
versity, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu). users under full jamming-power attack of the smart jammer.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Conditions under which the SE exists in the game are pro-
Digital Object Identifier 10.1109/TVT.2017.2782726 vided to disclose the impact of the radio channel states and the
0018-9545 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3378 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
number of antennas on the communication efficiency of NOMA algorithm to further improve the sum data rates in the multiple-
transmissions. input and single-output NOMA system. The power allocation
Reinforcement learning techniques, such as Q-learning can method for multicarrier NOMA transmission designed in [18]
derive the optimal strategy with probability one, if all the feasible exploits the heterogeneity of service quality demand to perform
actions are repeatedly sampled over all the states in the Markov the successive interference cancellation and reduce power con-
decision process [14]. Therefore, we propose a Q-learning based sumption.
power allocation strategy that chooses the transmit power based User fairness is one of the key challenges in the NOMA
on the observed state of the radio environment and the jam- power allocation. The power allocation based on proportional
ming power and a quality function or Q-function that describes fairness proposed in [19] improves both the communication
the discount long-term reward for each state-action pair. This efficiency and user fairness in a two-user NOMA system. A
scheme is applied for the BS to derive the optimal policy for bisection-based iterative NOMA power allocation developed in
multiple users in the dynamic anti-jamming MIMO NOMA [20] enhances user fairness with partial channel state informa-
game without being aware of the channel model and the jam- tion. The cognitive radio inspired power allocation developed
ming model. A hotbooting Q-learning based power allocation in [5] satisfies the quality-of-service requirements for all the
exploits the experiences in similar anti-jamming NOMA trans- users in the NOMA system. A NOMA system that pairs users
mission scenarios to set the initial value of the quality function with independent and identically distributed random channel
for each state-action pair and thus accelerates the convergence states was investigated in [21] to reduce the outage probabilities
of the standard Q-learning algorithm that always uses zero as for each user. However, most existing NOMA power allocation
the initial Q-function value. Meanwhile, the Dyna architecture, strategies do not provide jamming resistance, especially against
which formulates a learned world model from real experience, smart jamming.
can be used to increase the learning speed to derive the opti- Game theoretic study of anti-jamming communications pro-
mal policy [15]. The Dyna algorithm based power allocation vides insights into the design of the power allocation [9]–[11],
scheme emulates the planning and reactions from hypotheti- [22]–[26]. For example, the power allocation game formulated
cal experiences and requires extra computation for updating in [9] provides a closed-form solution to improve the ergodic ca-
Q-function values. Simulation results show that the proposed pacity of MIMO under smart jamming. The MIMO transmission
power allocation schemes enable the BS to improve the sum data game formulated in [10] studies the power allocation strategy on
rate for the users in the downlink NOMA transmission against the artificial noise against a smart attacker who can choose ac-
jamming. tive eavesdropping or jamming. The anti-jamming transmission
The contributions of this work can be summarized as in a peer-to-peer network was formulated as a Bayesian game
follows: in [11], in which each jammer is identified based on the attack
1) We formulate an anti-jamming MIMO NOMA transmis- in the previous time slots. The stochastic communication game
sion game and derive the SEs of the game. as analyzed in [25] proposes an intrusion detection system to
2) We propose a hotbooting Q-learning based power allo- assist the transmitter and increase the network capacity against
cation scheme for multiple users in the dynamic NOMA both eavesdropping and jamming.
transmission game against a smart jammer without know- Learning based attacker type identification developed in [27]
ing the jamming and channel models. improves the transmission capacity under an uncertain type of
3) The Dyna architecture and hotbooting techniques are ap- attacks in the time-varying radio channels for the stochastic se-
plied to accelerate the learning speed and thus enhance cure transmission game. The Q-learning based power control
the anti-jamming NOMA communication efficiency. proposed in [28] resists smart jamming in cooperative cog-
The remainder of the paper is organized as follows: The re- nitive radio networks. The hotbooting learning technique as
lated work is reviewed in Section II. The anti-jamming NOMA proposed in [29] exploits the historical experiences to initial-
transmission game is formulated in Section III, and the SE of ize the Q values and thus accelerates the convergence rate of
the game is derived in Section IV. Two reinforcement learning Q-learning. Model learning based on tree structures as pro-
based power allocation schemes are proposed for the dynamic posed in [30] further improves the sample efficiency in stochas-
NOMA transmission game in Sections V and VI, respectively. tic environments and thus the convergence rate of the learning
Simulation results are provided in Section VII, and conclusions process.
are drawn in Section VIII. The Q-learning based power control strategy proposed in [31]
improves the secure capacity of the MIMO transmission against
smart attacks. Compared with our previous work in [31], we
II. RELATED WORK consider the anti-jamming MIMO NOMA transmission and im-
Power allocation is critical for NOMA transmissions. For prove the game model by incorporating the sum data rates and
instance, the low-complexity scheme as proposed in [16] im- the minimum data rates required by each user in the utility
proves the ergodic capacity for a given total transmit power function. In addition, a fast Q-based power allocation algorithm
constraint. The NOMA power allocation as proposed in [17] that combines the hotbooting technique and Dyna architecture
increases the sum data rates by pairing users with distinctive is proposed to improve the communication efficiency com-
channel conditions. The joint power allocation and user pairing pared with the standard Q-learning based scheme as developed
method proposed in [4] employs the minorization-maximization in [31].
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3379
instantaneous channel state information accurately in time in the

time-varying wireless environment and thus equally allocates
the transmit power PT over the NT antennas, as mentioned in
[16]. On the other hand, the BS can obtain the statistical results
regarding channel states of the users via the feedback from the
users and decide the signal decoding order accordingly. Thus
the transmit covariance matrix of User m denoted by Qm is
given by
Fig. 1. NOMA transmission against a smart jammer. θ m PT

Qm = E xm xH
m = IN T , (2)
NT
III. ANTI-JAMMING MIMO NOMA TRANSMISSION GAME
As shown in Fig. 1, we consider a time-slotted NOMA system where a superscript H denotes the conjugate transpose. In this
consisting of M mobile users each equipped with NR antennas game, the power allocation vector chosen by the BS is denoted
to receive the signals from a BS against a smart jammer. The by θ k = [θ
k
m ]1≤m ≤M for all the M users with the total power
k
BS uses NT transmit antennas to send the signal denoted by constraint 1≤m ≤M θm = 1. For simplicity, let Ω denote the
xkm to User m, with m = 1, 2, . . . , M and k = 1, 2, 3, . . .. A vector space of all the available power allocation strategies,
with ∀θ ∈ Ω.
superimposed NT -dimensional signal vector xk = M n =1 xn is
k
sent at time k. On the other hand, without knowing the downlink channel
The smart jammer uses NJ antennas to send jamming signal power gain of the users, the jammer has to equally allocate
denoted by zJ with the power constraint pkJ = E[zTJ zJ ]. The the jamming power, with the jamming power covariance matrix
jamming power pkJ ∈ [0, PJ ] is chosen based on the ongoing given by
downlink transmit power, where PJ is the maximum jamming
power. pJ
k
Let gj,i,m denote the channel power gain from the i-th an- QJ = E zJ zH
J = IN . (3)
NJ J
tenna of the BS to the j-th antenna of User m, and HkB ,m =
k
[gj,i,m ]1≤j ≤N T ,1≤i≤N R denote the channel matrix between the
BS and User m. The time index k is omitted if no confusion The effective precoding and successive interference cancel-
occurs. Without loss of generality, the channel conditions of lation design for the MIMO-NOMA system have been studied
users are sorted as ||HB ,M || > . . . > ||HB ,1 ||, where ||X|| de- in works such as [4] and [5]. More specifically, the downlink
notes the Frobenius norm of a matrix X, i.e., the user rank NOMA transmission scheme developed in [4] increases the sum
follows the order of channel power gains, indicating that User data rate under the decodability constraint according to a sim-
M may stay in the cell central area while User 1 is at the plified SIC method that orders the users based on the norm of
cell-edge. More specifically, the MIMO channel can be viewed the channel gain vector. The signal alignment scheme developed
as a bundle of independent sub-channels and thus the chan- in [5] decomposes the multi-user MIMO-NOMA scenario into
nel gain matrices of the users can be ordered by the squared multiple separate single-antenna NOMA channels, and orders
Frobenius norm, according to [4] and [32]. Similarly, HkJ,m the users based on the large-scale channel fading gain with user
denotes the channel matrix between the jammer and User m. pairing.
Each channel matrix is assumed to have independently and In this work, we apply the successive interference cancellation
identically distributed (i.i.d.) complex Gaussian distributed ele- scheme that orders the users according to the channel power
ments, i.e., Hμ,m ∼ CN (0, σμ,m I), with μ = B, J and m = gains under jamming, similar to [16] and [33]. The decoding
1, 2, . . . , M , and the i-th largest eigenvalue of Hμ,m HH order according to the Frobenius norms of the channel matrices
μ,m
is denoted by hμm ,i . Thus, the signal received by User m is is optimal if the jamming power is not strong enough to change
m /hm ≥ hj /hj
the SIC order, i.e., for any two users j and m, hB J B J
given by
holds if m > j. Nevertheless, if this assumption does not hold

M
as the jamming power received by a user is much stronger than
ym = HB ,m xn + HJ,m zJ + nm , m = 1, 2, . . . , M, that received by the user who decodes afterwards, the NOMA
n =1
system ranks the decoding order according to the channel power
(1)
gain normalized by the jamming power in SIC, i.e., User m
where the noise vector nm consists of NR additive normalized decodes before User j, if hB m /hm ≥ hj /hj with m > j, if not
J B J
zero-mean Gaussian noise variables for User m. We assume specified.

that the channel gains and receiver noise are independent of According to the singular value decomposition of the channel
each other and from user to user. matrix Hμ,m , μ = B, J and the assumption of the Gaussian
k
Let θm be the power allocation coefficient for User m, i.e., the distributed signals, the achievable data rate of User m denoted
k
allocated transmit power is PT θm . A BS has difficulty obtaining by Rm depends on the SINR of the signal, and is given by [16],
[33] as ∀1 ≤ m ≤ M − 1 TABLE I
SUMMARY OF SYMBOLS AND NOTATION

M
Rm = log2 det I + I + HB ,m Qn HH
B ,m M Number of users
n =m +1
−1 NT / R Number of antennas of BS/RX
NJ Number of jamming antennas
+ HJ,m QJ HH
J,m HB ,m Qm HH
B ,m xk Superimposed transmit signal for users
H kμ , m Channel matrix
h μm , i i-th eigenvalue of channel matrix

NR
σμ , m Channel matrix covariance
= log2
θk = θm k Power allocation coefficient vector for users
i=1 1≤m ≤M
Ω Space of feasible power allocation vectors
NJ θ m P T h B PJ Maximum jamming power
m ,i
1+ M . u Utility of the BS
NT NJ + N J n =m +1 θn PT hB J
m ,i + NT pJ hm ,i
γ Cost coefficient of the jammer
ξ Feasible set of jamming power
(4) R0 Minimum data rate required by a user
L Number of the SINR quantization levels
Applying interference cancellation as illustrated in [33], User I Number of emulated environments
M with the best channel condition can subtract the signals of T Size of training data
Φ Observation counter
all the other M − 1 users from the superimposed signal x. Thus τ Modeled reward
the data rate of User M denoted by RM is given by τ Reward record
−1 Ψ State transition probability
RM = log2 det I + I+HJ,M QJ HH J,M H B ,M QM HH
B ,M
QoS is not satisfied. In summary, we have considered an anti-

NR
NJ θ M P T h B
M ,i
= log2 1 + . (5) jamming NOMA transmission game given by G = {B, J},
i=1
NT NJ + NT pJ hJM ,i {θ, pJ }, {u, −u} . For ease of reference, we also summarize
Note that the received jamming signal is viewed as noise, and our commonly used notation in Table I.
does not change the decoding sequence according to the suc-
cessive interference cancellation strategy in [33]. IV. SE OF THE NOMA TRANSMISSION GAME
By exploiting the difference of the channel power gains be- At a Stackelberg equilibrium of the anti-jamming NOMA
tween the users, the BS usually increases the power allocation transmission game, the BS as the leader chooses its power allo-
coefficient for weak users to satisfy the quality of service (QoS) cation strategy for all downlink users first to maximize its utility
required by the user, such as the minimum data rate demand of given by (6) considering the response of the smart jammer as the
the user denoted by R0 . Therefore, the NOMA transmission has follower. Then the jammer as the follower decides the jamming
to ensure min(R1 , . . . , RM ) ≥ R0 . power to minimize the utility u based on the observed ongo-
The jammer aims to interrupt the NOMA transmission and ing transmission. The SE strategies of the NOMA transmission
make at least one user’s data rate less than the quality of service game denoted by (θ ∗ , p∗J (θ ∗ )) are given by definition as
requirement, i.e., Rm ≤ R0 , for 1 ≤ m ≤ M , as shown in the
first term in (6). If failing to do that, the jammer has a goal to p∗J (θ) = arg min u(θ, pJ ) (7)
0≤p J ≤P J
reduce the overall data rate with less jamming power and avoid
being detected, as indicated in (6). Let γ denote the jamming θ ∗ = arg max u(θ, p∗J (θ)). (8)
θ∈Ω
cost, including the jamming power consumption and the risk to
be detected, which depends on the intrusion detection methods In the following, we first consider the transmission game for the
used by the BS. BS with NT × NR MIMO that serves M users, and then the
The anti-jamming NOMA transmissions are formulated as a specific case with 2 users to evaluate the SE strategies under
zero-sum Stackelberg game denoted by G, in which the BS as different anti-jamming NOMA transmission scenarios.
the leader first chooses the power allocation coefficient vector Lemma 1: The anti-jamming NT × NR NOMA transmis-
θ = [θm ]1≤m ≤M to improve the system throughput under the sion game with M users has a unique SE (θ ∗ , p∗J ) given by (9)
user rate constraint, and a smart jammer as the follower chooses and (10), shown at the bottom of the next page, if hB J
m ,i hM ,i <
the jamming power in an attempt to interrupt the NOMA trans- hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR and (11), shown at
B J
mission at a low jamming cost. the bottom of the next page, holds.
The utility of the BS denoted by u depends on the sum data Proof: See Appendix A.
rate of the M users and the jamming cost, and is given by According to (9) and (10), the SE of the anti-jamming NOMA
transmission game depends on the QoS in the transmission (R0 ),

M
the number of antennas (NT and NR ), the channel condition of
u=I min Rm ≥ R0 Rm + γpJ , (6) the weak users and the maximum jamming power (PJ ). If the
1≤m ≤M
m =1
jammer has good channel conditions to the weaker users, the
where I(·) is the indicator function that takes value 1 if the BS allocates more power to the strongest user (i.e., User M ) to
event is true and 0 otherwise, and the transmission fails if the improve the sum data rates, and meets the QoS of the weaker
Fig. 2. Performance of the anti-jamming transmission game for the MIMO NOMA system with M = 2, N T = N R = 3, N J = 2, σ B , 1 = 10 dB, σ B , 2 = 20
dB, σ J, 1 = σ J, 2 = 12 dB, P J = 20 W, R 0 = 1 bps/Hz and γ = 0.5. (a) P T = 20 W. (b) P T = 40 W.
users. If the total transmit power is low for the transmission, the Proof: See Appendix B.
jammer applies full jamming power to block all the M users in If the jamming cost is high or the jamming channel expe-
terms of the transmission QoS. riences serious fading as shown in Eq. (19), a smart jammer
Corollary 1: The anti-jamming NT × NR NOMA transmis- will keep silent to reduce the power consumption. In addition,
sion game G with M users has a unique SE (θ ∗ , 0), where θ ∗ as the jamming power at SE also decreases with NT , MIMO
is given by ∀1 ≤ m ≤ M − 1 significantly improves the jamming resistance.
Corollary 2: The anti-jamming NT × NR NOMA transmis-

NR ∗
θm PT hB
log2 1 + m ,i
= R0 , (18) sion game G with M = 2 users has a unique SE (θ1∗ , p∗J ) given
i=1
NT + (1 − m ∗ B
n =1 θn ) PT hm ,i by (12) and (13), shown at the bottom of the next page, if
hB1,i h2,i < h1,i h2,i , ∀1 ≤ i ≤ NR and (14), shown at the bot-
J J B
m ,i hM ,i < hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR and
if hB J B J
tom of the next page, holds.
⎛ Proof: See Appendix C.
M −1
⎜
NR
As a concrete example, the utility of the BS in Fig. 2 is maxi-
PT ⎝ mized if the rate demand of the weak user is satisfied under full
i=1 m =1
jamming power (i.e., (12)). Moreover, the feasible allocation
∗ J
NT θ m hm ,i hB coefficient for the strong user increases with the total trans-
M m ,i mission power PT and the jamming power at the SE decreases
NT + n =m +1 θn PT hm ,i NT + M
∗ B ∗ B
n =m θn PT hm ,i with it.
⎞
−1 ∗ B J
Fig. 3 presents the jamming power and the sum data rate of
1− M θ
n =1 n h h
M ,i M ,i ⎟
2 users in terms of the channel power gain of User 1, σB ,1 .
+ M −1 ∗ ⎠ < NJ γ ln 2. (19) More specifically, the average SINR of users increases with
NT + 1 − n =1 θn PT hM ,i B
σB ,1 and the number of transmission antennas NT , because the

NR ∗
NJ θ m PT hBm ,i
log2 1+ m = R0 , ∀1 ≤ m ≤ M − 1 (9)
i=1
NT NJ + NJ (1 − n =1 θn∗ ) PT hB J
m ,i + NT PJ hm ,i
⎛

NR
M −1 ∗
NT NJ θ m PT hJm ,i hB
⎝ m ,i

M ∗
∗ J
i=1 m =1 NT NJ + NJ
∗ B J
n =m +1 θn PT hm ,i + NT pJ hm ,i NT NJ + N J M ∗ B
n =m θn PT hm ,i + NT pJ hm ,i
−1 ∗ ⎞
NJ 1 − M B J
n =1 θn PT hM ,i hM ,i
+ M −1 ∗ ⎠ = γ ln 2 (10)
∗
NJ + pJ hM ,i NT NJ + NJ 1 − n =1 θn PT hM ,i + NT pJ hM ,i
J B ∗ J
⎛ −1 ∗ B J ⎞
NR
M −1 ∗ J
NT θ m hm ,i hB 1− M n =1 θn hM ,i hM ,i
⎝ m ,i + ⎠ > NJ γ ln 2
PT M ∗ M ∗
M −1 ∗
i=1 m =1 NT +
B
n =m +1 θn PT hm ,i NT + n =m θn PT hm ,i B NT + 1 − n =1 θn PT hM ,i B
(11)
Fig. 3. Performance of the anti-jamming transmission game for the MIMO NOMA system at the SE with M = 2, N J = 3, σ J, 1 = σ J, 2 = 12 dB, P T = P J = 20
W, R 0 = 1 bps/Hz and γ = 0.5. (a) Average SINR of users. (b) Sum capacity of users.
jammer will fail to interrupt the user experience and decrease Proof: See Appendix D.
the jamming power to reduce the cost. Therefore, the sum data If the jamming channel condition to the stronger user (i.e.,
rate of the M users improves with σB ,1 and NT . User 2) is high as shown in the given condition, the sum data
Corollary 3: The anti-jamming NT × NR NOMA trans- rate increases with more power allocated for the weak user (i.e.,
mission game G with M = 2 users has a unique SE User 1). On the other hand, the smart jammer, as a follower
(θ1∗ , p∗J ) given by (15)–(16), shown at the bottom of this observing the ongoing transmission will apply its full jamming
∗
hB1,i h2,i > h1,i h2,i + (h2,i − h1,i )NJ /pJ , ∀1 ≤
J J B B B
page, if power to interrupt the communication of the stronger user if its
i ≤ NR and (17), shown at the bottom of this page, holds. transmit power is not high enough as in (15).

NR
NJ θ1∗ PT hB 1,i
log2 1 + = R0 (12)
i=1
NT NJ + NJ (1 − θ1∗ )PT hB 1,i + NT PJ h1,i
J
⎛
NR
NT NJ θ1∗ PT hJ1,i hB
⎝ 1,i
∗ B ∗
NT NJ + NJ (1 − θ1 )PT h1,i + NT pJ h1,i NT NJ + NJ PT hB
J + N p ∗ hJ
i=1 1,i T J 1,i
⎞
NJ (1 − θ1∗ )PT hB J
2,i h2,i
+ ⎠ = γ ln 2 (13)
NJ + p∗J hJ2,i NT NJ + NJ (1 − θ1∗ )PT hB 2,i + N p ∗ hJ
T J 2,i
⎛ ⎞

NR
NT θ1∗ hJ1,i hB (1 − θ ∗ B J
)h h
PT ⎝ 1,i
+ 1 2,i 2,i ⎠ > NJ γ ln 2 (14)
i=1
∗
NT + (1 − θ1 )PT h1,i NT + PT h1,i
B B N T + (1 − θ1∗ )PT hB 2,i

NR
NJ (1 − θ1∗ )PT hB2,i
log2 1 + = R0 (15)
i=1
NT NJ + NT p∗J hJ2,i
⎛
NR
NT NJ θ1∗ PT hJ1,i hB
⎝ 1,i
∗ ∗
NT NJ + NJ (1 − θ1 )PT h1,i + NT pJ h1,i NT NJ + NJ PT hB
B J ∗ J
i=1 1,i + NT pJ h1,i
⎞
NJ (1 − θ1∗ )PT hB J
2,i h2,i
+ ⎠ = γ ln 2 (16)
NJ + p∗J hJ2,i NT NJ + NJ (1 − θ1∗ )PT hB 2,i + N p ∗ hJ
T J 2,i
⎛ ⎞

NR
NT θ1∗ hJ1,i hB (1 − θ ∗ B J
)h h
PT ⎝ 1,i
+ 1 2,i 2,i ⎠ > NJ γ ln 2 (17)
i=1
∗
NT + (1 − θ1 )PT h1,i NT + PT h1,i
B B N T + (1 − θ1∗ )PT hB 2,i
V. NOMA POWER ALLOCATION BASED ON

Algorithm 1: Hotbooting preparation.
HOTBOOTING Q-LEARNING
1: Q(s, θ) = 0, V (s) = 0, ∀s ∈ ξ, ∀θ ∈ Ω, and s0 = 0
The repeated interactions between a BS and a smart jammer 2: for i = 1, 2, . . . , I do
can be formulated as a dynamic anti-jamming NOMA transmis- 3: Emulate a similar environment
sion game. The optimal downlink power allocation depends on 4: for k = 1, 2, . . . , K do
the locations of the users, the radio channel states and the jam- 5: Choose action θ k at random
ming parameters in the time slot, which are challenging for a BS 6: Obtain the utility uk and the received SINR of each
to accurately estimate. In addition, the power allocation strategy user SINRk1≤m ≤M
of the BS has an impact on the jamming strategy and the received
7: sk +1 = [SINRkm ]1≤m ≤M
SINR of the users and thus the anti-jamming NOMA transmis-
8: Q (s , θ k ) = (1−α) Q∗ (sk , θ k )+α(uk +δV ∗ (sk +1 ))
∗ k
sion process can be formulated as a finite Markov decision
9: V ∗ (sk ) = max Q∗ (sk , θ)
process. Therefore, the NOMA system can apply Q-learning, a θ∈Ω
widely-used model-free reinforcement learning technique in the 10: end for
power allocation without being aware of the instantaneous chan- 11: end for
nel state information at the base station. Instead of relying on
the knowledge of the instantaneous channel state information,
this NOMA system uses the statistical channel state informa- Algorithm 2: Hotbooting Q-learning based NOMA
tion and the transmission experience to determine the power power allocation.
allocation.
The proposed power allocation depends on the quality func- 1: Set Q(s, θ) = Q∗ (s, θ), ∀s ∈ ξ, ∀θ ∈ Ω, and s0 = 0
tion or Q-function — the discount expected reward for each 2: for k = 1, 2, 3, ... do
state-action pair. The state at time k that reflects the en- 3: Choose θ k via (22)
vironment dynamics, denoted by sk , is chosen as the last 4: for m = 1, 2, ..., M do
k
received SINR of each user denoted by SINRk1≤m −1 5: Allocate power θm PT for the signal to User m
≤M , i.e.,
k −1 6: end for
s = [SINRm ]1≤m ≤M ∈ ξ, where ξ is the space of all the pos-
k
7: Send the superimposed signal xk over NT antennas
sible SINR vectors. More specifically, each user quantizes the
8: Observe SINRk1≤m ≤M and the utility uk
received SINR into one of L levels for simplicity, and sends
back the quantized value to the BS as the observed state for 9: sk +1 = [SINRkm ]1≤m ≤M
the next time slot. Therefore, the BS uses the M L vector as 10: Update Q(sk , θ k ) via (20)
feedback information to update its Q function in the Q-learning 11: Update V (sk ) via (21)
based NOMA system. 12: end for
For simplicity, the feasible action set of the BS in the power
allocation, or each of the transmit power coefficient is quantized
into L non-zero levels denoted by θm k
∈ {l/L}0≤l≤L , ∀1 ≤ m ≤ narios against jammers and observes the resulting transmission
M , and the power allocation strategy of the BS is denoted by status and the received utility. Let Q(s, θ) denote the quality or
θ k = [θm k
]1≤m ≤M ∈ Ω. The transmit signal of the M users is Q function of the BS for system state s and action θ, which is the
sent at the power allocation given by θ k . expected discounted long-term reward observed by the BS. The
Note that the random exploration of standard Q-learning at the discount factor denoted by δ ∈ [0, 1] is chosen according to
beginning of the game due to the all-zero Q-value initialization the uncertainty regarding the future gains. As shown in
usually requires exponentially more data than the optimal pol- Algorithm 2, at each time slot, the BS updates both the Q func-
icy [34]. Therefore, we propose a hotbooting technique, which tion and the value function, given by
initializes the Q-value based on the training data obtained in ad-
vance from large-scale experiments in similar scenarios. As we Q sk , θ k ← (1− α)Q sk , θ k + α u sk , θ k + δV (sk +1 ) ,
will see, the hotbooting Q-learning based power allocation can (20)
decrease the useless random explorations in an initial environ- where the learning rate α ∈ (0, 1] represents the weight of the
ment and thus accelerate the learning speed of the Q-learning current experience in the learning process. The value function
scheme in the dynamic game and improve the communication V (s) is the maximum of Q(s, θ) over all the available power
efficiency of the anti-jamming NOMA transmission. allocation vectors, given by
The preparation stage of the hotbooting technique as pre-
sented in Algorithm 1 performs I anti-jamming MIMO NOMA V (sk ) = max Q(sk , θ). (21)
θ∈Ω
transmission experiments before the start of the game. The num-
The resulting Q-function denoted by Q∗ will be used to initialize
ber of experiments I is chosen as a tradeoff between the conver-
the Q-value in the hotbooting Q-learning based scheme.
gence rate and the risk of overfitting in several special scenarios.
During the learning process as summarized in Algorithm 2,
In each experiment, a BS randomly selects the transmit power
the tradeoff between exploitation and exploration has an
for M users at a given state in similar radio transmission sce-
important impact on the convergence performance. Therefore,
Algorithm 3: NOMA power allocation with fast Q-learning.

1: Set Q(s, θ) = Q∗ (s, θ), ∀s ∈ ξ, ∀θ ∈ Ω, and s0 = 0
2: for k = 1, 2, 3, ... do
3: Choose θ k via (22)
4: for m = 1, 2, ..., M do
k
5: Allocate power θm PT for the signal to User m
6: end for k
7: θ k = θm 1≤m ≤M
8: Send xk according to θ k over NT antennas
9: Observe SINRk1≤m ≤M and the utility uk
10: sk +1 = [SINRkm ]1≤m ≤M
11: Update Q(sk , θ k ) via (20)
12: Update V (sk ) via (21)
Fig. 4. Illustration of the fast Q-based NOMA power allocation scheme.
13: Calculate Φ (sk , θ k , sk +1 ) via (23)
14: Calculate Φ(sk , θ k ) via (24)
15: Calculate Ψ(sk , θ k , sk +1 ) via (27)
16: Calculate τ (sk , θ k ) via (25)
according to ε-greedy policy, the power allocation coefficient 17: Randomly select ŝ0 ∈ ξ and θ̂ 0 ∈ Ω
θ k is based on the system state sk and the Q-function and is 18: for d = 1 to D do
given by 19: Select the next state ŝd+1 according to (28)
d
20: Obtain τ (ŝd , θ̂ ) via (26)
1 − ε, θ̂ = arg max Q(sk , θ ) d
Pr θ = θ̂ = ε
θ ∈Ω . (22) 21: Update Q(ŝd , θ̂ ) via (29)
|Ω|−1 , o.w. 22: Update V (ŝd ) via (30)
23: end for
The BS with our proposed RL based power allocation scheme 24: end for
does not know the instantaneous channel state information, the
jamming power and the jamming channel gains, which are re-
quired by the global optimization algorithm such as [35]. In this
way, the BS learns the jamming strategy according to the anti- The fast Q-learning power allocation algorithm also applies
jamming NOMA transmission history and derives the optimal Algorithm 1 to initialize the Q values with the hotbooting tech-
transmit power allocation strategy that improves the long-term nique and updates the Q functions via (20) and (21) similarly
anti-jamming communication efficiency. to Algorithm 2 according to the NOMA transmission result in
However, the convergence complexity of Algorithm 2 de- each time slot. The BS observes the received SINR of each
−1
user at last time slot denoted by SINRk1≤m ≤M as the state s ,
k
pends on the size of the action-state space |ξ × Ω|, which
k −1
exponentially increases with the number of users and the i.e., s = [SINRm ]1≤m ≤M . The power allocation strategy θ k
k
quantization levels of the power allocation coefficients. The is also chosen according to the system state and the Q-function
channel time variation on the other hand, requires much faster based on the ε-greedy policy as shown in (22).
learning rate of the power allocation. Therefore, we further pro- Meanwhile, the Dyna-Q based algorithm also uses the real
pose a fast Q-learning based power allocation scheme that com- experiences of the NOMA transmission at time k to build a
bines the hotbooting technique and the Dyna architecture to hypothetical anti-jamming NOMA transmission model. More
improve the communication efficiency of Algorithm 2 in dy- specifically, the real experience of the NOMA transmission at
namic radio environments in the following section. time k is saved as an experience record for the corresponding
state-action pair, i.e., (sk , θ k ). The experience record stored at
the BS consists of the occurrence counter vector of the state-
VI. NOMA POWER ALLOCATION BASED ON action pairs denoted by Φ, the occurrence counter vector of the
FAST Q-LEARNING next state denoted by Φ , the reward record τ , and the modeled
The Dyna architecture formulates a learned world model from reward τ . The hypothetical experience is generated according
the real anti-jamming NOMA transmission experiences to ac- to the real experience (Φ , Φ, τ , τ, Ψ). As shown in Algorithm
celerate the learning speed of Q-learning in dynamic radio en- 3, the counter vector of the next state Φ is updated according to
vironments. More specifically, we apply the Dyna architecture the anti-jamming NOMA transmission at time k, i.e.,
that emulates the planning and reactions from hypothetical ex-
periences to improve the anti-jamming efficiency in the fast Φ (sk , θ k , sk +1 ) = Φ (sk , θ k , sk +1 ) + 1. (23)
Q-learning based NOMA power allocation algorithm. The al-
gorithm is summarized in Algorithm 3, and the main structure The occurrence counter vector in the hypothetical experience Φ
is illustrated in Fig. 4. defined as the sum of Φ over all the feasible next state is then
updated with

Φ(sk , θ k ) = Φ (sk , θ k , sk +1 ). (24)
s k + 1 ∈ξ
The reward record from the real experience τ in this time slot
is the utility of the BS at time k given by

τ sk , θ k , Φ(sk , θ k ) = u(sk , θ k ). (25)
Based on the reward record τ in (25) and the state transition
from sk to sk +1 , we can formulate a hypothetical anti-jamming
NOMA transmission model to generate several hypothetical
NOMA transmission experiences, in which the reward to the
BS denoted by τ is defined as the utility of the BS averaged over
all the previous real experiences and is given by
Φ(s k ,θ k )
1
τ (s , θ ) =
k k
τ sk , θ k , n . (26)
Φ(sk , θ k ) n =1
The transition probability from sk to sk +1 in the hypothetical

experience denoted by Ψ is given by (23) and (24) as
Φ (sk , θ k , sk +1 )
Ψ(sk , θ k , sk +1 ) = . (27)
Φ(sk , θ k )
The Q function is updated with D more times with the Dyna
architecture, in which D hypothetical experiences are generated
according to the world model Ψ learned from the previous real
experience (Φ , Φ, τ , τ, Ψ). More specifically, the system state
denoted by ŝd and the hypothetical action denoted by θ̂ d in
the d-th additional update are generated based on the transition
d
probability Ψ(ŝd , θ̂ , ŝd+1 ) given in (27), i.e.,
Pr(ŝd+1 |ŝd , θ̂ d ) = Ψ(ŝd , θ̂ d , ŝd+1 ), (28)
0
where (ŝ0 , θ̂ ) is randomly chosen from ξ and Ω, respectively.
The Q function is updated in the d-th hypothetical experience
with d = 1, 2, . . . , D by the following:
d d d
Q(ŝd , θ̂ ) ← (1 − α)Q(ŝd , θ̂ ) + α(τ (ŝd , θ̂ ) + δV (ŝd+1 ))
(29)
V (ŝd ) = max Q(ŝd , θ̂). (30)

θ̂∈Ω
As shown in Algorithm 3, the proposed NOMA power al-

location algorithm requires an additional memory in each time
slot to store the experience record (Φ , Φ, τ , τ, Ψ). Compared
with the standard Q-learning based power allocation in Algo-
rithm 2, this scheme has D more updates of the Q functions.
However, the proposed fast Q-based power allocation algo- Fig. 5. Performance of the power control schemes in the dynamic MIMO
NOMA transmission game in the presence of smart jamming, with M = 3,
rithm applies the Dyna architecture to increase the convergence N T = N R = 5, N J = 3, σ B , 1 = 10 dB, σ B , 2 = 11 dB, σ B , 3 = 20 dB,
speed of the reinforcement learning process and thus improve σ J, 1 = σ J, 2 = 11 dB, σ J, 3 = 12 dB, P T = P J = 20 W, R 0 = 1 bps/Hz and
the NOMA communication efficiency in the presence of smart γ = 0.5. (a) Average SINR. (b) Sum data rates. (c) Utility of the BS.
jamming.
dynamic anti-jamming communication game via simulations.
VII. SIMULATION RESULTS If not specified otherwise, we set NT = NR = 5, σB ,1 =
In this section, we evaluate the performance of the MIMO 10 dB, σB ,2 = 11 dB, σB ,3 = 20 dB, PT = 20 W, R0 = 1
NOMA power allocation scheme for M = 3 users in the bps/Hz, γ = 0.5, α = 0.2, δ = 0.7, ε = 0.9 and I = 200. In the
Fig. 7. Performance of the power control schemes in the dynamic MIMO

NOMA transmission game in the presence of smart jamming versus the num-
Fig. 6. Performance of the power control schemes in the dynamic MIMO ber of the RX antennas, with M = 2, N T = 10, σ B , 1 = 12 dB, σ B , 2 = 20
NOMA transmission game in the presence of smart jamming versus the channel dB, σ J, 1 = σ J, 2 = 11 dB, P T = P J = 20 W, R 0 = 2 bps/Hz and γ = 0.5.
gain of User 1, with M = 2, N T = N R = 5, N J = 3, σ J, 1 = σ J, 2 = 12 dB, (a) Average SINR. (b) Sum data rates. (c) Utility of the BS.
P T = P J = 20 W, R 0 = 2 bps/Hz and γ = 0.5. (a) Average SINR. (b) Sum
data rates. (c) Utility of the BS.
As shown in Fig. 5, the Q-learning based power alloca-

tion scheme has significantly increased the jamming resistance
simulations, a smart jammer applies Q-learning to determine of NOMA transmissions, which can be further improved by
the jamming power according to the current downlink transmit the hotbooting technique and the Dyna architecture. In the
power with NJ = 3, σJ,1 = σJ,2 = 11 dB, σJ,3 = 12 dB, and benchmark Q-learning based OMA system, the BS equally allo-
PJ = 20 W. cates the frequency and time resources to each user and chooses
the transmit power according to the current state and the Q SE of the game has been derived, and conditions assuring its
function, similarly to Algorithm 2. existence have been provided, showing how the anti-jamming
According to [36], the NOMA transmission outperforms communication efficiency of NOMA systems increases with
OMA with less outage probability, which is verified by the the number of antennas and the channel power gains. A fast
simulation results in Fig. 5. More specifically, the Q-learning Q-based NOMA power allocation scheme that combines the
based NOMA system can improve the communication effi- hotbooting technique and Dyna architecture is proposed for a
ciency against jamming attack compared to the OMA system. dynamic game to accelerate the learning and thus improve the
For instance, the Q-learning based anti-jamming NOMA system communication efficiency against smart jamming. As shown in
exceeds the Q-learning based OMA system with 8.7% higher the simulation results, the proposed NOMA power allocation
SINR, 42.8% higher sum data rate, and 7.1% higher utility of scheme can significantly improve the average SINR, the sum
the BS at the 1000-th time slot. data rate and the utility of the BS soon after the start of the
The NOMA transmission performance can be further im- game. For example, the sum data rate of three users in the 5×
proved by the hotbooting technique. For instance, the NOMA 5 MIMO NOMA anti-jamming transmission game increases by
system with hotbooting Q-learning achieves 12% higher SINR, 128% after 1000 time slots, which is 21% higher than that of
10% higher sum data rate, and 9% higher utility, at the 1000-th the standard Q-learning based scheme.
time slot. The reason is that the hotbooting technique can We have analyzed a simplified jamming scenario for MIMO
formulate the emulated experiences in dynamic radio envi- NOMA systems in this work. A future direction of our work
ronments to effectively reduce the exploration trials and thus is to extend the theoretical analysis of the NOMA transmission
significantly improve the convergence speed of Q-learning to more practical scenarios with smart jamming, in which a
and the jamming resistance efficiency. The power allocation jammer uses programmable radio devices to flexibly choose
performance can be further improve by the Dyna architec- multiple jamming policies.
ture with hypothetical experiences. For instance, the average
SINR, the sum data rate, and the utility of the proposed fast
Q-learning algorithm are 11%, 10%, and 7% higher at the APPENDIX A
1000-th time slot, compared with the hotbooting Q-learning PROOF OF LEMMA 1
scheme.
According to (6), if min[Rm ]1≤m ≤M ≥ R0 for ∀0 ≤ pJ ≤
The performance of the proposed power allocation scheme
PJ , we have (31) as shown at the top of the next page,
in the 5 × 5 MIMO NOMA system with two users is evaluated
and thus ∂ 2 u(θ, pJ )/∂pJ 2 > 0, i.e., that u(θ, pJ ) is convex
with the varying channel gains of User 1 in Fig. 6. Both the
in terms of pJ . By (11), we have ∂u(θ, pJ )/∂pJ |p J =0 < 0.
sum capacity of users and the utility of the BS increase with
Thus, u(θ, pJ ) is minimized at ∂u(θ, pJ )/∂pJ = 0, and by Eq.
the channel gain σB ,1 . For instance, as the channel gain σB ,1
(7), p∗J is given by (10). Otherwise, if ∃pJ ≤ PJ that satisfies
increases from 4 to 12 dB, the average SINR, the sum data rate
min[Rm ]1≤m ≤M < R0 , by Eq. (7), p∗J ∈ [pJ , PJ ].
and the utility of the Q-learning based power allocation increase
According to the assumption that the channel condi-
by 9%, 14% and 17%, respectively. If σB ,1 = 12 dB, the fast
tions of User k are stronger than User j (k > j), we
Q-based power allocation achieves the best anti-jamming com-
have hB k ,i > hj,i ∀1 ≤ j < k ≤ NR . In addition, if hm ,i hM ,i <
B B J
munication and performance exceeds the standard Q-learning
hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR , by (31), we have
B J
based strategy with 9% higher SINR, 14% higher sum capacity,
∂u(θm , p∗J )/∂θm < 0, i.e., (32) as shown at the top of the next
and 22% higher utility.
page. Clearly, u monotonically decreases with θm , ∀1 ≤ m ≤
Fig. 7 shows that the NOMA transmission efficiency improves
M − 1. Meanwhile, if Rm (θ, PJ ) < R0 ∀1 ≤ m ≤ M − 1,
with the number of the receive antennas with NT = 10 transmit
p∗J = PJ and u(θ, p∗J ) = 0. Therefore, u is maximized if
antennas. The proposed power allocation schemes have strong
Rm (θ, PJ ) = R0 , ∀1 ≤ m ≤ M − 1, i.e., that θ ∗ is given by
resistance against smart jamming even with a large number
(9) and thus (8) holds. Therefore, the SE (θ ∗ , p∗J ) is given by
of jamming antennas NJ . For instance, our proposed scheme
(9)–(10).
can improve the average SINR, the sum data rate and the user
utility of the 10 × 8 NOMA system against a jammer with two
antennas by 10%, 12%, and 15%, respectively, compared with APPENDIX B
the benchmark strategy. PROOF OF COROLLARY 1
Similar to the proof of Lemma 1, if min[Rm ]1≤m ≤M ≥ R0
VIII. CONCLUSION for ∀0 ≤ pJ ≤ PJ , by (31), ∂ 2 u(θ, pJ )/∂pJ 2 > 0. By (19),
In this paper, we have formulated an anti-jamming MIMO we have ∂u(θ, pJ )/∂pJ |p J =0 > 0. Thus u(θ, pJ ) is minimized
NOMA transmission game, in which the BS determines the at pJ = 0, and we have p∗J = 0 by Eq. (7). If hB J
m ,i hM ,i <
transmit power to improve its utility based on the sum data hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR , we have (32) and
B J
rate of the users, and the smart jammer as the follower in the thus θ ∗ given by (18). Therefore, if hB
m ,i hM ,i < hM ,i hm ,i , ∀1 ≤
J B J
Stackelberg game chooses the jamming power at an attempt to m ≤ M − 1, ∀1 ≤ i ≤ NR and (19) holds, we have the SE
interrupt the ongoing transmission at a low jamming cost. The (θ ∗ , 0).
⎛ ⎛ ⎞⎞
−1

NR
M −1
N J θm PT h B NJ 1− M B
n = 1 θn PT hM ,i
u= ⎝ log2 1+
m ,i
+ log2 ⎝1 + ⎠⎠ + γpJ ,
i= 1 m =1
NT NJ + NJ (1 − m θ
n=1 n ) P B J
T hm ,i +NT pJ hm ,i NT NJ +NT pJ hJM ,i
(31)

∗ B hJ − hB hJ − B hB M −1 − B − hB
∗
∂u(θ, pJ )
NR N J P T N T p J h m ,i M ,i M ,i m ,i N J P T h m ,i M ,i n =m + 1 θ n N T N J h M ,i m ,i
= m < 0,
∂θm ∗ J B ∗ M −1
i= 1 ln 2 NT NJ + NT pJ hm ,i +NJ (1− n = 1 θn)PT hm ,i NT NJ + NT pJ hM ,i + NJ 1− n = 1 θn PT hB
J
M ,i
∀1 ≤ m ≤ M − 1. (32)

NR
N J θ1 PT h B
1,i N J (1 − θ1 ) PT h B
2,i
u= log2 1+ + log2 1+ + γpJ , (33)
i= 1
N T N J + N J (1 − θ1 ) P T h B J
1,i + NT pJ h1,i NT NJ + NT pJ hJ2,i

∂u(θ, p∗J )
NR NJ PT NT p∗J hB 1,i h2,i − h2,i h1,i − NT NJ h2,i − h1,i
J B J B B
= < 0. (34)
∂θ1 ∗ J NT NJ + NT p∗J hJ2,i + NJ (1 − θ1 ) PT hB
i= 1 ln 2 NT NJ + NT pJ h1,i + NJ (1 − θ1 ) PT h1,i
B
2,i
APPENDIX C [7] Q. Wang, K. Ren, P. Ning, and S. Hu, “Jamming-resistant multiradio

PROOF OF COROLLARY 2 multichannel opportunistic spectrum access in cognitive radio networks,”
IEEE Trans. Veh. Technol., vol. 65, no. 10, pp. 8331–8344, Oct. 2016.
By (33), as shown at the top of the page, if [8] Q. Yan et al., “Jamming resilient communication using MIMO interfer-
ence cancellation,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 7,
1,i h2,i < h1,i h2,i , ∀1 ≤ i ≤ NR and (14) holds, we have
hB J J B
pp. 1486–1499, Jul. 2016.
∗
∂u(θ1 , pJ )/∂θ1 < 0, i.e., (34) as shown at the top of the [9] X. Zhou, D. Niyato, and A. Hjorungnes, “Optimizing training-based trans-
page. Thus we have that u monotonically increases with θ2 , mission against smart jamming,” IEEE Trans. Veh. Technol., vol. 60, no. 6,
pp. 2644–2655, Jul. 2011.
0 ≤ θ2 ≤ 1. On the other hand, it is clear that R1 (θ1 , p∗J ) = 0 if [10] A. Mukherjee and A. L. Swindlehurst, “Jamming games in the MIMO
θ1 = 0, and R1 (θ1 , p∗J ) monotonically increases with θ1 . Thus wiretap channel with an active eavesdropper,” IEEE Trans. Signal Pro-
u is maximized if R1 = R0 , i.e., that θ1∗ is given by (12), and cess., vol. 61, no. 1, pp. 82–91, Jan. 2013.
[11] A. Garnaev, Y. Liu, and W. Trappe, “Anti-jamming strategy versus a
Eq. (8) is satisfied with θ1∗ . Therefore, the SE (θ1∗ , p∗J ) is given low-power jamming attack when intelligence of adversary’s attack type is
by (12) and (13). unknown,” IEEE Trans. Signal Inf. Process. Netw., vol. 2, no. 1, pp. 49–56,
Mar. 2016.
APPENDIX D [12] Y. Gwon, S. Dastangoo, C. Fossa, and H. T. Kung, “Competing mo-
bile network game: Embracing antijamming and jamming strategies with
PROOF OF COROLLARY 3 reinforcement learning,” in Proc. IEEE Conf. Commun. Netw. Security,
B ∗ Washington, DC, USA, Oct. 2013, pp. 28–36.
By (32), if hB 1,i h2,i > h1,i h2,i + h2,i − h1,i NJ /pJ , ∀1 ≤
J J B B
∗
[13] F. Slimeni, B. Scheers, Z. Chtourou, and V. L. Nir, “Jamming mitigation in
i ≤ NR and (17) holds, we have ∂u(θ1 , pJ )/∂θ1 > 0. It is clear cognitive radio networks using a modified Q-learning algorithm,” in Proc.
that R2 (θ1 , p∗J ) = 0 if θ1 = 1, and R2 (θ1 , p∗J ) monotonically IEEE Int. Conf. Mil. Commun. Inf. Syst., Cracow, Poland, May 2015,
decreases with θ1 . Thus u is maximized if R2 = R0 , i.e., that θ1∗ pp. 1–7.
[14] E. R. Gomes and R. Kowalczyk, “Dynamic analysis of multiagent Q-
is given by (15). Therefore, the SE (θ1∗ , p∗J ) is given by (15)–(16). learning with ε-greedy exploration,” in Proc. ACM Annu. Int. Conf. Mach.
Learn., Montreal, QC, Canada, Jun. 2009, pp. 369–376.
REFERENCES [15] R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and
reacting,” ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, Aug. 1991.
[1] L. Dai, B. Wang, Y. Yuan, S. Han, I. Chih-Lin, and Z. Wang, “Non- [16] Q. Sun, S. Han, I. Chin-Lin, and Z. Pan, “On the ergodic capacity of
orthogonal multiple access for 5G: Solutions, challenges, opportunities, MIMO NOMA systems,” IEEE Wireless Commun. Lett., vol. 4, no. 4,
and future research trends,” IEEE Commun. Mag., vol. 53, no. 9, pp. 74– pp. 405–408, Aug. 2015.
81, Sep. 2015. [17] Z. Ding, P. Fan, and H. V. Poor, “Impact of user pairing on 5G nonorthog-
[2] B. Kimy et al., “Non-orthogonal multiple access in a downlink multiuser onal multiple-access downlink transmissions,” IEEE Trans. Veh. Technol.,
beamforming system,” in Proc. IEEE Mil. Commun. Conf., San Diego, vol. 65, no. 8, pp. 6010–6023, Aug. 2016.
CA, USA, Nov. 2013, pp. 1278–1283. [18] Z. Wei, D. W. K. Ng, and J. Yuan, “Power-efficient resource allocation
[3] Z. Ding, L. Dai, and H. V. Poor, “MIMO-NOMA design for small packet for MC-NOMA with statistical channel state information,” in Proc. IEEE
transmission in the Internet of Things,” IEEE Access, vol. 4, pp. 1393– Global Commun. Conf., Washington, DC, USA, Dec. 2016, pp. 1–7.
1405, Apr. 2016. [19] F. Liu, P. Mähönen, and M. Petrova, “Proportional fairness-based user
[4] M. F. Hanif, Z. Ding, T. Ratnarajah, and G. K. Karagiannidis, “A pairing and power allocation for non-orthogonal multiple access,” in Proc.
minorization-maximization method for optimizing sum rate in the down- IEEE Annu. Int. Symp. Pers., Indoor, Mobile Radio Commun., Hong Kong,
link of non-orthogonal multiple access systems,” IEEE Trans. Signal Pro- Aug. 2015, pp. 1127–1131.
cess., vol. 64, no. 1, pp. 76–88, Jan. 2016. [20] S. Timotheou and I. Krikidis, “Fairness for non-orthogonal multiple access
[5] Z. Ding, R. Schober, and H. V. Poor, “A general MIMO framework in 5G systems,” IEEE Signal Process. Lett., vol. 22, no. 10, pp. 1647–1651,
for NOMA downlink and uplink transmission based on signal align- Oct. 2015.
ment,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 4438–4454, [21] J. A. Oviedo and H. R. Sadjadpour, “A fair power allocation approach to
Jun. 2016. NOMA in multi-user SISO systems,” IEEE Trans. Veh. Technol., vol. 66,
[6] K. Firouzbakht, G. Noubir, and M. Salehi, “On the performance of no. 9, pp. 7974–7985, Sep. 2017, doi: 10.1109/TVT.2017.2689000.
adaptive packetized wireless communication links under jamming,” [22] L. Xiao, J. Liu, Q. Li, N. B. Mandayam, and H. V. Poor, “User-centric
IEEE Trans. Wireless Commun., vol. 13, no. 7, pp. 3481–3495, Jul. view of jamming games in cognitive radio networks,” IEEE Trans. Inf.
2014. Forensics Security, vol. 10, no. 12, pp. 2578–2590, Dec. 2015.
[23] X. Tang, P. Ren, Y. Wang, Q. Du, and L. Sun, “Securing wireless transmis- Canhuang Dai received the B.S. degree in communi-
sion against reactive jamming: A Stackelberg game framework,” in Proc. cation engineering from Xiamen University, Xiamen,
IEEE Global Commun. Conf., San Diego, CA, USA, Dec. 2015, pp. 1–6. China, in 2017, where he is currently working toward
[24] Y. Wu, B. Wang, K. J. R. Liu, and T. C. Clancy, “Anti-jamming games the M.S. degree with the Department of Communica-
in multi-channel cognitive radio networks,” IEEE J. Sel. Areas Commun., tion Engineering. His research interests include smart
vol. 30, no. 1, pp. 4–15, Jan. 2012. grid and wireless communications.
[25] A. Garnaev, M. Baykal-Gursoy, and H. V. Poor, “A game theoretic analysis
of secret and reliable communication with active and passive adversarial
modes,” IEEE Trans. Wireless Commun., vol. 15, no. 3, pp. 2155–2163,
Mar. 2016.
[26] L. Xiao, N. B. Mandayam, and H. V. Poor, “Prospect theoretic analysis
of energy exchange among microgrids,” IEEE Trans. Smart Grid, vol. 6,
no. 1, pp. 63–72, Jan. 2015. Huaiyu Dai (F’17) received the B.E. and M.S. de-
[27] X. He, H. Dai, P. Ning, and R. Dutta, “A stochastic multi-channel spec- grees in electrical engineering from Tsinghua Univer-
trum access game with incomplete information,” in Proc. IEEE Int. Conf. sity, Beijing, China, in 1996 and 1998, respectively,
Commun., London, U.K., Jun. 2015, pp. 4799–4804. and the Ph.D. degree in electrical engineering from
[28] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforce- Princeton University, Princeton, NJ, USA, in 2002.
ment learning in cooperative cognitive radio networks against jamming,” He was with Bell Labs, Lucent Technologies,
Springer J. Supercomput., vol. 71, no. 9, pp. 3237–3257, Sep. 2015. Holmdel, NJ, in summer 2000, and with AT&T Labs-
[29] X. Xiao, C. Dai, Y. Li, C. Zhou, and L. Xiao, “Energy trading game for Research, Middletown, NJ, in summer 2001. He is
microgrids using reinforcement learning,” in Proc. EAI Int. Conf. Game currently a Professor of electrical and computer en-
Theory Netw., Knoxville, TN, USA, May 2017, pp. 131–140. gineering with NC State University, Raleigh, NC,
[30] K. S. Hwang, W. C. Jiang, Y. J. Chen, and W. H. Wang, “Model-based USA. His research interests are in the general areas
indirect learning method based on Dyna-Q architecture,” in Proc. IEEE of communication systems and networks, advanced signal processing for dig-
Int. Conf. Systems, Man, Cybern., Manchester, U.K., Oct. 2013, pp. 2540– ital communications, and communication theory and information theory. His
2544. current research focuses on networked information processing and crosslayer
[31] Y. Li, L. Xiao, H. Dai, and H. V. Poor, “Game theoretic study of protecting design in wireless networks, cognitive radio networks, network security, and
MIMO transmissions against smart attacks,” in Proc. IEEE Int. Conf. associated information-theoretic and computation-theoretic analysis.
Commun., London, U.K., May 2017. Dr. Dai has served as an Editor of the IEEE TRANSACTIONS ON COMMUNICA-
[32] C. Wang, J. Chen, and Y. Chen, “Power allocation for a downlink non- TIONS, IEEE TRANSACTIONS ON SIGNAL PROCESSING, and IEEE TRANSACTIONS
orthogonal multiple access system,” IEEE Wireless Commun. Lett., vol. 5, ON WIRELESS COMMUNICATIONS. Currently he is an Area Editor in charge of
no. 5, pp. 532–535, Aug. 2016. wireless communications for the IEEE TRANSACTIONS ON COMMUNICATIONS.
[33] Z. Ding, Z. Yang, P. Fan, and H. V. Poor, “On the performance of non- He coedited two special issues of EURASIP Journals on Distributed Signal Pro-
orthogonal multiple access in 5G systems with randomly deployed users,” cessing Techniques for Wireless Sensor Networks, and on multiuser information
IEEE Signal Process. Lett., vol. 21, no. 12, pp. 1501–1505, Dec. 2014. theory and related applications, respectively. He cochaired the Signal Processing
[34] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: for Communications Symposium of IEEE Globecom 2013, the Communications
A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996. Theory Symposium of IEEE ICC 2014, and the Wireless Communications Sym-
[35] H. H. Kha, H. D. Tuan, and H. H. Nguyen, “Fast global optimal power posium of IEEE Globecom 2014. He was a co-recipient of best paper awards
allocation in wireless networks by local DC programming,” IEEE Trans. at 2010 IEEE International Conference on Mobile Ad-hoc and Sensor Systems
Wireless Commun., vol. 11, no. 2, pp. 510–515, Feb. 2012. (MASS 2010), 2016 IEEE INFOCOM BIGSECURITY Workshop, and 2017
[36] J. Cui, Z. Ding, and P. Fan, “A novel power allocation scheme under outage IEEE International Conference on Communications (ICC 2017).
constraints in NOMA systems,” IEEE Signal Process. Lett., vol. 23, no. 9,
pp. 1226–1230, Sep. 2016.
H. Vincent Poor (S’72–M’77–SM’82–F’87) re-
ceived the Ph.D. degree in electrical engineering and
computer science from Princeton University, Prince-
Liang Xiao (M’09–SM’13) received the B.S. degree
ton, NJ, USA, in 1977. From 1977 until 1990, he was
in communication engineering from the Nanjing Uni-
on the faculty of the University of Illinois at Urbana-
versity of Posts and Telecommunications, Nanjing,
Champaign. Since 1990, he has been on the faculty at
China, in 2000, the M.S. degree in electrical engi-
neering from Tsinghua University, Beijing, China, in Princeton University, where he is the Michael Henry
Strater University Professor of Electrical Engineer-
2003, and the Ph.D. degree in electrical engineering
ing. During 2006–2016, he served as Dean of Prince-
from Rutgers University, New Brunswick, NJ, USA,
ton’s School of Engineering and Applied Science. He
in 2009. She is currently a Professor with the Depart-
ment of Communication Engineering, Xiamen Uni- has also held visiting appointments at several other
institutions, most recently at Berkeley and Cambridge. His research interests are
versity, Fujian, China. She was a Visiting Professor
in the areas of information theory and signal processing, and their applications
with Princeton University, Princeton, NJ, USA, Vir-
in wireless networks, energy systems and related fields. Among his publications
ginia Tech, Blacksburg, VA, USA, and University of Maryland, College Park,
MD, USA. She has served as an Associate Editor of theIEEE TRANSACTIONS in these areas is the recent book Information Theoretic Security and Privacy of
Information Systems (Cambridge University Press, 2017).
INFORMATION FORENSICS AND SECURITY and guest editor of theIEEE JOURNAL
Dr. Poor is a member of the National Academy of Engineering and the Na-
OF SELECTED TOPICS IN SIGNAL PROCESSING. She is the recipient of the Best
tional Academy of Sciences, and is a foreign member of the Chinese Academy
Paper Award for 2016 INFOCOM Big Security WS and 2017 ICC.
of Sciences, the Royal Society and other national and international academies.
Recent recognition of his work includes the 2017 IEEE Alexander Graham
Bell Medal, Honorary Professorships from Peking University and Tsinghua
Yanda Li received the B.S. degree in communica- University, both conferred in 2017, and a D.Sc. honoris causa from Syracuse
tion engineering from Xiamen University, Xiamen, University awarded in 2017.
China, in 2015, where he is currently working toward
the M.S. degree with the Department of Commu-
nication Engineering. His research interests include
network security and wireless communications.

Reinforcement Learning-Based NOMA Power Allocation in The Presence of Smart Jamming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning-Based NOMA Power Allocation in The Presence of Smart Jamming

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO.

4, APRIL 2018 3377