Professional Documents
Culture Documents
Reinforcement Learning-Based NOMA Power Allocation in The Presence of Smart Jamming
Reinforcement Learning-Based NOMA Power Allocation in The Presence of Smart Jamming
0018-9545 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3378 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
number of antennas on the communication efficiency of NOMA algorithm to further improve the sum data rates in the multiple-
transmissions. input and single-output NOMA system. The power allocation
Reinforcement learning techniques, such as Q-learning can method for multicarrier NOMA transmission designed in [18]
derive the optimal strategy with probability one, if all the feasible exploits the heterogeneity of service quality demand to perform
actions are repeatedly sampled over all the states in the Markov the successive interference cancellation and reduce power con-
decision process [14]. Therefore, we propose a Q-learning based sumption.
power allocation strategy that chooses the transmit power based User fairness is one of the key challenges in the NOMA
on the observed state of the radio environment and the jam- power allocation. The power allocation based on proportional
ming power and a quality function or Q-function that describes fairness proposed in [19] improves both the communication
the discount long-term reward for each state-action pair. This efficiency and user fairness in a two-user NOMA system. A
scheme is applied for the BS to derive the optimal policy for bisection-based iterative NOMA power allocation developed in
multiple users in the dynamic anti-jamming MIMO NOMA [20] enhances user fairness with partial channel state informa-
game without being aware of the channel model and the jam- tion. The cognitive radio inspired power allocation developed
ming model. A hotbooting Q-learning based power allocation in [5] satisfies the quality-of-service requirements for all the
exploits the experiences in similar anti-jamming NOMA trans- users in the NOMA system. A NOMA system that pairs users
mission scenarios to set the initial value of the quality function with independent and identically distributed random channel
for each state-action pair and thus accelerates the convergence states was investigated in [21] to reduce the outage probabilities
of the standard Q-learning algorithm that always uses zero as for each user. However, most existing NOMA power allocation
the initial Q-function value. Meanwhile, the Dyna architecture, strategies do not provide jamming resistance, especially against
which formulates a learned world model from real experience, smart jamming.
can be used to increase the learning speed to derive the opti- Game theoretic study of anti-jamming communications pro-
mal policy [15]. The Dyna algorithm based power allocation vides insights into the design of the power allocation [9]–[11],
scheme emulates the planning and reactions from hypotheti- [22]–[26]. For example, the power allocation game formulated
cal experiences and requires extra computation for updating in [9] provides a closed-form solution to improve the ergodic ca-
Q-function values. Simulation results show that the proposed pacity of MIMO under smart jamming. The MIMO transmission
power allocation schemes enable the BS to improve the sum data game formulated in [10] studies the power allocation strategy on
rate for the users in the downlink NOMA transmission against the artificial noise against a smart attacker who can choose ac-
jamming. tive eavesdropping or jamming. The anti-jamming transmission
The contributions of this work can be summarized as in a peer-to-peer network was formulated as a Bayesian game
follows: in [11], in which each jammer is identified based on the attack
1) We formulate an anti-jamming MIMO NOMA transmis- in the previous time slots. The stochastic communication game
sion game and derive the SEs of the game. as analyzed in [25] proposes an intrusion detection system to
2) We propose a hotbooting Q-learning based power allo- assist the transmitter and increase the network capacity against
cation scheme for multiple users in the dynamic NOMA both eavesdropping and jamming.
transmission game against a smart jammer without know- Learning based attacker type identification developed in [27]
ing the jamming and channel models. improves the transmission capacity under an uncertain type of
3) The Dyna architecture and hotbooting techniques are ap- attacks in the time-varying radio channels for the stochastic se-
plied to accelerate the learning speed and thus enhance cure transmission game. The Q-learning based power control
the anti-jamming NOMA communication efficiency. proposed in [28] resists smart jamming in cooperative cog-
The remainder of the paper is organized as follows: The re- nitive radio networks. The hotbooting learning technique as
lated work is reviewed in Section II. The anti-jamming NOMA proposed in [29] exploits the historical experiences to initial-
transmission game is formulated in Section III, and the SE of ize the Q values and thus accelerates the convergence rate of
the game is derived in Section IV. Two reinforcement learning Q-learning. Model learning based on tree structures as pro-
based power allocation schemes are proposed for the dynamic posed in [30] further improves the sample efficiency in stochas-
NOMA transmission game in Sections V and VI, respectively. tic environments and thus the convergence rate of the learning
Simulation results are provided in Section VII, and conclusions process.
are drawn in Section VIII. The Q-learning based power control strategy proposed in [31]
improves the secure capacity of the MIMO transmission against
smart attacks. Compared with our previous work in [31], we
II. RELATED WORK consider the anti-jamming MIMO NOMA transmission and im-
Power allocation is critical for NOMA transmissions. For prove the game model by incorporating the sum data rates and
instance, the low-complexity scheme as proposed in [16] im- the minimum data rates required by each user in the utility
proves the ergodic capacity for a given total transmit power function. In addition, a fast Q-based power allocation algorithm
constraint. The NOMA power allocation as proposed in [17] that combines the hotbooting technique and Dyna architecture
increases the sum data rates by pairing users with distinctive is proposed to improve the communication efficiency com-
channel conditions. The joint power allocation and user pairing pared with the standard Q-learning based scheme as developed
method proposed in [4] employs the minorization-maximization in [31].
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3379
sent at time k. On the other hand, without knowing the downlink channel
The smart jammer uses NJ antennas to send jamming signal power gain of the users, the jammer has to equally allocate
denoted by zJ with the power constraint pkJ = E[zTJ zJ ]. The the jamming power, with the jamming power covariance matrix
jamming power pkJ ∈ [0, PJ ] is chosen based on the ongoing given by
downlink transmit power, where PJ is the maximum jamming
power. pJ
k
Let gj,i,m denote the channel power gain from the i-th an- QJ = E zJ zH
J = IN . (3)
NJ J
tenna of the BS to the j-th antenna of User m, and HkB ,m =
k
[gj,i,m ]1≤j ≤N T ,1≤i≤N R denote the channel matrix between the
BS and User m. The time index k is omitted if no confusion The effective precoding and successive interference cancel-
occurs. Without loss of generality, the channel conditions of lation design for the MIMO-NOMA system have been studied
users are sorted as ||HB ,M || > . . . > ||HB ,1 ||, where ||X|| de- in works such as [4] and [5]. More specifically, the downlink
notes the Frobenius norm of a matrix X, i.e., the user rank NOMA transmission scheme developed in [4] increases the sum
follows the order of channel power gains, indicating that User data rate under the decodability constraint according to a sim-
M may stay in the cell central area while User 1 is at the plified SIC method that orders the users based on the norm of
cell-edge. More specifically, the MIMO channel can be viewed the channel gain vector. The signal alignment scheme developed
as a bundle of independent sub-channels and thus the chan- in [5] decomposes the multi-user MIMO-NOMA scenario into
nel gain matrices of the users can be ordered by the squared multiple separate single-antenna NOMA channels, and orders
Frobenius norm, according to [4] and [32]. Similarly, HkJ,m the users based on the large-scale channel fading gain with user
denotes the channel matrix between the jammer and User m. pairing.
Each channel matrix is assumed to have independently and In this work, we apply the successive interference cancellation
identically distributed (i.i.d.) complex Gaussian distributed ele- scheme that orders the users according to the channel power
ments, i.e., Hμ,m ∼ CN (0, σμ,m I), with μ = B, J and m = gains under jamming, similar to [16] and [33]. The decoding
1, 2, . . . , M , and the i-th largest eigenvalue of Hμ,m HH order according to the Frobenius norms of the channel matrices
μ,m
is denoted by hμm ,i . Thus, the signal received by User m is is optimal if the jamming power is not strong enough to change
m /hm ≥ hj /hj
the SIC order, i.e., for any two users j and m, hB J B J
given by
holds if m > j. Nevertheless, if this assumption does not hold
M
as the jamming power received by a user is much stronger than
ym = HB ,m xn + HJ,m zJ + nm , m = 1, 2, . . . , M, that received by the user who decodes afterwards, the NOMA
n =1
system ranks the decoding order according to the channel power
(1)
gain normalized by the jamming power in SIC, i.e., User m
where the noise vector nm consists of NR additive normalized decodes before User j, if hB m /hm ≥ hj /hj with m > j, if not
J B J
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3380 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
[33] as ∀1 ≤ m ≤ M − 1 TABLE I
SUMMARY OF SYMBOLS AND NOTATION
M
Rm = log2 det I + I + HB ,m Qn HH
B ,m M Number of users
n =m +1
−1 NT / R Number of antennas of BS/RX
NJ Number of jamming antennas
+ HJ,m QJ HH
J,m HB ,m Qm HH
B ,m xk Superimposed transmit signal for users
H kμ , m Channel matrix
h μm , i i-th eigenvalue of channel matrix
NR
σμ , m
Channel matrix covariance
= log2
θk = θm k Power allocation coefficient vector for users
i=1 1≤m ≤M
Ω Space of feasible power allocation vectors
NJ θ m P T h B PJ Maximum jamming power
m ,i
1+ M . u Utility of the BS
NT NJ + N J n =m +1 θn PT hB J
m ,i + NT pJ hm ,i
γ Cost coefficient of the jammer
ξ Feasible set of jamming power
(4) R0 Minimum data rate required by a user
L Number of the SINR quantization levels
Applying interference cancellation as illustrated in [33], User I Number of emulated environments
M with the best channel condition can subtract the signals of T Size of training data
Φ Observation counter
all the other M − 1 users from the superimposed signal x. Thus τ Modeled reward
the data rate of User M denoted by RM is given by τ Reward record
−1 Ψ State transition probability
RM = log2 det I + I+HJ,M QJ HH J,M H B ,M QM HH
B ,M
QoS is not satisfied. In summary, we have considered an anti-
NR
NJ θ M P T h B
M ,i
= log2 1 + . (5) jamming NOMA transmission game given by G = {B, J},
i=1
NT NJ + NT pJ hJM ,i {θ, pJ }, {u, −u} . For ease of reference, we also summarize
Note that the received jamming signal is viewed as noise, and our commonly used notation in Table I.
does not change the decoding sequence according to the suc-
cessive interference cancellation strategy in [33]. IV. SE OF THE NOMA TRANSMISSION GAME
By exploiting the difference of the channel power gains be- At a Stackelberg equilibrium of the anti-jamming NOMA
tween the users, the BS usually increases the power allocation transmission game, the BS as the leader chooses its power allo-
coefficient for weak users to satisfy the quality of service (QoS) cation strategy for all downlink users first to maximize its utility
required by the user, such as the minimum data rate demand of given by (6) considering the response of the smart jammer as the
the user denoted by R0 . Therefore, the NOMA transmission has follower. Then the jammer as the follower decides the jamming
to ensure min(R1 , . . . , RM ) ≥ R0 . power to minimize the utility u based on the observed ongo-
The jammer aims to interrupt the NOMA transmission and ing transmission. The SE strategies of the NOMA transmission
make at least one user’s data rate less than the quality of service game denoted by (θ ∗ , p∗J (θ ∗ )) are given by definition as
requirement, i.e., Rm ≤ R0 , for 1 ≤ m ≤ M , as shown in the
first term in (6). If failing to do that, the jammer has a goal to p∗J (θ) = arg min u(θ, pJ ) (7)
0≤p J ≤P J
reduce the overall data rate with less jamming power and avoid
being detected, as indicated in (6). Let γ denote the jamming θ ∗ = arg max u(θ, p∗J (θ)). (8)
θ∈Ω
cost, including the jamming power consumption and the risk to
be detected, which depends on the intrusion detection methods In the following, we first consider the transmission game for the
used by the BS. BS with NT × NR MIMO that serves M users, and then the
The anti-jamming NOMA transmissions are formulated as a specific case with 2 users to evaluate the SE strategies under
zero-sum Stackelberg game denoted by G, in which the BS as different anti-jamming NOMA transmission scenarios.
the leader first chooses the power allocation coefficient vector Lemma 1: The anti-jamming NT × NR NOMA transmis-
θ = [θm ]1≤m ≤M to improve the system throughput under the sion game with M users has a unique SE (θ ∗ , p∗J ) given by (9)
user rate constraint, and a smart jammer as the follower chooses and (10), shown at the bottom of the next page, if hB J
m ,i hM ,i <
the jamming power in an attempt to interrupt the NOMA trans- hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR and (11), shown at
B J
mission at a low jamming cost. the bottom of the next page, holds.
The utility of the BS denoted by u depends on the sum data Proof: See Appendix A.
rate of the M users and the jamming cost, and is given by According to (9) and (10), the SE of the anti-jamming NOMA
transmission game depends on the QoS in the transmission (R0 ),
M
the number of antennas (NT and NR ), the channel condition of
u=I min Rm ≥ R0 Rm + γpJ , (6) the weak users and the maximum jamming power (PJ ). If the
1≤m ≤M
m =1
jammer has good channel conditions to the weaker users, the
where I(·) is the indicator function that takes value 1 if the BS allocates more power to the strongest user (i.e., User M ) to
event is true and 0 otherwise, and the transmission fails if the improve the sum data rates, and meets the QoS of the weaker
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3381
Fig. 2. Performance of the anti-jamming transmission game for the MIMO NOMA system with M = 2, N T = N R = 3, N J = 2, σ B , 1 = 10 dB, σ B , 2 = 20
dB, σ J, 1 = σ J, 2 = 12 dB, P J = 20 W, R 0 = 1 bps/Hz and γ = 0.5. (a) P T = 20 W. (b) P T = 40 W.
users. If the total transmit power is low for the transmission, the Proof: See Appendix B.
jammer applies full jamming power to block all the M users in If the jamming cost is high or the jamming channel expe-
terms of the transmission QoS. riences serious fading as shown in Eq. (19), a smart jammer
Corollary 1: The anti-jamming NT × NR NOMA transmis- will keep silent to reduce the power consumption. In addition,
sion game G with M users has a unique SE (θ ∗ , 0), where θ ∗ as the jamming power at SE also decreases with NT , MIMO
is given by ∀1 ≤ m ≤ M − 1 significantly improves the jamming resistance.
Corollary 2: The anti-jamming NT × NR NOMA transmis-
NR ∗
θm PT hB
log2 1 + m ,i
= R0 , (18) sion game G with M = 2 users has a unique SE (θ1∗ , p∗J ) given
i=1
NT + (1 − m ∗ B
n =1 θn ) PT hm ,i by (12) and (13), shown at the bottom of the next page, if
hB1,i h2,i < h1,i h2,i , ∀1 ≤ i ≤ NR and (14), shown at the bot-
J J B
m ,i hM ,i < hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR and
if hB J B J
tom of the next page, holds.
⎛ Proof: See Appendix C.
M −1
⎜
NR
As a concrete example, the utility of the BS in Fig. 2 is maxi-
PT ⎝ mized if the rate demand of the weak user is satisfied under full
i=1 m =1
jamming power (i.e., (12)). Moreover, the feasible allocation
∗ J
NT θ m hm ,i hB coefficient for the strong user increases with the total trans-
M m ,i mission power PT and the jamming power at the SE decreases
NT + n =m +1 θn PT hm ,i NT + M
∗ B ∗ B
n =m θn PT hm ,i with it.
⎞
−1 ∗ B J
Fig. 3 presents the jamming power and the sum data rate of
1− M θ
n =1 n h h
M ,i M ,i ⎟
2 users in terms of the channel power gain of User 1, σB ,1 .
+ M −1 ∗ ⎠ < NJ γ ln 2. (19) More specifically, the average SINR of users increases with
NT + 1 − n =1 θn PT hM ,i B
σB ,1 and the number of transmission antennas NT , because the
NR ∗
NJ θ m PT hBm ,i
log2 1+ m = R0 , ∀1 ≤ m ≤ M − 1 (9)
i=1
NT NJ + NJ (1 − n =1 θn∗ ) PT hB J
m ,i + NT PJ hm ,i
⎛
NR
M −1 ∗
NT NJ θ m PT hJm ,i hB
⎝ m ,i
M ∗
∗ J
i=1 m =1 NT NJ + NJ
∗ B J
n =m +1 θn PT hm ,i + NT pJ hm ,i NT NJ + N J M ∗ B
n =m θn PT hm ,i + NT pJ hm ,i
−1 ∗ ⎞
NJ 1 − M B J
n =1 θn PT hM ,i hM ,i
+ M −1 ∗ ⎠ = γ ln 2 (10)
∗
NJ + pJ hM ,i NT NJ + NJ 1 − n =1 θn PT hM ,i + NT pJ hM ,i
J B ∗ J
⎛ −1 ∗ B J ⎞
NR
M −1 ∗ J
NT θ m hm ,i hB 1− M n =1 θn hM ,i hM ,i
⎝ m ,i + ⎠ > NJ γ ln 2
PT M ∗ M ∗
M −1 ∗
i=1 m =1 NT +
B
n =m +1 θn PT hm ,i NT + n =m θn PT hm ,i B NT + 1 − n =1 θn PT hM ,i B
(11)
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3382 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
Fig. 3. Performance of the anti-jamming transmission game for the MIMO NOMA system at the SE with M = 2, N J = 3, σ J, 1 = σ J, 2 = 12 dB, P T = P J = 20
W, R 0 = 1 bps/Hz and γ = 0.5. (a) Average SINR of users. (b) Sum capacity of users.
jammer will fail to interrupt the user experience and decrease Proof: See Appendix D.
the jamming power to reduce the cost. Therefore, the sum data If the jamming channel condition to the stronger user (i.e.,
rate of the M users improves with σB ,1 and NT . User 2) is high as shown in the given condition, the sum data
Corollary 3: The anti-jamming NT × NR NOMA trans- rate increases with more power allocated for the weak user (i.e.,
mission game G with M = 2 users has a unique SE User 1). On the other hand, the smart jammer, as a follower
(θ1∗ , p∗J ) given by (15)–(16), shown at the bottom of this observing the ongoing transmission will apply its full jamming
∗
hB1,i h2,i > h1,i h2,i + (h2,i − h1,i )NJ /pJ , ∀1 ≤
J J B B B
page, if power to interrupt the communication of the stronger user if its
i ≤ NR and (17), shown at the bottom of this page, holds. transmit power is not high enough as in (15).
NR
NJ θ1∗ PT hB 1,i
log2 1 + = R0 (12)
i=1
NT NJ + NJ (1 − θ1∗ )PT hB 1,i + NT PJ h1,i
J
⎛
NR
NT NJ θ1∗ PT hJ1,i hB
⎝ 1,i
∗ B ∗
NT NJ + NJ (1 − θ1 )PT h1,i + NT pJ h1,i NT NJ + NJ PT hB
J + N p ∗ hJ
i=1 1,i T J 1,i
⎞
NJ (1 − θ1∗ )PT hB J
2,i h2,i
+ ⎠ = γ ln 2 (13)
NJ + p∗J hJ2,i NT NJ + NJ (1 − θ1∗ )PT hB 2,i + N p ∗ hJ
T J 2,i
⎛ ⎞
NR
NT θ1∗ hJ1,i hB (1 − θ ∗ B J
)h h
PT ⎝ 1,i
+ 1 2,i 2,i ⎠ > NJ γ ln 2 (14)
i=1
∗
NT + (1 − θ1 )PT h1,i NT + PT h1,i
B B N T + (1 − θ1∗ )PT hB 2,i
NR
NJ (1 − θ1∗ )PT hB2,i
log2 1 + = R0 (15)
i=1
NT NJ + NT p∗J hJ2,i
⎛
NR
NT NJ θ1∗ PT hJ1,i hB
⎝ 1,i
∗ ∗
NT NJ + NJ (1 − θ1 )PT h1,i + NT pJ h1,i NT NJ + NJ PT hB
B J ∗ J
i=1 1,i + NT pJ h1,i
⎞
NJ (1 − θ1∗ )PT hB J
2,i h2,i
+ ⎠ = γ ln 2 (16)
NJ + p∗J hJ2,i NT NJ + NJ (1 − θ1∗ )PT hB 2,i + N p ∗ hJ
T J 2,i
⎛ ⎞
NR
NT θ1∗ hJ1,i hB (1 − θ ∗ B J
)h h
PT ⎝ 1,i
+ 1 2,i 2,i ⎠ > NJ γ ln 2 (17)
i=1
∗
NT + (1 − θ1 )PT h1,i NT + PT h1,i
B B N T + (1 − θ1∗ )PT hB 2,i
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3383
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3384 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
quantization levels of the power allocation coefficients. The is also chosen according to the system state and the Q-function
channel time variation on the other hand, requires much faster based on the ε-greedy policy as shown in (22).
learning rate of the power allocation. Therefore, we further pro- Meanwhile, the Dyna-Q based algorithm also uses the real
pose a fast Q-learning based power allocation scheme that com- experiences of the NOMA transmission at time k to build a
bines the hotbooting technique and the Dyna architecture to hypothetical anti-jamming NOMA transmission model. More
improve the communication efficiency of Algorithm 2 in dy- specifically, the real experience of the NOMA transmission at
namic radio environments in the following section. time k is saved as an experience record for the corresponding
state-action pair, i.e., (sk , θ k ). The experience record stored at
the BS consists of the occurrence counter vector of the state-
VI. NOMA POWER ALLOCATION BASED ON action pairs denoted by Φ, the occurrence counter vector of the
FAST Q-LEARNING next state denoted by Φ , the reward record τ , and the modeled
The Dyna architecture formulates a learned world model from reward τ . The hypothetical experience is generated according
the real anti-jamming NOMA transmission experiences to ac- to the real experience (Φ , Φ, τ , τ, Ψ). As shown in Algorithm
celerate the learning speed of Q-learning in dynamic radio en- 3, the counter vector of the next state Φ is updated according to
vironments. More specifically, we apply the Dyna architecture the anti-jamming NOMA transmission at time k, i.e.,
that emulates the planning and reactions from hypothetical ex-
periences to improve the anti-jamming efficiency in the fast Φ (sk , θ k , sk +1 ) = Φ (sk , θ k , sk +1 ) + 1. (23)
Q-learning based NOMA power allocation algorithm. The al-
gorithm is summarized in Algorithm 3, and the main structure The occurrence counter vector in the hypothetical experience Φ
is illustrated in Fig. 4. defined as the sum of Φ over all the feasible next state is then
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3385
updated with
Φ(sk , θ k ) = Φ (sk , θ k , sk +1 ). (24)
s k + 1 ∈ξ
The reward record from the real experience τ in this time slot
is the utility of the BS at time k given by
τ sk , θ k , Φ(sk , θ k ) = u(sk , θ k ). (25)
Based on the reward record τ in (25) and the state transition
from sk to sk +1 , we can formulate a hypothetical anti-jamming
NOMA transmission model to generate several hypothetical
NOMA transmission experiences, in which the reward to the
BS denoted by τ is defined as the utility of the BS averaged over
all the previous real experiences and is given by
Φ(s k ,θ k )
1
τ (s , θ ) =
k k
τ sk , θ k , n . (26)
Φ(sk , θ k ) n =1
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3386 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3387
the transmit power according to the current state and the Q SE of the game has been derived, and conditions assuring its
function, similarly to Algorithm 2. existence have been provided, showing how the anti-jamming
According to [36], the NOMA transmission outperforms communication efficiency of NOMA systems increases with
OMA with less outage probability, which is verified by the the number of antennas and the channel power gains. A fast
simulation results in Fig. 5. More specifically, the Q-learning Q-based NOMA power allocation scheme that combines the
based NOMA system can improve the communication effi- hotbooting technique and Dyna architecture is proposed for a
ciency against jamming attack compared to the OMA system. dynamic game to accelerate the learning and thus improve the
For instance, the Q-learning based anti-jamming NOMA system communication efficiency against smart jamming. As shown in
exceeds the Q-learning based OMA system with 8.7% higher the simulation results, the proposed NOMA power allocation
SINR, 42.8% higher sum data rate, and 7.1% higher utility of scheme can significantly improve the average SINR, the sum
the BS at the 1000-th time slot. data rate and the utility of the BS soon after the start of the
The NOMA transmission performance can be further im- game. For example, the sum data rate of three users in the 5×
proved by the hotbooting technique. For instance, the NOMA 5 MIMO NOMA anti-jamming transmission game increases by
system with hotbooting Q-learning achieves 12% higher SINR, 128% after 1000 time slots, which is 21% higher than that of
10% higher sum data rate, and 9% higher utility, at the 1000-th the standard Q-learning based scheme.
time slot. The reason is that the hotbooting technique can We have analyzed a simplified jamming scenario for MIMO
formulate the emulated experiences in dynamic radio envi- NOMA systems in this work. A future direction of our work
ronments to effectively reduce the exploration trials and thus is to extend the theoretical analysis of the NOMA transmission
significantly improve the convergence speed of Q-learning to more practical scenarios with smart jamming, in which a
and the jamming resistance efficiency. The power allocation jammer uses programmable radio devices to flexibly choose
performance can be further improve by the Dyna architec- multiple jamming policies.
ture with hypothetical experiences. For instance, the average
SINR, the sum data rate, and the utility of the proposed fast
Q-learning algorithm are 11%, 10%, and 7% higher at the APPENDIX A
1000-th time slot, compared with the hotbooting Q-learning PROOF OF LEMMA 1
scheme.
According to (6), if min[Rm ]1≤m ≤M ≥ R0 for ∀0 ≤ pJ ≤
The performance of the proposed power allocation scheme
PJ , we have (31) as shown at the top of the next page,
in the 5 × 5 MIMO NOMA system with two users is evaluated
and thus ∂ 2 u(θ, pJ )/∂pJ 2 > 0, i.e., that u(θ, pJ ) is convex
with the varying channel gains of User 1 in Fig. 6. Both the
in terms of pJ . By (11), we have ∂u(θ, pJ )/∂pJ |p J =0 < 0.
sum capacity of users and the utility of the BS increase with
Thus, u(θ, pJ ) is minimized at ∂u(θ, pJ )/∂pJ = 0, and by Eq.
the channel gain σB ,1 . For instance, as the channel gain σB ,1
(7), p∗J is given by (10). Otherwise, if ∃pJ ≤ PJ that satisfies
increases from 4 to 12 dB, the average SINR, the sum data rate
min[Rm ]1≤m ≤M < R0 , by Eq. (7), p∗J ∈ [pJ , PJ ].
and the utility of the Q-learning based power allocation increase
According to the assumption that the channel condi-
by 9%, 14% and 17%, respectively. If σB ,1 = 12 dB, the fast
tions of User k are stronger than User j (k > j), we
Q-based power allocation achieves the best anti-jamming com-
have hB k ,i > hj,i ∀1 ≤ j < k ≤ NR . In addition, if hm ,i hM ,i <
B B J
munication and performance exceeds the standard Q-learning
hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR , by (31), we have
B J
based strategy with 9% higher SINR, 14% higher sum capacity,
∂u(θm , p∗J )/∂θm < 0, i.e., (32) as shown at the top of the next
and 22% higher utility.
page. Clearly, u monotonically decreases with θm , ∀1 ≤ m ≤
Fig. 7 shows that the NOMA transmission efficiency improves
M − 1. Meanwhile, if Rm (θ, PJ ) < R0 ∀1 ≤ m ≤ M − 1,
with the number of the receive antennas with NT = 10 transmit
p∗J = PJ and u(θ, p∗J ) = 0. Therefore, u is maximized if
antennas. The proposed power allocation schemes have strong
Rm (θ, PJ ) = R0 , ∀1 ≤ m ≤ M − 1, i.e., that θ ∗ is given by
resistance against smart jamming even with a large number
(9) and thus (8) holds. Therefore, the SE (θ ∗ , p∗J ) is given by
of jamming antennas NJ . For instance, our proposed scheme
(9)–(10).
can improve the average SINR, the sum data rate and the user
utility of the 10 × 8 NOMA system against a jammer with two
antennas by 10%, 12%, and 15%, respectively, compared with APPENDIX B
the benchmark strategy. PROOF OF COROLLARY 1
Similar to the proof of Lemma 1, if min[Rm ]1≤m ≤M ≥ R0
VIII. CONCLUSION for ∀0 ≤ pJ ≤ PJ , by (31), ∂ 2 u(θ, pJ )/∂pJ 2 > 0. By (19),
In this paper, we have formulated an anti-jamming MIMO we have ∂u(θ, pJ )/∂pJ |p J =0 > 0. Thus u(θ, pJ ) is minimized
NOMA transmission game, in which the BS determines the at pJ = 0, and we have p∗J = 0 by Eq. (7). If hB J
m ,i hM ,i <
transmit power to improve its utility based on the sum data hM ,i hm ,i , ∀1 ≤ m ≤ M − 1, ∀1 ≤ i ≤ NR , we have (32) and
B J
rate of the users, and the smart jammer as the follower in the thus θ ∗ given by (18). Therefore, if hB
m ,i hM ,i < hM ,i hm ,i , ∀1 ≤
J B J
Stackelberg game chooses the jamming power at an attempt to m ≤ M − 1, ∀1 ≤ i ≤ NR and (19) holds, we have the SE
interrupt the ongoing transmission at a low jamming cost. The (θ ∗ , 0).
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
3388 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 67, NO. 4, APRIL 2018
⎛ ⎛ ⎞⎞
−1
NR
M −1
N J θm PT h B NJ 1− M B
n = 1 θn PT hM ,i
u= ⎝ log2 1+
m ,i
+ log2 ⎝1 + ⎠⎠ + γpJ ,
i= 1 m =1
NT NJ + NJ (1 − m θ
n=1 n ) P B J
T hm ,i +NT pJ hm ,i NT NJ +NT pJ hJM ,i
(31)
∗ B hJ − hB hJ − B hB M −1 − B − hB
∗
∂u(θ, pJ )
NR N J P T N T p J h m ,i M ,i M ,i m ,i N J P T h m ,i M ,i n =m + 1 θ n N T N J h M ,i m ,i
= m < 0,
∂θm ∗ J B ∗ M −1
i= 1 ln 2 NT NJ + NT pJ hm ,i +NJ (1− n = 1 θn)PT hm ,i NT NJ + NT pJ hM ,i + NJ 1− n = 1 θn PT hB
J
M ,i
∀1 ≤ m ≤ M − 1. (32)
NR
N J θ1 PT h B
1,i N J (1 − θ1 ) PT h B
2,i
u= log2 1+ + log2 1+ + γpJ , (33)
i= 1
N T N J + N J (1 − θ1 ) P T h B J
1,i + NT pJ h1,i NT NJ + NT pJ hJ2,i
∂u(θ, p∗J )
NR NJ PT NT p∗J hB 1,i h2,i − h2,i h1,i − NT NJ h2,i − h1,i
J B J B B
= < 0. (34)
∂θ1 ∗ J NT NJ + NT p∗J hJ2,i + NJ (1 − θ1 ) PT hB
i= 1 ln 2 NT NJ + NT pJ h1,i + NJ (1 − θ1 ) PT h1,i
B
2,i
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: REINFORCEMENT LEARNING-BASED NOMA POWER ALLOCATION IN THE PRESENCE OF SMART JAMMING 3389
[23] X. Tang, P. Ren, Y. Wang, Q. Du, and L. Sun, “Securing wireless transmis- Canhuang Dai received the B.S. degree in communi-
sion against reactive jamming: A Stackelberg game framework,” in Proc. cation engineering from Xiamen University, Xiamen,
IEEE Global Commun. Conf., San Diego, CA, USA, Dec. 2015, pp. 1–6. China, in 2017, where he is currently working toward
[24] Y. Wu, B. Wang, K. J. R. Liu, and T. C. Clancy, “Anti-jamming games the M.S. degree with the Department of Communica-
in multi-channel cognitive radio networks,” IEEE J. Sel. Areas Commun., tion Engineering. His research interests include smart
vol. 30, no. 1, pp. 4–15, Jan. 2012. grid and wireless communications.
[25] A. Garnaev, M. Baykal-Gursoy, and H. V. Poor, “A game theoretic analysis
of secret and reliable communication with active and passive adversarial
modes,” IEEE Trans. Wireless Commun., vol. 15, no. 3, pp. 2155–2163,
Mar. 2016.
[26] L. Xiao, N. B. Mandayam, and H. V. Poor, “Prospect theoretic analysis
of energy exchange among microgrids,” IEEE Trans. Smart Grid, vol. 6,
no. 1, pp. 63–72, Jan. 2015. Huaiyu Dai (F’17) received the B.E. and M.S. de-
[27] X. He, H. Dai, P. Ning, and R. Dutta, “A stochastic multi-channel spec- grees in electrical engineering from Tsinghua Univer-
trum access game with incomplete information,” in Proc. IEEE Int. Conf. sity, Beijing, China, in 1996 and 1998, respectively,
Commun., London, U.K., Jun. 2015, pp. 4799–4804. and the Ph.D. degree in electrical engineering from
[28] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforce- Princeton University, Princeton, NJ, USA, in 2002.
ment learning in cooperative cognitive radio networks against jamming,” He was with Bell Labs, Lucent Technologies,
Springer J. Supercomput., vol. 71, no. 9, pp. 3237–3257, Sep. 2015. Holmdel, NJ, in summer 2000, and with AT&T Labs-
[29] X. Xiao, C. Dai, Y. Li, C. Zhou, and L. Xiao, “Energy trading game for Research, Middletown, NJ, in summer 2001. He is
microgrids using reinforcement learning,” in Proc. EAI Int. Conf. Game currently a Professor of electrical and computer en-
Theory Netw., Knoxville, TN, USA, May 2017, pp. 131–140. gineering with NC State University, Raleigh, NC,
[30] K. S. Hwang, W. C. Jiang, Y. J. Chen, and W. H. Wang, “Model-based USA. His research interests are in the general areas
indirect learning method based on Dyna-Q architecture,” in Proc. IEEE of communication systems and networks, advanced signal processing for dig-
Int. Conf. Systems, Man, Cybern., Manchester, U.K., Oct. 2013, pp. 2540– ital communications, and communication theory and information theory. His
2544. current research focuses on networked information processing and crosslayer
[31] Y. Li, L. Xiao, H. Dai, and H. V. Poor, “Game theoretic study of protecting design in wireless networks, cognitive radio networks, network security, and
MIMO transmissions against smart attacks,” in Proc. IEEE Int. Conf. associated information-theoretic and computation-theoretic analysis.
Commun., London, U.K., May 2017. Dr. Dai has served as an Editor of the IEEE TRANSACTIONS ON COMMUNICA-
[32] C. Wang, J. Chen, and Y. Chen, “Power allocation for a downlink non- TIONS, IEEE TRANSACTIONS ON SIGNAL PROCESSING, and IEEE TRANSACTIONS
orthogonal multiple access system,” IEEE Wireless Commun. Lett., vol. 5, ON WIRELESS COMMUNICATIONS. Currently he is an Area Editor in charge of
no. 5, pp. 532–535, Aug. 2016. wireless communications for the IEEE TRANSACTIONS ON COMMUNICATIONS.
[33] Z. Ding, Z. Yang, P. Fan, and H. V. Poor, “On the performance of non- He coedited two special issues of EURASIP Journals on Distributed Signal Pro-
orthogonal multiple access in 5G systems with randomly deployed users,” cessing Techniques for Wireless Sensor Networks, and on multiuser information
IEEE Signal Process. Lett., vol. 21, no. 12, pp. 1501–1505, Dec. 2014. theory and related applications, respectively. He cochaired the Signal Processing
[34] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: for Communications Symposium of IEEE Globecom 2013, the Communications
A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996. Theory Symposium of IEEE ICC 2014, and the Wireless Communications Sym-
[35] H. H. Kha, H. D. Tuan, and H. H. Nguyen, “Fast global optimal power posium of IEEE Globecom 2014. He was a co-recipient of best paper awards
allocation in wireless networks by local DC programming,” IEEE Trans. at 2010 IEEE International Conference on Mobile Ad-hoc and Sensor Systems
Wireless Commun., vol. 11, no. 2, pp. 510–515, Feb. 2012. (MASS 2010), 2016 IEEE INFOCOM BIGSECURITY Workshop, and 2017
[36] J. Cui, Z. Ding, and P. Fan, “A novel power allocation scheme under outage IEEE International Conference on Communications (ICC 2017).
constraints in NOMA systems,” IEEE Signal Process. Lett., vol. 23, no. 9,
pp. 1226–1230, Sep. 2016.
H. Vincent Poor (S’72–M’77–SM’82–F’87) re-
ceived the Ph.D. degree in electrical engineering and
computer science from Princeton University, Prince-
Liang Xiao (M’09–SM’13) received the B.S. degree
ton, NJ, USA, in 1977. From 1977 until 1990, he was
in communication engineering from the Nanjing Uni-
on the faculty of the University of Illinois at Urbana-
versity of Posts and Telecommunications, Nanjing,
Champaign. Since 1990, he has been on the faculty at
China, in 2000, the M.S. degree in electrical engi-
neering from Tsinghua University, Beijing, China, in Princeton University, where he is the Michael Henry
Strater University Professor of Electrical Engineer-
2003, and the Ph.D. degree in electrical engineering
ing. During 2006–2016, he served as Dean of Prince-
from Rutgers University, New Brunswick, NJ, USA,
ton’s School of Engineering and Applied Science. He
in 2009. She is currently a Professor with the Depart-
ment of Communication Engineering, Xiamen Uni- has also held visiting appointments at several other
institutions, most recently at Berkeley and Cambridge. His research interests are
versity, Fujian, China. She was a Visiting Professor
in the areas of information theory and signal processing, and their applications
with Princeton University, Princeton, NJ, USA, Vir-
in wireless networks, energy systems and related fields. Among his publications
ginia Tech, Blacksburg, VA, USA, and University of Maryland, College Park,
MD, USA. She has served as an Associate Editor of theIEEE TRANSACTIONS in these areas is the recent book Information Theoretic Security and Privacy of
Information Systems (Cambridge University Press, 2017).
INFORMATION FORENSICS AND SECURITY and guest editor of theIEEE JOURNAL
Dr. Poor is a member of the National Academy of Engineering and the Na-
OF SELECTED TOPICS IN SIGNAL PROCESSING. She is the recipient of the Best
tional Academy of Sciences, and is a foreign member of the Chinese Academy
Paper Award for 2016 INFOCOM Big Security WS and 2017 ICC.
of Sciences, the Royal Society and other national and international academies.
Recent recognition of his work includes the 2017 IEEE Alexander Graham
Bell Medal, Honorary Professorships from Peking University and Tsinghua
Yanda Li received the B.S. degree in communica- University, both conferred in 2017, and a D.Sc. honoris causa from Syracuse
tion engineering from Xiamen University, Xiamen, University awarded in 2017.
China, in 2015, where he is currently working toward
the M.S. degree with the Department of Commu-
nication Engineering. His research interests include
network security and wireless communications.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on January 31,2023 at 09:55:21 UTC from IEEE Xplore. Restrictions apply.