Deep Reinforcement Learning For Enhancing The Secrecy of A MU-MISO UOWC Network

2023 IEEE Global Communications Conference: Optical Networks and Systems
Deep Reinforcement Learning for Enhancing the

Secrecy of a MU-MISO UOWC Network
Elmehdi Illi1 , Emna Baccour1 , Marwa Qaraqe1 , and Mounir Hamdi1
1
College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar.
emails: elmehdi.illi@ieee.org, {ebaccourepbesaid, mqaraqe, mhamdi}@hbku.edu.qa
Abstract—In this paper, we propose a Deep Reinforcement networks. For instance, works such as [4] analyzed exper-
Learning (DRL) framework to optimize the secrecy performance imentally the secrecy of a UOWC system. On the other
of a Multi-User (MU)-Multiple-Input Single-Output (MISO) hand, sporadic works evaluated the secrecy of OWC networks
Underwater Optical Wireless Communication (UOWC) system.
The network consists of several light-emitting diodes connected in indoor environments, i.e., visible light communications
with various underwater users through optical beams. The or terrestrial free-space optics [2], [3], [5], [6]. Addition-
GLOBECOM 2023 - 2023 IEEE Global Communications Conference | 979-8-3503-1090-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/GLOBECOM54140.2023.10437117
legitimate transmission is threatened by several eavesdroppers ally, other works dealt with the optimization/analysis of
attempting to overhear the confidential message sent to each precoding techniques for UOWC Multiple-Input Multiple-
user. Thus, digital precoding is employed to cancel the inter- Output (MIMO) systems, such as [7], [8]. Interestingly,
user interference and maximize the per-user secrecy rate and,
consequently, the secrecy sum rate (SSR). Leveraging the baseline precoding techniques, such as Zero-Forcing (ZF),
developed DRL algorithm, the MU-MISO precoding matrix is have shown acceptable performance in the case of perfect
optimized for enhancing the system’s SSR. Numerical results Channel State Information (CSI) acquisition at the transmitter
show the superiority of the proposed DRL framework compared [9]. Nonetheless, obtaining perfect CSI at the transmitter
to the baseline zero-forcing and random precoding schemes, from the various receivers through feedback channels is far
even with corrupted CSI at the transmitter due to seawater
dynamics and estimation errors. from feasible due to the limited feedback resolution and the
channel time decorrelation.
I. I NTRODUCTION Deep Reinforcement Learning (DRL) has emerged as a
Underwater Optical Wireless Communication (UOWC) viable tool to solve complex and dynamic optimization prob-
technology has witnessed a remarkable evolution over the lems in wireless communication systems. Such a family of
past few years. Such a technology has been gaining credit learning algorithms relies on interacting with the environment
for reaching the target high data rate goals due to the huge and taking actions based on which rewards and penalties
amount of bandwidth available in the visible light spectrum are received to assess its actions’ accuracy [10]. Thus, the
[1]. In addition, UOWC exhibits several other advantages, system learns to optimize its policy over time and maxi-
such as its low power consumption and inherent security mize/minimize the objective function of interest through a
when operating in narrow light beams. Nonetheless, in spite learned optimal policy. To this end, DRL has brought notable
of the abovementioned benefits, several challenges impede intuitions for using it in problems such as optimizing the
the wide deployment of UOWC, such as oceanic turbulence, beamforming and precoding matrices in multi-user (MU)-
pointing errors, and thermal noise. To remedy this, spatial MIMO networks [11].
diversity through multiple light-emitting diodes (LEDs) and To this end, we propose in this work a DRL framework to
photodetectors (PDs) has been widely advocated as an effi- maximize the achievable Secrecy Sum Rate (SSR) of a MU-
cient way to boost the performance of UOWC systems. MISO UOWC system subject to turbulence-induced fading.
Physical Layer Security (PLS) has received significant Multiple LEDs at the transmitter serve several nodes under-
attention as a means of securing futuristic networks with water, where we consider the presence of the same number of
minimal overhead. PLS focuses on securing transmission eavesdroppers overhearing the legitimate messages. The DRL
using physical layer parameters such as channel coding, algorithm, based on Deep Neural Networks (DNN), optimizes
multiantenna diversity, and channel fading without the need the precoding matrix under a peak optical power constraint to
for complex encryption methods [2]. Though PLS shows maximize the network’s SSR, given only outdated and noise-
promise in RF communications, its implementation in Optical corrupted CSI observations at the transmitter. To the best of
Wireless Communications (OWC) remains often overlooked. our knowledge, the current work is the first to optimize a MU-
Despite the inherent security of OWC due to narrow optical MISO UOWC network’s PLS in the presence of underwater
beams, eavesdropping attacks can still occur in several ways, turbulence, CSI aging and estimation errors.
such as positioning an eavesdropper in the beam divergence
region [3]. II. S YSTEM AND C HANNEL M ODEL
Despite the broad PLS inspection over RF wireless net- A. System Model
works, its literature contains a handful of works on UOWC Let us consider an UOWC system. A transmitter (T ) with
979-8-3503-1090-0/23/$31.00 ©2023 IEEE K LEDs illuminates the area underneath where N legitimate
6807
Authorized licensed use limited to: RMIT University Library. Downloaded on February 27,2024 at 14:30:33 UTC from IEEE Xplore. Restrictions apply.
underwater sensors (Un )n=1,...,N are placed. It is considered undesired signals of other users Vi (i ̸= n), as manifested by
that the number of LEDs exceeds the number of sensors (3). To this end, precoding is incorporated as an efficient way
(users), i.e., K ≥ N . In the meantime, N eavesdroppers to cancel/reduce the inter-user interference effects at each
(En )n=1,...,N are attempting to illegitimately overhear and node, by which the transmit symbols vector is designed as
decode the confidential messages, whereby each eavesdrop- follows: s = Wx, with W = w1T , w2T , ..., wN T
is the K×N
per (En ) is attempting to overhear the signal of a legitimate precoding matrix with wn = [wn,1 , wn,2 , ..., wn,K ] is the
user (Un ). The received optical signal at Un and En from precoding vector of the nth user.
the kth optical aperture is expressed as [12]
(n,k) (n,k) (n,k)
yV = Rhg,V hu,V xn + zV,n , V ∈ {U, E} , (1) B. Channel Model and Statistics
for k = 1, . . . , K and n = 1, . . . , N , where R is the
photodetectors’ responsivity, LogNormal (LN) distribution can represent well the un-
(n,k) (n,k)

(n,k)

(n,k)
derwater turbulence effects [14], where hu,V = e2XV ,
m
 A(m+1) cos ϕV cos ΨV (n,k)
with XV
(n,k)
being a Gaussian random variable of mean µV
(n,k) (n,k) 2
, 0 ≤ ΨV ≤ Ψc
hg,V = 2π dV exp cdV
(n,k)
and variance σV2 (∀n, k), i.e., independent and identically-

0, othewise distributed turbulence channels. Also, to keep a constant av-
(2) erage power, we set µV = −σV2 , where σV2 = 14 log 1 + σI2
is the geometric loss of the channel, where A is the receivers’ and σI2 is the Rytov variance. The turbulence-induced fad-
photodetector area, m = − ln 2/ ln cos Φ1/2 is the Lamber- ing is subject to spatiotemporal decorrelation properties
tian emission order with Φ1/2 being the half-power angle of caused by sea motion. The authors in [15] provided an
(n,k)
the transmitter, ϕV is the angle between the transmitter’s analytical model to express the temporal correlation coef-
(n,k)
normal axis and the receiver, ΨV is the incidence angle ficient
between twoturbulence-induced fading observations
at the n malign/benign node from the kth LED, Ψc is the (n,k,t) (n,k,t+τ )
(n,k)
hu,V , hu,V , at time instants t and t + τ , as
receiver’s field of view, dV is the distance between the kth
transmit LED and the nth receiver (V ∈ {U, E}), and c is
the medium’s extinction coefficient (in m−1 ), encompassing B (L, τ )
ρ= , (5)
both the absorption and scattering effects. Each user’s legit- σV2
imate information signal xn is intensity-modulated whereby
it fulfills both peak amplitude and non-negativity constraints where
as 0 ≤ xn ≤ Am , whereby Am = ηµIDC , with η, IDC , 
x2 L

and µ representing the LEDs’ electrical-to-optical conversion Z ∞ sin k
ratio, DC component, and modulation index, respectively [6]. B (L, τ ) = 8π 2 k 2 L xΦn (x) J0 (xvτ ) 1 − 
(n,k) 0 x2 L/k
Furthermore, hu,V is the underwater channel loss due to
water turbulence. To this end, the received signal vector at (6)
the N malign/benign users can be expressed as follows is the channel temporal covariance, where k = 2π/λ is
the wave number with λ denoting the wavelength, L is the
yV = RHV x + zV , V ∈ {U, E} , (3) receiver’s depth underwater, Φn (.) is the power spectrum
T T function, given by [15, Eq. (1)], and J0 (.) is the 0th-order
where HV = hV,1 , hTV,2 , ..., hTV,N is a N × K channel
Bessel function of the first kind.
matrix between the K optical apertures h and the N legiti- i
(n,1) (n,K)
mate/illegitimate users, with hV,n = hV , . . . , hV is
the channel vector between the K transmit LEDs and the nth III. S ECRECY R ATE A NALYSIS
(n,k) (n,k) (n,k)
legitimate/illegitimate receiver with hV = hg,V hu,V ,
while the superscript T stands for the transpose of a vec- Intensity Modulation and Direct Detection (IM/DD) is the
tor/matrix. In addition, x = [x1 , x2 , . . . , xN ] is the intensity- widely adopted signaling/detection scheme in UOWC due to
modulated data symbol vector of the N users, and z = its low implementation cost and simplicity. Under the optical
[zV,1 , zV,2 , . . . , zV,N ] is the zero-mean additive white Gaus- signal’s non-negativity constraint, it has been demonstrated
sian noise vector, whereby the variance of each of its elements that the traditional Shannon capacity formula with Gaussian
2
is σV,n , composed of thermal and shot noises’ powers as [13] signaling is unsuitable for UOWC channels. To this end,
upper and lower bound expressions for the channel capacity

2 2Rη ∥hV,n ∥∞ IDC +
σV,n = Bw q + i2a , (4) (i.e., maximal achievable rate) are considered when subject
4πARχa (1 − cos Ψ)
to both peak and average allowed optical and electrical signal
where ia is the pre-amplifier noise current density, q is the
powers as [6]
elementary charge, Bw is the receiver’s bandwidth, ∥.∥∞ is
the L-∞ norm, and χa is the ambient light photocurrent. (n) (n) (n)
It is worth highlighting that each receiver will receive an RV,lb ≤ RV ≤ RV,ub , (V ∈ {U, E}) , (7)
aggregate signal from all transmitting LEDs, whereby inter-
user interference arises from the reception at each user Vn of with
6808
1) (C1): referring to the peak optical power allowed at

 N 2
 the output of each of the K transmit LEDs. Given that
hV,n wpT + 1 
P
δV,n Am = ηµIDC , it reduces to
(n) 1 
p=1
RV,lb = log  , (8)
 
N 1
2  P T
2  (C1): ∥wk ∥∞ ≤ , ∀k = 1, ..., K, (13)
ξV,n hV,n wp + 1 µ
p=1,p̸=n
 N 2
 2) (C2): the second constraint ensures that the precoding
hV,n wpT
P
ξV,n +1 matrix is the null space of the eavesdroppers’ channel
(n) 1 
p=1

matrix, i.e., guaranteeing a null signal-to-noise ratio at
RV,ub = log  , (9)
 
2 N
P 2 the eavesdroppers.
hV,n wpT
 
δV,n +1
p=1,p̸=n
B. Zero-Forcing Precoder
R2 exp(2qx ) R 2 σx
2
Zero-Forcing precoder is among well-known precoding
δV,n = 2
2π exp(1)σV,n
, and ξV,n = 2
σV,n
, where
techniques, aiming to create parallel streams and cancel
!
1 3 inter-user interference. It is based on building a precoding

1 γinc 2 ,1 γinc 2 , 1
qx = log Am Γ + log 1 + 1 matrix as the pseudo-inverse of the channel matrix between
2 Γ 2 γinc 2, 1 the different LEDs and the various legitimate receivers, as
(10)
2 follows [9]
is the maximal transmit signal entropy, and σ x = −1
A2m

3 Am
2
WZF = HTU HU HTU . (14)
γ
Γ( 12 ) inc 2
, 4 is the average electrical power [6]. There-
fore, it yields from (8) and (9) that the secrecy rate (SR) of As a result, by plugging the above ZF precoding matrix
the nth legitimate link can be lower-bounded as into (11), one obtains
 
 log (δn,U + 1)
   
N 2

 δU,n (hU,n wpT ) +1 
P  N

1  ξn,Z P (hE,n wpT )2 +1  
 p=1
 
 log    (n)
Rs,ZF ≥    . (15)
 N  
p=1
T 2 +1 
2  − log 
  
  ξU,n P ( hU,n wp )  N 2
 

1
p=1    δn,Z P (hE,n wpT ) +1  
Rs(n)  p̸=n

≥   . (11) p=1
2  p̸=n
N
T 2
 
  ξE,n (hE,n wp ) +1 
P
 C. Proposed DRL Framework
p=1
 − log 
   
N 2
 
  δE,n P
(hE,n wpT ) +1   The problem in hand, given by (12a)-(12c), is an NP-hard
p=1
p̸=n one [2]. Furthermore, it has been shown that the ZF precoding
technique exhibits two main limitations, namely:
IV. P ROBLEM F ORMULATION AND P ROPOSED DRL
1) The ZF precoding matrix in (14) is not the optimal
F RAMEWORK
one maximizing the sum rate or SSR of a MU-
In this section, we detail the considered problem of achiev- MISO/MIMO network [9]. In fact, encountering the
able SSR maximization of the considered UOWC network. optimal precoding weights maximizing (12a) is a non-
Then, we provide the proposed DRL framework to solve it. convex optimization problem,
2) ZF is based on creating orthogonal streams across
A. SSR Maximization Problem Formulation the different users to avoid inter-user interference [9].
The main aim is to maximize the considered MU-MISO Nonetheless, the presence of imperfect and/or outdated
UOWC network’s achievable SSR, subject to eavesdropping CSI leads to inevitable performance degradation as the
attacks. In particular, one can obviously infer from (11) orthogonality between users is broken.
that the per-user achievable SR depends essentially on the The emergence of DRL has stimulated its employment
legitimate and wiretap channel gains, along with the choice of to solve nonconvex optimization problems, e.g., optimizing
the precoding matrix W. Thus, encountering the maximum the precoding/beamforming matrix in a MU-MISO/MIMO
of the sum of individual SRs can be attained by properly network. The DRL agent interacts with the environment,
optimizing W. We can formulate the SSR maximization takes actions from the tunable system parameters, and gets
problem as follows rewards based on his choices. Thus, it learns from such inter-
N
X actions and rewards the optimal policy to maximize/minimize
max Rs(n) (12a) the objective function in question. Capitalizing on this, we
W
n=1 propose the use of a DRL framework for solving the prob-
(C1): Am ∥wk ∥∞ ≤ ηIDC , ∀k = 1, ..., K, (12b) lem in (12a)-(12c). Generally, reinforcement learning (RL)
(C2) : W.HE = 0 (12c) represents an approach for solving Markov decision process
(MDP) problems, represented by the tuple (S, A, R, P, ϱ),
with wk = [wk,1 , wk,2 , ..., wk,N ] is the precoding vector where S, A, R, and P represent the respective sets of
corresponding to the kth LED. The above SSR maximization possible environment states, actions, rewards, and state tran-
problem is conditioned by two constraints, namely: sition probability values, respectively, while ϱ stands for the
6809
discount factor. We detail in the sequel the proposed DRL estimation error with standard deviation σe . To this end, the
design and components. current state of the environment is manifested by the outdated
1) Agent and Environment: The environment consists of and imperfect CSI of both legitimate and wiretap links, the
the considered UOWC system, represented by the transmit- previousntime step’s precoding matrix and achievableo SSR,
ter’s K LEDs and the N legitimate users and eavesdroppers. as st = H b (m(t)−1) , H
b (m(t)−1) , W(t−1) , R(t−1) , with
U E S
The single RL agent, established at the transmitter, interacts
with the environment by taking sequential actions to optimize N
(t)
X
the precoding matrix W so as to maximize the achievable RS ≜ Rs(n,t) . (18)
SSR. At a given discrete time step of index t ∈ N∗ , the n=1
environment, being in state st ∈ S receives an action at ∈ A 3) Rewards: The agent’s aim is the maximize the achiev-
generated by the agent, grants a reward rt ≜ r (at , st ) ∈ R, able SSR, given by (18), under two constraints, as shown in
and moves to the upcoming state (next training time step) (12a)-(12c). Therefore, the reward is directly linked with the
st+1 with a probability P (st , at , st+1 ) from P. Thus, the achievable SSR. Also, the constraint (C1) in (12b) can be
agent undergoes the learning process throughout several time ensured at each time step through the normalization
steps. We define each time step as the optimization of the (t)
precoding vector corresponding to a given LED with respect wk (t) 1
, if wk > . (19)
to the N users. Consequently, each block of K successive maxj=1,...,K
(t)
wj ∞ µ
time steps represents an episode, constituting the optimization ∞
of the whole precoding matrix W. Furthermore, it is worth Furthermore, to make sure that the SSR (i) exceeds the
highlighting that the cumulative reward counter, computed baseline ZF one and (ii) fulfills (C2), two penalty terms are
by accumulating each episode steps’ rewards, is reset to zero subtracted from the reward, namely the SSR of the ZF scheme
at the beginning of each episode. To this end, the agent is and a constant β (t) as
trained to learn by leveraging its experience over a total of (t) (t)
rt = RS − RS,ZF − β (t) , (20)
Tmax episodes of K time steps each through taking actions
and getting corresponding rewards. The agent aims at finding with β (t) is a penalty term included at time step t if the con-
the optimal policy Π maximizing the future expected reward, straint on W to match the null-space of HE is not respected
expressed by the classical Bellman equation as [10] (t) PN
(n,t)
  (i.e., violation of C2). Also, we have RS,ZF = RS,ZF
n=1
(n,t)
X
QΠ (s, a) = E  ϱt rt s0 = s, a0 = a, Π , (16) as the achievable SSR of the ZF scheme, where RS,ZF
t≥1 can be computed by incorporating (14), evaluated using the
imperfect CSI matrix H b (m(t)−1) , into (11).
where rt is the reward function detailed in the sequel. Also, U
the discount factor ϱ represents the significance of subsequent 4) DRL Algorithm: Due to the channel dynamics in the
per-episode rewards on the expected global one. Therefore, UOWC channel, we opt for using a DNN-enabled DRL
the optimal policy is the set of actions maximizing the Q- framework to reach the optimal policy. Without loss of
value, given by (16), as follows: Π∗ = argmax QΠ (s, a) . generality, we make use of the Proximal Policy Optimization
a (PPO) algorithm, as was elaborated in [10]. PPO is a policy-
2) States and Actions: The proposed DRL algorithm op- based approach suitable for stochastic policies and with either
timizes the precoding matrix W sequentially. At each time a discrete or continuous action space. In the considered
step t, the taken action at consists of the precoding matrix’s scenario, we assume that the precoding coefficients wk,n are
(t) (t)
current row of N elements, i.e., at = wmod(t,K) , with wk set through discrete actions. The pseudocode of the consid-
is the precoding vector corresponding to the kth LED at ered PPO algorithm is detailed in Algorithm 1. The initial
time slot t. We consider that the channel realization remains phase consists of initializing two networks with identical
constant throughout each episode of K time slots, while sets of weights, enabling to establish two PPO policies.
it changes from one episode to another according to the Then, the training process sets about by generating samples
temporal correlation model, detailed in (5), where each two from the policy, i.e., {st , at , rt }, with fixed parameters (θp )
successive episodes are τ -seconds apart. Due to the channel over different episodes, where such samples are saved in
decorrelation over time, the CSI estimation at the transmitter an experience memory D (lines 9-13). A policy gradient
undergoes aging effects. Therefore, at each episode of index estimator is evaluated at the end of each episode as follows
m, the agent has access to only an outdated version of
∞
the channel matrix, from the previous episode, corrupted by (s)
X j (s)
Êt = (ϱζ) κt+j (21)
estimation errors, i.e.,
j=0
b (m(t)−1) = H(m(t)−1) + e(m(t)) (V ∈ {U, E}) ,
H (17)
V V V where
(s)
where we define the element at the nth row and kth column κt+j = rt + ϱV (st+1 , θ) − V (st , θ) , (22)
(m(t)) (m(t))
of eV by eV,n,k 1 , defined as a zero-mean Gaussian
and V (st , θ) is the expected reward obtained by averaging
1 Thenotation m(t) signifies the dependence of the current episode index (16) over all possible actions with the state st . Afterward, a
m with the current global time step index t random mini-batch from the experience memory is sampled
6810
(n,k)
(lines 15-16). We highlight that the learning rate ζ adjusts LED and nth benign/malign
node is evaluated as: ϕV =
the bias-variance trade-off. Then, the main policy Π(θ0 ) is zTk −zVn (n,k) (n,k)
arccos (n,k) , with ΨV = ϕV , and
updated by finding the θ value maximizing dV
(n,k)
h i p
(s) (s)
L(clip) (θ) = E pt (θ)Êt , clip (pt (θ), 1 − ε, 1 + ε) Êt , dV = (xTk − xVn )2 + (yTk − yVn )2 + (zTk − zVn )2 .
(24)
(23)
Also, we set A = πr2 with r = 1cm, µ = 0.2, nref =
and function of the samples {st , at , rt }, generated according 1.5, Ψc = π/6, Φ1/2 = π/10 q = 1.6 × 10−19 c, χa = 10.93
1
to Π(θp ) for each batch’s sample, where pt (θ) = Π(a Π(at |st ,θ ) A/(m2 .Sr), ia = 5 × 10−12 A/Hz 2 , η = 0.44 W/A, R = 0.54
t |st ,θp )
is the policy probability ratio. A noteworthy process com- A/W, and c = 0.3 m−1 [6]. In addition, we set σe2 = 10−5 ,
ponent is the clip function ensuring pt (θ) to range between ρ = 0.7, and τ = 0.4 ms, corresponding to v = 1 m/s of
1 − ε and 1 + ε, where ε is the clip range. Finally, both ocean waves velocity and λ = 532 nm (green wavelength).
policies are synchronized by replacing θp with θ. Furthermore, the DRL parameters are set as ϱ = 0.99, β (t) =
2 (∀t), ζ = 10−3 , ε = 0.2, and Tmax = 105 . Lastly, it
is worth mentioning that the presented results are smoothed
Algorithm 1: PPO Secure Precoding. through a moving average over 500 episodes.
Data: ζ (learning rate), ϱ (discount factor), Tmax
Result: θp (DNN optimal weights)
16
1 begin 1.2
1
2 Random setting of the DNN weights (i.e., θ) to 14 0.8
0.6
get Π(θ) \\ Initialization 12 0.4
3 θp ← θ 10
4 4.2 4.4 4.6 4.8 5
104
4 for m ← 1 to Tmax do
8
5 \\ loop over episodes
6 for k ← 1 to K do 6
7 \\ loop over time steps 4
8 t ← mk \\ global time index

2
9 stn← o
b (m−1) , H
H b (m−1) , W(t−1) , R(t−1) 0
1 2 3 4 5 6 7 8
U E S
10 4
10 at ← w \\ according to Π(θp )
11 computenrt using (20) o Fig. 1: Achievable SSR vs. the number of episodes for various
12 st+1 ← H b (m−1) , H
b (m−1) , W(t) , R(t) N values.
U E S
13 Save (st , at , rt , st+1 ) in the experience
memory D Fig. 1 presents the achievable SSR of the considered
14 end system for different values of the users’ number (N ). In
15
(s)
Evaluate Êt using (21) and (22). particular, we set N = 2 with defaults coordinates’ values,
16 Sample a mini-batch of (sn , an , rn , sn+1 ) N = 3 with xU = {2.25, 3.56, 3}, xE = {4, 0.4, 0.6} ,
from D yU = {3, 4, 4.5} , yE = {2.5, 4.5, 1.25}, and N = 4
17 θ ← argmaxy L(clip) (y), \\ given by (23) with xU = {2.25, 3.56, 3, 2}, xE = {4, 0.4, 0.6, 2} , yU =
18 θp ← θ \\ Updating the previous policy {3, 4, 4.5, 2} , yE = {2.5, 4.5, 1.25, 2.75}, while xT , yT ,
19 end zT , zU , and zE are set by default. It can be obviously
20 end noted that the proposed DRL scheme yields an increasing
secrecy behavior at initial episodes, i.e., the learning phase,
before reaching convergence. This is due to the fact that the
RL agent initially takes random actions on the precoding
V. N UMERICAL R ESULTS coefficients, producing either a lower received signal power
and/or a higher inter-user interference at each legitimate user,
In this section, we provide illustrative numerical results to or a higher signal power and/or lower interference at each
showcase the performance of the proposed DRL framework eavesdropper, as manifested by the first and second terms
for maximizing the achievable SSR. Unless otherwise stated, of (11), respectively. Consequently, this leads to a lower
the considered network and DRL parameters’ values are achievable SSR. Nonetheless, the system sequentially learns
set as follows: Am = 25 dBm, K = 6, N = 2, where how to fulfill the problem’s constraints and maximize its
the different LEDs and users’ default Cartesian coordinates reward. Importantly, the DRL scheme reaches convergence
values (in meters) are set as xTk = 3 + 0.2 (k − 1) for for N = 2 after 10000 episodes and after 40000 episodes
k = 1, ..., K, yTk = 4, zTk = 5 (∀k), xU = {2.25, 3.56} for N = 3 and 4 due to the high complexity of solving the
, xE = {4, 0.4} , yU = {3, 4} , yE = {2.5, 4.5} , zU = problem with an increased users’ number. In addition, the
zE = 0. It is worth noting that the angle between the kth DRL framework yields a better achievable SSR compared
6811
14
4.2
with respect to the ZF for the above respective Am values,
12
4
3.8 while it surpasses the random precoding scheme by 970%,
3.6
3.4 676%, and 499%, for the same above-mentioned Am values’
3.2
10 4 4.5
104
5
order.
8 VI. C ONCLUSION
6 In this paper, we proposed a DRL framework to enhance
the secrecy performance of an MU-MISO UOWC system
4
subject to oceanic turbulence-induced fading, channel aging
2 and estimation errors. The proposed DRL algorithm aims at
optimizing the system’s precoding matrix by leveraging the
1 2 3 4 5 6 7 8 outdated and imperfect CSI observations at the transmitter
10 4
Fig. 2: Achievable SSR over the number of episodes for of the legitimate and wiretap channels. The provided results
various values of ρ. showed the superiority of the proposed DRL scheme against
the baseline ZF, even with limited CSI, for different number
14
1.2 of LEDs and users, and channel decorrelation levels over
12 1
time.
10
0.8
4 4.5 5 ACKNOWLEDGMENT
104
This research was sponsored in part by the NATO Science
8
for Peace and Security Programme under grant SPS G5797.
6
R EFERENCES
4 [1] E. Illi et al., “Physical layer security of a dual-hop regenerative mixed
RF/UOW system,” IEEE Trans. Sust. Comp., vol. 6, no. 1, pp. 90–104,
2 ZF 2021.
Random
[2] A. Mostafa and L. Lampe, “Physical-layer security for MISO visible
0
0 1 2 3 4 5 6 light communication channels,” IEEE J. Sel. Areas Commun., vol. 33,
10 4 no. 9, pp. 1806–1818, 2015.
Fig. 3: Achievable SSR over the number of episodes for [3] F. J. Lopez-Martinez, G. Gomez, and J. M. Garrido-Balsells, “Physical-
various values of Am in dBm. layer security in free-space optical communications,” IEEE Photonics
J., vol. 7, no. 2, pp. 1–14, Apr. 2015.
[4] J. Zhang et al., “Secure and noise-resistant underwater wireless opti-
to its ZF and random precoding counterparts, whereby the cal communication based on spectrum spread and encrypted OFDM
respective gain vs. the ZF one is 245%, 240%, and 435%, modulation,” Opt. Express, vol. 30, no. 10, pp. 17 140–17 155, May
2022.
for N = 2, 3, and 4, respectively, while it outperforms the [5] L. Yin and H. Haas, “Physical-layer security in multiuser visible light
random precoding scheme by 968%, 701%, and 791% for communication networks,” IEEE J. Sel. Areas Commun., vol. 36, no. 1,
the same respective N values. Indeed, this is due to the fact pp. 162–174, Jan. 2018.
[6] M. A. Arfaoui et al., “Secrecy performance of multi-user MISO VLC
that the ZF precoder’s performance is jeopardized by error broadcast channels with confidential messages,” IEEE Trans. Wireless
floors in the presence of CSI imperfections, as demonstrated Commun., vol. 17, no. 11, pp. 7789–7800, Nov. 2018.
in [9]. [7] G. Huang et al., “A zero-forcing precoder and mode modulation
aided orbital angular momentum multiplexing transceiver for under-
In Fig. 2, we show the proposed scheme’s secrecy per- water transmissions,” in 2021 IEEE/CIC Intern. Conf. Communications
formance for different values of the channel correlation ρ. China (ICCC), 2021, pp. 1–5.
The considered ρ values are 0.7, 0.47, and 0.25, corre- [8] A. Amantayeva et al., “Multiuser MIMO for underwater visible light
communication,” in 2018 International Conference on Computing and
sponding to (v, τ ) = (1 m/s, 0.4 ms), (2 m/s, 0.3 ms), and Network Communications (CoCoNet), 2018, pp. 164–168.
(2 m/s, 0.4 ms). The performance curves show that the pro- [9] T. Yoo and A. Goldsmith, “On the optimality of multiantenna broad-
posed DRL framework yields a notable secrecy improvement cast scheduling using zero-forcing beamforming,” IEEE J. Sel. Areas
Commun., vol. 24, no. 3, pp. 528–541, Mar. 2006.
versus the baseline ZF, even when the outdated CSI is 25%- [10] R. Hamdi et al., “LoRa-RL: Deep reinforcement learning for resource
correlated with the actual one. The proposed scheme’s SSR management in hybrid energy lora wireless networks,” IEEE Internet
yields an average gain of 237%, 253%, and 263%, for Things J., vol. 9, no. 9, pp. 6458–6476, 2022.
[11] Q. Hu et al., “Joint deep reinforcement learning and unfolding: Beam
ρ = 0.7, 0.47, and 0.25, respectively, compared to its ZF selection and precoding for mmwave multiuser MIMO with lens
counterpart. arrays,” IEEE J. Sel. Areas Commun., vol. 39, no. 8, pp. 2289–2304,
2021.
The system’s secrecy is shown in Fig. 3 for different [12] H. Shen et al., “Rate-maximized zero-forcing beamforming for VLC
values of peak optical power Am with ρ = 0.5. Once more, multiuser MISO downlinks,” IEEE Photonics J., vol. 8, no. 1, pp. 1–13,
the proposed DRL scheme reveals a significant achievable 2016.
[13] T. V. Pham and A. T. Pham, “Coordination/cooperation strategies and
SSR boost compared to the considered baseline schemes, optimal zero-forcing precoding design for multi-user multi-cell VLC
for which the higher the peak optical power, the better the networks,” IEEE Trans. Commun., vol. 67, no. 6, pp. 4240–4251, 2019.
secrecy performance. In addition, the DRL scheme reaches [14] M. Uysal et al., “SLIPT for underwater visible light communications:
Performance analysis and optimization,” IEEE Trans. Wireless Com-
an average achievable SSR of 9.17, 7.26, and 4.73 Nats/s/Hz mun., vol. 20, no. 10, pp. 6715–6728, 2021.
for Am = 25, 15, and 5 dBm, respectively. Furthermore, the [15] S. Tang, X. Zhang, and Y. Dong, “Temporal statistics of irradiance in
DRL exhibits an average gain of 253%, 213%, and 204% moving turbulent ocean,” in 2013 MTS/IEEE OCEANS - Bergen.
6812

Deep Reinforcement Learning For Enhancing The Secrecy of A MU-MISO UOWC Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Reinforcement Learning For Enhancing The Secrecy of A MU-MISO UOWC Network

Uploaded by

Copyright:

Available Formats

2023 IEEE Global Communications Conference: Optical Networks and Systems

Deep Reinforcement Learning for Enhancing the

1) (C1): referring to the peak optical power allowed at

7 \\ loop over time steps 4

8 t ← mk \\ global time index

You might also like