Professional Documents
Culture Documents
Abstract—We introduce a trainable coded modulation reversibly map an incoming stream of independent and uni-
scheme that enables joint optimization of the formly distributed bits to a stream of matched bits, such
bit-wise mutual information (BMI) through probabilistic that, once modulated to channel symbols, the probabilities
GLOBECOM 2020 - 2020 IEEE Global Communications Conference | 978-1-7281-8298-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/GLOBECOM42002.2020.9348032
Rayleigh block fading (RBF) channels. On the AWGN chan- On the receiver side, a differentiable demapper computes
nel and when considering a code rate compatible with PAS, log-likelihood ratios (LLRs) l̃ ∈ Rn from the received sym-
results show that the proposed scheme achieves performance bols y ∈ Cq . The LLRs are fed to a decoding algorithm
competitive with QAM shaped with the MB distribution using that reconstructs the matched information bits. Because DM
the extended PAS architecture. For a lower code rate that PAS is reversible, the unmatched information bits can be retrieved.
does not support, the proposed scheme achieves higher rates
A. Mapper architecture
than geometric shaping and unshaped QAM. On the RBF
channel, joint probabilistic and geometric shaping leads to The mapper maps each codeword c = [cI cP ] to a vector
higher rates than QAM shaped with MB as well as geometric of channel symbols x ∈ Cq by mapping each bit vector
(i) (i)
shaping. [bI bP ], i ∈ {1, . . . , q}, of length m to a channel symbol
The rest of this paper is organized as follows. Section II x ∈ C according to the constellation C eθ . The key idea behind
details the architecture of the proposed system as well as the mapper architecture is to partition the constellation C eθ
the joint probabilistic and geometric shaping optimization into 2m−k sub-constellations each of size 2k
, i.e., eθ =
C
e(1) , . . . , C
e(2
m−k
)
algorithm. The numerical results are presented in Section III. [C θ θ ]. All sub-constellations are normalized
Section IV concludes this paper. e(i)
(i) C
Cθ = √∑ θ
(1)
II. J OINT GEOMETRIC AND PROBABILISTIC SHAPING e
x∈C
(i) pθ (x)|x|2
θ
Let us denote by n the code length and by r the code to form the normalized constellation Cθ =
(1) (2m−k )
rate. It is assumed that r = m k
, where 2m is the modulation [Cθ , . . . , Cθ ]. In order to map a vector of bits
(i) (i)
order, and k ∈ {1, . . . , m − 1}. We denote by q = m n
the [bI bP ], i ∈ {1, . . . , q} to a channel symbol, the sub-
(j) (i)
number of channel symbols per codeword, with n assumed to constellation Cθ such that j has bP as binary representation
(j)
be divisible by m. The overall architecture of the proposed is chosen. Next, the channel symbol xk ∈ Cθ such that
system is shown in Fig. 1. A source generates independent (i)
k has bP as binary representation is selected to modulate
and uniformly distributed bits ceI , which are fed to a matcher (i) (i)
the bit vector [bI bP ], as illustrated in Fig. 1. Note that
that implements a DM algorithm [7]. Given a target shaping (i)
the bit vectors bI are shaped according to pθ , whereas the
distribution pθ of dimension 2k provided by an NN with (i)
parity bits bP are not as channel encoding does not preserve
trainable parameters denoted by θ, the matcher maps ceI to
(1) (q) shaping. The parity bits are assumed to be uniform and
a bit vector cI = [bI . . . bI ] of length rn, where
(i) independent and identically distributed (i.i.d.) [1]. Therefore,
bI , i ∈ {1, . . . , q}, are bit vectors of dimension k distributed using the proposed mapper, the constellation consists of a
according to pθ (note that the distribution pθ is over the bit set of 2m−k sub-constellations, all probabilistically shaped
vectors themselves and not over the individual bits). Because according to pθ , but with possibly different geometries.
of the shaping redundancy introduced by the matcher, ceI [ ] The
normalization (1) ensures the power constraint E |x|2 = 1.
has a smaller length than rn. The matched information bits This approach is similar to PAS [1], in which k = m − 2
cI are fed to a channel encoder that generates a vector of and the signs of the constellation point components are se-
(1) (q)
parity bits cP = [bP . . . bP ] of length (1 − r)n, where lected according to the two parity bits, leading to symmetric
(i)
bP , i ∈ {1, . . . , q}, are bit vectors of dimension m − k. constellations. Our approach can be seen as a generalization
Assuming a systematic code, the codeword c = [cI cP ] is of PAS, where k can take any value from the range from 1
mapped to a vector of channel symbols x ∈ Cq by a mapper to m − 1, and the sub-constellations can be freely optimized,
and according to a constellation C eθ also provided by the NN
subject only to a power constraint.
with trainable parameters θ. The constellation C eθ consists of
m
a set of 2 points in the complex plane, numbered from 0 to B. System training
2m − 1. The mapping operation will be explained in detail in At training, only the mapper, channel, and demapper are
Section II-A. considered, as channel encoding and decoding are not required
Authorized licensed use limited to: Raytheon Technologies. Downloaded on May 18,2021 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
where pe(bi |y) is the posterior probability of bi given y ap-
proximated by the demapper. One can rewrite Lb as
∑
m ∑
m
b
L(θ) = Hθ (bi |y) + Ey [DKL (p(bi |y)||e
p(bi |y))] (8)
i=1 i=1
∑
m
= Hθ (X) − R(θ) + Ey [DKL (p(bi |y)||e
p(bi |y))]
Fig. 2: System architecture at training i=1
(9)
where DKL is the Kullback–Leibler (KL) divergence. Because
to perform probabilistic and geometric shaping on the BMI. we are performing probabilistic shaping, training on L, b as
The system considered for training is shown in Fig. 2. Joint done in, e.g., [5], would lead to the minimization of Hθ (X),
probabilistic and geometric shaping consists of jointly optimiz- which is not desired. To avoid this issue, we train on the loss
ing pθ and Cθ to maximize the BMI, i.e., to find the vector
b
L(θ) := L(θ) − Hθ (X) (10)
of parameters that solves
∑
m
subject to pθ (x)|x|2 = 1 (3) Therefore, by minimizing L, one maximizes the BMI. More-
x∈C(θ) over, assuming the demapper is trainable, it would be jointly
optimized with the transmitter to minimize its KL divergence
where (3) is an average power constraint, ensured by the to the true posterior distribution of Bi given Y .
normalization step performed by the mapper. Optimizing on A challenge of optimizing the shaping distribution with
the BMI is relevant as it is an achievable rate for BICM SGD is to compute the gradient of L with respect to (w.r.t.)
systems [6]. The BMI is defined as to pθ . This difficulty was addressed in [8] by leveraging the
[ ]+ Gumbel-Softmax trick [9] to implement a trainable sampling
∑m
R(θ) := Hθ (X) − Hθ (Bi |Y ) (4) mechanism as the source of information bits. However, the
i=1 Gumbel-Softmax trick requires extra hyper-parameters to be
set, and can lead to numerical instabilities at training. In this
where [·]+ := max (·, 0), X is the random variable correspond- work, we avoid the need of implementing a trainable sampling
ing to the channel input, Y the random variable corresponding mechanism by implementing the loss in a different manner at
to the channel output, and B1 , . . . , Bm the random variables training. More precisely, let us denote by h the channel vector
corresponding to the bits transmitted in a single channel use state, which captures all the random elements of the channel,
(which consist of information and parity bits). The first term e.g., the noise realization of an AWGN channel or the fading
in (4) is the entropy of the channel input X, coefficient and noise realization of a fading channel. Then,
∑ without loss of generality, (6) can be rewritten as
Hθ (X) = − pθ (x) log (pθ (x)) + (m − k) (5)
x∈Cθ
(1)
H(Bi |Y ) = −Ebi ,x,h [log (p (bi |y(x, h)))]
[ 2m−k
and depends only on the shaping distribution pθ . In (5), any ∑ ∑ ∑
(i) = −Eh p(bi |x)pθ (x)·
sub-constellation Cθ , i ∈ {1, . . . , 2m−k }, could be used j=1 x∈C (j) bi ∈{0,1}
in the sum as they all share the same probabilistic shaping θ
]
pθ . The second term in (4) is the sum of the transmitted bit
entropies conditioned on the channel output Y , i.e., log (p (bi |y(x, h))) (12)
Hθ (Bi |Y ) = −Ebi ,y [log (p(bi |y))] (6) where the last equality comes from the fact that h is inde-
pendent of x and bi . In (12), p(bi |x) is set by the labeling
where p(bi |y) is the posterior probability of bi given y. of the constellation point, and equals to either 0 or 1. More
Note that this conditional entropy depends on θ through the precisely, each point x of the constellation Cθ is associated to
distribution of Y and Bi . a bit vector of size m. In the rest of this work, the labeling of
Finding a local solution to (2)-(3) is done by perform- the constellation is set to natural labeling, i.e., the first element
ing stochastic gradient descent (SGD) on a loss function that of Cθ is associated to 0, the second one to 1, etc. In other
serves as a proxy for R. As noticed in [5], the BMI is closely words, the bit vector associated to each constellation point is
related to the total binary cross-entropy (BCE) defined by fixed, however the points can freely move within the complex
plane during the optimization process, being only subject to
∑
m
b
L(θ) := − Ey,bi [log (e
p(bi |y))] (7) the power constraint (3). When optimizing the constellation
j=1
geometry Cθ , i.e., the point positions, on the BMI, their
Authorized licensed use limited to: Raytheon Technologies. Downloaded on May 18,2021 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
(b) Demapper for the mismatched
(a) NN for the PS-GS scheme RBF channel (c) Extension of the PAS architecture
Fig. 3: Architectures of the NNs used for evaluation
corresponding labeling is considered as it impacts the BMI considered (see Section III-B and III-C, respectively). The
value. As a result, joint optimization of the constellation number of bits per channel use was set to m = 6 and the
geometry and labeling is achieved. SNR was defined as 1/N0 , where N0 is the noise spectral
The key idea to enable optimization of the shaping dis- density. The SE was defined as the number of information bits
tribution is to sample h to estimate the outer expectation transmitted per channel use, formally SE := H(X)−m(1−r).
in (12), but to explicitly implement the inner expectation over Regarding the proposed PS-GS approach, two code rates
x and bi This is not as in, e.g., [5], where both the channel were considered, 1/2 and 2/3. The code rate constrains the
and the source are sampled. This trick avoids the need for a partitioning of the constellation Ceθ in the mapper as well as
trainable sampling mechanism as the gradient of L is correctly the size of the shaping distribution pθ (see Section II). To see
computed with respect to pθ . In practice, at training, each how gains in BMI translate into gains in bit error rate (BER),
batch example consists of sampling a channel realization h, a standard IEEE 802.11n code was used, with a length of
and transmitting all the points forming the constellation Cθ , n = 1944 bits and a code rate of 2/3. This code rate was
as shown in Fig. 2. The loss L is then estimated by chosen as it is compatible with PAS for m = 6. However,
( PAS cannot operate with a code rate of 1/2. A conventional
∑ 1 ∑ ∑ ∑
m−k
m B 2 sum-product belief-propagation decoder with 100 iterations
L(θ) ≈ − Hθ (X) + pθ (x)· was leveraged for channel decoding.
i=1
B
l=1 j=1 x∈C (j)
θ
) Regarding the proposed scheme, the NN that generates
( ( ( ))) the shaping distribution pθ and the constellation C eθ was
log pe bi |y x, h (l)
(13) made of two dense layers with 64 units each and hyperbolic
tangent activation function, followed by two parallel dense
where B is the batch-size, and h (l) the lth sample of channel layers with linear activation, one with 2k units to generate
realization. the shaping distribution, and the other with 2m+1 units to
generate the real and imaginary parts of the constellation
III. N UMERICAL RESULTS points. This architecture is shown in Fig. 3a. When consid-
To assess the performance of the proposed scheme, referred ering the AWGN channel, the true posterior distribution on
to as PS-GS as it performs joint probabilistic and geometric bits was implemented by the demapper. For the mismatched
shaping, we compared it to the conventional unshaped QAM, RBF channel, as no efficient implementation of the optimal
to an optimized geometric shaping (GS) baseline with no demapper is known to the best of our knowledge, the demapper
probabilistic shaping, and to QAM shaped according to a MB was implemented by an NN whose architecture is shown in
distribution by PAS, referred to as MB-QAM. For fairness, the Fig. 3b. The NN-based demapper was jointly optimized with
PAS architecture [1] was extended with an NN that computes the transmitter. In Fig. 3b, ŷ is the equalized received sym-
the MB shaping from the SNR, in order to find an optimized bol and ĥ the linear minimum mean square error (LMMSE)
MB distribution for each SNR value (see Section III-A). The channel estimate. For fairness, an NN-based demapper was
GS baseline consists of an NN-based mapper with uniform leveraged for all considered schemes when considering the
probability of occurrence of the points, as in, e.g., [5]. mismatched RBF channel, including the QAM baseline for
To evaluate the BMI achieved by the different schemes, which the transmitter includes no trainable parameters.
no DM algorithm was used, and the information bit vectors Training was done with batches of size 1000, using the
were sampled according to the shaping distributions (or to the Adam optimizer [10] with the learning rate set to 10−3 , and
uniform distribution in the cases of uniform GS and uniform by uniformly sampling the SNR from the range from 0 to
QAM). This is equivalent to assuming the use of a perfect 20 dB for the AWGN channel, and from the range from 5 to
DM algorithm. AWGN and mismatched RBF channels were 25 dB for the mismatched RBF channel. For all the schemes
Authorized licensed use limited to: Raytheon Technologies. Downloaded on May 18,2021 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
that required training, the best of five seeds is shown. Capacity PS-GS(2/3) MB-QAM(2/3)
A. Extending PAS with deep learning PS-GS(1/2) GS QAM
B. AWGN channel
Considering the AWGN channel, the true posterior distri- 2
bution was implemented in the demapper. Therefore, the KL
divergence in (11) is null, and the loss function L equals the
BMI R up to the sign. The BMIs achieved by the compared
1
schemes were estimated by Monte Carlo sampling of (11) and
are shown in Fig. 4a. The code rate is indicated in parenthesis
in the legend. One can see that PS-GS with a code rate of 2/3
and MB-QAM achieve essentially the same BMI, followed 0 5 10 15 20
by PS-GS with a code rate of 1/2. The lower performance Es /N0 [dB]
of PS-GS with a code rate of 1/2 can be explained by the
eθ must be (b) SE. PS-GS and MB-QAM adapt the SE according to the SNR.
higher number of sub-constellations into which C
partitioned by the mapper due to the lower code rate. Indeed, 100
the higher the number of sub-constellations into which the con-
eθ must be partitioned, the stronger the constraint
stellation C 10−1
as all sub-constellations must share the same probabilities of
occurrence of points. Note that conventional PAS does not
operate at a code rate of 1/2 with m set to 6. However, PS- 10−2
BER
Authorized licensed use limited to: Raytheon Technologies. Downloaded on May 18,2021 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.
PS-GS(2/3) MB-QAM(2/3) PS-GS(1/2) Because of space restriction, only the BMI and SE are
GS QAM shown for the mismatched RBF channel. Fig. 5a shows the
lower bounds on the BMI achieved by the compared schemes
BMI lower bound [bit / channel use]
Authorized licensed use limited to: Raytheon Technologies. Downloaded on May 18,2021 at 13:53:22 UTC from IEEE Xplore. Restrictions apply.