You are on page 1of 90

Chapter 5

Dierential Entropy and Gaussian Channels

Po-Ning Chen, Professor

Department of Electrical and Computer Engineering

National Chiao Tung University

Hsin Chu, Taiwan 30010, R.O.C.


Continuous sources I: 5-1

Model
{Xt X , t I}
Discrete sources
Both X and I are discrete.
Continuous sources
Discrete-time continuous sources
X is continuous; I is discrete.
Waveform sources
Both X and I are continuous.

We have so far examined information measures and their operational charac-


terization for discrete-time discrete-alphabet systems. In this chapter,
we turn our focus to discrete-time continuous-alphabet (real-valued)
sources.
Information content of continuous sources I: 5-2

Relation between discrete-alphabet and continuous-alphabet sources


A discrete-alphabet source can be regarded as a discrete approximation
(quantization) of a continuous-alphabet source.
As the quantization becomes ner and ner, a discrete-alphabet source is
expected to approach a continuous-alphabet source.
We thus require an extension of information measure (i.e., entropy) from
discrete sources to continuous sources.

Entropy: H(X)  PX (x) log2 PX (x) bits.
xX

Example 5.1

Consider a real-valued random variable X that is uniformly distributed on


the unit interval, i.e., with pdf given by

1 if x [0, 1);
fX (x) =
0 otherwise.

Given a positive integer m, we can discretize X by uniformly quantizing it


into m levels by partitioning the support of X into equal-length segments
Information content of continuous sources I: 5-3

1
of size = m ( is called the quantization step-size) such that:
i i1 i
qm (X) = , if X < ,
m m m
for 1 i m.
Then the entropy of the quantized random variable qm(X) is given by
 m  
1 1
H(qm (X)) = log2 = log2 m (in bits).
i=1
m m

Since the entropy H(qm (X)) of the quantized version of X is a lower bound
to the entropy of X (as qm (X) is a function of X) and satises in the limit
lim H(qm (X)) = lim log2 m = ,
m m

we obtain that the entropy of X is innite. 2


Information content of continuous sources I: 5-4

The above example indicates that to compress a continuous source without


incurring any loss or distortion requires an innite number of bits.
Thus when studying continuous sources, the entropy measure is limited in its
eectiveness and the introduction of a new measure is necessary.
Such a new measure is obtained upon close examination of the entropy
of a uniformly quantized real-valued random-variable minus the
quantization accuracy (as the accuracy increases without bound), for
which the idea can be picked up from the next lemma.
Information content of continuous sources I: 5-5

Lemma 5.2 Consider a real-valued random variable X with support [a, b)


and pdf fX such that

(i) fX log2 fX is (Riemann-)integrable, and
b
(ii) a fX (x) log2 fX (x)dx is nite.
Then a uniform quantization of X with an n-bit accuracy (i.e., with a quanti-
zation step-size of = 2n ) yields an entropy approximately equal to
 b
fX (x) log2 fX (x)dx + n bits
a

for n suciently large. In other words,


 b
lim [H(qn (X)) n] = fX (x) log2 fX (x)dx
n a

where qn (X) is the uniformly quantized version of X with quantization step-


size = 2n .
Information content of continuous sources I: 5-6

Proof:

Step 1: Mean-value theorem.


Let = 2n be the quantization step-size, and let

a + i, i = 0, 1, , j 1
ti 
b, i=j
where  
ba
j= .

From the mean-value theorem, we can choose xi [ti1, ti] for 1 i j
such that
 ti
pi  fX (x)dx = fX (xi)(ti ti1) = fX (xi ).
ti1
Information content of continuous sources I: 5-7

Step 2: Denition of h(n)(X).


Let

j 
j
h(n)(X)  [fX (xi) log2 fX (xi)] = [fX (xi) log2 fX (xi)]2n .
i=1 i=1

Since fX (x) log2 fX (x) is (Riemann-)integrable,


 b
h(n)(X) fX (x) log2 fX (x)dx as n .
a

Therefore, given any > 0, there exists N such that for all n > N ,
 b

f (x) log f (x)dx h (n)
(X) < .
X 2 X
a
Information content of continuous sources I: 5-8

Step 3: Computation of H(qn (X)). The entropy of the (uniformly) quan-


tized version of X, qn (X), is given by

j
H(qn (X)) = pi log2 pi
i=1
j
= (fX (xi)) log2(fX (xi))
i=1
 j
= (fX (xi)2n ) log2(fX (xi)2n)
i=1

where the pis are the probabilities of the dierent values of qn (X).
Information content of continuous sources I: 5-9

Step 4: H(qn (X)) h(n)(X) . From Steps 2 and 3,



j
H(qn (X)) h(n)(X) = [fX (xi)2n] log2(2n )
i=1
j  t
 i
= n fX (x)dx
i=1 ti1
 b
= n fX (x)dx
a
= n.
Hence, we have that for n > N ,

 b
fX (x) log2 fX (x)dx + n < H(qn (X)) = h(n)(X) + n
a

 b
< fX (x) log2 fX (x)dx + n + ,
a

yielding that
b
limn [H(qn (X)) n] = a fX (x) log2 fX (x)dx. 2
Information content of continuous sources I: 5-10

Lemma 5.2 actually holds not limited for support [a, b) but for any support SX .

Theorem 5.3 [38, Theorem 1](Due to Renyi) For any real-valued random vari-
able with pdf fX , if
 j
pi log2 pi
i=1
is nite, where the pis are the probabilities of the dierent values of uniformly
quantized qn (X) over support SX , then

lim [H(qn (X)) n] = fX (x) log2 fX (x)dx
n SX

provided the integral on the right-hand side exists.


This suggest that SX fX (x) log2 fX (x)dx could be a good information measure
for continuous-alphabet sources.
Dierential entropy I: 5-11

Denition 5.4 (Dierential entropy) The dierential entropy (in bits) of a


continuous random variable X with pdf fX and support SX is dened as

h(X)  fX (x) log2 fX (x)dx = E[ log2 fX (X)],
SX

when the integral exists.

Example 5.5 A continuous random variable X with support SX = [0, 1) and pdf
fX (x) = 2x for x SX has dierential entropy equal to
 1 1
x (log2 e 2 log2(2x))
2
2x log2(2x)dx =
0 2 0
1
= log2(2) = 0.278652 bits.
2 ln 2
Dierential entropy I: 5-12

Next, we have that qn (X) is given by


i i1 i
qn (X) = , if X < ,
2n 2n 2n
for 1 i 2n. Hence,

i (2i 1)
Pr qn (X) = n = ,
2 22n
which yields
2

n  
2i 1 2i 1
H(qn(X)) = log2
i=1
22n 22n
 2n

1 
= (2i 1) log2(2i 1) + 2 log2(2n) .
22n i=1
Dierential entropy I: 5-13

n H(qn (X)) H(qn (X)) n


1 0.811278 bits 0.188722 bits
2 1.748999 bits 0.251000 bits
3 2.729560 bits 0.270440 bits
4 3.723726 bits 0.276275 bits
5 4.722023 bits 0.277977 bits
6 5.721537 bits 0.278463 bits
7 6.721399 bits 0.278600 bits
8 7.721361 bits 0.278638 bits
9 8.721351 bits 0.278648 bits
Dierential entropy I: 5-14

Example 5.7 (Dierential entropy of a uniformly distributed ran-


dom variable) Let X be a continuous random variable that is uniformly dis-
tributed over the interval (a, b), where b > a; i.e., its pdf is given by

1
ba if x (a, b);
fX (x) =
0 otherwise.
So its dierential entropy is given by
 b
1 1
h(X) = log2 = log2(b a) bits.
a b a b a

Note that if (b a) < 1 in the above example, then h(X) is negative.


Dierential entropy I: 5-15

Example 5.8 (Dierential entropy of a Gaussian random variable)


Let X N (, 2); i.e., X is a Gaussian (or normal) random variable with nite
mean , variance Var(X) = 2 > 0 and pdf
1 (x)2

fX (x) = e 22
2 2
for x R. Then its dierential entropy is given by

2

1 (x )
h(X) = fX (x) log2(2 2) + 2
log2 e dx
R 2 2
1 log e
= log2(2 2) + 22 E[(X )2]
2 2
1 1
= log2(2 2) + log2 e
2 2
1
= log2(2e 2) bits. (5.1.1)
2

Note that for a Gaussian random variable, its dierential entropy is only a
function of its variance 2 (it is independent from its mean ).
This is similar to the dierential entropy of a uniform random variable, which
only depends on dierence (b a) but not the mean (a + b)/2.
Properties of dierential entropy I: 5-16

Joint and conditional dierential entropies, divergence and mutual


information

Denition 5.9 (Joint dierential entropy) If X n = (X1, X2, , Xn) is a


continuous random vector of size n (i.e., a vector of n continuous random variables)
with joint pdf fX n and support SX n Rn, then its joint dierential entropy is
dened as

h(X n)  fX n (x1, x2, , xn) log2 fX n (x1, x2, , xn) dx1 dx2 dxn
SX n
= E[ log2 fX n (X n)]
when the n-dimensional integral exists.
Properties of dierential entropy I: 5-17

Denition 5.10 (Conditional dierential entropy) Let X and Y be two


jointly distributed continuous random variables with joint pdf fX,Y and support
SX,Y R2 such that the conditional pdf of Y given X, given by
fX,Y (x, y)
fY |X (y|x) = ,
fX (x)
is well dened for all (x, y) SX,Y , where fX is the marginal pdf of X. Then the
conditional entropy of Y given X is dened as

h(Y |X)  fX,Y (x, y) log2 fY |X (y|x) dx dy = E[ log2 fY |X (Y |X)],
SX,Y

when the integral exists.

Chain rule for dierential entropy:


h(X, Y ) = h(X) + h(Y |X) = h(Y ) + h(X|Y ).
Properties of dierential entropy I: 5-18

Denition 5.11 (Divergence or relative entropy) Let X and Y be two


continuous random variables with marginal pdfs fX and fY , respectively, such that
their supports satisfy SX SY R. Then the divergence (or relative entropy or
Kullback-Leibler distance) between X and Y is written as D(X Y ) or D(fX fY )
and dened by


fX (x) fX (X)
D(X Y )  fX (x) log2 dx = E
SX fY (x) fY (X)
when the integral exists. The denition carries over similarly in the multivariate
case: for X n = (X1, X2, , Xn) and Y n = (Y1, Y2, , Yn) two random vectors
with joint pdfs fX n and fY n , respectively, and supports satisfying SX n SY n Rn,
the divergence between X n and Y n is dened as

fX n (x1, x2, , xn )
D(X n Y n)  fX n (x1, x2, , xn) log2 dx1 dx2 dxn
SX n fY n (x ,
1 2x , , x n )
when the integral exists.

Note that D(qn(X) qn(Y )) D(X Y ) for continuous X and Y .


Properties of dierential entropy I: 5-19

Denition 5.12 (Mutual information) Let X and Y be two jointly distributed


continuous random variables with joint pdf fX,Y and support SXY R2. Then the
mutual information between X and Y is dened by

fX,Y (x, y)
I(X; Y )  D(fX,Y fX fY ) = fX,Y (x, y) log2 dx dy,
SX,Y fX (x)f Y (y)
assuming the integral exists, where fX and fY are the marginal pdfs of X and Y ,
respectively.
For n and m suciently large,
I(qn (X); qm(Y )) = H(qn(X)) + H(qm (Y )) H(qn (X), qm(Y ))
[h(X) + n] + [h(Y ) + m] [h(X, Y ) + n + m]
= h(X) + h(Y ) h(X, Y )

fX,Y (x, y)
= fX,Y (x, y) log2 dx dy.
SX,Y fX (x)f Y (y)
Hence,
lim I(qn (X); qm(Y )) = h(X) + h(Y ) h(X, Y ).
n,m

This justies using identical notations for both I(; ) and D( ) as opposed to
the discerning notations of H() for entropy and h() for dierential entropy.
Properties of dierential entropy I: 5-20

Lemma 5.14 The following properties hold for the information measures of con-
tinuous systems.
1. Non-negativity of divergence: Let X and Y be two continuous random
variables with marginal pdfs fX and fY , respectively, such that their supports
satisfy SX SY R. Then
D(fX fY ) 0
with equality i fX (x) = fY (x) for all x SX except in a set of fX -measure
zero (i.e., X = Y almost surely).
2. Non-negativity of mutual information: For any two continuous random
variables X and Y ,
I(X; Y ) 0
with equality i X and Y are independent.
3. Conditioning never increases dierential entropy: For any two con-
tinuous random variables X and Y with joint pdf fX,Y and well-dened con-
ditional pdf fX|Y ,
h(X|Y ) h(X)
with equality i X and Y are independent.
Properties of dierential entropy I: 5-21

4. Chain rule for dierential entropy: For a continuous random vector


X n = (X1, X2, , Xn),

n
h(X1, X2, . . . , Xn) = h(Xi|X1, X2, . . . , Xi1),
i=1

where h(Xi|X1, X2, . . . , Xi1)  h(X1) for i = 1.


5. Chain rule for mutual information: For continuous random vector
X n = (X1, X2, , Xn) and random variable Y with joint pdf fX n,Y and
well-dened conditional pdfs fXi ,Y |X i1 , fXi |X i1 and fY |X i1 for i = 1, , n,
we have that
 n
I(X1, X2, , Xn; Y ) = I(Xi; Y |Xi1, , X1),
i=1

where I(Xi; Y |Xi1, , X1)  I(X1; Y ) for i = 1.


6. Data processing inequality: For continuous random variables X, Y and
Z such that X Y Z, i.e., X and Z are conditional independent given Y
(cf. Appendix B),
I(X; Y ) I(X; Z).
Properties of dierential entropy I: 5-22

7. Independence bound for dierential entropy: For a continuous ran-


dom vector X n = (X1, X2, , Xn),

n
h(X n) h(Xi)
i=1

with equality i all the Xis are independent from each other.
8. Invariance of dierential entropy under translation: For continuous
random variables X and Y with joint pdf fX,Y and well-dened conditional
pdf fX|Y ,
h(X + c) = h(X) for any constant c R, and h(X + Y |Y ) = h(X|Y ).
The results also generalize in the multivariate case: for two continuous random
vectors X n = (X1, X2, , Xn) and Y n = (Y1, Y2, , Yn) with joint pdf
fX n,Y n and well-dened conditional pdf fX n|Y n ,
h(X n + cn ) = h(X n)
for any constant n-tuple cn = (c1, c2, , cn) Rn, and
h(X n + Y n|Y n ) = h(X n|Y n),
where the addition of two n-tuples is performed component-wise.
Properties of dierential entropy I: 5-23

9. Dierential entropy under scaling: For any continuous random variable


X and any non-zero real constant a,
h(aX) = h(X) + log2 |a|.

10. Joint dierential entropy under linear mapping: Consider the ran-
dom (column) vector X = (X1, X2, , Xn)T with joint pdf fX n , where T
denotes transposition, and let Y = (Y1, Y2, , Yn)T be a random (column)
vector obtained from the linear transformation Y = AX, where A is an in-
vertible (non-singular) n n real-valued matrix. Then
h(Y ) = h(Y1, Y2, , Yn) = h(X1, X2, , Xn) + log2 |det(A)|,
where det(A) is the determinant of the square matrix A.
Properties of dierential entropy I: 5-24

11. Joint dierential entropy under nonlinear mapping: Consider the


random (column) vector X = (X1, X2, , Xn)T with joint pdf fX n , and
let Y = (Y1, Y2, , Yn)T be a random (column) vector obtained from the
nonlinear transformation
Y = g(X)  (g1(X1), g2(X2), , gn (Xn))T ,
where each gi : R R is a dierentiable function, i = 1, 2, , n. Then
h(Y ) = h(Y1, Y2, , Yn) 
= h(X1, , Xn) + fX n (x1, , xn) log2 |det(J)| dx1 dxn,
Rn

where J is the n n Jacobian matrix given by


g g
x
1
x
1
g1
x
g21 g22 gn2
xn .
J  x.. 1 x.. 2
. . ...
x1 x2 xn
gn gn gn
Properties of dierential entropy I: 5-25

Theorem 5.18 (Joint dierential entropy of the multivariate Gaus-


sian) If X Nn(, KX ) is a Gaussian random vector with mean vector and
(positive-denite) covariance matrix KX , then its joint dierential entropy is given
by
1
h(X) = h(X1, X2, , Xn) = log2 [(2e)ndet(KX )] . (5.2.1)
2
In particular, in the univariate case of n = 1, (5.2.1) reduces to (5.1.1).

Proof:
Without loss of generality we assume that X has a zero mean vector since its
dierential entropy is invariant under translation by Property 8 of Lemma 5.14:
h(X) = h(X );
so we assume that = 0.
Properties of dierential entropy I: 5-26

Since the (positive-denite) covariance matrix KX is a real-valued symmetric


matrix, then it is orthogonally diagonizable; i.e., there exits a square (n
n) orthogonal matrix A (i.e., satisfying AT = A1) such that AKX AT is
a diagonal matrix whose entries are given by the eigenvalues of KX . As a
result the linear transformation Y = AX Nn 0, AKX AT is a Gaussian
vector with the diagonal covariance matrix KY = AKX AT and has therefore
independent components. Thus
h(Y ) = h(Y1, Y2, , Yn)
= h(Y1) + h(Y2) + + h(Yn) (5.2.2)
n
1
= log2 [2eVar(Yi)] (5.2.3)
2
i=1
 n 
n 1 
= log2(2e) + log2 Var(Yi)
2 2 i=1
n 1
= log2(2e) + log2 [det (KY )] (5.2.4)
2 2
1 1
= log2 (2e)n + log2 [det (KX )] (5.2.5)
2 2
1
= log2 [(2e)n det (KX )] , (5.2.6)
2
Properties of dierential entropy I: 5-27

where (5.2.2) follows by the independence of the random variables Y1, . . . , Yn


(e.g., see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds
since the matrix KY is diagonal and hence its determinant is given by the
product of its diagonal entries, and (5.2.5) holds since
 
det (KY ) = det AKX AT
= det(A)det (KX ) det(AT )
= det(A)2det (KX )
= det (KX ) ,
where the last equality holds since (det(A))2 = 1, as the matrix A is orthogonal
(AT = A1 = det(A) = det(AT ) = 1/[det(A)]; thus, det(A)2 = 1).
Properties of dierential entropy I: 5-28

Now invoking Property 10 of Lemma 5.14 and noting that |det(A)| = 1 yield
that
h(Y1, Y2, , Yn) = h(X1, X2, , Xn) + log2 |det(A)| = h(X1, X2, , Xn).
  
=0

We therefore obtain using (5.2.6) that


1
h(X1, X2, , Xn) = log2 [(2e)n det (KX )] ,
2
hence completing the proof. 2
Properties of dierential entropy I: 5-29

An important fact: Among all real-valued size-n random vectors (of sup-
port Rn) with identical mean vector and covariance matrix, the Gaussian
random vector has the largest dierential entropy.
The proof of this fact requires the following inequality.

Corollary 5.19 (Hadamards inequality) For any real-valued nn positive-


denite matrix K = [Ki,j ]i,j=1, ,n,

n
det(K) Ki,i
i=1

with equality i K is a diagonal matrix, where Ki,i are the diagonal entries of K.
Properties of dierential entropy I: 5-30

Theorem 5.20 (Maximal dierential entropy for real-valued random


vectors) Let X = (X1, X2, , Xn)T be a real-valued random vector with sup-
port SX n = Rn, mean vector and covariance matrix KX . Then
1
h(X1, X2, , Xn) log2 [(2e)n det(KX)] , (5.2.11)
2
 
with equality i X is Gaussian; i.e., X Nn , KX .

Proof: We will present the proof in two parts: the scalar or univariate case, and
the multivariate case.
(i) Scalar case (n = 1): For a real-valued random variable with support SX = R,
mean and variance 2, let us show that
1  
h(X) log2 2e 2 , (5.2.12)
2
with equality i X N (, 2).
Properties of dierential entropy I: 5-31

For a Gaussian random variable Y N (, 2), using the non-negativity of


divergence, can write
0 D(X Y )

fX (x)
= fX (x) log2 (x)2
dx
R
1 e 2 2
2 2

   (x )2
= h(X) + 2 2
fX (x) log2 + log2 e dx
R 2 2

1 log e
= h(X) + log2(2 2) + 22 (x )2fX (x) dx
2 2
R  
= 2
1 
= h(X) + log2 2e 2 .
2
Thus
1 
h(X) log2 2e 2 ,
2
with equality i X = Y (almost surely); i.e., X N (, 2).
Properties of dierential entropy I: 5-32

(ii). Multivariate case (n > 1): As in the proof of Theorem 5.18, we can use an
orthogonal square matrix A (i.e., satisfying AT = A1 and hence |det(A)| = 1)
such that AKX AT is diagonal. Therefore, the random vector generated by the
linear map
Z = AX
will have a covariance matrix given by KZ = AKX AT and hence have uncorrelated
(but not necessarily independent) components.
Properties of dierential entropy I: 5-33

Thus h(X) = h(Z) log2 |det(A)| (5.2.13)


  
=0
= h(Z1, Z2, , Zn )
n
h(Zi) (5.2.14)
i=1
n
1
log2 [2eVar(Zi)] (5.2.15)
2
i=1
 n 
n 1 
= log2(2e) + log2 Var(Zi)
2 2 i=1
1 1
= log2 (2e)n + log2 [det (KZ )] (5.2.16)
2 2
1 1
= log2 (2e)n + log2 [det (KX )] (5.2.17)
2 2
1
= log2 [(2e)n det (KX )] ,
2
where (5.2.13) holds by Property 10 of Lemma 5.14 and since |det(A)| = 1, (5.2.14)
follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12) (the scalar
case above), (5.2.16) holds since KZ is diagonal, and (5.2.17) follows from the fact
that det (KZ ) = det (KX ) (as A is orthogonal).
Properties of dierential entropy I: 5-34

Finally, equality is achieved in both (5.2.14) and (5.2.15) i the random variables
Z1 , Z2, . . ., Zn are
 Gaussian and independent from each other, or equivalently i
X Nn , KX . 2

Observation 5.21 The following two results can also be shown (the proof is left
as an exercise):
1. Among all continuous random variables admitting a pdf with support the in-
terval (a, b), where b > a are real numbers, the uniformly distributed random
variable maximizes dierential entropy.
2. Among all continuous random variables admitting a pdf with support the in-
terval [0, ) and nite mean , the exponential distribution with parameter
(or rate parameter) = 1/ maximizes dierential entropy.
A systematic approach to nding distributions that maximize dierential entropy
subject to various support and moments constraints can be found in [13, 52].
AEP for continuous memoryless sources I: 5-35

The extension of AEP theorem from discrete cases to continuous cases is not
based on number counting (which is always innity for continuous sources),
but on volume measuring.

Theorem 5.22 (AEP for continuous memoryless sources) Let {Xi} i=1
be a continuous memoryless source (i.e., an innite sequence of continuous i.i.d.
random variables) with pdf fX () and dierential entropy h(X). Then
1
log fX (X1, . . . , Xn) E[ log2 fX (X)] = h(X) in probability.
n
Proof: The proof is an immediate result of the law of large numbers (e.g., see
Theorem 3.3). 2
AEP for continuous memoryless sources I: 5-36

Denition 5.23 (Typical set) For > 0 and any n given, dene the typical
set for the above continuous source as

1
Fn()  xn Rn : log2 fX (X1, . . . , Xn) h(X) < .
n

Denition 5.24 (Volume) The volume of a set A Rn is dened as



Vol(A)  dx1 dxn.
A

Theorem 5.25 (Consequence of the AEP for continuous memoryless


sources) For a continuous memoryless source {Xi}
i=1 with dierential entropy
h(X), the following hold.
1. For n suciently large, PX n {Fn()} > 1 .
2. Vol(Fn()) 2n(h(X)+) for all n.
3. Vol(Fn()) (1 )2n(h(X)) for n suciently large.
Proof: The proof is quite analogous to the corresponding theorem for discrete
memoryless sources (Theorem 3.4) and is left as an exercise. 2
Capacity for discrete memoryless Gaussian channel I: 5-37

Denition 5.26 (Discrete-time continuous memoryless channels) Con-


sider a discrete-time channel with continuous input and output alphabets given by
X R and Y R, respectively, and described by a sequence of n-dimensional
transition (conditional) pdfs {fY n|X n (y n|xn )}n=1 that govern the reception of y =
n

(y1 , y2, , yn ) Y n at the channel output when xn = (x1, x2, , xn ) X n is


sent as the channel input.
The channel (without feedback) is said to be memoryless with a given (marginal)
transition pdf fY |X if its sequence of transition pdfs fY n|X n satises

n
fY n|X n (y n |xn) = fY |X (yi |xi) (5.4.1)
i=1

for every n = 1, 2, , xn X n and y n Y n.


Capacity for discrete memoryless Gaussian channel I: 5-38

Average cost constraint (t, P ) on any input n-tuple xn = (x1, x2, , xn)
transmitted over the channel by requiring that
1
n
t(xi) P, (5.4.2)
n i=1

where t() is a given non-negative real-valued function describing the cost for
transmitting an input symbol, and P is a given positive number representing
the maximal average amount of available resources per input symbol.

Denition 5.27 The capacity (or capacity-cost function) of a discrete-time con-


tinuous memoryless channel with input average cost constraint (t, P ) is denoted by
C(P ) and dened as
C(P )  sup I(X; Y ) (in bits/channel use) (5.4.3)
FX :E[t(X)]P

where the supremum is over all input distributions FX .


Capacity for discrete memoryless Gaussian channel I: 5-39

Property A.4 (Properties of the supremum)

3. If < sup A < , then ( > 0)( a0 A) a0 > sup A .


(The existence of a0 (sup A , sup A] for any > 0 under the condition
of | sup A| < is called the approximation property for the supremum.)

Lemma 5.28 (Concavity of capacity) If C(P ) as dened in (5.4.3) is nite


for any P > 0, then it is concave, continuous and strictly increasing in P .

Proof: Fix P1 > 0 and P2 > 0. Then since C(P ) is nite for any P > 0, then by
the 3rd property in Property A.4, there exist two input distributions FX1 and FX2
such that for all  > 0,
I(Xi; Yi) C(Pi)  (5.4.4)
and
E[t(Xi)] Pi (5.4.5)
where Xi denotes the input with distribution FXi and Yi is the corresponding
channel output for i = 1, 2.
We do not know whether the capacity is achievable or not at the moment!
Capacity for discrete memoryless Gaussian channel I: 5-40

Now, for 0 1, let X be a random variable with distribution



FX = FX1 + (1 )FX2 .
Then by (5.4.5)
EX [t(X)] = EX1 [t(X)] + (1 )EX2 [t(X)] P1 + (1 )P2. (5.4.6)
Furthermore,
C(P1 + (1 )P2) = sup I(FX , fY |X )
FX : E[t(X)]P1 +(1)P2
I(FX , fY |X )
I(FX1 , fY |X ) + (1 )I(FX2 , fY |X )
= I(X1; Y1) + (1 )I(X2 : Y2)
C(P1)  + (1 )C(P2) ,
where the rst inequality holds by (5.4.6), the second inequality follows from the
concavity of the mutual information with respect to its rst argument (cf. Lemma 2.46)
and the third inequality follows from (5.4.4).
Capacity for discrete memoryless Gaussian channel I: 5-41

Letting  0 yields that


C(P1 + (1 )P2) C(P1) + (1 )C(P2)
and hence C(P ) is concave in P .
Finally, it can directly be seen by denition that C() is non-decreasing, which,
together with its concavity, imply that it is continuous and strictly increasing. 2
The most commonly used cost function is the power cost function,
t(x) = x2,
resulting in the average power constraint P for each transmitted input n-
tuple:
1 2
n
x P. (5.4.7)
n i=1 i
For better understanding, we only focus on the discrete-time memory-
less (additive white) Gaussian (noise) channel with average input
power constraint P :
Yi = Xi + Zi , for i = 1, 2, , (5.4.8)
where Yi, Xi and Zi are the channel output, input and noise at time i, {Zi}
i=1
2
i.i.d. Gaussian with mean zero and variance , and {Xi}i=1 {Zi }i=1.
Capacity for discrete memoryless Gaussian channel I: 5-42

Derivation of the Capacity C(P ) for the discrete-time memoryless (addi-


tive white) Gaussian (noise) channel with average input power con-
straint P .
For Gaussian distributed Z, we obtain
I(X; Y ) = h(Y ) h(Y |X)
= h(Y ) h(X + Z|X) (5.4.9)
= h(Y ) h(Z|X) (5.4.10)
= h(Y ) h(Z) (5.4.11)
1  
= h(Y ) log2 2e 2 , (5.4.12)
2
where (5.4.9) follows from (5.4.8), (5.4.10) holds since dierential entropy is
invariant under translation (see Property 8 of Lemma 5.14), (5.4.11) follows
from the independence of X and Z, and (5.4.12) holds since Z N (0, 2) is
Gaussian (see (5.1.1)).
Now since Y = X + Z, we have that
E[Y 2] = E[X 2] + E[Z 2 ] + 2E[X]E[Z] = E[X 2] + 2 + 2E[X](0) P + 2
since the input in (5.4.3) is constrained to satisfy E[X 2] P .
Capacity for discrete memoryless Gaussian channel I: 5-43

Thus the variance of Y satises


Var(Y ) E[Y 2] P + 2,
and
1 1  
h(Y ) log2 (2eVar(Y )) log2 2e(P + 2)
2 2
where the rst inequality follows by Theorem 5.20 since Y is real-valued (with
support R).
Noting that equality holds in the rst inequality above i Y is Gaussian and in
the second inequality i Var(Y ) = P + 2, we obtain that choosing the input
X as X N (0, P ) yields Y N (0, P + 2) and hence maximizes I(X; Y )
over all inputs satisfying E[X 2] P .
Thus, the capacity of the discrete-time memoryless Gaussian channel with input
average power constraint P and noise variance (or power) 2 is given by
1   1  
C(P ) = log2 2e(P + 2) log2 2e 2
2   2
1 P
= log2 1 + 2 . (5.4.13)
2
Capacity for discrete memoryless Gaussian channel I: 5-44

Denition 5.30 (Fixed-length data transmission code) Given positive


integers n and M , and a discrete-time memoryless Gaussian channel with input
average power constraint P , a xed-length data transmission code (or block code)
Cn = (n, M ) for this channel with blocklength n and rate n1 log2 M message bits
per channel symbol (or channel use) consists of:
1. M information messages intended for transmission.
2. An encoding function
f : {1, 2, . . . , M } Rn
yielding real-valued codewords c1 = f (1), c2 = f (2), , cM = f (M ), where
each codeword cm = (cm1, . . . , cmn ) is of length n and satises the power
constraint P
1 2
n
c P,
n i=1 i
for m = 1, 2, , M .
3. A decoding function g : Rn {1, 2, . . . , M }.
Capacity for discrete memoryless Gaussian channel I: 5-45

Theorem 5.31 (Shannons coding theorem for the memoryless Gaus-


sian channel) Consider a discrete-time memoryless Gaussian channel with input
average power constraint P , channel noise variance 2 and capacity C(P ) as given
by (5.4.13).
Forward part (achievability): For any (0, 1), there exist 0 < < 2 and
a sequence of data transmission block code {Cn = (n, Mn)} n=1 satisfying
1
log Mn > C(P )
n 2
with each codeword c = (c1, c2, . . . , cn ) in Cn satisfying
1 2
n
c P (5.4.14)
n i=1 i
such that the probability of error Pe(Cn) < for suciently large n.
Converse part: If for any sequence of data transmission block codes {Cn =
(n, Mn)}n=1 whose codewords satisfy (5.4.14), we have that
1
lim inf log2 Mn > C(P ),
n n

then the codes probability of error Pe(Cn ) is bounded away from zero for all
n suciently large.
Capacity for discrete memoryless Gaussian channel I: 5-46

Proof of the forward part: The theorem holds trivially when C(P ) = 0
because we can choose Mn = 1 for every n and have Pe(Cn) = 0. Hence, we
assume without loss of generality C(P ) > 0.
Step 0:
Take a positive satisfying < min{2, C(P )}.
Pick > 0 small enough such that 2[C(P ) C(P )] < , where the
existence of such is assured by the strictly increasing property of C(P ).
Hence, we have

C(P ) > C(P ) > 0.
2
Choose Mn to satisfy
1
C(P ) > log2 Mn > C(P ) ,
2 n
for which the choice should exist for all suciently large n.
Take = /8.
Let FX be the distribution that achieves C(P ), where C(P ) is given
by (5.4.13). In this case, FX is the Gaussian distribution with mean zero
and variance P and admits a pdf fX . Hence, E[X 2] P and
I(X; Y ) = C(P ).
Capacity for discrete memoryless Gaussian channel I: 5-47

Step 1: Random coding with average power constraint.


Randomly draw Mn codewords according to pdf fX n with

n
n
fX n (x ) = fX (xi).
i=1

By law of large numbers, each randomly selected codeword


cm = (cm1 , . . . , cmn )
satises
1 2
n
lim cmi = E[X 2] P
n n
i=1
for m = 1, 2, . . . , Mn.
Capacity for discrete memoryless Gaussian channel I: 5-48

Step 2: Code construction.


For Mn selected codewords {c1, . . . , cMn }, replace the codewords that vio-
late the power constraint (i.e., (5.4.14)) by an all-zero codeword 0.
Dene the encoder as
fn(m) = cm for 1 m Mn.

Given a received output sequence y n , the decoder gn() is given by



m, if (cm, y n ) Fn ()
gn (y n ) = and ( m = m) (cm , y n )  Fn (),

arbitrary, otherwise,
where the set

1
Fn ()  (xn, y n ) X n Y n: log2 fX nY n (xn, y n ) h(X, Y ) < ,
n

1
log2 fX n (xn) h(X) < ,
n

1
and log2 fY n (y n) h(Y ) < .
n
Capacity for discrete memoryless Gaussian channel I: 5-49

Note that Fn() is generated by



n
fX nY n (xn, y n) = fXY (xi, yi )
i=1

where fX nY n (xn, y n ) is the joint input-output pdf realized when the mem-
oryless Gaussian channel (with n-fold transition pdf

n
fY n|X n (y n |xn) = fY |X (yi |xi))
i=1

is driven by input X n with pdf



n
fX n (xn) = fX (xi)
i=1

(where fX achieves C(P )).


Capacity for discrete memoryless Gaussian channel I: 5-50

Step 3: Conditional probability of error.


Let m = m(Cn ) denote the conditional error probability given codeword
m is transmitted (with respect to code Cn ).
Dene  $
n
n 1
E0  x X :
n
x2i > P .
n i=1
Then
 Mn 

m(Cn ) fY n|X n (y n|cm) dy n + fY n|X n (y n |cm ) dy n ,
y n Fn (|cm ) y n Fn (|cm )
m =1
m =m

where
Fn (|xn)  {y n Y n : (xn, y n ) Fn()} .
Capacity for discrete memoryless Gaussian channel I: 5-51

By taking expectation with respect to the mth codeword-selecting distribu-


tion fX n (cm), we obtain

E[m ] = fX n (cm)m(Cn ) dcm
cmX 
n

= fX n (cm )m(Cn) dcm + fX n (cm)m (Cn) dcm


cmX E0 cm X n E0c
n

fX n (cm ) dcm + fX n (cm )m(Cn) dcm
cm E0 cm X n
 
PX n (E0) + fX n (cm)fY n|X n (y n |cm) dy n dcm
cm X n y n Fn (|cm )
 Mn 

+ fX n (cm)fY n|X n (y n |cm ) dy n dcm .
cm X n y n Fn (|cm )
m =1
m =m
= PX n (E0) + PX n,Y n (Fnc ())
Mn  
+ fX n,Y n (cm , y n ) dy n dcm . (5.4.15)
cm X n y n Fn (|cm )
m =1
m =m
Capacity for discrete memoryless Gaussian channel I: 5-52

Note that the additional term PX n (E0) in (5.4.15) is to cope with the errors
due to all-zero codeword replacement, which will be less than for all
suciently large n by the law of large numbers.
Finally, by carrying out a similar procedure as in the proof of the channel
coding theorem for discrete channels (cf. Theorem 4.10), we obtain:
E[Pe(C n)] PX n (E0) + PX n,Y n (Fnc ())
+Mn 2n(h(X,Y )+)2n(h(X)) 2n(h(Y ))
PX n (E0) + PX n,Y n (Fnc ()) + 2n(C(P )4) 2n(I(X;Y )3)
= PX n (E0) + PX n,Y n (Fnc ()) + 2n .
Accordingly, we can make the average probability of error, E[Pe(C n)], less
than 3 = 3/8 < 3/4 < for all suciently large n. 2
Capacity for discrete memoryless Gaussian channel I: 5-53

Proof of the converse part: Consider an (n, Mn) block data transmission
code satisfying the power constraint
1 2
n
c P
n i=1 i

with encoding function


fn : {1, 2, . . . , Mn} X n
and decoding function
gn : Y n {1, 2, . . . , Mn}.
Since the message W is uniformly distributed over {1, 2, . . . , Mn }, we have
H(W ) = log2 Mn.

Since W X n = fn(W ) Y n forms a Markov chain (as Y n only depends


on X n), we obtain by the data processing lemma that
I(W ; Y n ) I(X n; Y n ).
Capacity for discrete memoryless Gaussian channel I: 5-54

We can also bound I(X n; Y n ) by C(P ) as follows:


I(X n ; Y n) sup
%
I(X n; Y n )
n
FX n : (1/n) i=1 E[Xi ]P
2


n
sup
%
I(Xj ; Yj ) (by Theorem 2.21)
n
FX n : (1/n) i=1 E[Xi ]P
2
j=1

n
= sup %n
sup I(Xj ; Yj )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P FX n : ( i) E[Xi2 ]Pi j=1

n
sup %n
sup I(Xj ; Yj )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FX n : ( i) E[Xi2 ]Pi
n
= sup %n
sup I(Xj ; Yj )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FXj : E[Xj2 ]Pj


n 
n
1
= sup %n
C(Pj ) = sup %n
n C(Pj )
(P1 ,P2 ,...,Pn ):(1/n) i=1 Pi =P j=1 (P1 ,P2 ,...,Pn ):(1/n) i=1 Pi =P j=1
n

1
n
sup % nC Pj (by concavity of C(P ))
n
(P1 ,P2 ,...,Pn ):(1/n) i=1 Pi =P n j=1
= nC(P ).
Capacity for discrete memoryless Gaussian channel I: 5-55

Consequently, recalling that Pe(Cn) is the average error probability incurred


by guessing W from observing Y n via the decoding function gn : Y n
{1, 2, . . . , Mn}, we get
log2 Mn = H(W )
= H(W |Y n ) + I(W ; Y n)
H(W |Y n ) + I(X n ; Y n )
h (P (C )) + Pe(Cn) log2(|W| 1) +nC(P )
b e n  
Fanos inequality
1 + Pe(Cn) log2(Mn 1) + nC(P ),
(by the fact that ( t [0, 1]) hb (t) 1)
< 1 + Pe(Cn) log2 Mn + nC(P ),
which implies that
C(P ) 1
Pe(Cn) > 1 .
(1/n) log2 Mn log2 Mn
Capacity for discrete memoryless Gaussian channel I: 5-56

So if
1
lim inf log2 Mn > C(P ),
n n
then there exists > 0 and an integer N such that for n N ,
1
log Mn > C(P ) + .
n 2
Hence, for n N0  max{N, 2/},
C(P ) 1
Pe(Cn) 1 .
C(P ) + n(C(P ) + ) 2(C(P ) + )
2
Capacity for discrete memoryless Gaussian channel I: 5-57

Theorem 5.32 (Gaussian noise minimizes capacity of additive-noise


channels) Every discrete-time continuous memoryless channel with additive noise
(admitting a pdf) of mean zero and variance 2 and input average power constraint
P has its capacity C(P ) lower bounded by the capacity of the memoryless Gaussian
channel with identical input constraint and noise variance:
 
1 P
C(P ) log2 1 + 2 .
2
Proof:
Let fY |X and fYg |Xg denote the transition pdfs of the additive-noise channel and
the Gaussian channel, respectively, where both channels satisfy input average
power constraint P .
Let Z and Zg respectively denote their zero-mean noise variables of identical
variance 2.
Capacity for discrete memoryless Gaussian channel I: 5-58

Then we have
I(fXg , fY |X ) I(fXg , fYg |Xg )
 
fZ (y x)
= fXg (x)fZ (y x) log2 dydx
X Y fY (y)
 
fZg (y x)
fXg (x)fZg (y x) log2 dydx
f (y)
 X Y
Yg
fZ (y x)
= fXg (x)fZ (y x) log2 dydx
X Y f Y (y)
 
fZg (y x)
fXg (x)fZ (y x) log2 dydx
X Y fYg (y)
 
fZ (y x)fYg (y)
= fXg (x)fZ (y x) log2 dydx
X Y fZg (y x)f Y (y)
   
fZg (y x)fY (y)
fXg (x)fZ (y x)(log2 e) 1 dydx
X Y fZ (y x)fYg (y)

  
fY (y)
= (log2 e) 1 fXg (x)fZg (y x)dx dy = 0,
f
Y Yg (y) X

with equality holding in the inequality i fY (y)/fYg (y) = fZ (y x)/fZg (y x)


for all x.
Capacity for discrete memoryless Gaussian channel I: 5-59

Therefore,
 
1 P
log2 1 + 2 = sup I(FX , fYg |Xg )
2 FX : E[X 2 ]P
= I(fX g , fYg |Xg )
I(fX g , fY |X )
sup I(FX , fY |X )
FX : E[X 2 ]P
= C(P ).
2
The water-lling principle I: 5-60

Theorem 5.34 (Capacity of uncorrelated parallel Gaussian channels)


The capacity of k uncorrelated parallel Gaussian channels under an overall input
power constraint P is given by
k  
1 Pi
C(P ) = log2 1 + 2 ,
i=1
2 i

where i2 is the noise variance of channel i,


Pi = max{0, i2},
%
and is chosen to satisfy ki=1 Pi = P . This capacity is achieved by a tuple of
independent Gaussian inputs (X1, X2, , Xk ), where Xi N (0, Pi) is the input
to channel i, for i = 1, 2, , k.

P = P1 + P2 + P4

P2
P1 32
P4
22
12 42
The water-lling principle I: 5-61

Proof:
By denition,
C(P ) = sup I(X k ; Y k ).
%k
FX k : i=1 E[Xk ]P
2

Since the noise random variables Z1, . . . , Zk are independent from each other,
I(X k ; Y k ) =h(Y k ) h(Y k |X k )
=h(Y k ) h(Z k + X k |X k )
=h(Y k ) h(Z k |X k )
=h(Y k ) h(Z k )
k
= h(Y )
k
h(Zi)
i=1

k 
k
h(Yi) h(Zi)
i=1 i=1
k  
1 Pi
log 1 + 2 .
i=1
2 2 i
Equalities hold above if all the Xi inputs are independent of each other with
%k
each input Xi N (0, Pi) such that i=1 Pi = P .
The water-lling principle I: 5-62

Thus the problem is reduced to nding the power allotment that maximizes the
%
overall capacity subject to the equality constraint ki=1 Pi = P and inequality
constraints Pi 0, i = 1, . . . , k.
By using the Lagrange multiplier technique and verifying the KKT condition
(see Example B.22 in Appendix B.10), the maximizer (P1, . . . , Pk ) of
 k    * k +$
1 Pi
k 
max log2 1 + 2 + i P i Pi P
i=1
2 i i=1 i=1

can be found by taking the derivative of the above equation (with respect to
Pi) and setting it to zero, which yields


1 1
+ = 0, if Pi > 0;
2 ln(2) Pi + i2
i = 1 1


+ 0, if Pi = 0.
2 ln(2) Pi + i2
Hence,

Pi = i2, if Pi > 0;
(equivalently, Pi = max{0, i2}),
Pi i2, if Pi = 0,
%k
where  log2 e/(2) is chosen to satisfy i=1 Pi = P. 2
The water-lling principle I: 5-63

Observation 5.35

Although it seems reasonable to allocate more power to less noisy channels for
the optimization of the channel capacity according to the water-lling principle,
the Gaussian inputs required for achieving channel capacity do not t the
digital communication system in practice.
One may wonder what an optimal power allocation scheme will be when the
channel inputs are dictated to be practically discrete in values, such as
binary phase-shift keying (BPSK), quadrature phase-shift key-
ing (QPSK), or 16 quadrature-amplitude modulation (16-QAM).
Surprisingly, a dierent procedure from the water-lling principle results under
certain conditions.
Indeed, the optimal power allocation for parallel AWGN channels with inputs
constrained to be discrete is established, resulting in a new graphical power
allocation interpretation called the mercury/water-lling principle [31].
This research concludes that when the total transmission power P is small,
the strategy that maximizes the capacity follows approximately the equal
signal-to-noise ratio (SNR) principle.
The water-lling principle I: 5-64

Very recently, it was found that when the additive noise is no longer Gaussian,
the mercury adjustment fails to interpret the optimal power allocation scheme.
For this reason, the graphical interpretation of the optimal power allocation
principle is simply named two-phase water-lling principle in [48].
We end this observation by emphasizing that the true measure for a digital
communication system is of no doubt the eective transmission rate subject to
an acceptably good transmission error (e.g., an overall bit error rate 105).
In order to make the analysis tractable, researchers adopt the channel capac-
ity as the design criterion instead, in the hope of obtaining a simple reference
scheme for practical systems. Further study is thus required to examine the fea-
sibility of such an approach (even though the water-lling principle is certainly
a theoretically interesting and theoretically beautiful result).
Capacity of correlated parallel Gaussian channels I: 5-65

Theorem 5.36 (Capacity of correlated parallel Gaussian channels)


The capacity of k correlated parallel Gaussian channels with positive-denite noise
covariance matrix KZ under overall input power constraint P is given by
 k  
1 Pi
C(P ) = log2 1 + ,
i=1
2 i

where i is the i-th eigenvalue of KZ ,


Pi = max{0, i},
%
and is chosen to satisfy ki=1 Pi = P . This capacity is achieved by a tuple of zero-
mean Gaussian inputs (X1, X2, , Xk ) with covariance matrix KX having the
same eigenvectors as KZ , where the i-th eigenvalue of KX is Pi, for i = 1, 2, , k.

Proof:
In correlated parallel Gaussian channels, the input power constraint becomes

k
E[Xi2] = tr(KX ) P,
i=1

where tr() denotes the trace of the k k matrix KX .


Capacity of correlated parallel Gaussian channels I: 5-66

Since in each channel, the input and noise variables are independent from each
other, we have
I(X k ; Y k ) = h(Y k ) h(Y k |X k )
= h(Y k ) h(Z k + X k |X k )
= h(Y k ) h(Z k |X k )
= h(Y k ) h(Z k ).

Since h(Z k ) is not determined by the input, determining the systems capacity
reduces to maximizing h(Y k ) over all possible inputs (X1, . . . , Xk ) satisfying
the power constraint.
Now observe that the covariance matrix of Y k is equal to
KY = KX + KZ ,
which implies by Theorem 5.20 that the dierential entropy of Y k is upper
bounded by
1 
h(Y ) log2 (2e)k det(KX + KZ ) ,
k
2
k
with equality i Y Gaussian. It remains to nd out whether we can nd
inputs (X1, . . . , Xk ) satisfying the power constraint which achieve the above
upper bound and maximize it.
Capacity of correlated parallel Gaussian channels I: 5-67

As in the proof of Theorem 5.18, we can orthogonally diagonalize KZ as


KZ = AAT ,
where AAT = Ik (and thus det(A)2 = 1), Ik is the k k identity matrix,
and is a diagonal matrix with positive diagonal components consisting of the
eigenvalues of KZ (as KZ is positive denite). Then
det(KX + KZ ) = det(KX + AAT )
= det(AAT KX AAT + AAT )
= det(A) det(AT KX A + ) det(AT )
= det(AT KX A + )
= det(B + ),
where B  AT KX A.
Since for any two matrices C and D,
tr(CD) = tr(DC),
we have that
tr(B) = tr(AT KX A) = tr(AAT KX ) = tr(Ik KX ) = tr(KX ).
Capacity of correlated parallel Gaussian channels I: 5-68

Thus the capacity problem is further transformed


to maximizing det(B + ) subject to tr(B) P.

By observing that B + is positive denite (because is positive denite)


and using Hadamards inequality given in Corollary 5.19, we have

k
det(B + ) (Bii + i ),
i=1

where i is the component of matrix locating at ith row and ith column,
which is exactly the i-th eigenvalue of KZ .
Thus, the maximum value of det(B + ) under tr(B) P is realized by a
diagonal matrix B (to achieve equality in Hadamards inequality) with

k
Bii = P and each Bii 0.
i=1
Capacity of correlated parallel Gaussian channels I: 5-69

Thus,
1  1 
C(P ) = log2 (2e) det(B + ) log2 (2e)k det()
k
2 2
1k  
Bii
= log2 1 + .
i=1
2 i

Finally, as in the proof of Theorem 5.34, we obtain a water-lling allotment for


the optimal diagonal elements of B:
Bii = max{0, i },
%
where is chosen to satisfy ki=1 Bii = P . 2
Non-Gaussian discrete-time memoryless channels I: 5-70

If a discrete-time channel has an additive but non-Gaussian memoryless noise


and an input power constraint, then it is often hard to calculate its capacity.
Hence, we introduce an upper bound and a lower bound on the capacity of
such a channel (we assume that the noise admits a pdf).
Denition 5.37 (Entropy power) For a continuous random variable Z with
(well-dened) dierential entropy h(Z) (measured in bits), its entropy power is
denoted by Ze and dened as
1 2h(Z)
Ze  2 .
2e
Lemma 5.38 For a discrete-time continuous-alphabet memoryless additive-noise
channel with input power constraint P and noise variance 2, its capacity satises
1 P + 2 1 P + 2
log C(P ) log2 . (5.7.1)
2 2 Ze 2 2
Proof: The lower bound in (5.7.1) is already proved in Theorem 5.32. The upper
bound follows from
1 1
I(X; Y ) = h(Y ) h(Z) log2[2e(P + 2)] log2[2eZe].
2 2
2
Non-Gaussian discrete-time memoryless channels I: 5-71

The entropy power of Z can be viewed as the variance of a corresponding


Gaussian random variable with the same dierential entropy as Z:
1 2h(Z)
Ze = 2 = Var(Z), if Z Gaussian.
2e
Whenever two independent Gaussian random variables, Z1 and Z2 , are added,
the power (variance) of the sum is equal to the sum of the powers (variances)
of Z1 and Z2. This relationship can then be written as
22h(Z1+Z2 ) = 22h(Z1) + 22h(Z2),
or equivalently
Var(Z1 + Z2 ) = Var(Z1) + Var(Z2).
(Entropy-power inequality) However, when two independent random
variables are non-Gaussian, the relationship becomes
22h(Z1+Z2) 22h(Z1) + 22h(Z2), (5.7.2)
or equivalently
Ze(Z1 + Z2) Ze (Z1) + Ze (Z2). (5.7.3)
It reveals that the sum of two independent random variables may introduce
more (dierential) entropy power than the sum of each individual en-
tropy power, except in the Gaussian case.
Capacity of band-limited white Gaussian channel I: 5-72

Z(t)

X(t) ? Y (t)
- + - H(f ) -

Waveform channel
The output waveform is given by
Y (t) = (X(t) + Z(t)) h(t), t 0,
where represents the convolution operation.1

1

Rrecall that the convolution between two signals a(t) and b(t) is defined as a(t) b(t) = a( )b(t )d .
Capacity of band-limited white Gaussian channel I: 5-73

(Time-unlimited) X(t) is the channel input waveform with average power


constraint 
1 T /2
lim E[X 2(t)]dt P (5.8.1)
T T T /2

and bandwidth W cycles per second or Hertz (Hz); i.e., its spectrum or
Fourier transform  +
X(f )  F[X(t)] = X(t)ej2f tdt = 0


for all frequencies |f | > W , where j = 1 is the imaginary unit number.
Z(t) is the noise waveform of a zero-mean stationary white Gaussian process
with power spectral density N0 /2; i.e., its power spectral density PSDZ (f ),
which is the Fourier transform of the process covariance (equivalently, correla-
tion) function KZ ( )  E[Z(s)Z(s + )], s, R, is given by
 +
N0
PSDZ (f ) = F[KZ (t)] = KZ (t)ej2f tdt = f.
2
h(t) is the impulse response of an ideal bandpass lter with cuto frequencies
at W Hz: 
1 if W f W ,
H(f ) = F[(h(t)] =
0 otherwise.
Capacity of band-limited white Gaussian channel I: 5-74

Note that we can write the channel output as


Y (t) = X(t) + Z(t)
where Z(t)  Z(t) h(t) is the ltered noise waveform.
To determine the capacity (in bits per second) of this continuous-time band-
limited white Gaussian channel with parameters, P , W and N0, we convert it
to an equivalent discrete-time channel with power constraint P by
using the well-known Sampling theorem (due to Nyquist, Kotelnikov and
Shannon), which states that sampling a band-limited signal with bandwidth
W at a rate of 1/(2W ) is sucient to reconstruct the signal from its samples.
Since X(t), Z(t) and Y (t) are all band-limited to [W, W ], we can thus
1
represent these signals by their samples taken 2W seconds apart and model the
channel by a discrete-time channel described by:
Yn = Xn + Zn , n = 1, 2, ,
where Xn  X( 2W n
) are the input samples and Zn and Yn are the random
samples of the noise Z(t) and output Y (t) signals, respectively.
Capacity of band-limited white Gaussian channel I: 5-75

Since Z(t) is a ltered version of Z(t), which is a zero-mean stationary Gaussian


process, we obtain that Z(t) is also zero-mean, stationary and Gaussian.
This directly implies that the samples Zn , n = 1, 2, , are zero-mean
Gaussian identically distributed random variables.
Note that

2
N0
2 if W f W ,
PSDZ (f ) = PSDZ (f )|H(f )| =
0 otherwise.
Hence, covariance function of the ltered noise process is
KZ ( ) = E[Z(s)Z(s + )] = F 1[PSDZ (f )] = N0 W sinc(2W ) R.
(5.8.1)
(5.8.1) implies
E[Zn2 ] = KZ (0) = N0 W.
Capacity of band-limited white Gaussian channel I: 5-76

We conclude that the capacity of the band-limited white Gaussian channel in


bits per channel use is given using (5.4.13) by
 
1 P
log 1 + bits/channel use.
2 2 N0W
1
Given that we are using the channel (with inputs Xn) every 2W seconds, we
obtain that the capacity in bits/second of the band-limited white Gaussian
channel is given by
 
1 P  
2 log2 1 + N0 W P
C(P ) = 1 = W log2 1 + bits/second, (5.8.2)
2W
N 0 W
P
where N0 W
is typically referred to as the signal-to-noise ratio (SNR).
(5.8.2) is achieved by zero-mean i.i.d. Gaussian {Xn} 2
n= with E[Xn ] = P ,
which can be obtained by sampling a zero-mean, stationary and Gaussian X(t)
with 
P
if W f W ,
PSDX (f ) = 2W
0 otherwise.
Examining this X(t) conrms that it satises (5.8.1):

1 T /2
E[X 2(t)]dt = E[X 2(t)] = KX (0) = P sinc(2W 0) = P.
T T /2
Capacity of band-limited white Gaussian channel I: 5-77

Example 5.39 (Telephone line channel) Suppose telephone signals are band-
limited to 4 KHz. Given an SNR of 40 decibels (dB) i.e., 10 log10 NP0W = 40 dB
then from (5.8.2), we calculate that the capacity of the telephone line channel
(when modeled via the band-limited white Gaussian channel) is given by
4000 log2(1 + 10000) = 53151.4 bits/second = 51.906 Kbits/second.

By increasing the bandwidth to 1.2 MHz, the capacity becomes


1200000 log2(1 + 10000) = 15945428 bits/second = 15.207 Mbits/second.
Capacity of band-limited white Gaussian channel I: 5-78

Observation 5.41 (Band-limited colored Gaussian channel) If the above


band-limited channel has a stationary colored (non-white) additive Gaussian noise,
then it can be shown (e.g., see [19]) that the capacity of this channel becomes
 W

1
C(P ) = max 0, log2 df,
2 W PSDZ (f )
where is the solution of
 W
P = max [0, PSDZ (f )] df.
W
Capacity of band-limited white Gaussian channel I: 5-79

(a) The spectrum of PSDZ (f ) where the horizontal line repre-


sents , the level at which water rises to.

(b) The input spectrum that achieves capacity.


Water-pouring for the band-limited colored Gaussian channel.
Capacity of time-limited Gaussian channel I: 5-80

The system is given by:


Y (t) = X(t) + Z(t)
where the zero-mean noise process Z(t) has PSDZ (f ).

Notes
We cannot use sampling theorem in this case (as we did previously) because
X(t), Y (t) and Z(t) are not band-limited but actually time-limited to
[0, T ).
Z(t) is no longer assumed white (so the noise samples are not i.i.d.)
Capacity of time-limited Gaussian channel I: 5-81

Lemma 5.42 Any real-valued function v(t) dened over [0, T ) can be decomposed
into

v(t) = vii (t),
i=1
where the real-valued functions{i(t)}
i=1 are any orthonormal set (which can span
the space with respect to v(t)) of functions on [0, T ), namely
 T
1, if i = j;
i (t)j (t)dt =
0 0, if i = j,
and  T
vi = i (t)v(t)dt.
0
Capacity of time-limited Gaussian channel I: 5-82

Notes
Sampling theorem employs sinc function as the orthonormal set to span the
space of bandlimited signals.
The sinc orthonormal set however cannot be used for time-limited signals.
We therefore have to introduce the Karhunen-Loeve expansion.
Capacity of time-limited Gaussian channel I: 5-83

Lemma 5.43 (Karhunen-Loeve expansion) Given a stationary random pro-


cess Z(t) with correlation function
KZ ( ) = E[Z(t)Z(t + )].
Let {i ( )}
i=1 and {i }i=1 be the eigenfunctions and eigenvalues of KZ ( ), satis-
fying  T
KZ ( s)i(s)ds = ii ( ), 0 T.
0
Then the expansion coecients {Zi }i=1 of Z(t) with respect to orthonormal func-
tions {i(t)}
i=1 are uncorrelated. In addition, if Z(t) is Gaussian, then {Zi }i=1
are independent Gaussian random variables.
Capacity of time-limited Gaussian channel I: 5-84

Proof:

 T  T
E[Zi Zj ] = E i(t)Z(t)dt j (s)Z(s)ds
 T 0 T 0

= i (t)j (s)E[Z(t)Z(s)]dtds
0 T 0 T
= i (t)j (s)KZ (s t)dtds
0 T 0  T 
= i(s) KZ (s t)j (t)dt ds
0 T 0

= i(s)(j j (s))ds
0
i, if i = j;
=
0, if i = j.
2
Capacity of time-limited Gaussian channel I: 5-85

The system can then be equivalently transformed to:

Yi = Xi + Zi , i = 1, 2, 3, . . .

T

Xi = 0 X(t)i(t)dt,


T
where Y i = 0 Y (t)i (t)dt,

- .

T
Zi = 2
0 Z(t)i (t)dt i=1 are independent Gaussian with E[Zi ] = i .
%
Since Z(t) = i=1 Zi i (t) is assumed stationary, we have

KZ ( ) = E [Z(t)Z(t + )]


 
= E Zi i (t) Zk k (t + )
i=1 k=1


= E [ZiZk ] i (t)k (t + )
i=1 k=1

= i i(t)i(t + ).
i=1
Capacity of time-limited Gaussian channel I: 5-86

Thus,


KZ (0) = E[Z 2 (t)] = i2i (t),
i=1

which implies
 T   T 
1 1 1
E[Z 2 (t)] = E[Z 2(t)]dt = i 2i (t)dt = i
T 0 T i=1 0 T i=1

Using similar technique to water-lling power allocation scheme, we have that


the maximizer of maxX n:tr(KX )nP I(X n; Y n) is zero-mean independent Gaus-
sian distributed X n with E[Xi2] = Pi satisfying:

n
Pi = max{0, i} and Pi = constant.
i=1
Capacity of time-limited Gaussian channel I: 5-87

Consequently,
max I(X n ; Y n )
X n :tr(KX )nP
1 1
= log2 [(2e)ndet(KX + KZ )] log2 [(2e)ndet(KZ )]
2   2
1
n
Pi
= log2 1 +
i=1
2 i

which is achieved by zero-mean i.i.d. Gaussian {Xi} 2


i=1 with E[Xi ] = Pi .
%
Since X(t) = i=1 Xi i (t), we have

KX ( ) = E [X(t)X(t + )]


 
= E Xii (t) Xk k (t + )
i=1 k=1


= E [XiXk ] i (t)k (t + )
i=1 k=1

= Pii (t)i(t + ).
i=1
Capacity of time-limited Gaussian channel I: 5-88

Thus,


KX (0) = E[X 2(t)] = Pi2i (t),
i=1

which implies
 T   T 
1 1 1
P E[X 2(t)] = E[X 2(t)]dt = Pi 2i (t)dt = Pi .
T 0 T i=1 0 T i=1
Key Notes I: 5-89

Dierential entropy and its operational meaning in quantization eciency


Maximal dierential entropy of Gaussian source, among all sources with the
same mean and variance
The mismatch in properties of entropy and dierential entropy
Relative entropy and mutual information of continuous systems
Capacity-cost function and its proof
Calculation of the capacity-cost function for specic channels
Memoryless additive Gaussian channels
Uncorrelated and correlated parallel Gaussian channels
Water-lling scheme (graphical interpretation)
Gaussian band-limited waveform channels
Gaussian time-limited waveform channels
Interpretation of entropy-power (provide an upper bound on capacity of non-
Gaussian channels)
Operational characteristics of entropy-power inequality

You might also like