IT Lecture 05

Chapter 5
Dierential Entropy and Gaussian Channels
Po-Ning Chen, Professor
Department of Electrical and Computer Engineering
National Chiao Tung University
Hsin Chu, Taiwan 30010, R.O.C.

Continuous sources I: 5-1
Model
{Xt X , t I}
Discrete sources
Both X and I are discrete.
Continuous sources
Discrete-time continuous sources
X is continuous; I is discrete.
Waveform sources
Both X and I are continuous.
We have so far examined information measures and their operational charac-

terization for discrete-time discrete-alphabet systems. In this chapter,
we turn our focus to discrete-time continuous-alphabet (real-valued)
sources.
Information content of continuous sources I: 5-2
Relation between discrete-alphabet and continuous-alphabet sources

A discrete-alphabet source can be regarded as a discrete approximation
(quantization) of a continuous-alphabet source.
As the quantization becomes ner and ner, a discrete-alphabet source is
expected to approach a continuous-alphabet source.
We thus require an extension of information measure (i.e., entropy) from
discrete sources to continuous sources.

Entropy: H(X) PX (x) log2 PX (x) bits.
xX
Example 5.1
Consider a real-valued random variable X that is uniformly distributed on

the unit interval, i.e., with pdf given by

1 if x [0, 1);
fX (x) =
0 otherwise.
Given a positive integer m, we can discretize X by uniformly quantizing it

into m levels by partitioning the support of X into equal-length segments
1
of size = m ( is called the quantization step-size) such that:
i i1 i
qm (X) = , if X < ,
m m m
for 1 i m.
Then the entropy of the quantized random variable qm(X) is given by
m
1 1
H(qm (X)) = log2 = log2 m (in bits).
i=1
m m
Since the entropy H(qm (X)) of the quantized version of X is a lower bound
to the entropy of X (as qm (X) is a function of X) and satises in the limit
lim H(qm (X)) = lim log2 m = ,
m m
we obtain that the entropy of X is innite. 2

The above example indicates that to compress a continuous source without

incurring any loss or distortion requires an innite number of bits.
Thus when studying continuous sources, the entropy measure is limited in its
eectiveness and the introduction of a new measure is necessary.
Such a new measure is obtained upon close examination of the entropy
of a uniformly quantized real-valued random-variable minus the
quantization accuracy (as the accuracy increases without bound), for
which the idea can be picked up from the next lemma.
Lemma 5.2 Consider a real-valued random variable X with support [a, b)

and pdf fX such that

(i) fX log2 fX is (Riemann-)integrable, and
b
(ii) a fX (x) log2 fX (x)dx is nite.
Then a uniform quantization of X with an n-bit accuracy (i.e., with a quanti-
zation step-size of = 2n ) yields an entropy approximately equal to
b
fX (x) log2 fX (x)dx + n bits
a
for n suciently large. In other words,

b
lim [H(qn (X)) n] = fX (x) log2 fX (x)dx
n a
where qn (X) is the uniformly quantized version of X with quantization step-

size = 2n .
Proof:
Step 1: Mean-value theorem.

Let = 2n be the quantization step-size, and let

a + i, i = 0, 1, , j 1
ti
b, i=j
where
ba
j= .

From the mean-value theorem, we can choose xi [ti1, ti] for 1 i j
such that
ti
pi fX (x)dx = fX (xi)(ti ti1) = fX (xi ).
ti1
Step 2: Denition of h(n)(X).

Let

j
j
h(n)(X) [fX (xi) log2 fX (xi)] = [fX (xi) log2 fX (xi)]2n .
i=1 i=1
Since fX (x) log2 fX (x) is (Riemann-)integrable,

b
h(n)(X) fX (x) log2 fX (x)dx as n .
a
Therefore, given any > 0, there exists N such that for all n > N ,
b

f (x) log f (x)dx h (n)
(X) < .
X 2 X
a
Step 3: Computation of H(qn (X)). The entropy of the (uniformly) quan-

tized version of X, qn (X), is given by

j
H(qn (X)) = pi log2 pi
i=1
j
= (fX (xi)) log2(fX (xi))
i=1
j
= (fX (xi)2n ) log2(fX (xi)2n)
i=1
where the pis are the probabilities of the dierent values of qn (X).
Step 4: H(qn (X)) h(n)(X) . From Steps 2 and 3,

j
H(qn (X)) h(n)(X) = [fX (xi)2n] log2(2n )
i=1
j t
i
= n fX (x)dx
i=1 ti1
b
= n fX (x)dx
a
= n.
Hence, we have that for n > N ,

b
fX (x) log2 fX (x)dx + n < H(qn (X)) = h(n)(X) + n
a

b
< fX (x) log2 fX (x)dx + n + ,
a
yielding that
b
limn [H(qn (X)) n] = a fX (x) log2 fX (x)dx. 2
Lemma 5.2 actually holds not limited for support [a, b) but for any support SX .
Theorem 5.3 [38, Theorem 1](Due to Renyi) For any real-valued random vari-
able with pdf fX , if
j
pi log2 pi
i=1
is nite, where the pis are the probabilities of the dierent values of uniformly
quantized qn (X) over support SX , then

lim [H(qn (X)) n] = fX (x) log2 fX (x)dx
n SX
provided the integral on the right-hand side exists.

This suggest that SX fX (x) log2 fX (x)dx could be a good information measure
for continuous-alphabet sources.
Dierential entropy I: 5-11
Denition 5.4 (Dierential entropy) The dierential entropy (in bits) of a

continuous random variable X with pdf fX and support SX is dened as

h(X) fX (x) log2 fX (x)dx = E[ log2 fX (X)],
SX
when the integral exists.
Example 5.5 A continuous random variable X with support SX = [0, 1) and pdf
fX (x) = 2x for x SX has dierential entropy equal to
1 1
x (log2 e 2 log2(2x))
2
2x log2(2x)dx =
0 2 0
1
= log2(2) = 0.278652 bits.
2 ln 2
Next, we have that qn (X) is given by

i i1 i
qn (X) = , if X < ,
2n 2n 2n
for 1 i 2n. Hence,

i (2i 1)
Pr qn (X) = n = ,
2 22n
which yields
2

n
2i 1 2i 1
H(qn(X)) = log2
i=1
22n 22n
2n

1
= (2i 1) log2(2i 1) + 2 log2(2n) .
22n i=1
n H(qn (X)) H(qn (X)) n

1 0.811278 bits 0.188722 bits
2 1.748999 bits 0.251000 bits
3 2.729560 bits 0.270440 bits
4 3.723726 bits 0.276275 bits
5 4.722023 bits 0.277977 bits
6 5.721537 bits 0.278463 bits
7 6.721399 bits 0.278600 bits
8 7.721361 bits 0.278638 bits
9 8.721351 bits 0.278648 bits
Example 5.7 (Dierential entropy of a uniformly distributed ran-

dom variable) Let X be a continuous random variable that is uniformly dis-
tributed over the interval (a, b), where b > a; i.e., its pdf is given by

1
ba if x (a, b);
fX (x) =
0 otherwise.
So its dierential entropy is given by
b
1 1
h(X) = log2 = log2(b a) bits.
a b a b a
Note that if (b a) < 1 in the above example, then h(X) is negative.

Example 5.8 (Dierential entropy of a Gaussian random variable)

Let X N (, 2); i.e., X is a Gaussian (or normal) random variable with nite
mean , variance Var(X) = 2 > 0 and pdf
1 (x)2

fX (x) = e 22
2 2
for x R. Then its dierential entropy is given by

2

1 (x )
h(X) = fX (x) log2(2 2) + 2
log2 e dx
R 2 2
1 log e
= log2(2 2) + 22 E[(X )2]
2 2
1 1
= log2(2 2) + log2 e
2 2
1
= log2(2e 2) bits. (5.1.1)
2
Note that for a Gaussian random variable, its dierential entropy is only a
function of its variance 2 (it is independent from its mean ).
This is similar to the dierential entropy of a uniform random variable, which
only depends on dierence (b a) but not the mean (a + b)/2.
Properties of dierential entropy I: 5-16
Joint and conditional dierential entropies, divergence and mutual

information
Denition 5.9 (Joint dierential entropy) If X n = (X1, X2, , Xn) is a

continuous random vector of size n (i.e., a vector of n continuous random variables)
with joint pdf fX n and support SX n Rn, then its joint dierential entropy is
dened as

h(X n) fX n (x1, x2, , xn) log2 fX n (x1, x2, , xn) dx1 dx2 dxn
SX n
= E[ log2 fX n (X n)]
when the n-dimensional integral exists.
Denition 5.10 (Conditional dierential entropy) Let X and Y be two

jointly distributed continuous random variables with joint pdf fX,Y and support
SX,Y R2 such that the conditional pdf of Y given X, given by
fX,Y (x, y)
fY |X (y|x) = ,
fX (x)
is well dened for all (x, y) SX,Y , where fX is the marginal pdf of X. Then the
conditional entropy of Y given X is dened as

h(Y |X) fX,Y (x, y) log2 fY |X (y|x) dx dy = E[ log2 fY |X (Y |X)],
SX,Y
Chain rule for dierential entropy:

h(X, Y ) = h(X) + h(Y |X) = h(Y ) + h(X|Y ).
Denition 5.11 (Divergence or relative entropy) Let X and Y be two

continuous random variables with marginal pdfs fX and fY , respectively, such that
their supports satisfy SX SY R. Then the divergence (or relative entropy or
Kullback-Leibler distance) between X and Y is written as D(X Y ) or D(fX fY )
and dened by

fX (x) fX (X)
D(X Y ) fX (x) log2 dx = E
SX fY (x) fY (X)
when the integral exists. The denition carries over similarly in the multivariate
case: for X n = (X1, X2, , Xn) and Y n = (Y1, Y2, , Yn) two random vectors
with joint pdfs fX n and fY n , respectively, and supports satisfying SX n SY n Rn,
the divergence between X n and Y n is dened as

fX n (x1, x2, , xn )
D(X n Y n) fX n (x1, x2, , xn) log2 dx1 dx2 dxn
SX n fY n (x ,
1 2x , , x n )
Note that D(qn(X) qn(Y )) D(X Y ) for continuous X and Y .

Denition 5.12 (Mutual information) Let X and Y be two jointly distributed

continuous random variables with joint pdf fX,Y and support SXY R2. Then the
mutual information between X and Y is dened by

fX,Y (x, y)
I(X; Y ) D(fX,Y fX fY ) = fX,Y (x, y) log2 dx dy,
SX,Y fX (x)f Y (y)
assuming the integral exists, where fX and fY are the marginal pdfs of X and Y ,
respectively.
For n and m suciently large,
I(qn (X); qm(Y )) = H(qn(X)) + H(qm (Y )) H(qn (X), qm(Y ))
[h(X) + n] + [h(Y ) + m] [h(X, Y ) + n + m]
= h(X) + h(Y ) h(X, Y )

fX,Y (x, y)
= fX,Y (x, y) log2 dx dy.
SX,Y fX (x)f Y (y)
Hence,
lim I(qn (X); qm(Y )) = h(X) + h(Y ) h(X, Y ).
n,m
This justies using identical notations for both I(; ) and D( ) as opposed to
the discerning notations of H() for entropy and h() for dierential entropy.
Lemma 5.14 The following properties hold for the information measures of con-
tinuous systems.
1. Non-negativity of divergence: Let X and Y be two continuous random
variables with marginal pdfs fX and fY , respectively, such that their supports
satisfy SX SY R. Then
D(fX fY ) 0
with equality i fX (x) = fY (x) for all x SX except in a set of fX -measure
zero (i.e., X = Y almost surely).
2. Non-negativity of mutual information: For any two continuous random
variables X and Y ,
I(X; Y ) 0
with equality i X and Y are independent.
3. Conditioning never increases dierential entropy: For any two con-
tinuous random variables X and Y with joint pdf fX,Y and well-dened con-
ditional pdf fX|Y ,
h(X|Y ) h(X)
with equality i X and Y are independent.
4. Chain rule for dierential entropy: For a continuous random vector

X n = (X1, X2, , Xn),

n
h(X1, X2, . . . , Xn) = h(Xi|X1, X2, . . . , Xi1),
i=1
where h(Xi|X1, X2, . . . , Xi1) h(X1) for i = 1.

5. Chain rule for mutual information: For continuous random vector
X n = (X1, X2, , Xn) and random variable Y with joint pdf fX n,Y and
well-dened conditional pdfs fXi ,Y |X i1 , fXi |X i1 and fY |X i1 for i = 1, , n,
we have that
n
I(X1, X2, , Xn; Y ) = I(Xi; Y |Xi1, , X1),
i=1
where I(Xi; Y |Xi1, , X1) I(X1; Y ) for i = 1.

6. Data processing inequality: For continuous random variables X, Y and
Z such that X Y Z, i.e., X and Z are conditional independent given Y
(cf. Appendix B),
I(X; Y ) I(X; Z).
7. Independence bound for dierential entropy: For a continuous ran-

dom vector X n = (X1, X2, , Xn),

n
h(X n) h(Xi)
i=1
with equality i all the Xis are independent from each other.
8. Invariance of dierential entropy under translation: For continuous
random variables X and Y with joint pdf fX,Y and well-dened conditional
pdf fX|Y ,
h(X + c) = h(X) for any constant c R, and h(X + Y |Y ) = h(X|Y ).
The results also generalize in the multivariate case: for two continuous random
vectors X n = (X1, X2, , Xn) and Y n = (Y1, Y2, , Yn) with joint pdf
fX n,Y n and well-dened conditional pdf fX n|Y n ,
h(X n + cn ) = h(X n)
for any constant n-tuple cn = (c1, c2, , cn) Rn, and
h(X n + Y n|Y n ) = h(X n|Y n),
where the addition of two n-tuples is performed component-wise.
9. Dierential entropy under scaling: For any continuous random variable

X and any non-zero real constant a,
h(aX) = h(X) + log2 |a|.
10. Joint dierential entropy under linear mapping: Consider the ran-
dom (column) vector X = (X1, X2, , Xn)T with joint pdf fX n , where T
denotes transposition, and let Y = (Y1, Y2, , Yn)T be a random (column)
vector obtained from the linear transformation Y = AX, where A is an in-
vertible (non-singular) n n real-valued matrix. Then
h(Y ) = h(Y1, Y2, , Yn) = h(X1, X2, , Xn) + log2 |det(A)|,
where det(A) is the determinant of the square matrix A.
11. Joint dierential entropy under nonlinear mapping: Consider the

random (column) vector X = (X1, X2, , Xn)T with joint pdf fX n , and
let Y = (Y1, Y2, , Yn)T be a random (column) vector obtained from the
nonlinear transformation
Y = g(X) (g1(X1), g2(X2), , gn (Xn))T ,
where each gi : R R is a dierentiable function, i = 1, 2, , n. Then
h(Y ) = h(Y1, Y2, , Yn)
= h(X1, , Xn) + fX n (x1, , xn) log2 |det(J)| dx1 dxn,
Rn
where J is the n n Jacobian matrix given by

g g
x
1
x
1
g1
x
g21 g22 gn2
xn .
J x.. 1 x.. 2
. . ...
x1 x2 xn
gn gn gn
Theorem 5.18 (Joint dierential entropy of the multivariate Gaus-

sian) If X Nn(, KX ) is a Gaussian random vector with mean vector and
(positive-denite) covariance matrix KX , then its joint dierential entropy is given
by
1
h(X) = h(X1, X2, , Xn) = log2 [(2e)ndet(KX )] . (5.2.1)
2
In particular, in the univariate case of n = 1, (5.2.1) reduces to (5.1.1).
Proof:
Without loss of generality we assume that X has a zero mean vector since its
dierential entropy is invariant under translation by Property 8 of Lemma 5.14:
h(X) = h(X );
so we assume that = 0.
Since the (positive-denite) covariance matrix KX is a real-valued symmetric

matrix, then it is orthogonally diagonizable; i.e., there exits a square (n
n) orthogonal matrix A (i.e., satisfying AT = A1) such that AKX AT is
a diagonal matrix whose entries are given by the eigenvalues of KX . As a
result the linear transformation Y = AX Nn 0, AKX AT is a Gaussian
vector with the diagonal covariance matrix KY = AKX AT and has therefore
independent components. Thus
h(Y ) = h(Y1, Y2, , Yn)
= h(Y1) + h(Y2) + + h(Yn) (5.2.2)
n
1
= log2 [2eVar(Yi)] (5.2.3)
2
i=1
n
n 1
= log2(2e) + log2 Var(Yi)
2 2 i=1
n 1
= log2(2e) + log2 [det (KY )] (5.2.4)
2 2
1 1
= log2 (2e)n + log2 [det (KX )] (5.2.5)
2 2
1
= log2 [(2e)n det (KX )] , (5.2.6)
2
where (5.2.2) follows by the independence of the random variables Y1, . . . , Yn

(e.g., see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds
since the matrix KY is diagonal and hence its determinant is given by the
product of its diagonal entries, and (5.2.5) holds since

det (KY ) = det AKX AT
= det(A)det (KX ) det(AT )
= det(A)2det (KX )
= det (KX ) ,
where the last equality holds since (det(A))2 = 1, as the matrix A is orthogonal
(AT = A1 = det(A) = det(AT ) = 1/[det(A)]; thus, det(A)2 = 1).
Now invoking Property 10 of Lemma 5.14 and noting that |det(A)| = 1 yield
that
h(Y1, Y2, , Yn) = h(X1, X2, , Xn) + log2 |det(A)| = h(X1, X2, , Xn).

=0
We therefore obtain using (5.2.6) that

1
h(X1, X2, , Xn) = log2 [(2e)n det (KX )] ,
2
hence completing the proof. 2
An important fact: Among all real-valued size-n random vectors (of sup-
port Rn) with identical mean vector and covariance matrix, the Gaussian
random vector has the largest dierential entropy.
The proof of this fact requires the following inequality.
Corollary 5.19 (Hadamards inequality) For any real-valued nn positive-

denite matrix K = [Ki,j ]i,j=1, ,n,

n
det(K) Ki,i
i=1
with equality i K is a diagonal matrix, where Ki,i are the diagonal entries of K.
Theorem 5.20 (Maximal dierential entropy for real-valued random

vectors) Let X = (X1, X2, , Xn)T be a real-valued random vector with sup-
port SX n = Rn, mean vector and covariance matrix KX . Then
1
h(X1, X2, , Xn) log2 [(2e)n det(KX)] , (5.2.11)
2

with equality i X is Gaussian; i.e., X Nn , KX .
Proof: We will present the proof in two parts: the scalar or univariate case, and
the multivariate case.
(i) Scalar case (n = 1): For a real-valued random variable with support SX = R,
mean and variance 2, let us show that
1
h(X) log2 2e 2 , (5.2.12)
2
with equality i X N (, 2).
For a Gaussian random variable Y N (, 2), using the non-negativity of

divergence, can write
0 D(X Y )

fX (x)
= fX (x) log2 (x)2
dx
R
1 e 2 2
2 2

(x )2
= h(X) + 2 2
fX (x) log2 + log2 e dx
R 2 2

1 log e
= h(X) + log2(2 2) + 22 (x )2fX (x) dx
2 2
R
= 2
1
= h(X) + log2 2e 2 .
2
Thus
1
h(X) log2 2e 2 ,
2
with equality i X = Y (almost surely); i.e., X N (, 2).
(ii). Multivariate case (n > 1): As in the proof of Theorem 5.18, we can use an
orthogonal square matrix A (i.e., satisfying AT = A1 and hence |det(A)| = 1)
such that AKX AT is diagonal. Therefore, the random vector generated by the
linear map
Z = AX
will have a covariance matrix given by KZ = AKX AT and hence have uncorrelated
(but not necessarily independent) components.
Thus h(X) = h(Z) log2 |det(A)| (5.2.13)

=0
= h(Z1, Z2, , Zn )
n
h(Zi) (5.2.14)
i=1
n
1
log2 [2eVar(Zi)] (5.2.15)
2
i=1
n
n 1
= log2(2e) + log2 Var(Zi)
2 2 i=1
1 1
= log2 (2e)n + log2 [det (KZ )] (5.2.16)
2 2
1 1
= log2 (2e)n + log2 [det (KX )] (5.2.17)
2 2
1
= log2 [(2e)n det (KX )] ,
2
where (5.2.13) holds by Property 10 of Lemma 5.14 and since |det(A)| = 1, (5.2.14)
follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12) (the scalar
case above), (5.2.16) holds since KZ is diagonal, and (5.2.17) follows from the fact
that det (KZ ) = det (KX ) (as A is orthogonal).
Finally, equality is achieved in both (5.2.14) and (5.2.15) i the random variables
Z1 , Z2, . . ., Zn are
Gaussian and independent from each other, or equivalently i
X Nn , KX . 2
Observation 5.21 The following two results can also be shown (the proof is left
as an exercise):
1. Among all continuous random variables admitting a pdf with support the in-
terval (a, b), where b > a are real numbers, the uniformly distributed random
variable maximizes dierential entropy.
2. Among all continuous random variables admitting a pdf with support the in-
terval [0, ) and nite mean , the exponential distribution with parameter
(or rate parameter) = 1/ maximizes dierential entropy.
A systematic approach to nding distributions that maximize dierential entropy
subject to various support and moments constraints can be found in [13, 52].
AEP for continuous memoryless sources I: 5-35
The extension of AEP theorem from discrete cases to continuous cases is not
based on number counting (which is always innity for continuous sources),
but on volume measuring.
Theorem 5.22 (AEP for continuous memoryless sources) Let {Xi} i=1
be a continuous memoryless source (i.e., an innite sequence of continuous i.i.d.
random variables) with pdf fX () and dierential entropy h(X). Then
1
log fX (X1, . . . , Xn) E[ log2 fX (X)] = h(X) in probability.
n
Proof: The proof is an immediate result of the law of large numbers (e.g., see
Theorem 3.3). 2
AEP for continuous memoryless sources I: 5-36
Denition 5.23 (Typical set) For > 0 and any n given, dene the typical
set for the above continuous source as

1
Fn() xn Rn : log2 fX (X1, . . . , Xn) h(X) < .
n
Denition 5.24 (Volume) The volume of a set A Rn is dened as

Vol(A) dx1 dxn.
A
Theorem 5.25 (Consequence of the AEP for continuous memoryless

sources) For a continuous memoryless source {Xi}
i=1 with dierential entropy
h(X), the following hold.
1. For n suciently large, PX n {Fn()} > 1 .
2. Vol(Fn()) 2n(h(X)+) for all n.
3. Vol(Fn()) (1 )2n(h(X)) for n suciently large.
Proof: The proof is quite analogous to the corresponding theorem for discrete
memoryless sources (Theorem 3.4) and is left as an exercise. 2
Capacity for discrete memoryless Gaussian channel I: 5-37
Denition 5.26 (Discrete-time continuous memoryless channels) Con-

sider a discrete-time channel with continuous input and output alphabets given by
X R and Y R, respectively, and described by a sequence of n-dimensional
transition (conditional) pdfs {fY n|X n (y n|xn )}n=1 that govern the reception of y =
n
(y1 , y2, , yn ) Y n at the channel output when xn = (x1, x2, , xn ) X n is

sent as the channel input.
The channel (without feedback) is said to be memoryless with a given (marginal)
transition pdf fY |X if its sequence of transition pdfs fY n|X n satises

n
fY n|X n (y n |xn) = fY |X (yi |xi) (5.4.1)
i=1
for every n = 1, 2, , xn X n and y n Y n.

Average cost constraint (t, P ) on any input n-tuple xn = (x1, x2, , xn)
transmitted over the channel by requiring that
1
n
t(xi) P, (5.4.2)
n i=1
where t() is a given non-negative real-valued function describing the cost for
transmitting an input symbol, and P is a given positive number representing
the maximal average amount of available resources per input symbol.
Denition 5.27 The capacity (or capacity-cost function) of a discrete-time con-

tinuous memoryless channel with input average cost constraint (t, P ) is denoted by
C(P ) and dened as
C(P ) sup I(X; Y ) (in bits/channel use) (5.4.3)
FX :E[t(X)]P
where the supremum is over all input distributions FX .

Property A.4 (Properties of the supremum)
3. If < sup A < , then ( > 0)( a0 A) a0 > sup A .

(The existence of a0 (sup A , sup A] for any > 0 under the condition
of | sup A| < is called the approximation property for the supremum.)
Lemma 5.28 (Concavity of capacity) If C(P ) as dened in (5.4.3) is nite

for any P > 0, then it is concave, continuous and strictly increasing in P .
Proof: Fix P1 > 0 and P2 > 0. Then since C(P ) is nite for any P > 0, then by
the 3rd property in Property A.4, there exist two input distributions FX1 and FX2
such that for all > 0,
I(Xi; Yi) C(Pi) (5.4.4)
and
E[t(Xi)] Pi (5.4.5)
where Xi denotes the input with distribution FXi and Yi is the corresponding
channel output for i = 1, 2.
We do not know whether the capacity is achievable or not at the moment!
Now, for 0 1, let X be a random variable with distribution

FX = FX1 + (1 )FX2 .
Then by (5.4.5)
EX [t(X)] = EX1 [t(X)] + (1 )EX2 [t(X)] P1 + (1 )P2. (5.4.6)
Furthermore,
C(P1 + (1 )P2) = sup I(FX , fY |X )
FX : E[t(X)]P1 +(1)P2
I(FX , fY |X )
I(FX1 , fY |X ) + (1 )I(FX2 , fY |X )
= I(X1; Y1) + (1 )I(X2 : Y2)
C(P1) + (1 )C(P2) ,
where the rst inequality holds by (5.4.6), the second inequality follows from the
concavity of the mutual information with respect to its rst argument (cf. Lemma 2.46)
and the third inequality follows from (5.4.4).
Letting 0 yields that

C(P1 + (1 )P2) C(P1) + (1 )C(P2)
and hence C(P ) is concave in P .
Finally, it can directly be seen by denition that C() is non-decreasing, which,
together with its concavity, imply that it is continuous and strictly increasing. 2
The most commonly used cost function is the power cost function,
t(x) = x2,
resulting in the average power constraint P for each transmitted input n-
tuple:
1 2
n
x P. (5.4.7)
n i=1 i
For better understanding, we only focus on the discrete-time memory-
less (additive white) Gaussian (noise) channel with average input
power constraint P :
Yi = Xi + Zi , for i = 1, 2, , (5.4.8)
where Yi, Xi and Zi are the channel output, input and noise at time i, {Zi}
i=1
2
i.i.d. Gaussian with mean zero and variance , and {Xi}i=1 {Zi }i=1.
Derivation of the Capacity C(P ) for the discrete-time memoryless (addi-

tive white) Gaussian (noise) channel with average input power con-
straint P .
For Gaussian distributed Z, we obtain
I(X; Y ) = h(Y ) h(Y |X)
= h(Y ) h(X + Z|X) (5.4.9)
= h(Y ) h(Z|X) (5.4.10)
= h(Y ) h(Z) (5.4.11)
1
= h(Y ) log2 2e 2 , (5.4.12)
2
where (5.4.9) follows from (5.4.8), (5.4.10) holds since dierential entropy is
invariant under translation (see Property 8 of Lemma 5.14), (5.4.11) follows
from the independence of X and Z, and (5.4.12) holds since Z N (0, 2) is
Gaussian (see (5.1.1)).
Now since Y = X + Z, we have that
E[Y 2] = E[X 2] + E[Z 2 ] + 2E[X]E[Z] = E[X 2] + 2 + 2E[X](0) P + 2
since the input in (5.4.3) is constrained to satisfy E[X 2] P .
Thus the variance of Y satises

Var(Y ) E[Y 2] P + 2,
and
1 1
h(Y ) log2 (2eVar(Y )) log2 2e(P + 2)
2 2
where the rst inequality follows by Theorem 5.20 since Y is real-valued (with
support R).
Noting that equality holds in the rst inequality above i Y is Gaussian and in
the second inequality i Var(Y ) = P + 2, we obtain that choosing the input
X as X N (0, P ) yields Y N (0, P + 2) and hence maximizes I(X; Y )
over all inputs satisfying E[X 2] P .
Thus, the capacity of the discrete-time memoryless Gaussian channel with input
average power constraint P and noise variance (or power) 2 is given by
1 1
C(P ) = log2 2e(P + 2) log2 2e 2
2 2
1 P
= log2 1 + 2 . (5.4.13)
2
Denition 5.30 (Fixed-length data transmission code) Given positive

integers n and M , and a discrete-time memoryless Gaussian channel with input
average power constraint P , a xed-length data transmission code (or block code)
Cn = (n, M ) for this channel with blocklength n and rate n1 log2 M message bits
per channel symbol (or channel use) consists of:
1. M information messages intended for transmission.
2. An encoding function
f : {1, 2, . . . , M } Rn
yielding real-valued codewords c1 = f (1), c2 = f (2), , cM = f (M ), where
each codeword cm = (cm1, . . . , cmn ) is of length n and satises the power
constraint P
1 2
n
c P,
n i=1 i
for m = 1, 2, , M .
3. A decoding function g : Rn {1, 2, . . . , M }.
Theorem 5.31 (Shannons coding theorem for the memoryless Gaus-

sian channel) Consider a discrete-time memoryless Gaussian channel with input
average power constraint P , channel noise variance 2 and capacity C(P ) as given
by (5.4.13).
Forward part (achievability): For any (0, 1), there exist 0 < < 2 and
a sequence of data transmission block code {Cn = (n, Mn)} n=1 satisfying
1
log Mn > C(P )
n 2
with each codeword c = (c1, c2, . . . , cn ) in Cn satisfying
1 2
n
c P (5.4.14)
n i=1 i
such that the probability of error Pe(Cn) < for suciently large n.
Converse part: If for any sequence of data transmission block codes {Cn =
(n, Mn)}n=1 whose codewords satisfy (5.4.14), we have that
1
lim inf log2 Mn > C(P ),
n n
then the codes probability of error Pe(Cn ) is bounded away from zero for all
n suciently large.
Proof of the forward part: The theorem holds trivially when C(P ) = 0
because we can choose Mn = 1 for every n and have Pe(Cn) = 0. Hence, we
assume without loss of generality C(P ) > 0.
Step 0:
Take a positive satisfying < min{2, C(P )}.
Pick > 0 small enough such that 2[C(P ) C(P )] < , where the
existence of such is assured by the strictly increasing property of C(P ).
Hence, we have

C(P ) > C(P ) > 0.
2
Choose Mn to satisfy
1
C(P ) > log2 Mn > C(P ) ,
2 n
for which the choice should exist for all suciently large n.
Take = /8.
Let FX be the distribution that achieves C(P ), where C(P ) is given
by (5.4.13). In this case, FX is the Gaussian distribution with mean zero
and variance P and admits a pdf fX . Hence, E[X 2] P and
I(X; Y ) = C(P ).
Step 1: Random coding with average power constraint.

Randomly draw Mn codewords according to pdf fX n with

n
n
fX n (x ) = fX (xi).
i=1
By law of large numbers, each randomly selected codeword

cm = (cm1 , . . . , cmn )
satises
1 2
n
lim cmi = E[X 2] P
n n
i=1
for m = 1, 2, . . . , Mn.
Step 2: Code construction.

For Mn selected codewords {c1, . . . , cMn }, replace the codewords that vio-
late the power constraint (i.e., (5.4.14)) by an all-zero codeword 0.
Dene the encoder as
fn(m) = cm for 1 m Mn.
Given a received output sequence y n , the decoder gn() is given by

m, if (cm, y n ) Fn ()
gn (y n ) = and ( m = m) (cm , y n ) Fn (),

arbitrary, otherwise,
where the set

1
Fn () (xn, y n ) X n Y n: log2 fX nY n (xn, y n ) h(X, Y ) < ,
n

1
log2 fX n (xn) h(X) < ,
n

1
and log2 fY n (y n) h(Y ) < .
n
Note that Fn() is generated by

n
fX nY n (xn, y n) = fXY (xi, yi )
i=1
where fX nY n (xn, y n ) is the joint input-output pdf realized when the mem-
oryless Gaussian channel (with n-fold transition pdf

n
fY n|X n (y n |xn) = fY |X (yi |xi))
i=1
is driven by input X n with pdf

n
fX n (xn) = fX (xi)
i=1
(where fX achieves C(P )).

Step 3: Conditional probability of error.

Let m = m(Cn ) denote the conditional error probability given codeword
m is transmitted (with respect to code Cn ).
Dene $
n
n 1
E0 x X :
n
x2i > P .
n i=1
Then
Mn

m(Cn ) fY n|X n (y n|cm) dy n + fY n|X n (y n |cm ) dy n ,
y n Fn (|cm ) y n Fn (|cm )
m =1
m =m
where
Fn (|xn) {y n Y n : (xn, y n ) Fn()} .
By taking expectation with respect to the mth codeword-selecting distribu-

tion fX n (cm), we obtain

E[m ] = fX n (cm)m(Cn ) dcm
cmX
n
= fX n (cm )m(Cn) dcm + fX n (cm)m (Cn) dcm

cmX E0 cm X n E0c
n

fX n (cm ) dcm + fX n (cm )m(Cn) dcm
cm E0 cm X n

PX n (E0) + fX n (cm)fY n|X n (y n |cm) dy n dcm
cm X n y n Fn (|cm )
Mn

+ fX n (cm)fY n|X n (y n |cm ) dy n dcm .
m =1
m =m
= PX n (E0) + PX n,Y n (Fnc ())
Mn
+ fX n,Y n (cm , y n ) dy n dcm . (5.4.15)
m =1
m =m
Note that the additional term PX n (E0) in (5.4.15) is to cope with the errors
due to all-zero codeword replacement, which will be less than for all
suciently large n by the law of large numbers.
Finally, by carrying out a similar procedure as in the proof of the channel
coding theorem for discrete channels (cf. Theorem 4.10), we obtain:
E[Pe(C n)] PX n (E0) + PX n,Y n (Fnc ())
+Mn 2n(h(X,Y )+)2n(h(X)) 2n(h(Y ))
PX n (E0) + PX n,Y n (Fnc ()) + 2n(C(P )4) 2n(I(X;Y )3)
= PX n (E0) + PX n,Y n (Fnc ()) + 2n .
Accordingly, we can make the average probability of error, E[Pe(C n)], less
than 3 = 3/8 < 3/4 < for all suciently large n. 2
Proof of the converse part: Consider an (n, Mn) block data transmission
code satisfying the power constraint
1 2
n
c P
n i=1 i
with encoding function

fn : {1, 2, . . . , Mn} X n
and decoding function
gn : Y n {1, 2, . . . , Mn}.
Since the message W is uniformly distributed over {1, 2, . . . , Mn }, we have
H(W ) = log2 Mn.
Since W X n = fn(W ) Y n forms a Markov chain (as Y n only depends

on X n), we obtain by the data processing lemma that
I(W ; Y n ) I(X n; Y n ).
We can also bound I(X n; Y n ) by C(P ) as follows:

I(X n ; Y n) sup
%
I(X n; Y n )
n
FX n : (1/n) i=1 E[Xi ]P
2

n
sup
%
I(Xj ; Yj ) (by Theorem 2.21)
n
FX n : (1/n) i=1 E[Xi ]P
2
j=1

n
= sup %n
sup I(Xj ; Yj )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P FX n : ( i) E[Xi2 ]Pi j=1

n
sup %n
sup I(Xj ; Yj )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FX n : ( i) E[Xi2 ]Pi
n
= sup %n
sup I(Xj ; Yj )
(P1 ,P2 ,...,Pn ) : (1/n) i=1 Pi =P j=1 FXj : E[Xj2 ]Pj

n
n
1
= sup %n
C(Pj ) = sup %n
n C(Pj )
(P1 ,P2 ,...,Pn ):(1/n) i=1 Pi =P j=1 (P1 ,P2 ,...,Pn ):(1/n) i=1 Pi =P j=1
n

1
n
sup % nC Pj (by concavity of C(P ))
n
(P1 ,P2 ,...,Pn ):(1/n) i=1 Pi =P n j=1
= nC(P ).
Consequently, recalling that Pe(Cn) is the average error probability incurred

by guessing W from observing Y n via the decoding function gn : Y n
{1, 2, . . . , Mn}, we get
log2 Mn = H(W )
= H(W |Y n ) + I(W ; Y n)
H(W |Y n ) + I(X n ; Y n )
h (P (C )) + Pe(Cn) log2(|W| 1) +nC(P )
b e n
Fanos inequality
1 + Pe(Cn) log2(Mn 1) + nC(P ),
(by the fact that ( t [0, 1]) hb (t) 1)
< 1 + Pe(Cn) log2 Mn + nC(P ),
which implies that
C(P ) 1
Pe(Cn) > 1 .
(1/n) log2 Mn log2 Mn
So if
1
lim inf log2 Mn > C(P ),
n n
then there exists > 0 and an integer N such that for n N ,
1
log Mn > C(P ) + .
n 2
Hence, for n N0 max{N, 2/},
C(P ) 1
Pe(Cn) 1 .
C(P ) + n(C(P ) + ) 2(C(P ) + )
2
Theorem 5.32 (Gaussian noise minimizes capacity of additive-noise

channels) Every discrete-time continuous memoryless channel with additive noise
(admitting a pdf) of mean zero and variance 2 and input average power constraint
P has its capacity C(P ) lower bounded by the capacity of the memoryless Gaussian
channel with identical input constraint and noise variance:

1 P
C(P ) log2 1 + 2 .
2
Proof:
Let fY |X and fYg |Xg denote the transition pdfs of the additive-noise channel and
the Gaussian channel, respectively, where both channels satisfy input average
power constraint P .
Let Z and Zg respectively denote their zero-mean noise variables of identical
variance 2.
Then we have
I(fXg , fY |X ) I(fXg , fYg |Xg )

fZ (y x)
= fXg (x)fZ (y x) log2 dydx
X Y fY (y)

fZg (y x)
fXg (x)fZg (y x) log2 dydx
f (y)
X Y
Yg
fZ (y x)
X Y f Y (y)

fZg (y x)
fXg (x)fZ (y x) log2 dydx
X Y fYg (y)

fZ (y x)fYg (y)
X Y fZg (y x)f Y (y)

fZg (y x)fY (y)
fXg (x)fZ (y x)(log2 e) 1 dydx
X Y fZ (y x)fYg (y)

fY (y)
= (log2 e) 1 fXg (x)fZg (y x)dx dy = 0,
f
Y Yg (y) X
with equality holding in the inequality i fY (y)/fYg (y) = fZ (y x)/fZg (y x)

for all x.
Therefore,

1 P
log2 1 + 2 = sup I(FX , fYg |Xg )
2 FX : E[X 2 ]P
= I(fX g , fYg |Xg )
I(fX g , fY |X )
sup I(FX , fY |X )
FX : E[X 2 ]P
= C(P ).
2
The water-lling principle I: 5-60
Theorem 5.34 (Capacity of uncorrelated parallel Gaussian channels)

The capacity of k uncorrelated parallel Gaussian channels under an overall input
power constraint P is given by
k
1 Pi
C(P ) = log2 1 + 2 ,
i=1
2 i
where i2 is the noise variance of channel i,

Pi = max{0, i2},
%
and is chosen to satisfy ki=1 Pi = P . This capacity is achieved by a tuple of
independent Gaussian inputs (X1, X2, , Xk ), where Xi N (0, Pi) is the input
to channel i, for i = 1, 2, , k.
P = P1 + P2 + P4
P2
P1 32
P4
22
12 42
Proof:
By denition,
C(P ) = sup I(X k ; Y k ).
%k
FX k : i=1 E[Xk ]P
2
Since the noise random variables Z1, . . . , Zk are independent from each other,
I(X k ; Y k ) =h(Y k ) h(Y k |X k )
=h(Y k ) h(Z k + X k |X k )
=h(Y k ) h(Z k |X k )
=h(Y k ) h(Z k )
k
= h(Y )
k
h(Zi)
i=1

k
k
h(Yi) h(Zi)
i=1 i=1
k
1 Pi
log 1 + 2 .
i=1
2 2 i
Equalities hold above if all the Xi inputs are independent of each other with
%k
each input Xi N (0, Pi) such that i=1 Pi = P .
Thus the problem is reduced to nding the power allotment that maximizes the
%
overall capacity subject to the equality constraint ki=1 Pi = P and inequality
constraints Pi 0, i = 1, . . . , k.
By using the Lagrange multiplier technique and verifying the KKT condition
(see Example B.22 in Appendix B.10), the maximizer (P1, . . . , Pk ) of
k * k +$
1 Pi
k
max log2 1 + 2 + i P i Pi P
i=1
2 i i=1 i=1
can be found by taking the derivative of the above equation (with respect to
Pi) and setting it to zero, which yields

1 1
+ = 0, if Pi > 0;
2 ln(2) Pi + i2
i = 1 1

+ 0, if Pi = 0.
2 ln(2) Pi + i2
Hence,

Pi = i2, if Pi > 0;
(equivalently, Pi = max{0, i2}),
Pi i2, if Pi = 0,
%k
where log2 e/(2) is chosen to satisfy i=1 Pi = P. 2
Observation 5.35
Although it seems reasonable to allocate more power to less noisy channels for
the optimization of the channel capacity according to the water-lling principle,
the Gaussian inputs required for achieving channel capacity do not t the
digital communication system in practice.
One may wonder what an optimal power allocation scheme will be when the
channel inputs are dictated to be practically discrete in values, such as
binary phase-shift keying (BPSK), quadrature phase-shift key-
ing (QPSK), or 16 quadrature-amplitude modulation (16-QAM).
Surprisingly, a dierent procedure from the water-lling principle results under
certain conditions.
Indeed, the optimal power allocation for parallel AWGN channels with inputs
constrained to be discrete is established, resulting in a new graphical power
allocation interpretation called the mercury/water-lling principle [31].
This research concludes that when the total transmission power P is small,
the strategy that maximizes the capacity follows approximately the equal
signal-to-noise ratio (SNR) principle.
Very recently, it was found that when the additive noise is no longer Gaussian,
the mercury adjustment fails to interpret the optimal power allocation scheme.
For this reason, the graphical interpretation of the optimal power allocation
principle is simply named two-phase water-lling principle in [48].
We end this observation by emphasizing that the true measure for a digital
communication system is of no doubt the eective transmission rate subject to
an acceptably good transmission error (e.g., an overall bit error rate 105).
In order to make the analysis tractable, researchers adopt the channel capac-
ity as the design criterion instead, in the hope of obtaining a simple reference
scheme for practical systems. Further study is thus required to examine the fea-
sibility of such an approach (even though the water-lling principle is certainly
a theoretically interesting and theoretically beautiful result).
Capacity of correlated parallel Gaussian channels I: 5-65
Theorem 5.36 (Capacity of correlated parallel Gaussian channels)

The capacity of k correlated parallel Gaussian channels with positive-denite noise
covariance matrix KZ under overall input power constraint P is given by
k
1 Pi
C(P ) = log2 1 + ,
i=1
2 i
where i is the i-th eigenvalue of KZ ,

Pi = max{0, i},
%
and is chosen to satisfy ki=1 Pi = P . This capacity is achieved by a tuple of zero-
mean Gaussian inputs (X1, X2, , Xk ) with covariance matrix KX having the
same eigenvectors as KZ , where the i-th eigenvalue of KX is Pi, for i = 1, 2, , k.
Proof:
In correlated parallel Gaussian channels, the input power constraint becomes

k
E[Xi2] = tr(KX ) P,
i=1
where tr() denotes the trace of the k k matrix KX .

Since in each channel, the input and noise variables are independent from each
other, we have
I(X k ; Y k ) = h(Y k ) h(Y k |X k )
= h(Y k ) h(Z k + X k |X k )
= h(Y k ) h(Z k |X k )
= h(Y k ) h(Z k ).
Since h(Z k ) is not determined by the input, determining the systems capacity
reduces to maximizing h(Y k ) over all possible inputs (X1, . . . , Xk ) satisfying
the power constraint.
Now observe that the covariance matrix of Y k is equal to
KY = KX + KZ ,
which implies by Theorem 5.20 that the dierential entropy of Y k is upper
bounded by
1
h(Y ) log2 (2e)k det(KX + KZ ) ,
k
2
k
with equality i Y Gaussian. It remains to nd out whether we can nd
inputs (X1, . . . , Xk ) satisfying the power constraint which achieve the above
upper bound and maximize it.
As in the proof of Theorem 5.18, we can orthogonally diagonalize KZ as

KZ = AAT ,
where AAT = Ik (and thus det(A)2 = 1), Ik is the k k identity matrix,
and is a diagonal matrix with positive diagonal components consisting of the
eigenvalues of KZ (as KZ is positive denite). Then
det(KX + KZ ) = det(KX + AAT )
= det(AAT KX AAT + AAT )
= det(A) det(AT KX A + ) det(AT )
= det(AT KX A + )
= det(B + ),
where B AT KX A.
Since for any two matrices C and D,
tr(CD) = tr(DC),
we have that
tr(B) = tr(AT KX A) = tr(AAT KX ) = tr(Ik KX ) = tr(KX ).
Thus the capacity problem is further transformed

to maximizing det(B + ) subject to tr(B) P.
By observing that B + is positive denite (because is positive denite)

and using Hadamards inequality given in Corollary 5.19, we have

k
det(B + ) (Bii + i ),
i=1
where i is the component of matrix locating at ith row and ith column,
which is exactly the i-th eigenvalue of KZ .
Thus, the maximum value of det(B + ) under tr(B) P is realized by a
diagonal matrix B (to achieve equality in Hadamards inequality) with

k
Bii = P and each Bii 0.
i=1
Thus,
1 1
C(P ) = log2 (2e) det(B + ) log2 (2e)k det()
k
2 2
1k
Bii
= log2 1 + .
i=1
2 i
Finally, as in the proof of Theorem 5.34, we obtain a water-lling allotment for

the optimal diagonal elements of B:
Bii = max{0, i },
%
where is chosen to satisfy ki=1 Bii = P . 2
Non-Gaussian discrete-time memoryless channels I: 5-70
If a discrete-time channel has an additive but non-Gaussian memoryless noise

and an input power constraint, then it is often hard to calculate its capacity.
Hence, we introduce an upper bound and a lower bound on the capacity of
such a channel (we assume that the noise admits a pdf).
Denition 5.37 (Entropy power) For a continuous random variable Z with
(well-dened) dierential entropy h(Z) (measured in bits), its entropy power is
denoted by Ze and dened as
1 2h(Z)
Ze 2 .
2e
Lemma 5.38 For a discrete-time continuous-alphabet memoryless additive-noise
channel with input power constraint P and noise variance 2, its capacity satises
1 P + 2 1 P + 2
log C(P ) log2 . (5.7.1)
2 2 Ze 2 2
Proof: The lower bound in (5.7.1) is already proved in Theorem 5.32. The upper
bound follows from
1 1
I(X; Y ) = h(Y ) h(Z) log2[2e(P + 2)] log2[2eZe].
2 2
2
Non-Gaussian discrete-time memoryless channels I: 5-71
The entropy power of Z can be viewed as the variance of a corresponding

Gaussian random variable with the same dierential entropy as Z:
1 2h(Z)
Ze = 2 = Var(Z), if Z Gaussian.
2e
Whenever two independent Gaussian random variables, Z1 and Z2 , are added,
the power (variance) of the sum is equal to the sum of the powers (variances)
of Z1 and Z2. This relationship can then be written as
22h(Z1+Z2 ) = 22h(Z1) + 22h(Z2),
or equivalently
Var(Z1 + Z2 ) = Var(Z1) + Var(Z2).
(Entropy-power inequality) However, when two independent random
variables are non-Gaussian, the relationship becomes
22h(Z1+Z2) 22h(Z1) + 22h(Z2), (5.7.2)
or equivalently
Ze(Z1 + Z2) Ze (Z1) + Ze (Z2). (5.7.3)
It reveals that the sum of two independent random variables may introduce
more (dierential) entropy power than the sum of each individual en-
tropy power, except in the Gaussian case.
Capacity of band-limited white Gaussian channel I: 5-72
Z(t)
X(t) ? Y (t)
- + - H(f ) -
Waveform channel
The output waveform is given by
Y (t) = (X(t) + Z(t)) h(t), t 0,
where represents the convolution operation.1
1

Rrecall that the convolution between two signals a(t) and b(t) is defined as a(t) b(t) = a( )b(t )d .
(Time-unlimited) X(t) is the channel input waveform with average power

constraint
1 T /2
lim E[X 2(t)]dt P (5.8.1)
T T T /2
and bandwidth W cycles per second or Hertz (Hz); i.e., its spectrum or
Fourier transform +
X(f ) F[X(t)] = X(t)ej2f tdt = 0

for all frequencies |f | > W , where j = 1 is the imaginary unit number.
Z(t) is the noise waveform of a zero-mean stationary white Gaussian process
with power spectral density N0 /2; i.e., its power spectral density PSDZ (f ),
which is the Fourier transform of the process covariance (equivalently, correla-
tion) function KZ ( ) E[Z(s)Z(s + )], s, R, is given by
+
N0
PSDZ (f ) = F[KZ (t)] = KZ (t)ej2f tdt = f.
2
h(t) is the impulse response of an ideal bandpass lter with cuto frequencies
at W Hz:
1 if W f W ,
H(f ) = F[(h(t)] =
0 otherwise.
Note that we can write the channel output as

Y (t) = X(t) + Z(t)
where Z(t) Z(t) h(t) is the ltered noise waveform.
To determine the capacity (in bits per second) of this continuous-time band-
limited white Gaussian channel with parameters, P , W and N0, we convert it
to an equivalent discrete-time channel with power constraint P by
using the well-known Sampling theorem (due to Nyquist, Kotelnikov and
Shannon), which states that sampling a band-limited signal with bandwidth
W at a rate of 1/(2W ) is sucient to reconstruct the signal from its samples.
Since X(t), Z(t) and Y (t) are all band-limited to [W, W ], we can thus
1
represent these signals by their samples taken 2W seconds apart and model the
channel by a discrete-time channel described by:
Yn = Xn + Zn , n = 1, 2, ,
where Xn X( 2W n
) are the input samples and Zn and Yn are the random
samples of the noise Z(t) and output Y (t) signals, respectively.
Since Z(t) is a ltered version of Z(t), which is a zero-mean stationary Gaussian

process, we obtain that Z(t) is also zero-mean, stationary and Gaussian.
This directly implies that the samples Zn , n = 1, 2, , are zero-mean
Gaussian identically distributed random variables.
Note that

2
N0
2 if W f W ,
PSDZ (f ) = PSDZ (f )|H(f )| =
0 otherwise.
Hence, covariance function of the ltered noise process is
KZ ( ) = E[Z(s)Z(s + )] = F 1[PSDZ (f )] = N0 W sinc(2W ) R.
(5.8.1)
(5.8.1) implies
E[Zn2 ] = KZ (0) = N0 W.
We conclude that the capacity of the band-limited white Gaussian channel in

bits per channel use is given using (5.4.13) by

1 P
log 1 + bits/channel use.
2 2 N0W
1
Given that we are using the channel (with inputs Xn) every 2W seconds, we
obtain that the capacity in bits/second of the band-limited white Gaussian
channel is given by

1 P
2 log2 1 + N0 W P
C(P ) = 1 = W log2 1 + bits/second, (5.8.2)
2W
N 0 W
P
where N0 W
is typically referred to as the signal-to-noise ratio (SNR).
(5.8.2) is achieved by zero-mean i.i.d. Gaussian {Xn} 2
n= with E[Xn ] = P ,
which can be obtained by sampling a zero-mean, stationary and Gaussian X(t)
with
P
if W f W ,
PSDX (f ) = 2W
0 otherwise.
Examining this X(t) conrms that it satises (5.8.1):

1 T /2
E[X 2(t)]dt = E[X 2(t)] = KX (0) = P sinc(2W 0) = P.
T T /2
Example 5.39 (Telephone line channel) Suppose telephone signals are band-
limited to 4 KHz. Given an SNR of 40 decibels (dB) i.e., 10 log10 NP0W = 40 dB
then from (5.8.2), we calculate that the capacity of the telephone line channel
(when modeled via the band-limited white Gaussian channel) is given by
4000 log2(1 + 10000) = 53151.4 bits/second = 51.906 Kbits/second.
By increasing the bandwidth to 1.2 MHz, the capacity becomes

1200000 log2(1 + 10000) = 15945428 bits/second = 15.207 Mbits/second.
Observation 5.41 (Band-limited colored Gaussian channel) If the above

band-limited channel has a stationary colored (non-white) additive Gaussian noise,
then it can be shown (e.g., see [19]) that the capacity of this channel becomes
W

1
C(P ) = max 0, log2 df,
2 W PSDZ (f )
where is the solution of
W
P = max [0, PSDZ (f )] df.
W
(a) The spectrum of PSDZ (f ) where the horizontal line repre-

sents , the level at which water rises to.
(b) The input spectrum that achieves capacity.

Water-pouring for the band-limited colored Gaussian channel.
Capacity of time-limited Gaussian channel I: 5-80
The system is given by:

Y (t) = X(t) + Z(t)
where the zero-mean noise process Z(t) has PSDZ (f ).
Notes
We cannot use sampling theorem in this case (as we did previously) because
X(t), Y (t) and Z(t) are not band-limited but actually time-limited to
[0, T ).
Z(t) is no longer assumed white (so the noise samples are not i.i.d.)
Lemma 5.42 Any real-valued function v(t) dened over [0, T ) can be decomposed
into

v(t) = vii (t),
i=1
where the real-valued functions{i(t)}
i=1 are any orthonormal set (which can span
the space with respect to v(t)) of functions on [0, T ), namely
T
1, if i = j;
i (t)j (t)dt =
0 0, if i = j,
and T
vi = i (t)v(t)dt.
0
Notes
Sampling theorem employs sinc function as the orthonormal set to span the
space of bandlimited signals.
The sinc orthonormal set however cannot be used for time-limited signals.
We therefore have to introduce the Karhunen-Loeve expansion.
Lemma 5.43 (Karhunen-Loeve expansion) Given a stationary random pro-

cess Z(t) with correlation function
KZ ( ) = E[Z(t)Z(t + )].
Let {i ( )}
i=1 and {i }i=1 be the eigenfunctions and eigenvalues of KZ ( ), satis-
fying T
KZ ( s)i(s)ds = ii ( ), 0 T.
0
Then the expansion coecients {Zi }i=1 of Z(t) with respect to orthonormal func-
tions {i(t)}
i=1 are uncorrelated. In addition, if Z(t) is Gaussian, then {Zi }i=1
are independent Gaussian random variables.
Proof:

T T
E[Zi Zj ] = E i(t)Z(t)dt j (s)Z(s)ds
T 0 T 0
= i (t)j (s)E[Z(t)Z(s)]dtds
0 T 0 T
= i (t)j (s)KZ (s t)dtds
0 T 0 T
= i(s) KZ (s t)j (t)dt ds
0 T 0
= i(s)(j j (s))ds
0
i, if i = j;
=
0, if i = j.
2
The system can then be equivalently transformed to:
Yi = Xi + Zi , i = 1, 2, 3, . . .
T

Xi = 0 X(t)i(t)dt,

T
where Y i = 0 Y (t)i (t)dt,

- .

T
Zi = 2
0 Z(t)i (t)dt i=1 are independent Gaussian with E[Zi ] = i .
%
Since Z(t) = i=1 Zi i (t) is assumed stationary, we have
KZ ( ) = E [Z(t)Z(t + )]

= E Zi i (t) Zk k (t + )
i=1 k=1

= E [ZiZk ] i (t)k (t + )
i=1 k=1

= i i(t)i(t + ).
i=1
Thus,

KZ (0) = E[Z 2 (t)] = i2i (t),
i=1
which implies
T T
1 1 1
E[Z 2 (t)] = E[Z 2(t)]dt = i 2i (t)dt = i
T 0 T i=1 0 T i=1
Using similar technique to water-lling power allocation scheme, we have that

the maximizer of maxX n:tr(KX )nP I(X n; Y n) is zero-mean independent Gaus-
sian distributed X n with E[Xi2] = Pi satisfying:

n
Pi = max{0, i} and Pi = constant.
i=1
Consequently,
max I(X n ; Y n )
X n :tr(KX )nP
1 1
= log2 [(2e)ndet(KX + KZ )] log2 [(2e)ndet(KZ )]
2 2
1
n
Pi
= log2 1 +
i=1
2 i
which is achieved by zero-mean i.i.d. Gaussian {Xi} 2

i=1 with E[Xi ] = Pi .
%
Since X(t) = i=1 Xi i (t), we have
KX ( ) = E [X(t)X(t + )]

= E Xii (t) Xk k (t + )
i=1 k=1

= E [XiXk ] i (t)k (t + )
i=1 k=1

= Pii (t)i(t + ).
i=1
Thus,

KX (0) = E[X 2(t)] = Pi2i (t),
i=1
which implies
T T
1 1 1
P E[X 2(t)] = E[X 2(t)]dt = Pi 2i (t)dt = Pi .
T 0 T i=1 0 T i=1
Key Notes I: 5-89
Dierential entropy and its operational meaning in quantization eciency

Maximal dierential entropy of Gaussian source, among all sources with the
same mean and variance
The mismatch in properties of entropy and dierential entropy
Relative entropy and mutual information of continuous systems
Capacity-cost function and its proof
Calculation of the capacity-cost function for specic channels
Memoryless additive Gaussian channels
Uncorrelated and correlated parallel Gaussian channels
Water-lling scheme (graphical interpretation)
Gaussian band-limited waveform channels
Gaussian time-limited waveform channels
Interpretation of entropy-power (provide an upper bound on capacity of non-
Gaussian channels)
Operational characteristics of entropy-power inequality

IT Lecture 05

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IT Lecture 05

Uploaded by

Copyright:

Available Formats

Chapter 5

Dierential Entropy and Gaussian Channels

Po-Ning Chen, Professor

Department of Electrical and Computer Engineering

National Chiao Tung University

Hsin Chu, Taiwan 30010, R.O.C.

We have so far examined information measures and their operational charac-

Relation between discrete-alphabet and continuous-alphabet sources

Consider a real-valued random variable X that is uniformly distributed on

Given a positive integer m, we can discretize X by uniformly quantizing it

we obtain that the entropy of X is innite. 2

The above example indicates that to compress a continuous source without

Lemma 5.2 Consider a real-valued random variable X with support [a, b)

for n suciently large. In other words,

where qn (X) is the uniformly quantized version of X with quantization step-

Step 1: Mean-value theorem.

Step 2: Denition of h(n)(X).

Since fX (x) log2 fX (x) is (Riemann-)integrable,

Step 3: Computation of H(qn (X)). The entropy of the (uniformly) quan-

Step 4: H(qn (X)) h(n)(X) . From Steps 2 and 3,

provided the integral on the right-hand side exists.

Denition 5.4 (Dierential entropy) The dierential entropy (in bits) of a

when the integral exists.

Next, we have that qn (X) is given by

n H(qn (X)) H(qn (X)) n

Example 5.7 (Dierential entropy of a uniformly distributed ran-

Note that if (b a) < 1 in the above example, then h(X) is negative.

Example 5.8 (Dierential entropy of a Gaussian random variable)

Joint and conditional dierential entropies, divergence and mutual

Denition 5.9 (Joint dierential entropy) If X n = (X1, X2, , Xn) is a

Denition 5.10 (Conditional dierential entropy) Let X and Y be two

when the integral exists.

Chain rule for dierential entropy:

Denition 5.11 (Divergence or relative entropy) Let X and Y be two

Note that D(qn(X) qn(Y )) D(X Y ) for continuous X and Y .

Denition 5.12 (Mutual information) Let X and Y be two jointly distributed

4. Chain rule for dierential entropy: For a continuous random vector

where h(Xi|X1, X2, . . . , Xi1) h(X1) for i = 1.

where I(Xi; Y |Xi1, , X1) I(X1; Y ) for i = 1.

7. Independence bound for dierential entropy: For a continuous ran-

9. Dierential entropy under scaling: For any continuous random variable

11. Joint dierential entropy under nonlinear mapping: Consider the

where J is the n n Jacobian matrix given by

Theorem 5.18 (Joint dierential entropy of the multivariate Gaus-

Since the (positive-denite) covariance matrix KX is a real-valued symmetric

where (5.2.2) follows by the independence of the random variables Y1, . . . , Yn

We therefore obtain using (5.2.6) that

Corollary 5.19 (Hadamards inequality) For any real-valued nn positive-

Theorem 5.20 (Maximal dierential entropy for real-valued random

For a Gaussian random variable Y N (, 2), using the non-negativity of

Thus h(X) = h(Z) log2 |det(A)| (5.2.13)

Denition 5.24 (Volume) The volume of a set A Rn is dened as

Theorem 5.25 (Consequence of the AEP for continuous memoryless

Denition 5.26 (Discrete-time continuous memoryless channels) Con-

(y1 , y2, , yn ) Y n at the channel output when xn = (x1, x2, , xn ) X n is

for every n = 1, 2, , xn X n and y n Y n.

Denition 5.27 The capacity (or capacity-cost function) of a discrete-time con-

where the supremum is over all input distributions FX .

Property A.4 (Properties of the supremum)

3. If < sup A < , then ( > 0)( a0 A) a0 > sup A .

Lemma 5.28 (Concavity of capacity) If C(P ) as dened in (5.4.3) is nite

Now, for 0 1, let X be a random variable with distribution

Letting  0 yields that

Letting 0 yields that