Volume II
by
PoNing Chen
n=1
.
u() = supspectrum of A
n
; u
() = infspectrum of A
n
; U
1
=
lim
1
U
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The spectrum h() for Example 2.7. . . . . . . . . . . . . . . . . . 22
2.3 The spectrum
i() for Example 2.7. . . . . . . . . . . . . . . . . . 22
2.4 The limiting spectrums of (1/n)h
Z
n(Z
n
) for Example 2.8 . . . . . 23
2.5 The possible limiting spectrums of (1/n)i
X
n
,Y
n(X
n
; Y
n
) for Ex
ample 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Behavior of the probability of block decoding error as block length
n goes to innity for an arbitrary source X. . . . . . . . . . . . . 35
3.2 Illustration of generalized AEP Theorem. T
n
(; ) T
n
[
H
(X) +
] T
n
[
H
tI
(I = (0, 1)) is an independent random
process with P
Xt
(0) = 1 P
Xt
(1) = t, and is also independent of
the selector Z, where X
t
is outputted if Z = t. Source generator
of each time instance is independent temporally. . . . . . . . . . . 72
4.2 The ultimate CDF of (1/n) log P
X
n(X
n
): Prh
b
(Z) t. . . . . 74
5.1 The ultimate CDFs of (1/n) log P
N
n(N
n
). . . . . . . . . . . . . . 89
5.2 The ultimate CDF of the normalized information density for Ex
ample 5.14Case B). . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 The communication system. . . . . . . . . . . . . . . . . . . . . . 94
5.4 The simulated communication system. . . . . . . . . . . . . . . . 94
7.1
Z,f (Z)
(D + ) > sup[ :
Z,f (Z)
() ] D + . . . . . . . . 115
7.2 The CDF of (1/n)
n
(Z
n
, f
n
(Z
n
)) for the probabilityoferror dis
tortion measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.1 The geometric meaning for Sanovs theorem. . . . . . . . . . . . . 127
xi
8.2 The divergence view on hypothesis testing. . . . . . . . . . . . . . 134
8.3 Function of (/6)u h(u). . . . . . . . . . . . . . . . . . . . . . . 168
8.4 The BerryEsseen constant as a function of the sample size n. The
sample size n is plotted in logscale. . . . . . . . . . . . . . . . . . 170
9.1 BSC channel with crossover probability and input distribution
(p, 1 p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2 Random coding exponent for BSC with crossover probability 0.2.
Also plotted is s
= arg sup
0s1
[sR E
0
(s)]. R
cr
= 0.056633. . 188
9.3 Expurgated exponent (solid line) and random coding exponent
(dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.192745)). . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4 Expurgated exponent (solid line) and random coding exponent
(dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.006)). . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.5 Partitioning exponent (thick line), random coding exponent (thin
line) and expurgated exponent (thin line) for BSC with crossover
probability 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6 (a) The shaded area is 
c
m
; (b) The shaded area is /
c
m,m
. . . . . . 247
9.7
X
(R) asymptotically lies between ess inf(1/n)(
X
n
, X
n
) and
(1/n)E[
n
(
X
n
, X
n
)[
n
(
X
n
, X
n
)] for
R
p
(X) < R <
R
0
(X). . . . . 248
9.8 General curve of
X
(R). . . . . . . . . . . . . . . . . . . . . . . . 248
9.9 Function of sup
s>0
_
sR s log
_
2s
_
1 e
1/2s
__
. . . . . . . . . . . 249
9.10 Function of sup
s>0
_
sR s log
__
2 + e
1/s
_
/4
_
. . . . . . . . . . . 249
10.1 The multiaccess channel. . . . . . . . . . . . . . . . . . . . . . . 253
10.2 Distributed detection with n senders. Each observations Y
i
may
come from one of two categories. The nal decision T H
0
, H
1
. 254
10.3 Distributed detection in o
n
. . . . . . . . . . . . . . . . . . . . . . 260
10.4 Bayes error probabilities associated with g and g. . . . . . . . . . 266
10.5 Upper and lower bounds on e
NP
() in Case B. . . . . . . . . . . . 275
xii
Chapter 1
Introduction
This volume will talk about some advanced topics in information theory. The
mathematical background on which these topics are based can be found in Ap
pendices A and B of Volume I.
1.1 Notations
Here, we clarify some of the easily confused notations used in Volume II of the
lecture notes.
For a random variable X, we use P
X
to denote its distribution. For con
venience, we will use interchangeably the two expressions for the probability of
X = x, i.e., PrX = x and P
X
(x). Similarly, for the probability of a set char
acterizing through an inequality, such as f(x) < a, its probability mass will be
expressed by either
P
X
x A : f(x) < a
or
Pr f(X) < a .
In the second expression, we view f(X) as a new random variable dened through
X and a function f().
Obviously, the above expressions can be applied to any legitimate function
f() dened over A, including any probability function P
X
() (or log P
X
(x)) of a
random variable
X. Therefore, the next two expressions denote the probability
of f(x) = P
X
(x) < a evaluated under distribution P
X
:
P
X
x A : f(x) < a = P
X
x A : P
X
(x) < a
and
Pr f(X) < a = Pr P
X
(X) < a .
1
As a result, if we write
P
X,Y
_
(x, y) A : log
P
X,
Y
(x, y)
P
X
(x)P
Y
(y)
< a
_
= Pr
_
log
P
X,
Y
(X, Y )
P
X
(X)P
Y
(Y )
< a
_
,
it means that we dene a new function
f(x, y) log
P
X,
Y
(x, y)
P
X
(x)P
Y
(y)
in terms of the joint distribution P
X,
Y
and its two marginal distributions, and
concern the probability of f(x, y) < a where x and y have distribution P
X,Y
.
2
Chapter 2
Generalized Information Measures for
Arbitrary System Statistics
In Volume I of the lecture notes, we show that the entropy, dened by
H(X)
x.
P
X
(x) log P
X
(x) = E
X
[log P
X
(X)] nats,
of a discrete random variable X is a measure of the average amount of uncertainty
in X. An extension denition of entropy to a sequence of random variables
X
1
, X
2
, . . . , X
n
, . . . is the entropy rate, which is given by
lim
n
1
n
H(X
n
) = lim
n
1
n
E [log P
X
n(X
n
)] ,
assuming the limit exists. The above quantities have an operational signicance
established via Shannons coding theorems when the stochastic systems under
consideration satisfy certain regularity conditions, such as stationarity and ergod
icity [3, 5]. However, in more complicated situations such as when the systems
are nonstationary or with timevarying nature, these information rates are no
longer valid and lose their operational signicance. This results in the need to
establish new entropy measures which appropriately characterize the operational
limits of arbitrary stochastic systems.
Let us begin with the model of arbitrary system statistics. In general, there
are two indices for observations: time index and space index. When a sequence
of observations is denoted by X
1
, X
2
, . . . , X
n
, . . ., the subscript i of X
i
can be
treated as either a time index or a space index, but not both. Hence, when a
sequence of observations are functions of both time and space, the notation of
X
1
, X
2
, . . . , X
n
, . . ., is by no means sucient; and therefore, a new model for a
timevarying multiplesensor system, such as
X
(n)
1
, X
(n)
2
, . . . , X
(n)
t
, . . . ,
3
where t is the time index and n is the space or position index (or vice versa),
becomes signicant.
When blockwise compression of such source (with block length n) is consid
ered, same question as to the compression of i.i.d. source arises:
what is the minimum compression rate (bits per source
sample) for which the error can be made arbitrarily small
as the block length goes to innity?
(2.0.1)
To answer the question, information theorists have to nd a sequence of data
compression codes for each block length n and investigate if the decompression
error goes to zero as n approaches innity. However, unlike those simple source
models considered in Volume I, the arbitrary source for each block length n may
exhibit distinct statistics at respective sample, i.e.,
n = 1 : X
(1)
1
n = 2 : X
(2)
1
, X
(2)
2
n = 3 : X
(3)
1
, X
(3)
2
, X
(3)
3
n = 4 : X
(4)
1
, X
(4)
2
, X
(4)
3
, X
(4)
4
(2.0.2)
.
.
.
and the statistics of X
(4)
1
could be dierent from X
(1)
1
, X
(2)
1
and X
(3)
1
. Since it is
the most general model for the question in (2.0.1), and the system statistics can
be arbitrarily dened, it is therefore named arbitrary statistics system.
In notation, the triangular array of random variables in (2.0.2) is often de
noted by a boldface letter as
X X
n
n=1
,
where X
n
_
X
(n)
1
, X
(n)
2
, . . . , X
(n)
n
_
; for convenience, the above statement is
sometimes briefed as
X
_
X
n
=
_
X
(n)
1
, X
(n)
2
, . . . , X
(n)
n
__
n=1
.
In this chapter, we will rst introduce a new concept on dening information
measures for arbitrary system statistics and then, analyze in detail their algebraic
properties. In the next chapter, we will utilize the new measures to establish
general source coding theorems for arbitrary nitealphabet sources.
4
2.1 Spectrum and Quantile
Denition 2.1 (inf/supspectrum) If A
n
n=1
is a sequence of random vari
ables, then its infspectrum u() and its supspectrum u() are dened by
u() liminf
n
PrA
n
,
and
u() limsup
n
PrA
n
.
In other words, u() and u() are respectively the liminf and the limsup of
the cumulative distribution function (CDF) of A
n
. Note that by denition, the
CDF of A
n
PrA
n
is nondecreasing and rightcontinuous. However,
for u() and u(), only the nondecreasing property remains.
1
Denition 2.2 (quantile of inf/supspectrum) For any 0 1, the
1
It is pertinent to also point out that even if we do not require rightcontinuity as a funda
mental property of a CDF, the spectrums u() and u() are not necessarily legitimate CDFs of
(conventional realvalued) random variables since there might exist cases where the probabi
lity mass escapes to innity (cf. [1, pp. 346]). A necessary and sucient condition for u()
and u() to be conventional CDFs (without requiring rightcontinuity) is that the sequence
of distribution functions of A
n
is tight [1, pp. 346]. Tightness is actually guaranteed if the
alphabet of A
n
is nite.
5
quantiles
2
U
and
U
sup : u()
and
sup : u() ,
respectively. It follows from the above denitions that U
and
U
are right
continuous and nondecreasing in . Note that the supremum of an empty set is
dened to be .
Based on the above denitions, the liminf in probability U of A
n
n=1
[4],
which is dened as the largest extended real number such that for all > 0,
lim
n
Pr[A
n
U ] = 0,
satises
4
U = lim
0
U
= U
0
.
2
Generally speaking, one can dene quantile in four dierent ways:
sup : liminf
n
Pr[A
n
]
sup : liminf
n
Pr[A
n
] <
U
+
sup : liminf
n
Pr[A
n
< ]
U
+
sup : liminf
n
Pr[A
n
< ] < .
The general relation of these four quantities are
U
U
+
U
+
. Obviously,
U
U
+
and
U
U
+
.
Suppose
U
+
>
U
, liminf
n
Pr[A
n
<
+/2] liminf
n
Pr[A
n
<
U
+] <
and violates the denition of
U
.
It is worth noting that
U
= lim
lim
_
lim
> 0.
Then
U
<
U
+/2 (
.
In the lecture notes, we will interchangeably use
U
and lim
for convenience.
The nal note is that
U
+
and
U
+
U
0
U.
6
Also, the limsup in probability
U (cf. [4]), dened as the smallest extended real
number such that for all > 0,
lim
n
Pr[A
n
U + ] = 0,
is exactly
5
U = lim
1
U
for [0, 1).
Remark that U
and
U
=
U
for all in
[0, 1], the sequence of random variables A
n
converges in distribution to a random
variable A, provided the sequence of A
n
is tight.
For a better understanding of the quantities dened above, we depict them
in Figure 2.1.
2.2 Properties of quantile
Lemma 2.3 Consider two random sequences, A
n
n=1
and B
n
n=1
. Let u()
and u() be respectively the sup and infspectrums of A
n
n=1
. Similarly, let
v() and v() denote respectively the sup and infspectrums of B
n
n=1
. Dene
The equality of lim
0
U
U > 0.
Then u(U + /2) for arbitrarily small > 0, which immediately implies u(U + /2) = 0,
contradicting to the denition of U.
5
Since 1 = lim
n
PrA
n
<
U + lim
n
PrA
n
U + = u(
U +), it is straight
forward that
.
The equality of
U and lim
1
U
> 0.
Then 1 u(
U/2) = 1. Accordingly,
by
1 liminf
n
PrA
n
<
U /4 liminf
n
PrA
n
U /2 = u(
U /2) = 1,
we obtain the desired contradiction.
7

6
u() u()
U
U
0
1
U
1
U
0
U
n=1
. u() = supspectrum of A
n
; u
() = infspectrum of A
n
; U
1
=
lim
1
U
.
U
and
U
n=1
; also dene
V
and
V
n=1
.
Now let (u + v)() and (u + v)() denote the sup and infspectrums of sum
sequence A
n
+ B
n
n=1
, i.e.,
(u + v)() limsup
n
PrA
n
+ B
n
,
and
(u + v)() liminf
n
PrA
n
+ B
n
.
Again, dene (U + V )
and (U + V )
and
U
= U
0
and lim
0
U
=
U
0
.
3. For 0, 0, and + 1,
(U + V )
+
U
+ V
, (2.2.1)
8
and
(U + V )
+
U
+
V
. (2.2.2)
4. For 0, 0, and + 1,
(U + V )
U
+
+
V
(1)
, (2.2.3)
and
(U + V )
U
+
+
V
(1)
. (2.2.4)
Proof: The proof of property 1 follows directly from the denitions of U
and
and the fact that the infspectrum and the supspectrum are nondecreasing
in .
The proof of property 2 can be proved by contradiction as follows. Suppose
lim
0
U
> U
0
+ for some > 0. Then for any > 0,
u(U
0
+ /2) .
Since the above inequality holds for every > 0, and u() is a nonnegative
function, we obtain u(U
0
+ /2) = 0, which contradicts to the denition of U
0
.
We can prove lim
0
U
=
U
0
in a similar fashion.
To show (2.2.1), we observe that for > 0,
limsup
n
PrA
n
+ B
n
U
+V
2
limsup
n
(PrA
n
U
+ PrB
n
V
)
limsup
n
PrA
n
U
+ limsup
n
PrB
n
V
+ ,
which, by denition of (U + V )
+
, yields
(U + V )
+
U
+V
2.
The proof is completed by noting that can be made arbitrarily small.
Similarly, we note that for > 0,
liminf
n
PrA
n
+ B
n
U
+
V
2
liminf
n
_
PrA
n
U
+ PrB
n
V
_
limsup
n
PrA
n
U
+ liminf
n
PrB
n
V
+ ,
9
which, by denition of (U + V )
+
and arbitrarily small , proves (2.2.2).
To show (2.2.3), we rst observe that (2.2.3) trivially holds when = 1 (and
= 0). It remains to prove its validity under < 1. Remark from (2.2.1) that
(U + V )
+ (V )
(U + V V )
+
= U
+
.
Hence,
(U + V )
U
+
(V )
.
(Note that the case = 1 is not allowed here because it results in U
1
= (V )
1
=
, and the subtraction between two innite terms is undened. That is why we
need to exclude the case of = 1 for the subsequent proof.) The proof is then
completed by showing that
(V )
V
(1)
. (2.2.5)
By denition,
(v)() limsup
n
Pr B
n
= 1 liminf
n
Pr B
n
<
Then
V
(1)
sup : v() 1
= sup : liminf
n
Pr[B
n
] 1
sup : liminf
n
Pr[B
n
< ] < 1 (cf. footnote 1)
= sup : liminf
n
Pr[B
n
< ] < 1
= sup : 1 limsup
n
Pr[B
n
] < 1
= sup : limsup
n
Pr[B
n
] >
= inf : limsup
n
Pr[B
n
] >
= sup : limsup
n
Pr[B
n
]
= sup : (v)()
= (V )
.
Finally, to show (2.2.4), we again note that it is trivially true for = 1. We then
observe from (2.2.2) that (U + V )
+ (V )
(U + V V )
+
=
U
+
. Hence,
(U + V )
U
+
(V )
.
Using (2.2.5), we have the desired result. 2
10
2.3 Generalized information measures
In Denitions 2.1 and 2.2, if we let the random variable A
n
equal the normalized
entropy density
6
1
n
h
X
n(X
n
)
1
n
log P
X
n(X
n
)
of an arbitrary source
X =
_
X
n
=
_
X
(n)
1
, X
(n)
2
, . . . , X
(n)
n
__
n=1
,
we obtain two generalized entropy measures for X, i.e.,
infentropy rate H
(X).
In concept, we can image that if the random variable (1/n)h(X
n
) exhibits a
limiting distribution, then the supentropy rate is the rightmargin of the support
of that limiting distribution. For example, suppose that the limiting distribution
of (1/n)h
X
n(X
n
) is positive over (2, 2), and zero, otherwise. Then
H(X) = 2.
Similarly, the infentropy rate is the left margin of the support of the limiting
random variable limsup
n
(1/n)h
X
n(X
n
), which is 2 for the same example.
Analogously, for an arbitrary channel
W = (Y [X) =
_
W
n
=
_
W
(n)
1
, . . . , W
(n)
n
__
n=1
,
or more specically, P
W
= P
Y[X
= P
Y
n
[X
n
n=1
with input X and output Y , if
we replace A
n
in Denitions 2.1 and 2.2 by the normalized information density
1
n
i
X
n
W
n(X
n
; Y
n
) =
1
n
i
X
n
,Y
n(X
n
; Y
n
)
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
)P
Y
n(Y
n
)
,
we get the inf/supinformation rates, denoted respectively by I
(X; Y ) and
(X; Y ), as:
6
The random variable h
X
n(X
n
) log P
X
n(X
n
) is named the entropy density. Hence,
the normalized entropy density is equal to h
X
n(X
n
) divided or normalized by the blocklength
n.
11
I
X
we can replace A
n
in Denitions 2.1 and 2.2 by the normalized loglikelihood ratio
1
n
d
X
n(X
n

X
n
)
1
n
log
P
X
n(X
n
)
P
X
n
(X
n
)
to obtain the inf/supdivergence rates, denoted respectively by D
(X
X) and
(X
X), as
D
(X
X) = quantile of supspectrum of
1
n
d
X
n(X
n

X
n
)
(X
X) = quantile of infspectrum of
1
n
d
X
n(X
n

X
n
).
The above replacements are summarized in Tables 2.12.3.
2.4 Properties of generalized information measures
In this section, we will introduce the properties regarding the general information
measured dened in the previous section. We begin with the generalization of a
simple property I(X; Y ) = H(Y ) H(Y [X).
By taking = 0 and letting 0 in (2.2.1) and (2.2.3), we obtain
(U + V ) U
0
+ lim
0
V
U + V
and
(U + V ) lim
0
U
+ lim
0
V
(1)
= U +
V ,
which mean that the liminf in probability of a sequence of random variables
A
n
+B
n
is upper bounded by the liminf in probability of A
n
plus the limsup in
probability of B
n
, and is lower bounded by the sum of the liminfs in probability
of A
n
and B
n
. This fact is used in [5] to show that
I(X; Y ) +H(Y [X) H(Y ) I(X; Y ) +
H(Y [X),
12
Entropy Measures
system arbitrary source X
norm. entropy density
1
n
h
X
n(X
n
)
1
n
log P
X
n(X
n
)
entropy supspectrum
h() limsup
n
Pr
_
1
n
h
X
n(X
n
)
_
entropy infspectrum h() liminf
n
Pr
_
1
n
h
X
n(X
n
)
_
infentropy rate H
(X) sup :
h()
supentropy rate
H
(X)
infentropy rate H(X) H
0
(X)
Table 2.1: Generalized entropy measures where [0, 1].
or equivalently,
H(Y )
H(Y [X) I(X; Y ) H(Y ) H(Y [X).
Other properties of the generalized information measures are summarized in
the next Lemma.
Lemma 2.4 For nite alphabet A, the following statements hold.
1.
H
(X),
I
(X; Y ), I
(X; Y ),
D
(X
X),
and D
(X
X).)
2. I
(X; Y ) = I
(Y ; X) and
I
(X; Y ) =
I
(X; Y ) H
+
(Y ) H
(Y [X), (2.4.1)
I
(X; Y )
H
+
(Y )
H
(Y [X), (2.4.2)
13
Mutual Information Measures
system
arbitrary channel P
W
= P
Y[X
with input
X and output Y
norm. information density
1
n
i
X
n
W
n(X
n
; Y
n
)
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
) P
Y
n(Y
n
)
information supspectrum
i() limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
information infSpectrum i() liminf
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
infinformation rate I
(X; Y ) sup :
i()
supinformation rate
I
(X; Y )
infinformation rate I(X; Y ) I
0
(X; Y )
Table 2.2: Generalized mutual information measures where [0, 1].
(X; Y )
H
+
(Y ) H
(Y [X), (2.4.3)
I
+
(X; Y ) H
(Y )
H
(1)
(Y [X), (2.4.4)
and
I
+
(X; Y )
H
(Y )
H
(1)
(Y [X). (2.4.5)
(Note that the case of (, ) = (1, 0) holds for (2.4.1) and (2.4.2), and the
case of (, ) = (0, 1) holds for (2.4.3), (2.4.4) and (2.4.5).)
4. 0 H
(X)
H
(X, Y ; Z) I
X
n
(X
n
)
divergence supspectrum
d() limsup
n
Pr
_
1
n
d
X
n(X
n

X
n
)
_
divergence infspectrum d() liminf
n
Pr
_
1
n
d
X
n(X
n

X
n
)
_
infdivergence rate D
(X
X) sup :
d()
supdivergence rate
D
(X
X) sup : d()
supdivergence rate
D(X
X) lim
1
D
(X
X)
infdivergence rate D(X
X) D
0
(X
X)
Table 2.3: Generalized divergence measures where [0, 1].
Proof: Property 1 holds because
Pr
_
1
n
log P
X
n(X
n
) < 0
_
= 0,
Pr
_
1
n
log
P
X
n(X
n
)
P
X
n
(X
n
)
<
_
= P
X
n
_
x
n
A
n
:
1
n
log
P
X
n(x
n
)
P
X
n
(x
n
)
<
_
=
_
x
n
.
n
: P
X
n(x
n
)<P
X
n
(x
n
)e
n
P
X
n(x
n
)
_
x
n
.
n
: P
X
n(x
n
)<P
X
n
(x
n
)e
n
_
P
X
n
(x
n
)e
n
e
n
_
x
n
.
n
: P
X
n(x
n
)<P
X
n
(x
n
)e
n
_
P
X
n
(x
n
)
e
n
, (2.4.6)
and, by following the same procedure as (2.4.6),
Pr
_
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
)P
Y
n(Y
n
)
<
_
e
n
.
15
Property 2 is an immediate consequence of the denition.
To show the inequalities in property 3, we rst remark that
1
n
h
Y
n(Y
n
) =
1
n
i
X
n
,Y
n(X
n
; Y
n
) +
1
n
h
X
n
,Y
n(Y
n
[X
n
),
where
1
n
h
X
n
,Y
n(Y
n
[X
n
)
1
n
log P
Y
n
[X
n(Y
n
[X
n
).
With this fact and for 0 < 1, 0 < < 1 and + 1, (2.4.1) follows
directly from (2.2.1); (2.4.2) and (2.4.3) follow from (2.2.2); (2.4.4) follows from
(2.2.3); and (2.4.5) follows from (2.2.4). (Note that in (2.4.1), if = 0 and = 1,
then the righthandside would be a dierence between two innite terms, which
is undened; hence, such case is therefore excluded by the condition that < 1.
For the same reason, we also exclude the cases of = 0 and = 1.) Now when
0 < 1 and = 0, we can conrm the validity of (2.4.1), (2.4.2) and (2.4.3),
again, by (2.2.1) and (2.2.2), and also examine the validity of (2.4.4) and (2.4.5)
by directly taking these values inside. The validity of (2.4.1) and (2.4.2) for
(, ) = (1, 0), and the validity of (2.4.3), (2.4.4) and (2.4.5) for (, ) = (0, 1)
can be checked by directly replacing and with the respective numbers.
Property 4 follows from the facts that
H
() is nondecreasing in ,
H
(X)
H(X), and
H(X) log [A[. The last inequality can be proved as follows.
Pr
_
1
n
h
X
n(X
n
) log [A[ +
_
= 1 P
X
n
_
x
n
A
n
:
1
n
log
P
X
n(X
n
)
1/[A[
n
<
_
1 e
n
,
where the last step can be obtained by using the same procedure as (2.4.6).
Therefore, h(log [A[ + ) = 1 for any > 0, which indicates
H(X) log [A[.
Property 5 can be proved using the fact that
1
n
i
X
n
,Y
n
,Z
n(X
n
, Y
n
; Z
n
) =
1
n
i
X
n
,Z
n(X
n
; Z
n
) +
1
n
i
X
n
,Y
n
,Z
n(Y
n
; Z
n
[X
n
).
By applying (2.2.1) with = 0, and observing I(Y ; Z[X) 0, we obtain the
desired result. 2
Lemma 2.5 (data processing lemma) Fix [0, 1]. Suppose that for every
n, X
n
1
and X
n
3
are conditionally independent given X
n
2
. Then
I
(X
1
; X
3
) I
(X
1
; X
2
).
16
Proof: By property 5 of Lemma 2.4, we get
I
(X
1
; X
3
) I
(X
1
; X
2
, X
3
) = I
(X
1
; X
2
),
where the equality holds because
1
n
log
P
X
n
1
,X
n
2
,X
n
3
(x
n
1
, x
n
2
, x
n
3
)
P
X
n
1
(x
n
1
)P
X
n
2
,X
n
3
(x
n
2
, x
n
3
)
=
1
n
log
P
X
n
1
,X
n
2
(x
n
1
, x
n
2
)
P
X
n
1
(x
n
1
)P
X
n
2
(x
n
2
)
.
2
Lemma 2.6 (optimality of independent inputs) Fix [0, 1). Consider
a nitealphabet channel with P
W
n(y
n
[x
n
) = P
Y
n
[X
n(y
n
[x
n
) =
n
i=1
P
Y
i
[X
i
(y
i
[x
i
)
for all n. For any input X and its corresponding output Y ,
I
(X; Y ) I
(
X;
Y ) = I(
X;
Y ),
where
Y is the output due to
X, which is an independent process with the same
rst order statistics as X, i.e., P
X
n
(x
n
) =
n
i=1
P
X
i
(x
i
).
Proof: First, we observe that
1
n
log
P
W
n(y
n
[x
n
)
P
Y
n(y
n
)
+
1
n
log
P
Y
n(y
n
)
P
Y
n
(y
n
)
=
1
n
log
P
W
n(y
n
[x
n
)
P
Y
n
(y
n
)
.
In other words,
1
n
log
P
X
n
W
n(x
n
, y
n
)
P
X
n(x
n
)P
Y
n(y
n
)
+
1
n
log
P
Y
n(y
n
)
P
Y
n
(y
n
)
=
1
n
log
P
X
n
W
n(x
n
, y
n
)
P
X
n
(x
n
)P
Y
n
(y
n
)
.
By evaluating the above terms under P
X
n
W
n = P
X
n
,Y
n and dening
z() limsup
n
P
X
n
W
n
_
(x
n
, y
n
) A
n
n
:
1
n
log
P
X
n
W
n(x
n
, y
n
)
P
X
n
(x
n
)P
Y
n
(y
n
)
_
and
Z
(
X;
Y ) sup : z() ,
we obtain from (2.2.1) (with = 0) that
Z
(
X;
Y ) I
(X; Y ) + D(Y 
Y ) I
(X; Y ),
since D(Y 
1
n
n
i=1
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
E
X
i
W
i
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
__
>
_
= Pr
_
1
n
n
i=1
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
E
X
i
W
i
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
__
>
_
0,
for any > 0, where the convergence to zero follows from the Chebyshevs
inequality and the niteness of the channel alphabets (or more directly, the
niteness of individual variance). Consequently, z() = 1 for
> liminf
n
1
n
n
i=1
E
X
i
W
i
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
_
= liminf
n
1
n
n
i=1
I(X
i
; Y
i
);
and z() = 0 for < liminf
n
(1/n)
n
i=1
I(X
i
; Y
i
), which implies
Z(
X;
Y ) =Z
(
X;
Y ) = liminf
n
1
n
n
i=1
I(X
i
; Y
i
)
for any [0, 1). Similarly, we can show that
I(
X;
Y ) = I
(
X;
Y ) = liminf
n
1
n
n
i=1
I(X
i
; Y
i
)
for any [0, 1). Accordingly,
I(
X;
Y ) =Z(
X;
Y ) I
(X; Y ).
2
7
The claim can be justied as follows.
P
Y1
(y
1
) =
y
n
2
Y
n1
x
n
X
n
P
X
n(x
n
)P
W
n(y
n
[x
n
)
=
x1X
P
X1
(x
1
)P
W1
(y
1
[x
1
)
y
n
2
Y
n1
x
n
2
X
n1
P
X
n
2
X1
(x
n
2
[x
1
)P
W
n
2
(y
n
2
[x
n
2
)
=
x1X
P
X1
(x
1
)P
W1
(y
1
[x
1
)
y
n
2
Y
n1
x
n
2
X
n1
P
X
n
2
W
n
2
X1
(x
n
2
, y
n
2
[x
1
)
=
x1X
P
X1
(x
1
)P
W1
(y
1
[x
1
).
Hence, for a channel with P
W
n(y
n
[x
n
) =
n
i=1
P
Wi
(y
i
[x
i
), the output marginal only depends
on the respective input marginal.
18
2.5 Examples for the computation of general information
measures
Let the alphabet be binary A = = 0, 1, and let every output be given by
Y
(n)
i
= X
(n)
i
Z
(n)
i
where represents the modulo2 addition operation, and Z is an arbitrary binary
random process, independent of X. Assume that X is a Bernoulli uniform input,
i.e., an i.i.d. random process with equalprobably marginal distribution. Then
it can be derived that the resultant Y is also Bernoulli uniform distributed, no
matter what distribution Z has.
To compute I
(X; Y ) H
0
(Y )
H
(1)
(Y [X), (2.5.1)
and
I
(X; Y )
H
+
(Y )
H
(Y [X). (2.5.2)
where 0 < 1, 0 < 1 and + 1. Note that the lower bound in (2.5.1)
and the upper bound in (2.5.2) are respectively equal to and for = 0
and + = 1, which become trivial bounds; hence, we further restrict > 0
and + < 1 respectively for the lower and upper bounds.
Thus for [0, 1),
I
(X; Y ) inf
0<1
_
H
+
(Y )
H
(Y [X)
_
.
By the symmetry of the channel,
H
(Y [X) =
H
(X; Y ) inf
0<1
_
H
+
(Y )
H
(Z)
_
inf
0<1
_
log(2)
H
(Z)
_
,
where the last step follows from property 4 of Lemma 2.4. Since log(2)
H
(Z)
is nonincreasing in ,
I
(Z).
On the other hand, we can derive the lower bound to I
(X; Y ) in (2.5.1) by
the fact that Y is Bernoulli uniform distributed. We thus obtain for (0, 1],
I
(X; Y ) log(2)
H
(1)
(Z),
19
and
I
0
(X; Y ) = lim
0
I
(Z) = log(2)
H(Z).
To summarize,
log(2)
H
(1)
(Z) I
i() limsup
n
Pr
_
1
n
log
P
Y
n
[X
n(Y
n
[X
n
)
P
Y
n(Y
n
)
_
= limsup
n
Pr
_
1
n
log P
Z
n(Z
n
)
1
n
log P
Y
n(Y
n
)
_
= limsup
n
Pr
_
1
n
log P
Z
n(Z
n
) log(2)
_
= limsup
n
Pr
_
1
n
log P
Z
n(Z
n
) log(2)
_
= 1 liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) < log(2)
_
.
20
Hence, for (0, 1),
I
(X; Y ) = sup :
i()
= sup
_
: 1 liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) < log(2)
_
_
= sup
_
: liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) < log(2)
_
1
_
= sup
_
(log(2) ) : liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) <
_
1
_
= log(2) + sup
_
: liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) <
_
1
_
= log(2) inf
_
: liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) <
_
1
_
= log(2) sup
_
: liminf
n
Pr
_
1
n
log P
Z
n(Z
n
) <
_
< 1
_
log(2) sup
_
: liminf
n
Pr
_
1
n
log P
Z
n(Z
n
)
_
< 1
_
= log(2) lim
(1)
(Z).
Also, for (0, 1),
I
(X; Y ) sup
_
: limsup
n
Pr
_
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
)P
Y
n(Y
n
)
<
_
<
_
(2.5.3)
= log(2) sup
_
: liminf
n
Pr
_
1
n
log P
Z
n(Z
n
)
_
1
_
= log(2)
H
(1)
(Z),
where (2.5.3) follows from the fact described in footnote 2. Therefore,
log(2)
H
(1)
(Z) I
1
Figure 2.2: The spectrum h() for Example 2.7.

1 h
b
(p) 1
t
t
d
d
6
0
1
1
Figure 2.3: The spectrum
i() for Example 2.7.
with individual component being one with probability p. Then the sequence of
random variables (1/n)h
Z
n(Z
n
) converges to 0 and h
b
(p) with respective masses
and 1 , where h
b
(p) p log p (1 p) log(1 p) is the binary entropy
function. The resulting h() is depicted in Figure 2.2. From (2.5.3), we obtain
(X; Y ) =
_
1 h
b
(p), if 0 < < 1 ;
1, if 1 < 1.
Example 2.8 If
Z =
_
Z
n
=
_
Z
(n)
1
, . . . , Z
(n)
n
__
n=1
is a nonstationary binary independent sequence with
Pr
_
Z
(n)
i
= 0
_
= 1 Pr
_
Z
(n)
i
= 1
_
= p
i
,
then by the uniform boundedness (in i) of the variance of random variable
22
log P
Z
(n)
i
_
Z
(n)
i
_
, namely
Var
_
log P
Z
(n)
i
_
Z
(n)
i
__
E
_
_
log P
Z
(n)
i
_
Z
(n)
i
__
2
_
sup
0<p
i
<1
_
p
i
(log p
i
)
2
+ (1 p
i
)(log(1 p
i
))
2
< log(2),
we have (by Chebyshevs inequality)
Pr
_
1
n
log P
Z
n(Z
n
)
1
n
n
i=1
H
_
Z
(n)
i
_
>
_
0,
for any > 0. Therefore,
H
(1)
(Z) is equal to
H
(1)
(Z) =
H(Z) = limsup
n
1
n
n
i=1
H
_
Z
(n)
i
_
= limsup
n
1
n
n
i=1
h
b
(p
i
)
for (0, 1], and innity for = 0, where h
b
(p
i
) = p
i
log(p
i
)(1p
i
) log(1p
i
).
Consequently,
I
(X; Y ) =
_
_
1
H(Z) = 1 limsup
n
1
n
n
i=1
h
b
(p
i
), for [0, 1),
, for = 1.
This result is illustrated in Figures 2.4 and 2.5.

H(Z)
H(Z)
clustering points
Figure 2.4: The limiting spectrums of (1/n)h
Z
n(Z
n
) for Example 2.8
23

log(2)
H(Z) log(2) H(Z)
clustering points
Figure 2.5: The possible limiting spectrums of (1/n)i
X
n
,Y
n(X
n
; Y
n
) for
Example 2.8.
2.6 Renyis information measures
In this section, we introduce alternative generalizations of information measures.
They are respectively named Renyis entropy, Renyis mutual information and
Renyis divergence.
It is known that Renyis information measures can be used to provide expo
nentially tight error bounds for data compression block coding and hypothesis
testing systems with i.i.d. statistics. Cambell also found its operational charac
terization for data compression variablelength codes [2]. Here, we only provide
their denitions, and will introduce their operational meanings in the subsequent
chapters.
Denition 2.9 (Renyis entropy) For > 0, the Renyis entropy
8
of order
is dened by:
H(X; )
_
_
1
1
log
_
x.
[P
X
(x)]
_
, for ,= 1;
lim
1
H(X; ) =
x.
P
X
(x) log P
X
(x), for = 1.
Denition 2.10 (Renyis divergence) For > 0, the Renyis divergence of
order is dened by:
D(X
X; )
_
_
1
1
log
_
x.
_
P
X
(x)P
1
X
(x)
_
_
, for ,= 1;
lim
1
D(X
X; ) =
x.
P
X
(x) log
P
X
(x)
P
X
(x)
, for = 1.
8
The Renyis entropy is usually denoted by H
(X) and
H
x.
y
P
X
(x)P
Y [X
(y[x) log
P
Y [X
(y[x)
P
Y
(y)
= min
P
x.
y
P
X
(x)P
Y [X
(y[x) log
P
Y [X
(y[x)
P
Y
(y)
.
The latter equality can be proved by taking the derivative of
x.
y
P
X
(x)P
Y [X
(y[x) log
P
Y [X
(y[x)
P
Y
(y)
+
_
y
P
Y
(y) 1
_
with respective to P
Y
(y), and letting it be zero to obtain that the minimization
output distribution satises
P
Y
(y) =
x.
P
X
(x)P
Y [X
(y[x).
The other extension is a direct generalization of
I(X; Y ) = D(P
X,Y
P
X
P
Y
) = min
P
Y
D(P
X,Y
P
X
P
Y
)
to Renyis divergence.
Denition 2.11 (typeI Renyis mutual information) For > 0, the type
I Renyis mutual information of order is dened by:
I(X; Y ; )
_
_
min
P
Y
1
1
x.
P
X
(x) log
_
y
_
P
Y [X
(y[x)P
1
Y
(y)
_
_
, if ,= 1;
lim
1
I(X; Y ; ) = I(X; Y ), if = 1,
where the minimization is taken over all P
Y
under xed P
X
and P
Y [X
.
Denition 2.12 (typeII Renyis mutual information) For > 0, the
typeII Renyis mutual information of order is dened by:
J(X; Y ; ) min
P
Y
D(P
X,Y
P
X
P
Y
; )
=
_
1
log
_
_
y
_
x.
P
X
(x)P
Y [X
(y[x)
_
1/
_
_
, for ,= 1;
lim
1
J(X; Y ; ) = I(X; Y ), for = 1.
25
Some elementary properties of Renyis entropy and divergence are summa
rized below.
Lemma 2.13 For nite alphabet A, the following statements hold.
1. 0 H(X; ) log [A[; the rst equality holds if, and only if, X is de
terministic, and the second equality holds if, and only if, X is uniformly
distributed over A.
2. H(X; ) is strictly decreasing in unless X is uniformly distributed over
its support x A : P
X
(x) > 0.
3. lim
0
H(X; ) = log [x A : P
X
(x) > 0[.
4. lim
X
.
6. D(X
X; ) = if, and only if, either
x A : P
X
(x) > 0 and P
X
(x) > 0 =
or
x A : P
X
(x) > 0 , x A : P
X
(x) > 0 for 1.
7. lim
0
D(X
X; ) = log P
X
x A : P
X
(x) > 0.
8. If P
X
(x) > 0 implies P
X
(x) > 0, then
lim
D(X
X; ) = max
_
x. : P
X
(x)>0
_
log
P
X
(x)
P
X
(x)
.
9. I(X; Y ; ) J(X; Y ; ) for 0 < < 1, and I(X; Y ; ) J(X; Y ; ) for
> 1.
Lemma 2.14 (data processing lemma for typeI Renyis mutual infor
mation) Fix > 0. If X Y Z, then I(X; Y ; ) I(X; Z; ).
26
Bibliography
[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John
Wiley and Sons, 1995.
[2] L. L. Cambell, A coding theorem and Renyis entropy, Informat. Contr.,
vol. 8, pp. 423429, 1965.
[3] R. M. Gray, Entropy and Information Theory, SpringerVerlag, New York,
1990.
[4] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. on Information Theory, vol. IT39, no. 3, pp. 752772, May 1993.
[5] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
27
Chapter 3
General Lossless Data Compression
Theorems
In Volume I of the lecture notes, we already know that the entropy rate
lim
n
1
n
H(X
n
)
is the minimum data compression rate (nats per source symbol) for arbitrarily
small data compression error for block coding of the stationaryergodic source.
We also mentioned that for a more complicated situations where the sources
becomes nonstationary, the quantity lim
n
(1/n)H(X
n
) may not exist, and
can no longer be used to characterize the source compression. This results in
the need to establish a new entropy measure which appropriately characterizes
the operational limits of arbitrary stochastic systems, which was done in the
previous chapter.
The role of a source code is to represent the output of a source eciently.
Specically, a source code design is to minimize the source description rate of
the code subject to a delity criterion constraint. One commonly used delity
criterion constraint is to place an upper bound on the probability of decoding
error P
e
. If P
e
is made arbitrarily small, we obtain a traditional (almost) error
free source coding system
1
. Lossy data compression codes are a larger class
of codes in the sense that the delity criterion used in the coding scheme is a
general distortion measure. In this chapter, we only demonstrate the bounded
error data compression theorems for arbitrary (not necessarily stationary ergodic,
information stable, etc.) sources. The general lossy data compression theorems
will be introduced in subsequent chapters.
1
Recall that only for variablelength codes, a complete errorfree data compression is re
quired. A lossless data compression block codes only dictates that the compression error can
be made arbitrarily small, or asymptotically errorfree.
28
3.1 Fixedlength data compression codes for arbitrary
sources
Equipped with the general information measures, we herein demonstrate a gen
eralized Asymptotic Equipartition Property (AEP) Theorem and establish ex
pressions for the minimum achievable (xedlength) coding rate of an arbitrary
source X.
Here, we have made an implicit assumption in the following derivation, which
is the source alphabet A is nite
2
.
Denition 3.1 (cf. Denition 4.2 and its associated footnote in volume
I) An (n, M) block code for data compression is a set
(
n
c
1
, c
2
, . . . , c
M
consisting of M sourcewords
3
of block length n (and a binaryindexing codeword
for each sourceword c
i
); each sourceword represents a group of source symbols
of length n.
Denition 3.2 Fix [0, 1]. R is an achievable data compression rate for
a source X if there exists a sequence of block data compression codes (
n
=
(n, M
n
)
n=1
with
limsup
n
1
n
log M
n
R,
and
limsup
n
P
e
( (
n
) ,
where P
e
( (
n
) Pr (X
n
/ (
n
) is the probability of decoding error.
The inmum of all achievable data compression rate for X is denoted by
T
(X).
Lemma 3.3 (Lemma 1.5 in [3]) Fix a positive integer n. There exists an
(n, M
n
) source block code (
n
for P
X
n such that its error probability satises
P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
_
.
2
Actually, the theorems introduced also apply for sources with countable alphabets. We
assume nite alphabets in order to avoid uninteresting cases (such as
H
x
n
.
n
: (1/n)h
X
n(x
n
)(1/n) log Mn
P
X
n(x
n
)
x
n
.
n
: (1/n)h
X
n(x
n
)(1/n) log Mn
1
Mn
x
n
A
n
:
1
n
h
X
n(x
n
)
1
n
log M
n
1
Mn
.
Therefore, [x
n
A
n
: (1/n)h
X
n(x
n
) (1/n) log M
n
[ M
n
. We can then
choose a code
(
n
_
x
n
A
n
:
1
n
h
X
n(x
n
)
1
n
log M
n
_
with [ (
n
[ = M
n
and
P
e
( (
n
) = 1 P
X
n (
n
Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
_
.
2
Lemma 3.4 (Lemma 1.6 in [3]) Every (n, M
n
) source block code (
n
for P
X
n
satises
P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
+
_
expn,
for every > 0.
Proof: It suces to prove that
1 P
e
( (
n
) = Pr X
n
(
n
< Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+ expn.
30
Clearly,
Pr X
n
(
n
= Pr
_
X
n
(
n
and
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+Pr
_
X
n
(
n
and
1
n
h
X
n(X
n
) >
1
n
log M
n
+
_
Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+Pr
_
X
n
(
n
and
1
n
h
X
n(X
n
) >
1
n
log M
n
+
_
= Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+
x
n
( n
P
X
n(x
n
) 1
_
1
n
h
X
n(x
n
) >
1
n
log M
n
+
_
= Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+
x
n
( n
P
X
n(x
n
) 1
_
P
X
n(x
n
) <
1
M
n
expn
_
< Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+[ (
n
[
1
M
n
expn
= Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+ expn.
2
We now apply Lemmas 3.3 and 3.4 to prove a general source coding theorems
for block codes.
Theorem 3.5 (general source coding theorem) For any source X,
T
(X) =
_
_
_
lim
(1)
(X) lim
(1)
H
(X)
We need to prove the existence of a sequence of block codes (
n
n1
such that
for every > 0,
limsup
n
1
n
log [ (
n
[ lim
(1)
(X) +
_
.
Therefore,
limsup
n
P
e
( (
n
) limsup
n
Pr
_
1
n
h
X
n(X
n
) > lim
(1)
(X) +
_
= 1 liminf
n
Pr
_
1
n
h
X
n(X
n
) lim
(1)
(X) +
_
1 (1 ) = ,
where the last inequality follows from
lim
(1)
(X) = sup
_
: liminf
n
Pr
_
1
n
h
X
n(X
n
)
_
< 1
_
. (3.1.1)
2. Converse part: T
(X) lim
(1)
H
(X)
Assume without loss of generality that lim
(1)
H
(X). Then
( > 0) T
(X) 4. By denition of T
(X) 4
_
+ (3.1.2)
and
limsup
n
P
e
( (
n
) . (3.1.3)
(3.1.2) implies that
1
n
log [ (
n
[ lim
(1)
(X) 2
for all suciently large n. Hence, for those n satisfying the above inequality and
also by Lemma 3.4,
P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log [ (
n
[ +
_
e
n
Pr
_
1
n
h
X
n(X
n
) >
_
lim
(1)
(X) 2
_
+
_
e
n
.
32
Therefore,
limsup
n
P
e
( (
n
) 1 liminf
n
Pr
_
1
n
h
X
n(X
n
) lim
(1)
(X)
_
> ,
where the last inequality follows from (3.1.1). Thus, a contradiction to (3.1.3)
is obtained. 2
A few remarks are made based on the previous theorem.
Note that as = 0, lim
(1)
H
(X) =
H(X). Hence, the above theorem
generalizes the block source coding theorem in [4], which states that the
minimum achievable xedlength source coding rate of any nitealphabet
source is
H(X).
Consider the special case where (1/n) log P
X
n(X
n
) converges in proba
bility to a constant H, which holds for all information stable sources
4
. In
this case, both the inf and supspectrums of X degenerate to a unit step
function:
u() =
_
1, if > H;
0, if < H,
where H is the source entropy rate. Thus,
H
(X) = sup R : F
Z
(R) < 1 .
Therefore, the relationship between the code rate and the ultimate optimal
error probability is also clearly dened. We further explore the case in the
next example.
4
A source X =
_
X
n
=
_
X
(n)
1
, . . . , X
(n)
n
__
n=1
is said to be information stable if H(X
n
) =
E [log P
X
n(x
n
)] > 0 for all n, and
lim
n
Pr
_
log P
X
n(x
n
)
H(X
n
)
1
>
_
= 0,
for every > 0. By the denition, any stationaryergodic source with nite nfold entropy is
information stable; hence, it can be viewed a generalized source model for stationaryergodic
sources.
33
Example 3.6 Consider a binary source X with each X
n
is Bernoulli()
distributed, where is a random variable dened over (0, 1). This is a sta
tionary but nonergodic source [1]. We can view the source as a mixture of
Bernoulli() processes where the parameter = (0, 1), and has distri
bution P
H
0
(X)
H(X)
P
e
n (i.o.)
1
P
e
n
0
P
e
is lower
bounded (i.o. in n)
R
Figure 3.1: Behavior of the probability of block decoding error as block
length n goes to innity for an arbitrary source X.
We close this section by remarking that the denition that we adopt for the
achievable data compression rate is slightly dierent from, but equivalent to,
the one used in [4, Def. 8]. The denition in [4] also brings the same result,
which was separately proved by Steinberg and Verd u as a direct consequence of
Theorem 10(a) (or Corollary 3) in [5]. To be precise, they showed that T
(X),
denoted by T
e
(, X) in [5], is equal to
R
v
(2) (cf. Def. 17 in [5]). By a simple
derivation, we obtain:
T
e
(, X) =
R
v
(2)
= inf
_
: limsup
n
P
X
n
_
1
n
log P
X
n(X
n
) >
_
_
= inf
_
: liminf
n
P
X
n
_
1
n
log P
X
n(X
n
)
_
1
_
= sup
_
: liminf
n
P
X
n
_
1
n
log P
X
n(X
n
)
_
< 1
_
= lim
(1)
(X).
Note that Theorem 10(a) in [5] is a lossless data compression theorem for ar
bitrary sources, which the authors show as a byproduct of their results on
niteprecision resolvability theory. Specically, they proved T
0
(X) = S(X) [4,
Thm. 1] and S(X) =
H(X) [4, Thm. 3], where S(X) is the resolvability
5
of
an arbitrary source X. Here, we establish Theorem 3.5 in a dierent and more
direct way.
3.2 Generalized AEP theorem
For discrete memoryless sources, the data compression theorem is proved by
choosing the codebook (
n
to be the weakly typical set and applying the Asymp
5
The resolvability, which is a measure of randomness for random variables, will be intro
duced in subsequent chapters.
35
totic Equipartition Property (AEP) which states that (1/n)h
X
n(X
n
) converges
to H(X) with probability one (and hence in probability). The AEP which
implies that the probability of the typical set is close to one for suciently large
n also holds for stationaryergodic sources. It is however invalid for more gen
eral sources e.g., nonstationary, nonergodic sources. We herein demonstrate
a generalized AEP theorem.
Theorem 3.8 (generalized asymptotic equipartition property for arbi
trary sources) Fix [0, 1). Given an arbitrary source X, dene
T
n
[R]
_
x
n
A
n
:
1
n
log P
X
n(x
n
) R
_
.
Then for any > 0, the following statements hold.
1.
liminf
n
Pr
_
T
n
[
H
(X) ]
_
(3.2.1)
2.
liminf
n
Pr
_
T
n
[
H
(X) + ]
_
> (3.2.2)
3. The number of elements in
T
n
(; ) T
n
[
H
(X) + ] T
n
[
H
(X) ],
denoted by [T
n
(; )[, satises
[T
n
(; )[ exp
_
n(
H
(X) + )
_
, (3.2.3)
where the operation /B between two sets / and B is dened by /B
/ B
c
with B
c
denoting the complement set of B.
4. There exists = () > 0 and a subsequence n
j
j=1
such that
[T
n
(; )[ > exp
_
n
j
(
H
(X) )
_
. (3.2.4)
Proof: (3.2.1) and (3.2.2) follows from the denitions. For (3.2.3), we have
1
x
n
Tn(;)
P
X
n(x
n
)
x
n
Tn(;)
exp
_
n(
H
(X) + )
_
= [T
n
(; )[ exp
_
n(
H
(X) + )
_
.
36
T
n
[
H
(X) ] T
n
[
H
(X) + ]
T
n
(; )
Figure 3.2: Illustration of generalized AEP Theorem. T
n
(; )
T
n
[
H
(X) + ] T
n
[
H
(X) + ]
_
> + 2.
Furthermore, (3.2.1) implies that for the previously chosen , there exist a sub
sequence n
t
j
j=1
such that
Pr
_
T
n
j
[
H
(X) ]
_
< + .
Therefore, for all n
t
j
> N
1
,
< Pr
_
T
n
j
[
H
(X) + ] T
n
j
[
H
(X) ]
_
<
T
n
j
[
H
(X) + ] T
n
j
[
H
(X) ]
exp
_
n
t
j
(
H
(X) )
_
=
T
n
j
(; )
exp
_
n
t
j
(
H
(X) )
_
.
The desired subsequence n
j
j=1
is then dened as n
1
is the rst n
t
j
> N
1
, and
n
2
is the second n
t
j
> N
1
, etc. 2
With the illustration depicted in Figure 3.2, we can clearly deduce that The
orem 3.8 is indeed a generalized version of the AEP since:
The set
T
n
(; ) T
n
[
H
(X) + ] T
n
[
H
(X) ]
=
_
x
n
A
n
:
1
n
log P
X
n(x
n
)
H
(X)
_
is nothing but the weakly typical set.
37
(3.2.1) and (3.2.2) imply that q
n
PrT
n
(; ) > 0 innitely often in n.
(3.2.3) and (3.2.4) imply that the number of sequences in T
n
(; ) (the
dashed region) is approximately equal to exp
_
n
H
(X)
_
, and the probabi
lity of each sequence in T
n
(; ) can be estimated by q
n
exp
_
n
H
(X)
_
.
In particular, if X is a stationaryergodic source, then
H
(X) is indepen
dent of [0, 1) and,
H
(X) =H
(X) =H
x.
P
X
(x)e
t(cx)
_
,
where t is a chosen positive constant, P
X
is the distribution function of source
X, c
x
is the binary codeword for source symbol x, and () is the length of the
38
binary codeword. The criterion for optimal code now becomes that a code is
said to be optimal if its cost L(t) is the smallest among all possible codes.
The physical meaning of the aforementioned cost function is roughly dis
cussed below. When t 0, L(0) =
x.
P
X
(x)(c
x
) which is the average
codeword length. In the case of t , L() = max
x.
(c
x
), which is the
maximum codeword length for all binary codewords. As you may have noticed,
longer codeword length has larger weight in the sum of L(t). In other words,
the minimization of L(t) is equivalent to the minimization of
x.
P
X
(x)e
t(cx)
;
hence, the weight for codeword c
x
is e
t(cx)
. For the minimization operation, it
is obvious that events with smaller weight is more preferable since it contributes
less in the sum of L(t).
Therefore, with the minimization of L(t), it is less likely to have codewords
with long code lengths. In practice, system with long codewords usually intro
duce complexity in encoding and decoding, and hence is somewhat nonfeasible.
Consequently, the new criterion, to some extent, is more suitable to physical
considerations.
3.3.2 Source coding theorem for Renyis entropy
Theorem 3.9 (source coding theorem for Renyis entropy) The mini
mum cost L(t) attainable for uniquely decodable codes
6
is the Renyis entropy
of order 1/(1 + t), i.e.,
H
_
X;
1
(1 + t)
_
=
1 + t
t
log
_
x.
P
1/(1+t)
X
(x)
_
.
Example 3.10 Given a source X. If we want to design an optimal lossless code
with max
x.
(c
x
) being smallest, the cost of the optimal code is H(X; 0) =
log [A[.
3.4 Entropy of English
The compression of English text is not only a practical application but also an
interesting research topic.
One of the main problem in the compression of English text (in principle)
is that its statistical model is unclear. Therefore, its entropy rate cannot be
immediately computable. To estimate the data compression bound of English
text, various stochastic approximations to English have been proposed. One can
6
For its denition, please refer to Section 4.3.1 of Volume I of the lecture notes.
39
then design a code, according to the estimated stochastic model, to compress
English text. It is obvious that the better the stochastic approximation, the
better the approximation.
Assumption 3.11 For data compression, we assume that the English text con
tains only 26 letters and the space symbol. In other words, the upper case letter
is treated the same as its lower case counterpart, and special symbols, such as
punctuation, will be ignored.
3.4.1 Markov estimate of entropy rate of English text
According to the source coding theorem, the rst thing that a data compression
code designer shall do is to estimate the (sup) entropy rate of the English text.
One can start from modeling the English text as a Markov source, and compute
the entropy rate according to the estimated Markov statistics.
zeroorder Markov approximation. It has been shown that the frequency
of letters in English is far from uniform; for example, the most common
letter, E, has P
empirical
(E) 0.13 but the least common letters, Q and Z,
have P
empirical
(Q) P
empirical
(Z) 0.001. Therefore, zeroorder Markov
approximation apparently does not t our need in estimating the entropy
rate of the English text.
1storder Markov approximate. The frequency of pairs of letters is also far
from uniform; the most common pair TH has frequency about 0.037. It is
fun to know that Q is always followed by U.
A higher order approximation is possible. However, the database may be
too large to be handled. For example, a 2ndorder approximation requires 27
3
=
19683 entries, and one may need millions of samples to make an accurate estimate
of its probability.
Here are some examples of Markov approximations to English from Shannons
original paper. Note that the sequence of English letters is generated according
to the approximated statistics.
1. Zeroorder Markov approximation: The symbols are drawn independently
with equiprobable distribution.
Example: XFOML RXKHRJEFJUJ ZLPWCFWKCYJ
EFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD
40
2. 1storder Markov approximation: The symbols are drawn independently.
Frequency of letters matches the 1storder Markov approximation of En
glish text.
Example: OCRO HLI RGWR NMIELWIS EU LL
NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
3. 2ndorder Markov approximation: Frequency of pairs of letters matches
English text.
Example: ON IE ANTSOUTINYS ARE T INCTORE ST BE
S DEAMY ACHIN D ILONASIVE TUCOOWE AT
TEASONARE FUSO TIZIN AN DY TOBE SEACE CTISBE
4. 3rdorder Markov approximation: Frequency of triples of letters matches
English text.
Example: IN NO IST LAT WHEY CRATICT FROURE BERS
GROCID PONDENOME OF DEMONSTURES OF THE
REPTAGIN IS REGOACTIONA OF CRE
5. 4thorder Markov approximation: Frequency of quadruples of letters ma
tches English text.
Example: THE GENERATED JOB PROVIDUAL BETTER
TRAND THE DISPLAYED CODE, ABOVERY UPONDULTS
WELL THE CODE RST IN THESTICAL IT DO HOCK
BOTHE MERG.
Another way to simulate the randomness of English text is to use word
approximation.
1. 1storder word approximation.
Example: REPRESENTING AND SPEEDILY IS AN GOOD APT OR
COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE
TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MES
SAGE HAD BE THESE.
2. 2ndorder word approximation.
Example: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF
WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
41
From the above results, it is obvious that the approximations get closer and
closer to resembling English for higher order approximation.
Using the above model, we can then compute the empirical entropy rate of
English text.
order of the letter approximation model entropy rate
zero order log
2
27 = 4.76 bits per letter
1st order 4.03 bits per letter
4th order 2.8 bits per letter
One nal remark on Markov estimate of English statistics is that the results
are not only useful in compression but also helpful in decryption.
3.4.2 Gambling estimate of entropy rate of English text
In this section, we will show that a good gambler is also a good data compressor!
A) Sequential gambling
Given an observed sequence of letters,
x
1
, x
2
, . . . , x
k
,
a gambler needs to bet on the next letter X
k+1
(which is now a random variable)
with all the money in hand. It is not necessary for him to put all the money
on the same outcome (there are 27 of them, i.e., 26 letters plus space). For
example, he is allowed to place part of his money on one possible outcome and
the rest of the money on another. The only constraint is that he should bet all
his money.
Let b(x
k+1
[x
1
, . . . , x
k
) be the ratio of his money, which be bet on the letter
x
k+1
, and assume that at rst, the amount of money that the gambler has is 1.
Then
x
k+1
.
b(x
k+1
[x
1
, . . . , x
k
) = 1,
and
( x
k+1
a, b, . . . , z, SPACE) b(x
k+1
[x
1
, . . . , x
k
) 0.
When the next letter appears, the gambler will be paid 27 times the bet on the
letter.
42
Let S
n
be the wealth of the gambler after n bets. Then
S
1
= 27 b
1
(x
1
)
S
2
= 27 [b
2
(x
2
[x
1
) S
1
]
S
3
= 27 [b
3
(x
3
[x
1
, x
2
) S
2
]
.
.
.
S
n
= 27
n
k=1
b
k
(x
k
[x
1
, . . . , x
k1
)
= 27
n
b(x
1
, . . . , x
n
),
where
b(x
n
) = b(x
1
, . . . , x
n
) =
n
k=1
b
k
(x
k
[x
1
, . . . , x
k1
).
We now wish to show that high value of S
n
lead to high data compression.
Specically, if a gambler with some gambling policy yields wealth S
n
, the data
can be saved up to log
2
S
n
bits.
Lemma 3.12 If a proportional gambling policy results in wealth E[log
2
S
n
],
there exists a data compression code for Englishtext source X
n
which yields
average codeword length being smaller than
log
2
(27)
1
n
E [log
2
S
n
] +
2
n
bits,
where X
n
represents the random variables of the n bet outcomes.
Proof:
1. Ordering : Index the English letter as
index(a) = 0
index(b) = 1
index(c) = 2
.
.
.
index(z) = 25
index(SPACE) = 26.
For any two sequences x
n
and x
n
in A a, b, . . . , z, SPACE, we say
x
n
x
n
if
n
i=1
index(x
i
) 27
i1
i=1
index( x
i
) 27
i1
.
43
2. ShannonFanoElias coder : Apply ShannonFanoElias coder (cf. Sec
tion 4.3.3 of Volume I of the lecture notes) to the gambling policy b(x
n
)
according to the ordering dened in step 1. Then the codeword length for
x
n
is
(log
2
(b(x
n
)) + 1) bits.
3. Data compression : Now observe that
E [log
2
S
n
] = E [log
2
(27
n
b(x
n
))]
= nlog
2
(27) + E[log
2
b(X
n
)].
Hence, the average codeword length
is upper bounded by
1
n
_
x
n
.
n
P
X
n(x
n
) log
2
b(x
n
) + 2
_
=
1
n
E [log
2
b(X
n
)] +
2
n
= log
2
(27)
1
n
E[log
2
S
n
] +
2
n
.
2
According to the concept behind the source coding theorem, the entropy rate
of the English text should be upper bounded by the average codeword length of
any (variablelength) code, which in turns should be bounded above by
log
2
(27)
1
n
E[log
2
S
n
] +
2
n
.
Equipped with the proportional gambling model, one can nd the bound of
the entropy rate of English text by a properly designed gambling policy. An
experiment using the book, Jeerson the Virginian by Dumas Malone, as the
database resulted in an estimate of 1.34 bits per letter for the entropy rate of
English.
3.5 LempelZiv code revisited
In Section 4.3.4 of Volume I of the lecture notes, we have introduced the famous
LempelZiv coder, and states that the coder is universally good for stationary
sources. In this section, we will establish the concept behind it.
For simplicity, we assume that the source alphabet is binary, i.e., A = 0, 1.
The optimality of the LempelZiv code can actually be extended to any station
ary source with nite alphabet.
44
Lemma 3.13 The number c(n) of distinct strings in the LempelZiv parsing of
a binary sequence satises
2n 1 c(n)
2n
log
2
n
,
where the upper bound holds for n 2
13
, and the lower bound is valid for every
n.
Proof: The upper bound can be proved as follows.
For xed n, the number of distinct strings is maximized when all the phrases
are as short as possible. Hence, in the extreme case,
n =
c(n) of them
..
1 + 2 + 2 + 3 + 3 + 3 + 3 + ,
which implies that
k+1
j=1
j2
j
n
k
j=1
j2
j
, (3.5.1)
where k is the integer satisfying
2
k+1
1 > c(n) 2
k
1. (3.5.2)
Observe that
k+1
j=1
j2
j
= k2
k+2
+ 2 and
k
j=1
j2
j
= (k 1)2
k+1
+ 2.
Now from (3.5.1), we obtain for k 7 (which will be justied later by n 2
13
),
n k2
k+2
+ 2 < 2
2(k1)
and n (k 1)2
k+1
+ 2 (k 1)2
k+1
.
The proof of the upper bound is then completed by noting that the maximum
c(n) for xed n should satisfy
c(n) < 2
k+1
1 2
k+1
n
k 1
2n
log
2
(n)
.
Again for the lower bound, we note that the number of distinct strings is
minimized when all the phrases are as long as possible. Hence, in the extreme
case,
n =
c(n) of them
..
1 + 2 + 3 + 4 +
c(n)
j=1
j =
c(n)[c(n) + 1]
2
[c(n) + 1]
2
2
,
45
which implies that
c(n)
2n 1.
Note that when n 2
13
, c(n) 2
7
1, which implies that the assumption of
k 7 in (3.5.2) is valid for n 2
13
. 2
The condition n 2
13
is equivalent to compressing a binary le of size larger
than 1K bytes, which, in practice, is a frequently encountered situation. Since
what concerns us is the asymptotic optimality of the coder as n goes to innity,
n 2
13
certainly becomes insignicant in such consideration.
Lemma 3.14 (entropy upper bound by a function of its mean) A non
negative integervalued source X with mean and entropy H(X) satises
H(X) ( + 1) log
2
( + 1) log
2
.
Proof: The lemma follows directly from the result that the geometric distri
bution maximizes the entropy of nonnegative integervalued source with given
mean, which is proved as follows.
For geometric distribution with mean ,
P
Z
(z) =
1
1 +
_
1 +
_
z
, for z = 0, 1, 2, . . . ,
its entropy is
H(Z) =
z=0
P
Z
(z) log
2
P
Z
(z)
=
z=0
P
Z
(z)
_
log
2
(1 + ) + z log
2
1 +
_
= log
2
(1 + ) + log
2
1 +
z=0
P
X
(z)
_
log
2
(1 + ) + z log
2
1 +
_
,
where the last equality holds for any nonnegative integervalued source X with
mean . So,
H(X) H(Z) =
x=0
P
X
(x)[log
2
P
X
(x) + log
2
P
Z
(x)]
=
x=0
P
X
(x) log
2
P
Z
(x)
P
X
(x)
= D(P
X
P
Z
) 0,
46
with equality holds if, and only if, X Z. 2
Before the introduction of the main theorems, we address some notations
used in their proofs.
Give the source
x
(k1)
, . . . , x
1
, x
0
, x
1
, . . . , x
n
,
and suppose x
1
, . . . , x
n
is LempelZivparsed into c distinct strings, y
1
, . . . , y
c
.
Let
i
be the location of the rst bit of y
i
, i.e.,
y
i
x
i
, . . . , x
i+1
1
.
1. Dene
s
i
= x
i
k
, . . . , x
i
1
as the k bits preceding y
i
.
2. Dene c
,s
be the number of strings in y
1
, . . . , y
c
with length and pro
ceeding state s.
3. Dene Q
k
be the kth order Markov approximation of the stationary source
X, i.e.,
Q
k
(x
1
, . . . , x
n
[x
0
, . . . , x
(k1)
) P
X
n
1
[X
0
(k1)
(x
n
, . . . , x
1
[x
0
, . . . , x
(k1)
),
where P
X
n
1
[X
0
(k1)
is the true (stationary) distribution of the source.
For ease of understanding, these notations are graphically illustrated in Fig
ure 3.3. It is easy to verify that
n
=1
s.
k
c
,s
= c, and
n
=1
s.
k
c
,s
= n. (3.5.3)
Lemma 3.15 For any LempelZiv parsing of the source x
1
. . . x
n
, we have
log
2
Q
k
(x
1
, . . . , x
n
[s)
n
=1
c
,s
log
2
c
,s
. (3.5.4)
47
s
1
..
x
(k1)
, . . . , x
1
(x
1
, . . . ,
s
2
..
x
2
(k1)
, . . . , x
2
1
. .
y
1
)(x
2
, . . . ,
s
3
..
x
3
(k1)
, . . . , x
3
1
. .
y
2
)
. . . (x
c1
, . . . ,
s
c
..
x
c(k1)
, . . . , x
c1
. .
y
c1
)(x
c
, . . . , x
n
. .
y
c
)
Figure 3.3: Notations used in LempelZiv coder.
Proof: By the kth order Markov property of Q
k
,
log
2
Q
k
(x
1
, . . . , x
n
[s)
= log
2
Q(y
1
, . . . , y
c
[x)
= log
2
_
c
i=1
P
X
i+1
i
1
[X
0
(k1)
(y
i
[s
i
)
_
=
c
i=1
log
2
P
X
i+1
i
1
[X
0
(k1)
(y
i
[s
i
)
=
n
=1
s.
k
_
_
i :
i+1
i
= and s
i
=s
log
2
P
X
1
[X
0
(k1)
(y
i
[s
i
)
_
_
=
n
=1
s.
k
c
,s
_
_
1
c
,s
i :
i+1
i
= and s
i
=s
log
2
P
X
1
[X
0
(k1)
(y
i
[s
i
)
_
_
=1
s.
k
c
,s
log
2
i :
i+1
i
= and s
i
=s
_
1
c
,s
P
X
1
[X
0
(k1)
(y
i
[s
i
)
_
(3.5.5)
=1
s.
k
c
,s
log
2
_
1
c
,s
_
(3.5.6)
where (3.5.5) follows from Jensens inequality and the concavity of log
2
(), and
(3.5.6) holds since probability sum is no greater than one. 2
Theorem 3.16 Fix a stationary source X. Given any observations x
1
, x
2
, x
3
,
. . .,
limsup
n
c log
2
c
n
H(X
k+1
[X
k
, . . . , X
1
),
for any integer k, where c = c(n) is the number of distinct LempelZiv parsed
strings of x
1
, x
2
, . . . , x
n
.
48
Proof: Lemma 3.15 gives that
log
2
Q
k
(x
1
, . . . , x
n
[s)
n
=1
s.
k
c
,s
log
2
c
,s
=
n
=1
s.
k
c
,s
log
2
c
,s
c
+
n
=1
s.
k
c
,s
log
2
c
= c
n
=1
s.
k
c
,s
c
log
2
c
,s
c
c log
2
c. (3.5.7)
Denote by L and S the random variables with distribution
P
L,S
(, s) =
c
,s
c
,
for which the sumtoone property of the distribution is justied by (3.5.3). Also,
from (3.5.3), we have
E[L] =
n
=1
s.
k
c
,s
c
=
n
c
.
Therefore, by independent bound for entropy (cf. Theorem 2.19 in Volume I of
the lecture notes) and Lemma 3.14, we get
H(L, S) H(L) + H(S)
(E[L] + 1) log
2
(E[L] + 1) E[L] log
2
E[L] + log
2
[A[
k
=
__
n
c
+ 1
_
log
2
_
n
c
+ 1
_
n
c
log
2
n
c
_
+ k
=
_
log
2
_
n
c
+ 1
_
+
n
c
log
2
_
n/c + 1
n/c
__
+ k
= log
2
_
n
c
+ 1
_
+
n
c
log
2
_
1 +
c
n
_
+ k,
which, together with Lemma 3.13, implies that for n 2
13
,
c
n
H(L, S)
c
n
log
2
_
n
c
+ 1
_
+ log
2
_
1 +
c
n
_
+
c
n
k
2
log
2
n
log
2
_
n
2n 1
+ 1
_
+ log 2
_
1 +
2
log
2
n
_
+
2
log
2
n
k.
Finally, we can rewrite (3.5.7) as
c log
2
c
n
1
n
log
2
Q
k
(x
1
, . . . , x
n
[x) +
c
n
H(L, S).
49
As a consequence, by taking the expectation value with respect to X
n
(k1)
on
both sides of the above inequality, we obtain
limsup
n
c log
2
c
n
limsup
n
1
n
E
_
log
2
Q
k
(X
1
, . . . , X
n
[X
(k1)
, . . . , X
0
)
= H(X
k+1
[X
k
, . . . , X
1
).
2
Theorem 3.17 (main result) Let (x
1
, . . . , x
n
) be the LempelZiv codeword
length of an observatory sequence x
1
, . . . , x
n
, which is drawn from a stationary
source X. Then
limsup
n
1
n
(x
1
, . . . , x
n
) lim
n
1
n
H(X
n
).
Proof: Let c(n) be the number of the parsed distinct strings, then
1
n
(x
1
, . . . , x
n
) =
1
n
c(n) (log
2
c(n) + 1)
1
n
c(n) (log
2
c(n) + 2)
=
c(n) log
2
c(n)
n
+ 2
c(n)
n
.
From Lemma 3.13, we have
limsup
n
c(n)
n
limsup
n
2
log
2
n
= 0.
From Theorem 3.16, we have for any integer k,
limsup
n
c(n) log
2
c(n)
n
H(X
k+1
[X
k
, . . . , X
1
).
Hence,
limsup
n
1
n
(x
1
, . . . , x
n
) H(X
k+1
[X
k
, . . . , X
1
)
for any integer k. The theorem is completed by applying Theorem 4.10 in Volume
I of the lecture notes, which states that for a stationary source X, its entropy
rate always exists and is equal to
lim
n
1
n
H(X
n
) = lim
k
H(X
k+1
[X
k
, . . . , X
1
).
2
We conclude the discussion in the section into the next corollary.
50
Corollary 3.18 The LempelZiv coders asymptotically achieves the entropy
rate of any (unknown) stationary source.
The LempelZiv code is often used in practice to compress data which cannot
be characterized in a simple statistical model, such as English text or computer
source code. It is simple to implement, and has an asymptotic rate approaching
the entropy rate (if it exists) of the source, which is known to be the lower
bound of the lossless data compression code rate. This code can be used without
knowledge of the source distribution provided the source is stationary. Some
wellknown examples of its implementation are the compress program in UNIX
and the arc program in DOS, which typically compresses ASCII text les by
about a factor of 2.
51
Bibliography
[1] F. Alajaji and T. Fuja, A communication channel modeled on contagion,
IEEE Trans. on Information Theory, vol. IT40, no. 6, pp. 20352041,
Nov. 1994.
[2] P.N. Chen and F. Alajaji, Generalized source coding theorems and hy
pothesis testing, Journal of the Chinese Institute of Engineering, vol. 21,
no. 3, pp. 283303, May 1998.
[3] T. S. Han, InformationSpectrum Methods in Information Theory, (in
Japanese), Baifukan Press, Tokyo, 1998.
[4] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. on Information Theory, vol. IT39, no. 3, pp. 752772, May 1993.
[5] Y. Steinberg and S. Verd u, Simulation of random processes and rate
distortion theory, IEEE Trans. on Information Theory, vol. IT42, no. 1,
pp. 6386, Jan. 1996.
52
Chapter 4
Measure of Randomness for Stochastic
Processes
In the previous chapter, it is shown that the supentropy rate is indeed the
minimum lossless data compression ratio achievable for block codes. Hence, to
nd an optimal block code becomes a welldened mission since for any source
with wellformulated statistical model, the supentropy rate can be computed
and such quantity can be used as a criterion to evaluate the optimality of the
designed block code.
In the very recent work of Verd u and Han [2], they found that, other than
the minimum lossless data compression ratio, the supentropy rate actually has
another operational meaning, which is called resolvability. In this chapter, we
will explore the new concept in details.
4.1 Motivation for resolvability : measure of randomness
of random variables
In simulations of statistical communication systems, generation of random vari
ables by a computer algorithm is very essential. The computer usually has the
access to a basic random experiment (through predened Application Program
ing Interface), which generates equally likely random values, such as rand
( )
that
generates a real number uniformly distributed over (0, 1). Conceptually, random
variables with complex models are more dicult to generate by computerw than
random variables with simple models. Question is how to quantify the com
plexity of generating the random variables by computers. One way to dene
such complexity measurement is:
Denition 4.1 The complexity of generating a random variable is dened as
the number of random bits that the most ecient algorithm requires in order
53
to generate the random variable by computer that has the access to a equally
likely random experiment.
To understand the above denition quantitatively, a simple example is de
monstrated below.
Example 4.2 Consider the generation of the random variable with probability
masses P
X
(1) = 1/4, P
X
(0) = 1/2, and P
X
(1) = 1/4. An algorithm is written
as:
Flipafaircoin; \\ one random bit
If Head, then output 0;
else
On the average, the above algorithm requires 1.5 coin ips, and in the worst
case, 2 coin ips are necessary. Therefore, the complexity measure can take two
fundamental forms: worstcase or averagecase over the range of outcomes of
the random variables. Note that we did not show in the above example that
the algorithm is the most ecient one in the sense of using minimum number of
random bits; however, it is indeed an optimal algorithm because it achieves the
lower bound of the minimum number of random bits. Later, we will show that
such bound for average minimum number of random bits required for generat
ing the random variables is the entropy, which is exactly 1.5 bits in the above
example. As for the worsecase bound, a new terminology, resolution, will be
introduced. As a result, the above algorithm also achieves the lower bound of
the worstcase complexity, which is the resolution of the random variable.
4.2 Notations and denitions regarding to resolvability
Denition 4.3 (Mtype) For any positive integer M, a probability distribu
tion P is said to be Mtype if
P()
_
0,
1
M
,
2
M
, . . . , 1
_
for all .
54
Denition 4.4 (resolution of a random variable) The resolution
1
R(X) of
a random variable X is the minimum log(M) such that P
X
is Mtype. If P
X
is
not Mtype for any integer M, then R(X) = .
As revealed previously, a random source needs to be resolved (meaning, can
be generated by a computer algorithm with access to equalprobable random
experiments). As anticipated, a random variable with nite resolution is resolv
able by computer algorithms. Yet, it is possible that the resolution of a random
variable is innity. A quick example is the random variable X with distribution
P
X
(0) = 1/ and P
X
(1) = 1 1/. (X does not belong to any Mtype for 
nite M.) In such case, one can alternatively choose another computerresolvable
random variable, which resembles the true source within some acceptable range,
to simulate the original one.
One criterion that can be used as a measure of resemblance of two random
variables is the variational distance. As for the same example in the above
paragraph, choose a random variable
X with distribution P
e
X
(0) = 1/3 and
P
e
X
(1) = 2/3. Then X
X 0.03, and
X is 3type, which is computer
resolvable
2
.
Denition 4.5 (variational distance) The variational distance (or
1
dis
tance) between tow distributions P and Q dened on common measurable space
(, T) is
P Q
[P() Q()[.
1
If the base of the logarithmic operation is 2, the resolution is measured in bits; however,
if natural logarithm is taken, nats becomes the basic measurement unit of resolution.
2
A program that generates Mtype random variable for any M satisfying log
2
(M) being a
positive integer is straightforward. A program that generates the 3type
X is as follows (in C
language).
even = False;
while (1)
Flipafaircoin; one random bit
if (Head)
if (even==True) output 0; even=False;
else output 1; even = True;
else
if (even==True) even=False;
else even=True;
55
(Note that an alternative way to formulate the variational distance is:
P Q = 2 sup
ET
[P(E) Q(E)[ = 2
_
x. : P(x)Q(x)
_
[P(x) Q(x)].
This two denitions are actually equivalent.)
Denition 4.6 (achievable resolution) Fix 0. R is an achievable
resolution for input X if for all > 0, there exists
X satises
R(
X) < R + and X
X < .
achievable resolution reveals the possibility that one can choose another
computerresolvable random variable whose variational distance to the true sour
ce is within an acceptable range, .
Next we dene the achievable resolution rate for a sequence of random vari
ables, which is an extension of achievable resolution dened for single random
variable. Such extension is analogous to extending entropy for a single source to
entropy rate for a random source sequence.
Denition 4.7 (achievable resolution rate) Fix 0 and input X. R is
an achievable resolution rate
3
for input X if for every > 0, there exists
X
satises
1
n
R(
X
n
) < R + and X
n
X
n
 < ,
for all suciently large n.
Denition 4.8 (resolvability for X) Fix > 0. The resolvability for in
put X, denoted by S
(X) min
_
R : ( > 0)(
X
n
) < R + and X
n
X
n
 <
_
.
3
Note that our denition of resolution rate is dierent from its original form (cf. Denition
7 in [2] and the statements following Denition 7 of the same paper for its modied Denition
for specic input X), which involves an arbitrary channel model W. Readers may treat our
denition as a special case of theirs over identity channel.
56
Here, we dene S
(X)
_
R : ( > 0)(
X and N)( n > N)
1
n
R(
X
n
) < R + and X
n
X
n
 <
_
.
Similar convention will be applied throughout the rest of this chapter.
Denition 4.9 (resolvability for X) The resolvability for input X, denoted
by S(X), is
S(X) lim
0
S
(X).
From the denition of resolvability, it is obvious nonincreasing in . Hence,
the resolvability can also be dened using supremum operation as:
S(X) sup
>0
S
(X).
The resolvability is pertinent to the worsecase complexity measure for ran
dom variables (cf. Example 4.2, and the discussion following it). With the en
tropy function, the information theorists also dene the meanresolvability and
meanresolvability for input X, which characterize the averagecase complexity
of random variables.
Denition 4.10 (meanachievable resolution rate) Fix 0. R is an
meanachievable resolution rate for input X if for all > 0, there exists
X
satises
1
n
H(
X
n
) < R + and X
n
X
n
 < ,
for all suciently large n.
Denition 4.11 (meanresolvability for X) Fix > 0. The meanre
solvability for input X, denoted by
S
(X) min
_
R : ( > 0)(
X and N)( n > N)
1
n
H(
X
n
) < R + and X
n
X
n
 <
_
.
57
Denition 4.12 (meanresolvability for X) The meanresolvability for in
put X, denoted by
S(X), is
S(X) lim
0
(X) = sup
>0
(X).
The only dierence between resolvability and meanresolvability is that the
former employs resolution function, while the latter replaces it by entropy func
tion. Since entropy is the minimum average codeword length for uniquely de
codable codes, an explanation for meanresolvability is that the new random
variable
X can be resolvable through realizing the optimal variablelength code
for it. You can think of the probability mass of each outcome of
X is approxi
mately 2
_
x. : P(x)Q(x)
_
[P(x) Q(x)]
= 2
_
x. : log[P(x)/Q(x)]0
_
[P(x) Q(x)]
= 2
_
_
_
_
x. : log[P(x)/Q(x)]>
_
[P(x) Q(x)]
+
_
x. : log[P(x)/Q(x)]0
_
[P(x) Q(x)]
_
_
_
2
_
_
_
_
x. : log[P(x)/Q(x)]>
_
P(x)
+
_
x. : log[P(x)/Q(x)]0
_
P(x)
_
1
Q(x)
P(x)
_
_
_
_
2
_
P
_
x A : log
P(x)
Q(x)
>
_
+
_
x. : log[P(x)/Q(x)]0
_
P(x)
_
log
P(x)
Q(x)
_
_
_
_
(by fundamental inequality)
2
_
_
_
P
_
x A : log
P(x)
Q(x)
>
_
+
_
x. : log[P(x)/Q(x)]0
_
P(x)
_
_
_
= 2
_
P
_
x A : log
P(x)
Q(x)
>
_
+ P
X
_
x A : log
P(x)
Q(x)
0
__
= 2
_
P
_
x A : log
P(x)
Q(x)
>
_
+
_
.
59
2
Lemma 4.15
P
e
X
n
_
x
n
A
n
:
1
n
log P
e
X
n
(x
n
)
1
n
R(
X
n
)
_
= 1,
for every n.
Proof: By denition of R(
X
n
),
P
e
X
n
(x
n
) expR(
X
n
)
for all x
n
A
n
. Hence, for all x
n
A
n
,
1
n
log P
e
X
n
(x
n
)
1
n
R(
X
n
).
The lemma then holds. 2
Theorem 4.16 The resolvability for input X is equal to its supentropy rate,
i.e.,
S(X) =
H(X).
Proof:
1. S(X)
H(X).
It suces to show that S(X) <
H(X) contradicts to Lemma 4.15.
Suppose S(X) <
H(X). Then there exists > 0 such that
S(X) + <
H(X).
Let
T
0
_
x
n
A
n
:
1
n
log P
X
n(x
n
) S(X) +
_
.
By denition of
H(X),
limsup
n
P
X
n(T
0
) > 0.
Therefore, there exists > 0 such that
limsup
n
P
X
n(T
0
) > ,
60
which immediately implies
P
X
n(T
0
) >
innitely often in n.
Select 0 < < min
2
, 1 and observe that S
X
n
to satisfy
1
n
R(
X
n
) < S(X) +
2
and X
n
X
n
 <
for suciently large n. Dene
T
1
x
n
A
n
: P
X
n(x
n
) > 0
and
P
X
n(x
n
) P
e
X
n
(x
n
)
P
X
n(x
n
)
_
.
Then
P
X
n(T
c
1
) = P
X
n x
n
A
n
: P
X
n(x
n
) = 0
or
P
X
n(x
n
) P
e
X
n
(x
n
)
>
P
X
n(x
n
)
_
P
X
n x
n
A
n
: P
X
n(x
n
) = 0
+P
X
n
_
x
n
A
n
:
P
X
n(x
n
) P
e
X
n
(x
n
)
>
P
X
n(x
n
)
_
= P
X
n
_
x
n
A
n
:
P
X
n(x
n
) P
e
X
n
(x
n
)
>
P
X
n(x
n
)
_
=
_
x
n
.
n
: P
X
n(x
n
)<(1/
)[P
X
n(x
n
)P
e
X
n
(x
n
)[
_
P
X
n(x
n
)
x
n
.
n
1
[P
X
n(x
n
) P
e
X
n
(x
n
)[
.
Consider that
P
X
n(T
1
T
0
) P
X
n(T
0
) P
X
n(T
c
1
)
> 0, (4.3.1)
which holds innitely often in n; and every x
n
0
in T
1
T
0
satises
P
e
X
n
(x
n
0
) (1
)P
X
n(x
n
0
)
and
1
n
log P
e
X
n
(x
n
0
)
1
n
log P
X
n(x
n
0
) +
1
n
log
1
1 +
(S(X) + ) +
1
n
log
1
1 +
S(X) +
2
,
61
for n > (2/) log(1 +
X
n
)
_
P
e
X
n
_
x
n
A
n
:
1
n
log P
e
X
n
(x
n
) > S(X) +
2
_
P
e
X
n
(T
1
T
0
)
(1
1/2
)P
X
n(T
1
T
0
)
> 0,
which contradicts to the result of Lemma 4.15.
2. S(X)
H(X).
It suces to show the existence of
X for arbitrary > 0 such that
lim
n
X
n
X
n
 = 0
and
X
n
is an Mtype distribution with
M =
_
exp
_
n(
H(X) + )
__
.
Let
X
n
=
X
n
(X
n
) be uniformly distributed over a set
( U
j
A
n
: j = 1, . . . , M
which drawn randomly (independently) according to P
X
n. Dene
T
_
x
n
A
n
:
1
n
log P
X
n(x
n
) >
H(X) + +
n
_
.
For each ( chosen, we obtain from Lemma 4.14 that
X
n
X
n

2 + 2 P
e
X
n
_
x
n
A
n
: log
P
e
X
n
(x
n
)
P
X
n(x
n
)
>
_
= 2 + 2 P
e
X
n
_
x
n
( : log
1/M
P
X
n(x
n
)
>
_
(since P
e
X
n
((
c
) = 0)
= 2 + P
e
X
n
_
x
n
( :
1
n
log P
X
n(x
n
) >
H(X) + +
n
_
= 2 + P
e
X
n
(( T)
= 2 +
1
M
[( T[ .
62
Since ( is chosen randomly, we can take the expectation values (w.r.t. the
random () of the above inequality to obtain:
E
_
X
n
X
n

_
2 +
1
M
E
[[( T[] .
Observe that each U
j
is either in T or not in T, and will contribute weight
1/M when it is in T. From the i.i.d. assumption of U
j
M
j=1
, we can then
evaluate (1/M)E
[[( T[] by
4
1
M
E
[[( T[]
= P
M
X
n[T] +
M 1
M
P
M1
X
n [T]P
X
n[T
c
] + +
1
M
P
X
n[T]P
M1
X
n [T
c
]
=
1
M
_
MP
M
X
n[T] + (M 1)P
M1
X
n [T]P
X
n[T
c
]
+ + P
X
n[T]P
M1
X
n [T
c
]
_
=
1
M
(MP
X
n[T])
= P
X
n[T].
Hence,
limsup
n
E
_
X
n
X
n

_
2 + limsup
n
P
X
n[T] = 2,
which implies
limsup
n
E
_
X
n
X
n

_
= 0 (4.3.2)
since can be chosen arbitrarily small. (4.3.2) therefore guarantees the
existence of the desired
X.
2
Theorem 4.17 For any X,
S(X) = limsup
n
1
n
H(X
n
).
Proof:
4
Readers may imagine that there are cases: where [( T[ = M, [( T[ = M 1, . . .,
[(T[ = 1 and [(T[ = 0, respectively with drawing probability P
M
X
n(T), P
M1
X
n (T)P
X
n(T
c
),
. . ., P
X
n(T)P
M1
X
n (T
c
) and P
M
X
n(T
c
), and with expectation quantity (i.e., (1/M)[(T[) being
M/M, (M 1)/M, . . ., 1/M and 0.
63
1.
S(X) limsup
n
(1/n)H(X
n
).
It suces to prove that
S
(X) limsup
n
(1/n)H(X
n
) for every > 0.
This is equivalent to show that for all > 0, there exists
X such that
1
n
H(
X
n
) < limsup
n
1
n
H(X
n
) +
and
X
n
X
n
 <
for suciently large n. This can be trivially achieved by letting
X = X,
since for suciently many n,
1
n
H(X
n
) < limsup
n
1
n
H(X
n
) +
and
X
n
X
n
 = 0.
2.
S(X) limsup
n
(1/n)H(X
n
).
Observe that
S(X)
S
(X) for any 0 < < 1/2. Then for any > 0
and all suciently large n, there exists
X
n
such that
1
n
H(
X
n
) <
S(X) + (4.3.3)
and
X
n
X
n
 < .
Using the fact [1, pp. 33] that X
n
X
n
 1/2 implies
[H(X
n
) H(
X
n
)[ log
[A[
n
,
and (4.3.3), we obtain
1
n
H(X
n
) log [A[ +
1
n
log
1
n
H(
X
n
) <
S(X) + ,
which implies that
limsup
n
1
n
H(X
n
) log [A[ <
S(X) + .
Since and can be taken arbitrarily small, we have
S(X) limsup
n
1
n
H(X
n
).
2
64
4.4 Resolvability and source coding
In the previous chapter, we have proved that the lossless data compression rate
for block codes is lower bounded by
H(X). We also show in Section 4.3 that
H(X) is also the resolvability for source X. We can therefore conclude that
resolvability is equal to the minimum lossless data compression rate for block
codes. We will justify their equivalence directly in this section.
As explained in AEP theorem for memoryless source, the set T
n
() contains
approximately e
nH(X)
elements, and the probability for source sequences being
outside T
n
() will eventually goes to 0. Therefore, we can binaryindex those
source sequences in T
n
() by
log
_
e
nH(X)
_
= nH(X) nats,
and encode the source sequences outside T
n
() by a unique default binary code
word, which results in an asymptotically zero probability of decoding error. This
is indeed the main idea for Shannons source coding theorem for block codes.
By further exploring the above concept, we found that the key is actually the
existence of a set /
n
= x
n
1
, x
n
2
, . . . , x
n
M
with M e
nH(X)
and P
X
n(/
c
n
) 0.
Thus, if we can nd such typical set, the Shannons source coding theorem for
block codes can actually be generalized to more general sources, such as non
stationary sources. Furthermore, extension of the theorems to codes of some
specic types becomes feasible.
Denition 4.18 (minimum source compression rate for xedlength
codes) R is the source compression rate for xedlength codes if there exists
a sequence of sets /
n
n=1
with /
n
A
n
such that
limsup
n
1
n
log [/
n
[ R and limsup
n
P
X
n[/
c
n
] .
T
(X).
65
Denition 4.20 (minimum source compression rate for variablelength
codes) R is an achievable source compression rate for variablelength codes if
there exists a sequence of errorfree prex codes (
n
n=1
such that
limsup
n
1
n
n
R
where
n
is the average codeword length of (
n
.
T(X) is the minimum of all
such rates.
Recall that for a single source, the measure of its uncertainty is entropy.
Although the entropy can also be used to characterize the overall uncertainty of
a random sequence X, the source coding however concerns more on the average
entropy of it. So far, we have seen four expressions of average entropy:
limsup
n
1
n
H(X
n
) limsup
n
1
n
x
n
.
n
P
X
n(x
n
) log P
X
n(x
n
);
liminf
n
1
n
H(X
n
) liminf
n
1
n
x
n
.
n
P
X
n(x
n
) log P
X
n(x
n
);
H(X) inf
T
_
: limsup
n
P
X
n
_
1
n
log P
X
n(X
n
) >
_
= 0
_
;
H(X) sup
T
_
: limsup
n
P
X
n
_
1
n
log P
X
n(X
n
) <
_
= 0
_
.
If
lim
n
1
n
H(X
n
) = limsup
n
1
n
H(X
n
) = liminf
n
1
n
H(X
n
),
then lim
n
(1/n)H(X
n
) is named the entropy rate of the source.
H(X) and
H(X) are called the supentropy rate and infentropy rate, which were introduced
in Section 2.3.
Next we will prove that T(X) = S(X) =
H(X) and
T(X) =
S(X) =
limsup
n
(1/n)H(X
n
) for a source X. The operational characterization of
liminf
n
(1/n)H(X
n
) andH(X) will be introduced in Chapter 6.
Theorem 4.21 (equality of resolvability and minimum source coding
rate for xedlength codes)
T(X) = S(X) =
H(X).
Proof: Equality of S(X) and
H(X) is already given in Theorem 4.16. Also,
T(X) =
H(X) can be obtained from Theorem 3.5 by letting = 0. Here, we
provide an alternative proof for T(X) = S(X).
66
1. T(X) S(X).
If we can show that, for any xed, T
(X) S
2
(X), then the proof is
completed. This claim is proved as follows.
By denition of S
2
(X), we know that for any > 0, there exists
X
and N such that for n > N,
1
n
R(
X
n
) < S
2
(X) + and X
n
X
n
 < 2.
Let /
n
_
x
n
: P
e
X
n
(x
n
) > 0
_
. Since (1/n)R(
X
n
) < S
2
(X) + ,
[/
n
[ expR(
X
n
) < expn(S
2
(X) + ).
Therefore,
limsup
n
1
n
log [/
n
[ S
2
(X) + .
Also,
2 > X
n
X
n
 = 2 sup
E.
n
[P
X
n(E) P
e
X
n
(E)[
2[P
X
n(/
c
n
) P
e
X
n
(/
c
n
)[
= 2P
X
n(/
c
n
), (sinceP
e
X
n
(/
c
n
) = 0).
Hence, limsup
n
P
X
n(/
c
n
) .
Since S
2
(X) + is just one of the rates that satisfy the conditions
of the minimum source compression rate, and T
(X) S
2
(X) + for any > 0.
2. T(X) S(X).
Similarly, if we can show that, for any xed, T
(X) S
3
(X), then the
proof is completed. This claim can be proved as follows.
Fix > 0. By denition of T
n=1
such that for n > N,
1
n
log [/
n
[ < T
(X) + and P
X
n(/
c
n
) < + .
67
Choose M
n
to satisfy
expn(T
(X) + 2) M
n
expn(T
(X) + 3).
Also select one element x
n
0
from /
c
n
. Dene a new random variable
X
n
as follows:
P
e
X
n
(x
n
) =
_
_
_
0, if x
n
, x
n
0
/
n
;
k(x
n
)
M
n
, if x
n
x
n
0
/
n
,
where
k(x
n
)
_
_
_
M
n
P
X
n(x
n
), if x
n
/
n
;
M
n
x
n
,n
k(x
n
), if x
n
= x
n
0
.
It can then be easily veried that
X
n
satises the next four properties:
(a)
X
n
is M
n
type;
(b) P
e
X
n
(x
n
0
) P
X
n(/
c
n
) < + , since x
n
0
/
c
n
;
(c) for all x
n
/
n
,
P
e
X
n
(x
n
) P
X
n(x
n
)
=
M
n
P
X
n(x
n
)
M
n
P
X
n(x
n
)
1
M
n
.
(d) P
e
X
n
(/
n
) + P
e
X
n
(x
n
0
) = 1.
Consequently,
1
n
R(
X
n
) T
(X) + 3,
68
and
X
n
X
n
 =
x
n
,n
P
e
X
n
(x
n
) P
X
n(x
n
)
P
e
X
n
(x
n
0
) P
X
n(x
n
0
)
x
n
,
c
n
x
n
0
P
e
X
n
(x
n
) P
X
n(x
n
)
x
n
,n
P
e
X
n
(x
n
) P
X
n(x
n
)
+
_
P
e
X
n
(x
n
0
) + P
X
n(x
n
0
)
x
n
,
c
n
x
n
0
P
e
X
n
(x
n
) P
X
n(x
n
)
x
n
,n
1
M
n
+ P
e
X
n
(x
n
0
) + P
X
n(x
n
0
)
+
x
n
,
c
n
x
n
0
P
X
n(x
n
)
=
[/
n
[
M
n
+ P
e
X
n
(x
n
0
) +
x
n
,
c
n
P
X
n(x
n
)
expn(T
(X) + )
expn(T
(X) + 2)
+ ( + ) + P
X
n(/
c
n
)
e
n
+ ( + ) + ( + )
3( + ), for n log( + )/.
Since T
(X) is just one of the rates that satisfy the conditions of 3(+
)resolvability, and S
3(+)
(X) is the smallest one of such quantities,
S
3(+)
(X) T
(X).
The proof is completed by noting that can be made arbitrarily
small.
2
This theorem tells us that the minimum source compression ratio for xed
length code is the resolvability, which in turn is equal to the supentropy rate.
Theorem 4.22 (equality of meanresolvability and minimum source
coding rate for variablelength codes)
T(X) =
S(X) = limsup
n
1
n
H(X
n
).
Proof: Equality of
S(X) and limsup
n
(1/n)H(X
n
) is already given in The
orem 4.17.
69
1.
S(X)
T(X).
Denition 4.20 states that there exists, for all > 0 and all suciently
large n, an errorfree variablelength code whose average codeword length
n
satises
1
n
n
<
T(X) + .
Moreover, the fundamental source coding lower bound for a uniquely de
codable code (cf. Theorem 4.18 of Volume I of the lecture notes) is
H(X
n
)
n
.
Thus, by letting
X = X, we obtain X
n
X
n
 = 0 and
1
n
H(
X
n
) =
1
n
H(X
n
) <
T(X) + ,
which concludes that
T(X) is an achievable meanresolution rate of X
for any > 0, i.e.,
S(X) = lim
0
(X)
T(X).
2.
T(X)
S(X).
Observe that
S
(X)
S(X) for 0 < < 1/2. Hence, by taking satis
fying 2 log [A[ > > log [A[ and for all suciently large n, there exists
X
n
such that
1
n
H(
X
n
) <
S(X) +
and
X
n
X
n
 < . (4.4.1)
On the other hand, Theorem 4.22 of Volume I of the lecture notes proves
the existence of an errorfree prex code for X
n
with average codeword
length
n
satises
n
H(X
n
) + 1 (bits).
By the fact [1, pp. 33] that X
n
X
n
 1/2 implies
[H(X
n
) H(
X
n
)[ log
2
[A[
n
,
and (4.4.1), we obtain
1
n
n
1
n
H(X
n
) +
1
n
1
n
H(
X
n
) + log
2
[A[
1
n
log
2
+
1
n
S(X) + + log
2
[A[
1
n
log
2
+
1
n
S(X) + 2,
70
if n > (1 log
2
)/( log
2
[A[). Since can be made arbitrarily small,
T(X)
S(X).
2
Again, the above theorem tells us that the minimum source compression ratio
for variablelength code is the meanresolvability, and the meanresolvability is
exactly limsup
n
(1/n)H(X
n
).
Note that limsup
n
(1/n)H(X
n
)
H(X), which follows straightforwardly
by the fact that the mean of the random variable (1/n) log P
X
n(X
n
) is no
greater than its right margin of the support. Also note that for stationary
ergodic source, all these quantities are equal, i.e.,
T(X) = S(X) =
H(X) =
T(X) =
S(X) = limsup
n
1
n
H(X
n
).
We end this chapter by computing these quantities for a specic example.
Example 4.23 Consider a binary random source X
1
, X
2
, . . . where X
i
i=1
are
independent random variables with individual distribution
P
X
i
(0) = Z
i
and P
X
i
(1) = 1 Z
i
,
where Z
i
i=1
are pairwise independent with common uniform marginal distri
bution over (0, 1).
You may imagine that the source is formed by selecting from innitely many
binary number generators as shown in Figure 4.1. The selecting process Z is
independent for each time instance.
It can be shown that such source is not stationary. Nevertheless, by means
of similar argument as AEP theorem, we can show that:
log P
X
(X
1
) + log P
X
(X
2
) + + log P
X
(X
n
)
n
h
b
(Z) in probability,
where h
b
(a) a log
2
(a) (1 a) log
2
(1 a) is the binary entropy function.
To compute the ultimate average entropy rate in terms of the random variable
h
b
(Z), it requires that
log P
X
(X
1
) + log P
X
(X
2
) + + log P
X
(X
n
)
n
h
b
(Z) in mean,
71
source X
t
t I
source X
t
2
source X
t
1
.
.
.
.
.
.
.
.
.



Selector
?
Z

. . . , X
2
, X
1
.
.
.
Source Generator
Figure 4.1: Source generator: X
t
tI
(I = (0, 1)) is an independent
random process with P
Xt
(0) = 1 P
Xt
(1) = t, and is also independent
of the selector Z, where X
t
is outputted if Z = t. Source generator of
each time instance is independent temporally.
which is a stronger result than convergence in probability. With the fundamental
properties for convergence, convergenceinprobability implies convergencein
mean provided the sequence of random variables is uniformly integrable, which
72
is true for (1/n)
n
i=1
log P
X
(X
i
):
sup
n>0
E
_
1
n
n
i=1
log P
X
(X
i
)
_
sup
n>0
1
n
n
i=1
E [[log P
X
(X
i
)[]
= sup
n>0
E [[log P
X
(X)[] , because of i.i.d. of X
i
n
i=1
= E [[log P
X
(X)[]
= E
_
E
_
[log P
X
(X)[
Z
__
=
_
1
0
E
_
[log P
X
(X)[
Z = z
_
dz
=
_
1
0
_
z[ log(z)[ + (1 z)[ log(1 z)[
_
dz
_
1
0
log(2)dz = log(2).
We therefore have:
E
_
1
n
log P
X
n(X
n
)
_
E[h
b
(Z)]
E
_
1
n
log P
X
n(X
n
) h
b
(Z)
_
0.
Consequently,
limsup
n
1
n
H(X
n
) = E[h
b
(Z)] = 0.5 nats or 0.721348 bits.
However, it can be shown that the ultimate cumulative distribution function of
(1/n) log P
X
n(X
n
) is Pr[h
b
(Z) t] for t [0, log(2)] (cf. Figure 4.2).
The supentropy rate of X should be log(2) nats or 1 bit (which is the
rightmargin of the ultimate CDF of (1/n) log P
X
n(X
n
)). Hence, for this un
stationary source, the minimum average codeword length for xedlength codes
and variablelength codes are dierent, which are 0.859912 bit and 1 bit, respec
tively.
73
0
1
0 log(2) nats
Figure 4.2: The ultimate CDF of (1/n) log P
X
n(X
n
): Prh
b
(Z) t.
74
Bibliography
[1] I. Csiszar and J. K orner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic, New York, 1981.
[2] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. on Information Theory, vol. IT39, no. 3, pp. 752772, May 1993.
[3] D. E. Knuth and A. C. Yao, The complexity of random number genera
tion, in Proceedings of Symposium on New Directions and Recent Results
in Algorithms and Complexity. New York: Academic Press, 1976.
75
Chapter 5
Channel Coding Theorems and
Approximations of Output Statistics for
Arbitrary Channels
The Shannon channel capacity in Volume I of the lecture notes is derived under
the assumption that the channel is memoryless. With moderate modication of
the proof, this result can be extended to stationaryergodic channels for which
the capacity formula becomes the maximization of the mutual information rate:
lim
n
sup
X
n
1
n
I(X
n
; Y
n
).
Yet, for more general channels, such as nonstationary or nonergodic channels,
a more general expression for channel capacity needs to be derived.
5.1 General models for channels
The channel transition probability in its most general form is denoted by W
n
=
P
Y
n
[X
n
n=1
, which is abbreviated by W for convenience. Similarly, the input
and output random processes are respectively denoted by X and Y .
5.2 Variations of capacity formulas for arbitrary channels
5.2.1 Preliminaries
Now, similar to the denitions of sup and inf entropy rates for sequence of
sources, the sup and inf (mutual)information rates are respectively dened
76
by
1
i() limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
is the supspectrum of the normalized information density, and
i
X
n
W
n(x
n
; y
n
) log
P
Y
n
[X
n(y
n
[x
n
)
P
Y
n(y
n
)
is the information density.
In 1994, Verd u and Han [2] have shown that the channel capacity in its most
general form is
C sup
X
I(X; Y ).
In their proof, they showed the achievability part in terms of Feinsteins Lemma,
and provide a new proof for the converse part. In this section, we will not adopt
the original proof of Verd u and Han in the converse theorem. Instead, we will
use a new and tighter bound [1] established by Poor and Verd u in 1995.
Denition 5.1 (xedlength data transmission code) An (n, M) xed
length data transmission code for channel input alphabet A
n
and output alpha
bet
n
consists of
1. M informational messages intended for transmission;
1
In the paper of Verd u and Han [2], these two quantities are dened by:
I(X; Y ) inf
_
: ( > 0) limsup
n
P
X
n
W
n
_
1
n
i
X
n
W
n(X
n
; Y
n
) > +
_
= 0
_
and
I(X; Y ) sup
_
: ( > 0) limsup
n
P
X
n
W
n
_
1
n
i
X
n
W
n(X
n
; Y
n
) < +
_
= 0
_
.
The above denitions are in fact equivalent to ours.
77
2. an encoding function
f : 1, 2, . . . , M A
n
;
3. a decoding function
g :
n
1, 2, . . . , M,
which is (usually) a deterministic rule that assigns a guess to each possible
received vector.
The channel inputs in x
n
A
n
: x
n
= f(m) for some 1 m M are the
codewords of the data transmission code.
Denition 5.2 (average probability of error) The average probability of
error for a (
n
= (n, M) code with encoder f() and decoder g() transmitted
over channel Q
Y
n
[X
n is dened as
P
e
( (
n
) =
1
M
M
i=1
i
,
where
i
_
y
n
n
: g(y
n
),=i
_
Q
Y
n
[X
n(y
n
[f(i)).
Under the criterion of average probability of error, all of the codewords are
treated equally, namely the prior probability of the selected M codewords are
uniformly distributed.
Lemma 5.3 (Feinsteins Lemma) Fix a positive n. For every > 0 and
input distribution P
X
n on A
n
, there exists an (n, M) block code for the transition
probability P
W
n = P
Y
n
[X
n that its average error probability P
e
( (
n
) satises
P
e
( (
n
) < Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
1
n
log M +
_
+ e
n
.
Proof:
Step 1: Notations. Dene
(
_
(x
n
, y
n
) A
n
n
:
1
n
i
X
n
W
n(x
n
; y
n
)
1
n
log M +
_
.
Let e
n
+ P
X
n
W
n((
c
).
78
The Feinsteins Lemma obviously holds if 1, because then
P
e
( (
n
) 1 Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
1
n
log M +
_
+ e
n
.
So we assume < 1, which immediately results in
P
X
n
W
n((
c
) < < 1,
or equivalently,
P
X
n
W
n(() > 1 > 0.
Therefore,
P
X
n(/ x
n
A
n
: P
Y
n
[X
n((
x
n[x
n
) > 1 ) > 0,
where (
x
n y
n
n
: (x
n
, y
n
) (, because if P
X
n(/) = 0,
( x
n
with P
X
n(x
n
) > 0) P
Y
n
[X
n((
x
n[x
n
) 1
x
n
.
n
P
X
n(x
n
)P
Y
n
[X
n((
x
n[x
n
) = P
X
n
W
n(() 1 .
Step 2: Encoder. Choose an x
n
1
in / (Recall that P
X
n(/) > 0.) Dene
1
=
(
x
n
1
. (Then P
Y
n
[X
n(
1
[x
n
1
) > 1 .)
Next choose, if possible, a point x
n
2
A
n
without replacement (i.e., x
n
2
can
be identical to x
n
1
) for which
P
Y
n
[X
n
_
(
x
n
2
x
n
2
_
> 1 ,
and dene
2
(
x
n
2
1
.
Continue in the following way as for codeword i: choose x
n
i
to satisfy
P
Y
n
[X
n
_
(
x
n
i
i1
_
j=1
x
n
i
_
> 1 ,
and dene
i
(
x
n
i
i1
j=1
j
.
Repeat the above codeword selecting procedure until either M codewords
have been selected or all the points in / have been exhausted.
Step 3: Decoder. Dene the decoding rule as
(y
n
) =
_
i, if y
n
i
arbitraryy, otherwise.
79
Step 4: Probability of error. For all selected codewords, the error probabi
lity given codeword i is transmitted,
e[i
, satises
e[i
P
Y
n
[X
n(
c
i
[x
n
i
) < .
(Note that ( i) P
Y
n
[X
n(
i
[x
n
i
) 1 by step 2.) Therefore, if we can show
that the above codeword selecting procedures will not terminate before M,
then
P
e
( (
n
) =
1
M
M
i=1
e[i
< .
Step 5: Claim. The codeword selecting procedure in step 2 will not terminate
before M.
proof: We will prove it by contradiction.
Suppose the above procedure terminates before M, say at N < M. Dene
the set
T
N
_
i=1
i
n
.
Consider the probability
P
X
n
W
n(() = P
X
n
W
n[( (A
n
T)] + P
X
n
W
n[( (A
n
T
c
)]. (5.2.1)
Since for any y
n
(
x
n
i
,
P
Y
n(y
n
)
P
Y
n
[X
n(y
n
[x
n
i
)
M e
n
,
we have
P
Y
n(
i
) P
Y
n((
x
n
i
)
1
M
e
n
P
Y
n
[X
n((
x
n
i
[x
n
i
)
1
M
e
n
.
So the rst term of the right hand side in (5.2.1) can be upper bounded by
P
X
n
W
n[( (A
n
T)] P
X
n
W
n(A
n
T)
= P
Y
n(T)
=
N
i=1
P
Y
n(
i
)
N
1
M
e
n
=
N
M
e
n
.
80
As for the second term of the right hand side in (5.2.1), we can upper
bound it by
P
X
n
W
n[( (A
n
T
c
)] =
x
n
.
n
P
X
n(x
n
)P
Y
n
[X
n((
x
n T
c
[x
n
)
=
x
n
.
n
P
X
n(x
n
)P
Y
n
[X
n
_
(
x
n
N
_
i=1
x
n
_
x
n
.
n
P
X
n(x
n
)(1 ) (1 ),
where the last step follows since for all x
n
A
n
,
P
Y
n
[X
n
_
(
x
n
N
_
i=1
x
n
_
1 .
(Because otherwise we could nd the (N + 1)th codeword.)
Consequently, P
X
n
W
n(() (N/M)e
n
+ (1 ). By denition of (,
P
X
n
W
n(() = 1 + e
n
N
M
e
n
+ (1 ),
which implies N M, a contradiction. 2
Lemma 5.4 (PoorVerd u Lemma [1]) Suppose X and Y are random vari
ables, where X taking values on a nite (or coutably innite) set
A = x
1
, x
2
, x
3
, . . . .
The minimum probability of error P
e
in estimating X from Y satises
P
e
(1 )P
X,Y
_
(x, y) A : P
X[Y
(x[y)
_
for each [0, 1].
Proof: It is known that the minimumerrorprobability estimate e(y) of X when
receiving y is
e(y) = arg max
x.
P
X[Y
(x[y). (5.2.2)
Therefore, the error probability incurred in testing among the values of X is
81
given by
1 P
e
= PrX = e(Y )
=
_
_
_
x : x=e(y)
_
P
X[Y
(x[y)
_
dP
Y
(y)
=
_
_
max
x.
P
X[Y
(x[y)
_
dP
Y
(y)
=
_
_
max
x.
f
x
(y)
_
dP
Y
(y)
= E
_
max
x.
f
x
(Y )
_
,
where f
x
(y) P
X[Y
(x[y). Let h
j
(y) be the jth element in the reordering set
of f
x
1
(y), f
x
2
(y), f
x
3
(y), . . . according to ascending element values. In other
words,
h
1
(y) h
2
(y) h
3
(y)
and
h
1
(y), h
2
(y), h
3
(y), . . . = f
x
1
(y), f
x
2
(y), f
x
3
(y), . . ..
Then
1 P
e
= E[h
1
(Y )]. (5.2.3)
For any [0, 1], we can write
P
X,Y
(x, y) A : f
x
(y) > =
_
P
X[Y
x A : f
x
(y) > dP
Y
(y).
Observe that
P
X[Y
x A : f
x
(y) > =
x.
P
X[Y
(x[y) 1[f
x
(y) > ]
=
x.
f
x
(y) 1[f
x
(y) > ]
=
j=1
h
j
(y) 1[h
j
(y) > ],
82
where 1() is the indicator function
2
. Therefore,
P
XY
(x, y) A : f
x
(y) > =
_
j=1
h
j
(y) 1[h
j
(y) > ]
_
dP
Y
(y)
h
1
(y) 1(h
1
(y) > )dP
Y
(y)
= E[h
1
(Y ) 1(h
1
(Y ) > )].
It remains to relate E[h
1
(Y ) 1(h
1
(Y ) > )] with E[h
1
(Y )], which is exactly
1 P
e
. For any [0, 1] and any random variable U with Pr0 U 1 = 1,
U + (1 ) U 1(U > )
holds with probability one. This can be easily proved by upperbounding U in
terms of when 0 U , and + (1 )U, otherwise. Thus
E[U] + (1 )E[U 1(U > )].
By letting U = h
1
(Y ), together with (5.2.3), we nally obtain
(1 )P
XY
(x, y) A : f
x
(y) > (1 )E[h
1
(Y ) 1[h
1
(Y ) > ]]
E[h
1
(Y )]
= (1 P
e
)
= (1 ) P
e
.
2
Corollary 5.5 Every (
n
= (n, M) code satises
P
e
( (
n
)
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
1
n
log M
_
for every > 0, where X
n
places probability mass 1/M on each codeword, and
P
e
( (
n
) denotes the error probability of the code.
Proof: Taking = e
n
in Lemma 5.4, and replacing X and Y in Lemma 5.4
2
I.e., if f
x
(y) > is true, 1() = 1; else, it is zero.
83
by its nfold counterparts, i.e., X
n
and Y
n
, we obtain
P
e
( (
n
)
_
1 e
n
_
P
X
n
W
n
_
(x
n
, y
n
) A
n
n
: P
X
n
[Y
n(x
n
[y
n
) e
n
=
_
1 e
n
_
P
X
n
W
n
_
(x
n
, y
n
) A
n
n
:
P
X
n
[Y
n(x
n
[y
n
)
1/M
e
n
1/M
_
=
_
1 e
n
_
P
X
n
W
n
_
(x
n
, y
n
) A
n
n
:
P
X
n
[Y
n(x
n
[y
n
)
P
X
n(x
n
)
e
n
1/M
_
=
_
1 e
n
_
P
X
n
W
n [(x
n
, y
n
) A
n
n
:
1
n
log
P
X
n
[Y
n(x
n
[y
n
)
P
X
n(x
n
)
1
n
log M
_
=
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
1
n
log M
_
.
2
Denition 5.6 (achievable rate) Fix [0, 1]. R 0 is an achievable
rate if there exists a sequence of (
n
= (n, M
n
) channel block codes such that
liminf
n
1
n
log M
n
R
and
limsup
n
P
e
( (
n
) .
Denition 5.7 (capacity C
.
It is straightforward for the denition that C
is nondecreasing in , and
C
1
= log [A[.
Denition 5.8 (capacity C) The channel capacity C is dened as the supre
mum of the rates that are achievable for all [0, 1]. It follows immediately
from the denition
3
that C = inf
01
C
= lim
0
C
= C
0
and that C is the
3
The proof of C
0
= lim
0
C
(
n
) = 0, which contradicts to the denition of C
0
.
84
supremum of all the rates R for which there exists a sequence of (
n
= (n, M
n
)
channel block codes such that
liminf
n
1
n
log M
n
R,
and
limsup
n
P
e
( (
n
) = 0.
5.2.2 capacity
Theorem 5.9 (capacity) For 0 < < 1, the capacity C
for arbitrary
channels satises
sup
X
lim
(X; Y ) C
sup
X
I
(X; Y ).
The upper and lower bounds of the above inequality meet except at the points
of discontinuity of C
sup
X
lim
(X; Y ).
Fix input X. It suces to show the existence of (
n
= (n, M
n
) data
transmission code with rate
lim
(X; Y ) <
1
n
log M
n
< lim
(X; Y )
2
and probability of decoding error satisfying
P
e
( (
n
)
for every > 0. (Because if such code exists, then limsup
n
P
e
( (
n
)
and liminf
n
(1/n) log M
n
lim
lim
(X; Y ).)
From Lemma 5.3, there exists an (
n
= (n, M
n
) code whose error probabi
lity satises
P
e
( (
n
) < Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
1
n
log M
n
+
4
_
+ e
n/4
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
_
lim
(X; Y )
2
_
+
4
_
+ e
n/4
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < lim
(X; Y )
4
_
+ e
n/4
, (5.2.4)
85
where (5.2.4) holds for all suciently large n because
lim
(X; Y ) sup
_
R : Pr
_
1
n
i
W
n
W
n(X
n
; Y
n
) R
_
<
_
,
or more specically,
limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < lim
(X; Y )
4
_
< .
Hence, the proof of the direct part is completed.
2. C
sup
X
I
(X; Y ).
Suppose that there exists a sequence of (
n
= (n, M
n
) codes with rate
strictly larger than sup
X
I
(X; Y ) + 2.
Since the above inequality holds for every X, it certainly holds if taking
input
X
n
which places probability mass 1/M
n
on each codeword, i.e.,
1
n
log M
n
> I
(
X;
Y ) + 2, (5.2.5)
where
Y is the channel output due to channel input
X. Then from Corol
lary 5.5, the error probability of the code satises
P
e
( (
n
)
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n
(
X
n
;
Y
n
)
1
n
log M
n
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n
(
X
n
;
Y
n
) I
(
X;
Y ) +
_
,
where the last inequality follows from (5.2.5), which by taking the limsup
of both sides, we have
limsup
n
P
e
( (
n
) limsup
n
Pr
_
1
n
i
X
n
W
n
(
X
n
;
Y
n
) I
(
X;
Y ) +
_
> ,
and a desired contradiction is obtained. 2
86
5.2.3 General Shannon capacity
Theorem 5.10 (general Shannon capacity) The channel capacity C for ar
bitrary channel satises
C = sup
X
I(X; Y ).
Proof: Observe that
C = C
0
= lim
0
C
.
Hence, from Theorem 5.9, we note that for (0, 1),
C sup
X
lim
(X; Y ) sup
X
I
0
(X; Y ) = sup
X
I(X; Y ).
It remains to show that C sup
X
I(X; Y ).
Suppose that there exists a sequence of (
n
= (n, M
n
) codes with rate strictly
larger than sup
X
I(X; Y ) and error probability tends to 0 as n . Let the
ultimate code rate for this code be sup
X
I(X; Y ) + 3 for some > 0. Then
for suciently large n,
1
n
log M
n
> sup
X
I(X; Y ) + 2.
Since the above inequality holds for every X, it certainly holds if taking input
X
n
which places probability mass 1/M
n
on each codeword, i.e.,
1
n
log M
n
> I(
X;
Y ) + 2, (5.2.6)
where
Y is the channel output due to channel input
X. Then from Corollary
5.5, the error probability of the code satises
P
e
( (
n
)
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n
(
X
n
;
Y
n
)
1
n
log M
n
_
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n
(
X
n
;
Y
n
) I(
X;
Y ) +
_
, (5.2.7)
where the last inequality follows from (5.2.6). Since, by assumption, P
e
( (
n
)
vanishes as n , but (5.2.7) cannot vanish by denition of I(
X;
Y ), thereby
a desired contradiction is obtained. 2
87
5.2.4 Strong capacity
Denition 5.11 (strong capacity) Dene the strong converse capacity (or
strong capacity) C
SC
as the inmum of the rates R such that for all channel
block codes (
n
= (n, M
n
) with
liminf
n
1
n
log M
n
R,
we have
liminf
n
P
e
( (
n
) = 1.
Theorem 5.12 (general strong capacity)
C
SC
sup
X
I(X; Y ).
5.2.5 Examples
With these general capacity formulas, we can now compute the channel capac
ity for some of the nonstationary or nonergodic channels, and analyze their
properties.
Example 5.13 (Shannon capacity) Let the input and output alphabets be
0, 1, and let every output Y
i
be given by:
Y
i
= X
i
N
i
,
where represents modulo2 addition operation. Assume the input process
X and the noise process N are independent.
A general relation between the entropy rate and mutual information rate can
be derived from (2.4.2) and (2.4.4) as:
H(Y )
H(Y [X) I(X; Y )
H(Y )
H(Y [X).
Since N
n
is completely determined from Y
n
under the knowledge of X
n
,
H(Y [X) =
H(N).
Indeed, this channel is a symmetric channel. Therefore, an uniform input yields
uniform output (Bernoulli with parameter (1/2)), and H(Y ) =
H(Y ) = log(2)
nats. We thus have
C = log(2)
H(N) nats.
We then compute the channel capacity for the next two cases of noises.
88
Case A) If N is a nonstationary binary independent sequence with
PrN
i
= 1 = p
i
,
then by the uniform boundedness (in i) of the variance of random variable
log P
N
i
(N
i
), namely,
Var[log P
N
i
(N
i
)] E[(log P
N
i
(N
i
))
2
]
sup
0<p
i
<1
_
p
i
(log p
i
)
2
+ (1 p
i
)(log(1 p
i
))
2
log(2),
we have by Chebyshevs inequality,
Pr
_
1
n
log P
N
n(N
n
)
1
n
n
i=1
H(N
i
)
>
_
0,
for any > 0. Therefore,
H(N) = limsup
n
(1/n)
n
i=1
h
b
(p
i
). Conse
quently,
C = log(2) limsup
n
1
n
n
i=1
h
b
(p
i
) nats/channel usage.
This result is illustrated in Figures 5.1.

H(N)
H(N) cluster points
Figure 5.1: The ultimate CDFs of (1/n) log P
N
n(N
n
).
Case B) If N has the same distribution as the source process in Example 4.23,
then
H(N) = log(2) nats, which yields zero channel capacity.
Example 5.14 (strong capacity) Continue from Example 5.13. To show the
infinformation rate of channel W, we rst derive the relation of the CDF be
tween the information density and noise process with respective to the uniform
89
input that maximizes the information rate (In this case, P
X
n(x
n
) = P
Y
n(y
n
) =
2
n
).
Pr
_
1
n
log
P
Y
n
[X
n(Y
n
[X
n
)
P
Y
n(Y
n
)
_
= Pr
_
1
n
log P
N
n(N
n
)
1
n
log P
Y
n(Y
n
)
_
= Pr
_
1
n
log P
N
n(N
n
) log(2)
_
= Pr
_
1
n
log P
N
n(N
n
) log(2)
_
= 1 Pr
_
1
n
log P
N
n(N
n
) < log(2)
_
. (5.2.8)
Case A) From (5.2.8), it is obvious that
C
SC
= 1 liminf
n
1
n
n
i=1
h
b
(p
i
).
Case B) From (5.2.8) and also from Figure 5.2,
C
SC
= log(2) nats/channel usage.
In Case B of Example 5.14, we have derived the ultimate CDF of the nor
malized information density, which is depicted in Figure 5.2. This limiting CDF
is called spectrum of the normalized information density.
In Figure 5.2, it has been stated that the channel capacity is 0 nat/channnel
usage, and the strong capacity is log(2) nat/channel usage. Hence, the opera
tional meaning of the two margins has been clearly revealed. Question is what
is the operational meaning of the function value between 0 and log(2)?. The
answer of this question actually follows Denition 5.7 of a new capacityrelated
quantity.
In practice, it may not be easy to design a block code which transmits infor
mation with (asymptotically) no error through a very noisy channel with rate
equals to channel capacity. However, if we admit some errors in transmission,
such as the error probability is bounded above by 0.001, we may have more
chance to come up with a feasible block code.
Example 5.15 (capacity) Continue from case B of Example 5.13. Let the
spectrum of the normalized information density be i(). Then the capacity of
this channel is actually the inverse function of i(), i.e.
C
= i
1
().
90

0 log(2) nats
1
Figure 5.2: The ultimate CDF of the normalized information density
for Example 5.14Case B).
Note that the Shannon capacity can be written as:
C = lim
0
C
;
and in general, the strong capacity satises
C
SC
lim
1
C
.
However, equality of the above inequality holds in this example.
5.3 Structures of good data transmission codes
The channel capacity for discrete memoryless channel is shown to be:
C max
P
X
I(P
X
, Q
Y [X
).
Let P
X
be the optimizer of the above maximization operation. Then
C max
P
X
I(P
X
, Q
Y [X
) = I(P
X
, Q
Y [X
).
Here, the performance of the code is assumed to be the average error probability,
namely
P
e
( (
n
) =
1
M
M
i=1
P
e
( (
n
[x
n
i
),
91
if the codebook is (
n
x
n
1
, x
n
2
, . . . , x
n
M
. Due to the random coding argument,
a deterministic good code with arbitrarily small error probability and rate less
than channel capacity must exist. Question is what is the relationship between
the good code and the optimizer P
X
? It is widely believed that if the code is
good (with rate close to capacity and low error probability), then the output
statistics P
e
Y
n
due to the equallylikely codemust approximate the output
distribution, denoted by P
Y
n
, due to the input distribution achieving the channel
capacity.
This fact is actually reected in the next theorem.
Theorem 5.16 For any channel W
n
= (Y
n
[X
n
) with nite input alphabet and
capacity C that satises the strong converse (i.e., C = C
SC
), the following
statement holds.
Fix > 0 and a sequence of (
n
= (n, M
n
)
n=1
block codes with
1
n
log M
n
C /2,
and vanishing error probability (i.e., error probability approaches zero as block
length n tends to innity.) Then,
1
n

Y
n
Y
n
 for all suciently large n,
where
Y
n
is the output due to the block code and Y
n
is the output due the X
n
that satises
I(X
n
; Y
n
) = max
X
n
I(X
n
; Y
n
).
To be specic,
P
e
Y
n
(y
n
) =
x
n
( n
P
e
X
n
(x
n
)P
W
n(y
n
[x
n
) =
x
n
( n
1
M
n
P
W
n(y
n
[x
n
)
and
P
Y
n
(y
n
) =
x
n
.
n
P
X
n
(x
n
)P
W
n(y
n
[x
n
).
Note that the above theorem holds for arbitrary channels, not restricted to
only discrete memoryless channels.
One may query that can a result in the spirit of the above theorem be
proved for the input statistics rather than the output statistics? The answer
is negative. Hence, the statement that the statistics of any good code must
approximate those that maximize the mutual information is erroneously taken
for granted. (However, we do not rule out the possibility of the existence of
92
good codes that approximate those that maximize the mutual information.) To
see this, simply consider the normalized entropy of X
n
versus that of
X
n
(which
is uniformly distributed over the codewords) for discrete memoryless channels:
1
n
H(X
n
)
1
n
H(
X
n
) =
_
1
n
H(X
n
[Y
n
) +
1
n
I(X
n
; Y
n
)
_
1
n
log(M
n
)
=
_
H(X[Y ) + I(X; Y )
1
n
log(M
n
)
=
_
H(X[Y ) + C
1
n
log(M
n
).
A good code with vanishing error probability exists for (1/n) log(M
n
) arbitrarily
close to C; hence, we can nd a good code sequence to satisfy
lim
n
_
1
n
H(X
n
)
1
n
H(
X
n
)
_
= H(X[Y ).
Since the term H(X[Y ) is in general positive, where a quick example is the BSC
with crossover probability p, which yields H(X[Y ) = p log(p)(1p) log(1p),
the two input distributions does not necessarily resemble to each other.
5.4 Approximations of output statistics: resolvability for
channels
5.4.1 Motivations
The discussion of the previous section somewhat motivates the necessity to nd
a equallydistributed (over a subset of input alphabet) input distribution that
generates the output statistics, which is close to the output due to the input
that maximizes the mutual information. Since such approximations are usually
performed by computers, it may be natural to connect approximations of the
input and output statistics with the concept of resolvability.
5.4.2 Notations and denitions of resolvability for chan
nels
In a data transmission system as shown in Figure 5.3, suppose that the source,
channel and output are respectively denoted by
X
n
(X
1
, . . . , X
n
),
W
n
(W
1
, . . . , W
n
),
93
and
Y
n
(Y
1
, . . . , Y
n
),
where W
i
has distribution P
Y
i
[X
i
.
To simulate the behavior of the channel, a computergenerated input may
be necessary as shown in Figure 5.4. As stated in Chapter 4, such computer
generated input is based on an algorithm formed by a few basic uniform random
experiments, which has nite resolution. Our goal is to nd a good computer
generated input
X
n
such that the corresponding output
Y
n
is very close to the
true output Y
n
.

. . . , X
3
, X
2
, X
1
true source
P
Y
n
[X
n
true channel

. . . , Y
3
, Y
2
, Y
1
true output
Figure 5.3: The communication system.

. . . ,
X
3
,
X
2
,
X
1
computergenerated
source
P
Y
n
[X
n
true channel

. . . ,
Y
3
,
Y
2
,
Y
1
corresponding
output
Figure 5.4: The simulated communication system.
Denition 5.17 (resolvability for input X and channel W) Fix >
0, and suppose that the (true) input random variable and (true) channel statistics
are X and W = (Y [X), respectively.
Then the resolvability S
(X, W) min
_
R : ( > 0)(
X and N)( n > N)
1
n
R(
X
n
) < R + and Y
n
Y
n
 <
_
,
where P
e
Y
n
= P
e
X
n
P
W
n. (The denitions of resolution R() and variational dis
tance () () are given by Denitions 4.4 and 4.5.)
Note that if we take the channel W
n
to be an identity channel for all n,
namely A
n
=
n
and P
Y
n
[X
n(y
n
[x
n
) is either 1 or 0, then the resolvability for
input X and channel W is reduced to source resolvability for X:
S
(X, W
Identity
) = S
(X).
94
Similar reductions can be applied to all the following denitions.
Denition 5.18 (meanresolvability for input X and channel W) Fix
> 0, and suppose that the (true) input random variable and (true) channel
statistics are respectively X and W.
Then the meanresolvability
S
(X, W) min
_
R : ( > 0)(
X and N)( n > N)
1
n
H(
X
n
) < R + and Y
n
Y
n
 <
_
,
where P
e
Y
n
= P
e
X
n
P
W
n and P
Y
n = P
X
nP
W
n.
Denition 5.19 (resolvability and mean resolvability for input X and
channel W) The resolvability and meanresolvability for input X and W are
dened respectively as:
S(X, W) sup
>0
S
(X, W) and
S(X, W) sup
>0
(X, W).
Denition 5.20 (resolvability and mean resolvability for channel W)
The resolvability and meanresolvability for channel W are dened respectively
as:
S(W) sup
X
S(X, W),
and
S(W) sup
X
S(X, W).
5.4.3 Results on resolvability and meanresolvability for
channels
Theorem 5.21
S(W) = C
SC
= sup
X
I(X; Y ).
It is somewhat a reasonable inference that if no computer algorithms can pro
duce a desired good output statistics under the number of random nats specied,
then all codes should be bad codes for this rate.
Theorem 5.22
S(W) = C = sup
X
I(X; Y ).
95
It it not yet clear that the operational relation between the resolvability (or
meanresolvability) and capacities for channels. This could be some interesting
problems to think of.
96
Bibliography
[1] H. V. Poor and S. Verd u, A lower bound on the probability of error in
multihypothesis Testing, IEEE Trans. on Information Theory, vol. IT41,
no. 6, pp. 19921994, Nov. 1995.
[2] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
97
Chapter 6
Optimistic Shannon Coding Theorems
for Arbitrary SingleUser Systems
The conventional denitions of the source coding rate and channel capacity re
quire the existence of reliable codes for all suciently large blocklengths. Alterna
tively, if it is required that good codes exist for innitely many blocklengths, then
optimistic denitions of source coding rate and channel capacity are obtained.
In this chapter, formulas for the optimistic minimum achievable xedlength
source coding rate and the minimum achievable source coding rate for arbi
trary nitealphabet sources are established. The expressions for the optimistic
capacity and the optimistic capacity of arbitrary singleuser channels are also
provided. The expressions of the optimistic source coding rate and capacity are
examined for the class of information stable sources and channels, respectively.
Finally, examples for the computation of optimistic capacity are presented.
6.1 Motivations
The conventional denition of the minimum achievable xedlength source cod
ing rate T(X) (or T
0
(X)) for a source X (cf. Denition 3.2) requires the exis
tence of reliable source codes for all suciently large blocklengths. Alternatively,
if it is required that reliable codes exist for innitely many blocklengths, a new,
more optimistic denition of source coding rate (denoted by T(X)) is obtained
[11]. Similarly, the optimistic capacity
C is dened by requiring the existence
of reliable channel codes for innitely many blocklengths, as opposed to the
denition of the conventional channel capacity C [12, Denition 1].
This concept of optimistic source coding rate and capacity has recently been
investigated by Verd u et.al for arbitrary (not necessarily stationary, ergodic,
information stable, etc.) sources and singleuser channels [11, 12]. More speci
cally, they establish an additional operational characterization for the optimistic
98
minimum achievable source coding rate (T(X) for source X) by demonstrating
that for a given channel, the classical statement of the sourcechannel separation
theorem
1
holds for every channel if T(X) = T(X) [11]. In a dual fashion, they
also show that for channels with
C = C, the classical separation theorem holds
for every source. They also conjecture that T(X) and
C do not seem to admit
a simple expression.
In this chapter, we demonstrate that T(X) and
C do indeed have a general
formula. The key to these results is the application of the generalized sup
information rate introduced in Chapter 2 to the existing proofs by Verd u and
Han [12] of the direct and converse parts of the conventional coding theorems.
We also provide a general expression for the optimistic minimum achievable
source coding rate and the optimistic capacity.
For the generalized sup/infinformation/entropy rates which will play a key
role in proving our optimistic coding theorems, readers may refer to Chapter 2.
6.2 Optimistic source coding theorems
In this section, we provide the optimistic source coding theorems. They are
shown based on two new bounds due to Han [7] on the error probability of
a source code as a function of its size. Interestingly, these bounds constitute
the natural counterparts of the upper bound provided by Feinsteins Lemma
and the Verd uHan lower bound [12] to the error probability of a channel code.
Furthermore, we show that for information stable sources, the formula for T(X)
reduces to
T(X) = liminf
n
1
n
H(X
n
).
This is in contrast to the expression for T(X), which is known to be
T(X) = limsup
n
1
n
H(X
n
).
The above result leads us to observe that for sources that are both stationary and
information stable, the classical separation theorem is valid for every channel.
In [11], Vembu et.al characterize the sources for which the classical separation
theorem holds for every channel. They demonstrate that for a given source X,
1
By the classical statement of the sourcechannel separation theorem, we mean the fol
lowing. Given a source X with (conventional) source coding rate T(X) and channel W
with capacity C, then X can be reliably transmitted over W if T(X) < C. Conversely, if
T(X) > C, then X cannot be reliably transmitted over W. By reliable transmissibility of
the source over the channel, we mean that there exits a sequence of joint sourcechannel codes
such that the decoding error probability vanishes as the blocklength n (cf. [11]).
99
the separation theorem holds for every channel if its optimistic minimum achiev
able source coding rate (T(X)) coincides with its conventional (or pessimistic)
minimum achievable source coding rate (T(X)); i.e., if T(X) = T(X).
We herein establish a general formula for T(X). We prove that for any source
X,
T(X) = lim
1
H
(X).
We also provide the general expression for the optimistic minimum achievable
source coding rate. We show these results based on two new bounds due to Han
(one upper bound and one lower bound) on the error probability of a source
code [7, Chapter 1]. The upper bound (Lemma 3.3) consists of the counterpart
of Feinsteins Lemma for channel codes, while the lower bound (Lemma 3.4)
consists of the counterpart of the Verd uHan lower bound on the error probability
of a channel code ([12, Theorem 4]). As in the case of the channel coding bounds,
both source coding bounds (Lemmas 3.3 and 3.4) hold for arbitrary sources and
for arbitrary xed blocklength.
Denition 6.1 An (n, M) xedlength source code for X
n
is a collection of M
ntuples (
n
= c
n
1
, . . . , c
n
M
. The error probability of the code is
P
e
( (
n
) Pr [X
n
, (
n
] .
Denition 6.2 (optimistic achievable source coding rate) Fix 0 < <
1. R 0 is an optimistic achievable rate if, for every > 0, there exists a
sequence of (n, M
n
) xedlength source codes (
n
such that
limsup
n
1
n
log M
n
R
and
liminf
n
P
e
( (
n
) .
The inmum of all achievable source coding rates for source X is denoted
by T
(X) = lim
0
T
(X) = T
0
(X) as
the optimistic source coding rate.
We can then use Lemmas 3.3 and 3.4 (in a similar fashion to the general source
coding theorem) to prove the general optimistic (xedlength) source coding
theorems.
Theorem 6.3 (optimistic minimum achievable source coding rate for
mula) For any source X,
T
(X)
_
lim
(1)
H
h
X
n(X
n
)
H(X
n
)
1
>
_
= 0 > 0,
where H(X
n
) = E[h
X
n(X
n
)] is the entropy of X
n
.
Lemma 6.5 Every information source X satises
T(X) = liminf
n
1
n
H(X
n
).
Proof:
1. [T(X) liminf
n
(1/n)H(X
n
)]
Fix > 0 arbitrarily small. Using the fact that h
X
n(X
n
) is a nonnegative
bounded random variable for nite alphabet, we can write the normalized block
entropy as
1
n
H(X
n
) = E
_
1
n
h
X
n(X
n
)
_
= E
_
1
n
h
X
n(X
n
) 1
_
0
1
n
h
X
n(X
n
) lim
1
H
(X) +
__
+ E
_
1
n
h
X
n(X
n
) 1
_
1
n
h
X
n(X
n
) > lim
1
H
(X) +
__
. (6.2.1)
From the denition of lim
1
H
(X) liminf
n
1
n
H(X
n
).
2. [T(X) liminf
n
(1/n)H(X
n
)]
101
Fix > 0. For innitely many n,
Pr
_
h
X
n(X
n
)
H(X
n
)
1 >
_
= Pr
_
1
n
h
X
n(X
n
) > (1 + )
_
1
n
H(X
n
)
__
Pr
_
1
n
h
X
n(X
n
) > (1 + )
_
liminf
n
1
n
H(X
n
) +
__
.
Since X is information stable, we obtain that
liminf
n
Pr
_
1
n
h
X
n(X
n
) > (1 + )
_
liminf
n
1
n
H(X
n
) +
__
= 0.
By the denition of lim
1
H
(X) (1 + )
_
liminf
n
1
n
H(X
n
) +
_
.
The proof is completed by noting that can be made arbitrarily small. 2
It is worth pointing out that if the source X is both information stable and
stationary, the above Lemma yields
T(X) = T(X) = lim
n
1
n
H(X
n
).
This implies that given a stationary and information stable source X, the clas
sical separation theorem holds for every channel.
6.3 Optimistic channel coding theorems
In this section, we state without proving the general expressions for the opti
mistic capacity
2
(
C
(X; Y ) C
sup
X
I
(X; Y ).
We close this section by proving the formula of
C for information stable channels.
Denition 6.6 (optimistic achievable rate) Fix 0 < < 1. R 0 is an
optimistic achievable rate if there exists a sequence of (
n
= (n, M
n
) channel
block codes such that
liminf
n
1
n
log M
n
R
and
liminf
n
P
e
( (
n
) .
Denition 6.7 (optimistic capacity
C
.
It is straightforward for the denition that C
is nondecreasing in , and
C
1
= log [A[.
Denition 6.8 (optimistic capacity
C) The optimistic channel capacity
C
is dened as the supremum of the rates that are achievable for all [0, 1]. It
follows immediately from the denition that C = inf
01
C
= lim
0
C
= C
0
and that C is the supremum of all the rates R for which there exists a sequence
of (
n
= (n, M
n
) channel block codes such that
liminf
n
1
n
log M
n
R,
and
liminf
n
P
e
( (
n
) = 0.
Theorem 6.9 (optimistic capacity formula) Fix 0 < < 1. The opti
mistic capacity
C
satises
sup
X
lim
(X; Y )
C
sup
X
(X; Y ). (6.3.1)
Note that actually
C
= sup
X
C = sup
X
I
0
(X; Y ).
103
We next investigate the expression of
C for information stable channels.
The expression for the capacity of information stable channels is already known
(cf. for example [11])
C = liminf
n
C
n
.
We prove a dual formula for
C.
Denition 6.11 (Information stable channels [6, 8]) A channel W is said
to be information stable if there exists an input process X such that 0 < C
n
<
for n suciently large, and
limsup
n
Pr
_
i
X
n
W
n(X
n
; Y
n
)
nC
n
1
>
_
= 0
for every > 0.
Lemma 6.12 Every information stable channel W satises
C = limsup
n
sup
X
n
1
n
I(X
n
; Y
n
).
Proof:
1. [
C limsup
n
sup
X
n(1/n)I(X
n
; Y
n
)]
By using a similar argument as in the proof of [12, Theorem 8, property h)],
we have
I
0
(X; Y ) limsup
n
sup
X
n
1
n
I(X
n
; Y
n
).
Hence,
C = sup
X
I
0
(X; Y ) limsup
n
sup
X
n
1
n
I(X
n
; Y
n
).
2. [
C limsup
n
sup
X
n(1/n)I(X
n
; Y
n
)]
Suppose
X is the input process that makes the channel information stable.
Fix > 0. Then for innitely many n,
P
X
n
W
n
_
1
n
i
X
n
W
n(
X
n
; Y
n
) (1 )(limsup
n
C
n
)
_
P
X
n
W
n
_
i
X
n
W
n(
X
n
; Y
n
)
n
< (1 )C
n
_
= P
X
n
W
n
_
i
X
n
W
n(
X
n
; Y
n
)
nC
n
1 <
_
.
104
Since the channel is information stable, we get that
liminf
n
P
X
n
W
n
_
1
n
i
X
n
W
n(
X
n
; Y
n
) (1 )(limsup
n
C
n
)
_
= 0.
By the denition of
C, the above immediately implies that
C = sup
X
I
0
(X; Y )
I
0
(
X; Y ) (1 )(limsup
n
C
n
).
Finally, the proof is completed by noting that can be made arbitrarily small.
2
Observations:
It is known that for discrete memoryless channels, the optimistic capacity
C
is equal to the (conventional) capacity C [12, 5]. The same result holds for
moduloq additive noise channels with stationary ergodic noise. However,
in general,
C C since
I
0
(X; Y ) I(X; Y ) [3, 4].
Remark that Theorem 11 in [11] holds if, and only if,
sup
X
I(X; Y ) = sup
X
I
0
(X; Y ).
Furthermore, note that, if
C = C and there exists an input distribution
P
X
that achieves C, then P
X
also achieves
C.
6.4 Examples
We provide four examples to illustrate the computation of C and
C. The rst two
examples present information stable channels for which
C > C. The third exam
ple shows an information unstable channel for which
C = C. These examples in
dicate that information stability is neither necessary nor sucient to ensure that
= sup
X
I(X; Y ).
105
6.4.1 Information stable channels
Example 6.13 Consider a nonstationary channel W such that at odd time in
stances n = 1, 3, , W
n
is the product of the transition distribution of a binary
symmetric channel with crossover probability 1/8 (BSC(1/8)), and at even time
instances n = 2, 4, 6, , W
n
is the product of the distribution of a BSC(1/4). It
can be easily veried that this channel is information stable. Since the channel
is symmetric, a Bernoulli(1/2) input achieves C
n
= sup
X
n(1/n)I(X
n
; Y
n
); thus
C
n
=
_
1 h
b
(1/8), for n odd;
1 h
b
(1/4), for n even,
where h
b
(a) a log
2
a (1 a) log
2
(1 a) is the binary entropy function.
Therefore, C = liminf
n
C
n
= 1 h
b
(1/4) and
C = limsup
n
C
n
= 1
h
b
(1/8) > C.
Example 6.14 Here we use the information stable channel provided in [11,
Section III] to show that
C > C. Let ^ be the set of all positive integers.
Dene the set as
n ^ : 2
2i+1
n < 2
2i+2
, i = 0, 1, 2, . . .
= 2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, , 63, 128, 129, , 255, .
Consider the following nonstationary symmetric channel W. At times n ,
W
n
is a BSC(0), whereas at times n , , W
n
is a BSC(1/2). Put W
n
=
W
1
W
2
W
n
. Here again C
n
is achieved by a Bernoulli(1/2) input
X
n
.
We then obtain
C
n
=
1
n
n
i=1
I(
X
i
; Y
i
) =
1
n
[J(n) 1 + (n J(n)) 0] =
J(n)
n
,
where J(n) [ 1, 2, , n[. It can be shown that
J(n)
n
=
_
_
1
2
3
2
]log
2
n
n
+
1
3n
, for log
2
n odd;
2
3
2
]log
2
n
n
2
3n
, for log
2
n even.
Consequently, C = liminf
n
C
n
= 1/3 and
C = limsup
n
C
n
= 2/3.
6.4.2 Information unstable channels
Example 6.15 The Polyacontagion channel: Consider a discrete additive cha
nnel with binary input and output alphabet 0, 1 described by
Y
i
= X
i
Z
i
, i = 1, 2, ,
106
where X
i
, Y
i
and Z
i
are respectively the ith input, ith output and ith noise, and
represents modulo2 addition. Suppose that the input process is independent
of the noise process. Also assume that the noise sequence Z
n
n1
is drawn
according to the Polya contagion urn scheme [1, 9] as follows: an urn originally
contains R red balls and B black balls with R < B; the noise just make successive
draws from the urn; after each draw, it returns to the urn 1 + balls of the
same color as was just drawn ( > 0). The noise sequence Z
i
corresponds to
the outcomes of the draws from the Polya urn: Z
i
= 1 if ith ball drawn is red
and Z
i
= 0, otherwise. Let R/(R + B) and /(R + B). It is shown in
[1] that the noise process Z
i
is stationary and nonergodic; thus the channel is
information unstable.
From Lemma 2 and Section IV in [4, Part I], we obtain
1
H
1
(X) C
1 lim
(1)
(X),
and
1 H
1
(X)
C
1 lim
(1)
H
(X).
It has been shown [1] that (1/n) log P
X
n(X
n
) converges in distribution to the
continuous random variable V h
b
(U), where U is betadistributed with pa
rameters (/, (1 )/), and h
b
() is the binary entropy function. Thus
H
1
(X) = lim
(1)
(X) =H
1
(X) = lim
(1)
H
(X) = F
1
V
(1 ),
where F
V
(a) PrV a is the cumulative distribution function of V , and
F
1
V
() is its inverse [1]. Consequently,
C
=
C
= 1 F
1
V
(1 ),
and
C =
C = lim
0
_
1 F
1
V
(1 )
= 0.
Example 6.16 Let
W
1
,
W
2
, . . . consist of the channel in Example 6.14, and let
W
1
,
W
2
, . . . consist of the channel in Example 6.15. Dene a new channel W as
follows:
W
2i
=
W
i
and W
2i1
=
W
i
for i = 1, 2, .
As in the previous examples, the channel is symmetric, and a Bernoulli(1/2)
input maximizes the inf/sup information rates. Therefore for a Bernoulli(1/2)
107
input X, we have
Pr
_
1
n
log
P
W
n(Y
n
[X
n
)
P
Y
n(Y
n
)
_
=
_
_
Pr
_
1
2i
_
log
P
W
i (Y
i
[X
i
)
P
Y
i (Y
i
)
+ log
P
W
i
(Y
i
[X
i
)
P
Y
i (Y
i
)
_
_
,
if n = 2i;
Pr
_
1
2i + 1
_
log
P
W
i (Y
i
[X
i
)
P
Y
i (Y
i
)
+ log
P
W
i+1
(Y
i+1
[X
i+1
)
P
Y
i+1(Y
i+1
)
_
_
,
if n = 2i + 1;
=
_
_
1 Pr
_
1
i
log P
Z
i (Z
i
) < 1 2 +
1
i
J(i)
_
,
if n = 2i;
1 Pr
_
1
i + 1
log P
Z
i+1(Z
i+1
) < 1
_
2
1
i + 1
_
+
1
i + 1
J(i)
_
,
if n = 2i + 1.
The fact that (1/i) log[P
Z
i (Z
i
)] converges in distribution to the continuous ran
dom variable V h
b
(U), where U is betadistributed with parameters (/, (1
)/), and the fact that
liminf
n
1
n
J(n) =
1
3
and limsup
n
1
n
J(n) =
2
3
imply that
i() liminf
n
Pr
_
1
n
log
P
W
n(Y
n
[X
n
)
P
Y
n(Y
n
)
_
= 1 F
V
_
5
3
2
_
,
and
i() limsup
n
Pr
_
1
n
log
P
W
n(Y
n
[X
n
)
P
Y
n(Y
n
)
_
= 1 F
V
_
4
3
2
_
.
Consequently,
=
5
6
1
2
F
1
V
(1 ) and C
=
2
3
1
2
F
1
V
(1 ).
Thus
0 < C =
1
6
<
C =
1
3
< C
SC
=
5
6
< log
2
[[ = 1.
108
Bibliography
[1] F. Alajaji and T. Fuja, A communication channel modeled on contagion,
IEEE Trans. on Information Theory, vol. IT40, no. 6, pp. 20352041,
Nov. 1994.
[2] P.N. Chen and F. Alajaji, Strong converse, feedback capacity and hypoth
esis testing, Proc. of CISS, John Hopkins Univ., Baltimore, Mar. 1995.
[3] P.N. Chen and F. Alajaji, Generalization of information measures,
Proc. Int. Symp. Inform. Theory & Applications, Victoria, Canada, Septem
ber 1996.
[4] P.N. Chen and F. Alajaji, Generalized source coding theorems and hy
pothesis testing, Journal of the Chinese Institute of Engineering, vol. 21,
no. 3, pp. 283303, May 1998.
[5] I. Csiszar and J. K orner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic, New York, 1981.
[6] R. L. Dobrushin, General formulation of Shannons basic theorems of infor
mation theory, AMS Translations, vol. 33, pp. 323438, AMS, Providence,
RI, 1963.
[7] T. S. Han, InformationSpectrum Methods in Information Theory, (in
Japanese), Baifukan Press, Tokyo, 1998.
[8] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, HoldenDay, 1964.
[9] G. Polya, Sur quelques points de la theorie des probabilites, Ann. Inst.
H. Poincarre, vol. 1, pp. 117161, 1931.
[10] Y. Steinberg, New converses in the theory of identication via channels,
IEEE Trans. on Information Theory, vol. 44, no. 3, pp. 984998, May 1998.
109
[11] S. Vembu, S. Verd u and Y. Steinberg, The sourcechannel separation theo
rem revisited, IEEE Trans. on Information Theory, vol. 41, no. 1, pp. 44
54, Jan. 1995.
[12] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
110
Chapter 7
Lossy Data Compression
In this chapter, a ratedistortion theorem for arbitrary (not necessarily stationary
or ergodic) discretetime nitealphabet sources is shown. The expression of the
minimum achievable xedlength coding rate subject to a delity criterion will
also be provided.
7.1 General lossy source compression for block codes
In this section, we consider the problem of source coding with a delity criterion
for arbitrary (not necessarily stationary or ergodic) discretetime nitealphabet
sources. We prove a general ratedistortion theorem by establishing the expres
sion of the minimum achievable block coding rate subject to a delity criterion.
We will relax all the constraints on the source statistics and distortion measure.
In other words, the source is not restricted to DMS, and the distortion measure
can be arbitrary.
Denition 7.1 (lossy compression block code) Given a nite source alpha
bet Z and a nite reproduction alphabet
Z, a block code for data compression
of blocklength n and size M is a mapping f
n
() : Z
n
Z
n
that results
in f
n
 = M codewords of length n, where each codeword is a sequence of n
reproducing letters.
Denition 7.2 (distortion measure) A distortion measure
n
(, ) is a map
ping
n
: Z
n
Z
n
1
+
[0, ).
We can view the distortion measure as the cost of representing a source ntuple
z
n
by a reproduction ntuple f
n
(z
n
).
111
Similar to the general results for entropy rate and information rate, the av
erage distortion measure no longer has nice properties, such as convergencein
probability, in general. Instead, its ultimate probability may span over some
regions. Hence, in order to derive the fundamental limit for lossy data compres
sion code rate, a dierent technique should be applied. This technique, which
employing the spectrum of the normalized distortion density dened later, is in
fact the same as that for general lossless source coding theorem and general
channel coding theorem.
Denition 7.3 (distortion infspectrum) Let (Z,
Z) and
n
(, )
n1
be
given. The distortion infspectrum
Z,
Z
() is dened by
Z,
Z
() liminf
n
Pr
_
1
n
n
(Z
n
,
Z
n
)
_
.
Denition 7.4 (distortion infspectrum for lossy compression code f)
Let Z and
n
(, )
n1
be given. Let f() f
n
()
n=1
denote a sequence of
(lossy) data compression codes. The distortion infspectrum
Z,f(Z)
() for f()
is dened by
Z,f(Z)
() liminf
n
Pr
_
1
n
n
(Z
n
, f
n
(Z
n
))
_
.
Denition 7.5 Fix D > 0 and 1 > > 0. R is a achievable data compression
rate at distortion D for a source Z if there exists a sequence of data compression
codes f
n
() with
limsup
n
1
n
log f
n
 R,
and
sup
_
:
Z,f(Z)
()
D. (7.1.1)
Note that (7.1.1), which can be rewritten as:
inf
_
: limsup
n
Pr
_
1
n
n
(Z
n
, f
n
(Z
n
)) >
_
< 1
_
D,
is equivalent to stating that the limsup of the probability of excessive distortion
(i.e., distortion larger than D) is smaller than 1 .
The inmum achievable data compression rate at distortion D for Z is
denoted by T
(D, Z).
112
Theorem 7.6 (general data compression theorem) Fix D > 0 and 1 >
> 0. Let Z and
n
(, )
n1
be given. Then
R
(D) T
(D, Z) R
(D ),
for any > 0, where
R
(D) inf
_
P
Z[Z
: sup
_
:
Z,
Z
()
_
D
_
I(Z;
Z),
where the inmum is taken over all conditional distributions P
Z[
Z
for which the
joint distribution P
Z,
Z
= P
Z
P
Z[
Z
satises the distortion constraint.
Proof:
1. Forward part (achievability): T
(D, Z) R
(D) or T
(D+, Z) R
(D).
Choose > 0. We will prove the existence of a sequence of data compression
codes with
limsup
n
1
n
log f
n
 R
(D) + 2,
and
sup
_
:
Z,f(Z)
()
D + .
step 1: Let P
Z[Z
be the distribution achieving R
(D).
step 2: Let R = R
(D) + 2. Choose M = e
nR
codewords of blocklength n
independently according to P
Z
n, where
P
Z
n( z
n
) =
z
n
?
n
P
Z
n(z
n
)P
Z
n
[Z
n( z
n
[z
n
),
and denote the resulting random set by (
n
.
step 3: For a given (
n
, we denote by A((
n
) the set of sequences z
n
Z
n
such
that there exists z
n
(
n
with
1
n
n
(z
n
, z
n
) D + .
step 4: Claim:
limsup
n
E
Z
n[P
Z
n(A
c
((
n
))] < .
The proof of this claim will be provided as a lemma following this theorem.
Therefore there exists (a sequence of) (
n
such that
limsup
n
P
Z
n(A
c
((
n
)) < .
113
step 5: Dene a sequence of codes f
n
n1
by
f
n
(z
n
) =
_
arg min
z
n
(
n
(z
n
, z
n
), if z
n
A((
n
);
0, otherwise,
where 0 is a xed default ntuple in
Z
n
.
Then
_
z
n
Z
n
:
1
n
n
(z
n
, f
n
(z
n
)) D +
_
A((
n
),
since ( z
n
A((
n
)) there exists z
n
(
n
such that (1/n)
n
(z
n
, z
n
) D+,
which by denition of f
n
implies that (1/n)
n
(z
n
, f
n
(z
n
)) D + .
step 6: Consequently,
(Z,f(Z))
(D + ) = liminf
n
P
Z
n
_
z
n
Z
n
:
1
n
n
(z
n
, f(z
n
)) D +
_
liminf
n
P
Z
n(A((
n
))
= 1 limsup
n
P
Z
n(A
c
((
n
))
> .
Hence,
sup
_
:
Z,f(Z)
()
D + ,
where the last step is clearly depicted in Figure 7.1.
This proves the forward part.
2. Converse part: T
(D, Z) R
n=1
, if
sup
_
:
Z,f(Z)
()
D,
then
limsup
n
1
n
log f
n
 R
(D).
Let
P
Z
n
[Z
n( z
n
[z
n
)
_
1, if z
n
= f
n
(z
n
);
0, otherwise.
Then to evaluate the statistical properties of the random sequence
(1/n)
n
(Z
n
, f
n
(Z
n
)
n=1
114

6
D +
Z,f(Z)
(D + )
Z,f(Z)
()
sup
_
:
Z,f(Z)
()
Figure 7.1:
Z,f (Z)
(D + ) > sup[ :
Z,f (Z)
() ] D + .
under distribution P
Z
n is equivalent to evaluating those of the random sequence
(1/n)
n
(Z
n
,
Z
n
)
n=1
under distribution P
Z
n
,
Z
n. Therefore,
R
(D) inf
P
Z[Z
: sup[ :
Z,
Z
() ] D
I(Z;
Z)
I(Z;
Z)
H(
Z) H(
Z[Z)
H(
Z)
limsup
n
1
n
log f
n
,
where the second inequality follows from (2.4.3) (with 1 and = 0), and the
third inequality follows from the fact that H(
Z[Z) 0. 2
Lemma 7.7 (cf. Proof of Theorem 7.6)
limsup
n
E
Z
n[P
Z
n(A
c
((
n
))] < .
Proof:
step 1: Let
D
()
sup
_
:
Z,
Z
()
_
.
115
Dene
A
()
n,
_
(z
n
, z
n
) :
1
n
n
(z
n
, z
n
) D
()
+
and
1
n
i
Z
n
,
Z
n(z
n
, z
n
)
I(Z;
Z) +
_
.
Since
liminf
n
Pr
_
T
_
1
n
n
(Z
n
,
Z
n
) D
()
+
__
> ,
and
liminf
n
Pr
_
c
_
1
n
i
Z
n
,
Z
n(Z
n
;
Z
n
)
I(Z;
Z) +
__
= 1,
we have
liminf
n
Pr(A
()
n,
) = liminf
n
Pr(T c)
liminf
n
Pr(T) + liminf
n
Pr(c) 1
> + 1 1 = .
step 2: Let K(z
n
, z
n
) be the indicator function of A
()
n,
:
K(z
n
, z
n
) =
_
1, if (z
n
, z
n
) A
()
n,
;
0, otherwise.
step 3: By following a similar argument in step 4 of the achievability part of
Theorem 8.3 of Volume I, we obtain
E
Z
n[P
Z
n(A
c
((
n
))] =
n
P
Z
n((
n
)
z
n
,A((
n
)
P
Z
n(z
n
)
=
z
n
?
n
P
Z
n(z
n
)
n
:z
n
,A((
n
)
P
Z
n((
n
)
=
z
n
?
n
P
Z
n(z
n
)
_
_
1
z
n
?
n
P
Z
n
(y
n
)K(z
n
, z
n
)
_
_
M
z
n
?
n
P
Z
n(z
n
)
_
1 e
n(
I(Z;
Z)+)
z
n
?
n
P
Z
n
[Z
n( z
n
[z
n
)K(z
n
, z
n
)
_
_
M
1
z
n
?
n
z
n
?
n
P
Z
n(z
n
)P
Z
n
[Z
n(z
n
, z
n
)K(z
n
, z
n
)
+exp
_
e
n(RR(D))
_
.
116
Therefore,
limsup
n
E
Z
n[P
Z
n(A
n
((
n
))] 1 liminf
n
Pr(A
()
n,
)
< 1 .
2
For the probabilityoferror distortion measure
n
: Z
n
Z
n
, namely,
n
(z
n
, z
n
) =
_
n, if z
n
,= z
n
;
0, otherwise,
we dene a data compression code f
n
: Z
n
Z
n
based on a chosen (asymptotic)
lossless xedlength data compression code book (
n
Z
n
:
f
n
(z
n
) =
_
z
n
, if z
n
(
n
;
0, if z
n
, (
n
,
where 0 is some default element in Z
n
. Then (1/n)
n
(z
n
, f
n
(z
n
)) is either 1 or
0 which results in a cumulative distribution function as shown in Figure 7.2.
Consequently, for any [0, 1),
Pr
_
1
n
n
(Z
n
, f
n
(Z
n
))
_
= Pr Z
n
= f
n
(Z
n
) .
 c
c s
s
0 1 D
1
PrZ
n
= f
n
(Z
n
)
Figure 7.2: The CDF of (1/n)
n
(Z
n
, f
n
(Z
n
)) for the probabilityof
error distortion measure.
117
By comparing the (asymptotic) lossless and lossy xedlength compression
theorems under the probabilityoferror distortion measure, we observe that
R
() = inf
P
Z[Z
: sup[ :
Z,
Z
() ]
I(Z;
Z)
=
_
_
0, 1;
inf
P
Z[Z
: liminf
n
PrZ
n
=
Z
n
>
I(Z;
Z), < 1,
=
_
_
0, 1;
inf
P
Z[Z
: limsup
n
PrZ
n
,=
Z
n
1
I(Z;
Z), < 1,
where
Z,
Z
() =
_
liminf
n
Pr
_
Z
n
=
Z
n
_
, 0 < 1;
1, 1.
In particular, in the extreme case where goes to one,
H(Z) = inf
P
Z[Z
: limsup
n
PrZ
n
,=
Z
n
= 0
I(Z;
Z).
Therefore, in this case, the data compression theorem reduces (as expected) to
the asymptotic lossless xedlength data compression theorem.
118
Chapter 8
Hypothesis Testing
8.1 Error exponent and divergence
Divergence can be adopted as a measure of how similar two distributions are.
This quantity has the operational meaning that it is the exponent of the type II
error probability for hypothesis testing system of xed test level. For rigorous
ness, the exponent is rst dened below.
Denition 8.1 (exponent) A real number a is said to be the exponent for a
sequence of nonnegative quantities a
n
n1
converging to zero, if
a = lim
n
_
1
n
log a
n
_
.
In operation, exponent is an index of the exponential rateofconvergence for
sequence a
n
. We can say that for any > 0,
e
n(a+)
a
n
e
n(a)
,
as n large enough.
Recall that in proving the channel coding theorem, the probability of decod
ing error for channel block codes can be made arbitrarily close to zero when the
rate of the codes is less than channel capacity. This result can be mathematically
written as:
P
e
( (
n
) 0, as n ,
provided R = limsup
n
(1/n) log  (
n
 < C, where (
n
is the optimal code for
block length n. From the theorem, we only know the decoding error will vanish
as block length increases; but, it does not reveal that how fast the decoding error
approaches zero. In other words, we do not know the rateofconvergence of the
119
decoding error. Sometimes, this information is very important, especially for
one to decide the sucient block length to achieve some error bound.
The rst step of investigating the rateofconvergence of the decoding error
is to compute its exponent, if the decoding error decays to zero exponentially
fast (it indeed does for memoryless channels.) This exponent, as a function of
the rate, is in fact called the channel reliability function, and will be discussed
in the next chapter.
For the hypothesis testing problems, the type II error probability of xed test
level also decays to zero as the number of observations increases. As it turns
out, its exponent is the divergence of the null hypothesis distribution against
alternative hypothesis distribution.
Lemma 8.2 (Steins lemma) For a sequence of i.i.d. observations X
n
which
is possibly drawn from either null hypothesis distribution P
X
n or alternative
hypothesis distribution P
X
n
, the type II error satises
( (0, 1)) lim
n
1
n
log
n
() = D(P
X
P
X
),
where
n
() = min
n
n
, and
n
and
n
represent the type I and type II errors,
respectively.
Proof: [1. Forward Part]
In the forward part, we prove that there exists an acceptance region for null
hypothesis such that
liminf
n
1
n
log
n
() D(P
X
P
X
).
step 1: divergence typical set. For any > 0, dene divergence typical set
as
/
n
()
_
x
n
:
1
n
log
P
X
n(x
n
)
P
X
n
(x
n
)
D(P
X
P
X
)
<
_
.
Note that in divergence typical set,
P
X
n
(x
n
) P
X
n(x
n
)e
n(D(P
X
P
X
))
.
step 2: computation of type I error. By weak law of large number,
P
X
n(/
n
()) 1.
Hence,
n
= P
X
n(/
c
n
()) < ,
for suciently large n.
120
step 3: computation of type II error.
n
() = P
X
n
(/
n
())
=
x
n
,n()
P
X
n
(x
n
)
x
n
,n()
P
X
n(x
n
)e
n(D(P
X
P
X
))
e
n(D(P
X
P
X
))
x
n
,n()
P
X
n(x
n
)
e
n(D(P
X
P
X
))
(1
n
).
Hence,
1
n
log
n
() D(P
X
P
X
) +
1
n
log(1
n
),
which implies
liminf
n
1
n
log
n
() D(P
X
P
X
) .
The above inequality is true for any > 0; therefore
liminf
n
1
n
log
n
() D(P
X
P
X
).
[2. Converse Part]
In the converse part, we will prove that for any acceptance region for null
hypothesis B
n
satisfying the type I error constraint, i.e.,
n
(B
n
) = P
X
n(B
c
n
) ,
its type II error
n
(B
n
) satises
limsup
n
1
n
log
n
(B
n
) D(P
X
P
X
).
n
(B
n
) = P
X
n
(B
n
) P
X
n
(B
n
/
n
())
x
n
Bn,n()
P
X
n
(x
n
)
x
n
Bn,n()
P
X
n(x
n
)e
n(D(P
X
P
X
)+)
= e
n(D(P
X
P
X
)+)
P
X
n(B
n
/
n
())
e
n(D(P
X
P
X
)+)
[1 P
X
n(B
c
n
) P
X
n (/
c
n
())]
e
n(D(P
X
P
X
)+)
[1
n
(B
n
) P
X
n (/
c
n
())]
e
n(D(P
X
P
X
)+)
[1 P
X
n (/
c
n
())] .
121
Hence,
1
n
log
n
(B
n
) D(P
X
P
X
) + +
1
n
log [1 P
X
n (/
c
n
())] ,
which implies that
limsup
n
1
n
log
n
(B
n
) D(P
X
P
X
) + .
The above inequality is true for any > 0; therefore,
limsup
n
1
n
log
n
(B
n
) D(P
X
P
X
).
2
8.1.1 Composition of sequence of i.i.d. observations
Steins lemma gives the exponent of the type II error probability for xed test
level. As a result, this exponent, which is the divergence of null hypothesis
distribution against alternative hypothesis distribution, is independent of the
type I error bound for i.i.d. observations.
Specically under i.i.d. environment, the probability for each sequence of x
n
depends only on its composition, which is dened as an [A[dimensional vector,
and is of the form
_
#1(x
n
)
n
,
#2(x
n
)
n
, . . . ,
#k(x
n
)
n
_
,
where A = 1, 2, . . . , k, and #i(x
n
) is the number of occurrences of symbol i
in x
n
. The probability of x
n
is therefore can be written as
P
X
n(x
n
) = P
X
(1)
#1(x
n
)
P
X
(2)
#2(x
n
)
P
X
(k)
#k(x
n
)
.
Note that #1(x
n
) + + #k(x
n
) = n. Since the composition of a sequence
decides its probability deterministically, all sequences with the same composition
should have the same statistical property, and hence should be treated the same
when processing. Instead of manipulating the sequences of observations based
on the typicalsetlike concept, we may focus on their compositions. As it turns
out, such approach yields simpler proofs and better geometrical explanations for
theories under i.i.d. environment. (It needs to be pointed out that for cases when
composition alone can not decide the probability, this viewpoint does not seem
to be eective.)
Lemma 8.3 (polynomial bound on number of composition) The num
ber of compositions increases polynomially fast, while the number of possible
sequences increases exponentially fast.
122
Proof: Let T
n
denotes the set of all possible compositions for x
n
A
n
. Then
since each numerator of the components in [A[dimensional composition vector
ranges from 0 to n, it is obvious that [T
n
[ (n + 1)
[.[
which increases polyno
mially with n. However, the number of possible sequences is [A[
n
which is of
exponential order w.r.t. n. 2
Each composition (or [A[dimensional vector) actually represents a possible
probability mass distribution. In terms of the composition distribution, we can
compute the exponent of the probability of those observations that belong to
the same composition.
Lemma 8.4 (probability of sequences of the same composition) The
probability of the sequences of composition ( with respect to distribution P
X
n
satises
1
(n + 1)
[.[
e
nD(P
C
P
X
)
P
X
n(() e
nD(P
C
P
X
)
,
where P
(
is the composition distribution for composition (, and ( (by abusing
notation without ambiguity) is also used to represent the set of all sequences (in
A
n
) of composition (.
Proof: For any sequence x
n
of composition
( =
_
#1(x
n
)
n
,
#2(x
n
)
n
, . . . ,
#k(x
n
)
n
_
= (P
(
(1), P
(
(2), . . . , P
(
(k)) ,
where k = [A[,
P
X
n(x
n
) =
k
i=1
P
X
(i)
#i(x
n
)
=
k
x=1
P
X
(x)
nP
C
(x)
=
k
x=1
e
nP
C
(x) log P
X
(x)
= e
n
P
xX
P
C
(x) log P
X
(x)
= e
n(
P
xX
P
C
(x) log[P
C
(x)/P
X
(x)]P
C
(x) log P
C
(x))
= e
n(D(P
C
P
X
)+H(P
C
))
.
Similarly, we have
P
(
(x
n
) = e
n[D(P
C
P
C
)+H(P
C
)]
= e
nH(P
C
)
. (8.1.1)
123
(P
(
originally represents a onedimensional distribution over A. Here, without
ambiguity, we reuse the same notation of P
(
() to represent its ndimensional
extension, i.e., P
(
(x
n
) =
n
i=1
P
(
(x
i
).) We can therefore bound the number of
sequence in ( as
1
x
n
(
P
(
(x
n
)
=
x
n
(
e
nH(P
C
)
= [([e
nH(P
C
)
,
which implies [([ e
nH(P
C
)
. Consequently,
P
X
n(() =
x
n
(
P
X
n(x
n
)
=
x
n
(
e
n(D(P
C
P
X
)+H(P
C
))
= [([e
n(D(P
C
P
X
)+H(P
C
))
e
nH(P
C
)
e
n(D(P
C
P
X
)+H(P
C
))
= e
nD(P
C
P
X
)
. (8.1.2)
This ends the proof of the upper bound.
From (8.1.2), we observe that to derive a lower bound of P
X
n(() is equiva
lent to nd a lower bound of [([, which requires the next inequality. For any
composition C
t
, and its corresponding composition set (
t
,
P
(
(()
P
(
((
t
)
=
[([
[(
t
[
i.
P
(
(i)
nP
C
(i)
i.
P
(
(i)
nP
C
(i)
=
n!
j.
(n P
(
(j))!
n!
j.
(n P
(
(j))!
i.
P
(
(i)
nP
C
(i)
i.
P
(
(i)
nP
C
(i)
=
j.
(n P
(
(j))!
(n P
(
(j))!
i.
P
(
(i)
nP
C
(i)
i.
P
(
(i)
nP
C
(i)
j.
(n P
(
(j))
nP
C
(j)nP
C
(j)
i.
P
(
(i)
nP
C
(i)nP
C
(i)
using the inequality
m!
n!
n
m
n
n
;
= n
n
P
jX
(P
C
(j)P
C
(j))
= n
0
= 1. (8.1.3)
124
Hence,
1 =
P
(
((
t
)
max
(
P
(
((
tt
)
=
P
(
(() (8.1.4)
(n + 1)
[.[
P
(
(()
= (n + 1)
[.[
[([e
nH(P
C
)
, (8.1.5)
where (8.1.4) and (8.1.5) follow from (8.1.3) and (8.1.1), respectively. Therefore,
a lower bound of [([ is obtained as:
[([
1
(n + 1)
[.[
e
nH(P
C
)
.
2
Theorem 8.5 (Sanovs Theorem) Let c
n
be the set that consists of all com
positions over nite alphabet A, whose composition distribution belongs to T.
Fix a sequence of product distribution P
X
n =
n
i=1
P
X
. Then,
liminf
n
1
n
log P
X
n(c
n
) inf
P
C
1
D(PP
X
),
where P
(
is the composition distribution for composition (. If, in addition, for
every distribution P in T, there exists a sequence of composition distributions
P
(
1
, P
(
2
, P
(
3
, . . . T such that limsup
n
D(P
(n
P
X
) = D(PP
X
), then
limsup
n
1
n
log P
X
n(c
n
) inf
P1
D(P
(
P
X
),
Proof:
P
X
n(c
n
) =
Ccn
P
X
n(()
Ccn
e
nD(P
C
P
X
)
Ccn
max
Ccn
e
nD(P
C
P
X
)
=
Ccn
e
nmin
CEn
D(P
C
P
X
)
(n + 1)
[.[
e
nmin
CEn
D(P
C
P
X
)
.
125
Let (
c
n
satisfying that
D(P
(
P
X
) = min
(cn
D(P
(
P
X
)
for which its existence can be substantiated by the fact that the number of
compositions in c
n
is nite.
P
X
n(c
n
) =
(cn
P
X
n(()
(cn
1
(n + 1)
[.[
e
nD(P
C
P
X
)
1
(n + 1)
[.[
e
nD(P
C
P
X
)
.
Therefore,
1
(n + 1)
[.[
e
nmin
CEn
D(P
C
P
X
)
P
X
n(c
n
) (n + 1)
[.[
e
nmin
CEn
D(P
C
P
X
)
,
which immediately gives
liminf
n
1
n
log P
X
n(c
n
) liminf
n
min
(cn
D(P
(
P
X
) inf
P
C
1
D(PP
X
),
Now suppose that for P T, there exists a sequence of composition dis
tributions P
(
1
, P
(
2
, P
(
3
, . . . T such that limsup
n
D(P
(n
P
X
) = D(PP
X
),
then
1
n
log P
X
n(c
n
) min
(cn
D(P
(
P
X
) D(P
(n
P
X
),
which implies that
limsup
n
1
n
log P
X
n(c
n
) limsup
n
D(P
(n
P
X
) = D(PP
X
).
As the above inequality is true for every P in T,
limsup
n
1
n
log P
X
n(c
n
) inf
P1
D(PP
X
).
2
The geometrical explanation for Sanovs theorem is depicted in Figure 8.1.
In the gure, it shows that if we adopt the divergence as the measure of distance
in distribution space (the outer triangle region in Figure 8.1), and T is a set of
distributions that can be approached by composition distributions, then for any
distribution P
X
outside T, we can measure the shortest distance from P
X
to T
in terms of the divergence measure. This shortest distance, as it turns out, is
indeed the exponent of P
X
(c
n
). The following examples show that the viewpoint
from Sanov is really eective, especially for i.i.d. observations.
126
@
@
@
@
@
@
@
@
@
@
@
@
T
u
P
X
u
P
1
u
P
2
C
C
C
C
C
C
inf
P1
D(PP
X
)
Figure 8.1: The geometric meaning for Sanovs theorem.
Example 8.6 One wants to roughly estimate the probability that the average of
the throws is greater or equal to 4, when tossing a fair dice n times. He observes
that whether the requirement is satised only depends on the compositions of the
observations. Let c
n
be the set of compositions which satisfy the requirement.
Also let P
X
be the probability for tossing a fair dice. Then c
n
can be written as
c
n
=
_
( :
6
i=1
iP
(
(i) 4
_
.
To minimize D(P
(
P
X
) for ( c
n
, we can use the Lagrange multiplier tech
nique (since divergence is convex with respect to the rst argument.) with the
constraints on P
(
being:
6
i=1
iP
(
(i) = k and
6
i=1
P
(
(i) = 1
for k = 4, 5, 6, . . . , n. So it becomes to minimize:
6
i=1
P
(
(i) log
P
(
(i)
P
X
(i)
+
1
_
6
i=1
iP
(
(i) k
_
+
2
_
6
i=1
P
(
(i) 1
_
.
By taking the derivatives, we found that the minimizer should be of the form
P
(
(i) =
e
1
i
6
j=1
e
1
j
,
for
1
is chosen to satisfy
6
i=1
iP
(
(i) = k. (8.1.6)
127
Since the above is true for all k 4, it suces to take the smallest one as
our solution, i.e., k = 4. Finally, by solving (8.1.6) for k = 4 numerically, the
minimizer should be
P
(
= (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468),
and the exponent of the desired probability is D(P
(
P
X
) = 0.0433 nat. Conse
quently,
P
X
n(c
n
) e
0.0433 n
.
8.1.2 Divergence typical set on composition
In Steins lemma, one of the most important steps is to dene the divergence
typical set, which is
/
n
()
_
x
n
:
1
n
log
P
X
n(x
n
)
P
X
n
(x
n
)
D(P
X
P
X
)
<
_
.
One of the key feature of the divergence typical set is that its probability eventu
ally tends to 1 with respect to P
X
n. In terms of the compositions, we can dene
another divergence typical set with the same feature as
T
n
() x
n
A
n
: D(P
(
x
n
P
X
) ,
where (
x
n represents the composition of x
n
. The above typical set is rewritten
in terms of composition set as:
T
n
() ( : D(P
(
P
X
) .
P
X
n(T
n
()) 1 is justied by
1 P
X
n(T
n
()) =
C : D(P
C
P
X
)>
P
X
n(()
C : D(P
C
P
X
)>
e
nD(P
C
P
X
)
, from Lemma 8.4.
C : D(P
C
P
X
)>
e
n
(n + 1)
[.[
e
n
, cf. Lemma 8.3.
8.1.3 Universal source coding on compositions
It is sometimes not possible to know the true source distribution. In such case,
instead of nding the best lossless data compression code for a specic source
128
distribution, one may turn to design a code which is good for all possible candi
date distributions (of some specic class.) Here, good means that the average
codeword length achieves the (unknown) source entropy for all sources with the
distributions in the considered class.
To be more precise, let
f
n
: A
n
_
i=1
0, 1
i
be a xed encoding function. Then for any possible source distribution P
X
n in
the considered class, the resultant average codeword length for this specic code
satises
1
n
i=1
P
X
n(x
n
)(f
n
(x
n
)) H(X),
as n goes to innity, where () represents the length of f
n
(x
n
). Then f
n
is said
to be a universal code for the source distributions in the considered class.
Example 8.7 (universal encoding using composition) First, binaryindex
the compositions using log
2
(n + 1)
[.[
bits, and denote this binary index for
composition ( by a((). Note that the number of compositions is at most (n +
1)
[.[
.
Let (
x
n denote the composition with respect to x
n
, i.e. x
n
(
x
n.
For each composition (, we know that the number of sequence x
n
in ( is
at most 2
nH(P
C
)
(Here, H(P
(
) is measured in bits. I.e., the logarithmic base
in entropy computation is 2. See the proof of Lemma 8.4). Hence, we can
binaryindex the elements in ( using n H(P
(
) bits. Denote this binary index
for elements in ( by b((). Dene a universal encoding function f
n
as
f
n
(x
n
) = concatenationa((
x
n), b((
x
n).
Then this encoding rule is a universal code for all i.i.d. sources.
Proof: All the logarithmic operations in entropy and divergence are taken to be
base 2 throughout the proof.
For any i.i.d. distribution with marginal P
X
, its associated average codeword
length is
n
=
x
n
.
n
P
X
n(x
n
)(a((
x
n)) +
x
n
.
n
P
X
n(x
n
)(b((
x
n))
x
n
.
n
P
X
n(x
n
) log
2
(n + 1)
[.[
+
x
n
.
n
P
X
n(x
n
) n H(P
(
x
n
)
[A[ log
2
(n + 1) +
(
P
X
n(() n H(P
(
).
129
Hence, the normalized average code length is upperbounded by
1
n
n
[A[ log
2
(n + 1)
n
+
(
P
X
n(()H(P
(
).
Since the rst term of the right hand side of the above inequality vanishes as n
tends to innity, only the second term has contribution to the ultimate normal
ized average code length, which in turn can be bounded above using typical set
T
n
() as:
(
P
X
n(()H(P
(
)
=
(Tn()
P
X
n(()H(P
(
) +
(,Tn()
P
X
n(()H(P
(
)
max
( : D(P
C
P
X
)/ log(2)
H(P
(
) +
( : D(P
C
P
X
)>/ log(2)
P
X
n(()H(P
(
)
max
( : D(P
C
P
X
)/ log(2)
H(P
(
) +
( : D(P
C
P
X
)>/ log(2)
2
nD(P
C
P
X
)
H(P
(
),
(From Lemma 8.4)
max
( : D(P
C
P
X
)/ log(2)
H(P
(
) +
( : D(P
C
P
X
)>/ log(2)
e
n
H(P
(
)
max
( : D(P
C
P
X
)/ log(2)
H(P
(
) +
( : D(P
C
P
X
)>/ log(2)
e
n
log
2
[A[
max
( : D(P
C
P
X
)/ log(2)
H(P
(
) + (n + 1)
[.[
e
n
log
2
[A[,
where the second term of the last step vanishes as n . (Note that when
base2 logarithm is taken in divergence instead of natural logarithm, the range
[0, ] in T
n
() should be replaced by [0, / log(2)].) It remains to show that
max
( : D(P
C
P
X
)/ log(2)
H(P
(
) H(X) + (),
where () only depends on , and approaches zero as 0.
Since D(P
(
P
X
) / log(2),
P
(
P
X

_
2 log(2) D(P
(
P
X
)
2,
where P
(
P
X
 represents the variational distance of P
(
against P
X
, which
implies that for all x A,
[P
(
(x) P
X
(x)[
x.
[P
(
(x) P
X
(x)[
= P
(
P
X

2.
130
Fix 0 < < min min
x : P
X
(x)>0
P
X
(x), 2/e
2
. Choose =
2
/8. Then for
all x A, [P
(
(x) P
X
(x)[ /2.
For those x satisfying P
X
(x) , the following is true. From mean value
theorem, namely,
[t log t s log s[
max
u[t,s][s,t]
d(u log u)
du
[t s[,
we have
[P
(
(x) log P
(
(x) P
X
(x) log P
X
(x)[
max
u[P
C
(x),P
X
(x)][P
X
(x),P
C
(x)]
d(u log u)
du
[P
(
(x) P
X
(x)[
max
u[P
C
(x),P
X
(x)][P
X
(x),P
C
(x)]
d(u log u)
du
max
u[/2,1]
d(u log u)
du
2
(Since P
X
(x) + /2 P
(
(x) P
X
(x) /2 /2 = /2
and P
X
(x) > /2.)
=
2
[1 + log(/2)[ ,
where the last step follows from < 2/e
2
.
For those x satisfying P
X
(x) = 0, P
(
(x) < /2. Hence,
[P
(
(x) log P
(
(x) P
X
(x) log P
X
(x)[ = [P
(
(x) log P
(
(x)[
2
log
2
,
since 2/e
2
< 2/e. Consequently,
[H(P
(
) H(X)[
1
log(2)
x.
[P
(
(x) log P
(
(x) P
X
(x) log P
X
(x)[
[A[
log(2)
max
_
2
log
2
,
[1 + log(/2)[
2
_
[A[
log(2)
max
_
2
log(2),
1 +
1
2
log(2)
_
().
2
131
8.1.4 Likelihood ratio versus divergence
Recall that the NeymanPearson lemma indicates that the optimal test for two
hypothesis (H
0
: P
X
n against H
1
: P
X
n
) is of the form
P
X
n(x
n
)
P
X
n
(x
n
)
>
<
(8.1.7)
for some . This is the likelihood ratio test, and the quantity P
X
n(x
n
)/P
X
n
(x
n
)
is called the likelihood ratio. If the logarithm operation is taken on both sides of
(8.1.7), the test remains intact; and the resultant quantity becomes loglikelihood
ratio, which can be rewritten in terms of the divergence of the composition
distribution against the hypothesis distributions. To be precise, let (
x
n be the
composition of x
n
, and let P
(
x
n
be the corresponding composition distribution.
Then
log
P
X
n(x
n
)
P
X
n
(x
n
)
=
n
i=1
log
P
X
(x
i
)
P
X
(x
i
)
=
a.
[#a(x
n
)] log
P
X
(a)
P
X
(a)
=
a.
[nP
(
x
n
(a)] log
P
X
(a)
P
X
(a)
= n
a.
P
(
x
n
(a) log
P
X
(a)
P
(
x
n
(a)
P
(
x
n
(a)
P
X
(a)
= n
_
a.
P
(
x
n
(a) log
P
(
x
n
(a)
P
X
(a)
a.
P
(
x
n
(a) log
P
(
x
n
(a)
P
X
(a)
_
= n[D(P
(
x
n
P
X
) D(P
(
x
n
P
X
)]
Hence, (8.1.7) is equivalent to
D(P
(
x
n
P
X
) D(P
(
x
n
P
X
)
>
<
1
n
log . (8.1.8)
This equivalence means that for hypothesis testing under i.i.d. observations,
selection of the acceptance region suces to be made upon compositions instead
of observations. In other words, the optimal decision function can be dened as:
(() =
_
_
_
0, if composition ( is classied to belong to null hypothesis
according to (8.1.8);
1, otherwise.
132
8.1.5 Exponent of Bayesian cost
Randomization on decision () can improve the resultant typeII error for
NeymanPearson criterion of xed test level; however, it cannot improve Baye
sian cost. The latter statement can be easily justied by noting the contribution
to Bayesian cost for randomizing () as
(x
n
) =
_
0, with probability ;
1, with probability 1 ;
satises
0
P
X
n(x
n
) +
1
(1 )P
X
n
(x
n
) min
0
P
X
n(x
n
),
1
P
X
n
(x
n
).
Therefore, assigning x
n
to either null hypothesis or alternative hypothesis im
proves the Bayesian cost for (0, 1). Since under i.i.d. statistics, all observa
tions with the same composition have the same probability, the above statement
is also true for decision function dened based on compositions.
Now suppose the acceptance region for null hypothesis is chosen to be a set
of compositions, denoted by
/ ( : D(P
(
P
X
) D(P
(
P
X
) >
t
.
Then by Sanovs theorem, the exponent of type II error,
n
, is
min
(,
D(P
(
P
X
).
Similarly, the exponent of type I error,
n
, is
min
(,
c
D(P
(
P
X
).
Because whether or not an element x
n
with composition (
x
n is in / is deter
mined by the dierence of D(P
(
x
n
P
X
) and D(P
(
x
n
P
X
), it can be expected
that as n goes to innity, there exists P
e
X
such that the exponents for type I
error and type II error are respectively D(P
e
X
P
X
) and D(P
e
X
P
X
) (cf. Figure
8.2.) The solution for P
e
X
can be solved using Lagrange multiplier technique by
reformulating the problem as minimizing D(P
(
P
X
) subject to the condition
D(P
(
P
X
) = D(P
(
P
X
)
t
. By taking the derivative of
D(P
e
X
P
X
) +
_
D(P
e
X
P
X
) D(P
e
X
P
X
)
t
+
_
x.
P
e
X
(x) 1
_
with respective to each P
e
X
(x), we have
log
P
e
X
(x)
P
X
(x)
+ 1 + log
P
X
(x)
P
X
(x)
+ = 0.
133
Then the optimal P
e
X
(x) satises:
P
e
X
(x) = P
(x)
P
X
(x)P
1
X
(x)
a.
P
X
(a)P
1
X
(a)
.
t t
P
X
P
X
D(P
e
X
P
X
)
t
P
e
X
D(P
e
X
P
X
)
D(P
(
P
X
) = D(P
(
P
X
)
t
X
over the probability
space. When 0, P
X
; when 1, P
P
X
. Usually, P
is named
the tilted or twisted distribution. The value of is dependent on
t
= (1/n) log .
It is known from detection theory that the best for Bayes testing is
1
/
0
,
which is xed. Therefore,
t
= lim
n
1
n
log
1
0
= 0,
which implies that the optimal exponent for Bayes error is the minimum of
D(P
P
X
) subject to D(P
P
X
) = D(P
P
X
), namely the midpoint of the line
segment (P
X
, P
X
) on probability space. This quantity is called the Cherno
bound.
134
8.2 Large deviations theory
The large deviations theory basically consider the technique of computing the
exponent of an exponentially decayed probability. Such technique is very useful
in understanding the exponential detail of error probabilities. You have already
seen a few good examples in hypothesis testing. In this section, we will introduce
its fundamental theories and techniques.
8.2.1 Tilted or twisted distribution
Suppose the probability of a set P
X
(/
n
) decreasing down to zero exponentially
fact, and its exponent is equal to a > 0. Over the probability space, let T denote
the set of those distributions P
e
X
satisfying P
e
X
(/
n
) exhibits zero exponent. Then
applying similar concept as Sanovs theorem, we can expect that
a = min
P
e
X
1
D(P
e
X
P
X
).
Now suppose the minimizer of the above function happens at f(P
e
X
) = for some
constant and some dierentiable function f() (e.g., T P
e
X
: f(P
e
X
) = .)
Then the minimizer should be of the form
( a A) P
e
X
(a) =
P
X
(a)e
f(P
e
X
)
P
e
X
(a)
.
P
X
(a
t
)e
f(P
e
X
)
P
e
X
(a
)
.
As a result, P
e
X
is the resultant distribution from P
X
exponentially twisted via
the partial derivative of the function f(). Note that P
e
X
is usually written as
P
()
X
since it is generated by twisting P
X
with twisted factor .
8.2.2 Conventional twisted distribution
The conventional denition for twisted distribution is based on the divergence
function, i.e.,
f(P
e
X
) = D(P
e
X
P
X
) D(P
e
X
P
X
). (8.2.1)
Since
D(P
e
X
P
X
)
P
e
X
(a)
= log
P
e
X
(a)
P
X
(a)
+ 1,
135
the twisted distribution (with respect to the f() dened in (8.2.1)) becomes
( a A) P
e
X
(a) =
P
X
(a)e
log
P
X
(a)
P
X
(a)
.
P
X
(a
t
)e
log
P
X
(a
)
P
X
(a
)
=
P
1
X
(a)P
X
(a)
.
P
1
X
(a)P
X
(a)
Note that for Bayes testing, the probability of the optimal acceptance regions for
both hypotheses evaluated under the optimal twisted distribution (i.e., = 1/2)
exhibits zero exponent.
8.2.3 Cramers theorem
Consider a sequence of i.i.d. random variables, X
n
, and suppose that we are
interested in the probability of the set
_
X
1
+ + X
n
n
>
_
.
Observe that (X
1
+ + X
n
)/n can be rewritten as
a.
aP
(
(a).
Therefore, the set considered can be dened through function f():
f(P
e
X
)
a.
aP
e
X
(a),
and its partial derivative with respect to P
e
X
(a) is a. The resultant twisted
distribution is
( a A) P
()
X
(a) =
P
X
(a)e
a
.
P
X
(a
t
)e
a
.
So the exponent of P
X
n(X
1
+ + X
n
)/n > is
min
P
e
X
: D(P
e
X
P
X
)>
D(P
e
X
P
X
) = min
P
()
X
: D(P
()
X
P
X
)>
D(P
()
X
P
X
).
136
It should be pointed out that
.
P
X
(a
t
)e
a
i=1
with
mean E[X] = < and bounded variance. As a result, its exponent equals
I
X
() (cf. (8.2.2)).
A) Preliminaries : Observe that since E[X] = < and E[[X [
2
] < ,
Pr
_
X
1
+ + X
n
n
_
0 as n .
Hence, we can compute its rate of convergence (to zero).
B) Upper bound of the probability :
Pr
_
X
1
+ + X
n
n
_
= Pr (X
1
+ + X
n
) n , for any > 0
= Pr exp ((X
1
+ + X
n
)) exp (n)
E [exp ((X
1
+ + X
n
))]
exp (n)
=
E
n
[exp (X))]
exp (n)
=
_
M
X
()
exp ()
_
n
.
137
Hence,
liminf
n
1
n
Pr
_
X
1
+ + X
n
n
>
_
log M
X
().
Since the above inequality holds for every > 0, we have
liminf
n
1
n
Pr
_
X
1
+ + X
n
n
>
_
max
>0
[ log M
X
()]
=
log M
X
(
),
where
1
n
Pr
_
X
1
+ + X
n
n
>
_
log M
X
(
)
= sup
T
[ log M
X
()] = I
X
().
C) Lower bound of the probability :
Dene the twisted distribution of P
X
as
P
()
X
(x)
expxP
X
(x)
M
X
()
.
Let X
()
be the random variable having distribution P
()
X
. Then from
i.i.d. property of X
n
,
P
()
X
n(x
n
)
exp(x
1
+ + x
n
)P
X
n(x
n
)
M
n
X
()
,
which implies
P
X
n(x
n
) = M
n
X
() exp(x
1
+ + x
n
)P
()
X
n(x
n
).
138
Note that X
()
i
i=1
are also i.i.d. random variables. Then
Pr
_
X
1
+ + X
n
n
_
=
[x
n
: x
1
++xnn]
P
X
n(x
n
)
=
[x
n
: x
1
++xnn]
M
n
X
() exp(x
1
+ + x
n
)P
()
X
n(x
n
)
= M
n
X
()
[x
n
: x
1
++xnn]
exp(x
1
+ + x
n
)P
()
X
n(x
n
)
M
n
X
()
[x
n
: n(+)>x
1
++xnn]
exp(x
1
+ + x
n
)P
()
X
n(x
n
),
for > 0
M
n
X
()
[x
n
: n(+)>x
1
++xnn]
expn( + )P
()
X
n(x
n
), for > 0.
= M
n
X
() expn( + )
[x
n
: n(+)>x
1
++xnn]
P
()
X
n(x
n
)
= M
n
X
() expn( + )Pr
_
+ >
X
()
1
+ + X
()
n
n
_
.
Choosing =
)
] = , we obtain:
Pr
_
+ >
X
()
1
+ + X
()
n
n
_
= Pr
_
n
Var[X
(
)
]
>
(X
()
1
) + + (X
()
n
)
nVar[X
(
)
]
0
_
Pr
_
B >
(X
()
1
) + + (X
()
n
)
nVar[X
(
)
]
0
_
,
for
n > BVar[X
(
)
]/
(B) (0) as n ,
where the last step follows from the fact that
(X
()
1
) + + (X
()
n
)
nVar[X
(
)
]
converges in distribution to standard normal distribution (). Since B
139
can be made arbitrarily large,
Pr
_
+ >
X
()
1
+ + X
()
n
n
_
1
2
.
Consequently,
limsup
n
1
n
log Pr
_
X
1
+ + X
n
n
_
( + ) log M
X
(
)
= sup
T
[ log M
X
()] +
= I
X
() +
.
The derivation is completed by noting that
n=1
will denote an innite sequence of
arbitrary random variables.
Denition 8.9 Dene
n
()
1
n
log E [exp Z
n
] and () limsup
n
n
().
The suplarge deviation rate function of an arbitrary random sequence Z
n
n=1
is dened as
I(x) sup
T : ()>
[x ()] . (8.3.1)
The range of the supremum operation in (8.3.1) is always nonempty since
(0) = 0, i.e. 1 : () > ,= . Hence,
I(x) is always dened.
With the above denition, the rst extension theorem of G artnerEllis can be
proposed as follows.
Theorem 8.10 For a, b 1 and a b,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
I(x).
140
Proof: The proof follows directly from Theorem 8.13 by taking h(x) = x, and
hence, we omit it. 2
The bound obtained in the above theorem is not in general tight. This can
be easily seen by noting that for an arbitrary random sequence Z
n
n=1
, the
exponent of Pr Z
n
/n b is not necessarily convex in b, and therefore, cannot
be achieved by a convex (sup)large deviation rate function. The next example
further substantiates this argument.
Example 8.11 Suppose that PrZ
n
= 0 = 1 e
2n
, and PrZ
n
= 2n =
e
2n
. Then from Denition 8.9, we have
n
()
1
n
log E
_
e
Zn
=
1
n
log
_
1 e
2n
+ e
(+1)2n
,
and
() limsup
n
n
() =
_
0, for 1;
2( + 1), for < 1.
Hence, 1 : () > = 1 (i.e., the real line) and
I(x) = sup
T
[x ()]
= sup
T
[x + 2( + 1)1 < 1)]
=
_
x, for 2 x 0;
, otherwise,
where 1 represents the indicator function of a set. Consequently, by Theorem
8.10,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
I(x)
=
_
_
_
0, for 0 [a, b]
b, for b [2, 0]
, otherwise
.
The exponent of PrZ
n
/n [a, b] in the above example is indeed given by
lim
n
1
n
log P
Z
n
_
Z
n
n
[a, b]
_
= inf
x[a,b]
I
(x),
where
I
(x) =
_
_
_
2, for x = 2;
0, for x = 0;
, otherwise.
(8.3.2)
141
Thus, the upper bound obtained in Theorem 8.10 is not tight.
As mentioned earlier, the looseness of the upper bound in Theorem 8.10 can
not be improved by simply using a convex suplarge deviation rate function. Note
that the true exponent (cf. (8.3.2)) of the above example is not a convex func
tion. We then observe that the convexity of the suplarge deviation rate function
is simply because it is a pointwise supremum of a collection of ane functions
(cf. (8.3.1)). In order to obtain a better bound that achieves a nonconvex large
deviation rate, the involvement of nonane functionals seems necessary. As
a result, a new extension of the G artnerEllis theorem is established along this
line.
Before introducing the nonane extension of G artnerEllis upper bound, we
dene the twisted suplarge deviation rate function as follows.
Denition 8.12 Dene
n
(; h)
1
n
log E
_
exp
_
n h
_
Z
n
n
___
and
h
() limsup
n
n
(; h),
where h() is a given realvalued continuous function. The twisted suplarge
deviation rate function of an arbitrary random sequence Z
n
n=1
with respect
to a realvalued continuous function h() is dened as
J
h
(x) sup
T :
h
()>
[ h(x)
h
()] . (8.3.3)
Similar to
I(x), the range of the supremum operation in (8.3.3) is not empty,
and hence,
J
h
() is always dened.
Theorem 8.13 Suppose that h() is a realvalued continuous function. Then
for a, b 1 and a b,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
J
h
(x).
Proof: The proof is divided into two parts. Part 1 proves the result under
[a, b]
_
x 1 :
J
h
(x) <
_
,= , (8.3.4)
and Part 2 veries it under
[a, b]
_
x 1 :
J
h
(x) =
_
. (8.3.5)
Since either (8.3.4) or (8.3.5) is true, these two parts, together, complete the
proof of this theorem.
142
Part 1: Assume [a, b]
_
x 1 :
J
h
(x) <
_
,= .
Dene J
inf
x[a,b]
J
h
(x). By assumption, J
< . Therefore,
[a, b]
_
x :
J
h
(x) > J
_
,
for any > 0, and
_
x 1 :
J
h
(x) > J
_
=
_
x 1 : sup
T :
h
()>
[h(x)
h
()] > J
_
T :
h
()>
x 1 : [h(x)
h
()] > J
.
Observe that
T :
h
()>
x 1 : [x
h
()] > J
is a col
lection of (uncountably innite) open sets that cover [a, b] which is closed
and bounded (and hence compact). By the HeineBorel theorem, we can
nd a nite subcover such that
[a, b]
k
_
i=1
x 1 : [
i
x
h
(
i
)] > J
,
and ( 1 i k)
h
(
i
) < (otherwise, the set x : [
i
x
h
(
i
)] > J
i=1
Pr
_
h
_
Z
n
n
_
i
h
(
i
) > J
_
=
k
i=1
Pr
_
n h
_
Z
n
n
_
i
> n
h
(
i
) + n(J
)
_
i=1
exp n[
n
(
i
; h)
h
(
i
)] n(J
) ,
where the last step follows from Markovs inequality. Since k is a constant
independent of n, and for each integer i [1, k],
limsup
n
1
n
log (exp n[
n
(
i
; h)
h
(
i
)] n(J
)) = (J
),
143
we obtain
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
(J
).
Since is arbitrary, the proof is completed.
Part 2: Assume [a, b]
_
x 1 :
J
h
(x) =
_
.
Observe that [a, b]
_
x :
J
h
(x) > L
_
for any L > 0. Following the same
procedure as used in Part 1, we obtain
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
L.
Since L can be taken arbitrarily large,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
= = inf
x[a,b]
J
h
(x).
2
The proof of the above theorem has implicitly used the condition <
h
(
i
) < to guarantee that limsup
n
[
n
(
i
; h)
h
(
i
)] = 0 for each inte
ger i [1, k]. Note that when
h
(
i
) = (resp. ), limsup
n
[
n
(
i
; h)
h
(
i
)] = (resp. ). This explains why the range of the supremum opera
tion in (8.3.3) is taken to be 1 :
h
() > , instead of the whole real
line.
As indicated in Theorem 8.13, a better upper bound can possibly be found
by twisting the large deviation rate function around an appropriate (nonane)
functional on the real line. Such improvement is substantiated in the next ex
ample.
Example 8.14 Let us, again, investigate the Z
n
n=1
dened in Example 8.11.
Take
h(x) =
1
2
(x + 2)
2
1.
Then from Denition 8.12, we have
n
(; h)
1
n
log E [exp nh(Z
n
/n)]
=
1
n
log [exp n exp n( 2) + exp n( + 2)] ,
and
h
() limsup
n
n
(; h) =
_
( + 2), for 1;
, for > 1.
144
Hence, 1 :
h
() > = 1 and
J
h
(x) sup
T
[h(x)
h
()] =
_
1
2
(x + 2)
2
+ 2, for x [4, 0];
, otherwise.
Consequently, by Theorem 8.13,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
J
h
(x)
=
_
_
min
_
(a + 2)
2
2
,
(b + 2)
2
2
_
2, for 4 a < b 0;
0, for a > 0 or b < 4;
, otherwise.
(8.3.6)
For b (2, 0) and a
_
2
2b 4, b
_
, the upper bound attained in the
previous example is strictly less than that given in Example 8.11, and hence,
an improvement is obtained. However, for b (2, 0) and a < 2
2b 4,
the upper bound in (8.3.6) is actually looser. Accordingly, we combine the two
upper bounds from Examples 8.11 and 8.14 to get
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
max
_
inf
x[a,b]
J
h
(x), inf
x[a,b]
I(x)
_
=
_
_
0, for 0 [a, b];
1
2
(b + 2)
2
2, for b [2, 0];
, otherwise.
A better bound on the exponent of PrZ
n
/n [a, b] is thus obtained. As a
result, Theorem 8.13 can be further generalized as follows.
Theorem 8.15 For a, b 1 and a b,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
J(x),
where
J(x) sup
h1
J
h
(x) and H is the set of all realvalued continuous func
tions.
Proof: By redening J
inf
x[a,b]
J(x) in the proof of Theorem 8.13, and
observing that
[a, b] x 1 :
J(x) > J
_
h1
_
T :
h
()>
x 1 : [h(x)
h
()] > J
,
145
the theorem holds under [a, b] x 1 :
J(x) < ,= . Similar modica
tions to the proof of Theorem 8.13 can be applied to the case of [a, b] x
1 :
J(x) = . 2
Example 8.16 Let us again study the Z
n
n=1
in Example 8.11 (also in Ex
ample 8.14). Suppose c > 1. Take h
c
(x) = c
1
(x + c
2
)
2
c, where
c
1
c +
c
2
1
2
and c
2
2
c + 1
c + 1 +
c 1
.
Then from Denition 8.12, we have
n
(; h
c
)
1
n
log E
_
exp
_
nh
c
_
Z
n
n
___
=
1
n
log [exp n exp n( 2) + exp n( + 2)] ,
and
hc
() limsup
n
n
(; h
c
) =
_
( + 2), for 1;
, for > 1.
Hence, 1 :
hc
() > = 1 and
J
hc
(x) = sup
T
[h
c
(x)
hc
()]
=
_
c
1
(x + c
2
)
2
+ c + 1, for x [2c
2
, 0];
, otherwise.
From Theorem 8.15,
J(x) = sup
h1
J
h
(x) maxliminf
c
J
hc
(x),
I(x) = I
(x),
where I
J(x)
inf
x[a,b]
I
(x)
=
_
_
_
0, if 0 [a, b];
2, if 2 [a, b] and 0 , [a, b];
, otherwise,
and a tight upper bound is nally obtained!
146
Theorem 8.13 gives us the upper bound on the limsup of (1/n) log PrZ
n
/n
[a, b]. With the same technique, we can also obtain a parallel theorem for the
quantity
liminf
n
1
n
log Pr
_
Z
n
n
[a, b]
_
.
Denition 8.17 Dene
h
() liminf
n
n
(; h), where
n
(; h) was de
ned in Denition 8.12. The twisted inflarge deviation rate function of an arbi
trary random sequence Z
n
n=1
with respect to a realvalued continuous function
h() is dened as
J
h
(x) sup
T :
h
()>
_
h(x)
h
()
_
.
Theorem 8.18 For a, b 1 and a b,
liminf
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
J(x),
where J(x) sup
h1
J
h
(x) and H is the set of all realvalued continuous func
tions.
8.3.2 Extension of GartnerEllis lower bounds
The tightness of the upper bound given in Theorem 8.13 naturally relies on the
validity of
limsup
n
1
n
log Pr
_
Z
n
n
(a, b)
_
inf
x(a,b)
J
h
(x), (8.3.7)
which is an extension of the G artnerEllis lower bound. The above inequality,
however, is not in general true for all choices of a and b (cf. Case A of Exam
ple 8.21). It therefore becomes signicant to nd those (a, b) within which the
extended GartnerEllis lower bound holds.
Denition 8.19 Dene the supG artnerEllis set with respect to a realvalued
continuous function h() as
T
h
_
T :
h
()>
T(; h)
147
where
T(; h)
_
x 1 : limsup
t0
h
( + t)
h
()
t
h(x) liminf
t0
h
()
h
( t)
t
_
.
Let us briey remark on the supG artnerEllis set dened above. It is self
explained in its denition that
T
h
is always dened for any realvalued function
h(). Furthermore, it can be derived that the supG artnerEllis set is reduced to
T
h
_
T :
h
()>
x 1 :
t
h
() = h(x) ,
if the derivative
t
h
() exists for all . Observe that the condition h(x) =
t
h
()
is exactly the equation for nding the that achieves
J
h
(x), which is obtained
by taking the derivative of [h(x)
h
()]. This somehow hints that the sup
G artnerEllis set is a collection of those points at which the exact suplarge
deviation rate is achievable.
We now state the main theorem in this section.
Theorem 8.20 Suppose that h() is a realvalued continuous function. Then if
(a, b)
T
h
,
limsup
n
1
n
log Pr
_
Z
n
n
h
(a, b)
_
inf
x(a,b)
J
h
(x),
where
h
(a, b) y 1 : h(y) = h(x) for some x (a, b).
Proof: Let F
n
() denote the distribution function of Z
n
. Dene its extended
twisted distribution around the realvalued continuous function h() as
dF
(;h)
n
()
exp n h(x/n) dF
n
()
E[exp n h(Z
n
/n)]
=
exp n h(x/n) dF
n
()
exp n
n
(; h)
.
Let Z
(:h)
n
be the random variable having F
(;h)
n
() as its probability distribution.
Let
J
inf
x(a,b)
J
h
(x).
Then for any > 0, there exists v (a, b) with
J
h
(v) J
+ .
148
Now the continuity of h() implies that
B(v, ) x 1 : [h(x) h(v)[ <
h
(a, b)
for some > 0. Also, (a, b)
T
h
ensures the existence of satisfying
limsup
t0
h
( + t)
h
()
t
h(v) liminf
t0
h
()
h
( t)
t
,
which in turn guarantees the existence of t = t() > 0 satisfying
h
( + t)
h
()
t
h(v) +
4
and h(v)
4
h
()
h
( t)
t
. (8.3.8)
We then derive
Pr
_
Z
n
n
h
(a, b)
_
Pr
_
Z
n
n
B(v, )
_
= Pr
_
h
_
Z
n
n
_
h(v)
<
_
=
_
xT : [h(x/n)h(v)[<
dF
n
(x)
=
_
xT : [h(x/n)h(v)[<
exp
_
n
n
(; h) nh
_
x
n
__
dF
(;h)
n
(x)
exp n
n
(; h) nh(v) n[[
_
xT : [h(x/n)h(v)[<
dF
(;h)
n
(x)
= exp n
n
(; h) nh(v) n[[ Pr
_
Z
(;h)
n
n
B(v, )
_
,
which implies
limsup
n
1
n
log Pr
_
Z
n
n
h
(a, b)
_
[h(v)
h
()] [[ + limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
=
J
h
(v) [[ + limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
J
[[ + limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
.
149
Since both and can be made arbitrarily small, it remains to show
limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
= 0. (8.3.9)
To show (8.3.9), we rst note that
Pr
_
h
_
Z
(;h)
n
n
_
h(v) +
_
= Pr
_
e
nt h
Z
(;h)
n
/n
e
nt h(v)+n t
_
e
nt h(v)n t
_
T
e
nt h(x/n)
dF
(;h)
n
= e
nt h(v)n t
_
T
e
nt h(x/n)+n h(x/n)nn(;h)
dF
n
(x)
= e
nth(v)ntnn(;h)+nn(+t;h)
.
Similarly,
Pr
_
h
_
Z
(;h)
n
n
_
h(v)
_
= Pr
_
e
nt h
Z
(;h)
n
/n
e
nt h(v)+n t
_
e
nt h(v)n t
_
T
e
nt h(x/n)
dF
(;h)
n
= e
nt h(v)n t
_
T
e
nt h(x/n)+n h(x/n)nn(;h)
dF
n
(x)
= e
nth(v)ntnn(;h)+nn(t;h)
.
Now by denition of limsup,
n
( + t; h)
h
( + t) +
t
4
and
n
( t; h)
h
( t) +
t
4
(8.3.10)
for suciently large n; and
n
(; h)
h
()
t
4
(8.3.11)
150
for innitely many n. Hence, there exists a subsequence n
1
, n
2
, n
3
, . . . such
that for all n
j
, (8.3.10) and (8.3.11) hold. Consequently, for all j,
1
n
j
log Pr
_
Z
(;h)
n
j
n
j
, B(v, )
_
1
n
j
log
_
2 max
_
e
n
j
th(v)n
j
tn
j
n
j
(;h)+n
j
n
j
(+t;h)
,
e
n
j
th(v)n
j
tn
j
n
j
(;h)+n
j
n
j
(t;h)
__
=
1
n
j
log 2 + max
__
th(v) +
n
j
( + t; h), th(v) +
n
j
( t; h)
_
n
j
(; h) t
1
n
j
log 2 + max [th(v) +
h
( + t), th(v) +
h
( t)]
h
()
t
2
=
1
n
j
log 2
+t max
__
h
( + t)
h
()
t
h(v), h(v)
h
()
h
( t)
t
__
t
2
1
n
j
log 2
t
4
, (8.3.12)
where (8.3.12) follows from (8.3.8). The proof is then completed by obtaining
liminf
n
1
n
log Pr
_
Z
(;h)
n
n
, B(v, )
_
t
4
,
which immediately guarantees the validity of (8.3.9). 2
Next, we use an example to demonstrate that by choosing the right h(),
we can completely characterize the exact (nonconvex) suplarge deviation rate
n
i=1
are i.i.d. Gaussian
random variables with mean 1 and variance 1 if n is even, and with mean 1
and variance 1 if n is odd. Then the exact large deviation rate formula
I
(x)
that satises for all a < b,
inf
x[a,b]
(x) limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
limsup
n
1
n
log Pr
_
Z
n
n
(a, b)
_
inf
x(a,b)
(x)
151
is
(x) =
([x[ 1)
2
2
. (8.3.13)
Case A: h(x) = x.
For the ane h(),
n
() = +
2
/2 when n is even, and
n
() =
+
2
/2 when n is odd. Hence, () = [[ +
2
/2, and
T
h
=
_
_
>0
v 1 : v = 1 +
_
_
_
_
<0
v 1 : v = 1 +
_
_
v 1 : 1 v 1
= (1, ) (, 1).
Therefore, Theorem 8.20 cannot be applied to any a and b with (a, b)
[1, 1] ,= .
By deriving
I(x) = sup
T
x () =
_
_
_
([x[ 1)
2
2
, for [x[ > 1;
0, for [x[ 1,
we obtain for any a (, 1) (1, ),
lim
0
limsup
n
1
n
log Pr
_
Z
n
n
(a , a + )
_
lim
0
inf
x(a,a+)
I(x) =
([a[ 1)
2
2
,
which can be shown tight by Theorem 8.13 (or directly by (8.3.13)). Note
that the above inequality does not hold for any a (1, 1). To ll the
gap, a dierent h() must be employed.
Case B: h(x) = [x a[ for 1 < a < 1.
152
For n even,
E
_
e
nh(Zn/n)
= E
_
e
n[Zn/na[
=
_
na
e
x+na
1
2n
e
(xn)
2
/(2n)
dx +
_
na
e
xna
1
2n
e
(xn)
2
/(2n)
dx
= e
n(2+2a)/2
_
na
2n
e
[xn(1)]
2
/(2n)
dx
+e
n(+22a)/2
_
na
1
2n
e
[xn(1+)]
2
/(2n)
dx
= e
n(2+2a)/2
_
( + a 1)
n
_
+ e
n(+22a)/2
_
( a + 1)
n
_
,
where () represents the unit Gaussian cdf.
Similarly, for n odd,
E
_
e
nh(Zn/n)
= e
n(+2+2a)/2
_
( + a + 1)
n
_
+ e
n(22a)/2
_
( a 1)
n
_
.
Observe that for any b 1,
lim
n
1
n
log (b
n) =
_
_
_
0, for b 0;
b
2
2
, for b < 0.
Hence,
h
() =
_
([a[ 1)
2
2
, for < [a[ 1;
[ + 2(1 [a[)]
2
, for [a[ 1 < 0;
[ + 2(1 +[a[)]
2
, for 0.
Therefore,
T
h
=
_
_
>0
x 1 : [x a[ = + 1 +[a[
_
_
_
_
<0
x 1 : [x a[ = + 1 [a[
_
= (, a 1 [a[) (a 1 +[a[, a + 1 [a[) (a + 1 +[a[, )
153
and
J
h
(x) =
_
_
([x a[ 1 +[a[)
2
2
, for a 1 +[a[ < x < a + 1 [a[;
([x a[ 1 [a[)
2
2
, for x > a + 1 +[a[ or x < a 1 [a[;
0, otherwise.
(8.3.14)
We then apply Theorem 8.20 to obtain
lim
0
limsup
n
1
n
log Pr
_
Z
n
n
(a , a + )
_
lim
0
inf
x(a,a+)
J
h
(x)
= lim
0
( 1 +[a[)
2
2
=
([a[ 1)
2
2
.
Note that the above lower bound is valid for any a (1, 1), and can be
shown tight, again, by Theorem 8.13 (or directly by (8.3.13)).
Finally, by combining the results of Cases A) and B), the true large devi
ation rate of Z
n
n1
is completely characterized.
2
Remarks.
One of the problems in applying the extended GartnerEllis theorems is the
diculty in choosing an appropriate realvalued continuous function (not to
mention the nding of the optimal one in the sense of Theorem 8.15). From
the previous example, we observe that the resultant
J
h
(x) is in fact equal to
the lower convex contour
1
(with respect to h()) of min
yT : h(y)=h(x)
I
(x).
Indeed, if the lower convex contour of min
yT : h(y)=h(x)
I
(x) equals
I
(x)
for some x lying in the interior of
T
h
, we can apply Theorems 8.13 and
8.20 to establish the large deviation rate at this point. From the above
example, we somehow sense that taking h(x) = [x a[ is advantageous
in characterizing the large deviation rate at x = a. As a consequence
of such choice of h(),
J
h
(x) will shape like the lower convex contour of
min
(x a),
I
T
h
_
_
(a, b) : limsup
n
1
n
log Pr
_
Z
n
n
h
(a, b)
_
inf
x(a,b)
J
h
(x)
_
.
To verify whether or not the above two sets are equal merits further inves
tigation.
Modifying the proof of Theorem 8.20, we can also establish a lower bound
for
liminf
n
1
n
log Pr
_
Z
n
n
h
(a, b)
_
.
Denition 8.22 Dene the infGartnerEllis set with respect to a real
valued continuous function h() as
T
h
_
T :
h
()>
T(; h)
where
T(; h)
_
x 1 : limsup
t0
h
( + t)
h
()
t
h(x)
liminf
t0
h
()
h
( t)
t
_
.
155
Theorem 8.23 Suppose that h() is a realvalued continuous function.
Then if (a, b) T
h
,
liminf
n
1
n
log Pr
_
Z
n
n
h
(a, b)
_
inf
x(a,b)
J
h
(x).
One of the important usages of the large deviation rate functions is to nd
the Varadhans asymptotic integration formula of
lim
n
1
n
log E [exp Z
n
]
for a given random sequence Z
n
n=1
. To be specic, it is equal [4, Theo
rem 2.1.10] to
lim
n
1
n
log E [exp Z
n
] = sup
xT : I(x)<
[x I(x)],
if
lim
L
limsup
n
1
n
log
__
[xT : xL]
exp x dP
Zn
(x)
_
= .
The above result can also be extended using the same idea as applied to
the GartnerEllis theorem.
Theorem 8.24 If
lim
L
limsup
n
1
n
log
__
[xT : h(x)L]
exp h(x) dP
Zn
(x)
_
= ,
then
limsup
n
1
n
log E
_
exp
_
nh
_
Z
n
n
___
= sup
xT :
J
h
(x)<
[h(x)
J
h
(x)],
and
liminf
n
1
n
log
_
exp
_
nh
_
Z
n
n
___
= sup
xT : J
h
(x)<
[h(x) J
h
(x)].
Proof: This can be obtained by modifying the proofs of Lemmas 2.1.7
and 2.1.8 in [4]. 2
We close the section by remarking that the result of the above theorem
can be reformulated as
J
h
(x) = sup
T :
h
()>
[h(x)
h
()]
156
and
h
() = sup
xT :
J
h
(x)<
[h(x)
J
h
(x)],
which is an extension of the LegendreFenchel transform pair. Similar
conclusion applies to J
h
(x) and
h
().
8.3.3 Properties of (twisted) sup and inflarge deviation
rate functions.
Property 8.25 Let
I(x) and I(x) be the sup and inf large deviation rate func
tions of an innite sequence of arbitrary random variables Z
n
n=1
, respectively.
Denote m
n
= (1/n)E[Z
n
]. Let m limsup
n
m
n
and m liminf
n
m
n
.
Then
1.
I(x) and I(x) are both convex.
2.
I(x) is continuous over x 1 :
I(x) < . Likewise, I(x) is continuous
over x 1 : I(x) < .
3.
I(x) gives its minimum value 0 at m x m.
4. I(x) 0. But I(x) does not necessary give its minimum value at both
x = m and x = m.
Proof:
1.
I(x) is the pointwise supremum of a collection of ane functions. Therefore,
it is convex. Similar argument can be applied to I(x).
2. A convex function on the real line is continuous everywhere on its domain
and hence the property holds.
3.&4. The proofs follow immediately from Property 8.26 by taking h(x) = x.2
Since the twisted sup/inflarge deviation rate functions are not necessarily
convex, a few properties of sup/inflarge deviation functions do not hold for
general twisted functions.
Property 8.26 Suppose that h() is a realvalued continuous function. Let
J
h
(x) and J
h
(x) be the corresponding twisted sup and inf large deviation rate
functions, respectively. Denote m
n
(h) E[h(Z
n
/n)]. Let
m
h
limsup
n
m
n
(h) and m
h
liminf
n
m
n
(h).
Then
157
1.
J
h
(x) 0, with equality holds if m
h
h(x) m
h
.
2. J
h
(x) 0, but J
h
(x) does not necessary give its minimum value at both
x = m
h
and x = m
h
.
Proof:
1. For all x 1,
J
h
(x) sup
T :
h
()>
[ h(x)
h
()] 0 h(x)
h
(0) = 0.
By Jensens inequality,
exp n
n
(; h) = E[exp n h(Z
n
/n)]
exp n E[h(Z
n
/n)]
= exp n m
n
(h) ,
which is equivalent to
m
n
(h)
n
(; h).
After taking the limsup and liminf of both sides of the above inequalities,
we obtain:
for 0,
m
h
h
() (8.3.15)
and
m
h
h
()
h
(); (8.3.16)
for < 0,
m
h
h
() (8.3.17)
and
m
h
h
()
h
(). (8.3.18)
(8.3.15) and (8.3.18) imply
J
h
(x) = 0 for those x satisfying h(x) = m
h
,
and (8.3.16) and (8.3.17) imply
J
h
(x) = 0 for those x satisfying h(x) = m
h
.
For x x : m
h
h(x) m
h
,
h(x)
h
() m
h
h
() 0 for 0
and
h(x)
h
() m
h
h
() 0 for < 0.
Hence, by taking the supremum over 1 :
h
() > , we obtain
the desired result.
158
2. The nonnegativity of J
h
(x) can be similarly proved as
J
h
(x).
From Case A of Example 8.21, we have m = 1, m = 1, and
() =
_
_
+
1
2
2
, for 0
+
1
2
2
, for < 0
Therefore,
I(x) = sup
T
x () =
_
_
(x + 1)
2
2
, for x 0
(x 1)
2
2
, for x < 0,
.
for which I(1) = I(1) = 2 and min
xT
I(x) = I(0) = 1/2. Conse
quently, I
h
(x) neither equals zero nor gives its minimum value at both
x = m
h
and x = m
h
. 2
8.4 Probabilitic subexponential behavior
In the previous section, we already demonstrated the usage of large deviations
techniques to compute the exponent of an exponentially decayed probability.
However, no eort so far is placed on studying its subexponential behavior. Ob
serve that the two sequences, a
n
= (1/n) exp2n and b
n
= (1/
n) exp2n,
have the same exponent, but contain dierent subexponential terms. Such subex
ponential terms actually aect their respective rate of convergence when n is not
large. Let us begin with the introduction of (a variation of) the BerryEsseen
theorem [5, sec.XVI. 5], which is later applied to evaluate the subexponential
detail of a desired probability.
8.4.1 BerryEsseen theorem for compound i.i.d. sequence
The BerryEsseen theorem [5, sec.XVI. 5] states that the distribution of the sum
of independent zeromean random variables X
i
n
i=1
, normalized by the standard
deviation of the sum, diers from the Gaussian distribution by at most C r
n
/s
3
n
,
where s
2
n
and r
n
are respectively sums of the marginal variances and the marginal
absolute third moments, and C is an absolute constant. Specically, for every
a 1,
Pr
_
1
s
n
(X
1
+ + X
n
) a
_
(a)
C
r
n
s
3
n
, (8.4.1)
where () represents the unit Gaussian cdf. The striking feature of this theorem
is that the upper bound depends only on the variance and the absolute third
159
moment, and hence, can provide a good asymptotic estimate based on only the
rst three moments. The absolute constant C is commonly 6 [5, sec. XVI. 5,
Thm. 2]. When X
n
n
i=1
are identically distributed, in addition to independent,
the absolute constant can be reduced to 3, and has been reported to be improved
down to 2.05 [5, sec. XVI. 5, Thm. 1].
The samples that we concern in this section actually consists of two i.i.d. se
quences (and, is therefore named compound i.i.d. sequence.) Hence, we need to
build a BerryEsseen theorem based on compound i.i.d. samples. We begin with
the introduction of the smoothing lemma.
Lemma 8.27 Fix the bandlimited ltering function
v
T
(x)
1 cos(Tx)
Tx
2
.
For any cumulative distribution function H() on the real line 1,
sup
xT
[
T
(x)[
1
2
6
T
2
h
_
T
2
2
_
,
where
T
(t)
_
[H(t x) (t x)] v
T
(x)dx, sup
xT
[H(x) (x)[ ,
and
h(u)
_
_
u
_
u
1 cos(x)
x
2
dx
=
2
u + 1 cos(u) u
_
u
0
sin(x)
x
dx, if u 0;
0, otherwise.
Proof: The rightcontinuity of the cdf H() and the continuity of the Gaussian
unit cdf () together indicate the rightcontinuity of [H(x) (x)[, which in
turn implies the existence of x
0
1 satisfying
either = [H(x
0
) (x
0
)[ or = lim
xx
0
[H(x) (x)[ > [H(x
0
) (x
0
)[ .
We then distinguish between three cases:
Case A) = H(x
0
) (x
0
)
Case B) = (x
0
) H(x
0
)
Case C) = lim
xx
0
[H(x) (x)[ > [H(x
0
) (x
0
)[ .
160
Case A) = H(x
0
) (x
0
). In this case, we note that for s > 0,
H(x
0
+ s) (x
0
+ s) H(x
0
)
_
(x
0
) +
s
2
_
(8.4.2)
=
s
2
, (8.4.3)
where (8.4.2) follows from sup
xT
[
t
(x)[ = 1/
2
2
x
_
_
x
0
+
2
2
x
_
1
2
_
2
2
x
_
=
2
+
x
2
,
for [x[ <
2/2. Together with the fact that H(x) (x) for all
x 1, we obtain
sup
xT
[
T
(x)[
T
_
x
0
+
2
2
_
=
_
_
H
_
x
0
+
2
2
x
_
_
x
0
+
2
2
x
__
v
T
(x)dx
=
_
[[x[<
2/2]
_
H
_
x
0
+
2
2
x
_
_
x
0
+
2
2
x
__
v
T
(x)dx
+
_
[[x[
2/2]
_
H
_
x
0
+
2
2
x
_
_
x
0
+
2
2
x
__
v
T
(x)dx
_
[[x[<
2/2]
_
2
+
x
2
_
v
T
(x)dx +
_
[[x[
2/2]
() v
T
(x)dx
=
_
[[x[<
2/2]
2
v
T
(x)dx +
_
[[x[
2/2]
() v
T
(x)dx, (8.4.4)
where the last equality holds because of the symmetry of the ltering func
161
tion, v
T
(). The quantity of
_
[
[x[
2/2]
v
T
(x)dx can be derived as follows:
_
[
[x[
2/2]
v
T
(x)dx = 2
_
2/2
v
T
(x)dx
=
2
2/2
1 cos(Tx)
Tx
2
dx
=
2
_
T
2/2
1 cos(x)
x
2
dx
=
4
T
2
h
_
T
2
2
_
.
Continuing from (8.4.4),
sup
xT
[
T
(x)[
2
_
1
4
T
2
h
_
T
2
2
__
_
4
T
2
h
_
T
2
2
__
=
1
2
6
T
2
h
_
T
2
2
_
.
Case B) = (x
0
) H(x
0
). Similar to case A), we rst derive for s > 0,
(x
0
s) H(x
0
s)
_
(x
0
)
s
2
_
H(x
0
) =
s
2
,
and then obtain
H
_
x
0
2
2
x
_
_
x
0
2
2
x
_
1
2
_
2
2
+ x
_
=
2
x
2
,
for [x[ <
2/2. Together with the fact that H(x) (x) for all
162
x 1, we obtain
sup
xT
[
T
(x)[
T
_
x
0
2
2
_
_
[[x[<
2/2]
_
2
x
2
_
v
T
(x)dx +
_
[[x[
2/2]
() v
T
(x)dx
=
_
[[x[<
2/2]
2
v
T
(x)dx +
_
[[x[
2/2]
() v
T
(x)dx
=
1
2
6
T
2
h
_
T
2
2
_
.
Case C) = lim
xx
0
[H(x) (x)[ > [H(x
0
) (x
0
)[.
In this case, we claim that for s > 0, either
H(x
0
+ s) (x
0
+ s)
s
2
or
(x
0
s) H(x
0
s)
s
2
holds, because if it were not true, then
_
_
H(x
0
+ s) (x
0
+ s) <
s
2
(x
0
s) H(x
0
s) <
s
_
_
_
lim
s0
H(x
0
+ s) (x
0
)
(x
0
) lim
s0
H(x
0
s)
which immediately gives us H(x
0
) = (x
0
) , contradicting to the
premise of the current case. Following this claim, the desired result can
be proved using the procedure of either case A) or case B); and hence, we
omit it. 2
163
Lemma 8.28 For any cumulative distribution function H() with characteristic
function
H
(),
1
_
T
T
H
() e
(1/2)
2
d
[[
+
12
T
2
h
_
T
2
2
_
,
where and h() are dened in Lemma 8.27.
Proof: Observe that
T
(t) in Lemma 8.27 is nothing but a convolution of v
T
()
and H(x) (x). The characteristic function (or Fourier transform) of v
T
() is
T
() =
_
1
[[
T
, if [[ T;
0, otherwise.
By the Fourier inversion theorem [5, sec. XV.3],
d (
T
(x))
dx
=
1
2
_
e
jx
_
H
() e
(1/2)
2
_
T
()d.
Integrating with respect to x, we obtain
T
(x) =
1
2
_
T
T
e
jx
_
H
() e
(1/2)
2
_
j
T
()d.
Accordingly,
sup
xT
[
T
(x)[ = sup
xT
1
2
_
T
T
e
jx
_
H
() e
(1/2)
2
_
j
T
()d
sup
xT
1
2
_
T
T
e
jx
_
H
() e
(1/2)
2
_
j
T
()
d
= sup
xT
1
2
_
T
T
H
() e
(1/2)
2
[
T
()[
d
[[
sup
xT
1
2
_
T
T
H
() e
(1/2)
2
d
[[
=
1
2
_
T
T
H
() e
(1/2)
2
d
[[
.
Together with the result in Lemma 8.27, we nally have
1
_
T
T
H
() e
(1/2)
2
d
[[
+
12
T
2
h
_
T
2
2
_
.
2
164
Theorem 8.29 (BerryEsseen theorem for compound i.i.d. sequences)
Let Y
n
=
n
i=1
X
i
be the sum of independent random variables, among which
X
i
d
i=1
are identically Gaussian distributed, and X
i
n
i=d+1
are identically dis
tributed but not necessarily Gaussian. Denote the meanvariance pair of X
1
and
X
d+1
by (,
2
) and ( ,
2
), respectively. Dene
E
_
[X
1
[
3
, E
_
[X
d+1
[
3
, and s
2
n
= Var[Y
n
] =
2
d +
2
(n d).
Also denote the cdf of (Y
n
E[Y
n
])/s
n
by H
n
(). Then for all y 1,
[H
n
(y) (y)[ C
n,d
2
(n d 1)
_
2(n d) 3
2
_
2
s
n
,
where C
n,d
is the unique positive number satisfying
6
C
n,d
h(C
n,d
) =
_
2(n d) 3
2
_
12(n d 1)
_
6
2
_
3
2
_
3/2
+
9
2(11 6
2)
n d
_
,
provided that n d 3.
Proof: From Lemma 8.28,
_
T
T
d
_
s
n
_
nd
_
s
n
_
e
(1/2)
2
d
[[
+
12
T
2
h
_
T
2
2
_
, (8.4.5)
where () and () are respectively the characteristic functions of (X
1
) and
(X
d+1
). Observe that the integrand satises
d
_
s
n
_
nd
_
s
n
_
e
(1/2)
2
nd
_
s
n
_
e
(1/2)(
2
(nd)/s
2
n
)
2
e
(1/2)(
2
d/s
2
n
)
2
(n d)
_
s
n
_
e
(1/2)(
2
/s
2
n
)
2
e
(1/2)(
2
d/s
2
n
)
2
nd1
, (8.4.6)
(n d)
_
_
s
n
_
_
1
2
2s
2
n
2
_
_
1
2
2s
2
n
2
_
e
(1/2)(
2
/s
2
n
)
2
_
e
(1/2)(
2
d/s
2
n)
2
nd1
(8.4.7)
where the quantity in (8.4.6) requires [5, sec. XVI. 5, (5.5)] that
_
s
n
_
and
e
(1/2)(
2
/s
2
n
)
2
.
165
By equation (26.5) in [1], we upperbound the rst and second terms in the braces
of (8.4.7) respectively by
_
s
n
_
1 +
2
2s
2
n
6s
3
n
[[
3
and
1
2
2s
2
n
2
e
(1/2)(
2
/s
2
n
)
2
4
8s
4
n
4
.
Continuing the derivation of (8.4.7),
d
_
s
n
_
nd
_
s
n
_
e
(1/2)
2
(n d)
_
6s
3
n
[[
3
+
4
8s
4
n
4
_
e
(1/2)(
2
d/s
2
n
)
2
nd1
. (8.4.8)
It remains to choose that bounds both [ (/s
n
)[ and exp(1/2) (
2
/s
2
n
)
2
from above.
From the elementary property of characteristic functions,
_
s
n
_
1
2
2s
2
n
2
+
6s
3
n
[
3
[,
if
2
2s
2
n
2
1. (8.4.9)
For those [T, T] (which is exactly the range of integration operation in
(8.4.5)), we can guarantee the validity of (8.4.9) by dening
T
2
s
n
_
2(n d) 3
n d 1
_
,
and obtain
2
2s
2
n
2
2s
2
n
T
2
6
2
2
_
2(n d) 3
n d 1
_
2
1
2
_
2(n d) 3
n d 1
_
2
1,
for n d 3. Hence, for [[ T,
_
s
n
_
1 +
_
2
2s
2
n
2
+
6s
3
n
[
3
[
_
exp
_
2
2s
2
n
2
+
6s
3
n
[
3
[
_
exp
_
2
2s
2
n
2
+
6s
3
n
T
2
_
= exp
_
_
2
2s
2
n
T
6s
3
n
_
2
_
= exp
_
(3
2)(n d)
6(n d 1)
2
s
2
n
2
_
.
166
We can then choose
exp
_
(3
2)(n d)
6(n d 1)
2
s
2
n
2
_
.
The above selected is apparently an upper bound of exp (1/2) (
2
/s
2
n
)
2
.
By taking the chosen into (8.4.8), the integration part in (8.4.5) becomes
_
T
T
d
_
s
n
_
(nd)
_
s
n
_
e
(1/2)
2
d
[[
_
T
T
(n d)
_
6s
3
n
2
+
4
8s
4
n
[[
3
_
(8.4.10)
exp
_
_
3
2
d + (3
2)
2
(n d)
_
2
6s
2
n
_
d
_
T
T
(n d)
_
6s
3
n
2
+
4
8s
4
n
[[
3
_
exp
_
_
(3
2)
2
d + (3
2)
2
(n d)
_
2
6s
2
n
_
d
=
_
T
T
(n d)
_
6s
3
n
2
+
4
8s
4
n
[[
3
_
exp
_
(3
2)
6
2
_
d
=
_
(n d)
_
6s
3
n
2
+
4
8s
4
n
[[
3
_
exp
_
(3
2)
6
2
_
d
=
(n d)
2
s
2
n
2
s
n
_
3
6
6(3
2)
3/2
+
9
2(11 6
2)
3
s
n
_
1
T
_
2(n d) 3
n d 1
__
3
6
6(3
2)
3/2
+
9
2(11 6
2)
1
n d
_
, (8.4.11)
where the last inequality follows respectively from
(n d)
2
s
2
n
,
/(
2
s
n
) = (1/T)(
n d.
Taking (8.4.11) into (8.4.5), we nally obtain
1
T
_
2(n d) 3
n d 1
__
3
6
6(3
2)
3/2
+
9
2(11 6
2)
1
n d
_
+
12
T
2
h
_
T
2
2
_
,
167
1
0.5
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5
6
u h(u)
u
Figure 8.3: Function of (/6)u h(u).
or equivalently,
6
u h(u)
2
_
2(n d) 3
12(n d 1)
_
6
2(3
2)
3/2
+
9
2(11 6
2)
n d
_
,
(8.4.12)
for u T
2
C
n,d
2
T
2
= C
n,d
2
(n d 1)
(2(n d) 3
2)
2
s
n
.
2
8.4.2 BerryEsseen Theorem with a samplesize depen
dent multiplicative coecient for i.i.d. sequence
By letting d = 0, the BerryEsseen inequality for i.i.d. sequences can also be
readily obtained from Theorem 8.29.
168
Corollary 8.30 (BerryEsseen theorem with a samplesize dependent
multiplicative coecient for i.i.d. sequence) Let Y
n
=
n
i=1
X
i
be the sum
of independent random variables with common marginal distribution. Denote
the marginal mean and variance by ( ,
2
). Dene E
_
[X
1
[
3
. Also
denote the cdf of (Y
n
n )/(
n ) by H
n
(). Then for all y 1,
[H
n
(y) (y)[ C
n
2(n 1)
_
2n 3
2
_
3
n
,
where C
n
is the unique positive solution of
6
u h(u) =
_
2n 3
2
_
12(n 1)
_
6
2
_
3
2
_
3/2
+
9
2(11 6
2)
n
_
,
provided that n 3.
Let us briey remark on the previous corollary. We observe from numericals
2
that the quantity
C
n
2
(n 1)
_
2n 3
2
_
is decreasing in n, and ranges from 3.628 to 1.627 (cf. Figure 8.4.) Numerical
result shows that it lies below 2 when n 9, and is smaller than 1.68 as n 100.
In other words, we can upperbound this quantity by 1.68 as n 100, and
therefore, establish a better estimate of the BerryEsseen constant than that in
[5, sec. XVI. 5, Thm. 1].
2
We can upperbound C
n
by the unique positive solution D
n
of
6
u h(u) =
6
_
_
6
2
_
3
2
_
3/2
+
9
2(11 6
2)
n
_
_
,
which is strictly decreasing in n. Hence,
C
n
2
(n 1)
_
2n 3
2
_ E
n
D
n
2
(n 1)
_
2n 3
2
_,
and the righthandside of the above inequality is strictly decreasing (since both D
n
and
(n 1)/(2n 3
(n 1)
(2n 3
2)
n
Figure 8.4: The BerryEsseen constant as a function of the sample size
n. The sample size n is plotted in logscale.
8.4.3 Probability bounds using BerryEsseen inequality
We now investigate an upper probability bound for the sum of a compound
i.i.d. sequence.
Basically, the approach used here is the large deviation technique, which is
considered the technique of computing the exponent of an exponentially decayed
probability. Other than the large deviations, the previously derived BerryEsseen
theorem is applied to evaluate the subexponential detail of the desired probability.
With the help of these two techniques, we can obtain a good estimate of the
destined probability. Some notations used in large deviations are introduced
here for the sake of completeness.
Let Y
n
=
n
i=1
X
i
be the sum of independent random variables. Denote the
distribution of Y
n
by F
n
(). The twisted distribution with parameter corre
sponding to F
n
() is dened by
dF
()
n
(y)
e
y
dF
n
(y)
_
e
y
dF
n
( y)
=
e
y
dF
n
(y)
M
n
()
, (8.4.13)
where M
n
() is the moment generating function corresponding to the distribution
F
n
(). Let Y
()
n
be a random variable having F
()
n
() as its probability distribution.
Analogously, we can twist the distribution of X
i
with parameter , yielding its
170
twisted counterpart X
()
i
. A basic large deviation result is that
Y
()
n
=
n
i=1
X
()
i
,
and X
()
i
n
i=1
still forms a compound i.i.d. sequence.
Lemma 8.31 (probability upper bound) Let Y
n
=
n
i=1
X
i
be the sum of
independent random variables, among which X
i
d
i=1
are identically Gaussian
distributed with mean > 0 and variance
2
, and X
i
n
i=d+1
have common
distribution as minX
i
, 0. Denote the variance and the absolute third moment
of the twisted random variable X
()
d+1
by
2
() and (), respectively. Also dene
s
2
n
() Var[Y
()
n
] and M
n
() E
_
e
Yn
. Then
Pr Y
n
0 B
_
d, (n d);
2
,
2
_
_
A
n,d
M
n
(), when E[Y
n
] 0;
1 B
n,d
M
n
(), otherwise,
where
A
n,d
min
_
1
2[[s
n
()
+ C
n,d
4(n d 1) ()
_
2(n d) 3
2
()s
n
()
, 1
_
,
B
n,d
1
[[s
n
()
exp
_
[[s
n
()
1
_
(0) +
4C
n,d
(n d 1) ()
[2(n d) 3
2]
2
()s
n
()
+
1
[[s
n
()
__
,
and is the unique solution of M
n
()/ = 0, provided that n d 3. (A
n,d
,
B
n,d
and are all functions of
2
and
2
. Here, we drop them to simplify the
notations.)
Proof: We derive the upper bound of Pr (Y
n
0) under two dierent situations:
E[Y
n
] 0 and E[Y
n
] < 0.
Case A) E[Y
n
] 0. Denote the cdf of Y
n
by F
n
(). Choose to satisfy
M
n
()/ = 0. By the convexity of the moment generating function
M
n
() with respect to and the nonnegativity of E[Y
n
] = M
n
()/[
=0
,
the satisfying M
n
()/ = 0 uniquely exists and is nonpositive. Denote
171
the distribution of
_
Y
()
n
E[Y
()
n
]
_
/s
n
() by H
n
(). Then from (8.4.13),
we have
Pr (Y
n
0) =
_
0
dF
n
(y)
= M
n
()
_
0
e
y
dF
()
n
(y)
= M
n
()
_
0
e
sn() y
dH
n
(y). (8.4.14)
Integrating by parts on (8.4.14) with
(dy) s
n
() e
sn() y
dy,
and then applying Theorem 8.29 yields
_
0
e
sn() y
dH
n
(y)
=
_
0
[H
n
(0) H
n
(y)] (dy)
_
0
_
(0) (y) + C
n,d
4(n d 1) ()
_
2(n d) 3
2
_
2
()s
n
()
_
(dy)
=
_
0
_
2(n d) 3
2
_
2
()s
n
()
= e
2
s
2
n
()/2
( s
n
()) + C
n,d
4(n d 1) ()
_
2(n d) 3
2
_
2
()s
n
()
(8.4.15)
2 [[ s
n
()
+ C
n,d
4(n d 1) ()
_
2(n d) 3
2
_
2
()s
n
()
,
where (8.4.15) holds by, again, applying integration by part, and the last
inequality follows from for u 0,
(u) =
_
u
2
e
t
2
/2
dt
_
u
2
_
1 +
1
t
2
_
e
t
2
/2
dt =
1
2u
e
u
2
/2
.
On the other hand,
_
0
e
sn() y
dH
n
(y) 1 can be established by ob
172
serving that
M
n
()
_
0
e
sn() y
dH
n
(y) = PrY
n
0
= Pr
_
e
Yn
1
_
E[e
Yn
] = M
n
().
Finally, Case A concludes to
Pr(Y
n
0)
min
_
1
2 [[ s
n
()
+
4C
n,d
(n d 1) ()
_
2(n d) 3
2
_
2
()s
n
()
, 1
_
M
n
().
Case B) E[Y
n
] < 0.
By following a procedure similar to case A, together with the observation
that > 0 for E[Y
n
] < 0, we obtain
Pr(Y
n
> 0)
=
_
0
dF
n
(y)
= M
n
()
_
0
e
sn() y
dH
n
(y)
M
n
()
_
0
e
sn() y
dH
n
(y), for some > 0 specied later
M
n
()e
sn()
_
0
dH
n
(y)
= M
n
()e
sn()
[H() H(0)]
M
n
()e
sn()
_
() (0) C
n,d
4(n d 1) ()
_
2(n d) 3
2
_
2
()s
n
()
_
.
The proof is completed by letting be the solution of
() = (0) + C
n,d
4(n d 1) ()
_
2(n d) 3
2
_
2
()s
n
()
+
1
[[s
n
()
.
2
The upper bound obtained in Lemma 8.31 in fact depends only upon the
ratio of (/) or (1/2)(
2
/
2
). Therefore, we can rephrase Lemma 8.31 as the
next corollary.
173
Corollary 8.32 Let Y
n
=
n
i=1
X
i
be the sum of independent random variables,
among which X
i
d
i=1
are identically Gaussian distributed with mean > 0
and variance
2
, and X
i
n
i=d+1
have common distribution as minX
i
, 0. Let
(1/2)(
2
/
2
). Then
Pr Y
n
0 B (d, (n d); )
where
B (d, (n d); )
_
_
_
A
n,d
()M
n
(), if
d
n
1
4
e
4(
2)
;
1 B
n,d
()M
n
(), otherwise,
A
n,d
() min
_
1
s
n
()
+
4C
n,d
(n d 1) ()
_
2(n d) 3
2
() s
n
()
, 1
_
,
B
n,d
()
1
s
n
()
exp
_
_
2
s
n
()
1
_
1
2
+
4C
n,d
(n d 1) ()
[2(n d) 3
2]
2
() s
n
()
+
1
s
n
()
__
,
M
n
() e
n
e
d
2
/2
_
() e
2
/2
+ e
_
_
2
__
nd
,
2
()
d
n d
nd
(n d)
2
2
+
n
n d
1
1 +
2e
2)
,
()
n
(n d)
[1 +
2e
2)]
_
1
d(n + d)
(n d)
2
2
+2
_
n
2
(n d)
2
2
+ 2
_
e
d(2nd)
2
/(nd)
2
d
n d
_
n
2
(n d)
2
2
+ 3
_
2e
(
_
2)
2n
n d
_
n
2
(n d)
2
2
+ 3
_
2e
2
/2
n
n d
__
,
s
2
n
() n
_
d
n d
2
+
1
1 +
2e
2)
_
,
and is the unique positive solution of
e
(1/2)
2
() =
1
2
_
1
d
n
_
d
n
e
(
_
2),
provided that n d 3.
174
Proof: The notations used in this proof follows those in Lemma 8.31.
Let (/) + =
2
/2
and M
X
d+1
() = () e
2
/2
+
_
_
2
_
.
Therefore, the moment generating function of Y
n
=
n
i=1
X
i
becomes
M
n
() = M
d
X
1
()M
nd
X
d+1
() = e
n
e
d
2
/2
_
() e
2
/2
+ e
_
_
2
__
nd
.
(8.4.16)
Since and have a onetoone correspondence, solving M
n
()/ = 0 is
equivalent to solving M
n
()/ = 0, which from (8.4.16) results in
e
(1/2)
2
() =
1
2
_
1
d
n
_
d
n
e
(
_
2). (8.4.17)
As it turns out, the solution of the above equation depends only on .
Next we derive
2
()
2
()
=(
2)/
, ()
()
=(
2)/
and
s
2
n
()
s
2
n
()
=(
2)/
.
By replacing e
(1/2)
2
() with
1
2
_
1
d
n
_
d
n
e
(
_
2)
(cf. (8.4.17)), we obtain
2
() =
d
n d
nd
(n d)
2
2
+
n
n d
1
1 +
2e
2)
, (8.4.18)
() =
n
(n d)
[1 +
2e
2)]
_
1
d(n + d)
(n d)
2
2
+2
_
n
2
(n d)
2
2
+ 2
_
e
d(2nd)
2
/(nd)
2
d
n d
_
n
2
(n d)
2
2
+ 3
_
2e
(
_
2)
2n
n d
_
n
2
(n d)
2
2
+ 3
_
2e
2
/2
n
n d
__
,(8.4.19)
175
and
s
2
n
() = n
_
d
n d
2
+
1
1 +
2e
2)
_
. (8.4.20)
Taking (8.4.18), (8.4.19), (8.4.20) and = (
2)/ into A
n,d
and B
n,d
in
Lemma 8.31, we obtain:
A
n,d
= min
_
1
s
n
()
+ C
n,d
4(n d 1) ()
_
2(n d) 3
2
() s
n
()
, 1
_
and
B
n,d
=
1
s
n
()
exp
_
_
2
s
n
()
1
_
(0) +
4C
n,d
(n d 1) ()
[2(n d) 3
2]
2
() s
n
()
+
1
s
n
()
__
,
which depends only on and , as desired. Finally, a simple derivation yields
E[Y
n
] = dE[X
1
] + (n d)E[X
d+1
]
=
_
d
_
2 + (n d)
_
(1/
2)e
+
_
2(
_
2)
__
,
and hence,
E[Y
n
] 0
d
n
1
4
e
4(
2)
.
2
We close this section by remarking that the parameters in the upper bounds
of Corollary 8.32, although they have complex expressions, are easily computer
evaluated, once their formulas are established.
8.5 Generalized NeymanPearson Hypothesis Testing
The general expression of the NeymanPearson typeII error exponent subject to
a constant bound on the typeI error has been proved for arbitrary observations.
In this section, we will state the results in terms of the inf/supdivergence
rates.
Theorem 8.33 (NeymanPearson typeII error exponent for a xed
test level [3]) Consider a sequence of random observations which is assumed to
176
have a probability distribution governed by either P
X
(null hypothesis) or P
X
(alternative hypothesis). Then, the typeII error exponent satises
lim
(X
X) limsup
n
1
n
log
n
()
D
(X
X)
lim
(X
X) liminf
n
1
n
log
n
() D
(X
X)
where
n
() represents the minimum typeII error probability subject to a xed
typeI error bound [0, 1).
The general formula for NeymanPearson typeII error exponent subject to an
exponential test level has also been proved in terms of the inf/supdivergence
rates.
Theorem 8.34 (NeymanPearson typeII error exponent for an expo
nential test level) Fix s (0, 1) and [0, 1). It is possible to choose decision
regions for a binary hypothesis testing problem with arbitrary datawords of
blocklength n, (which are governed by either the null hypothesis distribution
P
X
or the alternative hypothesis distribution P
X
,) such that
liminf
n
1
n
log
n
D
(
X
(s)

X) and limsup
n
1
n
log
n
D
(1)
(
X
(s)
X),
(8.5.1)
or
liminf
n
1
n
log
n
D
(
X
(s)

X) and limsup
n
1
n
log
n
D
(1)
(
X
(s)
X),
(8.5.2)
where
X
(s)
exhibits the tilted distributions
_
P
(s)
X
n
_
n=1
dened by
dP
(s)
X
n
(x
n
)
1
n
(s)
exp
_
s log
dP
X
n
dP
X
n
(x
n
)
_
dP
X
n
(x
n
),
and
n
(s)
_
.
n
exp
_
s log
dP
X
n
dP
X
n
(x
n
)
_
dP
X
n
(x
n
).
Here,
n
and
n
are the typeI and typeII error probabilities, respectively.
Proof: For ease of notation, we use
X to represent
X
(s)
. We only prove (8.5.1);
(8.5.2) can be similarly demonstrated.
177
By denition of dP
(s)
X
n
(), we have
1
s
_
1
n
d
e
X
n
(
X
n

X
n
)
_
+
1
1 s
_
1
n
d
e
X
n
(
X
n
X
n
)
_
=
1
s(1 s)
_
1
n
log
n
(s)
_
.
(8.5.3)
Let
limsup
n
(1/n) log
n
(s). Then, for any > 0, N
0
such that n >
N
0
,
1
n
log
n
(s) <
+ .
From (8.5.3),
d
e
X
n

X
n
()
liminf
n
Pr
_
1
n
d
e
X
n
(
X
n

X
n
)
_
= liminf
n
Pr
_
1
1 s
_
1
n
d
e
X
n
(
X
n
X
n
)
_
1
s(1 s)
_
1
n
log
n
(s)
_
s
_
= liminf
n
Pr
_
1
n
d
e
X
n
(
X
n
X
n
)
1 s
s
1
s
_
1
n
log
n
(s)
__
liminf
n
Pr
_
1
n
d
e
X
n
(
X
n
X
n
) >
1 s
s
1
s
s
_
= 1 limsup
n
Pr
_
1
n
d
e
X
n
(
X
n
X
n
)
1 s
s
1
s
s
_
= 1
d
e
X
n
X
n
_
1 s
s
1
s
s
_
.
Thus,
(
X
X) sup : d
e
X
n

X
n
()
sup
_
: 1
d
e
X
n
X
n
_
1 s
s
1
s
(
+ )
_
<
_
= sup
_
1
1 s
(
+ )
s
1 s
t
:
d
e
X
n
X
n
(
t
) > 1
_
=
1
1 s
(
+ )
s
1 s
inf
t
:
d
e
X
n
X
n
(
t
) > 1
=
1
1 s
(
+ )
s
1 s
sup
t
:
d
e
X
n
X
n
(
t
) 1
=
1
1 s
(
+ )
s
1 s
D
1
(
XX).
Finally, choosing the acceptance region for null hypothesis as
_
x
n
A
n
:
1
n
log
dP
e
X
n
dP
X
n
(x
n
)
D
(
X
X)
_
,
178
we obtain:
n
= P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
)
D
(
X
X)
_
exp
_
n
D
(
X
X)
_
,
and
n
= P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
) <
D
(
X
X)
_
P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
) <
1
1 s
(
+ )
s
1 s
D
1
(
XX)
_
= P
X
n
_
1
1 s
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
)
_
1
s(1 s)
_
1
n
log
n
(s)
_
<
+
s(1 s)
1
1 s
D
1
(
XX)
_
= P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
) > D
1
(
XX) +
1
s
_
1
n
log
n
(s)
_
+
s
_
.
Then, for n > N
0
,
n
P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
) > D
1
(
XX)
_
exp
_
nD
1
(
XX)
_
.
Consequently,
liminf
n
1
n
log
n
D
(
X
(s)

X) and limsup
n
1
n
log
n
D
(1)
(
X
(s)
X).
2
179
Bibliography
[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John
Wiley and Sons, 1995.
[2] James A. Bucklew. Large Deviation Techniques in Decision, Simulation,
and Estimation. Wiley, New York, 1990.
[3] P.N. Chen, General formulas for the NeymanPearson typeII error
exponent subject to xed and exponential typeI error bound, IEEE
Trans. Info. Theory, vol. 42, no. 1, pp. 316323, November 1993.
[4] JeanDominique Deuschel and Daniel W. Stroock. Large Deviations. Aca
demic Press, San Diego, 1989.
[5] W. Feller, An Introduction to Probability Theory and its Applications, 2nd
edition, New York, John Wiley and Sons, 1970.
180
Chapter 9
Channel Reliability Function
The channel coding theorem (cf. Chapter 5) shows that block codes with arbi
trarily small probability of block decoding error exist at any code rate smaller
than the channel capacity C. However, the result mentions nothing on the re
lation between the rate of convergence for the error probability and code rate
(specically for rate less than C.) For example, if C = 1 bit per channel usage,
and there are two reliable codes respectively with rates 0.5 bit per channel usage
and 0.25 bit per channel usage, is it possible for one to adopt the latter code even
if it results in lower information transmission rate? The answer is armative if
a higher error exponent is required.
In this chapter, we will rst reprove the channel coding theorem for DMC,
using an alternative and more delicate method that describes the dependence
between block codelength and the probability of decoding error.
9.1 Randomcoding exponent
Denition 9.1 (channel block code) An (n, M) code for channel W
n
with
input alphabet A and output alphabet is a pair of mappings
f : 1, 2, , M A
n
and g :
n
1, 2, , M.
Its average error probability is given by
P
e
(n, M)
.
=
1
M
M
m=1
y
n
:g(y
n
),=m
W
n
(y
n
[f(m)).
Denition 9.2 (channel reliability function [1]) For any R 0, dene the
channel reliability function E
1
n
log P
e
(n, M
n
)
181
and
R liminf
n
1
n
log M
n
. (9.1.1)
From the above denition, it is obvious that E
y
_
x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
.
Theorem 9.4 (random coding exponent) For DMC with generic transition
probability P
Y [X
,
E
(R) E
r
(R)
for R 0.
Proof: The theorem holds trivially at R = 0 since E
(0) = . It remains to
prove the theorem under R > 0.
Similar to the proof of channel capacity, the codeword is randomly selected
according to some distribution P
e
X
n
. For xed R > 0, choose a sequence of
M
n
n1
with R = lim
n
(1/n) log M
n
.
Step 1: Maximum likelihood decoder. Let c
1
, . . . , c
Mn
A
n
denote the
set of ntuple block codewords selected, and let the decoding partition for
symbol m (namely, the set of channel outputs that classify to m) be

m
y
n
: P
Y
n
[X
n(y
n
[c
m
) > P
Y
n
[X
n(y
n
[c
m
), for all m
t
,= m.
Those channel outputs that are on the boundary, i.e.,
for some m and m
t
, P
Y
n
[X
n(y
n
[c
m
) = P
Y
n
[X
n(y
n
[c
m
),
will be arbitrarily assigned to either m or m
t
.
Step 2: Property of indicator function for s 0. Let
m
be the indicator
function of 
m
. Then
1
for all s 0,
1
m
(y
n
)
_
m
,=m,1m
Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
.
1
_
i
a
1/(1+s)
i
_
s
is no less than 1 when one of a
i
is no less than 1.
182
Step 3: Probability of error given codeword c
m
is transmitted. Let
P
e[m
denote the probability of error given codeword c
m
is transmitted.
Then
P
e[m
y
n
,m
P
Y
n
[X
n(y
n
[c
m
)
=
y
n
n
P
Y
n
[X
n(y
n
[c
m
)[1
m
(y
n
)]
y
n
n
P
Y
n
[X
n(y
n
[c
m
)
_
m
,=m,1m
Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
=
y
n
n
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
m
,=m,1m
Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
.
Step 4: Expectation of P
e[m
.
E[P
e[m
]
E
_
y
n
n
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
m
,=m,1m
Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
=
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
m
,=m,1m
Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
=
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
E
__
m
,=m,1m
Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
,
where the latter step follows because c
m
1mM
are independent random
variables.
Step 5: Jensens inequality for 0 s 1, and bounds on probability
of error. By Jensens inequality, when 0 s 1,
E[t
s
] (E[t])
s
.
183
Therefore,
E[P
e[m
]
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
E
__
m
,=m,1m
Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
E
_
m
,=m,1m
Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
s
=
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
m
,=m,1m
Mn
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
s
Since the codewords are selected with identical distribution, the expecta
tions
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
should be the same for each m. Hence,
E[P
e[m
]
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
m
,=m,1m
Mn
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
s
=
y
n
n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_ _
(M
n
1)E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
s
= (M
n
1)
s
y
n
n
_
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
1+s
M
s
n
y
n
n
_
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
1+s
= M
s
n
y
n
n
_
x
n
.
n
P
e
X
n
(x
n
)P
1/(1+s)
Y
n
[X
n
(y
n
[x
n
)
_
1+s
Since the upper bound of E[P
e[m
] is no longer dependent on m, the ex
pected average error probability E[P
e
] can certainly be bounded by the
same bound, namely,
E[P
e
] M
s
n
y
n
n
_
x
n
.
n
P
e
X
n
(x
n
)P
1/(1+s)
Y
n
[X
n
(y
n
[x
n
)
_
1+s
, (9.1.2)
which immediately implies the existence of an (n, M
n
) code satisfying:
P
e
(n, M
n
) M
s
n
y
n
n
_
x
n
.
n
P
e
X
n
(x
n
)P
1/(1+s)
Y
n
[X
n
(y
n
[x
n
)
_
1+s
.
184
By using the fact that P
e
X
n
and P
Y
n
[X
n are product distributions with
identical marginal, and taking logarithmic operation on both sides of the
above inequality, we obtain:
1
n
log P
e
(n, M
n
)
s
n
log M
n
log
y
_
x.
P
e
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
,
which implies
liminf
n
1
n
log P
e
(n, M
n
) sR log
y
_
x.
P
e
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
.
The proof is completed by noting that the lower bound holds for any s
[0, 1] and any P
e
X
. 2
9.1.1 The properties of random coding exponent
We reformulate E
r
(R) to facilitate the understanding of its behavior as follows.
Denition 9.5 (randomcoding exponent) The random coding exponent
for DMC with generic distribution P
Y [X
is dened by
E
r
(R) sup
0s1
[sR + E
0
(s)] ,
where
E
0
(s) sup
P
X
_
_
log
y
_
x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
.
Based on the reformulation, the properties of E
r
(R) can be realized via the
analysis of function E
0
(s).
Lemma 9.6 (properties of E
r
(R))
1. E
r
(R) is convex and nonincreasing in R; hence, it is a strict decreasing
and continuous function of R [0, ).
2. E
r
(C) = 0, where C is channel capacity, and E
r
(C ) > 0 for all 0 < <
C.
3. There exists R
cr
such that for 0 < R < R
cr
, the slope of E
r
(R) is 1.
185
Proof:
1. E
r
(R) is convex and nonincreasing, since it is the supremum of ane
functions sR+E
0
(s) with nonpositive slope s. Hence, E
r
(R) is strictly
decreasing and continuous for R > 0.
2. sR + E
0
(s) equals zero at R = E
0
(s)/s for 0 < s 1. Hence by the
continuity of E
r
(R), E
r
(R) = 0 for R = sup
0<s1
E
0
(s)/s, and E
r
(R) > 0
for R < sup
0<s1
E
0
(s)/s. The property is then justied by:
sup
0<s1
1
s
E
0
(s) = sup
0<s1
sup
P
X
_
_
1
s
log
y
_
x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
= sup
P
X
sup
0<s1
_
_
1
s
log
y
_
x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
= sup
P
X
lim
s0
_
_
1
s
log
y
_
x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
= sup
P
X
I(X; Y ) = C.
3. The intersection between sR + E
0
(s) and R + E
0
(1) is
R
s
=
E
0
(1) E
0
(s)
1 s
;
and below R
s
, sR + E
0
(s) R + E
0
(1). Therefore,
R
cr
= inf
0s1
E
0
(1) E
0
(s)
1 s
.
So below R
cr
, the slope of E
r
(R) remains 1. 2
Example 9.7 For BSC with crossover probability , the random coding expo
nent becomes
E
r
(R) = max
0p1
max
0s1
_
sR log
_
_
p(1 )
1/(1+s)
+ (1 p)
1/(1+s)
_
(1+s)
+
_
p
1/(1+s)
+ (1 p)(1 )
1/(1+s)
_
(1+s)
__
,
where (p, 1 p) is the input distribution.
Note that the input distribution achieving E
r
(R) is uniform, i.e., p = 1/2.
Hence,
E
0
(s) = s log(2) (1 + s) log
_
1/(1+s)
+ (1 )
1/(1+s)
_
186
@
@
@
@
@
@
@
1 p
p
1
0
1
0
1
1
Figure 9.1: BSC channel with crossover probability and input distri
bution (p, 1 p).
and
R
cr
= inf
0s1
E
0
(1) E
0
(s)
1 s
= lim
s1
E
0
(s)
s
= log(2) log
_
+
1
_
+
log() +
1 log(1 )
2(
1 )
.
The random coding exponent for = 0.2 is depicted in Figure 9.2.
9.2 Expurgated exponent
The proof of the random coding exponent is based on the argument of random
coding, which selects codewords independently according to some input distribu
tion. Since the codeword selection is unbiased, the good codewords (i.e., with
small P
e[m
) and bad codewords (i.e., with large P
e[m
) contribute the same to
the overall average error probability. Therefore, if, to some extent, we can ex
purgate the contribution of the bad codewords, a better bound on channel
reliability function may be found.
In stead of randomly selecting M
n
codewords, expurgated approach rst
draws 2M
n
codewords to form a codebook (
2Mn
, and sorts these codewords
in ascending order in terms of P
e[m
( (
2Mn
), which is the error probability given
codeword m is transmitted. After that, it chooses the rst M
n
codewords (whose
P
e[m
( (
2Mn
) is smaller) to form a new codebook, (
Mn
. It can be expected that
for 1 m M
n
and M
n
< m
t
2M
n
,
P
e[m
( (
Mn
) P
e[m
( (
2Mn
) P
e[m
( (
2Mn
),
provided that the maximumlikelihood decoder is always employed at the receiver
side. Hence, a better codebook is obtained.
187
1
0
0.1
0 R
cr
0.08 0.12 0.16 0.192745
C
E
r
(R)
s
Figure 9.2: Random coding exponent for BSC with crossover proba
bility 0.2. Also plotted is s
= arg sup
0s1
[sR E
0
(s)]. R
cr
=
0.056633.
To further reduce the contribution of the bad codewords when comput
ing the expected average error probability under random coding selection, a
better bound in terms of Lyapounovs inequality
2
is introduced. Note that by
Lyapounovs inequality,
E
1/
_
P
e[m
( (
2Mn
)
_
E
_
P
e[m
( (
2Mn
)
.
for 0 < 1.
Similar to the random coding argument, we need to claim the existence of
one good code whose error probability vanishes at least at the rate of expurgated
exponent. Recall that the random coding exponent comes to this conclusion by
the simple fact that E[P
e
] e
nEr(R)
(cf. (9.1.2)) implies the existence of code
with P
e
e
nEr(R)
. However, a dierent argument is adopted for the existence
of such good codes for expurgated exponent.
Lemma 9.8 (existence of good code for expurgated exponent) There
2
Lyapounovs inequality:
E
1/
[[X[
] E
1/
[[X[
] for 0 < .
188
exists an (n, M
n
) block code (
Mn
with
P
e[m
( (
Mn
) 2
1/
E
1/
_
P
e[m
( (
2Mn
)
_
, (9.2.1)
where (
2Mn
is randomly selected according to some distribution P
e
X
n
.
Proof: Let the random codebook (
2Mn
be denoted by
(
2Mn
= c
1
, c
2
, . . . , c
2Mn
.
Let () be the indicator function of the set
t 1 : t < 2
1/
E
1/
[P
e[m
( (
2Mn
)],
for some > 0. Hence, ( P
e[m
( (
2Mn
) ) = 1 if
P
e[m
( (
2Mn
) < 2
1/
E
1/
[P
e[m
( (
2Mn
)].
By Markovs inequality,
E
_
1m2Mn
(P
e[m
( (
2Mn
))
_
=
1m2Mn
E
_
(P
e[m
( (
2Mn
))
= 2M
n
E
_
(P
e[m
( (
2Mn
))
= 2M
n
Pr
_
P
e[m
( (
2Mn
) < 2
1/
E
1/
[P
e[m
( (
2Mn
)]
_
= 2M
n
Pr
_
P
e[m
( (
2Mn
) < 2E[P
e[m
( (
2Mn
)]
_
2M
n
_
1
E[P
e[m
( (
2Mn
)]
2E[P
e[m
( (
2Mn
)]
_
= M
n
.
Therefore, there exist at lease one codebook such that
1m2Mn
(P
e[m
( (
2Mn
)) M
n
.
Hence, by selecting M
n
codewords from this codebook with (P
e[m
( (
2Mn
)) = 1,
a new codebook is formed, and it is obvious that (9.2.1) holds for this new
codebook. 2
189
Denition 9.9 (expurgated exponent) The expurgated exponent for DMC
with generic distribution P
Y [X
is dened by
E
ex
(R) sup
s1
sup
P
X
_
sR s log
x.
.
P
X
(x)P
X
(x
t
)
_
y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
)
_
1/s
_
_
.
Theorem 9.10 (expurgated exponent) For DMC with generic transition
probability P
Y [X
,
E
(R) E
ex
(R)
for R 0.
Proof: The theorem holds trivially at R = 0 since E
(0) = . It remains to
prove the theorem under R > 0.
Randomly select 2M
n
codewords according to some distribution P
e
X
n
. For
xed R > 0, choose a sequence of M
n
n1
with R = lim
n
(1/n) log M
n
.
Step 1: Maximum likelihood decoder. Let c
1
, . . . , c
2Mn
A
n
denote the
set of ntuple block codewords selected, and let the decoding partition for
symbol m (namely, the set of channel outputs that classify to m) be

m
= y
n
: P
Y
n
[X
n(y
n
[c
m
) > P
Y
n
[X
n(y
n
[c
m
), for all m
t
,= m.
Those channel outputs that are on the boundary, i.e.,
for some m and m
t
, P
Y
n
[X
n(y
n
[c
m
) = P
Y
n
[X
n(y
n
[c
m
),
will be arbitrarily assigned to either m or m
t
.
Step 2: Property of indicator function for s = 1. Let
m
be the indicator
function of 
m
. Then for all s > 0.
1
m
(y
n
)
_
m
,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
.
(Note that this step is the same as random coding exponent, except only
s = 1 is considered.) By taking s = 1, we have
1
m
(y
n
)
_
m
,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
.
190
Step 3: Probability of error given codeword c
m
is transmitted. Let
P
e[m
( (
2Mn
) denote the probability of error given codeword c
m
is transmit
ted. Then
P
e[m
( (
2Mn
)
y
n
,m
P
Y
n
[X
n(y
n
[c
m
)
=
y
n
n
P
Y
n
[X
n(y
n
[c
m
)[1
m
(y
n
)]
y
n
n
P
Y
n
[X
n(y
n
[c
m
)
_
m
,=m,1m
2Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
y
n
n
P
Y
n
[X
n(y
n
[c
m
)
_
1m
2Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
=
1m
2Mn
_
y
n
n
P
Y
n
[X
n(y
n
[c
m
)
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
=
1m
2Mn
_
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
Step 4: Standard inequality for s
t
= 1/ 1. It is known that for any
0 < = 1/s
t
1, we have
_
i
a
i
_
i
a
i
,
for all nonnegative sequence
3
a
i
.
3
Proof: Let f() = (
i
a
i
) /
_
j
a
j
_
k
a
k
), and hence, a
i
= p
i
(
k
a
k
). Take it to the numerator of f():
f() =
i
p
i
(
k
a
k
)
j
a
j
)
i
p
i
Now since f()/ =
i
log(p
i
)p
i
0 (which implies f() is nonincreasing in ) and
f(1) = 1, we have the desired result. 2
191
Hence,
P
e[m
( (
2Mn
)
_
1m
2Mn
_
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
__
1m
2Mn
_
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
Step 5: Expectation of P
e[m
( (
2Mn
).
E[P
e[m
( (
2Mn
)]
E
_
1m
2Mn
_
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
_
=
1m
2Mn
E
__
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
_
= 2M
n
E
__
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
_
.
Step 6: Lemma 9.8. From Lemma 9.8, there exists one codebook with size
M
n
such that
P
e[m
( (
Mn
)
2
1/
E
1/
[P
e[m
( (
2Mn
)]
2
1/
_
2M
n
E
__
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
__
1/
= (4M
n
)
1/
E
1/
__
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
_
= (4M
n
)
s
E
s
_
_
_
y
n
n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
1/s
_
_
= (4M
n
)
s
_
_
x
n
.
n
(x
)
n
.
n
P
e
X
n
(x
n
)P
e
X
n
((x
t
)
n
)
_
y
n
n
_
P
Y
n
[X
n(y
n
[x
n
)P
Y
n
[X
n(y
n
[(x
t
)
n
)
_
1/s
_
_
s
.
192
By using the fact that P
e
X
n
and P
Y
n
[X
n are product distributions with iden
tical marginal, and taking logarithmic operation on both sides of the above
inequality, followed by taking the liminf operation, the proof is completed
by the fact that the lower bound holds for any s
t
1 and any P
e
X
. 2
As a result of the expurgated exponent obtained, it only improves the random
coding exponent at lower rate. It is even worse than random coding exponent
for rate higher R
cr
. One possible reason for its being worse at higher rate may
be due to the replacement of the indicatorfunction argument
1
m
(y
n
)
_
m
,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
.
by
1
m
(y
n
)
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
.
Note that
min
s>0
_
m
,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
.
The improvement of expurgated bound over random coding bound at lower
rate may also reveal the possibility that the contribution of bad codes under
unbiased random coding argument is larger at lower rate; hence, suppressing the
weights from bad codes yield better result.
9.2.1 The properties of expurgated exponent
Analysis of E
ex
(R) is very similar to that of E
r
(R). We rst rewrite its formula.
Denition 9.11 (expurgated exponent) The expurgated exponent for DMC
with generic distribution P
Y [X
is dened by
E
ex
(R) max
s1
[sR + E
x
(s)] ,
where
E
x
(s) max
P
X
_
_
s log
x.
.
P
X
(x)P
X
(x
t
)
_
y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
)
_
1/s
_
_
.
193
The properties of E
ex
(R) can be realized via the analysis of function E
x
(s)
as follows.
Lemma 9.12 (properties of E
ex
(R))
1. E
ex
(R) is nonincreasing.
2. E
ex
(R) is convex in R. (Note that the rst two properties imply that E
ex
(R)
is strictly decreasing.)
3. There exists R
cr
such that for R > R
cr
, the slope of E
r
(R) is 1.
Proof:
1. Again, the slope of E
ex
(R) is equal to the maximizer s
for E
ex
(R) times
1. Therefore, the slope of E
ex
(R) is nonpositive, which certainly implies
the desired property.
2. Similar to the proof of the same property for random coding exponent.
3. Similar to the proof of the same property for random coding exponent. 2
Example 9.13 For BSC with crossover probability , the expurgated exponent
becomes
E
ex
(R)
= max
1p0
max
s1
_
sR s log
_
p
2
+ 2p(1 p)
_
2
_
(1 )
_
1/s
+ (1 p)
2
__
where (p, 1 p) is the input distribution. Note that the input distribution
achieving E
ex
(R) is uniform, i.e., p = 1/2. The expurgated exponent, as well as
random coding exponent, for = 0.2 is depicted in Figures 9.3 and 9.4.
9.3 Partitioning bound: an upper bounds for channel re
liability
In the previous sections, two lower bounds of the channel reliability, random
coding bound and expurgated bound, are discussed. In this section, we will start
to introduce the upper bounds of the channel reliability function for DMC, which
includes partitioning bound, spherepacking bound and straightline bound.
194
0
0.02
0.04
0.06
0.08
0.1
0.12
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.192745
C
Figure 9.3: Expurgated exponent (solid line) and random coding ex
ponent (dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.192745)).
The partitioning bound relies heavily on theories of binary hypothesis testing.
This is because the receiver end can be modeled as:
H
0
: c
m
= codeword transmitted
H
1
: c
m
,= codeword transmitted,
where m is the nal decision made by receiver upon receipt of some channel
outputs. The channel decoding error given that codeword m is transmitted
becomes the type II error, which can be computed using the theory of binary
hypothesis testing.
Denition 9.14 (partitioning bound) For a DMC with marginal P
Y [X
,
E
p
(R) max
P
X
min
P
e
Y X
: I(P
X
,P
e
Y X
)R
D(P
e
Y [X
P
Y [X
[P
X
).
One way to explain the above denition of partitioning bound is as follows.
For a given i.i.d. source with marginal P
X
, to transmit it into a dummy channel
with rate R and dummy capacity I(P
X
, P
e
Y [X
) R is expected to have poor
performance (note that we hope the code rate to be smaller than capacity).
We can expect that those bad noise patterns whose composition transition
probability P
e
Y [X
satises I(P
X
, P
e
Y [X
) R will contaminate the codewords more
195
0.1
0.105
0.11
0 0.001 0.002 0.003 0.004 0.005 0.006
Figure 9.4: Expurgated exponent (solid line) and random coding ex
ponent (dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.006)).
seriously than other noise patterns. Therefore for any codebook with rate R, the
probability of decoding error is lower bounded by the probability of these bad
noise patterns, since the bad noise patterns anticipate to unrecoverably distort
the codewords. Accordingly, the channel reliability function is upper bounded
by the exponent of the probability of these bad noise patterns.
Sanovs theorem already reveal that for any set consisting of all elements with
the same compositions, its exponent is equal to the minimum divergence of the
composition distribution against the true distribution. This is indeed the case
for the set of bad noise patterns. Therefore, the exponent of the probability
of these bad noise patterns is given as:
min
P
e
Y X
: I(P
X
,P
e
Y X
)R
D(P
e
Y [X
P
Y [X
[P
X
).
We can then choose the best input source to yield the partitioning exponent.
The partitioning exponent can be easily reformulated in terms of divergences.
This is because the mutual information can actually be written in the form of
196
divergence.
I(P
X
, P
e
Y [X
)
x
y
P
X
(x)P
e
Y [X
(y[x) log
P
X
e
Y
(x, y)
P
X
(x)P
e
Y
(y)
=
x
P
X
(x)
y
P
e
Y [X
(y[x) log
P
e
Y [X
(y[x)
P
e
Y
(y)
=
x
P
X
(x)D(P
e
Y [X
P
e
Y
[X = x)
= D(P
e
Y [X
P
e
Y
[P
X
)
Hence, the partitioning exponent can be written as:
E
p
(R) max
P
X
min
P
e
Y X
: D(P
e
Y X
P
e
Y
[P
X
)R
D(P
e
Y [X
P
Y [X
[P
X
).
In order to apply the theory of hypothesis testing, the partitioning bound needs
to be further reformulated.
Observation 9.15 (partitioning bound) If P
e
X
and P
e
Y [X
are distributions
that achieves E
p
(R), and P
e
Y
(y) =
x.
P
e
X
(x)P
e
Y [X
(y[x), then
E
p
(R) max
P
X
min
P
Y X
: D(P
Y X
P
e
Y
[P
X
)R
D(P
Y [X
P
Y [X
[P
X
).
In addition, the distribution P
Y [X
that achieves E
p
(R) is a tilted distribution
between P
Y [X
and P
e
Y
, i.e.,
E
p
(R) max
P
X
D(P
Y
[X
P
Y [X
[P
X
),
where P
Y
[X
is a tilted distribution
4
between P
Y [X
and P
e
Y
, and is the solution
of
D(P
Y
P
e
Y
[P
X
) = R.
Theorem 9.16 (partitioning bound) For a DMC with marginal P
Y [X
, for
any > 0 arbitrarily small,
E(R + ) E
p
(R).
4
P
Y
X
(y[x)
P
e
Y
(y)P
1
Y X
(y[x)
Y
P
e
Y
(y
)P
1
Y X
(y
[x)
.
197
Proof:
Step 1: Hypothesis testing. Let code size M
n
= 2e
nR
 e
n(R+)
for n >
ln2/, and let c
1
, . . . , c
Mn
be the best codebook. (Note that using a
code with smaller size should yield a smaller error probability, and hence,
a larger exponent. So, if this larger exponent is upper bounded by E
p
(R),
the error exponent for code size expR + should be bounded above by
E
p
(R).)
Suppose that P
e
X
and P
e
Y [X
are the distributions achieving E
p
(R). Dene
P
e
Y
(y) =
x.
P
e
X
(x)P
e
Y [X
(y[x). Then for any (mutuallydisjoint) decod
ing partitions at the output site, 
1
, 
2
, . . . , 
Mn
,
1m
Mn
P
e
Y
n
(
m
) = 1,
which implies the existence of m satisfying
P
e
Y
n
(
m
)
2
M
n
.
(Note that there are at least M
n
/2 codewords satisfying this condition.)
Construct the binary hypothesis testing problem of
H
0
: P
e
Y
n
() against H
1
: P
Y
n
[X
n([c
m
),
and choose the acceptance region for null hypothesis as 
c
m
(i.e., the ac
ceptance region for alternative hypothesis is 
m
), the type I error
n
Pr(
m
[H
0
) = P
e
Y
n
(
m
)
2
M
n
,
and the type II error
n
Pr(
c
m
[H
1
) = P
Y
n
[X
n(
c
m
[c
m
),
which is exactly, P
e[m
, the probability of decoding error given codeword c
m
is transmitted. We therefore have
P
e[m
min
n2/Mn
n
min
ne
nR
n
.
_
Since
2
M
n
e
nR
_
198
Then from Theorem 8.34, the best exponent
5
for P
e[m
is
limsup
n
1
n
n
i=1
D(P
Y
,i
P
Y [X
([a
i
)), (9.3.1)
where a
i
is the i
th
component of c
m
, P
Y
,i
is the tilted distribution between
P
Y [X
([a
i
) and P
e
Y
, and
liminf
n
1
n
n
i=1
D(P
Y
,i
P
e
Y
) = R.
If we denote the composition distribution with respect to c
m
as P
(cm)
, and
observe that the number of dierent tilted distributions (i.e., P
Y
,i
) is at
most [A[, the quantity of (9.3.1) can be further upper bounded by
limsup
n
1
n
n
i=1
D(P
Y
,i
P
Y [X
([a
i
))
limsup
n
x.
P
(cm)
(x)D(P
Y
,x
P
Y [X
([x))
max
P
X
D(P
Y
P
Y [X
[P
X
)
= D(P
Y
P
Y [X
[P
e
X
)
= E
p
(R). (9.3.2)
Besides, D(P
Y
P
e
Y
[P
e
X
) = R.
Step 2: A good code with rate R. Within the M
n
2e
nR
codewords, there
are at least M
n
/2 codewords satisfying
P
e
Y
n
(
m
)
2
M
n
,
and hence, their P
e[m
should all satisfy (9.3.2). Consequently, if
B m : those m satisfying step 1,
5
According to Theorem 8.34, we need to consider the quantile function of the CDF of
1
n
log
dP
Y
n
dP
Y
n
X=cm
(Y
n
).
Since both P
Y
n
X
([c
m
) and P
e
Y
n
are product distributions, P
Y
n
is a product distribution,
too. Therefore, by the similar reason as in Example 5.13, the CDF becomes a bunch of unit
functions (see Figure 2.4), and its largest quantile becomes (9.3.1).
199
then
P
e
1
M
n
1mMn
P
e[m
1
M
n
mB
P
e[m
1
M
n
mB
min
mB
P
e[m
=
1
M
n
[B[ min
mB
P
e[m
1
2
min
mB
P
e[m
=
1
2
P
e[m
, (m
t
is the minimizer.)
which implies
E(R + ) limsup
n
1
n
log P
e
limsup
n
1
n
log P
e[m
E
p
(R).
2
The partitioning upper bound can be reformulated to shape similar as ran
dom coding lower bound.
Lemma 9.17
E
p
(R) = max
P
X
max
s0
_
_
sR log
y
_
x.
P
X
(x)P
Y [X
(y[x)
1/(1+s)
_
1+s
_
_
= max
s0
[sR E
0
(s)] .
Recall that the random coding exponent is max
0<s1
[sR E
0
(s)], and the
optimizer s
is the slope of E
r
(R) times 1. Hence, the channel reliability E(R)
satises
max
0<s1
[sR E
0
(s)] E(R) max
s0
[sR E
0
(s)] ,
and for optimizer s
(0, 1] (i.e., R
cr
R C), the upper bound meets the
lower bound.
200
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.192745
Figure 9.5: Partitioning exponent (thick line), random coding exponent
(thin line) and expurgated exponent (thin line) for BSC with crossover
probability 0.2.
9.4 Spherepacking exponent: an upper bound of the
channel reliability
9.4.1 Problem of spherepacking
For a given space / and a given distance measures d(, ) for elements in /, a
ball or sphere centered at a with radius r is dened as
b / : d(b, a) r.
The problem of spherepacking is to nd the minimum radius if M spheres need
to be packed into the space /. Its dual problem is to nd the maximum number
of balls with xed radius r that can be packed into space /.
9.4.2 Relation of spherepacking and coding
To nd the best codebook which yields minimum decoding error is one of the
main research issues in communications. Roughly speaking, if two codewords are
similar, they should be more vulnerable to noise. Hence, a good codebook should
be a set of codewords, which look very dierent from others. In mathematics,
such codeword resemblance can be modeled as a distance function. We can
201
then say if the distance between two codewords is large, they are more dierent,
and more robust to noise or interference. Accordingly, a good code book becomes
a set of codewords whose minimum distance among codewords is largest. This
is exactly the spherepacking problem.
Example 9.18 (Hamming distance versus BSC) For BSC, the source al
phabet and output alphabet are both 0, 1
n
. The Hamming distance between
two elements in 0, 1 is given by
d
H
(x, x) =
_
0, if x = x;
1, if x ,= x.
Its extension denition to ntuple is
d
H
(x
n
, x
n
) =
n
i=1
d
H
(x
i
, x
i
).
It is known that the best decoding rule is the maximum likelihood ratio
decoder, i.e.,
(y
n
) = m, if P
Y
n
[X
n(y
n
[c
m
) P
Y
n
[X
n(y
n
[c
m
) for all m
t
.
Since for BSC with crossover probability ,
P
Y
n
[X
n(y
n
[c
m
) =
d
H
(y
n
,cm)
(1 )
nd
H
(y
n
,cm)
= (1 )
n
_
1
_
d
H
(y
n
,cm)
. (9.4.1)
Therefore, the best decoding rule can be rewritten as:
(y
n
) = m, if d
H
(y
n
, c
m
) d
H
(y
n
, c
m
) for all m
t
.
As a result, if two codewords are too close in Hamming distance, the number of
bits of outputs that can be used to classify its origin will be less, and therefore,
will result in a poor performance.
From the above example, two observations can be made. First, if the dis
tance measure between codewords can be written as a function of the transition
probability of channel, such as (9.4.1), one may regard the probability of decod
ing error with the distances between codewords. Secondly, the coding problem
in some cases can be reduced to a spherepacking problem. As a consequence,
solution of the spherepacking problem can be used to characterize the error pro
bability of channels. This can be conrmed by the next theorem, which shows
that an upper bound on the channel reliability can be obtained in terms of the
largest minimum radius among M disjoint spheres in a code space.
202
Theorem 9.19 Let
n
(, ) be the Bhattacharya distance
6
between two elements
in A
n
. Denote by d
n,M
the largest minimum distance among M selected code
words of length n. (Obviously, the largest minimum radius among M disjoint
spheres in a code space is half of d
n,M
.) Then
limsup
n
1
n
log Pe(n, R) limsup
n
1
n
d
n,M=e
nR.
Proof:
Step 1: Hypothesis testing. For any code
c
0
, c
1
, . . . , c
M1
P
Y
n
[X
n([c
m
)
_
+ o(n)
7
_
6
The Bhattacharya distance (for channels P
Y
n
X
n) between elements x
n
and x
n
is dened
by
n
(x
n
, x
n
) log
y
n
Y
n
_
P
Y
n
X
n(y
n
[x
n
)P
Y
n
X
n(y
n
[ x
n
).
7
This is a special notation, introduced in 1909 by E. Landau. Basically, there are two of
them: one is littleo notation and the other is bigO notation. They are respectively dened
as follows, based on the assumption that g(x) > 0 for all x in some interval containing a.
Denition 9.20 (littleo notation) The notation f(x) = o(g(x)) as x a means that
lim
xa
f(x)
g(x)
= 0.
Denition 9.21 (bigO notation) The notation f(x) = O(g(x)) as x a means that
lim
xa
f(x)
g(x)
< .
203
and
P
e[ m
P
Y
n
[X
n(/
m,m
[c
m
) exp
_
D
_
P
Y
n
P
Y
n
[X
n([c
m
)
_
+ o(n)
_
,
where P
Y
n
P
Y
n
[X
n([c
m
)) = D(P
Y
n
P
Y
n
[X
n([c
m
)) =
n
(c
m
, c
m
),
and D() is the divergence (or KullbackLeibler divergence). We thus
have
P
e[ m
+ P
e[m
2e
n(c
m
,cm)+o(n)
.
Note that the above inequality holds for any m and m with m ,= m.
Step 2: Largest minimum distance. By the denition of d
n,M
, there exists
an ( m, m) pair for the above code such that
n
(c
m
, c
m
) d
n,M
,
which implies
( ( m, m)) P
e[ m
+ P
e[m
2e
d
n,M
+o(n)
.
Step 3: Probability of error. Suppose we have found the optimal code with
size 2M = e
n(R+log 2/n)
, which minimizes the error probability. Index the
codewords in ascending order of P
e[m
, namely
P
e[0
P
e[1
. . . P
e[2M1
.
Form two new codebooks as
(
1
= c
0
, c
1
, . . . , c
M1
and (
2
= c
M
, . . . , c
2M1
.
Then, from steps 1 and 2, there exists at least one pair of codewords
(c
m
, c
m
) in (
1
such that
P
e[ m
+ P
e[m
2e
d
n,M
+o(n)
.
Since for all c
i
in (
2
P
e[i
maxP
e[m
, P
e[ m
,
and hence,
P
e[i
+ P
e[j
P
e[ m
+ P
e[m
2e
d
n,M
+o(n)
204
for any c
i
and c
j
in (
2
. Accordingly,
P
e
(n, R + log(2)/n) =
1
8M
2
2M1
i=0
2M1
j=0
(P
e[i
+ P
e[j
)
1
8M
2
2M1
i=M
2M1
j=M,j,=i
(P
e[i
+ P
e[j
)
1
8M
2
2M1
i=M
2M1
j=M,j,=i
(P
e[ m
+ P
e[m
)
1
8M
2
2M1
i=M
2M1
j=M,j,=i
2e
d
n,M
+o(n)
=
M 1
4M
e
d
n,M
+o(n)
,
which immediately implies the desired result. 2
Since, according the above theorem, the largest minimum distance can be
used to formulate an upper bound on channel reliability, this quantity becomes
essential. We therefore introduce its general formula in the next subsection.
9.4.3 The largest minimum distance of block codes
The ultimate capabilities and limitations of error correcting codes are quite im
portant, especially for code designers who want to estimate the relative ecacy
of the designed code. In fairly general situations, this information is closely re
lated to the largest minimum distance of the codes [16]. One of the examples is
that for binary block code employing the Hamming distance, the error correcting
capability of the code is half of the minimum distance among codewords. Hence,
the knowledge of the largest minimum distance can be considered as a reference
of the optimal error correcting capability of codes.
The problem on the largest minimum distance can be described as follows.
Over a given code alphabet, and a given measurable function on the distance
between two code symbols, determine the asymptotic ratio, the largest minimum
distance attainable among M selected codewords divided by the code blocklength
n, as n tends to innity, subject to a xed rate R log(M)/n.
Research on this problem have been done for years. In the past, only bounds
on this ratio were established. The most wellknown bound on this problem
is the VarshamovGilbert lower bound, which is usually derived in terms of a
combinatorial approximation under the assumption that the code alphabet is
205
nite and the measure on the distance between code letters is symmetric [14].
If the size of the code alphabet, q, is an even power of a prime, satisfying q 49,
and the distance measure is the Hamming distance, a better lower bound can
be obtained through the construction of the AlgebraicGeometric code [11, 19],
of which the idea was rst proposed by Goppa. Later, Zinoviev and Litsyn
proved that a better lower bound than the VarshamovGilbert bound is actually
possible for any q 46 [21]. Other improvements of the bounds can be found in
[9, 15, 20].
In addition to the combinatorial techniques, some researchers also apply the
probabilistic and analytical methodologies to this problem. For example, by
means of the random coding argument with expurgation, the VarshamovGilbert
bound in its most general form can be established by simply using the Chebyshev
inequality ([3] or cf. Appendix A), and restrictions on the code alphabet (such
as nite, countable, . . .) and the distance measure (such as additive, symmetric,
bounded, . . .) are no longer necessary for the validity of its proof.
As shown in the Chapter 5, channels without statistical assumptions such
as memoryless, information stability, stationarity, causality, and ergodicity, . . .,
etc. are successfully handled by employing the notions of liminf in probability
and limsup in probability of the information spectrum. As a consequence, the
channel capacity C is shown to equal the supremum, over all input processes,
of the inputoutput infinformation rate dened as the liminf in probability of
the normalized information density [12]. More specically, given a channel W =
W
n
= P
Y
n
[X
n
n=1
,
C = sup
X
sup a 1 :
limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < a
_
= 0
_
,
where X
n
and Y
n
are respectively the nfold input process drawn from
X = X
n
= (X
(n)
1
, . . . , X
(n)
n
)
n=1
and the corresponding output process induced by X
n
via the channel W
n
=
P
Y
n
[X
n, and
1
n
i
X
n
W
n(x
n
; y
n
)
1
n
log
P
Y
n
[X
n(y
n
[x
n
)
P
Y
n(y
n
)
is the normalized information density. If the conventional denition of chan
nel capacity, which requires the existence of reliable block codes for all su
ciently large blocklengths, is replaced by that reliable codes exist for innitely
many blocklengths, a new optimistic denition of capacity
C is obtained [12]. Its
206
informationspectrum expression is then given by [8, 7]
C = sup
X
sup a 1 :
liminf
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < a
_
= 0
_
.
Inspired by such probabilistic methodology, together with randomcoding
scheme with expurgation, a spectrum formula on the largest minimum distance
of deterministic block codes for generalized distance functions
8
(not necessarily
additive, symmetric and bounded) is established in this work. As revealed in the
formula, the largest minimum distance is completely determined by the ultimate
statistical characteristics of the normalized distance function evaluated under
a properly chosen randomcode generating distribution. Interestingly, the new
formula has an analogous form to the general informationspectrum expressions
of the channel capacity and the optimistic channel capacity. This somehow
conrms the connection between the problem of designing a reliable code for a
given channel and that of nding a code with suciently large distance among
codewords, if the distance function is properly dened in terms of the channel
statistics.
With the help of the new formula, we characterize a minor class of distance
metrics for which the ultimate largest minimum distance among codewords can
be derived. Although these distance functions may be of secondary interest, it
sheds some light on the determination of largest minimum distance for a more
general class of distance functions. Discussions on the general properties of the
new formula will follow.
We next derive a general VarshamovGilbert lower bound directly from the
new distancespectrum formula. Some remarks regarding to its properties are
given. A sucient condition under which the general VarshamovGilbert bound
is tight, as well as examples to demonstrate its strict inferiority to the distance
spectrum formula, is also provided.
Finally, we demonstrate that the new formula can be used to derive the
known lower bounds for a few specic block coding schemes of general interests,
8
Conventionally, a distance or metric [13][18, pp. 139] should satisfy the properties of i)
nonnegativity; ii) being zero if, and only if, two points coincide; iii) symmetry; and iv) triangle
inequality. The derivation in this paper, however, is applicable to any measurable function
dened over the code alphabets. Since none of the above four properties are assumed, the
measurable function on the distance between two code letters is therefore termed generalized
distance function. One can, for example, apply our formula to situation where the code
alphabet is a distribution space, and the distance measure is the KullbackLeibler divergence.
For simplicity, we will abbreviate the generalized distance function simply as the distance
function in the remaining part of the article.
207
such as constant weight codes and the codes that corrects arbitrary noise. Trans
formation of the asymptotic distance determination problem into an alternative
problem setting over a graph for a possible improvement of these known bounds
are also addressed.
A) Distancespectrum formula on the largest minimum distance of
block codes
We rst introduce some notations. The ntuple code alphabet is denoted by A
n
.
For any two elements x
n
and x
n
in A
n
, we use
n
( x
n
, x
n
) to denote the nfold
measure on the distance of these two elements. A codebook with block length
n and size M is represented by
(
n,M
_
c
(n)
0
, c
(n)
1
, c
(n)
2
, . . . , c
(n)
M1
_
,
where c
(n)
m
(c
m1
, c
m2
, . . . , c
mn
), and each c
mk
belongs to A. We dene the
minimum distance
d
m
( (
n,M
) min
0 mM1
m=m
n
_
c
(n)
m
, c
(n)
m
_
,
and the largest minimum distance
d
n,M
max
(
n,M
min
0mM1
d
m
( (
n,M
).
Note that there is no assumption on the code alphabet A and the sequence of
the functions
n
(, )
n1
.
Based on the above denitions, the problem considered in this paper becomes
to nd the limit, as n , of d
n,M
/n under a xed rate R = log(M)/n. Since
the quantity is investigated as n goes to innity, it is justied to take M = e
nR
as integers.
The concept of our method is similar to that of the random coding technique
employed in the channel reliability function [2]. Each codeword is assumed to be
selected independently of all others from A
n
through a generic distribution P
X
n.
Then the distance between codewords c
(n)
m
and c
(n)
m
becomes a random variable,
and so does d
m
( (
n,M
). For clarity, we will use D
m
to denote the random variable
corresponding to d
m
( (
n,M
). Also note that D
m
M1
m=0
are identically distributed.
We therefore have the following lemma.
Lemma 9.22 Fix a triangulararray random process
X =
_
X
n
=
_
X
(n)
1
, . . . , X
(n)
n
__
n=1
.
208
Let D
m
= D
m
(X
n
), 0 m (M 1), be dened for the random codebook
of block length n and size M, where each codeword is drawn independently
according to the distribution P
X
n. Then
1. for any > 0, there exists a universal constant = () (0, 1) (in
dependent of blocklength n) and a codebook sequence (
n,M
n1
such
that
1
n
min
0mM1
d
m
( (
n,M
)
> inf
_
a 1 : limsup
n
Pr
_
1
n
D
m
> a
_
= 0
_
, (9.4.2)
for innitely many n;
2. for any > 0, there exists a universal constant = () (0, 1) (in
dependent of blocklength n) and a codebook sequence (
n,M
n1
such
that
1
n
min
0mM1
d
m
( (
n,M
)
> inf
_
a 1 : liminf
n
Pr
_
1
n
D
m
> a
_
= 0
_
, (9.4.3)
for suciently large n.
Proof: We will only prove (9.4.2). (9.4.3) can be proved by simply following the
same procedure.
Dene
L
X
(R) inf
_
a : limsup
n
Pr
_
1
n
D
m
> a
_
= 0
_
.
Let 1(/) be the indicator function of a set /, and let
m
1
_
1
n
D
m
>
L
X
(R)
_
.
By denition of
L
X
(R),
limsup
n
Pr
_
1
n
D
m
>
L
X
(R)
_
> 0.
Let 2 limsup
n
Pr
_
(1/n)D
m
>
L
X
(R)
m=0
m
_
=
M1
m=0
E [
m
] > M,
which implies that among all possible selections, there exist (for innite many
n) a codebook (
n,M
in which M codewords satisfy
m
= 1, i.e.,
1
n
d
m
( (
n,M
) >
L
X
(R)
for at least M codewords in the codebook (
n,M
. The collection of these M
codewords is a desired codebook for the validity of (9.4.2). 2
Our second lemma concerns the spectrum of (1/n)D
m
.
Lemma 9.23 Let each codeword be independently selected through the distri
bution P
X
n. Suppose that
X
n
is independent of, and has the same distribution
as, X
n
. Then
Pr
_
1
n
D
m
> a
_
_
Pr
_
1
n
n
_
X
n
, X
n
_
> a
__
M
.
Proof: Let C
(n)
m
denote the mth randomly selected codeword. From the de
nition of D
m
, we have
Pr
_
1
n
D
m
> a
C
(n)
m
_
= Pr
_
min
0 mM1
m=m
1
n
n
_
C
(n)
m
, C
(n)
m
_
> a
C
(n)
m
_
=
0 mM1
m=m
Pr
_
1
n
n
_
C
(n)
m
, C
(n)
m
_
> a
C
(n)
m
_
(9.4.4)
=
_
Pr
_
1
n
n
(
X
n
, X
n
) > a
X
n
__
M1
,
where (9.4.4) holds because
_
1
n
n
_
C
(n)
m
, C
(n)
m
_
_
0 mM1
m=m
210
is conditionally independent given C
(n)
m
. Hence,
Pr
_
1
n
D
m
> a
_
=
_
.
n
_
Pr
_
1
n
n
(
X
n
, x
n
) > a
__
M1
dP
X
n(x
n
)
_
.
n
_
Pr
_
1
n
n
(
X
n
, x
n
) > a
__
M
dP
X
n(x
n
)
= E
X
n
_
_
Pr
_
1
n
n
_
X
n
, X
n
_
> a
X
n
__
M
_
E
M
X
n
_
Pr
_
1
n
n
_
X
n
, X
n
_
> a
X
n
__
(9.4.5)
=
_
Pr
_
1
n
n
_
X
n
, X
n
_
> a
__
M
,
where (9.4.5) follows from Lyapounovs inequality [1, page 76], i.e., E
1/M
[U
M
]
E[U] for a nonnegative random variable U. 2
We are now ready to prove the main theorem of the paper. For simplicity,
throughout the article,
X
n
and X
n
are used specically to denote two indepen
dent random variables having common distribution P
X
n.
Theorem 9.24 (distancespectrum formula)
sup
X
X
(R) limsup
n
d
n,M
n
sup
X
X
(R + ) (9.4.6)
and
sup
X
X
(R) liminf
n
d
n,M
n
sup
X
X
(R + ) (9.4.7)
for every > 0, where
X
(R) inf a 1 :
limsup
n
_
Pr
_
1
n
n
(
X
n
, X
n
) > a
__
M
= 0
_
and
X
(R) inf a 1 :
liminf
n
_
Pr
_
1
n
n
(
X
n
, X
n
) > a
__
M
= 0
_
.
211
Proof:
1. lower bound. Observe that in Lemma 9.22, the rate only decreases by the
amount log()/n when employing a code (
n,M
. Also note that for any
> 0, log()/n < for suciently large n. These observations, together
with Lemma 9.23, imply the validity of the lower bound.
2. upper bound. Again, we will only prove (9.4.6), since (9.4.7) can be proved
by simply following the same procedure.
To show that the upper bound of (9.4.6) holds, it suces to prove the
existence of X such that
X
(R) limsup
n
d
n,M
n
.
Let X
n
be uniformly distributed over one of the optimal codes (
n,M
. (By
optimal, we mean that the code has the largest minimum distance among
all codes of the same size.) Dene
n
1
n
min
0mM1
d
m
( (
n,M
) and
limsup
n
n
.
Then for any > 0,
n
>
for innitely many n. (9.4.8)
For those n satisfying (9.4.8),
Pr
_
1
n
n
_
X
n
, X
n
_
>
_
Pr
_
1
n
n
_
X
n
, X
n
_
n
_
Pr
_
X
n
,= X
n
_
= 1
1
M
,
which implies
limsup
n
_
Pr
_
1
n
n
(
X
n
, X
n
) >
__
M
limsup
n
_
1
1
M
_
M
= limsup
n
_
1
1
e
nR
_
e
nR
= e
1
> 0.
Consequently,
X
(R)
. Since can be made arbitrarily small, the
upper bound holds. 2
212
Observe that sup
X
X
(R) is nonincreasing in R, and hence, the number of
discontinuities is countable. This fact implies that
sup
X
X
(R) = lim
0
sup
X
(R + )
except for countably many R. Similar argument applies to sup
X
X
(R). We
can then rephrase the above theorem as appeared in the next corollary.
Corollary 9.25
limsup
n
d
n,M
n
= sup
X
X
(R)
_
resp. liminf
n
d
n,M
n
= sup
X
X
(R)
_
except possibly at the points of discontinuities of
sup
X
X
(R) (resp. sup
X
X
(R)),
which are countable.
From the above theorem (or corollary), we can characterize the largest mini
mum distance of deterministic block codes in terms of the distance spectrum. We
thus name it distancespectrum formula. For convenience,
X
() and
X
() will
be respectively called the supdistancespectrum function and the infdistance
spectrum function in the remaining article.
We conclude this section by remarking that the distancespectrum formula
obtained above indeed have an analogous form to the informationspectrum for
mulas of the channel capacity and the optimistic channel capacity. Furthermore,
by taking the distance metric to be the nfold Bhattacharyya distance [2, De
nition 5.8.3], an upper bound on channel reliability [2, Theorem 10.6.1] can be
obtained, i.e.,
limsup
n
1
n
log P
e
(n, M = e
nR
)
sup
X
inf
_
a : limsup
n
_
Pr
_
1
n
log
y
n
n
P
1/2
Y
n
[X
n
(y
n
[
X
n
)P
1/2
Y
n
[X
n
(y
n
[X
n
) > a
__
M
= 0
_
213
and
liminf
n
1
n
log P
e
(n, M = e
nR
)
sup
X
inf
_
a : liminf
n
_
Pr
_
1
n
log
y
n
n
P
1/2
Y
n
[X
n
(y
n
[
X
n
)P
1/2
Y
n
[X
n
(y
n
[X
n
) > a
__
M
= 0
_
,
where P
Y
n
[X
n is the ndimensional channel transition distribution from code
alphabet A
n
to channel output alphabet
n
, and P
e
(n, M) is the average proba
bility of error for optimal channel code of blocklength n and size M. Note that
the formula of the above channel reliability bound is quite dierent from those
formulated in terms of the exponents of the information spectrum (cf. [6, Sec.
V] and [17, equation (14)]).
B) Determination of the largest minimum distance for a class of dis
tance functions
In this section, we will present a minor class of distance functions for which the
optimization input X for the distancespectrum function can be characterized,
and thereby, the ultimate largest minimum distance among codewords can be
established in terms of the distancespectrum formula.
A simple example for which the largest minimum distance can be derived in
terms of the new formula is the probabilityoferror distortion measure, which is
dened
n
( x
n
, x
n
) =
_
0, if x
n
= x
n
;
n, if x
n
,= x
n
.
It can be easily shown that
sup
X
X
(R)
inf a 1 :
limsup
n
sup
X
n
_
Pr
_
1
n
n
(
X
n
, X
n
) > a
__
M
= 0
_
=
_
1, for 0 R < log [A[;
0, for R log [A[,
and the upper bound can be achieved by letting X be uniformly distributed over
the code alphabet. Similarly,
sup
X
X
(R) =
_
1, for 0 R < log [A[;
0, for R log [A[.
214
Another example for which the optimizer X of the distancespectrum func
tion can be characterized is the separable distance function dened below.
Denition 9.26 (separable distance functions)
n
( x
n
, x
n
) f
n
([g
n
( x
n
) g
n
(x
n
)[),
where f
n
() and g
n
() are realvalued functions.
Next, we derive the basis for nding one of the optimization distributions for
sup
X
n
Pr
_
1
n
n
(
X
n
, X
n
) > a
_
under separable distance functionals.
Lemma 9.27 Dene G() inf
U
Pr
_
U U
_
, where the inmum is
taken over all (
U U
_
Pr
_
U U
1
j
_
j2
i=0
Pr
_
_
U, U
_
_
i
j
,
i + 1
j
_
2
_
+Pr
_
_
U, U
_
_
j 1
j
, 1
_
2
_
=
j2
i=0
_
Pr
_
U
_
i
j
,
i + 1
j
___
2
+
_
Pr
_
U
_
j 1
j
, 1
___
2
1
j
.
215
Achievability of G() by uniform distribution over
_
0,
1
j 1
,
2
j 1
, . . . ,
j 2
j 1
, 1
_
can be easily veried, and hence, we omit here. 2
Lemma 9.28 For 1/j < 1/(j 1),
1
j
inf
Un
Pr
_
U
n
U
n
_
1
j
+
1
n + 0.5
,
where the inmum is taken over all (
U
n
, U
n
) pair having independent and iden
tical distribution on
_
0,
1
n
,
2
n
, . . . ,
n 1
n
, 1
_
,
and j = 2, 3, 4, . . . etc.
Proof: The lower bound follows immediately from Lemma 9.27.
To prove the upper bound, let U
n
be uniformly distributed over
_
0,
n
,
2
n
, . . . ,
k
n
, 1
_
,
where = n + 1 and k is an integer satisfying
n
n/(j 1) + 1
1 k <
n
n/(j 1) + 1
.
(Note that
n
j 1
+ 1
_
n
j 1
_
+ 1 .)
Then
inf
Un
Pr
_
U
n
U
n
_
Pr
_
n
U
_
=
1
k + 2
1
1
1/(j 1) + 1/n
+ 1
=
1
j
+
(j 1)
2
j
2
1
n + (j 1)/j
1
j
+
1
n + 0.5
.
216
2
Based on the above Lemmas, we can then proceed to compute the asymptotic
largest minimum distance among codewords of the following examples. It needs
to be pointed out that in these examples, our objective is not to attempt to
solve any related problems of practical interests, but simply to demonstrate the
computation of the distance spectrum function for general readers.
Example 9.29 Assume that the ntuple code alphabet is 0, 1
n
. Let the nfold
distance function be dened as
n
( x
n
, x
n
) [u( x
n
) u(x
n
)[
where u(x
n
) represents the number of 1s in x
n
. Then
sup
X
X
(R)
inf
_
a 1 : limsup
n
sup
X
n
_
Pr
_
1
n
u(
X
n
) u(X
n
)
> a
__
M
= 0
_
= inf
_
a 1 : limsup
n
_
1 inf
X
n
Pr
_
1
n
u(
X
n
) u(X
n
)
a
__
M
= 0
_
inf
_
a 1 : limsup
n
_
1 inf
U
Pr
_
U U
a
__
expnR
= 0
_
= inf
_
a 1 : limsup
n
_
1
1
1/a
_
expnR
= 0
_
= 0.
Hence, the asymptotic largest minimum distance among block codewords is zero.
This conclusion is not surprising because the code with nonzero minimal distance
should contain the codewords of dierent Hamming weights and the whole num
ber of such words is n + 1.
Example 9.30 Assume that the code alphabet is binary. Dene
n
( x
n
, x
n
) log
2
([g
n
( x
n
) g
n
(x
n
)[ + 1) ,
217
where
g
n
(x
n
) = x
n1
2
n1
+ x
n2
2
n2
+ + x
1
2 + x
0
.
Then
sup
X
X
(R)
inf
_
a 1 : limsup
n
sup
X
n
_
Pr
_
1
n
log
2
_
g
n
(
X
n
) g
n
(X
n
)
+ 1
_
> a
__
M
= 0
_
inf
_
a 1 : limsup
n
_
1 inf
U
Pr
_
U U
2
na
1
2
n
1
__
expnR
= 0
_
= inf
_
_
a 1 : limsup
n
_
_
_
_
1
1
_
2
n
1
2
na
1
_
_
_
_
_
expnR
= 0
_
_
= 1
R
log 2
. (9.4.9)
By taking X
(R)
inf
_
a 1 : limsup
n
_
_
_
_
1
1
_
2
n
1
2
na
1
_
1
K + 0.5
_
_
_
_
expnR
= 0
_
_
= inf
_
a 1 : limsup
n
_
_
_
_
1
1
_
2
n
1
2
na
1
_
1
2
n
0.5
_
_
_
_
expnR
= 0
_
_
= 1
R
log 2
.
This proved the achievability of (9.4.9).
C) General properties of distancespectrum function
We next address some general functional properties of
X
(R) and
X
(R).
Lemma 9.31 1.
X
(R) and
X
(R) are nonincreasing and rightcontinuous
functions of R.
2.
X
(R)
D
0
(X) limsup
n
ess inf
1
n
n
(
X
n
, X
n
) (9.4.10)
X
(R) D
0
(X) liminf
n
ess inf
1
n
n
(
X
n
, X
n
) (9.4.11)
where ess inf represents essential inmum
9
. In addition, equality holds for
(9.4.10) and (9.4.11) respectively when
R >
R
0
(X) limsup
n
1
n
log Pr
X
n
= X
n
(9.4.12)
9
For a given random variable Z, its essential inmum is dened as
ess inf Z supz : Pr[Z z] = 1.
219
and
R > R
0
(X) liminf
n
1
n
log Pr
X
n
= X
n
, (9.4.13)
provided that
( x
n
A
n
) min
x
n
.
n
n
( x
n
, x
n
) =
n
(x
n
, x
n
) = constant. (9.4.14)
3.
X
(R)
D
p
(X) (9.4.15)
for R >
R
p
(X); and
X
(R) D
p
(X) (9.4.16)
for R > R
p
(X), where
D
p
(X) limsup
n
1
n
E[
n
(
X
n
, X
n
)[
n
(
X
n
, X
n
) < ]
D
p
(X) liminf
n
1
n
E[
n
(
X
n
, X
n
)[
n
(
X
n
, X
n
) < ]
R
p
(X) limsup
n
1
n
log Pr
n
(
X
n
, X
n
) <
and
R
p
(X) liminf
n
1
n
log Pr
n
(
X
n
, X
n
) < .
In addition, equality holds for (9.4.15) and (9.4.16) respectively when R =
R
p
(X) and R = R
p
(X), provided that [
n
(
X
n
, X
n
)[
n
(
X
n
, X
n
) < ] has
the large deviation type of behavior, i.e., for all > 0,
liminf
n
1
n
log L
n
> 0, (9.4.17)
where
L
n
Pr
_
1
n
(Y
n
E[Y
n
[Y
n
< ])
Y
n
<
_
,
and Y
n
n
(
X
n
, X
n
).
4. For 0 < R <
R
p
(X),
X
(R) = . Similarly, for 0 < R < R
p
(X),
X
(R) = .
Proof: Again, only the proof regarding
X
(R) will be provided. The properties
of
X
(R) can be proved similarly.
1. Property 1 follows by denition.
220
2. (9.4.10) can be proved as follows. Let
e
n
(X) ess inf
1
n
n
(
X
n
, X
n
),
and hence,
D
0
(X) = limsup
n
e
n
(X). Observe that for any > 0 and
for innitely many n,
_
Pr
_
1
n
n
(
X
n
, X
n
) >
D
0
(X) 2
__
M
_
Pr
_
1
n
n
(
X
n
, X
n
) > e
n
(X)
__
M
= 1.
Therefore,
X
(R)
D
0
(X)2 for arbitrarily small > 0. This completes
the proof of (9.4.10).
To prove the equality condition for (9.4.10), it suces to show that for any
> 0,
X
(R)
D
0
(X) + (9.4.18)
for
R > limsup
n
1
n
log Pr
_
X
n
,= X
n
_
.
By the assumption on the range of R, there exists > 0 such that
R >
1
n
log Pr
_
X
n
,= X
n
_
+ for suciently large n. (9.4.19)
Then for those n satisfying e
n
(X
n
)
D
0
(X) +/2 and (9.4.19) (of which
there are suciently many)
_
Pr
_
1
n
n
(
X
n
, X
n
) >
D
0
(X) +
__
M
_
Pr
_
1
n
n
(
X
n
, X
n
) > e
n
(X
n
) +
2
__
M
_
Pr
_
X
n
,= X
n
__
M
(9.4.20)
=
_
1 Pr
_
X
n
= X
n
__
M
,
where (9.4.20) holds because (9.4.14). Consequently,
limsup
n
_
Pr
_
1
n
n
(
X
n
, X
n
) >
D
0
(X) +
__
M
= 0
which immediately implies (9.4.18).
221
3. (9.4.15) holds trivially if
D
p
(X) = . Thus, without loss of generality, we
assume that
D
p
(X) < . (9.4.15) can then be proved by observing that
for any > 0 and suciently large n,
_
Pr
_
1
n
Y
n
> (1 + )
2
D
p
(X)
__
M
_
Pr
_
1
n
Y
n
> (1 + )
1
n
E[Y
n
[Y
n
< ]
__
M
= (Pr Y
n
> (1 + ) E[Y
n
[Y
n
< ])
M
= (Pr Y
n
= + Pr Y
n
<
Pr Y
n
> (1 + ) E[Y
n
[Y
n
< ][ Y
n
< )
M
_
Pr Y
n
= + Pr Y
n
<
1
1 +
_
M
(9.4.21)
=
_
1
1 +
Pr Y
n
<
_
M
,
where (9.4.21) follows from Markovs inequality. Consequently, for R >
R
p
(X),
limsup
n
_
Pr
_
1
n
n
(
X
n
, X
n
) > (1 + )
2
D
p
(X)
__
M
= 0.
To prove the equality holds for (9.4.15) at R =
R
p
(X), it suces to show
the achievability of
X
(R) to
D
p
(X) by R
R
p
(X), since
X
(R) is right
continuous. This can be shown as follows. For any > 0, we note from
(9.4.17) that there exists = () such that for suciently large n,
Pr
_
1
n
Y
n
1
n
E[Y
n
[Y
n
< ]
Y
n
<
_
e
n
.
222
Therefore, for innitely many n,
_
Pr
_
1
n
Y
n
>
D
p
(X) 2
__
M
_
Pr
_
1
n
Y
n
>
1
n
E[Y
n
[Y
n
< ]
__
M
= (Pr Y
n
= + Pr Y
n
<
Pr
_
1
n
Y
n
>
1
n
E[Y
n
[Y
n
< ]
Y
n
<
__
M
=
_
1 Pr
_
1
n
Y
n
1
n
E[Y
n
[Y
n
< ]
Y
n
<
_
Pr Y
n
< )
M
_
1 e
n
Pr Y
n
<
_
M
.
Accordingly, for
R
p
(X) < R <
R
p
(X) + ,
limsup
n
_
Pr
_
1
n
Y
n
>
D
p
(X) 2
__
M
> 0,
which in turn implies that
X
(R)
D
p
(X) 2.
This completes the proof of achievability of (9.4.15) by R
R
p
(X).
4. This is an immediate consequence of
( L > 0)
_
Pr
_
1
n
n
(
X
n
, X
n
) > L
__
M
_
1 Pr
_
n
(
X
n
, X
n
) <
__
M
.
2
Remarks.
A weaker condition for (9.4.12) and (9.4.13) is
R > limsup
n
1
n
log [o
n
[ and R > liminf
n
1
n
log [o
n
[,
where o
n
x
n
A
n
: P
X
n(x
n
) > 0. This indicates an expected result
that when the rate is larger than log [A[, the asymptotic largest minimum
distance among codewords remains at its smallest value sup
X
D
0
(X) (resp.
sup
X
D
0
(X)), which is usually zero.
223
Based on Lemma 9.31, the general relation between
X
(R) and the spec
trum of (1/n)
n
(
X
n
, X
n
) can be illustrated as in Figure 9.7, which shows
that
X
(R) lies asymptotically within
ess inf
1
n
(
X
n
, X
n
)
and
1
n
E[
n
(
X
n
, X
n
)[
n
(
X
n
, X
n
)]
for
R
p
(X) < R <
R
0
(X). Similar remarks can be made on
X
(R). On
the other hand, the general curve of
X
(R) (similarly for
X
(R)) can be
plotted as shown in Figure 9.8. To summarize, we remark that under fairly
general situations,
X
(R)
_
_
= for 0 < R <
R
p
(X);
=
D
p
(X) at R = R
p
(X);
(
D
0
(X),
D
p
(X)] for R
p
(X) < R < R
0
(X);
=
D
0
(X) for R R
0
(X).
A simple universal upper bound on the largest minimum distance among
block codewords is the Plotkin bound. Its usual expression is given by [10]
for which a straightforward generalization (cf. Appendix B) is
sup
X
limsup
n
1
n
E[
n
(
X
n
, X
n
)].
Property 3 of Lemma 9.31, however, provides a slightly better form for the
general Plotkin bound.
We now, based on Lemma 9.31, calculate the distancespectrum function of
the following examples. The rst example deals with the case of innite code
alphabet, and the second example derives the distancespectrum function under
unbounded generalized distance measure.
Example 9.32 (continuous code alphabet) Let
A = [0, 1),
and let the marginal distance metric be
(x
1
, x
2
) = min 1 [x
1
x
2
[, [x
1
x
2
[ .
Note that the metric is nothing but treating [0, 1) as a circle (0 and 1 are
glued together), and then to measure the shorter distance between two posi
tions. Also, the additivity property is assumed for the nfold distance function,
i.e.,
n
( x
n
, x
n
)
n
i=1
( x
i
, x
i
).
224
Using the product X of uniform distributions over A, the supdistance
spectrum function becomes
X
(R) = inf
_
a 1 : limsup
n
_
1 Pr
_
1
n
n
i=1
(
X
i
, X
i
) a
__
M
= 0
_
_
_
.
By Cramer Theorem [4],
lim
n
1
n
Pr
_
1
n
n
i=1
(
X
i
, X
i
) a
_
= I
X
(a),
where
I
X
(a) sup
t>0
_
ta log E[e
t(
X,X)
]
_
is the large deviation rate function
10
. Since I
X
(a) is convex in a, there exists a
supporting line to it satisfying
I
X
(a) = t
a log E
_
e
t
(
X,X)
_
,
which implies
a = s
I
X
(a) s
log E
_
e
(
X,X)/s
_
for s
1/t
R s
log E
_
e
(
X,X)/s
_
(9.4.22)
= sup
s>0
_
sR s log E
_
e
(
X,X)/s
__
,
where
11
the last step follows from the observation that (9.4.22) is also a support
ing line to the convex I
1
X
(R). Consequently,
X
(R) = inf a 1 : I
X
(a) < R
= sup
s>0
_
sR s log
_
2s
_
1 e
1/2s
__
, (9.4.23)
10
We take the range of supremum to be [t > 0] (instead of [t 1] as conventional large
deviation rate function does) since what concerns us here is the exponent of the cumulative
probability mass.
11
One may notice the analog between the expression of the large deviation rate function
I
X
(a) and that of the error exponent function [2, Thm. 4.6.4] (or the channel reliability expo
nent function [2, Thm. 10.1.5]). Here, we demonstrate in Example 9.32 the basic procedure of
225
which is plotted in Figure 9.9.
Also from Lemma 9.31, we can easily compute the marginal points of the
distancespectrum function as follows.
R
0
(X) = log Pr
X = X = ;
D
0
(X) = ess inf(
X, X) = 0;
R
p
(X) = log Pr(
X, X) < = 0;
D
p
(X) = E[(
X, X)[(
X, X) < ] =
1
4
.
Example 9.33 Under the case that
A = 0, 1
and
n
(, ) is additive with marginal distance metric (0, 0) = (1, 1) = 0,
(0, 1) = 1 and (1, 0) = , the supdistancespectrum function is obtained
using the product of uniform (on A) distributions as:
X
(R) = inf a 1 : I
X
(a) < R
= sup
s>0
_
sR s log
2 + e
1/s
4
_
, (9.4.24)
where
I
X
(a) sup
t>0
_
ta log E[e
t(
X,X)
]
_
= sup
t>0
_
ta log
2 + e
t
4
_
.
This curve is plotted in Figure 9.10. It is worth noting that there exists a region
obtaining
inf
_
a 1 : sup
t>0
_
ta log E
_
e
tZ
_
< R
_
= sup
s>0
_
sRs log E
_
e
Z/s
__
,
for a random variable Z so that readers do not have to refer to literatures regarding to error
exponent function or channel reliability exponent function for the validity of the above equality.
This equality will be used later in Examples 9.33 and 9.36, and also, equations (9.4.27) and
(9.4.29).
226
that the supdistancespectrum function is innity. This is justied by deriving
R
0
(X) = log Pr
X = X = log 2;
D
0
(X) = ess inf(
X, X) = 0;
R
p
(X) = log Pr(
X, X) < = log
4
3
;
D
p
(X) = E[(
X, X)[(
X, X) < ] =
1
3
.
One can draw the same conclusion by simply taking the derivative of (9.4.24)
with respect to s, and obtaining that the derivative
R log
_
2 + e
1/s
_
e
1/s
s (2 + e
1/s
)
+ log(4)
is always positive when R < log(4/3). Therefore, when 0 < R < log(4/3), the
distancespectrum function is innity.
From the above two examples, it is natural to question whether the formula
of the largest minimum distance can be simplied to the quantile function
12
of
the large deviation rate function (cf. (9.4.23) and (9.4.24)), especially when the
distance functional is symmetric and additive. Note that the quantile function
of the large deviation rate function is exactly the wellknown VarshamovGilbert
lower bound (cf. the next section). This inquiry then becomes to nd the answer
of an open question: under what conditions is the VarshamovGilbert lower bound
tight? Some insight on this inquiry will be discussed in the next section.
D) General VarshamovGilbert lower bound
In this section, a general VarshamovGilbert lower bound will be derived directly
from the distancespectrum formulas. Conditions under which this lower bound
is tight will then be explored.
Lemma 9.34 (large deviation formulas for
X
(R) and
X
(R))
X
(R) = inf
_
a 1 :
X
(a) < R
_
and
X
(R) = inf a 1 :
X
(a) < R
12
Note that the usual denition [1, page 190] of the quantile function of a nondecreasing
function F() is dened as: sup : F() < . Here we adopt its dual denition for a non
increasing function I() as: infa : I(a) < R. Remark that if F() is strictly increasing (resp.
I() is strictly decreasing), then the quantile is nothing but the inverse of F() (resp. I()).
227
where
X
(a) and
X
(a) are respectively the sup and the inflarge deviation
spectrums of (1/n)
n
(
X
n
, X
n
), dened as
X
(a) limsup
n
1
n
log Pr
_
1
n
n
(
X
n
, X
n
) a
_
and
X
(a) liminf
n
1
n
log Pr
_
1
n
n
(
X
n
, X
n
) a
_
.
Proof: We will only provide the proof regarding
X
(R). All the properties of
X
(R) can be proved by following similar arguments.
Dene
inf
_
a 1 :
X
(a) < R
_
. Then for any > 0,
X
(
+ ) < R (9.4.25)
and
X
(
) R. (9.4.26)
Inequality (9.4.25) ensures the existence of = () > 0 such that for suciently
large n,
Pr
_
1
n
n
(
X
n
, X
n
)
+
_
e
n(R)
,
which in turn implies
limsup
n
_
Pr
_
1
n
n
(
X
n
, X
n
) >
+
__
M
limsup
n
_
1 e
n(R)
_
e
nR
= 0.
Hence,
X
(R)
+ . On the other hand, (9.4.26) implies the existence of
subsequence n
j
j=1
satisfying
lim
j
1
n
j
log Pr
_
1
n
j
n
j
(
X
n
j
, X
n
j
)
_
R,
which in turn implies
limsup
n
_
Pr
_
1
n
n
(
X
n
, X
n
) >
__
M
limsup
j
_
1 e
n
j
R
_
e
n
j
R
= e
1
.
Accordingly,
X
(R)
. Since is arbitrary, the lemma therefore holds. 2
228
The above lemma conrms that the distance spectrum function
X
() (resp.
X
()) is exactly the quantile of the sup (resp. inf) large deviation spectrum
of (1/n)
n
(
X
n
, X
n
)
n>1
Thus, if the large deviation spectrum is known, so is
the distance spectrum function.
By the generalized GartnerEllis upper bound derived in [5, Thm. 2.1], we
obtain
X
(a) inf
[xa]
I
X
(x) = I
X
(a)
and
X
(a) inf
[xa]
I
X
(x) =
I
X
(a),
where the equalities follow from the convexity (and hence, continuity and strict
decreasing) of I
X
(x) sup
[<0]
[x
X
()] and
I
X
(x) sup
[<0]
[x
X
()],
and
X
() liminf
n
1
n
log E
_
e
n(
X
n
,X
n
)
_
and
X
() limsup
n
1
n
log E
_
e
n(
X
n
,X
n
)
_
.
Based on these observations, the relation between the distancespectrum expres
sion and the VarshamovGilbert bound can be described as follows.
Corollary 9.35
sup
X
X
(R) sup
X
G
X
(R) and sup
X
X
(R) sup
X
G
X
(R)
where
G
X
(R) inf a 1 : I
X
(a) < R
= sup
s>0
_
sR s
X
(1/s)
_
(9.4.27)
and
G
X
(R) inf
_
a 1 :
I
X
(a) < R
_
(9.4.28)
= sup
s>0
[sR s
X
(1/s)] . (9.4.29)
Some remarks on the VarshamovGilbert bound obtained above are given
below.
Remarks.
229
One can easily see from [2, page 400], where the VarshamovGilbert bound
is given under Bhattacharyya distance and nite code alphabet, that
sup
X
G
X
(R)
and
sup
X
G
X
(R)
are the generalization of the conventional VarshamovGilbert bound.
Since 0 exp
n
( x
n
, x
n
)/s 1 for s > 0, the function exp(, )/s
is always integrable. Hence, (9.4.27) and (9.4.29) can be evaluated under
any nonnegative measurable function
n
(, ). In addition, no assumption
on the alphabet space A is needed in deriving the lower bound. Its full
generality can be displayed using, again, Examples 2.8 and 9.33, which
result in exactly the same curves as shown in Figures 9.9 and 9.10.
Observe that
G
X
(R) and G
X
(R) are both the pointwise supremum of
a collection of ane functions, and hence, they are both convex, which
immediately implies their continuity and strict decreasing property on the
interior of their domains, i.e.,
R :
G
X
(R) < and R : G
X
(R) < .
However, as pointed out in [5],
X
() and
X
() are not necessarily con
vex, which in turns hints the possibility of yielding nonconvex
X
() and
X
(). This clearly indicates that the VarshamovGilbert bound is not
tight whenever the asymptotic largest minimum distance among codewords
is nonconvex.
An immediate improvement from [5] to the VarshamovGilbert bound
is to employ the twisted large deviation rate function (instead of
I
X
())
J
X,h
(x) sup
T :
X
(;h)>
[ h(x)
X
(; h)]
and yields a potentially nonconvex VarshamovGilberttype bound, where
h() is a continuous realvalued function, and
X
(; h) limsup
n
1
n
log E
_
e
nh(n(
X
n
,X
n
)/n)
_
.
Question of how to nd a proper h() for such improvement is beyond the
scope of this paper and hence is deferred for further study.
We now demonstrate that
X
(R) >
G
X
(R) by a simple example.
230
Example 9.36 Assume binary code alphabet A = 0, 1, and nfold
Hamming distance
n
( x
n
, x
n
) =
n
i=1
( x
i
, x
i
). Dene a measurable func
tion as follows:
n
( x
n
, x
n
)
_
_
_
0, if 0
n
( x
n
, x
n
) < n;
n, if n
n
( x
n
, x
n
) < 2n;
, if 2n
n
( x
n
, x
n
),
where 0 < < 1/2 is a universal constant. Let X be the product of
uniform distributions over A, and let Y
i
(
X
i
, X
i
) for 1 i n.
Then
X
(1/s)
= liminf
n
1
n
log E
_
e
n(
X
n
,X
n
)/s
_
= liminf
n
1
n
log
_
Pr
_
0
n
(
X
n
, X
n
)
n
<
_
+Pr
_
n
(
X
n
, X
n
)
n
< 2
_
e
n/s
_
= liminf
n
1
n
log
_
Pr
_
0
Y
1
+ + Y
n
n
<
_
+Pr
_
Y
1
+ + Y
n
n
< 2
_
e
n/s
_
= max
_
I
Y
(), I
Y
(2)
n
s
_
.
where I
Y
() sup
s>0
s log[(1 + e
s
)/2] (which is exactly the large
deviation rate function of (1/n)Y
n
). Hence,
G
X
(R)
= sup
s>0
_
sR s max
_
I
Y
(), I
Y
(2)
n
s
__
= sup
s>0
min s[I
Y
() R], s[I
Y
(2) R] +
=
_
_
, for 0 R < I
Y
(2);
(I
Y
() R)
I
Y
() I
Y
(2)
, for I
Y
(2) R < I
Y
();
0, for R I
Y
().
231
We next derive
X
(R). Since
Pr
_
1
n
n
(
X
n
, X
n
) a
_
=
_
_
Pr
_
0
Y
1
+ + Y
n
n
<
_
, 0 a < ;
Pr
_
0
Y
1
+ + Y
n
n
< 2
_
, a < 2,
we obtain
X
(R) =
_
_
_
, if 0 R < I
Y
(2);
, if I
Y
(2) R < I
Y
();
0, if I
Y
() R.
Consequently,
X
(R) >
G
X
(R) for I
Y
(2) < R < I
Y
().
One of the problems that remain open in the combinatorial coding theory
is the tightness of the asymptotic VarshamovGilbert bound for the binary
code and the Hamming distance [3, page vii]. As mentioned in Section I, it
is already known that the asymptotic VarshamovGilbert bound is in gen
eral not tight, e.g., for algebraicgeometric code with large code alphabet
size and Hamming distance. Example 9.36 provides another example to
conrm the untightness of the asymptotic VarshamovGilbert bound for
simple binary code and quantized Hamming measure.
By the generalized GartnerEllis lower bound derived in [5, Thm. 2.1],
we conclude that
X
(R) = I
X
(R) (or equivalently
X
(R) =
G
X
(R))
if
_
D
0
(X),
D
p
(X)
_
<0
_
limsup
t0
X
( + t)
X
()
t
,
liminf
t0
X
()
X
( t)
t
_
. (9.4.30)
Note that although (9.4.30) guarantees that
X
(R) =
G
X
(R), it dose not
by any means ensure the tightness of the VarshamovGilbert bound. An
additional assumption needs to be made, which is summarized in the next
observation.
232
Observation 9.37 If there exists an
X such that
sup
X
X
(R) =
X
(R)
and (9.4.30) holds for
X, then the asymptotic VarshamovGilbert lower
bound is tight.
Problem in applying the above observation is the diculty in nding the
optimizing process X. However, it does provide an alternative to prove the
tightness of the asymptotic VarshamovGilbert bound. Instead of nding
an upper bound to meet the lower bound, one could show that (9.4.30)
holds for a fairly general class of randomcode generating processes, which
the optimization process surely lies in.
9.4.4 Elias bound: a singleletter upper bound formula
on the largest minimum distance for block codes
As stated in the previous chapter, the number of compositions for i.i.d. source
is of the polynomial order. Therefore, we can divide any block code ( into a
polynomial number of constantcomposition subcodes. To be more precise
(n + 1)
[.[
max
C
i
[ ( C
i
[ [ ( [ =
C
i
[ ( C
i
[ max
C
i
[ ( C
i
[. (9.4.31)
where C
i
represents a set of x
n
with the same composition. As a result of
(9.4.31), the rate of code ( should be equal to ( C
where C
is the maximizer
of (9.4.31). We can then conclude that any block code should have a constant
composition subcode with the same rate, and the largest minimum distance for
block codes should be upper bounded by that for constant composition block
codes.
The idea of the Elias bound for i.i.d. source is to upper bound the largest
minimum distance among codewords by the average distance among those code
words on the spherical surface. Note that the minimum distance among code
words within a spherical surface should be smaller than the minimum distance
among codewords on the spherical surface, which in terms is upper bounded by
its corresponding average distance.
By observing that points locate on the spherical surface in general have con
stant joint composition with the center, we can model the set of spherical surface
in terms of joint composition
13
. To be more precise, the spherical surface cen
tered at x
n
with radius r can be represented by
/
x
n
,r
x
n
: d(x
n
, x
n
) = r.
13
For some distance measures, points on the spherical surface may have more than one joint
233
(Recall that the sphere is modeled as x
n
: d(x
n
, x
n
) r.)
d(x
n
, x
n
)
n
=1
d(x
, x
)
=
K
i=1
K
j=1
#ij(x
n
, x
n
) d(i, j),
where #ij(x
n
, x
n
) represents the number of (x
, x
1iK,1jK
is a xed joint composition.
Example 9.38 Assume a binary alphabet. Suppose n = 4 and the marginal
composition is #0, #1 = 2, 2. Then the set of all words with such compo
sition is
0011, 0101, 0110, 1010, 1100, 1001
which has 4!/(2!2!) = 6 elements. Choose the joint composition to be
#00, #01, #10, #11 = 1, 1, 1, 1.
Then the set of the spherical surface centered at (0011) is
0101, 1001, 0110, 1010,
which has [2!/(1!1!)][2!/(1!1!)] = 4 elements.
Lemma 9.39 (average number of compositionconstant codewords on
spherical surface) Fix a composition n
i
1iK
for code alphabet
1, 2, . . . , K.
compositions with the center. For example, if d(0, 1) = d(0, 2) = 1 for ternary alphabet, then
both 0011 and 0022 locate on the spherical surface
x
4
: d(0000, x
4
) = 2;
however, they have dierent joint composition with the center. In this subsection, what we
concern are the general distance measures, and hence, points on spherical surface in general
have constant joint composition with the centers.
234
The average number of M
n
codewords of blocklength n on a spherical surface
dened over a joint composition n
ij
1iK,1jK
(with
K
j=1
n
ij
=
K
j=1
n
ji
=
n
i
) is equal to
T = M
n
_
K
i=1
n
i
_
2
n!
K
i=1
K
j=1
n
ij
!
.
Proof:
Step 1: Number of words. The number of words that have a composition
n
i
1iK
is N n!/
K
i=1
n
i
!. The number of words on the spherical
surface centered at x
n
(having a composition n
i
1iK
)
/(x
n
) x
n
A
n
: #ij(x
n
, x
n
) = n
ij
is equal to
J =
K
i=1
n
i
!
K
j=1
n
ij
!
.
Note that since
K
i=1
= n
j
, /(x
n
) is equivalent to
/(x
n
) x
n
A
n
: #i x
n
= n
i
and #ij(x
n
, x
n
) = n
ij
.
Step 2: Average. Consider all spherical surfaces centered at words with com
position n
i
1iK
. There are N of them, which we index by r in a xed
order, i.e., /
1
, /
2
, . . . , /
r
, . . . , /
N
. Form a code book ( with M
n
code
words, each has composition n
i
1iK
. Because each codeword shares
joint composition n
ij
with J words with composition n
i
1iK
, it is on
J spherical surfaces among /
1
, . . . , /
N
. In other words, if T
r
denote the
number of codewords in /
r
, i.e., T
r
[ ( /
r
[, then
N
r=1
T
r
= M
n
J.
Hence, the average number of codewords on one spherical surface satises
T =
M
n
J
N
= M
n
_
K
i=1
n
i
_
2
n!
K
i=1
K
j=1
n
ij
!
.
2
235
Observation 9.40 (Reformulating
T)
a(n)e
n(RI(X;
X))
<
T < b(n)
2
1
(n)e
n(RI(X;
X))
.
where P
X
X
is the composition distribution to n
ij
1iK,1jK
dened in the
previous Lemma, and a(n) and b(n) are both polynomial functions of n.
Proof: We can write
T as
T = M
n
n!
K
i=1
K
j=1
n
ij
!
K
i=1
n
i
n!
K
i=1
n
i
n!
.
By the Stirlings approximation
14
, we obtain
1
(n)e
nH(X,
X)
<
n!
K
i=1
K
j=1
n
ij
!
<
2
(n)e
nH(X,
X)
,
and
1
(n)e
nH(X)
<
n!
K
i=1
n
i
!
<
2
(n)e
nH(X)
,
where
15
1
(n) = 6
2K
n
(12K)/2
and
2
(n) = 6
1
(n)
2
2
(n)e
n(R+H(X,
X)H(X)H(
X))
<
T <
2
(n)
2
1
(n)e
n(R+H(X,
X)H(X)H(
X))
.
14
Stirlings approximation:
2n
_
n
e
_
n
< n! <
2n
_
n
e
_
n
_
1 +
1
12n 1
_
.
15
Using a simplied Stirlings approximation as
n
_
n
e
_
n
< n! < 6
n
_
n
e
_
n
,
we have
n
_
n
e
_
n
K
i=1
K
j=1
6
n
ij
_
nij
e
_
nij
<
n!
K
i=1
K
j=1
n
ij
!
<
6
n
_
n
e
_
n
K
i=1
K
j=1
n
ij
_
nij
e
_
nij
K
i=1
K
j=1
n
ij
6
2K
n
n
K
i=1
K
j=1
n
nij
ij
<
n!
K
i=1
K
j=1
n
ij
!
<
K
i=1
K
j=1
n
ij
6n
n
K
i=1
K
j=1
n
nij
ij
.
Since
n
(12K)/2
K
i=1
K
j=1
n
ij
n,
236
2
This observations tell us that if we choose M
n
suciently large, say R >
I(X;
X), then
T goes to innity. We can then conclude that there exists a
spherical surface such that T
r
= [ ( /
r
[ goes to innity. Since the minimum
distance among all codewords must be upper bounded by the average distance
among codewords on spherical surface /
r
, an upper bound of the minimum
distance can be established via the next theorem.
Theorem 9.41 For Hamming(like) distance, which is dened by
d(i, j) =
_
0, if i = j,
A, if i ,= j
,
d(R) max
P
X
min
P
X
e
X
: P
X
=P
e
X
=P
X
,I(
X;
e
X)<R
K
i=1
K
j=1
K
k=1
P
X
e
X
(i, k)P
X
e
X
(j, k)
P
X
(k)
d(i, j).
Proof:
Step 1: Notations. Choose a compositionconstant code ( with rate R and
composition n
i
1iK
, and also choose a jointcomposition
n
ij
1iK,1jK
whose marginal compositions are both n
i
1iK
, and satises I(
X;
X) <
R, where P
X
e
X
is its joint composition distribution. From Observation 9.40,
the above derivation can be continued as
6
2K
n
(12K)/2
n
n
K
i=1
K
j=1
n
nij
ij
<
n!
K
i=1
K
j=1
n
ij
!
< 6
n
n
n
K
i=1
K
j=1
n
nij
ij
1
(n)
n
n
K
i=1
K
j=1
n
nij
ij
<
n!
K
i=1
K
j=1
n
ij
!
<
2
(n)
n
n
K
i=1
K
j=1
n
nij
ij
.
Finally,
n
n
K
i=1
K
j=1
n
nij
ij
= e
nlog n
P
K
i=1
P
K
j=1
nij log nij
= e
P
K
i=1
P
K
j=1
nij log n
P
K
i=1
P
K
j=1
nij log nij
= e
P
K
i=1
P
K
j=1
nij log
n
i
j
n
= e
n
P
K
i=1
P
K
j=1
P
X
X
(i,j) log P
X
X
(i,j)
= e
nH(X,
X)
.
237
if I(
X;
X) < R,
T increases exponentially with respective to n. We can
then nd a spherical surface /
r
such that the number of codewords in it,
namely T
r
= [ ( /
r
[, increases exponentially in n, where the spherical
surface is dened on the joint composition n
ij
1iK,1jK
.
Dene
d
r
x
n
( ,r
x
n
( ,r, x
n
,=x
n
d(x
n
, x
n
)
T
r
(T
r
1)
.
Then it is obvious that the minimum distance among codewords is upper
bounded by
d
r
.
Step 2: Upper bound of
d
r
.
d
r
x
n
( ,r
x
n
( ,r, x
n
,=x
n
d(x
n
, x
n
)
T
r
(T
r
1)
=
x
n
( ,r
x
n
( ,r, x
n
,=x
n
_
K
i=1
K
j=1
#ij(x
n
, x
n
)d(i, j)
_
T
r
(T
r
1)
.
Let 1
ij[
(x
n
, x
n
) = 1 if x
= i and x
d
r
=
x
n
( ,r
x
n
( ,r, x
n
,=x
n
_
K
i=1
K
j=1
_
n
=1
1
ij[
(x
n
, x
n
)
_
d(i, j)
_
T
r
(T
r
1)
=
K
i=1
K
j=1
_
_
n
=1
x
n
( ,r
x
n
( ,r, x
n
,=x
n
1
ij[
(x
n
, x
n
)
_
_
d(i, j)
T
r
(T
r
1)
Dene
T
i[
= [x
n
( /
r
: x
= i[, and
i[
=
T
i[
T
r
.
Then
K
i=1
T
i[
= T
r
. (Hence,
i[
K
i=1
is a probability mass function.) In
addition, suppose
j[
is 1 when the th component of the center of the
spherical surface /
r
is j; and 0, otherwise. Then
n
=1
T
i[
j[
= T
r
n
ij
,
238
which is equivalent to
n
=1
i[
j[
= n
ij
.
Therefore,
d
r
=
K
i=1
K
j=1
_
n
=1
T
i[
T
j[
_
d(i, j)
T
r
(T
r
1)
=
T
r
T
r
1
K
i=1
K
j=1
_
n
=1
i[
j[
_
d(i, j)
T
r
T
r
1
max
:
P
K
i=1
i
=1,
and P
n
=1
i
j
=n
ij
i=1
K
j=1
_
n
=1
i[
j[
_
d(i, j)
Step 3: Claim. The optimizer of the above maximization is
i[
=
K
k=1
n
ik
n
k
k[
.
(The maximizer of the above maximization can actually be solved in terms
of Lagrange multiplier techniques. For simplicity, a direct derivation of this
maximizer is provided in the following step.) proof: For any ,
n
=1
i[
j[
=
n
=1
_
K
k=1
n
ik
n
k
k[
_
j[
=
K
k=1
n
ik
n
k
_
n
=1
k[
j[
_
=
K
k=1
n
ik
n
jk
n
k
239
By the fact that
n
=1
i[
j[
=1
i[
j[
=
n
=1
(
i[
j[
i[
j[
)
=
n
=1
(
i[
j[
i[
j[
+
i[
j[
i[
j[
)
=
n
=1
(
i[
i[
)(
j[
j[
)
=
n
=1
c
i[
c
j[
,
where c
i[
i[
i[
, we obtain
K
i=1
K
j=1
_
n
=1
i[
j[
_
d(i, j)
K
i=1
K
j=1
_
n
=1
i[
j[
_
d(i, j)
=
K
i=1
K
j=1
_
n
=1
c
i[
c
j[
_
d(i, j)
=
n
=1
_
K
i=1
K
j=1
c
i[
c
j[
d(i, j)
_
=
n
=1
A
_
K
i=1
K
j=1,j,=i
c
i[
c
j[
_
=
n
=1
A
_
K
i=1
K
j=1
c
i[
c
j[
i=1
c
2
i[
_
=
n
=1
A
__
K
i=1
c
i[
__
K
j=1
c
j[
_
i=1
c
2
i[
_
=
n
=1
A
K
i=1
c
2
i[
0.
240
Accordingly,
d
r
T
r
T
r
1
n
=1
K
i=1
K
j=1
_
K
k=1
n
ik
n
jk
n
k
_
d(i, j)
=
T
r
T
r
1
n
n
=1
K
i=1
K
j=1
_
K
k=1
P
X
e
X
(i, k)P
X
e
X
(j, k)
P
X
(k)
_
d(i, j),
which implies that
d
r
n
T
r
T
r
1
n
=1
K
i=1
K
j=1
_
K
k=1
P
X
e
X
(i, k)P
X
e
X
(j, k)
P
X
(k)
_
d(i, j). (9.4.32)
Step 4: Final step. Since (9.4.32) holds for any joint composition satisfying
the constraints, and T
r
goes to innity as n goes to innity, we can take
the minimum of P
X
e
X
and yield a better bound. However, the marginal
composition is chosen so that the compositionconstant subcode has the
same rate as the optimal code which is unknown to us. Hence, we choose
a pessimistic viewpoint to take the maximum over all P
X
, and complete
the proof. 2
The upper bound in the above theorem is called the Elias bound, and will
be denoted by d
E
(R).
9.4.5 Gilbert bound and Elias bound for Hamming dis
tance and binary alphabet
The (general) Gilbert bound is
G(R) sup
X
sup
s>0
[sR s
X
(s)] ,
where
X
(s) limsup
n
(1/n) log E[exp
n
(
X
n
, X
n
)/s], and
X
n
X
n
are
independent. For binary alphabet,
Pr(
X, X) y =
_
_
_
0, y < 0
Pr(
X = X), 0 y < 1
1 , y 1.
Hence,
G(R) = sup
0<p<1
sup
s>0
_
sR s log
_
p
2
+ (1 p)
2
+ 2p(1 p)e
1/s
_
,
241
where P
X
(0) = p and P
X
(1) = 1 p. By taking the derivatives with respective
to p, the optimizer can be found to be p = 1/2. Therefore,
G(R) = sup
s>0
_
sR s log
1 + e
1/s
2
_
.
The Elias bound for Hamming distance and binary alphabet is equal to
d
E
(R) max
P
X
min
P
X
e
X
: P
X
=P
e
X
=P
X
,I(
X;
e
X)<R
K
i=1
K
j=1
K
k=1
P
X
e
X
(i, k)P
X
e
X
(j, k)
P
X
(k)
d(i, j)
= max
P
X
min
P
X
e
X
: P
X
=P
e
X
=P
X
,I(
X;
e
X)<R
2
K
k=1
P
X
e
X
(0, k)P
X
e
X
(1, k)
P
X
(k)
= max
0<p<1
min
n
0ap : a log
a
p
2
+2(pa) log
pa
p(1p)
+(12p+a) log
12p+a
(1p)
2
=R
o
2
_
a(p a)
p
+
(p a)(1 2p + a)
1 p
_
,
where P
X
(0) = p and P
X
e
X
(0, 0) = a. Since
f(p, a) 2
_
a(p a)
p
+
(p a)(1 2p + a)
1 p
_
,
is nonincreasing with respect to a, by assuming that the solution for
a log
a
p
2
+ 2(p a) log
p a
p(1 p)
+ (1 2p + a) log
1 2p + a
(1 p)
2
= R
is a
p
which is a function of p,
d
E
(R) = max
0<p<1
f(p, mina
p
, p).
9.4.6 Bhattacharyya distance and expurgated exponent
The Gilbert bound for additive distance measure
n
(x
n
, x
n
) =
n
i=1
(x
i
, x
i
) can
be written as
G(R) = max
s>0
max
P
X
_
sR s log
_
x.
.
P
X
(x)P
X
(x
t
)e
(x,x
)/s
__
. (9.4.33)
Recall that the expurgated exponent for DMC with generic distribution P
Y [X
is dened by
E
ex
(R) max
s1
max
P
X
_
sR s log
x.
.
P
X
(x)P
X
(x
t
)
242
_
y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
)
_
1/s
_
_
,
which can be reformulated as:
E
ex
(R) max
s1
max
P
X
_
sR s log
_
x.
.
P
X
(x)P
X
(x
t
)
e
(log
P
yY
P
Y X
(y[x)P
Y X
(y[x
))/s
__
. (9.4.34)
Comparing (9.4.33) and (9.4.34), we can easily nd that the expurgated lower
bound (to reliability function) is nothing but the Gilbert lower bound (to mini
mum distance among codewords) with
(x, x
t
) log
y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
).
This is called the Bhattacharyya distance for channel P
Y [X
.
Note that the channel reliability function E(R) for DMC satises
E
ex
(R) E(R) limsup
n
1
n
d
n,M=e
n,R;
also note that
G(R)
1
n
d
n,M=e
n,R
and for optimizer s
1,
E
ex
(R) = G(R).
We conclude that the channel reliability function for DMC is solved (not only for
high rate but also for low rate, if the Gilbert bound is tight for Bhattacharyya
distance! This again conrms the signicance of the query on when the Gilbert
bound being tight!
The nal remark in this subsection regarding the resemblance of Bhattachar
ya distance and Hamming distance. The Bhattacharyya distance is in general
not necessary a Hamminglike distance measure. The channel which results in
a Hamminglike Bhattacharyya distance is named equidistance channels. For
example, any binary input channel is a equidistance channel.
9.5 Straight line bound
It is conjectured that the channel reliability function is convex. Therefore, any
nonconvex upper bound can be improved by making it convex. This is the main
idea of the straight line bound.
243
Denition 9.42 (list decoder) A list decoder decodes the outputs of a noisy
channel by a list of candidates (possible inputs), and an error occurs only when
the correct codeword transmitted is not in the list.
Denition 9.43 (maximal error probability for list decoder)
P
e,max
(n, M, L) min
( with L candidates for decoder
max
1mM
P
e[m
,
where n is the blocklength and M is the code size.
Denition 9.44 (average error probability for list decoder)
P
e
(n, M, L) min
( with L candidates for decoder
1
M
1mM
P
e[m
,
where n is the blocklength and M is the code size.
Lemma 9.45 (lower bound on average error) For DMC,
P
e
(n, M, 1) P
e
(n
1
, M, L)P
e,max
(n
2
, L + 1, 1),
where n = n
1
+ n
2
.
Proof:
Step 1: Partition of codewords. Partition each codeword into two parts: a
prex with length n
1
and a sux of length n
2
. Likewise, divide each output
observations into a prex part with length n
1
and a sux part with length
n
2
.
Step 2: Average error probability. Let
c
1
, . . . , c
M
be the optimal codewords, and 
1
, . . . , 
M
be the corresponding output
partitions. Denote the prex parts of the codewords as c
1
, . . . , c
M
and the
sux parts as c
1
, . . . , c
M
. Similar notations applies to outputs y
n
. Then
P
e
(n, M, 1)
=
1
M
M
m=1
y
n
,m
P
Y
n
[X
n(y
n
[c
m
)
=
1
M
M
m=1
y
n
,m
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)P
Y
n
2 [X
n
2 ( y
n
2
[ c
m
)
=
1
M
M
m=1
y
n
1
n
1
y
n
2,m( y
n
1 )
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)P
Y
n
2[X
n
2 ( y
n
2
[ c
m
),
244
where

m
( y
n
1
) y
n
2
n
2
: concatenation of y
n
1
and y
n
2

m
.
Therefore,
P
e
(n, M, 1)
=
1
M
M
m=1
y
n
1
n
1
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)
y
n
2 ,m( y
n
1 )
P
Y
n
2[X
n
2 ( y
n
2
[ c
m
)
Step 3: [1 m M : P
e[m
( ( ) < P
e,max
(n, L + 1, 1)[ L.
Claim: For any code with size M L, the number of m that satisfying
P
e[m
< P
e,max
(n, L + 1, 1)
is at most L.
proof: Suppose the claim is not true. Then there exist at least L + 1
codewords satisfying P
e[m
( ( ) < P
e,max
(n, L + 1, 1). Form a new code (
t
with size L + 1 by these L + 1 codewords. The error probability given
each transmitted codeword should be upper bounded by its corresponding
original P
e[m
, which implies that max
1mL+1
P
e[m
( (
t
) is strictly less than
P
e,max
(n, L + 1, 1), contradicted to its denition.
Step 4: Continue from step 2. Treat 
m
( y
n
1
)
M
m=1
as decoding partitions
at the output site for observations of length n
2
, and denote its resultant
error probability given each transmitted codeword by P
e[m
( y
n
1
). Then
P
e
(n, M, 1)
=
1
M
M
m=1
y
n
1
n
1
P
Y
n
1[X
n
1
( y
n
1
[ c
m
)
y
n
2,m( y
n
1 )
P
Y
n
2[X
n
2
( y
n
2
[ c
m
)
=
1
M
M
m=1
y
n
1
n
1
P
Y
n
1[X
n
1
( y
n
1
[ c
m
)P
e[m
( y
n
1
)
1
M
m/
y
n
1
n
1
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)P
e,max
(n
2
, L + 1),
where L is the set of indexes satisfying
P
e[m
( y
n
1
) P
e,max
(n
2
, L + 1).
Note that [L[ M L + 1. We nally obtain
P
e
(n, M, 1) P
e,max
(n
2
, L + 1)
1
M
m/
y
n
1
n
1
P
Y
n
1 [X
n
1
( y
n
1
[ c
m
).
245
By partitioning
n
1
into [L[ mutually disjoint decoding sets, and using
(1, 2, . . . , M L) plus the codeword itself as a decoding candidate list
which is at most L,
1
M
m/
y
n
1
n
1
P
Y
n
1[X
n
1
( y
n
1
[ c
m
) P
e
(n
1
, M, L)
Combining this with the previous inequality completes the proof. 2
Theorem 9.46 (straightline bound)
E(R
1
+ (1 )R
2
) E
p
(R
1
) + (1 )E
sp
(R
2
).
Proof: By applying the previous lemma with = n
1
/n, log(M/L)/n
1
= R
1
and
log L/n
2
= R
2
, we can upper bound P
e
(n
1
, M, L) by partitioning exponent, and
P
e,max
(n
2
, L + 1, 1) by spherepacking exponent. 2
246
P
Y
n
X
n(
.
/c
m
)
P
Y
n
X
n(
.
/c
m
)
P
Y
n
X
n(
.
/c
m
)
P
Y
n
X
n(
.
/c
m
)
(a)
(b)
Figure 9.6: (a) The shaded area is 
c
m
; (b) The shaded area is /
c
m,m
.
247
X
(R)

ess inf(1/n)
n
(
X
n
, X
n
)
(1/n)E[(
X
n
, X
n
)[(
X
n
, X
n
) < ]
Probability mass
of (1/n)
n
(
X
n
, X
n
)
+
Figure 9.7:
X
(R) asymptotically lies between ess inf(1/n)(
X
n
, X
n
)
and (1/n)E[
n
(
X
n
, X
n
)[
n
(
X
n
, X
n
)] for
R
p
(X) < R <
R
0
(X).
R
X
(R)

6
c
p
p
p
p
p
p
p
p
p
s
R
p
(X)
D
p
(X)
c
p
p
p
s
p
p
p
p
p
p
p
p
p
p
p
c
p
p
p
s
R
0
(X)
D
0
(X)
Figure 9.8: General curve of
X
(R).
248
0
1/4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
R
Figure 9.9: Function of sup
s>0
_
sR s log
_
2s
_
1 e
1/2s
__
.
0
1/3
0 log(4/3) log(2)
R
innite region
Figure 9.10: Function of sup
s>0
_
sR s log
__
2 + e
1/s
_
/4
_
.
249
Bibliography
[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John
Wiley and Sons, 1995.
[2] Richard E. Blahut. Principles and Practice of Information Theory. Addison
Wesley, Massachusetts, 1988.
[3] V. Blinovsky. Asymptotic Combinatorial Coding Theory. Kluwer Acad. Pub
lishers, 1997.
[4] James A. Bucklew. Large Deviation Techniques in Decision, Simulation,
and Estimation. Wiley, New York, 1990.
[5] P.N. Chen, Generalization of G artnerEllis theorem, submitted to IEEE
Trans. Inform. Theory, Feb. 1998.
[6] P.N. Chen and F. Alajaji, Generalized source coding theorems and hy
pothesis testing, Journal of the Chinese Institute of Engineering, vol. 21,
no. 3, pp. 283303, May 1998.
[7] P.N. Chen and F. Alajaji, On the optimistic capacity of arbitrary chan
nels, in proc. of the 1998 IEEE Int. Symp. Inform. Theory, Cambridge,
MA, USA, Aug. 1621, 1998.
[8] P.N. Chen and F. Alajaji, Optimistic shannon coding theorems for ar
bitrary singleuser systems, IEEE Trans. Inform. Theory, vol. 45, no. 7,
pp. 26232629, Nov. 1999.
[9] T. Ericson and V. A. Zinoviev, An improvement of the Gilbert bound for
constant weight codes, IEEE Trans. Inform. Theory, vol. IT33, no. 5,
pp. 721723, Sep. 1987.
[10] R. G. Gallager. Information Theory and Reliable Communications. John
Wiley & Sons, 1968.
[11] G. van der Geer and J. H. van Lint. Introduction to Coding Theory and
Algebraic Geometry. Birkhauser:Basel, 1988.
250
[12] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
[13] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. New
York:Dover Publications, Inc., 1970.
[14] J. H. van Lint, Introduction to Coding Theory. New York:SpringerVerlag,
2nd edition, 1992.
[15] S. N. Litsyn and M. A. Tsfasman, A note on lower bounds, IEEE
Trans. on Inform. Theory, vol. IT32, no. 5, pp. 705706, Sep. 1986.
[16] J. K. Omura, On general Gilbert bounds, IEEE Trans. Inform. Theory,
vol. IT19, no. 5, pp. 661666, Sep. 1973.
[17] H. V. Poor and S. Verd u, A lower bound on the probability of error in
multihypothesis testing, IEEE Trans. on Information Theory, vol. IT41,
no. 6, pp. 19921994, Nov. 1995.
[18] H. L. Royden. Real Analysis. New York:Macmillan Publishing Company,
3rd edition, 1988.
[19] M. A. Tsfasman and S. G. Vladut. AlgebraicGeometric Codes. Nether
lands:Kluwer Academic Publishers, 1991.
[20] S. G. Vladut, An exhaustion bound for algebraicgeometric modular
codes, Probl. Info. Trans., vol. 23, pp. 2234, 1987.
[21] V. A. Zinoviev and S. N. Litsyn, Codes that exceed the Gilbert bound,
Problemy Peredachi Informatsii, vol. 21:1, pp. 105108, 1985.
251
Chapter 10
Information Theory of Networks
In this chapter, we consider the theory regarding to the communications among
many (more than three) terminals, which is usually named network. Several kind
of variations on this research topic are formed. What follow are some of them:
Example 10.1 (multiaccess channels) This is a standard problem on dis
tributed source coding (data compression). In this channel, two or more senders
send information to a common receiver. In order to reduce the overall trans
mitted rate on network, data compressions are applied independently among
senders. As shown in Figure 10.1, we want to nd lossless compressors f
1
, f
2
,
and f
3
such that R
1
+ R
2
+ R
3
is minimized. In this research topic, some other
interesting problems to study are, for example, how do the various senders
cooperate with each other, i.e., relations of f
i
s? and What is the relation
between rates and interferences among senders?.
Example 10.2 (broadcast channel) Another standard problem on network
information theory, which is of great interest in research, is the broadcast chan
nel, such as satellite TV networks, which consists of one sender and many re
ceivers. It is usually assumed that each receiver receives the same broadcast
information with dierent noise.
Example 10.3 (distributed detection) This is a variation of the multiaccess
channel in which the information of each sender comes from observing the same
targets. The target could belong to one of several categories. Upon receipt of the
compressed data from these senders, the receiver should decide which category
is the current observed one under some optimization criterion.
Perhaps the simplest model of the distributed detection consists of only two
categories which usually name null hypothesis and alternative hypothesis. The
criterion is chosen to be the Bayesian cost or NeymanPearson type II error
probability subject to a xed bound on the type I error probability. This system
is depicted in Figure 10.2.
252



Inform
3
Inform
2
Inform
1
Terminal
3
compressor f
3
with rate R
3
Terminal
2
compressor f
2
with rate R
2
Terminal
1
compressor f
1
with rate R
1






Noiseless
network
(such as
wired net)

Terminal
4
globally
decom
pressor

Inform
1
,
Inform
2
,
Inform
3
Figure 10.1: The multiaccess channel.
Example 10.4 (some other examples)
1) Relay channel. This channel consists of one source sender, one destination
receiver, and several intermediate senderreceiver pairs that act as relays
to facilitate the communication between the source sender and destination
receiver.
2) Interference channel. Several senders and several receivers communicate si
multaneously on common channel, where interference among them could
introduce degradation on performance.
3) Twoway communication channel. Instead of conventional oneway channel,
two terminals can communicate in a twoway fashion (full duplex).
10.1 Lossless data compression over distributed sources
for block codes
10.1.1 Full decoding of the original sources
In this section, we will rst consider the simplest model of the distributed (cor
related) sources which consists only two random variables, and then, extend the
result to those models consisting of three or more random variables.
253







Fusion center

T(U
1
, . . . , U
n
)
g
n
g
2
g
1
Y
n
Y
2
Y
1
U
n
U
2
U
1
Figure 10.2: Distributed detection with n senders. Each observations Y
i
may come from one of two categories. The nal decision T H
0
, H
1
.
Denition 10.5 (independent encoders among distributed sources) T
here are several sources
X
1
, X
2
, . . . , X
m
(may or may not be independent) which are respectively obtained by m termi
nals. Before each terminal transmits its local source to the receiver, a block
encoder f
i
with rate R
1
and block length n is applied
f
i
: A
n
i
1, 2, . . . , 2
nR
i
.
It is assumed that there is no conspiracy among block encoders.
Denition 10.6 (global decoder for independently compressed source
s) A global decoder g() will recover the original sources after receiving all the
independently compressed sources, i.e.,
g : 1, . . . , 2
R
1
1, . . . , 2
Rm
A
n
1
A
n
m
.
Denition 10.7 (probability of error) The probability of error is dened as
P
e
(n) Prg(f
1
(X
n
1
), . . . , f
m
(X
n
m
)) ,= (X
n
1
, . . . , X
n
m
).
Denition 10.8 (achievable rates) A rates (R
1
, . . . , R
m
) is said to be achiev
able if there exists a sequence of block codes such that
limsup
n
P
e
(n) = 0.
Denition 10.9 (achievable rate region for distributed sources) The a
chievable rate region for distributed sources is the set of all achievable rates.
254
Observation 10.10 The achievable rate region is convex.
Proof: For any two achievable rates (R
1
, . . . , R
m
) and (
R
1
, . . . ,
R
m
), we can nd
two sequences of codes to achieve them. Therefore, by properly randomization
among these two sequences of codes, we can achieve the rates
(R
1
, . . . , R
m
) + (1 )(
R
1
, . . . ,
R
m
)
for any [0, 1]. 2
Theorem 10.11 (SlepianWolf) For distributed sources consisting of two ran
dom variables X
1
and X
2
, the achievable region is
R
1
H(X
1
[X
2
)
R
2
H(X
2
[X
1
)
R
1
+ R
2
H(X
1
, X
2
).
Proof:
1. Achievability Part: We need to show that for any (R
1
, R
2
) satisfying the
constraint, a sequence of code pairs for X
1
and X
2
with asymptotically zero
error probability exists.
Step 1: Random coding. For each source sequence x
n
1
, randomly assign it an
index in 1, 2, . . . , 2
nR
1
according to uniform distribution. This index is
the encoding output of f(x
n
1
). Let 
i
= x
n
A
n
1
: f
1
(x
n
) = i.
Similarly, form f
2
() by randomly assigning indexes. Let 1
i
= x
n
A
n
2
: f
1
(x
n
) = i.
Upon receipt of two indexes (i, j) from the two encoders, the decoder g()
is dened by
g(i, j) =
_
_
_
(x
n
1
, x
n
2
) if [(
i
1
j
) T
n
(; X
n
1
, X
n
2
)[ = 1,
and (x
n
1
, x
n
2
) (
i
1
j
) T
n
(; X
n
1
, X
n
2
);
arbitrary otherwise.
,
where T
n
(; X
n
1
, X
n
2
) is the weakly joint typical set of X
n
1
and X
n
2
.
Step 2: Error probability. The error happens when the source sequence pairs
(x
n
1
, x
n
2
)
(Case c
0
) (x
n
1
, x
n
2
) , T
n
(; X
n
1
, X
n
2
);
(Case c
1
) ( x
n
1
,= x
n
1
) f
1
( x
n
1
) = f
1
(x
n
1
) and ( x
n
1
, x
n
2
) T
n
(; X
n
1
, X
n
2
);
255
(Case c
2
) ( x
n
2
,= x
n
2
) f
2
( x
n
2
) = f
2
(x
n
2
) and (x
n
1
, x
n
2
) T
n
(; X
n
1
, X
n
2
);
(Case c
3
) ( ( x
n
1
, x
n
2
) ,= (x
n
1
, x
n
2
)) f
1
( x
n
1
) = f
2
( x
n
2
), f
2
( x
n
2
) = f
2
(x
n
2
)
and
( x
n
1
, x
n
2
) T
n
(; X
n
1
, X
n
2
).
Hence,
P
e
(n) Prc
0
c
1
c
2
c
3
Pr(c
0
) + Pr(c
1
) + Pr(c
2
) + Pr(c
3
)
Hence,
E[P
e
(n)] E[Pr(c
0
)] + E[Pr(c
1
)] + E[Pr(c
2
)] + E[Pr(c
3
)],
where the expectation is taken over the uniformly random coding scheme.
It remains to calculate the error for each event.
(Case E
0
) By AEP, E[Pr(c
0
)] 0.
256
(Case E
1
)
E[Pr(c
1
)]
=
x
n
1
.
n
1
x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)
x
n
1
.
n
1
: x
n
1
,=x
n
1
,
( x
n
1
,x
n
2
)Tn(;X
n
1
,X
n
2
)
Pr(f
1
( x
n
1
) = f
1
(x
n
1
))
=
x
n
1
.
n
1
x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)
x
n
1
.
n
1
: x
n
1
,=x
n
1
( x
n
1
,x
n
2
)Tn(;X
n
1
,X
n
2
)
2
nR
1
(By uniformly random coding scheme.)
x
n
1
.
n
1
x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)
x
n
1
.
n
1
: ( x
n
1
,x
n
2
)Tn(;X
n
1
,X
n
2
)
2
nR
1
x
n
1
.
n
1
x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)
x
n
1
.
n
1
:
1
n
log
2
P
X
n
1
X
n
2
( x
n
1
[x
n
2
)H(X
1
[X
2
)
<2
2
nR
1
=
x
n
1
.
n
1
x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
) 2
nR
1
_
x
n
1
A
n
1
:
1
n
log
2
P
X
n
1
[X
n
2
( x
n
1
[x
n
2
) H(X
1
[X
2
)
< 2
_
x
n
1
.
n
1
x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)2
n[H(X
1
[X
2
)+2]
2
nR
1
= 2
n[R
1
H(X
1
[X
2
)2]
,
Hence, we need R
1
> H(X
1
[X
2
) to make E[Pr(c
1
)] 0.
(Case c
2
) Same as Case c
1
. we need R
2
> H(X
2
[X
1
) to make
E[Pr(c
2
)] 0.
(Case c
3
) By replacing Prf
1
( x
n
1
) = f
1
(x
n
) by
Prf
1
( x
n
1
) = f
1
(x
n
), f
2
( x
n
2
) = f
2
(x
n
2
)
and using the fact that the number of elements in T
n
(; X
n
1
, X
n
2
) is
less than 2
n(H(X
1
,X
2
)+)
in the proof of Case c
1
, we have R
1
+ R
2
>
H(X
1
, X
2
) implies E[Pr(c
3
)] 0.
2. Converse Part: We have to prove that if a sequence of code pairs for X
1
and
X
2
has asymptotically zero error probability, then its rate pair (R
1
, R
2
) satises
the constraint.
257
Step 1: R
1
+ R
2
H(X
1
, X
2
). Apparently, to losslessly compress sources
(X
1
, X
2
)
requires at least H(X
1
, X
2
) bits.
Step 2: R
1
H(X
1
[X
2
). At the extreme case of R
2
> log
2
[A
2
[ (no compression
at all for X
2
), we can follow the proof of lossless data compression theorem
to show that R
1
H(X
1
[X
2
).
Step 3: R
2
H(X
2
[X
1
). Similar argument as in step 2 can be applied. 2
Corollary 10.12 Given sequences of several (correlated) discrete memoryless
sources X
1
, . . . , X
m
which are obtained from dierent terminals (and are to be
encoded independently), the achievable code rate region satises
iI
R
i
H(X
I
[X
LI
),
for any index set I L 1, 2, . . . , m, where X
I
represents (X
i
1
, X
i
2
, . . .) for
i
1
, i
2
, . . . = I.
10.1.2 Partial decoding of the original sources
In the previous subsection, the receiver intends to fully reconstruct all the orig
inal information transmitted, X
1
, . . . , X
m
. In some situations, the receiver may
only want to reconstruct part of the original information, say X
i
for i I
1, . . . , m or X
I
. Since it is in general assumed that X
1
, . . . , X
m
are dependent,
the remaining information, X
i
for i , I, should be helpful in the reconstruction
of X
I
. Accordingly, these remain information are usually named the side infor
mation for lossless data compression.
Due to the dierent intention of the receiver, the denitions of the decoder
mapping and error probability should be modied.
Denition 10.13 (reconstructed information and side information) Let
L 1, 2, . . . , m and I is any proper subset of L. Denote X
I
as the sources X
i
for i I, and similar notation is applied to X
LI
.
In the data compression with sideinformation, X
I
is the information needs
to be reconstructed, and X
LI
is the sideinformation.
Denition 10.14 (independent encoders among distributed sources) T
here are several sources X
1
, X
2
, . . . , X
m
(may or may not be independent) which
258
are respectively obtained by m terminals. Before each terminal transmits its
local source to the receiver, a block encoder f
i
with rate R
i
and block length n
is applied
f
i
: A
n
i
1, 2, . . . , 2
nR
i
.
It is assumed that there is no conspiracy among block encoders.
Denition 10.15 (global decoder for independently compressed sourc
es) A global decoder g() will recover the original sources after receiving all the
independently compressed sources, i.e.,
g : 1, . . . , 2
R
1
1, . . . , 2
Rm
A
n
I
.
Denition 10.16 (probability of error) The probability of error is dened
as
P
e
(n) Prg(f
I
(X
n
I
) ,= (X
n
I
).
Denition 10.17 (achievable rates) A rates
(R
1
, . . . , R
m
)
is said to be achievable if there exists a sequence of block codes such that
limsup
n
P
e
(n) = 0.
Denition 10.18 (achievable rate region for distributed sources) The
achievable rate region for distributed sources is the set of all achievable rates.
Observation 10.19 The achievable rate region is convex.
Theorem 10.20 For distributed sources with two random variable X
1
and X
2
,
let X
1
be the reconstructed information and X
2
be the side information, the
boundary function R
1
(R
2
) for the achievable region is
R
1
(R
2
) min
Z : X
1
X
2
Z and I(X
2
;Z)R
2
H(X
1
[Z)
The above result can be rewritten as
R
1
H(X
1
[Z) and R
2
I(X
2
; Z),
for any X
1
X
2
Z. Actually, Z can be viewed as the coding outputs of
X
2
, received by the decoder, and is used by the receiver as a side information to
reconstruct X
1
. Hence, I(X
2
; Z) is the transmission rate from sender X
2
to the
receiver. For all f
2
(X
2
) = Z that has the same transmission rate I(X
2
; Z), the
one that minimize H(X
1
[Z) will yield the minimum compression rate for X
1
.
259
10.2 Distributed detection
Instead of reconstruction of the original information, the decoder of a multiple
sources system may only want to classify the sources into one of nitely many
categories. This problem is usually named distributed detection.
A distributed or decentralized detection system consists of a number of ob
servers (or data collecting units) and one or more data fusion centers. Each
observer is coupled with a local data processor and communicates with the data
fusion center through network links. The fusion center combines all compressed
information received from the observers and attempts to classify the source of
the observations into one of nitely many categories.
Denition 10.21 (distributed system o
n
) A distributed detection system
o
n
, as depicted in Fig. 10.3, consists of n geographically dispersed sensors, noise
less oneway communication links, and a fusion center. Each sensor makes an
observation (denoted by Y
i
) of a random source, quantizes Y
i
into an mary mes
sage U
i
= g
i
(Y
i
), and then transmits U
i
to the fusion center. Upon receipt of
(U
1
, . . . , U
n
), the fusion center makes a global decision T(U
1
, . . . , U
n
) about the
nature of the random source.







Fusion center

T(U
1
, . . . , U
n
)
g
n
g
2
g
1
Y
n
Y
2
Y
1
U
n
U
2
U
1
Figure 10.3: Distributed detection in o
n
.
The optimal design of o
n
entails choosing quantizers g
1
, . . . , g
n
and a global
decision rule T so as to optimize a given performance index. In this section,
we consider binary hypothesis testing under the (classical) NeymanPearson and
Bayesian formulations. The rst formulation dictates minimization of the type II
error probability subject to an upper bound on the type I error probability; while
the second stipulates minimization of the Bayes error probability, computed
according to the prior probabilities of the two hypotheses.
260
The joint optimization of entities g
1
, . . . , g
n
and T in o
n
is a hard compu
tational task, except in trivial cases (such as when the observations Y
i
lie in
a set of size no greater than m). The complexity of the problem can only be
reduced by introducing additional statistical structure in the observations. For
example, it has been shown that whenever Y
1
, . . . , Y
n
are independent given each
hypothesis, an optimal solution can be found in which g
1
, . . . , g
n
are threshold
type functions of the local likelihood ratio (possibly with some randomization for
NeymanPearson testing). Still, we should note that optimization of g
1
, . . . , g
n
over the class of thresholdtype likelihoodratio quantizers is prohibitively com
plex when n is large.
Of equal importance are situations where the statistical model exhibits spatial
symmetry in the form of permutation invariance with respect to the sensors. A
natural question to ask in such cases is
Question: whether a symmetric optimal solution exists in which the quan
tizers g
i
are identical?
If so, then the optimal system design is considerably simplied. The answer
is clearly negative for cases where sensor observations are highly dependent; as
an extreme example, take Y
1
= . . . = Y
n
= Y with probability 1 under each
hypothesis, and note that any two identical quantizers lead to a redundancy.
The general problem is as follows. System o
n
is used for testing H
0
: P
versus H
1
: Q, where P and Q are onedimensional marginals of the i.i.d. data
Y
1
, . . . , Y
n
. As n tends to innity, both the minimum type II error probability
n
() (as function of the type I error probability bound ) and the Bayes error
probability
n
() (as function of the prior probability of H
0
) vanish at an
exponential rate. It thus becomes legitimate to adopt a measure of asymptotic
performance based on the error exponents
e
NP
() lim
n
1
n
log
n
()
e
B
() lim
n
1
n
log
n
() .
It was shown by Tsitsiklis [3] that, under certain assumptions on the hypothe
ses P and Q, it is possible to achieve the same error exponents using identical
quantizers. Thus if
n
(),
n
(), e
NP
() and e
B
() are the counterparts of
n
(),
n
(), e
NP
() and e
B
() under the constraint that the quantizers g
1
, . . . , g
n
are
identical, then
( (0, 1)) e
NP
() = e
NP
()
and
( (0, 1)) e
B
() = e
B
() .
261
(Of course, for all n,
n
()
n
() and
n
()
n
().) This result provides
some justication for restricting attention to identical quantizers when designing
a system consisting of a large number of sensors.
Here we will focus on two issues. The rst issue is the exact asymptotics
of the minimum error probabilities achieved by the absolutely optimal and best
identicalquantizer systems. Note that equality in the error exponents of
n
()
and
n
() does not in itself guarantee that for any given n, the values of
n
()
and
n
() are in any sense close. In particular, the ratio
n
()/
n
() may vanish
at a subexponential rate, and thus the best identicalquantizer system may be
vastly inferior to the absolutely optimal system. (The same argument can be
made for
n
()/
n
()).
From numerical simulations of Bayes testing in o
n
, the ratio
n
()/
n
()
is (apparently) bounded from below by a positive constant which is, in many
cases, reasonably close to unity (but not necessarily one. Cf. Example 10.32).
This simulation result is substantiated by using large deviations techniques to
prove that
n
()/
n
() is, indeed, always bounded from below (it is, of course,
upper bounded by unity). For NeymanPearson testing, an additional regularity
condition is required in order for the ratio
n
()/
n
() to be lowerbounded in
n. In either case, the optimal system essentially consists of almost identical
quantizers, and is thus only marginally dierent from the best identicalquantizer
system.
Denition 10.22 (observational model) Each sensor observation Y = Y
i
takes values in the measurable space (, B). The distribution of Y under the
null (H
0
) and alternative (H
1
) hypotheses is denoted by P and Q, respectively.
Assumption 10.23 P and Q are mutually absolutely continuous, i.e., P Q.
Under Assumption 10.23, the (prequantization) loglikelihood ratio
X(y) log
dP
dQ
(y)
is welldened for y and is a.s. nite. (Since P Q, almost surely and
almost everywhere are understood as under both P and Q.) In o
n
, the variable
X(Y
i
) will also be denoted as X
i
.
Assumption 10.24 Every measurable mary partition of contains an atom
over which X = log(dP/dQ) is not almost everywhere constant.
The objective of the above assumption is to guarantee that trivial quantiza
tion in which no compression is applied on the observations cannot be obtained.
262
Denition 10.25 (divergence) The (KullbackLeibler, informational) diver
gence, or relative entropy, of P relative to Q is dened by
D(PQ) E
P
[X] =
_
log
dP
dQ
(y) dP(y) .
Lemma 10.26 On the convex domain consisting of distribution pairs (P, Q)
with the property P Q, the functional D(PQ) is nonnegative and convex
with respect to all (P, Q) pair with the property P Q.
Lemma 10.27 (NeymanPearson type II error exponent of xed test
level) The optimal NeymanPearson error exponent in testing P versus Q at
any level (0, 1) based on the i.i.d. observations Y
1
, . . . , Y
n
is D(PQ).
Denition 10.28 (moment generation function of loglikelihood ratio)
() is the moment generation function of X under Q:
() E
Q
[expX] =
_ _
dP
dQ
(y)
_
dQ(y) .
Lemma 10.29 (concavity of ())
1. For xed [0, 1], () is a nitevalued concave functionals of the pair
(P, Q) with the property P Q.
2. For xed (P, Q) with P Q, () is nite and convex in [0, 1].
This last property, together with the fact that (0) = (1) = 1, guarantees
that () has a minimum value which is less than or equal to unity, achieved by
some
(0, 1).
Denition 10.30 (Cherno exponent) We dene the Cherno exponent
(P, Q) log (
) = log
_
min
(0,1)
()
_
.
Lemma 10.31 The Cherno exponent coincides with the Bayes error exponent.
Example 10.32 (counterexample to
n
()/
n
() 1) Consider a ternary
observation space = a
1
, a
2
, a
3
with binary quantization. The two hypotheses
are assumed equally likely, with
263
y a
1
a
2
a
3
P(y) 1/12 1/4 2/3
Q(y) 1/3 1/3 1/3
(dP/dQ)(y) 1/4 3/4 2
There are only two nontrivial deterministic LRQs: g, which partitions
into a
1
and a
2
, a
3
; and g, which partitions into a
1
, a
2
and a
3
. The
corresponding output distributions and loglikelihood ratios X
() and X
() are
given by
u 1 2
P
(u) 1/12 11/12
Q
(u) 1/3 2/3
X
(u) log 4 log(11/8)
and
u 1 2
P
(u) 1/3 2/3
Q
(u) 2/3 1/3
X
(u) log 2 log 2
(10.2.1)
(P
, Q
) =
1
2
log
9
8
= 0.0589 > 0.0534 = (P
, Q
) ,
and thus for n suciently large, the best identicalquantizer system o
n
employs
g on all n sensors (the other choice yields a suboptimal error exponent and thus
eventually a higher value of
n
()).
We will now show by contradiction that if o
n
is an absolutely optimal system
consisting of deterministic LRQs, then for all even values of n, at least one of
the quantizers in o
n
must be g.
Assume the contrary, i.e., o
n
is such that for all i n, g
i
= g. Now consider
the problem of optimizing the quantizer g
n
in o
n
subject to the constraint that
each of the remaining quantizers g
1
, . . . , g
n1
equals g. It is known that either
g
n
= g or g
n
= g is optimal. Our assumption about o
n
then implies that g
n
= g
is, in fact, optimal.
To see why this cannot be so if n is even, consider the Bayes error probability
for this problem. Writing u
n
1
for (u
1
, . . . , u
n
), we have
n
_
1
2
_
=
1
2
u
n
1
1,2
n
[P
(u
n1
1
)P
gn
(u
n
)] [Q
(u
n1
1
)Q
gn
(u
n
)]
=
1
2
u
n1
1
1,2
n1
u
n
1,2
[P
(u
n1
1
)P
gn
(u
n
)] [Q
(u
n1
1
)Q
gn
(u
n
)]
=
u
n1
1
1,2
n1
_
P
(u
n1
1
) + Q
(u
n1
1
)
2
_
_
P
(u
n1
1
)
P
(u
n1
1
) + Q
(u
n1
1
)
_
(10.2.2)
264
where () represents the Bayes error probability function of the nth
sensor/quantizer pair (note that in this equation, the argument of () is just
the posterior probability of H
0
given u
n1
1
).
We note from (10.2.1) that the loglikelihood ratio
log
P
(u
n1
1
)
Q
(u
n1
1
)
= X
(u
1
) + + X
(u
n1
)
can be also expressed as (2l
n1
(u) n + 1)(log 2), where l
n1
(u) is the number
of 2s in u
n1
1
. Now l
n1
(U) is a binomial variable under either hypothesis, and
we can rewrite (10.2.2) as
n
_
1
2
_
=
n1
l=0
_
n1
l
_
_
2
l
+ 2
nl1
2 3
n1
_
_
2
2ln+1
2
2ln+1
+ 1
_
. (10.2.3)
The two candidates for are and , given by
() =
_
1
12
1
3
(1 )
_
+
_
11
12
2
3
(1 )
_
() =
_
1
3
2
3
(1 )
_
+
_
2
3
1
3
(1 )
_
and shown in Figure 10.4. Note that () = () for 1/3, = 4/7 and
4/5. Thus the critical values of l in (10.2.3) are those for which 2
2ln+1
/(2
2ln+1
+
1) lies in the union of (1/3, 4/7) and (4/7, 4/5).
If n is odd, then the range of 2l n+1 in (10.2.3) comprises the even integers
between n +1 and n 1 inclusive. The only critical value of l is (n 1)/2, for
which the posterior probability of H
0
is 1/2. Since (1/2) = 3/8 > 1/3 = (1/2),
the optimal choice is g.
If n is even, then 2l n + 1 ranges over all odd integers between n + 1
and n 1 inclusive. Here the only critical value of l is n/2, which makes the
posterior probability of H
0
equal to 2/3. Since (2/3) = 5/18 < 1/3 = (2/3),
g is optimal.
We thus obtain the required contradiction, together with the inequality
2k
_
1
2
_
2k
_
1
2
_
_
2k1
k
_
_
2
k
+ 2
k1
2 3
2k1
_ _
_
2
3
_
_
2
3
__
=
1
8
_
2k1
k
_ _
2
9
_
k
.
Stirlings formula gives (4k)
1/2
expk log 4(1 + o(1)) for the binomial coef
cient, and thus
liminf
k
_
2k
_
1
2
_
2k
_
1
2
__
k
_
9
8
_
k
1
16
. (10.2.4)
265
0
1
0 1 1/3 4/7 4/5
1/3
()
()
Figure 10.4: Bayes error probabilities associated with g and g.
Since (9/8)
k
= exp2k(P
, Q
) = exp2k
2
, we immediately deduce from
(10.2.4) that
limsup
k
2k
(1/2)
2k
(1/2)
< 1 .
A ner approximation to
2k
_
1
2
_
= QX
(U
1
) + +X
(U
2k
) > 0 +
1
2
QX
(U
1
) + +X
(U
2k
) = 0
using [2, Theorem 1] and Stirlings formula for the rst and second summands,
respectively, yields
lim
k
2k
_
1
2
_
k
_
9
8
_
k
=
3
2
.
From (10.2.4) we then obtain
limsup
k
2k
(1/2)
2k
(1/2)
23
24
.
266
10.2.1 NeymanPearson testing in parallel distributed d
etection
The equality (and niteness) of e
NP
() and e
NP
() was established in Theorem 2
of [3] under the assumption that E
P
[X
2
] < . Actually, the proof only utilized
the following weaker condition for = 0 on the postquantization loglikelihood
ratio.
Assumption 10.33 (boundedness assumption) There exists 0 for
which
sup
gm
E
P
[[X
g
[
2+
] < , (10.2.5)
where (
m
is the set of all possible mary quantizers.
Let us briey examine the above assumption. For an arbitrary randomized
quantizer g, let p
u
= P
g
(u) and q
u
= Q
g
(u). Then
E
P
[[X
g
[
2+
] =
m
u=1
p
u
log
p
u
q
u
2+
.
Our rst observation is that the negative part X
g
of X
g
has bounded (2+)
th moment under P, and thus Assumption 10.33 is equivalent to
sup
g
E
P
[(X
+
g
)
2+
] < .
Indeed, we have
E
P
[[X
g
[
2+
] =
u : pu<qu
p
u
log
p
u
q
u
2+
=
u : pu<qu
p
u
log
q
u
p
u
2+
,
and using the inequality [ log x
1/(2+)
[ x
1/(2+)
1 for x 1, we obtain
E
P
[(X
g
)
2+
] (2 + )
2+
u : pu<qu
p
u
_
_
q
u
p
u
_
1/(2+)
1
_
2+
(2 + )
2+
m
u=1
q
u
(2 + )
2+
.
Theorem 10.34 Assumption 10.33 is equivalent to the condition
sup
T
2
E
P
[[X
[
2+
] < , (10.2.6)
where T
2
is the set of all possible binary loglikelihood ratio quantizers.
267
Proof: Assumption 10.33 clearly implies (10.2.6). To prove the converse, let
t
be the LRP in T
2
dened by
t
_
(, t] , (t, )
_
, (10.2.7)
and let p(t) = PX > t, q(t) = QX > t. By (10.2.6), there exists b <
such that for all t 1,
E
P
[[X
t
[
2+
] = (1 p(t))
log
1 p(t)
1 q(t)
2+
+ p(t)
log
p(t)
q(t)
2+
b . (10.2.8)
Consider now an arbitrary deterministic mary quantizer g with output pmfs
(p
1
, . . . , p
m
) and (q
1
, . . . , q
m
), and let the maximum of p
u
[ log(p
u
/q
u
)[
2+
subject
to p
u
q
u
be achieved at u = u
log
p
2+
+ (2 + )
2+
.
To see that p
[log(p
/q
)[
2+
is bounded from above if (10.2.8) holds, note that
for a given p
= PY C
, the value of q
= QY C
= PX = t + p(t) ,
q
QX = t + q(t) .
If PX = t = 0, then
p
log
p
2+
p(t)
log
p(t)
q(t)
2+
,
where by virtue of (10.2.8), the r.h.s. is upperbounded by b. Otherwise, the
r.h.s. is of the form
_
p(t) + PX = t
_
log
p(t) + PX = t
q(t) + QX = t
2+
,
for some (0, 1]. For = 1, this can again be bounded using (10.2.6): take
t
t
consisting of intervals (, t) and [t, ), or use
t
and a simple continuity argu
ment. Then the logsum inequality can be applied together with the concavity
of f(t) = t
1/(2+)
to show that the same bound b is valid for (0, 1).
If g is a randomized quantizer, then the probabilities p
and q
dened previ
ously will be expressible as
k
k
p
(k)
and
k
k
q
(k)
. Here k ranges over a nite
index set, and each pair (p
(k)
, q
(k)
) is derived from a deterministic quantizer.
Again, using the logsum inequality and the concavity of f(t) = t
1/(2+)
, one can
obtain p
[log(p
/q
)[
2+
b. 2
268
Observation 10.35 The boundedness assumption is equivalent to
limsup
t
E
P
[[X
t
[
2+
] < , (10.2.9)
where
t
is dened in (10.2.7).
Theorem 10.36 If Assumption 10.33 holds, then for all (0, 1),
1
n
log
n
() = D
m
+ O(n
1/2
)
and
1
n
log
n
() = D
m
+ O(n
1/2
) . 2
As an immediate corollary, we have e
NP
() = e
NP
() = D
m
, which is The
orem 2 in [3]. The stated result sharpens this equality by demonstrating that
the nitesample error exponents of the absolutely optimal and best identical
quantizer systems converge at a rate O(n
1/2
). It also motivates the following
observations.
The rst observation concerns the accuracy, or tightness, of the O(n
1/2
)
convergence factor. Although the upper and lower bounds on
n
() in Theo
rem 10.36 are based on a suboptimal null acceptance region, examples in the
context of centralized testing show that in general, the O(n
1/2
) rate cannot be
improved on. At the same time, it is rather unlikely that the ratio
n
()/
n
()
decays as fast as expc
t
n
()/
n
() is indeed bounded from below.
The second observation is about the composition of an optimal quantizer set
(g
1
, . . . , g
n
) for o
n
. The proof of Theorem 10.36 also yields an upper bound on
the number of quantizers that are at least distant from the deterministic LRQ
that achieves e
NP
(). Specically, if K
n
() is the number of indices i for which
D(P
g
i
Q
g
i
) < D
m
(where > 0), then
K
n
()
n
= O(n
1/2
) .
Thus in an optimal system, most quantizers will be essentially identical to the
one that achieves e
NP
(). (This conclusion can be signicantly strengthened if
additional assumption is made in the case of NeymanPearson testing [1].)
269
In the remainder of this section we discuss the asymptotics of Neyman
Pearson testing in situations where Assumption 10.33 does not hold. By the
remark following the proof of Theorem 10.34, this condition is violated if and
only if
limsup
t
E
P
[X
2
t
] = , (10.2.10)
where
t
is the binary LRP dened by
t
=
_
(, t] , (t, )
_
.
We now distinguish between three cases.
Case A. limsup
t
E
P
[X
t
] = .
Case B. 0 < limsup
t
E
P
[X
t
] < .
Case C. limsup
t
E
P
[X
t
] = 0 and limsup
t
E
P
[X
2
t
] = .
Example Let the observation space be the unit interval (0, 1] with its Borel
eld. For a > 0, dene the distributions P and Q by
PY y = y , QY y = exp
_
a + 1
a
_
1
1
y
a
__
.
The pdf of Q is strictly increasing in y, and thus the likelihood ratio (dP/dQ)(y)
is strictly decreasing in y. Hence the event X > t can also be written as
Y < y
t
, where y
t
0 as t . Using this equivalence, we can examine the
limiting behavior of E
P
[X
t
] and E
P
[X
2
t
] to obtain:
a. a > 1: lim
t
E
P
[X
t
] = lim
t
E
P
[X
2
t
] = (Case A)
b. a = 1: lim
t
E
P
[X
t
] = 2, lim
t
E
P
[X
2
t
] = (Case B)
c. 1/2 < a < 1: lim
t
E
P
[X
t
] = 0, lim
t
E
P
[X
2
t
] = (Case C)
d. a 1/2: lim
t
E
P
[X
2
t
] < (Assumption 10.33 is satised) .
2
In Case A, the error exponents e
NP
() and e
NP
() are both innite. This
result is neither dicult to prove nor surprising, considering that E
P
[X
g
] =
D(P
g
Q
g
) can be made arbitrarily large by choice of the quantizer g.
270
Theorem 10.37 (result for Case A) If limsup
t
E
P
[X
t
] = , then for
all m 2 and (0, 1),
e
NP
() = e
NP
() = . 2
We now turn to case B, which is more interesting. Using the notation p(t) =
PX > t and q(t) = QX > t introduced earlier, we have
E
P
[X
t
] = (1 p(t)) log
1 p(t)
1 q(t)
+ p(t) log
p(t)
q(t)
.
The rst summand on the r.h.s. clearly tends to zero (as t , which is under
stood throughout), hence the lim sup of the second summand p(t) log[p(t)/q(t)]
is greater than zero. Since p(t) tends to zero, both log[p(t)/q(t)] and
p(t) log
2
[p(t)/q(t)] have lim sup equal to innity. Thus in particular, (10.2.10)
always holds in Case B.
A separate argument (which we omit) reveals that the centralized error ex
ponent D(PQ) = E
P
[X] is also innite in Case B. Yet unlike Case A, the
decentralized error exponent e
NP
() obtained here is not innite. Quite surpris
ingly, if this exponent exists, then it must depend on the test level . This is
stated in the following theorem.
Theorem 10.38 (result for Case B) Consider hypothesis testing with mary
quantization, where m 2. If
0 < limsup
t
E
P
[X
t
] < , (10.2.11)
then there exist:
1. an increasing sequence of integers n
k
, k N and a function L : (0, 1)
(0, ) which is monotonically increasing to innity, such that
liminf
k
1
n
k
log
n
k
() L() D
m
;
2. a function M : (0, 1) (0, ) which is monotonically increasing to
innity and is such that
limsup
n
1
n
log
n
() M() .
271
Proof:
(i) Lower bound. As was argued in the proof of Theorem 10.36, an error exponent
equal to D
m
can be achieved using identical quantizers; hence one part of the
bound follows immediately. In what follows, we construct a sequence of identical
quantizer detection schemes with nite sampleerror exponent almost always
exceeding L(), where L() increases to innity as tends to unity.
Let limsup
t
E
P
[X
t
], so that 0 < < by (10.2.11). By subsequence
selection, we obtain a sequence of LRPs
t
k
with the property that
k
E
P
[X
k
]
converges to as k . (We eliminate t from all subscripts to simplify the
notation.) Letting p
k
= p(t
k
), q
k
= q(t
k
),
k
= log[(1 p
k
)/(1 q
k
)] and
k
= log(p
k
/q
k
), we can write
k
= (1 p
k
)
k
+ p
k
k
,
where by the discussion preceding this theorem, p
k
and
k
tend to zero and
k
increases to innity.
Fix > 0 and assume w.l.o.g. that
k
>
k1
+ (1/). Consider a system
consisting of n
k
sensors, where n
k
=
k
, and let each sensor employ the same
binary LRQ with LRP
k
. This choice is clearly suboptimal (since m 2), but
it suces for our purposes.
Dene the set /
k
1, 2
n
k
by
/
k
= (u
1
, . . . , u
n
k
) : at least one u
j
equals 2 .
Recalling that U
j
= 2 if, and only if, the observed loglikelihood ratio is larger
than t
k
, we have
P
(/
k
) = 1 (1 p
k
)
n
k
,
and thus
lim
k
(1 P
(/
k
)) = lim
k
(1 p
k
)
n
k
= lim
k
(1 p
k
)
k
=
_
lim
k
(1 p
k
)
k
_
= exp ,
where the last equality follows from the fact that
k
and p
k
k
as
k .
Thus given any > 0, for all suciently large k the set /
k
is admissible
(albeit not necessarily optimal) as a null acceptance region for testing at level
272
= exp + . For this value of , we have
n
k
() Q
(/
k
)
= 1 (1 q
k
)
n
k
= 1 (1 p
k
exp
k
)
n
k
= n
k
p
k
exp
k
n
k
(n
k
1)
2!
p
2
k
exp2
k
+ (1)
n
k
p
n
k
k
expn
k
n
k
p
k
exp
k
+ n
2
k
p
2
k
exp2
k
+
Summing the geometric series, we obtain
n
k
()
n
k
p
k
exp
k
1 n
k
p
k
exp
k
.
The r.h.s. denominator tends to unity because
k
and n
k
p
k
as
k . Since
k
/n
k
1/, we conclude that
liminf
k
1
n
k
log
n
k
()
1
=
log(1/) +
.
As > 0 and > 0 were chosen arbitrarily, it follows that
liminf
k
1
n
k
log
n
k
()
log(1/)
for all (0, 1). The lower bound in statement (i) of the theorem is obtained
by taking L() / log(1/).
(ii) Upper bound. Consider an optimal detection scheme for o
n
, with the same
setup as in the proof of Theorem 10.36. Recall in particular that the fusion center
employs a randomized test with loglikelihood threshold
n
and randomization
constant
n
.
For to be an upper bound on the error exponent of
n
(), it suces that n
be greater than
n
and such that the events
_
n
i=1
X
g
i
n
_
and
_
n
i=1
X
g
i
n
_
have signicant overlap under P
g
. Indeed, if >
n
/n is such that for all
suciently large n,
n
P
g
_
n
i=1
X
g
i
=
n
_
+ P
g
_
n
n
i=1
X
g
i
>
n
_
> 0 , (10.2.12)
then
n
() > expn, as required.
The threshold
n
is rather dicult to determine, so we use an indirect method
for nding . We have
P
g
_
n
i=1
X
g
i
> n
_
P
g
_
n
i=1
[X
g
i
[ > n
_
1
sup
gm
E
P
[[X
g
[] ,
273
where the last bound follows from the Markov inequality. We claim that the
supremum in this relationship is nite. This is because the negative part of
X
g
has bounded expectation under P (see the discussion following Assump
tion 10.33), and the proof of Theorem 10.34 can be easily modied to show that
t
sup
gm
E
P
[[X
g
[] is nite i = limsup
t
E
P
[X
t
] is (which is our current
hypothesis). Thus
P
g
_
n
i=1
X
g
i
> n
_
t
.
Now let > 0 and =
t
/(1), so that P
g
_
n
i=1
X
g
i
> n
_
1.
The NeymanPearson lemma immediately yields n >
n
. Also, using
n
P
g
_
n
i=1
X
g
i
=
n
_
+ P
g
_
n
i=1
X
g
i
>
n
_
= 1
and a simple contradiction, we obtain (10.2.12). Thus the chosen value of is an
upper bound on limsup
n
(1/n) log
n
(). Since > 0 can be made arbitrarily
small, we have that
M()
t
1
is also an upper bound. 2
From Theorem 10.38 we conclude that in Case B, the error exponent e
NP
()
must lie between the bounds L() and M() whenever it exists as a limit. (Since
t
D
m
and 1 log(1/), the inequality M() L() is indeed true.)
These bounds are shown in Figure 10.5.
Theorem 10.39 (result for Case C) In Case C,
e
NP
() = e
NP
() = D(PQ).
Theorem 10.40 (result under boundedness assumption) Let 1 sat
isfy (10.2.5). If 1/2, or if > 1/2 and observation space is nite, then
n
()
n
()
expc
t
(, )n
1
2
.
In particular, if (10.2.5) holds for 1, then the ratio
n
()/
n
() is bounded
from below.
10.2.2 Bayes testing in parallel distributed detection sys
tems
We now turn to the asymptotic study of optimal Bayes detection in o
n
. The
prior probabilities of H
0
and H
1
are denoted by and 1 , respectively, the
274
0 1
Upper bound
Lower bound
NP
()
D
m
t
Figure 10.5: Upper and lower bounds on e
NP
() in Case B.
probability of error of the absolutely optimal system is denoted by
n
(), and the
probability of error of the best identicalquantizer system is denoted by
n
().
In our analysis, we will always assume that o
n
employs deterministic m
ary LRQs represented by LRPs
1
, . . . ,
n
. This is clearly sucient because
randomization is of no help in Bayes testing.
Theorem 10.41 In Bayes testing with mary quantization,
liminf
n
n
()
n
()
> 0 (10.2.13)
for all (0, 1).
10.3 Capacity region of multiple access channels
Denition 10.42 (discrete memoryless multiple access channel) A dis
crete memoryless multiple access channel contains several senders
(X
1
, X
2
, . . . , X
m
)
and one receiver Y , which are respectively dened over nite alphabet (A
1
,A
2
,. . .)
and . Also given is the transition probability P
Y [X
1
,X
2
,...,Xm
.
275
For simplicity, we will focus on the system with only two senders. The block
code for this simple multiple access channel is dened below.
Denition 10.43 (block code for multiple access channels) A block code
(n, M
1
M
2
)
for multiple access channel has block length n and rates R
1
= (1/n) log
2
M
1
and
R
2
= (1/n) log
2
M
2
respectively for each sender as:
f
1
: 1, . . . , M
1
A
n
1
,
and
f
2
: 1, . . . , M
2
A
n
2
.
Upon receipt of the channel output, the decoder is a mapping
g :
n
1, . . . , M
1
1, . . . , M
2
.
Theorem 10.44 (capacity region of memoryless multiple access chan
nel) The capacity region for memoryless multiple access channel is the convex
set of the set
_
(R
1
, R
2
) (1
+
0)
2
: R
1
I(X
1
; Y [X
2
), R
2
I(X
2
; Y [X
1
)
and R
1
+ R
2
I(X
1
, X
2
; Y ) .
Proof: Again, the achievability part is proved using random coding arguments
and typicalset decoder. The converse part is based on the Fanos inequality. 2
10.4 Degraded broadcast channel
Denition 10.45 (broadcast channel) A broadcast channel consists of one
input alphabet A and two (or more) output alphabets
1
and
2
. The noise is
dened by the conditional probability P
Y
1
,Y
2
[X
(y
1
, y
2
[x).
Example 10.46 Examples of broadcast channels are
Cable Television (CATV) network;
Lecturer in classroom;
Code Division Multiple Access channels.
276
Denition 10.47 (degraded broadcast channel) A broadcast channel is
said to be degraded if
P
Y
1
,Y
2
[X
(y
1
, y
2
[x) = P
Y
1
[X
(y
1
[x)P
Y
2
[Y
1
(y
2
[y
1
).
It can be veried that when X Y
1
Y
2
forms a Markov chain, in which
P
Y
2
[Y
1
,X
(y
2
[y
1
, x) = P
Y
2
[Y
1
(y
2
[y
1
), a degraded broadcast channel is resulted. This
indicates that the parallelly broadcast channel degrades to a serially broad
cast channel, where the channel output Y
2
can only obtain information from
channel input X through the previous channel output Y
1
.
Denition 10.48 (block code for broadcast channel) A block code for
broadcast channel consists of one encoder f() and two (or more) decoders g
i
()
as
f : 1, . . . , 2
nR
1
1, . . . , 2
nR
2
A
n
,
and
g
1
: A
n
1, . . . , 2
nR
1
,
g
2
: A
n
1, . . . , 2
nR
2
.
Denition 10.49 (error probability) Let the source index random variable
be W
1
and W
2
, namely W
1
1, . . . , 2
nR
1
and W
2
1, . . . , 2
nR
2
. Then the
probability of error is dened as
P
e
PrW
1
,= g
1
[f(W
1
, W
2
)] or W
2
,= g
2
[f(W
1
, W
2
)].
Theorem 10.50 (capacity region for degraded broadcast channel) The
capacity region for memoryless degraded broadcast channel is the convex hull of
the closure of
_
U
(R
1
, R
2
) : R
1
I(X; Y
1
[U) and R
2
I(U; Y
2
) ,
where the union is taking over all U satisfying U X (Y
1
, Y
2
) with alphabet
size [[ min[A[, [
1
[, [
2
[.
Note that U X Y
1
Y
2
is equivalent to
P
U,X,Y
1
,Y
2
(u, x, y
1
, y
2
) = P
U
(u)P
X[U
(x[u)P
Y
1
,Y
2
[X,U
(y
1
, y
2
[x, u)
= P
U
(u)P
X[U
(x[u)P
Y
1
,Y
2
[X
(y
1
, y
2
[x)
= P
U
(u)P
X[U
(x[u)P
Y
1
[X
(y
1
[x)P
Y
2
[Y
1
(y
2
[y
1
).
277
Example 10.51 (capacity region for degraded BSC) Suppose P
Y
1
[X
and
P
Y
2
[Y
1
are BSC with crossover
1
and
2
, respectively. Then the capacity region
can be parameterized through as:
R
1
h
b
(
1
) h
b
(
1
)
R
2
1 h
b
( (
1
(1
2
) + (1
1
)
2
)),
where P
X[U
(0[1) = P
X[U
(1[0) = and U 0, 1.
Example 10.52 (capacity region for degraded AWGN channel) The
channel is modeled as
Y
1
= X + N
1
and Y
2
= Y
1
+ N
2
,
where the noise power for N
1
and N
2
are
2
1
and
2
2
, respectively. Then the
capacity region for input power constraint S should satisfy
R
1
1
2
log
2
_
1 +
S
2
1
_
R
2
1
2
log
2
_
1 +
(1 )S
S +
2
1
+
2
2
_
,
for any [0, 1].
10.5 Gaussian multiple terminal channels
In the previous chapter, we have dealt with the Gaussian channel with possibly
several inputs, and obtained the the best power distribution of each input should
follow the waterlling scheme. In the problem, the encoder is dened as
f : 1
m
1,
which means that all the inputs are observed by the same terminal and hence,
can be utilized in a centralized fashion. However, for Gaussian multiple terminal
channels, these inputs are observed distributed by dierent channels, and hence,
need to be encoded separately. In other words, the encoder now becomes
f
1
: 1 1
f
2
: 1 1
.
.
.
f
m
: 1 1
278
So we have now m (independent) transmitters, and one receiver in the system.
The system can be modeled as
Y =
m
i=1
X
i
+ N.
Theorem 10.53 (capacity region for AWGN multiple access channel)
Suppose each transmitter has (constant) power constraint S
i
. Let I denote the
subset of 1, 2, . . . , m. Then the capacity region should be
_
(R
1
, . . . , R
m
) : ( I)
iI
R
i
1
2
log
2
_
1 +
iI
S
i
2
_
_
,
where
2
is the noise power of N.
Note that S
i
/
2
is the SNR ratio for each terminal. So if one can aord to
unlimited number of terminals with xed transmitting power S, he can make
the channel capacity as large as he desired. On the other hand, if we have the
constraint that the total power of the system is xed, then distributing the power
to m terminals can only reduce the capacity due to the distributed encoding.
279
Bibliography
[1] P.N. Chen and A. Papamarcou, New asymptotic results in parallel dis
tributed detection, IEEE Trans. Info. Theory, vol. 39, no. 6, pp. 18471863,
Nov. 1993.
[2] J. A. Fill and M. J. Wichura, The convergence rate for the strong law
of large numbers: General lattice distributions, Probab. Th. Rel. Fields,
vol. 81, pp. 189212, 1989.
[3] J. N. Tsitsiklis, Decentralized Detection by a large number of sensors,
Mathematics of Control, Signals and Systems, vol. 1, no. 2, pp. 167182,
1988.
280