You are on page 1of 293

Lecture Notes in Information Theory

Volume II
by
Po-Ning Chen

and Fady Alajaji

Department of Communications Engineering


National Chiao Tung University
1001, Ta Hsueh Road
Hsin Chu, Taiwan 300
Republic of China
Email: poning@cc.nctu.edu.tw

Department of Mathematics & Statistics,


Queens University, Kingston, ON K7L 3N6, Canada
Email: fady@polya.mast.queensu.ca
September 23, 2009
c _ Copyright by
Po-Ning Chen

and Fady Alajaji

September 23, 2009


Preface
The reliable transmission of information bearing signals over a noisy commu-
nication channel is at the heart of what we call communication. Information
theoryfounded by Claude E. Shannon in 1948provides a mathematical frame-
work for the theory of communication; it describes the fundamental limits to how
eciently one can encode information and still be able to recover it with neg-
ligible loss. This course will examine the basic concepts of this theory. What
follows is a tentative list of topics to be covered.
1. Volume I:
(a) Fundamentals of source coding (data compression): Discrete memo-
ryless sources, entropy, redundancy, block encoding, variable-length
encoding, Kraft inequality, Shannon code, Human code.
(b) Fundamentals of channel coding: Discrete memoryless channels, mu-
tual information, channel capacity, coding theorem for discrete mem-
oryless channels, weak converse, channel capacity with output feed-
back, the Shannon joint source-channel coding theorem.
(c) Source coding with distortion (rate distortion theory): Discrete mem-
oryless sources, rate-distortion function and its properties, rate-dis-
tortion theorem.
(d) Other topics: Information measures for continuous random variables,
capacity of discrete-time and band-limited continuous-time Gaussian
channels, rate-distortion function of the memoryless Gaussian source,
encoding of discrete sources with memory, capacity of discrete chan-
nels with memory.
(e) Fundamental backgrounds on real analysis and probability (Appen-
dix): The concept of set, supremum and maximum, inmum and
minimum, boundedness, sequences and their limits, equivalence, pro-
bability space, random variable and random process, relation between
a source and a random process, convergence of sequences of random
variables, ergodicity and laws of large numbers, central limit theorem,
concavity and convexity, Jensens inequality.
ii
2. Volumn II:
(a) General information measure: Information spectrum and Quantile
and their properties, Renyis informatino measures.
(b) Advanced topics of losslesss data compression: Fixed-length lossless
data compression theorem for arbitrary channels, Variable-length loss-
less data compression theorem for arbitrary channels, entropy of En-
glish, Lempel-Ziv code.
(c) Measure of randomness and resolvability: Resolvability and source
coding, approximation of output statistics for arbitrary channels.
(d) Advanced topics of channel coding: Channel capacity for arbitrary
single-user channel, optimistic Shannon coding theorem, strong ca-
pacity, -capacity.
(e) Advanced topics of lossy data compressing
(f) Hypothesis testing: Error exponent and divergence, large deviations
theory, Berry-Esseen theorem.
(g) Channel reliability: Random coding exponent, expurgated exponent,
partitioning exponent, sphere-packing exponent, the asymptotic lar-
gest minimum distance of block codes, Elias bound, Varshamov-Gil-
bert bound, Bhattacharyya distance.
(h) Information theory of networks: Distributed detection, data com-
pression over distributed source, capacity of multiple access channels,
degraded broadcast channel, Gaussian multiple terminal channels.
As shown from the list, the lecture notes are divided into two volumes. The
rst volume is suitable for a 12-week introductory course as that given at the
department of mathematics and statistics in Queens university of Canada. It
also meets the need of a fundamental course for undergraduates as that given at
the department of computer science and information engineering in national Chi
Nan university of Taiwan. For a 18-week graduate course as given in department
of communications engineering, National Chiao-Tung university of Taiwan, the
lecturer can selectively add advanced topics covered in the second volume to
enrich the lecture content, and provide a more complete and advanced view on
information theory to students.
The authors are very much indebted to all the people who provide valuable
comments to these lecture notes. Special thanks are devoted to Prof. Yunghsiang
S. Han with the department of computer science and information engineering in
national Chi Nan university of Taiwan for his enthusiasm to test these lecture
notes at his school, and bringing to the authors with valuable suggestions.
iii
Notes to readers. In these notes, all the assumptions, claims, conjectures,
corollaries, denitions, examples, exercises, lemmas, observations, properties,
and theorems are numbered under the same counter for ease of their searching.
For example, the lemma that immediately follows Theorem 2.1 will be numbered
as Lemma 2.2, instead of Lemma 2.1.
In addition, you may obtain the latest version of the lecture notes from
http://shannon.cm.nctu.edu.tw. Interested readers are welcome to return com-
ments to poning@faculty.nctu.edu.tw.
iv
Acknowledgements
Thanks are given to our families for their full support during the period of
our writing these lecture notes.
v
Table of Contents
Chapter Page
List of Tables x
List of Figures xi
1 Introduction 1
1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Generalized Information Measures for Arbitrary System Statis-
tics 3
2.1 Spectrum and Quantile . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Properties of quantile . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Generalized information measures . . . . . . . . . . . . . . . . . . 11
2.4 Properties of generalized information measures . . . . . . . . . . . 12
2.5 Examples for the computation of general information measures . . 19
2.6 Renyis information measures . . . . . . . . . . . . . . . . . . . . 24
3 General Lossless Data Compression Theorems 28
3.1 Fixed-length data compression codes for arbitrary sources . . . . . 29
3.2 Generalized AEP theorem . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Variable-length lossless data compression codes
that minimizes the exponentially weighted codeword length . . . . 38
3.3.1 Criterion for optimality of codes . . . . . . . . . . . . . . . 38
3.3.2 Source coding theorem for Renyis entropy . . . . . . . . . 39
3.4 Entropy of English . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Markov estimate of entropy rate of English text . . . . . . 40
3.4.2 Gambling estimate of entropy rate of English text . . . . . 42
A) Sequential gambling . . . . . . . . . . . . . . . . . . . . 42
3.5 Lempel-Ziv code revisited . . . . . . . . . . . . . . . . . . . . . . 44
vi
4 Measure of Randomness for Stochastic Processes 53
4.1 Motivation for resolvability : measure of randomness of random
variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Notations and denitions regarding to resolvability . . . . . . . . 54
4.3 Operational meanings of resolvability and mean-resolvability . . . 58
4.4 Resolvability and source coding . . . . . . . . . . . . . . . . . . . 65
5 Channel Coding Theorems and Approximations of Output Sta-
tistics for Arbitrary Channels 76
5.1 General models for channels . . . . . . . . . . . . . . . . . . . . . 76
5.2 Variations of capacity formulas for arbitrary channels . . . . . . . 76
5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 -capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.3 General Shannon capacity . . . . . . . . . . . . . . . . . . 87
5.2.4 Strong capacity . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Structures of good data transmission codes . . . . . . . . . . . . . 91
5.4 Approximations of output statistics: resolvability for channels . . 93
5.4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Notations and denitions of resolvability for channels . . . 93
5.4.3 Results on resolvability and mean-resolvability for channels 95
6 Optimistic Shannon Coding Theorems for Arbitrary Single-User
Systems 98
6.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Optimistic source coding theorems . . . . . . . . . . . . . . . . . 99
6.3 Optimistic channel coding theorems . . . . . . . . . . . . . . . . . 102
6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4.1 Information stable channels . . . . . . . . . . . . . . . . . 106
6.4.2 Information unstable channels . . . . . . . . . . . . . . . . 106
7 Lossy Data Compression 111
7.1 General lossy source compression for block codes . . . . . . . . . . 111
8 Hypothesis Testing 119
8.1 Error exponent and divergence . . . . . . . . . . . . . . . . . . . . 119
8.1.1 Composition of sequence of i.i.d. observations . . . . . . . 122
8.1.2 Divergence typical set on composition . . . . . . . . . . . . 128
8.1.3 Universal source coding on compositions . . . . . . . . . . 128
8.1.4 Likelihood ratio versus divergence . . . . . . . . . . . . . . 132
8.1.5 Exponent of Bayesian cost . . . . . . . . . . . . . . . . . . 133
8.2 Large deviations theory . . . . . . . . . . . . . . . . . . . . . . . . 135
vii
8.2.1 Tilted or twisted distribution . . . . . . . . . . . . . . . . 135
8.2.2 Conventional twisted distribution . . . . . . . . . . . . . . 135
8.2.3 Cramers theorem . . . . . . . . . . . . . . . . . . . . . . . 136
8.2.4 Exponent and moment generating function: an example . . 137
8.3 Theories on Large deviations . . . . . . . . . . . . . . . . . . . . . 140
8.3.1 Extension of G artner-Ellis upper bounds . . . . . . . . . . 140
8.3.2 Extension of G artner-Ellis lower bounds . . . . . . . . . . 147
8.3.3 Properties of (twisted) sup- and inf-large deviation rate
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4 Probabilitic subexponential behavior . . . . . . . . . . . . . . . . 159
8.4.1 Berry-Esseen theorem for compound i.i.d. sequence . . . . 159
8.4.2 Berry-Esseen Theorem with a sample-size dependent mul-
tiplicative coecient for i.i.d. sequence . . . . . . . . . . . 168
8.4.3 Probability bounds using Berry-Esseen inequality . . . . . 170
8.5 Generalized Neyman-Pearson Hypothesis Testing . . . . . . . . . 176
9 Channel Reliability Function 181
9.1 Random-coding exponent . . . . . . . . . . . . . . . . . . . . . . 181
9.1.1 The properties of random coding exponent . . . . . . . . . 185
9.2 Expurgated exponent . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2.1 The properties of expurgated exponent . . . . . . . . . . . 193
9.3 Partitioning bound: an upper bounds for channel reliability . . . . 194
9.4 Sphere-packing exponent: an upper bound of the channel reliability201
9.4.1 Problem of sphere-packing . . . . . . . . . . . . . . . . . . 201
9.4.2 Relation of sphere-packing and coding . . . . . . . . . . . 201
9.4.3 The largest minimum distance of block codes . . . . . . . . 205
A) Distance-spectrum formula on the largest minimum
distance of block codes . . . . . . . . . . . . . . . 208
B) Determination of the largest minimum distance for a
class of distance functions . . . . . . . . . . . . . 214
C) General properties of distance-spectrum function . . . . 219
D) General Varshamov-Gilbert lower bound . . . . . . . . 227
9.4.4 Elias bound: a single-letter upper bound formula on the
largest minimum distance for block codes . . . . . . . . . . 233
9.4.5 Gilbert bound and Elias bound for Hamming distance and
binary alphabet . . . . . . . . . . . . . . . . . . . . . . . . 241
9.4.6 Bhattacharyya distance and expurgated exponent . . . . . 242
9.5 Straight line bound . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10 Information Theory of Networks 252
10.1 Lossless data compression over distributed sources for block codes 253
10.1.1 Full decoding of the original sources . . . . . . . . . . . . . 253
viii
10.1.2 Partial decoding of the original sources . . . . . . . . . . . 258
10.2 Distributed detection . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.2.1 Neyman-Pearson testing in parallel distributed detection . 267
10.2.2 Bayes testing in parallel distributed detection systems . . . 274
10.3 Capacity region of multiple access channels . . . . . . . . . . . . . 275
10.4 Degraded broadcast channel . . . . . . . . . . . . . . . . . . . . . 276
10.5 Gaussian multiple terminal channels . . . . . . . . . . . . . . . . 278
ix
List of Tables
Number Page
2.1 Generalized entropy measures where [0, 1]. . . . . . . . . . . . 13
2.2 Generalized mutual information measures where [0, 1]. . . . . 14
2.3 Generalized divergence measures where [0, 1]. . . . . . . . . . 15
x
List of Figures
Number Page
2.1 The asymptotic CDFs of a sequence of random variables, A
n

n=1
.
u() = sup-spectrum of A
n
; u

() = inf-spectrum of A
n
; U

1
=
lim
1
U


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The spectrum h() for Example 2.7. . . . . . . . . . . . . . . . . . 22
2.3 The spectrum

i() for Example 2.7. . . . . . . . . . . . . . . . . . 22
2.4 The limiting spectrums of (1/n)h
Z
n(Z
n
) for Example 2.8 . . . . . 23
2.5 The possible limiting spectrums of (1/n)i
X
n
,Y
n(X
n
; Y
n
) for Ex-
ample 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Behavior of the probability of block decoding error as block length
n goes to innity for an arbitrary source X. . . . . . . . . . . . . 35
3.2 Illustration of generalized AEP Theorem. T
n
(; ) T
n
[

H

(X) +
] T
n
[

H

(X) ] is the dashed region. . . . . . . . . . . . . . . . 37


3.3 Notations used in Lempel-Ziv coder. . . . . . . . . . . . . . . . . 48
4.1 Source generator: X
t

tI
(I = (0, 1)) is an independent random
process with P
Xt
(0) = 1 P
Xt
(1) = t, and is also independent of
the selector Z, where X
t
is outputted if Z = t. Source generator
of each time instance is independent temporally. . . . . . . . . . . 72
4.2 The ultimate CDF of (1/n) log P
X
n(X
n
): Prh
b
(Z) t. . . . . 74
5.1 The ultimate CDFs of (1/n) log P
N
n(N
n
). . . . . . . . . . . . . . 89
5.2 The ultimate CDF of the normalized information density for Ex-
ample 5.14-Case B). . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 The communication system. . . . . . . . . . . . . . . . . . . . . . 94
5.4 The simulated communication system. . . . . . . . . . . . . . . . 94
7.1
Z,f (Z)
(D + ) > sup[ :
Z,f (Z)
() ] D + . . . . . . . . 115
7.2 The CDF of (1/n)
n
(Z
n
, f
n
(Z
n
)) for the probability-of-error dis-
tortion measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.1 The geometric meaning for Sanovs theorem. . . . . . . . . . . . . 127
xi
8.2 The divergence view on hypothesis testing. . . . . . . . . . . . . . 134
8.3 Function of (/6)u h(u). . . . . . . . . . . . . . . . . . . . . . . 168
8.4 The Berry-Esseen constant as a function of the sample size n. The
sample size n is plotted in log-scale. . . . . . . . . . . . . . . . . . 170
9.1 BSC channel with crossover probability and input distribution
(p, 1 p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2 Random coding exponent for BSC with crossover probability 0.2.
Also plotted is s

= arg sup
0s1
[sR E
0
(s)]. R
cr
= 0.056633. . 188
9.3 Expurgated exponent (solid line) and random coding exponent
(dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.192745)). . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4 Expurgated exponent (solid line) and random coding exponent
(dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.006)). . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.5 Partitioning exponent (thick line), random coding exponent (thin
line) and expurgated exponent (thin line) for BSC with crossover
probability 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6 (a) The shaded area is |
c
m
; (b) The shaded area is /
c
m,m
. . . . . . 247
9.7

X
(R) asymptotically lies between ess inf(1/n)(

X
n
, X
n
) and
(1/n)E[
n
(

X
n
, X
n
)[
n
(

X
n
, X
n
)] for

R
p
(X) < R <

R
0
(X). . . . . 248
9.8 General curve of

X
(R). . . . . . . . . . . . . . . . . . . . . . . . 248
9.9 Function of sup
s>0
_
sR s log
_
2s
_
1 e
1/2s
__
. . . . . . . . . . . 249
9.10 Function of sup
s>0
_
sR s log
__
2 + e
1/s
_
/4
_
. . . . . . . . . . . 249
10.1 The multi-access channel. . . . . . . . . . . . . . . . . . . . . . . 253
10.2 Distributed detection with n senders. Each observations Y
i
may
come from one of two categories. The nal decision T H
0
, H
1
. 254
10.3 Distributed detection in o
n
. . . . . . . . . . . . . . . . . . . . . . 260
10.4 Bayes error probabilities associated with g and g. . . . . . . . . . 266
10.5 Upper and lower bounds on e

NP
() in Case B. . . . . . . . . . . . 275
xii
Chapter 1
Introduction
This volume will talk about some advanced topics in information theory. The
mathematical background on which these topics are based can be found in Ap-
pendices A and B of Volume I.
1.1 Notations
Here, we clarify some of the easily confused notations used in Volume II of the
lecture notes.
For a random variable X, we use P
X
to denote its distribution. For con-
venience, we will use interchangeably the two expressions for the probability of
X = x, i.e., PrX = x and P
X
(x). Similarly, for the probability of a set char-
acterizing through an inequality, such as f(x) < a, its probability mass will be
expressed by either
P
X
x A : f(x) < a
or
Pr f(X) < a .
In the second expression, we view f(X) as a new random variable dened through
X and a function f().
Obviously, the above expressions can be applied to any legitimate function
f() dened over A, including any probability function P

X
() (or log P

X
(x)) of a
random variable

X. Therefore, the next two expressions denote the probability
of f(x) = P

X
(x) < a evaluated under distribution P
X
:
P
X
x A : f(x) < a = P
X
x A : P

X
(x) < a
and
Pr f(X) < a = Pr P

X
(X) < a .
1
As a result, if we write
P
X,Y
_
(x, y) A : log
P

X,

Y
(x, y)
P

X
(x)P

Y
(y)
< a
_
= Pr
_
log
P

X,

Y
(X, Y )
P

X
(X)P

Y
(Y )
< a
_
,
it means that we dene a new function
f(x, y) log
P

X,

Y
(x, y)
P

X
(x)P

Y
(y)
in terms of the joint distribution P

X,

Y
and its two marginal distributions, and
concern the probability of f(x, y) < a where x and y have distribution P
X,Y
.
2
Chapter 2
Generalized Information Measures for
Arbitrary System Statistics
In Volume I of the lecture notes, we show that the entropy, dened by
H(X)

x.
P
X
(x) log P
X
(x) = E
X
[log P
X
(X)] nats,
of a discrete random variable X is a measure of the average amount of uncertainty
in X. An extension denition of entropy to a sequence of random variables
X
1
, X
2
, . . . , X
n
, . . . is the entropy rate, which is given by
lim
n
1
n
H(X
n
) = lim
n
1
n
E [log P
X
n(X
n
)] ,
assuming the limit exists. The above quantities have an operational signicance
established via Shannons coding theorems when the stochastic systems under
consideration satisfy certain regularity conditions, such as stationarity and ergod-
icity [3, 5]. However, in more complicated situations such as when the systems
are non-stationary or with time-varying nature, these information rates are no
longer valid and lose their operational signicance. This results in the need to
establish new entropy measures which appropriately characterize the operational
limits of arbitrary stochastic systems.
Let us begin with the model of arbitrary system statistics. In general, there
are two indices for observations: time index and space index. When a sequence
of observations is denoted by X
1
, X
2
, . . . , X
n
, . . ., the subscript i of X
i
can be
treated as either a time index or a space index, but not both. Hence, when a
sequence of observations are functions of both time and space, the notation of
X
1
, X
2
, . . . , X
n
, . . ., is by no means sucient; and therefore, a new model for a
time-varying multiple-sensor system, such as
X
(n)
1
, X
(n)
2
, . . . , X
(n)
t
, . . . ,
3
where t is the time index and n is the space or position index (or vice versa),
becomes signicant.
When block-wise compression of such source (with block length n) is consid-
ered, same question as to the compression of i.i.d. source arises:
what is the minimum compression rate (bits per source
sample) for which the error can be made arbitrarily small
as the block length goes to innity?
(2.0.1)
To answer the question, information theorists have to nd a sequence of data
compression codes for each block length n and investigate if the decompression
error goes to zero as n approaches innity. However, unlike those simple source
models considered in Volume I, the arbitrary source for each block length n may
exhibit distinct statistics at respective sample, i.e.,
n = 1 : X
(1)
1
n = 2 : X
(2)
1
, X
(2)
2
n = 3 : X
(3)
1
, X
(3)
2
, X
(3)
3
n = 4 : X
(4)
1
, X
(4)
2
, X
(4)
3
, X
(4)
4
(2.0.2)
.
.
.
and the statistics of X
(4)
1
could be dierent from X
(1)
1
, X
(2)
1
and X
(3)
1
. Since it is
the most general model for the question in (2.0.1), and the system statistics can
be arbitrarily dened, it is therefore named arbitrary statistics system.
In notation, the triangular array of random variables in (2.0.2) is often de-
noted by a boldface letter as
X X
n

n=1
,
where X
n

_
X
(n)
1
, X
(n)
2
, . . . , X
(n)
n
_
; for convenience, the above statement is
sometimes briefed as
X
_
X
n
=
_
X
(n)
1
, X
(n)
2
, . . . , X
(n)
n
__

n=1
.
In this chapter, we will rst introduce a new concept on dening information
measures for arbitrary system statistics and then, analyze in detail their algebraic
properties. In the next chapter, we will utilize the new measures to establish
general source coding theorems for arbitrary nite-alphabet sources.
4
2.1 Spectrum and Quantile
Denition 2.1 (inf/sup-spectrum) If A
n

n=1
is a sequence of random vari-
ables, then its inf-spectrum u() and its sup-spectrum u() are dened by
u() liminf
n
PrA
n
,
and
u() limsup
n
PrA
n
.
In other words, u() and u() are respectively the liminf and the limsup of
the cumulative distribution function (CDF) of A
n
. Note that by denition, the
CDF of A
n
PrA
n
is non-decreasing and right-continuous. However,
for u() and u(), only the non-decreasing property remains.
1
Denition 2.2 (quantile of inf/sup-spectrum) For any 0 1, the
1
It is pertinent to also point out that even if we do not require right-continuity as a funda-
mental property of a CDF, the spectrums u() and u() are not necessarily legitimate CDFs of
(conventional real-valued) random variables since there might exist cases where the probabi-
lity mass escapes to innity (cf. [1, pp. 346]). A necessary and sucient condition for u()
and u() to be conventional CDFs (without requiring right-continuity) is that the sequence
of distribution functions of A
n
is tight [1, pp. 346]. Tightness is actually guaranteed if the
alphabet of A
n
is nite.
5
quantiles
2
U

and

U

of the sup-spectrum and the inf-spectrum are dened by


3
U

sup : u()
and

sup : u() ,
respectively. It follows from the above denitions that U

and

U

are right-
continuous and non-decreasing in . Note that the supremum of an empty set is
dened to be .
Based on the above denitions, the liminf in probability U of A
n

n=1
[4],
which is dened as the largest extended real number such that for all > 0,
lim
n
Pr[A
n
U ] = 0,
satises
4
U = lim
0
U

= U
0
.
2
Generally speaking, one can dene quantile in four dierent ways:

sup : liminf
n
Pr[A
n
]

sup : liminf
n
Pr[A
n
] <

U
+

sup : liminf
n
Pr[A
n
< ]

U
+

sup : liminf
n
Pr[A
n
< ] < .
The general relation of these four quantities are

U



U
+



U
+

. Obviously,

U



U
+

and

U



U
+

by their denitions. It remains to show that



U
+

.
Suppose

U
+

>

U

+ for some > 0. Then by denition of



U
+

, liminf
n
Pr[A
n
<

+] < , which implies liminf


n
Pr[A
n


U

+/2] liminf
n
Pr[A
n
<

U

+] <
and violates the denition of

U

. This completes the proof of



U
+

.
It is worth noting that

U

= lim

. Their equality can be proved by rst observing


that

U

lim

by their denitions, and then assuming that


_

_
lim

> 0.
Then

U

<

U

+/2 (

) /2 implies that u((

) /2) > for arbitrarily close to


from below, which in turn implies u((

) /2) , contradicting to the denition of



U

.
In the lecture notes, we will interchangeably use

U

and lim

for convenience.
The nal note is that

U
+

and

U
+

will not be used in dening our general information


measures. They are introduced only for mathematical interests.
3
Note that the usual denition of the quantile function () of a non-decreasing function
F() is slightly dierent from our denition [1, pp. 190], where () sup : F() < .
Remark that if F() is strictly increasing, then the quantile is nothing but the inverse of F():
() = F
1
().
4
It is obvious from their denitions that
lim
0
U

U
0
U.
6
Also, the limsup in probability

U (cf. [4]), dened as the smallest extended real
number such that for all > 0,
lim
n
Pr[A
n


U + ] = 0,
is exactly
5

U = lim
1

= sup : u() < 1,


Straightforwardly by their denitions,
U U



U
for [0, 1).
Remark that U

and

U

always exist. Furthermore, if U

=

U

for all in
[0, 1], the sequence of random variables A
n
converges in distribution to a random
variable A, provided the sequence of A
n
is tight.
For a better understanding of the quantities dened above, we depict them
in Figure 2.1.
2.2 Properties of quantile
Lemma 2.3 Consider two random sequences, A
n

n=1
and B
n

n=1
. Let u()
and u() be respectively the sup- and inf-spectrums of A
n

n=1
. Similarly, let
v() and v() denote respectively the sup- and inf-spectrums of B
n

n=1
. Dene
The equality of lim
0
U

and U can be proved by contradiction by rst assuming


lim
0
U

U > 0.
Then u(U + /2) for arbitrarily small > 0, which immediately implies u(U + /2) = 0,
contradicting to the denition of U.
5
Since 1 = lim
n
PrA
n
<

U + lim
n
PrA
n


U + = u(

U +), it is straight-
forward that

U sup : u() < 1 = lim


1

.
The equality of

U and lim
1

U

can be proved by contraction by rst assuming that




U lim
1

> 0.
Then 1 u(

U/2) > for arbitrarily close to 1, which implies u(

U/2) = 1. Accordingly,
by
1 liminf
n
PrA
n
<

U /4 liminf
n
PrA
n


U /2 = u(

U /2) = 1,
we obtain the desired contradiction.
7
-
6
u() u()
U

U

0
1
U
1

U
0

U

Figure 2.1: The asymptotic CDFs of a sequence of random variables,


A
n

n=1
. u() = sup-spectrum of A
n
; u

() = inf-spectrum of A
n
; U

1
=
lim
1
U


.
U

and

U

be the quantiles of the sup- and inf-spectrums of A


n

n=1
; also dene
V

and

V

be the quantiles of the sup- and inf-spectrums of B


n

n=1
.
Now let (u + v)() and (u + v)() denote the sup- and inf-spectrums of sum
sequence A
n
+ B
n

n=1
, i.e.,
(u + v)() limsup
n
PrA
n
+ B
n
,
and
(u + v)() liminf
n
PrA
n
+ B
n
.
Again, dene (U + V )

and (U + V )

be the quantiles with respect to (u + v)()


and (u + v)().
Then the following statements hold.
1. U

and

U

are both non-decreasing and right-continuous functions of for


[0, 1].
2. lim
0
U

= U
0
and lim
0

U

=

U
0
.
3. For 0, 0, and + 1,
(U + V )
+
U

+ V

, (2.2.1)
8
and
(U + V )
+
U

+

V

. (2.2.2)
4. For 0, 0, and + 1,
(U + V )

U
+
+

V
(1)
, (2.2.3)
and
(U + V )



U
+
+

V
(1)
. (2.2.4)
Proof: The proof of property 1 follows directly from the denitions of U

and

and the fact that the inf-spectrum and the sup-spectrum are non-decreasing
in .
The proof of property 2 can be proved by contradiction as follows. Suppose
lim
0
U

> U
0
+ for some > 0. Then for any > 0,
u(U
0
+ /2) .
Since the above inequality holds for every > 0, and u() is a non-negative
function, we obtain u(U
0
+ /2) = 0, which contradicts to the denition of U
0
.
We can prove lim
0

U

=

U
0
in a similar fashion.
To show (2.2.1), we observe that for > 0,
limsup
n
PrA
n
+ B
n
U

+V

2
limsup
n
(PrA
n
U

+ PrB
n
V

)
limsup
n
PrA
n
U

+ limsup
n
PrB
n
V


+ ,
which, by denition of (U + V )
+
, yields
(U + V )
+
U

+V

2.
The proof is completed by noting that can be made arbitrarily small.
Similarly, we note that for > 0,
liminf
n
PrA
n
+ B
n
U

+

V

2
liminf
n
_
PrA
n
U

+ PrB
n


V


_
limsup
n
PrA
n
U

+ liminf
n
PrB
n


V


+ ,
9
which, by denition of (U + V )
+
and arbitrarily small , proves (2.2.2).
To show (2.2.3), we rst observe that (2.2.3) trivially holds when = 1 (and
= 0). It remains to prove its validity under < 1. Remark from (2.2.1) that
(U + V )

+ (V )

(U + V V )
+
= U
+
.
Hence,
(U + V )

U
+
(V )

.
(Note that the case = 1 is not allowed here because it results in U
1
= (V )
1
=
, and the subtraction between two innite terms is undened. That is why we
need to exclude the case of = 1 for the subsequent proof.) The proof is then
completed by showing that
(V )



V
(1)
. (2.2.5)
By denition,
(v)() limsup
n
Pr B
n

= 1 liminf
n
Pr B
n
<
Then

V
(1)
sup : v() 1
= sup : liminf
n
Pr[B
n
] 1
sup : liminf
n
Pr[B
n
< ] < 1 (cf. footnote 1)
= sup : liminf
n
Pr[B
n
< ] < 1
= sup : 1 limsup
n
Pr[B
n
] < 1
= sup : limsup
n
Pr[B
n
] >
= inf : limsup
n
Pr[B
n
] >
= sup : limsup
n
Pr[B
n
]
= sup : (v)()
= (V )

.
Finally, to show (2.2.4), we again note that it is trivially true for = 1. We then
observe from (2.2.2) that (U + V )

+ (V )

(U + V V )
+
=

U
+
. Hence,
(U + V )



U
+
(V )

.
Using (2.2.5), we have the desired result. 2
10
2.3 Generalized information measures
In Denitions 2.1 and 2.2, if we let the random variable A
n
equal the normalized
entropy density
6
1
n
h
X
n(X
n
)
1
n
log P
X
n(X
n
)
of an arbitrary source
X =
_
X
n
=
_
X
(n)
1
, X
(n)
2
, . . . , X
(n)
n
__

n=1
,
we obtain two generalized entropy measures for X, i.e.,
-inf-entropy rate H

(X) = quantile of sup-spectrum of


1
n
h
X
n(X
n
)
-sup-entropy rate

H

(X) = quantile of inf-spectrum of


1
n
h
X
n(X
n
).
Note that the inf-entropy-rateH(X) and the sup-entropy-rate

H(X) introduced
in [4] are special cases of the -inf/sup-entropy rate measures:
H(X) =H
0
(X), and

H(X) = lim
1

(X).
In concept, we can image that if the random variable (1/n)h(X
n
) exhibits a
limiting distribution, then the sup-entropy rate is the right-margin of the support
of that limiting distribution. For example, suppose that the limiting distribution
of (1/n)h
X
n(X
n
) is positive over (2, 2), and zero, otherwise. Then

H(X) = 2.
Similarly, the inf-entropy rate is the left margin of the support of the limiting
random variable limsup
n
(1/n)h
X
n(X
n
), which is 2 for the same example.
Analogously, for an arbitrary channel
W = (Y [X) =
_
W
n
=
_
W
(n)
1
, . . . , W
(n)
n
__

n=1
,
or more specically, P
W
= P
Y[X
= P
Y
n
[X
n

n=1
with input X and output Y , if
we replace A
n
in Denitions 2.1 and 2.2 by the normalized information density
1
n
i
X
n
W
n(X
n
; Y
n
) =
1
n
i
X
n
,Y
n(X
n
; Y
n
)
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
)P
Y
n(Y
n
)
,
we get the -inf/sup-information rates, denoted respectively by I

(X; Y ) and

(X; Y ), as:
6
The random variable h
X
n(X
n
) log P
X
n(X
n
) is named the entropy density. Hence,
the normalized entropy density is equal to h
X
n(X
n
) divided or normalized by the blocklength
n.
11
I

(X; Y ) = quantile of sup-spectrum of


1
n
i
X
n
W
n(X
n
; Y
n
)

(X; Y ) = quantile of inf-spectrum of


1
n
i
X
n
W
n(X
n
; Y
n
).
Similarly, for a simple hypothesis testing system with arbitrary observation
statistics for each hypothesis,
_
H
0
: P
X
H
1
: P

X
we can replace A
n
in Denitions 2.1 and 2.2 by the normalized log-likelihood ratio
1
n
d
X
n(X
n
|

X
n
)
1
n
log
P
X
n(X
n
)
P

X
n
(X
n
)
to obtain the -inf/sup-divergence rates, denoted respectively by D

(X|

X) and

(X|

X), as
D

(X|

X) = quantile of sup-spectrum of
1
n
d
X
n(X
n
|

X
n
)

(X|

X) = quantile of inf-spectrum of
1
n
d
X
n(X
n
|

X
n
).
The above replacements are summarized in Tables 2.12.3.
2.4 Properties of generalized information measures
In this section, we will introduce the properties regarding the general information
measured dened in the previous section. We begin with the generalization of a
simple property I(X; Y ) = H(Y ) H(Y [X).
By taking = 0 and letting 0 in (2.2.1) and (2.2.3), we obtain
(U + V ) U
0
+ lim
0
V

U + V
and
(U + V ) lim
0
U

+ lim
0

V
(1)
= U +

V ,
which mean that the liminf in probability of a sequence of random variables
A
n
+B
n
is upper bounded by the liminf in probability of A
n
plus the limsup in
probability of B
n
, and is lower bounded by the sum of the liminfs in probability
of A
n
and B
n
. This fact is used in [5] to show that
I(X; Y ) +H(Y [X) H(Y ) I(X; Y ) +

H(Y [X),
12
Entropy Measures
system arbitrary source X
norm. entropy density
1
n
h
X
n(X
n
)
1
n
log P
X
n(X
n
)
entropy sup-spectrum

h() limsup
n
Pr
_
1
n
h
X
n(X
n
)
_
entropy inf-spectrum h() liminf
n
Pr
_
1
n
h
X
n(X
n
)
_
-inf-entropy rate H

(X) sup :

h()
-sup-entropy rate

H

(X) sup : h()


sup-entropy rate

H(X) lim
1

H

(X)
inf-entropy rate H(X) H
0
(X)
Table 2.1: Generalized entropy measures where [0, 1].
or equivalently,
H(Y )

H(Y [X) I(X; Y ) H(Y ) H(Y [X).
Other properties of the generalized information measures are summarized in
the next Lemma.
Lemma 2.4 For nite alphabet A, the following statements hold.
1.

H

(X) 0 for [0, 1].


(This property also applies to H

(X),

I

(X; Y ), I

(X; Y ),

D

(X|

X),
and D

(X|

X).)
2. I

(X; Y ) = I

(Y ; X) and

I

(X; Y ) =

I

(Y ; X) for [0, 1].


3. For 0 < 1, 0 < 1 and + 1,
I

(X; Y ) H
+
(Y ) H

(Y [X), (2.4.1)
I

(X; Y )

H
+
(Y )

H

(Y [X), (2.4.2)
13
Mutual Information Measures
system
arbitrary channel P
W
= P
Y[X
with input
X and output Y
norm. information density
1
n
i
X
n
W
n(X
n
; Y
n
)

1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
) P
Y
n(Y
n
)
information sup-spectrum

i() limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
information inf-Spectrum i() liminf
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
-inf-information rate I

(X; Y ) sup :

i()
-sup-information rate

I

(X; Y ) sup : i()


sup-information rate

I(X; Y ) lim
1

I

(X; Y )
inf-information rate I(X; Y ) I
0
(X; Y )
Table 2.2: Generalized mutual information measures where [0, 1].

(X; Y )

H
+
(Y ) H

(Y [X), (2.4.3)
I
+
(X; Y ) H

(Y )

H
(1)
(Y [X), (2.4.4)
and

I
+
(X; Y )

H

(Y )

H
(1)
(Y [X). (2.4.5)
(Note that the case of (, ) = (1, 0) holds for (2.4.1) and (2.4.2), and the
case of (, ) = (0, 1) holds for (2.4.3), (2.4.4) and (2.4.5).)
4. 0 H

(X)

H

(X) log [A[ for [0, 1), where each X


(n)
i
takes values
in A for i = 1, . . . , n and n = 1, 2, . . ..
5. I

(X, Y ; Z) I

(X; Z) for [0, 1].


14
Divergence Measures
system arbitrary sources X and

X
A
n
: norm. log-likelihood ratio
1
n
d
X
n(X
n
|

X
n
)
1
n
log
P
X
n(X
n
)
P

X
n
(X
n
)
divergence sup-spectrum

d() limsup
n
Pr
_
1
n
d
X
n(X
n
|

X
n
)
_
divergence inf-spectrum d() liminf
n
Pr
_
1
n
d
X
n(X
n
|

X
n
)
_
-inf-divergence rate D

(X|

X) sup :

d()
-sup-divergence rate

D

(X|

X) sup : d()
sup-divergence rate

D(X|

X) lim
1

D

(X|

X)
inf-divergence rate D(X|

X) D
0
(X|

X)
Table 2.3: Generalized divergence measures where [0, 1].
Proof: Property 1 holds because
Pr
_

1
n
log P
X
n(X
n
) < 0
_
= 0,
Pr
_
1
n
log
P
X
n(X
n
)
P

X
n
(X
n
)
<
_
= P
X
n
_
x
n
A
n
:
1
n
log
P
X
n(x
n
)
P

X
n
(x
n
)
<
_
=

_
x
n
.
n
: P
X
n(x
n
)<P

X
n
(x
n
)e
n

P
X
n(x
n
)

_
x
n
.
n
: P
X
n(x
n
)<P

X
n
(x
n
)e
n
_
P

X
n
(x
n
)e
n
e
n

_
x
n
.
n
: P
X
n(x
n
)<P

X
n
(x
n
)e
n
_
P

X
n
(x
n
)
e
n
, (2.4.6)
and, by following the same procedure as (2.4.6),
Pr
_
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
)P
Y
n(Y
n
)
<
_
e
n
.
15
Property 2 is an immediate consequence of the denition.
To show the inequalities in property 3, we rst remark that
1
n
h
Y
n(Y
n
) =
1
n
i
X
n
,Y
n(X
n
; Y
n
) +
1
n
h
X
n
,Y
n(Y
n
[X
n
),
where
1
n
h
X
n
,Y
n(Y
n
[X
n
)
1
n
log P
Y
n
[X
n(Y
n
[X
n
).
With this fact and for 0 < 1, 0 < < 1 and + 1, (2.4.1) follows
directly from (2.2.1); (2.4.2) and (2.4.3) follow from (2.2.2); (2.4.4) follows from
(2.2.3); and (2.4.5) follows from (2.2.4). (Note that in (2.4.1), if = 0 and = 1,
then the right-hand-side would be a dierence between two innite terms, which
is undened; hence, such case is therefore excluded by the condition that < 1.
For the same reason, we also exclude the cases of = 0 and = 1.) Now when
0 < 1 and = 0, we can conrm the validity of (2.4.1), (2.4.2) and (2.4.3),
again, by (2.2.1) and (2.2.2), and also examine the validity of (2.4.4) and (2.4.5)
by directly taking these values inside. The validity of (2.4.1) and (2.4.2) for
(, ) = (1, 0), and the validity of (2.4.3), (2.4.4) and (2.4.5) for (, ) = (0, 1)
can be checked by directly replacing and with the respective numbers.
Property 4 follows from the facts that

H

() is non-decreasing in ,

H

(X)

H(X), and

H(X) log [A[. The last inequality can be proved as follows.
Pr
_
1
n
h
X
n(X
n
) log [A[ +
_
= 1 P
X
n
_
x
n
A
n
:
1
n
log
P
X
n(X
n
)
1/[A[
n
<
_
1 e
n
,
where the last step can be obtained by using the same procedure as (2.4.6).
Therefore, h(log [A[ + ) = 1 for any > 0, which indicates

H(X) log [A[.
Property 5 can be proved using the fact that
1
n
i
X
n
,Y
n
,Z
n(X
n
, Y
n
; Z
n
) =
1
n
i
X
n
,Z
n(X
n
; Z
n
) +
1
n
i
X
n
,Y
n
,Z
n(Y
n
; Z
n
[X
n
).
By applying (2.2.1) with = 0, and observing I(Y ; Z[X) 0, we obtain the
desired result. 2
Lemma 2.5 (data processing lemma) Fix [0, 1]. Suppose that for every
n, X
n
1
and X
n
3
are conditionally independent given X
n
2
. Then
I

(X
1
; X
3
) I

(X
1
; X
2
).
16
Proof: By property 5 of Lemma 2.4, we get
I

(X
1
; X
3
) I

(X
1
; X
2
, X
3
) = I

(X
1
; X
2
),
where the equality holds because
1
n
log
P
X
n
1
,X
n
2
,X
n
3
(x
n
1
, x
n
2
, x
n
3
)
P
X
n
1
(x
n
1
)P
X
n
2
,X
n
3
(x
n
2
, x
n
3
)
=
1
n
log
P
X
n
1
,X
n
2
(x
n
1
, x
n
2
)
P
X
n
1
(x
n
1
)P
X
n
2
(x
n
2
)
.
2
Lemma 2.6 (optimality of independent inputs) Fix [0, 1). Consider
a nite-alphabet channel with P
W
n(y
n
[x
n
) = P
Y
n
[X
n(y
n
[x
n
) =

n
i=1
P
Y
i
[X
i
(y
i
[x
i
)
for all n. For any input X and its corresponding output Y ,
I

(X; Y ) I

(

X;

Y ) = I(

X;

Y ),
where

Y is the output due to

X, which is an independent process with the same
rst order statistics as X, i.e., P
X
n
(x
n
) =

n
i=1
P
X
i
(x
i
).
Proof: First, we observe that
1
n
log
P
W
n(y
n
[x
n
)
P
Y
n(y
n
)
+
1
n
log
P
Y
n(y
n
)
P
Y
n
(y
n
)
=
1
n
log
P
W
n(y
n
[x
n
)
P
Y
n
(y
n
)
.
In other words,
1
n
log
P
X
n
W
n(x
n
, y
n
)
P
X
n(x
n
)P
Y
n(y
n
)
+
1
n
log
P
Y
n(y
n
)
P
Y
n
(y
n
)
=
1
n
log
P
X
n
W
n(x
n
, y
n
)
P
X
n
(x
n
)P
Y
n
(y
n
)
.
By evaluating the above terms under P
X
n
W
n = P
X
n
,Y
n and dening
z() limsup
n
P
X
n
W
n
_
(x
n
, y
n
) A
n

n
:
1
n
log
P
X
n
W
n(x
n
, y
n
)
P
X
n
(x
n
)P
Y
n
(y
n
)

_
and
Z

(

X;

Y ) sup : z() ,
we obtain from (2.2.1) (with = 0) that
Z

(

X;

Y ) I

(X; Y ) + D(Y |

Y ) I

(X; Y ),
since D(Y |

Y ) 0 by property 1 of Lemma 2.4.


17
Now by the independence and marginal being equal to X of

X, we know
that the induced

Y is also independent and has the same marginal distribution
7
as Y . Hence,
Pr
_

1
n
n

i=1
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
E
X
i
W
i
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
__

>
_
= Pr
_

1
n
n

i=1
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
E
X
i
W
i
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
__

>
_
0,
for any > 0, where the convergence to zero follows from the Chebyshevs
inequality and the niteness of the channel alphabets (or more directly, the
niteness of individual variance). Consequently, z() = 1 for
> liminf
n
1
n
n

i=1
E
X
i
W
i
_
log
P
X
i
W
i
(X
i
, Y
i
)
P
X
i
(X
i
)P
Y
i
(Y
i
)
_
= liminf
n
1
n
n

i=1
I(X
i
; Y
i
);
and z() = 0 for < liminf
n
(1/n)

n
i=1
I(X
i
; Y
i
), which implies
Z(

X;

Y ) =Z

(

X;

Y ) = liminf
n
1
n
n

i=1
I(X
i
; Y
i
)
for any [0, 1). Similarly, we can show that
I(

X;

Y ) = I

(

X;

Y ) = liminf
n
1
n
n

i=1
I(X
i
; Y
i
)
for any [0, 1). Accordingly,
I(

X;

Y ) =Z(

X;

Y ) I

(X; Y ).
2
7
The claim can be justied as follows.
P
Y1
(y
1
) =

y
n
2
Y
n1

x
n
X
n
P
X
n(x
n
)P
W
n(y
n
[x
n
)
=

x1X
P
X1
(x
1
)P
W1
(y
1
[x
1
)

y
n
2
Y
n1

x
n
2
X
n1
P
X
n
2
|X1
(x
n
2
[x
1
)P
W
n
2
(y
n
2
[x
n
2
)
=

x1X
P
X1
(x
1
)P
W1
(y
1
[x
1
)

y
n
2
Y
n1

x
n
2
X
n1
P
X
n
2
W
n
2
|X1
(x
n
2
, y
n
2
[x
1
)
=

x1X
P
X1
(x
1
)P
W1
(y
1
[x
1
).
Hence, for a channel with P
W
n(y
n
[x
n
) =

n
i=1
P
Wi
(y
i
[x
i
), the output marginal only depends
on the respective input marginal.
18
2.5 Examples for the computation of general information
measures
Let the alphabet be binary A = = 0, 1, and let every output be given by
Y
(n)
i
= X
(n)
i
Z
(n)
i
where represents the modulo-2 addition operation, and Z is an arbitrary binary
random process, independent of X. Assume that X is a Bernoulli uniform input,
i.e., an i.i.d. random process with equal-probably marginal distribution. Then
it can be derived that the resultant Y is also Bernoulli uniform distributed, no
matter what distribution Z has.
To compute I

(X; Y ) for [0, 1) ( I


1
(X; Y ) = is known), we use the
results of property 3 in Lemma 2.4:
I

(X; Y ) H
0
(Y )

H
(1)
(Y [X), (2.5.1)
and
I

(X; Y )

H
+
(Y )

H

(Y [X). (2.5.2)
where 0 < 1, 0 < 1 and + 1. Note that the lower bound in (2.5.1)
and the upper bound in (2.5.2) are respectively equal to and for = 0
and + = 1, which become trivial bounds; hence, we further restrict > 0
and + < 1 respectively for the lower and upper bounds.
Thus for [0, 1),
I

(X; Y ) inf
0<1
_

H
+
(Y )

H

(Y [X)
_
.
By the symmetry of the channel,

H

(Y [X) =

H

(Z), which is independent of


X. Hence,
I

(X; Y ) inf
0<1
_

H
+
(Y )

H

(Z)
_
inf
0<1
_
log(2)

H

(Z)
_
,
where the last step follows from property 4 of Lemma 2.4. Since log(2)

H

(Z)
is non-increasing in ,
I

(X; Y ) log(2) lim


(1)

(Z).
On the other hand, we can derive the lower bound to I

(X; Y ) in (2.5.1) by
the fact that Y is Bernoulli uniform distributed. We thus obtain for (0, 1],
I

(X; Y ) log(2)

H
(1)
(Z),
19
and
I
0
(X; Y ) = lim
0
I

(X; Y ) log(2) lim


1

(Z) = log(2)

H(Z).
To summarize,
log(2)

H
(1)
(Z) I

(X; Y ) log(2) lim


(1)

(Z) for (0, 1)


and
I(X; Y ) = I
0
(X; Y ) = log(2)

H(Z).
An alternative method to compute I

(X; Y ) is to derive its corresponding


sup-spectrum in terms of the inf-spectrum of the noise process. Under the equally
likely Bernoulli input X, we can write

i() limsup
n
Pr
_
1
n
log
P
Y
n
[X
n(Y
n
[X
n
)
P
Y
n(Y
n
)

_
= limsup
n
Pr
_
1
n
log P
Z
n(Z
n
)
1
n
log P
Y
n(Y
n
)
_
= limsup
n
Pr
_
1
n
log P
Z
n(Z
n
) log(2)
_
= limsup
n
Pr
_

1
n
log P
Z
n(Z
n
) log(2)
_
= 1 liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) < log(2)
_
.
20
Hence, for (0, 1),
I

(X; Y ) = sup :

i()
= sup
_
: 1 liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) < log(2)
_

_
= sup
_
: liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) < log(2)
_
1
_
= sup
_
(log(2) ) : liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) <
_
1
_
= log(2) + sup
_
: liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) <
_
1
_
= log(2) inf
_
: liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) <
_
1
_
= log(2) sup
_
: liminf
n
Pr
_

1
n
log P
Z
n(Z
n
) <
_
< 1
_
log(2) sup
_
: liminf
n
Pr
_

1
n
log P
Z
n(Z
n
)
_
< 1
_
= log(2) lim
(1)

(Z).
Also, for (0, 1),
I

(X; Y ) sup
_
: limsup
n
Pr
_
1
n
log
P
X
n
,Y
n(X
n
, Y
n
)
P
X
n(X
n
)P
Y
n(Y
n
)
<
_
<
_
(2.5.3)
= log(2) sup
_
: liminf
n
Pr
_

1
n
log P
Z
n(Z
n
)
_
1
_
= log(2)

H
(1)
(Z),
where (2.5.3) follows from the fact described in footnote 2. Therefore,
log(2)

H
(1)
(Z) I

(X; Y ) log(2) lim


(1)

(Z) for (0, 1).


By taking 0, we obtain
I(X; Y ) = I
0
(X; Y ) = log(2)

H(Z).
Based on this result, we can now compute I

(X; Y ) for some specic exam-


ples.
Example 2.7 Let Z be an all-zero sequence with probability and Bernoulli(p)
with probability 1 , where Bernoulli(p) represents a binary Bernoulli process
21
-
0 h
b
(p)
t
t
d
d
6
0

1
Figure 2.2: The spectrum h() for Example 2.7.
-
1 h
b
(p) 1
t
t
d
d
6
0
1
1
Figure 2.3: The spectrum

i() for Example 2.7.
with individual component being one with probability p. Then the sequence of
random variables (1/n)h
Z
n(Z
n
) converges to 0 and h
b
(p) with respective masses
and 1 , where h
b
(p) p log p (1 p) log(1 p) is the binary entropy
function. The resulting h() is depicted in Figure 2.2. From (2.5.3), we obtain

i() as shown in Figure 2.3.


Therefore,
I

(X; Y ) =
_
1 h
b
(p), if 0 < < 1 ;
1, if 1 < 1.
Example 2.8 If
Z =
_
Z
n
=
_
Z
(n)
1
, . . . , Z
(n)
n
__

n=1
is a non-stationary binary independent sequence with
Pr
_
Z
(n)
i
= 0
_
= 1 Pr
_
Z
(n)
i
= 1
_
= p
i
,
then by the uniform boundedness (in i) of the variance of random variable
22
log P
Z
(n)
i
_
Z
(n)
i
_
, namely
Var
_
log P
Z
(n)
i
_
Z
(n)
i
__
E
_
_
log P
Z
(n)
i
_
Z
(n)
i
__
2
_
sup
0<p
i
<1
_
p
i
(log p
i
)
2
+ (1 p
i
)(log(1 p
i
))
2

< log(2),
we have (by Chebyshevs inequality)
Pr
_

1
n
log P
Z
n(Z
n
)
1
n
n

i=1
H
_
Z
(n)
i
_

>
_
0,
for any > 0. Therefore,

H
(1)
(Z) is equal to

H
(1)
(Z) =

H(Z) = limsup
n
1
n
n

i=1
H
_
Z
(n)
i
_
= limsup
n
1
n
n

i=1
h
b
(p
i
)
for (0, 1], and innity for = 0, where h
b
(p
i
) = p
i
log(p
i
)(1p
i
) log(1p
i
).
Consequently,
I

(X; Y ) =
_

_
1

H(Z) = 1 limsup
n
1
n
n

i=1
h
b
(p
i
), for [0, 1),
, for = 1.
This result is illustrated in Figures 2.4 and 2.5.
-

H(Z)

H(Z)
clustering points
Figure 2.4: The limiting spectrums of (1/n)h
Z
n(Z
n
) for Example 2.8
23
-

log(2)

H(Z) log(2) H(Z)
clustering points
Figure 2.5: The possible limiting spectrums of (1/n)i
X
n
,Y
n(X
n
; Y
n
) for
Example 2.8.
2.6 Renyis information measures
In this section, we introduce alternative generalizations of information measures.
They are respectively named Renyis entropy, Renyis mutual information and
Renyis divergence.
It is known that Renyis information measures can be used to provide expo-
nentially tight error bounds for data compression block coding and hypothesis
testing systems with i.i.d. statistics. Cambell also found its operational charac-
terization for data compression variable-length codes [2]. Here, we only provide
their denitions, and will introduce their operational meanings in the subsequent
chapters.
Denition 2.9 (Renyis entropy) For > 0, the Renyis entropy
8
of order
is dened by:
H(X; )
_

_
1
1
log
_

x.
[P
X
(x)]

_
, for ,= 1;
lim
1
H(X; ) =

x.
P
X
(x) log P
X
(x), for = 1.
Denition 2.10 (Renyis divergence) For > 0, the Renyis divergence of
order is dened by:
D(X|

X; )
_

_
1
1
log
_

x.
_
P

X
(x)P
1

X
(x)
_
_
, for ,= 1;
lim
1
D(X|

X; ) =

x.
P
X
(x) log
P
X
(x)
P

X
(x)
, for = 1.
8
The Renyis entropy is usually denoted by H

(X); however, in order not to confuse with


the -inf/sup-entropy rate H

(X) and

H

(X), we adopt the notation H(X; ) for Renyis


entropy. Same interpretations apply to other Renyis measures.
24
There are two possible Renyis extensions for mutual information. One is
based on the observation of
I(X; Y ) =

x.

y
P
X
(x)P
Y [X
(y[x) log
P
Y [X
(y[x)
P
Y
(y)
= min
P

x.

y
P
X
(x)P
Y [X
(y[x) log
P
Y [X
(y[x)
P

Y
(y)
.
The latter equality can be proved by taking the derivative of

x.

y
P
X
(x)P
Y [X
(y[x) log
P
Y [X
(y[x)
P

Y
(y)
+
_

y
P

Y
(y) 1
_
with respective to P

Y
(y), and letting it be zero to obtain that the minimization
output distribution satises
P

Y
(y) =

x.
P
X
(x)P
Y [X
(y[x).
The other extension is a direct generalization of
I(X; Y ) = D(P
X,Y
|P
X
P
Y
) = min
P

Y
D(P
X,Y
|P
X
P

Y
)
to Renyis divergence.
Denition 2.11 (type-I Renyis mutual information) For > 0, the type-
I Renyis mutual information of order is dened by:
I(X; Y ; )
_

_
min
P

Y
1
1

x.
P
X
(x) log
_

y
_
P

Y [X
(y[x)P
1

Y
(y)
_
_
, if ,= 1;
lim
1
I(X; Y ; ) = I(X; Y ), if = 1,
where the minimization is taken over all P

Y
under xed P
X
and P
Y [X
.
Denition 2.12 (type-II Renyis mutual information) For > 0, the
type-II Renyis mutual information of order is dened by:
J(X; Y ; ) min
P

Y
D(P
X,Y
|P
X
P

Y
; )
=
_

1
log
_
_

y
_

x.
P
X
(x)P

Y [X
(y[x)
_
1/
_
_
, for ,= 1;
lim
1
J(X; Y ; ) = I(X; Y ), for = 1.
25
Some elementary properties of Renyis entropy and divergence are summa-
rized below.
Lemma 2.13 For nite alphabet A, the following statements hold.
1. 0 H(X; ) log [A[; the rst equality holds if, and only if, X is de-
terministic, and the second equality holds if, and only if, X is uniformly
distributed over A.
2. H(X; ) is strictly decreasing in unless X is uniformly distributed over
its support x A : P
X
(x) > 0.
3. lim
0
H(X; ) = log [x A : P
X
(x) > 0[.
4. lim

H(X; ) = log max


x.
P
X
(x).
5. D(X|

X; ) 0 with equality holds if, and only if, P
X
= P

X
.
6. D(X|

X; ) = if, and only if, either
x A : P
X
(x) > 0 and P

X
(x) > 0 =
or
x A : P
X
(x) > 0 , x A : P

X
(x) > 0 for 1.
7. lim
0
D(X|

X; ) = log P

X
x A : P
X
(x) > 0.
8. If P
X
(x) > 0 implies P

X
(x) > 0, then
lim

D(X|

X; ) = max
_
x. : P

X
(x)>0
_
log
P
X
(x)
P

X
(x)
.
9. I(X; Y ; ) J(X; Y ; ) for 0 < < 1, and I(X; Y ; ) J(X; Y ; ) for
> 1.
Lemma 2.14 (data processing lemma for type-I Renyis mutual infor-
mation) Fix > 0. If X Y Z, then I(X; Y ; ) I(X; Z; ).
26
Bibliography
[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John
Wiley and Sons, 1995.
[2] L. L. Cambell, A coding theorem and Renyis entropy, Informat. Contr.,
vol. 8, pp. 423429, 1965.
[3] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York,
1990.
[4] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. on Information Theory, vol. IT39, no. 3, pp. 752772, May 1993.
[5] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
27
Chapter 3
General Lossless Data Compression
Theorems
In Volume I of the lecture notes, we already know that the entropy rate
lim
n
1
n
H(X
n
)
is the minimum data compression rate (nats per source symbol) for arbitrarily
small data compression error for block coding of the stationary-ergodic source.
We also mentioned that for a more complicated situations where the sources
becomes non-stationary, the quantity lim
n
(1/n)H(X
n
) may not exist, and
can no longer be used to characterize the source compression. This results in
the need to establish a new entropy measure which appropriately characterizes
the operational limits of arbitrary stochastic systems, which was done in the
previous chapter.
The role of a source code is to represent the output of a source eciently.
Specically, a source code design is to minimize the source description rate of
the code subject to a delity criterion constraint. One commonly used delity
criterion constraint is to place an upper bound on the probability of decoding
error P
e
. If P
e
is made arbitrarily small, we obtain a traditional (almost) error-
free source coding system
1
. Lossy data compression codes are a larger class
of codes in the sense that the delity criterion used in the coding scheme is a
general distortion measure. In this chapter, we only demonstrate the bounded-
error data compression theorems for arbitrary (not necessarily stationary ergodic,
information stable, etc.) sources. The general lossy data compression theorems
will be introduced in subsequent chapters.
1
Recall that only for variable-length codes, a complete error-free data compression is re-
quired. A lossless data compression block codes only dictates that the compression error can
be made arbitrarily small, or asymptotically error-free.
28
3.1 Fixed-length data compression codes for arbitrary
sources
Equipped with the general information measures, we herein demonstrate a gen-
eralized Asymptotic Equipartition Property (AEP) Theorem and establish ex-
pressions for the minimum -achievable (xed-length) coding rate of an arbitrary
source X.
Here, we have made an implicit assumption in the following derivation, which
is the source alphabet A is nite
2
.
Denition 3.1 (cf. Denition 4.2 and its associated footnote in volume
I) An (n, M) block code for data compression is a set
(
n
c
1
, c
2
, . . . , c
M

consisting of M sourcewords
3
of block length n (and a binary-indexing codeword
for each sourceword c
i
); each sourceword represents a group of source symbols
of length n.
Denition 3.2 Fix [0, 1]. R is an -achievable data compression rate for
a source X if there exists a sequence of block data compression codes (
n
=
(n, M
n
)

n=1
with
limsup
n
1
n
log M
n
R,
and
limsup
n
P
e
( (
n
) ,
where P
e
( (
n
) Pr (X
n
/ (
n
) is the probability of decoding error.
The inmum of all -achievable data compression rate for X is denoted by
T

(X).
Lemma 3.3 (Lemma 1.5 in [3]) Fix a positive integer n. There exists an
(n, M
n
) source block code (
n
for P
X
n such that its error probability satises
P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
_
.
2
Actually, the theorems introduced also apply for sources with countable alphabets. We
assume nite alphabets in order to avoid uninteresting cases (such as

H

(X) = ) that might


arise with countable alphabets.
3
In Denition 4.2 of volume I, the (n, M) block data compression code is dened by M
codewords, where each codeword represents a group of sourcewords of length n. However, we
can actually pick up one source symbol from each group, and equivalently dene the code
using these M representative sourcewords. Later, it will be shown that this viewpoint does
facilitate the proving of the general source coding theorem.
29
Proof: Observe that
1

x
n
.
n
: (1/n)h
X
n(x
n
)(1/n) log Mn
P
X
n(x
n
)

x
n
.
n
: (1/n)h
X
n(x
n
)(1/n) log Mn
1
Mn

x
n
A
n
:
1
n
h
X
n(x
n
)
1
n
log M
n

1
Mn
.
Therefore, [x
n
A
n
: (1/n)h
X
n(x
n
) (1/n) log M
n
[ M
n
. We can then
choose a code
(
n

_
x
n
A
n
:
1
n
h
X
n(x
n
)
1
n
log M
n
_
with [ (
n
[ = M
n
and
P
e
( (
n
) = 1 P
X
n (
n
Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
_
.
2
Lemma 3.4 (Lemma 1.6 in [3]) Every (n, M
n
) source block code (
n
for P
X
n
satises
P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
+
_
expn,
for every > 0.
Proof: It suces to prove that
1 P
e
( (
n
) = Pr X
n
(
n
< Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+ expn.
30
Clearly,
Pr X
n
(
n
= Pr
_
X
n
(
n
and
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+Pr
_
X
n
(
n
and
1
n
h
X
n(X
n
) >
1
n
log M
n
+
_
Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+Pr
_
X
n
(
n
and
1
n
h
X
n(X
n
) >
1
n
log M
n
+
_
= Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+

x
n
( n
P
X
n(x
n
) 1
_
1
n
h
X
n(x
n
) >
1
n
log M
n
+
_
= Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+

x
n
( n
P
X
n(x
n
) 1
_
P
X
n(x
n
) <
1
M
n
expn
_
< Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+[ (
n
[
1
M
n
expn
= Pr
_
1
n
h
X
n(X
n
)
1
n
log M
n
+
_
+ expn.
2
We now apply Lemmas 3.3 and 3.4 to prove a general source coding theorems
for block codes.
Theorem 3.5 (general source coding theorem) For any source X,
T

(X) =
_
_
_
lim
(1)

(X), for [0, 1);


0, for = 1.
Proof: The case of = 1 follows directly from its denition; hence, the proof
only focus on the case of [0, 1).
1. Forward part (achievability): T

(X) lim
(1)

H

(X)
We need to prove the existence of a sequence of block codes (
n

n1
such that
for every > 0,
limsup
n
1
n
log [ (
n
[ lim
(1)

(X) + and limsup


n
P
e
( (
n
) .
31
Lemma 3.3 ensures the existence (for any > 0) of a source block code
(
n
= (n, M
n
= expn(lim
(1)

H

(X) + )|) with error probability


P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log M
n
_
Pr
_
1
n
h
X
n(X
n
) > lim
(1)

(X) +
_
.
Therefore,
limsup
n
P
e
( (
n
) limsup
n
Pr
_
1
n
h
X
n(X
n
) > lim
(1)

(X) +
_
= 1 liminf
n
Pr
_
1
n
h
X
n(X
n
) lim
(1)

(X) +
_
1 (1 ) = ,
where the last inequality follows from
lim
(1)

(X) = sup
_
: liminf
n
Pr
_
1
n
h
X
n(X
n
)
_
< 1
_
. (3.1.1)
2. Converse part: T

(X) lim
(1)

H

(X)
Assume without loss of generality that lim
(1)

H

(X) > 0. We will prove


the converse by contradiction. Suppose that T

(X) < lim


(1)

H

(X). Then
( > 0) T

(X) < lim


(1)

H

(X) 4. By denition of T

(X), there exists


a sequence of codes (
n
such that
limsup
n
1
n
log [ (
n
[
_
lim
(1)

(X) 4
_
+ (3.1.2)
and
limsup
n
P
e
( (
n
) . (3.1.3)
(3.1.2) implies that
1
n
log [ (
n
[ lim
(1)

(X) 2
for all suciently large n. Hence, for those n satisfying the above inequality and
also by Lemma 3.4,
P
e
( (
n
) Pr
_
1
n
h
X
n(X
n
) >
1
n
log [ (
n
[ +
_
e
n
Pr
_
1
n
h
X
n(X
n
) >
_
lim
(1)

(X) 2
_
+
_
e
n
.
32
Therefore,
limsup
n
P
e
( (
n
) 1 liminf
n
Pr
_
1
n
h
X
n(X
n
) lim
(1)

(X)
_
> ,
where the last inequality follows from (3.1.1). Thus, a contradiction to (3.1.3)
is obtained. 2
A few remarks are made based on the previous theorem.
Note that as = 0, lim
(1)

H

(X) =

H(X). Hence, the above theorem
generalizes the block source coding theorem in [4], which states that the
minimum achievable xed-length source coding rate of any nite-alphabet
source is

H(X).
Consider the special case where (1/n) log P
X
n(X
n
) converges in proba-
bility to a constant H, which holds for all information stable sources
4
. In
this case, both the inf- and sup-spectrums of X degenerate to a unit step
function:
u() =
_
1, if > H;
0, if < H,
where H is the source entropy rate. Thus,

H

(X) = H for all [0, 1).


Hence, general source coding theorem reduces to the conventional source
coding theorem.
More generally, if (1/n) log P
X
n(X
n
) converges in probability to a random
variable Z whose cumulative distribution function (cdf) is F
Z
(), then the
minimum achievable data compression rate subject to decoding error being
no greater than is
lim
(1)

(X) = sup R : F
Z
(R) < 1 .
Therefore, the relationship between the code rate and the ultimate optimal
error probability is also clearly dened. We further explore the case in the
next example.
4
A source X =
_
X
n
=
_
X
(n)
1
, . . . , X
(n)
n
__

n=1
is said to be information stable if H(X
n
) =
E [log P
X
n(x
n
)] > 0 for all n, and
lim
n
Pr
_

log P
X
n(x
n
)
H(X
n
)
1

>
_
= 0,
for every > 0. By the denition, any stationary-ergodic source with nite n-fold entropy is
information stable; hence, it can be viewed a generalized source model for stationary-ergodic
sources.
33
Example 3.6 Consider a binary source X with each X
n
is Bernoulli()
distributed, where is a random variable dened over (0, 1). This is a sta-
tionary but non-ergodic source [1]. We can view the source as a mixture of
Bernoulli() processes where the parameter = (0, 1), and has distri-
bution P

[1, Corollary 1]. Therefore, it can be shown by ergodic decom-


position theorem (which states that any stationary source can be viewed
as a mixture of stationary-ergodic sources) that (1/n) log P
X
n(X
n
) con-
verges in probability to a random variable Z = h
b
() [1], where h
b
(x)
xlog
2
(x)(1x) log
2
(1x) is the binary entropy function. Consequently,
the cdf of Z is F
Z
(z) = Prh
b
() z; and the minimum achievable xed-
length source coding rate with compression error being no larger than
is
supR : F
Z
(R) < 1 = supR : Prh
b
() R < 1 .
From the above example, or from Theorem 3.5, it shows that the strong
converse theorem (which states that codes with rate below entropy rate
will ultimately have decompression error approaching one, cf. Theorem 4.6
of Volume I of the lecture notes) does not hold in general. However, one
can always claim the weak converse statement for arbitrary sources.
Theorem 3.7 (weak converse theorem) For any block code sequence
of ultimate rate R <

H(X), the probability of block decoding failure P
e
cannot be made arbitrarily small. In other words, there exists > 0 such
that P
e
is lower bounded by innitely often in block length n.
The possible behavior of the probability of block decompression error of
an arbitrary source is depicted in Figure 3.1. As shown in the Figure,
there exist two bounds, denoted by

H(X) and

H
0
(X), where

H(X) is
the tight bound for lossless data compression rate. In other words, it is
possible to nd a sequence of block codes with compression rate larger than

H(X) and the probability of decoding error is asymptotically zero. When


the data compression rate lies between

H(X) and

H
0
(X), the minimum
probability of decoding error achievable is bounded below by a positive
absolute constant in (0, 1) innitely often in blocklength n. In the case that
the data compression rate is less than

H
0
(X), the probability of decoding
error of all codes will eventually go to 1 (for n innitely often). This
fact tells the block code designer that all codes with long blocklength are
bad when data compression rate is smaller than

H
0
(X). From the strong
converse theorem, the two bounds in Figure 3.1 coincide for memoryless
sources. In fact, these two bounds coincide even for stationary-ergodic
sources.
34
-

H
0
(X)

H(X)
P
e
n (i.o.)
1
P
e
n
0
P
e
is lower
bounded (i.o. in n)
R
Figure 3.1: Behavior of the probability of block decoding error as block
length n goes to innity for an arbitrary source X.
We close this section by remarking that the denition that we adopt for the
-achievable data compression rate is slightly dierent from, but equivalent to,
the one used in [4, Def. 8]. The denition in [4] also brings the same result,
which was separately proved by Steinberg and Verd u as a direct consequence of
Theorem 10(a) (or Corollary 3) in [5]. To be precise, they showed that T

(X),
denoted by T
e
(, X) in [5], is equal to

R
v
(2) (cf. Def. 17 in [5]). By a simple
derivation, we obtain:
T
e
(, X) =

R
v
(2)
= inf
_
: limsup
n
P
X
n
_

1
n
log P
X
n(X
n
) >
_

_
= inf
_
: liminf
n
P
X
n
_

1
n
log P
X
n(X
n
)
_
1
_
= sup
_
: liminf
n
P
X
n
_

1
n
log P
X
n(X
n
)
_
< 1
_
= lim
(1)

(X).
Note that Theorem 10(a) in [5] is a lossless data compression theorem for ar-
bitrary sources, which the authors show as a by-product of their results on
nite-precision resolvability theory. Specically, they proved T
0
(X) = S(X) [4,
Thm. 1] and S(X) =

H(X) [4, Thm. 3], where S(X) is the resolvability
5
of
an arbitrary source X. Here, we establish Theorem 3.5 in a dierent and more
direct way.
3.2 Generalized AEP theorem
For discrete memoryless sources, the data compression theorem is proved by
choosing the codebook (
n
to be the weakly -typical set and applying the Asymp-
5
The resolvability, which is a measure of randomness for random variables, will be intro-
duced in subsequent chapters.
35
totic Equipartition Property (AEP) which states that (1/n)h
X
n(X
n
) converges
to H(X) with probability one (and hence in probability). The AEP which
implies that the probability of the typical set is close to one for suciently large
n also holds for stationary-ergodic sources. It is however invalid for more gen-
eral sources e.g., non-stationary, non-ergodic sources. We herein demonstrate
a generalized AEP theorem.
Theorem 3.8 (generalized asymptotic equipartition property for arbi-
trary sources) Fix [0, 1). Given an arbitrary source X, dene
T
n
[R]
_
x
n
A
n
:
1
n
log P
X
n(x
n
) R
_
.
Then for any > 0, the following statements hold.
1.
liminf
n
Pr
_
T
n
[

H

(X) ]
_
(3.2.1)
2.
liminf
n
Pr
_
T
n
[

H

(X) + ]
_
> (3.2.2)
3. The number of elements in
T
n
(; ) T
n
[

H

(X) + ] T
n
[

H

(X) ],
denoted by [T
n
(; )[, satises
[T
n
(; )[ exp
_
n(

H

(X) + )
_
, (3.2.3)
where the operation /B between two sets / and B is dened by /B
/ B
c
with B
c
denoting the complement set of B.
4. There exists = () > 0 and a subsequence n
j

j=1
such that
[T
n
(; )[ > exp
_
n
j
(

H

(X) )
_
. (3.2.4)
Proof: (3.2.1) and (3.2.2) follows from the denitions. For (3.2.3), we have
1

x
n
Tn(;)
P
X
n(x
n
)

x
n
Tn(;)
exp
_
n(

H

(X) + )
_
= [T
n
(; )[ exp
_
n(

H

(X) + )
_
.
36
T
n
[

H

(X) ] T
n
[

H

(X) + ]
T
n
(; )
Figure 3.2: Illustration of generalized AEP Theorem. T
n
(; )
T
n
[

H

(X) + ] T
n
[

H

(X) ] is the dashed region.


It remains to show (3.2.4). (3.2.2) implies that there exist = () > 0 and
N
1
such that for all n > N
1
,
Pr
_
T
n
[

H

(X) + ]
_
> + 2.
Furthermore, (3.2.1) implies that for the previously chosen , there exist a sub-
sequence n
t
j

j=1
such that
Pr
_
T
n

j
[

H

(X) ]
_
< + .
Therefore, for all n
t
j
> N
1
,
< Pr
_
T
n

j
[

H

(X) + ] T
n

j
[

H

(X) ]
_
<

T
n

j
[

H

(X) + ] T
n

j
[

H

(X) ]

exp
_
n
t
j
(

H

(X) )
_
=

T
n

j
(; )

exp
_
n
t
j
(

H

(X) )
_
.
The desired subsequence n
j

j=1
is then dened as n
1
is the rst n
t
j
> N
1
, and
n
2
is the second n
t
j
> N
1
, etc. 2
With the illustration depicted in Figure 3.2, we can clearly deduce that The-
orem 3.8 is indeed a generalized version of the AEP since:
The set
T
n
(; ) T
n
[

H

(X) + ] T
n
[

H

(X) ]
=
_
x
n
A
n
:

1
n
log P
X
n(x
n
)

H

(X)


_
is nothing but the weakly -typical set.
37
(3.2.1) and (3.2.2) imply that q
n
PrT
n
(; ) > 0 innitely often in n.
(3.2.3) and (3.2.4) imply that the number of sequences in T
n
(; ) (the
dashed region) is approximately equal to exp
_
n

H

(X)
_
, and the probabi-
lity of each sequence in T
n
(; ) can be estimated by q
n
exp
_
n

H

(X)
_
.
In particular, if X is a stationary-ergodic source, then

H

(X) is indepen-
dent of [0, 1) and,

H

(X) =H

(X) = H for all [0, 1), where H is


the source entropy rate
H = lim
n
1
n
E [log P
X
n(X
n
)] .
In this case, (3.2.1)-(3.2.2) and the fact that

H

(X) =H

(X) for all


[0, 1) imply that the probability of the typical set T
n
(; ) is close to one
(for n suciently large), and (3.2.3) and (3.2.4) imply that there are about
e
nH
typical sequences of length n, each with probability about e
nH
. Hence
we obtain the conventional AEP.
The general source coding theorem can also be proved in terms of the
generalized AEP theorem. For details, readers can refer to [2].
3.3 Variable-length lossless data compression codes
that minimizes the exponentially weighted codeword
length
3.3.1 Criterion for optimality of codes
In the usual discussion of the source coding theorem, one chooses the criterion
to minimize the average codeword length. Implicit in the use of average code
length as a criterion of performance is the assumption that cost varies linearly
with codeword length. This is not always the case. In some papers, another
cost/penalty function of codeword length is introduced which implies that the
cost is an exponential function of codeword length. Obviously, linear cost is a
limiting case of the exponential cost function.
One of the exponential cost function introduced is:
L(t)
1
t
log
_

x.
P
X
(x)e
t(cx)
_
,
where t is a chosen positive constant, P
X
is the distribution function of source
X, c
x
is the binary codeword for source symbol x, and () is the length of the
38
binary codeword. The criterion for optimal code now becomes that a code is
said to be optimal if its cost L(t) is the smallest among all possible codes.
The physical meaning of the aforementioned cost function is roughly dis-
cussed below. When t 0, L(0) =

x.
P
X
(x)(c
x
) which is the average
codeword length. In the case of t , L() = max
x.
(c
x
), which is the
maximum codeword length for all binary codewords. As you may have noticed,
longer codeword length has larger weight in the sum of L(t). In other words,
the minimization of L(t) is equivalent to the minimization of

x.
P
X
(x)e
t(cx)
;
hence, the weight for codeword c
x
is e
t(cx)
. For the minimization operation, it
is obvious that events with smaller weight is more preferable since it contributes
less in the sum of L(t).
Therefore, with the minimization of L(t), it is less likely to have codewords
with long code lengths. In practice, system with long codewords usually intro-
duce complexity in encoding and decoding, and hence is somewhat non-feasible.
Consequently, the new criterion, to some extent, is more suitable to physical
considerations.
3.3.2 Source coding theorem for Renyis entropy
Theorem 3.9 (source coding theorem for Renyis entropy) The mini-
mum cost L(t) attainable for uniquely decodable codes
6
is the Renyis entropy
of order 1/(1 + t), i.e.,
H
_
X;
1
(1 + t)
_
=
1 + t
t
log
_

x.
P
1/(1+t)
X
(x)
_
.
Example 3.10 Given a source X. If we want to design an optimal lossless code
with max
x.
(c
x
) being smallest, the cost of the optimal code is H(X; 0) =
log [A[.
3.4 Entropy of English
The compression of English text is not only a practical application but also an
interesting research topic.
One of the main problem in the compression of English text (in principle)
is that its statistical model is unclear. Therefore, its entropy rate cannot be
immediately computable. To estimate the data compression bound of English
text, various stochastic approximations to English have been proposed. One can
6
For its denition, please refer to Section 4.3.1 of Volume I of the lecture notes.
39
then design a code, according to the estimated stochastic model, to compress
English text. It is obvious that the better the stochastic approximation, the
better the approximation.
Assumption 3.11 For data compression, we assume that the English text con-
tains only 26 letters and the space symbol. In other words, the upper case letter
is treated the same as its lower case counterpart, and special symbols, such as
punctuation, will be ignored.
3.4.1 Markov estimate of entropy rate of English text
According to the source coding theorem, the rst thing that a data compression
code designer shall do is to estimate the (-sup) entropy rate of the English text.
One can start from modeling the English text as a Markov source, and compute
the entropy rate according to the estimated Markov statistics.
zero-order Markov approximation. It has been shown that the frequency
of letters in English is far from uniform; for example, the most common
letter, E, has P
empirical
(E) 0.13 but the least common letters, Q and Z,
have P
empirical
(Q) P
empirical
(Z) 0.001. Therefore, zero-order Markov
approximation apparently does not t our need in estimating the entropy
rate of the English text.
1st-order Markov approximate. The frequency of pairs of letters is also far
from uniform; the most common pair TH has frequency about 0.037. It is
fun to know that Q is always followed by U.
A higher order approximation is possible. However, the database may be
too large to be handled. For example, a 2nd-order approximation requires 27
3
=
19683 entries, and one may need millions of samples to make an accurate estimate
of its probability.
Here are some examples of Markov approximations to English from Shannons
original paper. Note that the sequence of English letters is generated according
to the approximated statistics.
1. Zero-order Markov approximation: The symbols are drawn independently
with equiprobable distribution.
Example: XFOML RXKHRJEFJUJ ZLPWCFWKCYJ
EFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD
40
2. 1st-order Markov approximation: The symbols are drawn independently.
Frequency of letters matches the 1st-order Markov approximation of En-
glish text.
Example: OCRO HLI RGWR NMIELWIS EU LL
NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
3. 2nd-order Markov approximation: Frequency of pairs of letters matches
English text.
Example: ON IE ANTSOUTINYS ARE T INCTORE ST BE
S DEAMY ACHIN D ILONASIVE TUCOOWE AT
TEASONARE FUSO TIZIN AN DY TOBE SEACE CTISBE
4. 3rd-order Markov approximation: Frequency of triples of letters matches
English text.
Example: IN NO IST LAT WHEY CRATICT FROURE BERS
GROCID PONDENOME OF DEMONSTURES OF THE
REPTAGIN IS REGOACTIONA OF CRE
5. 4th-order Markov approximation: Frequency of quadruples of letters ma-
tches English text.
Example: THE GENERATED JOB PROVIDUAL BETTER
TRAND THE DISPLAYED CODE, ABOVERY UPONDULTS
WELL THE CODE RST IN THESTICAL IT DO HOCK
BOTHE MERG.
Another way to simulate the randomness of English text is to use word-
approximation.
1. 1st-order word approximation.
Example: REPRESENTING AND SPEEDILY IS AN GOOD APT OR
COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE
TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MES-
SAGE HAD BE THESE.
2. 2nd-order word approximation.
Example: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF
WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
41
From the above results, it is obvious that the approximations get closer and
closer to resembling English for higher order approximation.
Using the above model, we can then compute the empirical entropy rate of
English text.
order of the letter approximation model entropy rate
zero order log
2
27 = 4.76 bits per letter
1st order 4.03 bits per letter
4th order 2.8 bits per letter
One nal remark on Markov estimate of English statistics is that the results
are not only useful in compression but also helpful in decryption.
3.4.2 Gambling estimate of entropy rate of English text
In this section, we will show that a good gambler is also a good data compressor!
A) Sequential gambling
Given an observed sequence of letters,
x
1
, x
2
, . . . , x
k
,
a gambler needs to bet on the next letter X
k+1
(which is now a random variable)
with all the money in hand. It is not necessary for him to put all the money
on the same outcome (there are 27 of them, i.e., 26 letters plus space). For
example, he is allowed to place part of his money on one possible outcome and
the rest of the money on another. The only constraint is that he should bet all
his money.
Let b(x
k+1
[x
1
, . . . , x
k
) be the ratio of his money, which be bet on the letter
x
k+1
, and assume that at rst, the amount of money that the gambler has is 1.
Then

x
k+1
.
b(x
k+1
[x
1
, . . . , x
k
) = 1,
and
( x
k+1
a, b, . . . , z, SPACE) b(x
k+1
[x
1
, . . . , x
k
) 0.
When the next letter appears, the gambler will be paid 27 times the bet on the
letter.
42
Let S
n
be the wealth of the gambler after n bets. Then
S
1
= 27 b
1
(x
1
)
S
2
= 27 [b
2
(x
2
[x
1
) S
1
]
S
3
= 27 [b
3
(x
3
[x
1
, x
2
) S
2
]
.
.
.
S
n
= 27
n

k=1
b
k
(x
k
[x
1
, . . . , x
k1
)
= 27
n
b(x
1
, . . . , x
n
),
where
b(x
n
) = b(x
1
, . . . , x
n
) =
n

k=1
b
k
(x
k
[x
1
, . . . , x
k1
).
We now wish to show that high value of S
n
lead to high data compression.
Specically, if a gambler with some gambling policy yields wealth S
n
, the data
can be saved up to log
2
S
n
bits.
Lemma 3.12 If a proportional gambling policy results in wealth E[log
2
S
n
],
there exists a data compression code for English-text source X
n
which yields
average codeword length being smaller than
log
2
(27)
1
n
E [log
2
S
n
] +
2
n
bits,
where X
n
represents the random variables of the n bet outcomes.
Proof:
1. Ordering : Index the English letter as
index(a) = 0
index(b) = 1
index(c) = 2
.
.
.
index(z) = 25
index(SPACE) = 26.
For any two sequences x
n
and x
n
in A a, b, . . . , z, SPACE, we say
x
n
x
n
if
n

i=1
index(x
i
) 27
i1

i=1
index( x
i
) 27
i1
.
43
2. Shannon-Fano-Elias coder : Apply Shannon-Fano-Elias coder (cf. Sec-
tion 4.3.3 of Volume I of the lecture notes) to the gambling policy b(x
n
)
according to the ordering dened in step 1. Then the codeword length for
x
n
is
(log
2
(b(x
n
))| + 1) bits.
3. Data compression : Now observe that
E [log
2
S
n
] = E [log
2
(27
n
b(x
n
))]
= nlog
2
(27) + E[log
2
b(X
n
)].
Hence, the average codeword length

is upper bounded by


1
n
_

x
n
.
n
P
X
n(x
n
) log
2
b(x
n
) + 2
_
=
1
n
E [log
2
b(X
n
)] +
2
n
= log
2
(27)
1
n
E[log
2
S
n
] +
2
n
.
2
According to the concept behind the source coding theorem, the entropy rate
of the English text should be upper bounded by the average codeword length of
any (variable-length) code, which in turns should be bounded above by
log
2
(27)
1
n
E[log
2
S
n
] +
2
n
.
Equipped with the proportional gambling model, one can nd the bound of
the entropy rate of English text by a properly designed gambling policy. An
experiment using the book, Jeerson the Virginian by Dumas Malone, as the
database resulted in an estimate of 1.34 bits per letter for the entropy rate of
English.
3.5 Lempel-Ziv code revisited
In Section 4.3.4 of Volume I of the lecture notes, we have introduced the famous
Lempel-Ziv coder, and states that the coder is universally good for stationary
sources. In this section, we will establish the concept behind it.
For simplicity, we assume that the source alphabet is binary, i.e., A = 0, 1.
The optimality of the Lempel-Ziv code can actually be extended to any station-
ary source with nite alphabet.
44
Lemma 3.13 The number c(n) of distinct strings in the Lempel-Ziv parsing of
a binary sequence satises

2n 1 c(n)
2n
log
2
n
,
where the upper bound holds for n 2
13
, and the lower bound is valid for every
n.
Proof: The upper bound can be proved as follows.
For xed n, the number of distinct strings is maximized when all the phrases
are as short as possible. Hence, in the extreme case,
n =
c(n) of them
..
1 + 2 + 2 + 3 + 3 + 3 + 3 + ,
which implies that
k+1

j=1
j2
j
n
k

j=1
j2
j
, (3.5.1)
where k is the integer satisfying
2
k+1
1 > c(n) 2
k
1. (3.5.2)
Observe that
k+1

j=1
j2
j
= k2
k+2
+ 2 and
k

j=1
j2
j
= (k 1)2
k+1
+ 2.
Now from (3.5.1), we obtain for k 7 (which will be justied later by n 2
13
),
n k2
k+2
+ 2 < 2
2(k1)
and n (k 1)2
k+1
+ 2 (k 1)2
k+1
.
The proof of the upper bound is then completed by noting that the maximum
c(n) for xed n should satisfy
c(n) < 2
k+1
1 2
k+1

n
k 1

2n
log
2
(n)
.
Again for the lower bound, we note that the number of distinct strings is
minimized when all the phrases are as long as possible. Hence, in the extreme
case,
n =
c(n) of them
..
1 + 2 + 3 + 4 +
c(n)

j=1
j =
c(n)[c(n) + 1]
2

[c(n) + 1]
2
2
,
45
which implies that
c(n)

2n 1.
Note that when n 2
13
, c(n) 2
7
1, which implies that the assumption of
k 7 in (3.5.2) is valid for n 2
13
. 2
The condition n 2
13
is equivalent to compressing a binary le of size larger
than 1K bytes, which, in practice, is a frequently encountered situation. Since
what concerns us is the asymptotic optimality of the coder as n goes to innity,
n 2
13
certainly becomes insignicant in such consideration.
Lemma 3.14 (entropy upper bound by a function of its mean) A non-
negative integer-valued source X with mean and entropy H(X) satises
H(X) ( + 1) log
2
( + 1) log
2
.
Proof: The lemma follows directly from the result that the geometric distri-
bution maximizes the entropy of non-negative integer-valued source with given
mean, which is proved as follows.
For geometric distribution with mean ,
P
Z
(z) =
1
1 +
_

1 +
_
z
, for z = 0, 1, 2, . . . ,
its entropy is
H(Z) =

z=0
P
Z
(z) log
2
P
Z
(z)
=

z=0
P
Z
(z)
_
log
2
(1 + ) + z log
2
1 +

_
= log
2
(1 + ) + log
2
1 +

z=0
P
X
(z)
_
log
2
(1 + ) + z log
2
1 +

_
,
where the last equality holds for any non-negative integer-valued source X with
mean . So,
H(X) H(Z) =

x=0
P
X
(x)[log
2
P
X
(x) + log
2
P
Z
(x)]
=

x=0
P
X
(x) log
2
P
Z
(x)
P
X
(x)
= D(P
X
|P
Z
) 0,
46
with equality holds if, and only if, X Z. 2
Before the introduction of the main theorems, we address some notations
used in their proofs.
Give the source
x
(k1)
, . . . , x
1
, x
0
, x
1
, . . . , x
n
,
and suppose x
1
, . . . , x
n
is Lempel-Ziv-parsed into c distinct strings, y
1
, . . . , y
c
.
Let
i
be the location of the rst bit of y
i
, i.e.,
y
i
x

i
, . . . , x

i+1
1
.
1. Dene
s
i
= x

i
k
, . . . , x

i
1
as the k bits preceding y
i
.
2. Dene c
,s
be the number of strings in y
1
, . . . , y
c
with length and pro-
ceeding state s.
3. Dene Q
k
be the k-th order Markov approximation of the stationary source
X, i.e.,
Q
k
(x
1
, . . . , x
n
[x
0
, . . . , x
(k1)
) P
X
n
1
[X
0
(k1)
(x
n
, . . . , x
1
[x
0
, . . . , x
(k1)
),
where P
X
n
1
[X
0
(k1)
is the true (stationary) distribution of the source.
For ease of understanding, these notations are graphically illustrated in Fig-
ure 3.3. It is easy to verify that
n

=1

s.
k
c
,s
= c, and
n

=1

s.
k
c
,s
= n. (3.5.3)
Lemma 3.15 For any Lempel-Ziv parsing of the source x
1
. . . x
n
, we have
log
2
Q
k
(x
1
, . . . , x
n
[s)
n

=1
c
,s
log
2
c
,s
. (3.5.4)
47
s
1
..
x
(k1)
, . . . , x
1
(x
1
, . . . ,
s
2
..
x

2
(k1)
, . . . , x

2
1
. .
y
1
)(x

2
, . . . ,
s
3
..
x

3
(k1)
, . . . , x

3
1
. .
y
2
)
. . . (x

c1
, . . . ,
s
c
..
x
c(k1)
, . . . , x
c1
. .
y
c1
)(x
c
, . . . , x
n
. .
y
c
)
Figure 3.3: Notations used in Lempel-Ziv coder.
Proof: By the k-th order Markov property of Q
k
,
log
2
Q
k
(x
1
, . . . , x
n
[s)
= log
2
Q(y
1
, . . . , y
c
[x)
= log
2
_
c

i=1
P
X

i+1

i
1
[X
0
(k1)
(y
i
[s
i
)
_
=
c

i=1
log
2
P
X

i+1

i
1
[X
0
(k1)
(y
i
[s
i
)
=
n

=1

s.
k
_
_

i :
i+1

i
= and s
i
=s
log
2
P
X

1
[X
0
(k1)
(y
i
[s
i
)
_
_
=
n

=1

s.
k
c
,s
_
_
1
c
,s

i :
i+1

i
= and s
i
=s
log
2
P
X

1
[X
0
(k1)
(y
i
[s
i
)
_
_

=1

s.
k
c
,s
log
2

i :
i+1

i
= and s
i
=s
_
1
c
,s
P
X

1
[X
0
(k1)
(y
i
[s
i
)
_
(3.5.5)

=1

s.
k
c
,s
log
2
_
1
c
,s
_
(3.5.6)
where (3.5.5) follows from Jensens inequality and the concavity of log
2
(), and
(3.5.6) holds since probability sum is no greater than one. 2
Theorem 3.16 Fix a stationary source X. Given any observations x
1
, x
2
, x
3
,
. . .,
limsup
n
c log
2
c
n
H(X
k+1
[X
k
, . . . , X
1
),
for any integer k, where c = c(n) is the number of distinct Lempel-Ziv parsed
strings of x
1
, x
2
, . . . , x
n
.
48
Proof: Lemma 3.15 gives that
log
2
Q
k
(x
1
, . . . , x
n
[s)
n

=1

s.
k
c
,s
log
2
c
,s
=
n

=1

s.
k
c
,s
log
2
c
,s
c
+
n

=1

s.
k
c
,s
log
2
c
= c
n

=1

s.
k
c
,s
c
log
2
c
,s
c
c log
2
c. (3.5.7)
Denote by L and S the random variables with distribution
P
L,S
(, s) =
c
,s
c
,
for which the sum-to-one property of the distribution is justied by (3.5.3). Also,
from (3.5.3), we have
E[L] =
n

=1

s.
k

c
,s
c
=
n
c
.
Therefore, by independent bound for entropy (cf. Theorem 2.19 in Volume I of
the lecture notes) and Lemma 3.14, we get
H(L, S) H(L) + H(S)
(E[L] + 1) log
2
(E[L] + 1) E[L] log
2
E[L] + log
2
[A[
k
=
__
n
c
+ 1
_
log
2
_
n
c
+ 1
_

n
c
log
2
n
c
_
+ k
=
_
log
2
_
n
c
+ 1
_
+
n
c
log
2
_
n/c + 1
n/c
__
+ k
= log
2
_
n
c
+ 1
_
+
n
c
log
2
_
1 +
c
n
_
+ k,
which, together with Lemma 3.13, implies that for n 2
13
,
c
n
H(L, S)
c
n
log
2
_
n
c
+ 1
_
+ log
2
_
1 +
c
n
_
+
c
n
k

2
log
2
n
log
2
_
n

2n 1
+ 1
_
+ log 2
_
1 +
2
log
2
n
_
+
2
log
2
n
k.
Finally, we can re-write (3.5.7) as
c log
2
c
n

1
n
log
2
Q
k
(x
1
, . . . , x
n
[x) +
c
n
H(L, S).
49
As a consequence, by taking the expectation value with respect to X
n
(k1)
on
both sides of the above inequality, we obtain
limsup
n
c log
2
c
n
limsup
n
1
n
E
_
log
2
Q
k
(X
1
, . . . , X
n
[X
(k1)
, . . . , X
0
)

= H(X
k+1
[X
k
, . . . , X
1
).
2
Theorem 3.17 (main result) Let (x
1
, . . . , x
n
) be the Lempel-Ziv codeword
length of an observatory sequence x
1
, . . . , x
n
, which is drawn from a stationary
source X. Then
limsup
n
1
n
(x
1
, . . . , x
n
) lim
n
1
n
H(X
n
).
Proof: Let c(n) be the number of the parsed distinct strings, then
1
n
(x
1
, . . . , x
n
) =
1
n
c(n) (log
2
c(n)| + 1)

1
n
c(n) (log
2
c(n) + 2)
=
c(n) log
2
c(n)
n
+ 2
c(n)
n
.
From Lemma 3.13, we have
limsup
n
c(n)
n
limsup
n
2
log
2
n
= 0.
From Theorem 3.16, we have for any integer k,
limsup
n
c(n) log
2
c(n)
n
H(X
k+1
[X
k
, . . . , X
1
).
Hence,
limsup
n
1
n
(x
1
, . . . , x
n
) H(X
k+1
[X
k
, . . . , X
1
)
for any integer k. The theorem is completed by applying Theorem 4.10 in Volume
I of the lecture notes, which states that for a stationary source X, its entropy
rate always exists and is equal to
lim
n
1
n
H(X
n
) = lim
k
H(X
k+1
[X
k
, . . . , X
1
).
2
We conclude the discussion in the section into the next corollary.
50
Corollary 3.18 The Lempel-Ziv coders asymptotically achieves the entropy
rate of any (unknown) stationary source.
The Lempel-Ziv code is often used in practice to compress data which cannot
be characterized in a simple statistical model, such as English text or computer
source code. It is simple to implement, and has an asymptotic rate approaching
the entropy rate (if it exists) of the source, which is known to be the lower
bound of the lossless data compression code rate. This code can be used without
knowledge of the source distribution provided the source is stationary. Some
well-known examples of its implementation are the compress program in UNIX
and the arc program in DOS, which typically compresses ASCII text les by
about a factor of 2.
51
Bibliography
[1] F. Alajaji and T. Fuja, A communication channel modeled on contagion,
IEEE Trans. on Information Theory, vol. IT40, no. 6, pp. 20352041,
Nov. 1994.
[2] P.-N. Chen and F. Alajaji, Generalized source coding theorems and hy-
pothesis testing, Journal of the Chinese Institute of Engineering, vol. 21,
no. 3, pp. 283-303, May 1998.
[3] T. S. Han, Information-Spectrum Methods in Information Theory, (in
Japanese), Baifukan Press, Tokyo, 1998.
[4] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. on Information Theory, vol. IT39, no. 3, pp. 752772, May 1993.
[5] Y. Steinberg and S. Verd u, Simulation of random processes and rate-
distortion theory, IEEE Trans. on Information Theory, vol. IT42, no. 1,
pp. 6386, Jan. 1996.
52
Chapter 4
Measure of Randomness for Stochastic
Processes
In the previous chapter, it is shown that the sup-entropy rate is indeed the
minimum lossless data compression ratio achievable for block codes. Hence, to
nd an optimal block code becomes a well-dened mission since for any source
with well-formulated statistical model, the sup-entropy rate can be computed
and such quantity can be used as a criterion to evaluate the optimality of the
designed block code.
In the very recent work of Verd u and Han [2], they found that, other than
the minimum lossless data compression ratio, the sup-entropy rate actually has
another operational meaning, which is called resolvability. In this chapter, we
will explore the new concept in details.
4.1 Motivation for resolvability : measure of randomness
of random variables
In simulations of statistical communication systems, generation of random vari-
ables by a computer algorithm is very essential. The computer usually has the
access to a basic random experiment (through pre-dened Application Program-
ing Interface), which generates equally likely random values, such as rand
( )
that
generates a real number uniformly distributed over (0, 1). Conceptually, random
variables with complex models are more dicult to generate by computerw than
random variables with simple models. Question is how to quantify the com-
plexity of generating the random variables by computers. One way to dene
such complexity measurement is:
Denition 4.1 The complexity of generating a random variable is dened as
the number of random bits that the most ecient algorithm requires in order
53
to generate the random variable by computer that has the access to a equally
likely random experiment.
To understand the above denition quantitatively, a simple example is de-
monstrated below.
Example 4.2 Consider the generation of the random variable with probability
masses P
X
(1) = 1/4, P
X
(0) = 1/2, and P
X
(1) = 1/4. An algorithm is written
as:
Flip-a-fair-coin; \\ one random bit
If Head, then output 0;
else

Flip-a-fair-coin; \\ one random bit


If Head, then output 1;
else output 1;

On the average, the above algorithm requires 1.5 coin ips, and in the worst-
case, 2 coin ips are necessary. Therefore, the complexity measure can take two
fundamental forms: worst-case or average-case over the range of outcomes of
the random variables. Note that we did not show in the above example that
the algorithm is the most ecient one in the sense of using minimum number of
random bits; however, it is indeed an optimal algorithm because it achieves the
lower bound of the minimum number of random bits. Later, we will show that
such bound for average minimum number of random bits required for generat-
ing the random variables is the entropy, which is exactly 1.5 bits in the above
example. As for the worse-case bound, a new terminology, resolution, will be
introduced. As a result, the above algorithm also achieves the lower bound of
the worst-case complexity, which is the resolution of the random variable.
4.2 Notations and denitions regarding to resolvability
Denition 4.3 (M-type) For any positive integer M, a probability distribu-
tion P is said to be M-type if
P()
_
0,
1
M
,
2
M
, . . . , 1
_
for all .
54
Denition 4.4 (resolution of a random variable) The resolution
1
R(X) of
a random variable X is the minimum log(M) such that P
X
is M-type. If P
X
is
not M-type for any integer M, then R(X) = .
As revealed previously, a random source needs to be resolved (meaning, can
be generated by a computer algorithm with access to equal-probable random
experiments). As anticipated, a random variable with nite resolution is resolv-
able by computer algorithms. Yet, it is possible that the resolution of a random
variable is innity. A quick example is the random variable X with distribution
P
X
(0) = 1/ and P
X
(1) = 1 1/. (X does not belong to any M-type for -
nite M.) In such case, one can alternatively choose another computer-resolvable
random variable, which resembles the true source within some acceptable range,
to simulate the original one.
One criterion that can be used as a measure of resemblance of two random
variables is the variational distance. As for the same example in the above
paragraph, choose a random variable

X with distribution P
e
X
(0) = 1/3 and
P
e
X
(1) = 2/3. Then |X

X| 0.03, and

X is 3-type, which is computer-
resolvable
2
.
Denition 4.5 (variational distance) The variational distance (or
1
dis-
tance) between tow distributions P and Q dened on common measurable space
(, T) is
|P Q|

[P() Q()[.
1
If the base of the logarithmic operation is 2, the resolution is measured in bits; however,
if natural logarithm is taken, nats becomes the basic measurement unit of resolution.
2
A program that generates M-type random variable for any M satisfying log
2
(M) being a
positive integer is straightforward. A program that generates the 3-type

X is as follows (in C
language).
even = False;
while (1)
Flip-a-fair-coin; one random bit
if (Head)
if (even==True) output 0; even=False;
else output 1; even = True;

else
if (even==True) even=False;
else even=True;

55
(Note that an alternative way to formulate the variational distance is:
|P Q| = 2 sup
ET
[P(E) Q(E)[ = 2

_
x. : P(x)Q(x)
_
[P(x) Q(x)].
This two denitions are actually equivalent.)
Denition 4.6 (-achievable resolution) Fix 0. R is an -achievable
resolution for input X if for all > 0, there exists

X satises
R(

X) < R + and |X

X| < .
-achievable resolution reveals the possibility that one can choose another
computer-resolvable random variable whose variational distance to the true sour-
ce is within an acceptable range, .
Next we dene the -achievable resolution rate for a sequence of random vari-
ables, which is an extension of -achievable resolution dened for single random
variable. Such extension is analogous to extending entropy for a single source to
entropy rate for a random source sequence.
Denition 4.7 (-achievable resolution rate) Fix 0 and input X. R is
an -achievable resolution rate
3
for input X if for every > 0, there exists

X
satises
1
n
R(

X
n
) < R + and |X
n


X
n
| < ,
for all suciently large n.
Denition 4.8 (-resolvability for X) Fix > 0. The -resolvability for in-
put X, denoted by S

(X), is the minimum -achievable resolution rate of the


same input, i.e.,
S

(X) min
_
R : ( > 0)(

X and N)( n > N)


1
n
R(

X
n
) < R + and |X
n


X
n
| <
_
.
3
Note that our denition of resolution rate is dierent from its original form (cf. Denition
7 in [2] and the statements following Denition 7 of the same paper for its modied Denition
for specic input X), which involves an arbitrary channel model W. Readers may treat our
denition as a special case of theirs over identity channel.
56
Here, we dene S

(X) using the minimum instead of a more general in-


mum operation is simply because S

(X) indeed belongs to the range of the


minimum operation, i.e.,
S

(X)
_
R : ( > 0)(

X and N)( n > N)
1
n
R(

X
n
) < R + and |X
n


X
n
| <
_
.
Similar convention will be applied throughout the rest of this chapter.
Denition 4.9 (resolvability for X) The resolvability for input X, denoted
by S(X), is
S(X) lim
0
S

(X).
From the denition of -resolvability, it is obvious non-increasing in . Hence,
the resolvability can also be dened using supremum operation as:
S(X) sup
>0
S

(X).
The resolvability is pertinent to the worse-case complexity measure for ran-
dom variables (cf. Example 4.2, and the discussion following it). With the en-
tropy function, the information theorists also dene the -mean-resolvability and
mean-resolvability for input X, which characterize the average-case complexity
of random variables.
Denition 4.10 (-mean-achievable resolution rate) Fix 0. R is an
-mean-achievable resolution rate for input X if for all > 0, there exists

X
satises
1
n
H(

X
n
) < R + and |X
n


X
n
| < ,
for all suciently large n.
Denition 4.11 (-mean-resolvability for X) Fix > 0. The -mean-re-
solvability for input X, denoted by

S

(X), is the minimum -mean achievable


resolution rate for the same input, i.e.,

(X) min
_
R : ( > 0)(

X and N)( n > N)
1
n
H(

X
n
) < R + and |X
n


X
n
| <
_
.
57
Denition 4.12 (mean-resolvability for X) The mean-resolvability for in-
put X, denoted by

S(X), is

S(X) lim
0

(X) = sup
>0

(X).
The only dierence between resolvability and mean-resolvability is that the
former employs resolution function, while the latter replaces it by entropy func-
tion. Since entropy is the minimum average codeword length for uniquely de-
codable codes, an explanation for mean-resolvability is that the new random
variable

X can be resolvable through realizing the optimal variable-length code
for it. You can think of the probability mass of each outcome of

X is approxi-
mately 2

where is the codeword length of the optimal lossless variable-length


code for

X. Such probability mass can actually be generated by ipping fair
coins times, and the average number of fair coin ipping for this outcome is
indeed 2

. As you may expect, the mean-resolvability is shown to be the


average complexity of a random variable.
4.3 Operational meanings of resolvability and mean-re-
solvability
The operational meanings for the resolution and entropy (a new operational
meaning for entropy other than the one from source coding theorem) follow the
next theorem.
Theorem 4.13 For a single random variable X,
1. the worse-case complexity is lower-bounded by its resolution R(X) [2];
2. the average-case complexity is lower-bounded by its entropy H(X), and is
upper-bounded by entropy H(X) plus 2 bits [3].
Next, we reveal the operational meanings for resolvability and mean-resolva-
bility in source coding. We begin with some useful lemmas that are useful in
characterizing the resolvability.
Lemma 4.14 (bound on variational distance) For every > 0,
|P Q| 2 + 2 P
X
_
x A : log
P(x)
Q(x)
>
_
.
58
Proof:
|P Q|
= 2

_
x. : P(x)Q(x)
_
[P(x) Q(x)]
= 2

_
x. : log[P(x)/Q(x)]0
_
[P(x) Q(x)]
= 2
_
_
_

_
x. : log[P(x)/Q(x)]>
_
[P(x) Q(x)]
+

_
x. : log[P(x)/Q(x)]0
_
[P(x) Q(x)]
_
_
_
2
_
_
_

_
x. : log[P(x)/Q(x)]>
_
P(x)
+

_
x. : log[P(x)/Q(x)]0
_
P(x)
_
1
Q(x)
P(x)
_
_
_
_
2
_
P
_
x A : log
P(x)
Q(x)
>
_
+

_
x. : log[P(x)/Q(x)]0
_
P(x)
_
log
P(x)
Q(x)
_
_
_
_
(by fundamental inequality)
2
_
_
_
P
_
x A : log
P(x)
Q(x)
>
_
+

_
x. : log[P(x)/Q(x)]0
_
P(x)
_
_
_
= 2
_
P
_
x A : log
P(x)
Q(x)
>
_
+ P
X
_
x A : log
P(x)
Q(x)
0
__
= 2
_
P
_
x A : log
P(x)
Q(x)
>
_
+
_
.
59
2
Lemma 4.15
P
e
X
n
_
x
n
A
n
:
1
n
log P
e
X
n
(x
n
)
1
n
R(

X
n
)
_
= 1,
for every n.
Proof: By denition of R(

X
n
),
P
e
X
n
(x
n
) expR(

X
n
)
for all x
n
A
n
. Hence, for all x
n
A
n
,

1
n
log P
e
X
n
(x
n
)
1
n
R(

X
n
).
The lemma then holds. 2
Theorem 4.16 The resolvability for input X is equal to its sup-entropy rate,
i.e.,
S(X) =

H(X).
Proof:
1. S(X)

H(X).
It suces to show that S(X) <

H(X) contradicts to Lemma 4.15.
Suppose S(X) <

H(X). Then there exists > 0 such that
S(X) + <

H(X).
Let
T
0

_
x
n
A
n
:
1
n
log P
X
n(x
n
) S(X) +
_
.
By denition of

H(X),
limsup
n
P
X
n(T
0
) > 0.
Therefore, there exists > 0 such that
limsup
n
P
X
n(T
0
) > ,
60
which immediately implies
P
X
n(T
0
) >
innitely often in n.
Select 0 < < min
2
, 1 and observe that S

(X) S(X), we can choose

X
n
to satisfy
1
n
R(

X
n
) < S(X) +

2
and |X
n


X
n
| <
for suciently large n. Dene
T
1
x
n
A
n
: P
X
n(x
n
) > 0
and

P
X
n(x
n
) P
e
X
n
(x
n
)

P
X
n(x
n
)
_
.
Then
P
X
n(T
c
1
) = P
X
n x
n
A
n
: P
X
n(x
n
) = 0
or

P
X
n(x
n
) P
e
X
n
(x
n
)

>

P
X
n(x
n
)
_
P
X
n x
n
A
n
: P
X
n(x
n
) = 0
+P
X
n
_
x
n
A
n
:

P
X
n(x
n
) P
e
X
n
(x
n
)

>

P
X
n(x
n
)
_
= P
X
n
_
x
n
A
n
:

P
X
n(x
n
) P
e
X
n
(x
n
)

>

P
X
n(x
n
)
_
=

_
x
n
.
n
: P
X
n(x
n
)<(1/

)[P
X
n(x
n
)P
e
X
n
(x
n
)[
_
P
X
n(x
n
)

x
n
.
n
1

[P
X
n(x
n
) P
e
X
n
(x
n
)[

.
Consider that
P
X
n(T
1
T
0
) P
X
n(T
0
) P
X
n(T
c
1
)

> 0, (4.3.1)
which holds innitely often in n; and every x
n
0
in T
1
T
0
satises
P
e
X
n
(x
n
0
) (1

)P
X
n(x
n
0
)
and

1
n
log P
e
X
n
(x
n
0
)
1
n
log P
X
n(x
n
0
) +
1
n
log
1
1 +

(S(X) + ) +
1
n
log
1
1 +

S(X) +

2
,
61
for n > (2/) log(1 +

). Therefore, for those n that (4.3.1) holds,


P
e
X
n
_
x
n
A
n
:
1
n
log P
e
X
n
(x
n
) >
1
n
R(

X
n
)
_
P
e
X
n
_
x
n
A
n
:
1
n
log P
e
X
n
(x
n
) > S(X) +

2
_
P
e
X
n
(T
1
T
0
)
(1
1/2
)P
X
n(T
1
T
0
)
> 0,
which contradicts to the result of Lemma 4.15.
2. S(X)

H(X).
It suces to show the existence of

X for arbitrary > 0 such that
lim
n
|X
n


X
n
| = 0
and

X
n
is an M-type distribution with
M =
_
exp
_
n(

H(X) + )
__
.
Let

X
n
=

X
n
(X
n
) be uniformly distributed over a set
( U
j
A
n
: j = 1, . . . , M
which drawn randomly (independently) according to P
X
n. Dene
T
_
x
n
A
n
:
1
n
log P
X
n(x
n
) >

H(X) + +

n
_
.
For each ( chosen, we obtain from Lemma 4.14 that
|X
n


X
n
|
2 + 2 P
e
X
n
_
x
n
A
n
: log
P
e
X
n
(x
n
)
P
X
n(x
n
)
>
_
= 2 + 2 P
e
X
n
_
x
n
( : log
1/M
P
X
n(x
n
)
>
_
(since P
e
X
n
((
c
) = 0)
= 2 + P
e
X
n
_
x
n
( :
1
n
log P
X
n(x
n
) >

H(X) + +

n
_
= 2 + P
e
X
n
(( T)
= 2 +
1
M
[( T[ .
62
Since ( is chosen randomly, we can take the expectation values (w.r.t. the
random () of the above inequality to obtain:
E

_
|X
n


X
n
|
_
2 +
1
M
E

[[( T[] .
Observe that each U
j
is either in T or not in T, and will contribute weight
1/M when it is in T. From the i.i.d. assumption of U
j

M
j=1
, we can then
evaluate (1/M)E

[[( T[] by
4
1
M
E

[[( T[]
= P
M
X
n[T] +
M 1
M
P
M1
X
n [T]P
X
n[T
c
] + +
1
M
P
X
n[T]P
M1
X
n [T
c
]
=
1
M
_
MP
M
X
n[T] + (M 1)P
M1
X
n [T]P
X
n[T
c
]
+ + P
X
n[T]P
M1
X
n [T
c
]
_
=
1
M
(MP
X
n[T])
= P
X
n[T].
Hence,
limsup
n
E

_
|X
n


X
n
|
_
2 + limsup
n
P
X
n[T] = 2,
which implies
limsup
n
E

_
|X
n


X
n
|
_
= 0 (4.3.2)
since can be chosen arbitrarily small. (4.3.2) therefore guarantees the
existence of the desired

X.
2
Theorem 4.17 For any X,

S(X) = limsup
n
1
n
H(X
n
).
Proof:
4
Readers may imagine that there are cases: where [( T[ = M, [( T[ = M 1, . . .,
[(T[ = 1 and [(T[ = 0, respectively with drawing probability P
M
X
n(T), P
M1
X
n (T)P
X
n(T
c
),
. . ., P
X
n(T)P
M1
X
n (T
c
) and P
M
X
n(T
c
), and with expectation quantity (i.e., (1/M)[(T[) being
M/M, (M 1)/M, . . ., 1/M and 0.
63
1.

S(X) limsup
n
(1/n)H(X
n
).
It suces to prove that

S

(X) limsup
n
(1/n)H(X
n
) for every > 0.
This is equivalent to show that for all > 0, there exists

X such that
1
n
H(

X
n
) < limsup
n
1
n
H(X
n
) +
and
|X
n


X
n
| <
for suciently large n. This can be trivially achieved by letting

X = X,
since for suciently many n,
1
n
H(X
n
) < limsup
n
1
n
H(X
n
) +
and
|X
n
X
n
| = 0.
2.

S(X) limsup
n
(1/n)H(X
n
).
Observe that

S(X)

S

(X) for any 0 < < 1/2. Then for any > 0
and all suciently large n, there exists

X
n
such that
1
n
H(

X
n
) <

S(X) + (4.3.3)
and
|X
n


X
n
| < .
Using the fact [1, pp. 33] that |X
n


X
n
| 1/2 implies
[H(X
n
) H(

X
n
)[ log
[A[
n

,
and (4.3.3), we obtain
1
n
H(X
n
) log [A[ +
1
n
log
1
n
H(

X
n
) <

S(X) + ,
which implies that
limsup
n
1
n
H(X
n
) log [A[ <

S(X) + .
Since and can be taken arbitrarily small, we have

S(X) limsup
n
1
n
H(X
n
).
2
64
4.4 Resolvability and source coding
In the previous chapter, we have proved that the lossless data compression rate
for block codes is lower bounded by

H(X). We also show in Section 4.3 that

H(X) is also the resolvability for source X. We can therefore conclude that
resolvability is equal to the minimum lossless data compression rate for block
codes. We will justify their equivalence directly in this section.
As explained in AEP theorem for memoryless source, the set T
n
() contains
approximately e
nH(X)
elements, and the probability for source sequences being
outside T
n
() will eventually goes to 0. Therefore, we can binary-index those
source sequences in T
n
() by
log
_
e
nH(X)
_
= nH(X) nats,
and encode the source sequences outside T
n
() by a unique default binary code-
word, which results in an asymptotically zero probability of decoding error. This
is indeed the main idea for Shannons source coding theorem for block codes.
By further exploring the above concept, we found that the key is actually the
existence of a set /
n
= x
n
1
, x
n
2
, . . . , x
n
M
with M e
nH(X)
and P
X
n(/
c
n
) 0.
Thus, if we can nd such typical set, the Shannons source coding theorem for
block codes can actually be generalized to more general sources, such as non-
stationary sources. Furthermore, extension of the theorems to codes of some
specic types becomes feasible.
Denition 4.18 (minimum -source compression rate for xed-length
codes) R is the -source compression rate for xed-length codes if there exists
a sequence of sets /
n

n=1
with /
n
A
n
such that
limsup
n
1
n
log [/
n
[ R and limsup
n
P
X
n[/
c
n
] .
T

(X) is the minimum of all such rates.


Note that the denition of T

(X) is equivalent to the one in Denition 3.2.


Denition 4.19 (minimum source compression rate for xed-length
codes) T(X) represents the minimum source compression rate for xed-length
codes, which is dened as:
T(X) lim
0
T

(X).
65
Denition 4.20 (minimum source compression rate for variable-length
codes) R is an achievable source compression rate for variable-length codes if
there exists a sequence of error-free prex codes (
n

n=1
such that
limsup
n
1
n

n
R
where
n
is the average codeword length of (
n
.

T(X) is the minimum of all
such rates.
Recall that for a single source, the measure of its uncertainty is entropy.
Although the entropy can also be used to characterize the overall uncertainty of
a random sequence X, the source coding however concerns more on the average
entropy of it. So far, we have seen four expressions of average entropy:
limsup
n
1
n
H(X
n
) limsup
n
1
n

x
n
.
n
P
X
n(x
n
) log P
X
n(x
n
);
liminf
n
1
n
H(X
n
) liminf
n
1
n

x
n
.
n
P
X
n(x
n
) log P
X
n(x
n
);

H(X) inf
T
_
: limsup
n
P
X
n
_

1
n
log P
X
n(X
n
) >
_
= 0
_
;
H(X) sup
T
_
: limsup
n
P
X
n
_

1
n
log P
X
n(X
n
) <
_
= 0
_
.
If
lim
n
1
n
H(X
n
) = limsup
n
1
n
H(X
n
) = liminf
n
1
n
H(X
n
),
then lim
n
(1/n)H(X
n
) is named the entropy rate of the source.

H(X) and
H(X) are called the sup-entropy rate and inf-entropy rate, which were introduced
in Section 2.3.
Next we will prove that T(X) = S(X) =

H(X) and

T(X) =

S(X) =
limsup
n
(1/n)H(X
n
) for a source X. The operational characterization of
liminf
n
(1/n)H(X
n
) andH(X) will be introduced in Chapter 6.
Theorem 4.21 (equality of resolvability and minimum source coding
rate for xed-length codes)
T(X) = S(X) =

H(X).
Proof: Equality of S(X) and

H(X) is already given in Theorem 4.16. Also,
T(X) =

H(X) can be obtained from Theorem 3.5 by letting = 0. Here, we
provide an alternative proof for T(X) = S(X).
66
1. T(X) S(X).
If we can show that, for any xed, T

(X) S
2
(X), then the proof is
completed. This claim is proved as follows.
By denition of S
2
(X), we know that for any > 0, there exists

X
and N such that for n > N,
1
n
R(

X
n
) < S
2
(X) + and |X
n


X
n
| < 2.
Let /
n

_
x
n
: P
e
X
n
(x
n
) > 0
_
. Since (1/n)R(

X
n
) < S
2
(X) + ,
[/
n
[ expR(

X
n
) < expn(S
2
(X) + ).
Therefore,
limsup
n
1
n
log [/
n
[ S
2
(X) + .
Also,
2 > |X
n


X
n
| = 2 sup
E.
n
[P
X
n(E) P
e
X
n
(E)[
2[P
X
n(/
c
n
) P
e
X
n
(/
c
n
)[
= 2P
X
n(/
c
n
), (sinceP
e
X
n
(/
c
n
) = 0).
Hence, limsup
n
P
X
n(/
c
n
) .
Since S
2
(X) + is just one of the rates that satisfy the conditions
of the minimum -source compression rate, and T

(X) is the smallest


one of such rates,
T

(X) S
2
(X) + for any > 0.
2. T(X) S(X).
Similarly, if we can show that, for any xed, T

(X) S
3
(X), then the
proof is completed. This claim can be proved as follows.
Fix > 0. By denition of T

(X), we know that for any > 0, there


exists N and a sequence of sets /
n

n=1
such that for n > N,
1
n
log [/
n
[ < T

(X) + and P
X
n(/
c
n
) < + .
67
Choose M
n
to satisfy
expn(T

(X) + 2) M
n
expn(T

(X) + 3).
Also select one element x
n
0
from /
c
n
. Dene a new random variable

X
n
as follows:
P
e
X
n
(x
n
) =
_
_
_
0, if x
n
, x
n
0
/
n
;
k(x
n
)
M
n
, if x
n
x
n
0
/
n
,
where
k(x
n
)
_
_
_
M
n
P
X
n(x
n
)|, if x
n
/
n
;
M
n

x
n
,n
k(x
n
), if x
n
= x
n
0
.
It can then be easily veried that

X
n
satises the next four properties:
(a)

X
n
is M
n
-type;
(b) P
e
X
n
(x
n
0
) P
X
n(/
c
n
) < + , since x
n
0
/
c
n
;
(c) for all x
n
/
n
,

P
e
X
n
(x
n
) P
X
n(x
n
)

=
M
n
P
X
n(x
n
)|
M
n
P
X
n(x
n
)
1
M
n
.
(d) P
e
X
n
(/
n
) + P
e
X
n
(x
n
0
) = 1.
Consequently,
1
n
R(

X
n
) T

(X) + 3,
68
and
|X
n


X
n
| =

x
n
,n

P
e
X
n
(x
n
) P
X
n(x
n
)

P
e
X
n
(x
n
0
) P
X
n(x
n
0
)

x
n
,
c
n
x
n
0

P
e
X
n
(x
n
) P
X
n(x
n
)

x
n
,n

P
e
X
n
(x
n
) P
X
n(x
n
)

+
_
P
e
X
n
(x
n
0
) + P
X
n(x
n
0
)

x
n
,
c
n
x
n
0

P
e
X
n
(x
n
) P
X
n(x
n
)

x
n
,n
1
M
n
+ P
e
X
n
(x
n
0
) + P
X
n(x
n
0
)
+

x
n
,
c
n
x
n
0

P
X
n(x
n
)
=
[/
n
[
M
n
+ P
e
X
n
(x
n
0
) +

x
n
,
c
n
P
X
n(x
n
)

expn(T

(X) + )
expn(T

(X) + 2)
+ ( + ) + P
X
n(/
c
n
)
e
n
+ ( + ) + ( + )
3( + ), for n log( + )/.
Since T

(X) is just one of the rates that satisfy the conditions of 3(+
)-resolvability, and S
3(+)
(X) is the smallest one of such quantities,
S
3(+)
(X) T

(X).
The proof is completed by noting that can be made arbitrarily
small.
2
This theorem tells us that the minimum source compression ratio for xed-
length code is the resolvability, which in turn is equal to the sup-entropy rate.
Theorem 4.22 (equality of mean-resolvability and minimum source
coding rate for variable-length codes)

T(X) =

S(X) = limsup
n
1
n
H(X
n
).
Proof: Equality of

S(X) and limsup
n
(1/n)H(X
n
) is already given in The-
orem 4.17.
69
1.

S(X)

T(X).
Denition 4.20 states that there exists, for all > 0 and all suciently
large n, an error-free variable-length code whose average codeword length

n
satises
1
n

n
<

T(X) + .
Moreover, the fundamental source coding lower bound for a uniquely de-
codable code (cf. Theorem 4.18 of Volume I of the lecture notes) is
H(X
n
)
n
.
Thus, by letting

X = X, we obtain |X
n


X
n
| = 0 and
1
n
H(

X
n
) =
1
n
H(X
n
) <

T(X) + ,
which concludes that

T(X) is an -achievable mean-resolution rate of X
for any > 0, i.e.,

S(X) = lim
0

(X)

T(X).
2.

T(X)

S(X).
Observe that

S

(X)

S(X) for 0 < < 1/2. Hence, by taking satis-
fying 2 log [A[ > > log [A[ and for all suciently large n, there exists

X
n
such that
1
n
H(

X
n
) <

S(X) +
and
|X
n


X
n
| < . (4.4.1)
On the other hand, Theorem 4.22 of Volume I of the lecture notes proves
the existence of an error-free prex code for X
n
with average codeword
length
n
satises

n
H(X
n
) + 1 (bits).
By the fact [1, pp. 33] that |X
n


X
n
| 1/2 implies
[H(X
n
) H(

X
n
)[ log
2
[A[
n

,
and (4.4.1), we obtain
1
n

n

1
n
H(X
n
) +
1
n

1
n
H(

X
n
) + log
2
[A[
1
n
log
2
+
1
n


S(X) + + log
2
[A[
1
n
log
2
+
1
n


S(X) + 2,
70
if n > (1 log
2
)/( log
2
[A[). Since can be made arbitrarily small,

S(X) is an achievable source compression rate for variable-length codes;


and hence,

T(X)

S(X).
2
Again, the above theorem tells us that the minimum source compression ratio
for variable-length code is the mean-resolvability, and the mean-resolvability is
exactly limsup
n
(1/n)H(X
n
).
Note that limsup
n
(1/n)H(X
n
)

H(X), which follows straightforwardly
by the fact that the mean of the random variable (1/n) log P
X
n(X
n
) is no
greater than its right margin of the support. Also note that for stationary-
ergodic source, all these quantities are equal, i.e.,
T(X) = S(X) =

H(X) =

T(X) =

S(X) = limsup
n
1
n
H(X
n
).
We end this chapter by computing these quantities for a specic example.
Example 4.23 Consider a binary random source X
1
, X
2
, . . . where X
i

i=1
are
independent random variables with individual distribution
P
X
i
(0) = Z
i
and P
X
i
(1) = 1 Z
i
,
where Z
i

i=1
are pair-wise independent with common uniform marginal distri-
bution over (0, 1).
You may imagine that the source is formed by selecting from innitely many
binary number generators as shown in Figure 4.1. The selecting process Z is
independent for each time instance.
It can be shown that such source is not stationary. Nevertheless, by means
of similar argument as AEP theorem, we can show that:

log P
X
(X
1
) + log P
X
(X
2
) + + log P
X
(X
n
)
n
h
b
(Z) in probability,
where h
b
(a) a log
2
(a) (1 a) log
2
(1 a) is the binary entropy function.
To compute the ultimate average entropy rate in terms of the random variable
h
b
(Z), it requires that

log P
X
(X
1
) + log P
X
(X
2
) + + log P
X
(X
n
)
n
h
b
(Z) in mean,
71
source X
t
t I
source X
t
2
source X
t
1
.
.
.
.
.
.
.
.
.
-
-
-
Selector
?
Z
-
. . . , X
2
, X
1
.
.
.
Source Generator
Figure 4.1: Source generator: X
t

tI
(I = (0, 1)) is an independent
random process with P
Xt
(0) = 1 P
Xt
(1) = t, and is also independent
of the selector Z, where X
t
is outputted if Z = t. Source generator of
each time instance is independent temporally.
which is a stronger result than convergence in probability. With the fundamental
properties for convergence, convergence-in-probability implies convergence-in-
mean provided the sequence of random variables is uniformly integrable, which
72
is true for (1/n)

n
i=1
log P
X
(X
i
):
sup
n>0
E
_

1
n
n

i=1
log P
X
(X
i
)

_
sup
n>0
1
n
n

i=1
E [[log P
X
(X
i
)[]
= sup
n>0
E [[log P
X
(X)[] , because of i.i.d. of X
i

n
i=1
= E [[log P
X
(X)[]
= E
_
E
_
[log P
X
(X)[

Z
__
=
_
1
0
E
_
[log P
X
(X)[

Z = z
_
dz
=
_
1
0
_
z[ log(z)[ + (1 z)[ log(1 z)[
_
dz

_
1
0
log(2)dz = log(2).
We therefore have:

E
_

1
n
log P
X
n(X
n
)
_
E[h
b
(Z)]

E
_

1
n
log P
X
n(X
n
) h
b
(Z)

_
0.
Consequently,
limsup
n
1
n
H(X
n
) = E[h
b
(Z)] = 0.5 nats or 0.721348 bits.
However, it can be shown that the ultimate cumulative distribution function of
(1/n) log P
X
n(X
n
) is Pr[h
b
(Z) t] for t [0, log(2)] (cf. Figure 4.2).
The sup-entropy rate of X should be log(2) nats or 1 bit (which is the
right-margin of the ultimate CDF of (1/n) log P
X
n(X
n
)). Hence, for this un-
stationary source, the minimum average codeword length for xed-length codes
and variable-length codes are dierent, which are 0.859912 bit and 1 bit, respec-
tively.
73
0
1
0 log(2) nats
Figure 4.2: The ultimate CDF of (1/n) log P
X
n(X
n
): Prh
b
(Z) t.
74
Bibliography
[1] I. Csiszar and J. K orner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic, New York, 1981.
[2] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. on Information Theory, vol. IT39, no. 3, pp. 752772, May 1993.
[3] D. E. Knuth and A. C. Yao, The complexity of random number genera-
tion, in Proceedings of Symposium on New Directions and Recent Results
in Algorithms and Complexity. New York: Academic Press, 1976.
75
Chapter 5
Channel Coding Theorems and
Approximations of Output Statistics for
Arbitrary Channels
The Shannon channel capacity in Volume I of the lecture notes is derived under
the assumption that the channel is memoryless. With moderate modication of
the proof, this result can be extended to stationary-ergodic channels for which
the capacity formula becomes the maximization of the mutual information rate:
lim
n
sup
X
n
1
n
I(X
n
; Y
n
).
Yet, for more general channels, such as non-stationary or non-ergodic channels,
a more general expression for channel capacity needs to be derived.
5.1 General models for channels
The channel transition probability in its most general form is denoted by W
n
=
P
Y
n
[X
n

n=1
, which is abbreviated by W for convenience. Similarly, the input
and output random processes are respectively denoted by X and Y .
5.2 Variations of capacity formulas for arbitrary channels
5.2.1 Preliminaries
Now, similar to the denitions of sup- and inf- entropy rates for sequence of
sources, the sup- and inf- (mutual-)information rates are respectively dened
76
by
1

I(X; Y ) sup : i() < 1


and
I(X; Y ) sup :

i() 0,
where
i() liminf
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
is the inf-spectrum of the normalized information density,

i() limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
_
is the sup-spectrum of the normalized information density, and
i
X
n
W
n(x
n
; y
n
) log
P
Y
n
[X
n(y
n
[x
n
)
P
Y
n(y
n
)
is the information density.
In 1994, Verd u and Han [2] have shown that the channel capacity in its most
general form is
C sup
X
I(X; Y ).
In their proof, they showed the achievability part in terms of Feinsteins Lemma,
and provide a new proof for the converse part. In this section, we will not adopt
the original proof of Verd u and Han in the converse theorem. Instead, we will
use a new and tighter bound [1] established by Poor and Verd u in 1995.
Denition 5.1 (xed-length data transmission code) An (n, M) xed-
length data transmission code for channel input alphabet A
n
and output alpha-
bet
n
consists of
1. M informational messages intended for transmission;
1
In the paper of Verd u and Han [2], these two quantities are dened by:

I(X; Y ) inf

_
: ( > 0) limsup
n
P
X
n
W
n
_
1
n
i
X
n
W
n(X
n
; Y
n
) > +
_
= 0
_
and
I(X; Y ) sup

_
: ( > 0) limsup
n
P
X
n
W
n
_
1
n
i
X
n
W
n(X
n
; Y
n
) < +
_
= 0
_
.
The above denitions are in fact equivalent to ours.
77
2. an encoding function
f : 1, 2, . . . , M A
n
;
3. a decoding function
g :
n
1, 2, . . . , M,
which is (usually) a deterministic rule that assigns a guess to each possible
received vector.
The channel inputs in x
n
A
n
: x
n
= f(m) for some 1 m M are the
codewords of the data transmission code.
Denition 5.2 (average probability of error) The average probability of
error for a (
n
= (n, M) code with encoder f() and decoder g() transmitted
over channel Q
Y
n
[X
n is dened as
P
e
( (
n
) =
1
M
M

i=1

i
,
where

i


_
y
n

n
: g(y
n
),=i
_
Q
Y
n
[X
n(y
n
[f(i)).
Under the criterion of average probability of error, all of the codewords are
treated equally, namely the prior probability of the selected M codewords are
uniformly distributed.
Lemma 5.3 (Feinsteins Lemma) Fix a positive n. For every > 0 and
input distribution P
X
n on A
n
, there exists an (n, M) block code for the transition
probability P
W
n = P
Y
n
[X
n that its average error probability P
e
( (
n
) satises
P
e
( (
n
) < Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
1
n
log M +
_
+ e
n
.
Proof:
Step 1: Notations. Dene
(
_
(x
n
, y
n
) A
n

n
:
1
n
i
X
n
W
n(x
n
; y
n
)
1
n
log M +
_
.
Let e
n
+ P
X
n
W
n((
c
).
78
The Feinsteins Lemma obviously holds if 1, because then
P
e
( (
n
) 1 Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
1
n
log M +
_
+ e
n
.
So we assume < 1, which immediately results in
P
X
n
W
n((
c
) < < 1,
or equivalently,
P
X
n
W
n(() > 1 > 0.
Therefore,
P
X
n(/ x
n
A
n
: P
Y
n
[X
n((
x
n[x
n
) > 1 ) > 0,
where (
x
n y
n

n
: (x
n
, y
n
) (, because if P
X
n(/) = 0,
( x
n
with P
X
n(x
n
) > 0) P
Y
n
[X
n((
x
n[x
n
) 1

x
n
.
n
P
X
n(x
n
)P
Y
n
[X
n((
x
n[x
n
) = P
X
n
W
n(() 1 .
Step 2: Encoder. Choose an x
n
1
in / (Recall that P
X
n(/) > 0.) Dene
1
=
(
x
n
1
. (Then P
Y
n
[X
n(
1
[x
n
1
) > 1 .)
Next choose, if possible, a point x
n
2
A
n
without replacement (i.e., x
n
2
can
be identical to x
n
1
) for which
P
Y
n
[X
n
_
(
x
n
2

x
n
2
_
> 1 ,
and dene
2
(
x
n
2

1
.
Continue in the following way as for codeword i: choose x
n
i
to satisfy
P
Y
n
[X
n
_
(
x
n
i

i1
_
j=1

x
n
i
_
> 1 ,
and dene
i
(
x
n
i

i1
j=1

j
.
Repeat the above codeword selecting procedure until either M codewords
have been selected or all the points in / have been exhausted.
Step 3: Decoder. Dene the decoding rule as
(y
n
) =
_
i, if y
n

i
arbitraryy, otherwise.
79
Step 4: Probability of error. For all selected codewords, the error probabi-
lity given codeword i is transmitted,
e[i
, satises

e[i
P
Y
n
[X
n(
c
i
[x
n
i
) < .
(Note that ( i) P
Y
n
[X
n(
i
[x
n
i
) 1 by step 2.) Therefore, if we can show
that the above codeword selecting procedures will not terminate before M,
then
P
e
( (
n
) =
1
M
M

i=1

e[i
< .
Step 5: Claim. The codeword selecting procedure in step 2 will not terminate
before M.
proof: We will prove it by contradiction.
Suppose the above procedure terminates before M, say at N < M. Dene
the set
T
N
_
i=1

i

n
.
Consider the probability
P
X
n
W
n(() = P
X
n
W
n[( (A
n
T)] + P
X
n
W
n[( (A
n
T
c
)]. (5.2.1)
Since for any y
n
(
x
n
i
,
P
Y
n(y
n
)
P
Y
n
[X
n(y
n
[x
n
i
)
M e
n
,
we have
P
Y
n(
i
) P
Y
n((
x
n
i
)

1
M
e
n
P
Y
n
[X
n((
x
n
i
[x
n
i
)

1
M
e
n
.
So the rst term of the right hand side in (5.2.1) can be upper bounded by
P
X
n
W
n[( (A
n
T)] P
X
n
W
n(A
n
T)
= P
Y
n(T)
=
N

i=1
P
Y
n(
i
)
N
1
M
e
n
=
N
M
e
n
.
80
As for the second term of the right hand side in (5.2.1), we can upper
bound it by
P
X
n
W
n[( (A
n
T
c
)] =

x
n
.
n
P
X
n(x
n
)P
Y
n
[X
n((
x
n T
c
[x
n
)
=

x
n
.
n
P
X
n(x
n
)P
Y
n
[X
n
_
(
x
n
N
_
i=1

x
n
_

x
n
.
n
P
X
n(x
n
)(1 ) (1 ),
where the last step follows since for all x
n
A
n
,
P
Y
n
[X
n
_
(
x
n
N
_
i=1

x
n
_
1 .
(Because otherwise we could nd the (N + 1)-th codeword.)
Consequently, P
X
n
W
n(() (N/M)e
n
+ (1 ). By denition of (,
P
X
n
W
n(() = 1 + e
n

N
M
e
n
+ (1 ),
which implies N M, a contradiction. 2
Lemma 5.4 (Poor-Verd u Lemma [1]) Suppose X and Y are random vari-
ables, where X taking values on a nite (or coutably innite) set
A = x
1
, x
2
, x
3
, . . . .
The minimum probability of error P
e
in estimating X from Y satises
P
e
(1 )P
X,Y
_
(x, y) A : P
X[Y
(x[y)
_
for each [0, 1].
Proof: It is known that the minimum-error-probability estimate e(y) of X when
receiving y is
e(y) = arg max
x.
P
X[Y
(x[y). (5.2.2)
Therefore, the error probability incurred in testing among the values of X is
81
given by
1 P
e
= PrX = e(Y )
=
_

_

_
x : x=e(y)
_
P
X[Y
(x[y)
_
dP
Y
(y)
=
_

_
max
x.
P
X[Y
(x[y)
_
dP
Y
(y)
=
_

_
max
x.
f
x
(y)
_
dP
Y
(y)
= E
_
max
x.
f
x
(Y )
_
,
where f
x
(y) P
X[Y
(x[y). Let h
j
(y) be the j-th element in the re-ordering set
of f
x
1
(y), f
x
2
(y), f
x
3
(y), . . . according to ascending element values. In other
words,
h
1
(y) h
2
(y) h
3
(y)
and
h
1
(y), h
2
(y), h
3
(y), . . . = f
x
1
(y), f
x
2
(y), f
x
3
(y), . . ..
Then
1 P
e
= E[h
1
(Y )]. (5.2.3)
For any [0, 1], we can write
P
X,Y
(x, y) A : f
x
(y) > =
_

P
X[Y
x A : f
x
(y) > dP
Y
(y).
Observe that
P
X[Y
x A : f
x
(y) > =

x.
P
X[Y
(x[y) 1[f
x
(y) > ]
=

x.
f
x
(y) 1[f
x
(y) > ]
=

j=1
h
j
(y) 1[h
j
(y) > ],
82
where 1() is the indicator function
2
. Therefore,
P
XY
(x, y) A : f
x
(y) > =
_

j=1
h
j
(y) 1[h
j
(y) > ]
_
dP
Y
(y)

h
1
(y) 1(h
1
(y) > )dP
Y
(y)
= E[h
1
(Y ) 1(h
1
(Y ) > )].
It remains to relate E[h
1
(Y ) 1(h
1
(Y ) > )] with E[h
1
(Y )], which is exactly
1 P
e
. For any [0, 1] and any random variable U with Pr0 U 1 = 1,
U + (1 ) U 1(U > )
holds with probability one. This can be easily proved by upper-bounding U in
terms of when 0 U , and + (1 )U, otherwise. Thus
E[U] + (1 )E[U 1(U > )].
By letting U = h
1
(Y ), together with (5.2.3), we nally obtain
(1 )P
XY
(x, y) A : f
x
(y) > (1 )E[h
1
(Y ) 1[h
1
(Y ) > ]]
E[h
1
(Y )]
= (1 P
e
)
= (1 ) P
e
.
2
Corollary 5.5 Every (
n
= (n, M) code satises
P
e
( (
n
)
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
1
n
log M
_
for every > 0, where X
n
places probability mass 1/M on each codeword, and
P
e
( (
n
) denotes the error probability of the code.
Proof: Taking = e
n
in Lemma 5.4, and replacing X and Y in Lemma 5.4
2
I.e., if f
x
(y) > is true, 1() = 1; else, it is zero.
83
by its n-fold counterparts, i.e., X
n
and Y
n
, we obtain
P
e
( (
n
)
_
1 e
n
_
P
X
n
W
n
_
(x
n
, y
n
) A
n

n
: P
X
n
[Y
n(x
n
[y
n
) e
n

=
_
1 e
n
_
P
X
n
W
n
_
(x
n
, y
n
) A
n

n
:
P
X
n
[Y
n(x
n
[y
n
)
1/M

e
n
1/M
_
=
_
1 e
n
_
P
X
n
W
n
_
(x
n
, y
n
) A
n

n
:
P
X
n
[Y
n(x
n
[y
n
)
P
X
n(x
n
)

e
n
1/M
_
=
_
1 e
n
_
P
X
n
W
n [(x
n
, y
n
) A
n

n
:
1
n
log
P
X
n
[Y
n(x
n
[y
n
)
P
X
n(x
n
)

1
n
log M
_
=
_
1 e
n
_
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
)
1
n
log M
_
.
2
Denition 5.6 (-achievable rate) Fix [0, 1]. R 0 is an -achievable
rate if there exists a sequence of (
n
= (n, M
n
) channel block codes such that
liminf
n
1
n
log M
n
R
and
limsup
n
P
e
( (
n
) .
Denition 5.7 (-capacity C

) Fix [0, 1]. The supremum of -achievable


rates is called the -capacity, C

.
It is straightforward for the denition that C

is non-decreasing in , and
C
1
= log [A[.
Denition 5.8 (capacity C) The channel capacity C is dened as the supre-
mum of the rates that are -achievable for all [0, 1]. It follows immediately
from the denition
3
that C = inf
01
C

= lim
0
C

= C
0
and that C is the
3
The proof of C
0
= lim
0
C

can be proved by contradiction as follows.


Suppose C
0
+ 2 < lim
0
C

for some > 0. For any positive integer j, and by deni-


tion of C
1/j
, there exists N
j
and a sequence of (
n
= (n, M
n
) code such that for n > N
j
,
(1/n) log M
n
> C
1/j
> C
0
+ and P
e
( (
n
) < 2/j. Construct a sequence of codes

(
n
=
(n,

M
n
) as:

(
n
= (
n
, if max
j1
i=1
N
i
n < max
j
i=1
N
i
. Then limsup
n
(1/n) log

M
n
C
0
+
and liminf
n
P
e
(

(
n
) = 0, which contradicts to the denition of C
0
.
84
supremum of all the rates R for which there exists a sequence of (
n
= (n, M
n
)
channel block codes such that
liminf
n
1
n
log M
n
R,
and
limsup
n
P
e
( (
n
) = 0.
5.2.2 -capacity
Theorem 5.9 (-capacity) For 0 < < 1, the -capacity C

for arbitrary
channels satises
sup
X
lim

(X; Y ) C

sup
X
I

(X; Y ).
The upper and lower bounds of the above inequality meet except at the points
of discontinuity of C

, of which, there are, at most, countably many.


Proof:
1. C

sup
X
lim

(X; Y ).
Fix input X. It suces to show the existence of (
n
= (n, M
n
) data
transmission code with rate
lim

(X; Y ) <
1
n
log M
n
< lim

(X; Y )

2
and probability of decoding error satisfying
P
e
( (
n
)
for every > 0. (Because if such code exists, then limsup
n
P
e
( (
n
)
and liminf
n
(1/n) log M
n
lim

(X; Y ) , which implies C


lim

(X; Y ).)
From Lemma 5.3, there exists an (
n
= (n, M
n
) code whose error probabi-
lity satises
P
e
( (
n
) < Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
1
n
log M
n
+

4
_
+ e
n/4
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) <
_
lim

(X; Y )

2
_
+

4
_
+ e
n/4
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < lim

(X; Y )

4
_
+ e
n/4
, (5.2.4)
85
where (5.2.4) holds for all suciently large n because
lim

(X; Y ) sup
_
R : Pr
_
1
n
i
W
n
W
n(X
n
; Y
n
) R
_
<
_
,
or more specically,
limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < lim

(X; Y )

4
_
< .
Hence, the proof of the direct part is completed.
2. C

sup
X
I

(X; Y ).
Suppose that there exists a sequence of (
n
= (n, M
n
) codes with rate
strictly larger than sup
X
I

(X; Y ) and limsup


n
P
e
( (
n
) . Let the
ultimate code rate for this code be sup
X
I

(X; Y ) + 3 for some > 0.


Then for suciently large n,
1
n
log M
n
> sup
X
I

(X; Y ) + 2.
Since the above inequality holds for every X, it certainly holds if taking
input

X
n
which places probability mass 1/M
n
on each codeword, i.e.,
1
n
log M
n
> I

(

X;

Y ) + 2, (5.2.5)
where

Y is the channel output due to channel input

X. Then from Corol-
lary 5.5, the error probability of the code satises
P
e
( (
n
)
_
1 e
n
_
Pr
_
1
n
i

X
n
W
n
(

X
n
;

Y
n
)
1
n
log M
n

_
1 e
n
_
Pr
_
1
n
i

X
n
W
n
(

X
n
;

Y
n
) I

(

X;

Y ) +
_
,
where the last inequality follows from (5.2.5), which by taking the limsup
of both sides, we have
limsup
n
P
e
( (
n
) limsup
n
Pr
_
1
n
i

X
n
W
n
(

X
n
;

Y
n
) I

(

X;

Y ) +
_
> ,
and a desired contradiction is obtained. 2
86
5.2.3 General Shannon capacity
Theorem 5.10 (general Shannon capacity) The channel capacity C for ar-
bitrary channel satises
C = sup
X
I(X; Y ).
Proof: Observe that
C = C
0
= lim
0
C

.
Hence, from Theorem 5.9, we note that for (0, 1),
C sup
X
lim

(X; Y ) sup
X
I
0
(X; Y ) = sup
X
I(X; Y ).
It remains to show that C sup
X
I(X; Y ).
Suppose that there exists a sequence of (
n
= (n, M
n
) codes with rate strictly
larger than sup
X
I(X; Y ) and error probability tends to 0 as n . Let the
ultimate code rate for this code be sup
X
I(X; Y ) + 3 for some > 0. Then
for suciently large n,
1
n
log M
n
> sup
X
I(X; Y ) + 2.
Since the above inequality holds for every X, it certainly holds if taking input

X
n
which places probability mass 1/M
n
on each codeword, i.e.,
1
n
log M
n
> I(

X;

Y ) + 2, (5.2.6)
where

Y is the channel output due to channel input

X. Then from Corollary
5.5, the error probability of the code satises
P
e
( (
n
)
_
1 e
n
_
Pr
_
1
n
i

X
n
W
n
(

X
n
;

Y
n
)
1
n
log M
n

_

_
1 e
n
_
Pr
_
1
n
i

X
n
W
n
(

X
n
;

Y
n
) I(

X;

Y ) +
_
, (5.2.7)
where the last inequality follows from (5.2.6). Since, by assumption, P
e
( (
n
)
vanishes as n , but (5.2.7) cannot vanish by denition of I(

X;

Y ), thereby
a desired contradiction is obtained. 2
87
5.2.4 Strong capacity
Denition 5.11 (strong capacity) Dene the strong converse capacity (or
strong capacity) C
SC
as the inmum of the rates R such that for all channel
block codes (
n
= (n, M
n
) with
liminf
n
1
n
log M
n
R,
we have
liminf
n
P
e
( (
n
) = 1.
Theorem 5.12 (general strong capacity)
C
SC
sup
X

I(X; Y ).
5.2.5 Examples
With these general capacity formulas, we can now compute the channel capac-
ity for some of the non-stationary or non-ergodic channels, and analyze their
properties.
Example 5.13 (Shannon capacity) Let the input and output alphabets be
0, 1, and let every output Y
i
be given by:
Y
i
= X
i
N
i
,
where represents modulo-2 addition operation. Assume the input process
X and the noise process N are independent.
A general relation between the entropy rate and mutual information rate can
be derived from (2.4.2) and (2.4.4) as:
H(Y )

H(Y [X) I(X; Y )

H(Y )

H(Y [X).
Since N
n
is completely determined from Y
n
under the knowledge of X
n
,

H(Y [X) =

H(N).
Indeed, this channel is a symmetric channel. Therefore, an uniform input yields
uniform output (Bernoulli with parameter (1/2)), and H(Y ) =

H(Y ) = log(2)
nats. We thus have
C = log(2)

H(N) nats.
We then compute the channel capacity for the next two cases of noises.
88
Case A) If N is a non-stationary binary independent sequence with
PrN
i
= 1 = p
i
,
then by the uniform boundedness (in i) of the variance of random variable
log P
N
i
(N
i
), namely,
Var[log P
N
i
(N
i
)] E[(log P
N
i
(N
i
))
2
]
sup
0<p
i
<1
_
p
i
(log p
i
)
2
+ (1 p
i
)(log(1 p
i
))
2

log(2),
we have by Chebyshevs inequality,
Pr
_

1
n
log P
N
n(N
n
)
1
n
n

i=1
H(N
i
)

>
_
0,
for any > 0. Therefore,

H(N) = limsup
n
(1/n)

n
i=1
h
b
(p
i
). Conse-
quently,
C = log(2) limsup
n
1
n
n

i=1
h
b
(p
i
) nats/channel usage.
This result is illustrated in Figures 5.1.
-

H(N)

H(N) cluster points
Figure 5.1: The ultimate CDFs of (1/n) log P
N
n(N
n
).
Case B) If N has the same distribution as the source process in Example 4.23,
then

H(N) = log(2) nats, which yields zero channel capacity.
Example 5.14 (strong capacity) Continue from Example 5.13. To show the
inf-information rate of channel W, we rst derive the relation of the CDF be-
tween the information density and noise process with respective to the uniform
89
input that maximizes the information rate (In this case, P
X
n(x
n
) = P
Y
n(y
n
) =
2
n
).
Pr
_
1
n
log
P
Y
n
[X
n(Y
n
[X
n
)
P
Y
n(Y
n
)

_
= Pr
_
1
n
log P
N
n(N
n
)
1
n
log P
Y
n(Y
n
)
_
= Pr
_
1
n
log P
N
n(N
n
) log(2)
_
= Pr
_

1
n
log P
N
n(N
n
) log(2)
_
= 1 Pr
_

1
n
log P
N
n(N
n
) < log(2)
_
. (5.2.8)
Case A) From (5.2.8), it is obvious that
C
SC
= 1 liminf
n
1
n
n

i=1
h
b
(p
i
).
Case B) From (5.2.8) and also from Figure 5.2,
C
SC
= log(2) nats/channel usage.
In Case B of Example 5.14, we have derived the ultimate CDF of the nor-
malized information density, which is depicted in Figure 5.2. This limiting CDF
is called spectrum of the normalized information density.
In Figure 5.2, it has been stated that the channel capacity is 0 nat/channnel
usage, and the strong capacity is log(2) nat/channel usage. Hence, the opera-
tional meaning of the two margins has been clearly revealed. Question is what
is the operational meaning of the function value between 0 and log(2)?. The
answer of this question actually follows Denition 5.7 of a new capacity-related
quantity.
In practice, it may not be easy to design a block code which transmits infor-
mation with (asymptotically) no error through a very noisy channel with rate
equals to channel capacity. However, if we admit some errors in transmission,
such as the error probability is bounded above by 0.001, we may have more
chance to come up with a feasible block code.
Example 5.15 (-capacity) Continue from case B of Example 5.13. Let the
spectrum of the normalized information density be i(). Then the -capacity of
this channel is actually the inverse function of i(), i.e.
C

= i
1
().
90
-
0 log(2) nats
1
Figure 5.2: The ultimate CDF of the normalized information density
for Example 5.14-Case B).
Note that the Shannon capacity can be written as:
C = lim
0
C

;
and in general, the strong capacity satises
C
SC
lim
1
C

.
However, equality of the above inequality holds in this example.
5.3 Structures of good data transmission codes
The channel capacity for discrete memoryless channel is shown to be:
C max
P
X
I(P
X
, Q
Y [X
).
Let P
X
be the optimizer of the above maximization operation. Then
C max
P
X
I(P
X
, Q
Y [X
) = I(P
X
, Q
Y [X
).
Here, the performance of the code is assumed to be the average error probability,
namely
P
e
( (
n
) =
1
M
M

i=1
P
e
( (
n
[x
n
i
),
91
if the codebook is (
n
x
n
1
, x
n
2
, . . . , x
n
M
. Due to the random coding argument,
a deterministic good code with arbitrarily small error probability and rate less
than channel capacity must exist. Question is what is the relationship between
the good code and the optimizer P
X
? It is widely believed that if the code is
good (with rate close to capacity and low error probability), then the output
statistics P
e
Y
n
due to the equally-likely codemust approximate the output
distribution, denoted by P
Y
n
, due to the input distribution achieving the channel
capacity.
This fact is actually reected in the next theorem.
Theorem 5.16 For any channel W
n
= (Y
n
[X
n
) with nite input alphabet and
capacity C that satises the strong converse (i.e., C = C
SC
), the following
statement holds.
Fix > 0 and a sequence of (
n
= (n, M
n
)

n=1
block codes with
1
n
log M
n
C /2,
and vanishing error probability (i.e., error probability approaches zero as block-
length n tends to innity.) Then,
1
n
|

Y
n
Y
n
| for all suciently large n,
where

Y
n
is the output due to the block code and Y
n
is the output due the X
n
that satises
I(X
n
; Y
n
) = max
X
n
I(X
n
; Y
n
).
To be specic,
P
e
Y
n
(y
n
) =

x
n
( n
P
e
X
n
(x
n
)P
W
n(y
n
[x
n
) =

x
n
( n
1
M
n
P
W
n(y
n
[x
n
)
and
P
Y
n
(y
n
) =

x
n
.
n
P
X
n
(x
n
)P
W
n(y
n
[x
n
).
Note that the above theorem holds for arbitrary channels, not restricted to
only discrete memoryless channels.
One may query that can a result in the spirit of the above theorem be
proved for the input statistics rather than the output statistics? The answer
is negative. Hence, the statement that the statistics of any good code must
approximate those that maximize the mutual information is erroneously taken
for granted. (However, we do not rule out the possibility of the existence of
92
good codes that approximate those that maximize the mutual information.) To
see this, simply consider the normalized entropy of X
n
versus that of

X
n
(which
is uniformly distributed over the codewords) for discrete memoryless channels:
1
n
H(X
n
)
1
n
H(

X
n
) =
_
1
n
H(X
n
[Y
n
) +
1
n
I(X
n
; Y
n
)
_

1
n
log(M
n
)
=
_
H(X[Y ) + I(X; Y )

1
n
log(M
n
)
=
_
H(X[Y ) + C

1
n
log(M
n
).
A good code with vanishing error probability exists for (1/n) log(M
n
) arbitrarily
close to C; hence, we can nd a good code sequence to satisfy
lim
n
_
1
n
H(X
n
)
1
n
H(

X
n
)
_
= H(X[Y ).
Since the term H(X[Y ) is in general positive, where a quick example is the BSC
with crossover probability p, which yields H(X[Y ) = p log(p)(1p) log(1p),
the two input distributions does not necessarily resemble to each other.
5.4 Approximations of output statistics: resolvability for
channels
5.4.1 Motivations
The discussion of the previous section somewhat motivates the necessity to nd
a equally-distributed (over a subset of input alphabet) input distribution that
generates the output statistics, which is close to the output due to the input
that maximizes the mutual information. Since such approximations are usually
performed by computers, it may be natural to connect approximations of the
input and output statistics with the concept of resolvability.
5.4.2 Notations and denitions of resolvability for chan-
nels
In a data transmission system as shown in Figure 5.3, suppose that the source,
channel and output are respectively denoted by
X
n
(X
1
, . . . , X
n
),
W
n
(W
1
, . . . , W
n
),
93
and
Y
n
(Y
1
, . . . , Y
n
),
where W
i
has distribution P
Y
i
[X
i
.
To simulate the behavior of the channel, a computer-generated input may
be necessary as shown in Figure 5.4. As stated in Chapter 4, such computer-
generated input is based on an algorithm formed by a few basic uniform random
experiments, which has nite resolution. Our goal is to nd a good computer-
generated input

X
n
such that the corresponding output

Y
n
is very close to the
true output Y
n
.
-
. . . , X
3
, X
2
, X
1
true source
P
Y
n
[X
n
true channel
-
. . . , Y
3
, Y
2
, Y
1
true output
Figure 5.3: The communication system.
-
. . . ,

X
3
,

X
2
,

X
1
computer-generated
source
P
Y
n
[X
n
true channel
-
. . . ,

Y
3
,

Y
2
,

Y
1
corresponding
output
Figure 5.4: The simulated communication system.
Denition 5.17 (-resolvability for input X and channel W) Fix >
0, and suppose that the (true) input random variable and (true) channel statistics
are X and W = (Y [X), respectively.
Then the -resolvability S

(X, W) for input X and channel W is dened


by:
S

(X, W) min
_
R : ( > 0)(

X and N)( n > N)
1
n
R(

X
n
) < R + and |Y
n


Y
n
| <
_
,
where P
e
Y
n
= P
e
X
n
P
W
n. (The denitions of resolution R() and variational dis-
tance |() ()| are given by Denitions 4.4 and 4.5.)
Note that if we take the channel W
n
to be an identity channel for all n,
namely A
n
=
n
and P
Y
n
[X
n(y
n
[x
n
) is either 1 or 0, then the -resolvability for
input X and channel W is reduced to source -resolvability for X:
S

(X, W
Identity
) = S

(X).
94
Similar reductions can be applied to all the following denitions.
Denition 5.18 (-mean-resolvability for input X and channel W) Fix
> 0, and suppose that the (true) input random variable and (true) channel
statistics are respectively X and W.
Then the -mean-resolvability

S

(X, W) for input X and channel W is


dened by:

(X, W) min
_
R : ( > 0)(

X and N)( n > N)
1
n
H(

X
n
) < R + and |Y
n


Y
n
| <
_
,
where P
e
Y
n
= P
e
X
n
P
W
n and P
Y
n = P
X
nP
W
n.
Denition 5.19 (resolvability and mean resolvability for input X and
channel W) The resolvability and mean-resolvability for input X and W are
dened respectively as:
S(X, W) sup
>0
S

(X, W) and

S(X, W) sup
>0

(X, W).
Denition 5.20 (resolvability and mean resolvability for channel W)
The resolvability and mean-resolvability for channel W are dened respectively
as:
S(W) sup
X
S(X, W),
and

S(W) sup
X

S(X, W).
5.4.3 Results on resolvability and mean-resolvability for
channels
Theorem 5.21
S(W) = C
SC
= sup
X

I(X; Y ).
It is somewhat a reasonable inference that if no computer algorithms can pro-
duce a desired good output statistics under the number of random nats specied,
then all codes should be bad codes for this rate.
Theorem 5.22

S(W) = C = sup
X
I(X; Y ).
95
It it not yet clear that the operational relation between the resolvability (or
mean-resolvability) and capacities for channels. This could be some interesting
problems to think of.
96
Bibliography
[1] H. V. Poor and S. Verd u, A lower bound on the probability of error in
multihypothesis Testing, IEEE Trans. on Information Theory, vol. IT41,
no. 6, pp. 19921994, Nov. 1995.
[2] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
97
Chapter 6
Optimistic Shannon Coding Theorems
for Arbitrary Single-User Systems
The conventional denitions of the source coding rate and channel capacity re-
quire the existence of reliable codes for all suciently large blocklengths. Alterna-
tively, if it is required that good codes exist for innitely many blocklengths, then
optimistic denitions of source coding rate and channel capacity are obtained.
In this chapter, formulas for the optimistic minimum achievable xed-length
source coding rate and the minimum -achievable source coding rate for arbi-
trary nite-alphabet sources are established. The expressions for the optimistic
capacity and the optimistic -capacity of arbitrary single-user channels are also
provided. The expressions of the optimistic source coding rate and capacity are
examined for the class of information stable sources and channels, respectively.
Finally, examples for the computation of optimistic capacity are presented.
6.1 Motivations
The conventional denition of the minimum achievable xed-length source cod-
ing rate T(X) (or T
0
(X)) for a source X (cf. Denition 3.2) requires the exis-
tence of reliable source codes for all suciently large blocklengths. Alternatively,
if it is required that reliable codes exist for innitely many blocklengths, a new,
more optimistic denition of source coding rate (denoted by T(X)) is obtained
[11]. Similarly, the optimistic capacity

C is dened by requiring the existence
of reliable channel codes for innitely many blocklengths, as opposed to the
denition of the conventional channel capacity C [12, Denition 1].
This concept of optimistic source coding rate and capacity has recently been
investigated by Verd u et.al for arbitrary (not necessarily stationary, ergodic,
information stable, etc.) sources and single-user channels [11, 12]. More speci-
cally, they establish an additional operational characterization for the optimistic
98
minimum achievable source coding rate (T(X) for source X) by demonstrating
that for a given channel, the classical statement of the source-channel separation
theorem
1
holds for every channel if T(X) = T(X) [11]. In a dual fashion, they
also show that for channels with

C = C, the classical separation theorem holds
for every source. They also conjecture that T(X) and

C do not seem to admit
a simple expression.
In this chapter, we demonstrate that T(X) and

C do indeed have a general
formula. The key to these results is the application of the generalized sup-
information rate introduced in Chapter 2 to the existing proofs by Verd u and
Han [12] of the direct and converse parts of the conventional coding theorems.
We also provide a general expression for the optimistic minimum -achievable
source coding rate and the optimistic -capacity.
For the generalized sup/inf-information/entropy rates which will play a key
role in proving our optimistic coding theorems, readers may refer to Chapter 2.
6.2 Optimistic source coding theorems
In this section, we provide the optimistic source coding theorems. They are
shown based on two new bounds due to Han [7] on the error probability of
a source code as a function of its size. Interestingly, these bounds constitute
the natural counterparts of the upper bound provided by Feinsteins Lemma
and the Verd u-Han lower bound [12] to the error probability of a channel code.
Furthermore, we show that for information stable sources, the formula for T(X)
reduces to
T(X) = liminf
n
1
n
H(X
n
).
This is in contrast to the expression for T(X), which is known to be
T(X) = limsup
n
1
n
H(X
n
).
The above result leads us to observe that for sources that are both stationary and
information stable, the classical separation theorem is valid for every channel.
In [11], Vembu et.al characterize the sources for which the classical separation
theorem holds for every channel. They demonstrate that for a given source X,
1
By the classical statement of the source-channel separation theorem, we mean the fol-
lowing. Given a source X with (conventional) source coding rate T(X) and channel W
with capacity C, then X can be reliably transmitted over W if T(X) < C. Conversely, if
T(X) > C, then X cannot be reliably transmitted over W. By reliable transmissibility of
the source over the channel, we mean that there exits a sequence of joint source-channel codes
such that the decoding error probability vanishes as the blocklength n (cf. [11]).
99
the separation theorem holds for every channel if its optimistic minimum achiev-
able source coding rate (T(X)) coincides with its conventional (or pessimistic)
minimum achievable source coding rate (T(X)); i.e., if T(X) = T(X).
We herein establish a general formula for T(X). We prove that for any source
X,
T(X) = lim
1
H

(X).
We also provide the general expression for the optimistic minimum -achievable
source coding rate. We show these results based on two new bounds due to Han
(one upper bound and one lower bound) on the error probability of a source
code [7, Chapter 1]. The upper bound (Lemma 3.3) consists of the counterpart
of Feinsteins Lemma for channel codes, while the lower bound (Lemma 3.4)
consists of the counterpart of the Verd u-Han lower bound on the error probability
of a channel code ([12, Theorem 4]). As in the case of the channel coding bounds,
both source coding bounds (Lemmas 3.3 and 3.4) hold for arbitrary sources and
for arbitrary xed blocklength.
Denition 6.1 An (n, M) xed-length source code for X
n
is a collection of M
n-tuples (
n
= c
n
1
, . . . , c
n
M
. The error probability of the code is
P
e
( (
n
) Pr [X
n
, (
n
] .
Denition 6.2 (optimistic -achievable source coding rate) Fix 0 < <
1. R 0 is an optimistic -achievable rate if, for every > 0, there exists a
sequence of (n, M
n
) xed-length source codes (
n
such that
limsup
n
1
n
log M
n
R
and
liminf
n
P
e
( (
n
) .
The inmum of all -achievable source coding rates for source X is denoted
by T

(X). Also dene T(X) sup


0<<1
T

(X) = lim
0
T

(X) = T
0
(X) as
the optimistic source coding rate.
We can then use Lemmas 3.3 and 3.4 (in a similar fashion to the general source
coding theorem) to prove the general optimistic (xed-length) source coding
theorems.
Theorem 6.3 (optimistic minimum -achievable source coding rate for-
mula) For any source X,
T

(X)
_
lim
(1)
H

(X), for [0, 1);


0, for = 1.
100
We conclude this section by examining the expression of T(X) for infor-
mation stable sources. It is already known (cf. for example [11]) that for an
information stable source X,
T(X) = limsup
n
1
n
H(X
n
).
We herein prove a parallel expression for T(X).
Denition 6.4 (information stable sources [11]) A source X is said to be
information stable if H(X
n
) > 0 for n suciently large, and h
X
n(X
n
)/H(X
n
)
converges in probability to one as n , i.e.,
limsup
n
Pr
_

h
X
n(X
n
)
H(X
n
)
1

>
_
= 0 > 0,
where H(X
n
) = E[h
X
n(X
n
)] is the entropy of X
n
.
Lemma 6.5 Every information source X satises
T(X) = liminf
n
1
n
H(X
n
).
Proof:
1. [T(X) liminf
n
(1/n)H(X
n
)]
Fix > 0 arbitrarily small. Using the fact that h
X
n(X
n
) is a non-negative
bounded random variable for nite alphabet, we can write the normalized block
entropy as
1
n
H(X
n
) = E
_
1
n
h
X
n(X
n
)
_
= E
_
1
n
h
X
n(X
n
) 1
_
0
1
n
h
X
n(X
n
) lim
1
H

(X) +
__
+ E
_
1
n
h
X
n(X
n
) 1
_
1
n
h
X
n(X
n
) > lim
1
H

(X) +
__
. (6.2.1)
From the denition of lim
1
H

(X), it directly follows that the rst term in the


right hand side of (6.2.1) is upper bounded by lim
1
H

(X) + , and that the


liminf of the second term is zero. Thus
T(X) = lim
1
H

(X) liminf
n
1
n
H(X
n
).
2. [T(X) liminf
n
(1/n)H(X
n
)]
101
Fix > 0. For innitely many n,
Pr
_
h
X
n(X
n
)
H(X
n
)
1 >
_
= Pr
_
1
n
h
X
n(X
n
) > (1 + )
_
1
n
H(X
n
)
__
Pr
_
1
n
h
X
n(X
n
) > (1 + )
_
liminf
n
1
n
H(X
n
) +
__
.
Since X is information stable, we obtain that
liminf
n
Pr
_
1
n
h
X
n(X
n
) > (1 + )
_
liminf
n
1
n
H(X
n
) +
__
= 0.
By the denition of lim
1
H

(X), the above implies that


T(X) = lim
1
H

(X) (1 + )
_
liminf
n
1
n
H(X
n
) +
_
.
The proof is completed by noting that can be made arbitrarily small. 2
It is worth pointing out that if the source X is both information stable and
stationary, the above Lemma yields
T(X) = T(X) = lim
n
1
n
H(X
n
).
This implies that given a stationary and information stable source X, the clas-
sical separation theorem holds for every channel.
6.3 Optimistic channel coding theorems
In this section, we state without proving the general expressions for the opti-
mistic -capacity
2
(

C

) and for the optimistic capacity (



C) of arbitrary single-
user channels. The proofs of these expressions are straightforward once the right
denition (of

I

(X; Y )) is made. They employ Feinsteins Lemma and the Poor-


Verd u bound, and follow the same arguments used in Theorem 5.10 to show the
general expressions of the conventional channel capacity
C = sup
X
I
0
(X; Y ) = sup
X
I(X; Y ),
2
The authors would like to point out that the expression of

C

was also separately obtained


in [10, Theorem 7].
102
and the conventional -capacity
sup
X
lim

(X; Y ) C

sup
X
I

(X; Y ).
We close this section by proving the formula of

C for information stable channels.
Denition 6.6 (optimistic -achievable rate) Fix 0 < < 1. R 0 is an
optimistic -achievable rate if there exists a sequence of (
n
= (n, M
n
) channel
block codes such that
liminf
n
1
n
log M
n
R
and
liminf
n
P
e
( (
n
) .
Denition 6.7 (optimistic -capacity

C

) Fix 0 < < 1. The supremum of


optimistic -achievable rates is called the optimistic -capacity,

C

.
It is straightforward for the denition that C

is non-decreasing in , and
C
1
= log [A[.
Denition 6.8 (optimistic capacity

C) The optimistic channel capacity

C
is dened as the supremum of the rates that are -achievable for all [0, 1]. It
follows immediately from the denition that C = inf
01
C

= lim
0
C

= C
0
and that C is the supremum of all the rates R for which there exists a sequence
of (
n
= (n, M
n
) channel block codes such that
liminf
n
1
n
log M
n
R,
and
liminf
n
P
e
( (
n
) = 0.
Theorem 6.9 (optimistic -capacity formula) Fix 0 < < 1. The opti-
mistic -capacity

C

satises
sup
X
lim

(X; Y )

C

sup
X

(X; Y ). (6.3.1)
Note that actually

C

= sup
X

(X; Y ), except possibly at the points of discon-


tinuities of sup
X

I

(X; Y ) (which are countable).


Theorem 6.10 (optimistic capacity formula) The optimistic capacity

C
satises

C = sup
X

I
0
(X; Y ).
103
We next investigate the expression of

C for information stable channels.
The expression for the capacity of information stable channels is already known
(cf. for example [11])
C = liminf
n
C
n
.
We prove a dual formula for

C.
Denition 6.11 (Information stable channels [6, 8]) A channel W is said
to be information stable if there exists an input process X such that 0 < C
n
<
for n suciently large, and
limsup
n
Pr
_

i
X
n
W
n(X
n
; Y
n
)
nC
n
1

>
_
= 0
for every > 0.
Lemma 6.12 Every information stable channel W satises

C = limsup
n
sup
X
n
1
n
I(X
n
; Y
n
).
Proof:
1. [

C limsup
n
sup
X
n(1/n)I(X
n
; Y
n
)]
By using a similar argument as in the proof of [12, Theorem 8, property h)],
we have

I
0
(X; Y ) limsup
n
sup
X
n
1
n
I(X
n
; Y
n
).
Hence,

C = sup
X

I
0
(X; Y ) limsup
n
sup
X
n
1
n
I(X
n
; Y
n
).
2. [

C limsup
n
sup
X
n(1/n)I(X
n
; Y
n
)]
Suppose

X is the input process that makes the channel information stable.
Fix > 0. Then for innitely many n,
P

X
n
W
n
_
1
n
i

X
n
W
n(

X
n
; Y
n
) (1 )(limsup
n
C
n
)
_
P

X
n
W
n
_
i

X
n
W
n(

X
n
; Y
n
)
n
< (1 )C
n
_
= P

X
n
W
n
_
i

X
n
W
n(

X
n
; Y
n
)
nC
n
1 <
_
.
104
Since the channel is information stable, we get that
liminf
n
P

X
n
W
n
_
1
n
i

X
n
W
n(

X
n
; Y
n
) (1 )(limsup
n
C
n
)
_
= 0.
By the denition of

C, the above immediately implies that

C = sup
X

I
0
(X; Y )

I
0
(

X; Y ) (1 )(limsup
n
C
n
).
Finally, the proof is completed by noting that can be made arbitrarily small.
2
Observations:
It is known that for discrete memoryless channels, the optimistic capacity

C
is equal to the (conventional) capacity C [12, 5]. The same result holds for
moduloq additive noise channels with stationary ergodic noise. However,
in general,

C C since

I
0
(X; Y ) I(X; Y ) [3, 4].
Remark that Theorem 11 in [11] holds if, and only if,
sup
X
I(X; Y ) = sup
X

I
0
(X; Y ).
Furthermore, note that, if

C = C and there exists an input distribution
P

X
that achieves C, then P

X
also achieves

C.
6.4 Examples
We provide four examples to illustrate the computation of C and

C. The rst two
examples present information stable channels for which

C > C. The third exam-
ple shows an information unstable channel for which

C = C. These examples in-
dicate that information stability is neither necessary nor sucient to ensure that

C = C or thereby the validity of the classical source-channel separation theorem.


The last example illustrates the situation where 0 < C <

C < C
SC
< log
2
[[,
where C
SC
is the channel strong capacity
3
. We assume in this section that all
logarithms are in base 2 so that C and

C are measured in bits.
3
The strong (or strong converse) capacity C
SC
(cf. [2]) is dened as the inmum of the num-
bers R for which for all (n, M
n
) codes with (1/n) log M
n
R, liminf
n
P
e
( (
n
) = 1. This def-
inition of C
SC
implies that for any sequence of (n, M
n
) codes with liminf
n
(1/n) log M
n
>
C
SC
, P
e
( (
n
) > 1 for every > 0 and for n suciently large. It is shown in [2] that
C
SC
= lim
1

C

= sup
X

I(X; Y ).
105
6.4.1 Information stable channels
Example 6.13 Consider a nonstationary channel W such that at odd time in-
stances n = 1, 3, , W
n
is the product of the transition distribution of a binary
symmetric channel with crossover probability 1/8 (BSC(1/8)), and at even time
instances n = 2, 4, 6, , W
n
is the product of the distribution of a BSC(1/4). It
can be easily veried that this channel is information stable. Since the channel
is symmetric, a Bernoulli(1/2) input achieves C
n
= sup
X
n(1/n)I(X
n
; Y
n
); thus
C
n
=
_
1 h
b
(1/8), for n odd;
1 h
b
(1/4), for n even,
where h
b
(a) a log
2
a (1 a) log
2
(1 a) is the binary entropy function.
Therefore, C = liminf
n
C
n
= 1 h
b
(1/4) and

C = limsup
n
C
n
= 1
h
b
(1/8) > C.
Example 6.14 Here we use the information stable channel provided in [11,
Section III] to show that

C > C. Let ^ be the set of all positive integers.
Dene the set as
n ^ : 2
2i+1
n < 2
2i+2
, i = 0, 1, 2, . . .
= 2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, , 63, 128, 129, , 255, .
Consider the following nonstationary symmetric channel W. At times n ,
W
n
is a BSC(0), whereas at times n , , W
n
is a BSC(1/2). Put W
n
=
W
1
W
2
W
n
. Here again C
n
is achieved by a Bernoulli(1/2) input

X
n
.
We then obtain
C
n
=
1
n
n

i=1
I(

X
i
; Y
i
) =
1
n
[J(n) 1 + (n J(n)) 0] =
J(n)
n
,
where J(n) [ 1, 2, , n[. It can be shown that
J(n)
n
=
_

_
1
2
3

2
]log
2
n|
n
+
1
3n
, for log
2
n| odd;
2
3

2
]log
2
n|
n

2
3n
, for log
2
n| even.
Consequently, C = liminf
n
C
n
= 1/3 and

C = limsup
n
C
n
= 2/3.
6.4.2 Information unstable channels
Example 6.15 The Polya-contagion channel: Consider a discrete additive cha-
nnel with binary input and output alphabet 0, 1 described by
Y
i
= X
i
Z
i
, i = 1, 2, ,
106
where X
i
, Y
i
and Z
i
are respectively the i-th input, i-th output and i-th noise, and
represents modulo-2 addition. Suppose that the input process is independent
of the noise process. Also assume that the noise sequence Z
n

n1
is drawn
according to the Polya contagion urn scheme [1, 9] as follows: an urn originally
contains R red balls and B black balls with R < B; the noise just make successive
draws from the urn; after each draw, it returns to the urn 1 + balls of the
same color as was just drawn ( > 0). The noise sequence Z
i
corresponds to
the outcomes of the draws from the Polya urn: Z
i
= 1 if ith ball drawn is red
and Z
i
= 0, otherwise. Let R/(R + B) and /(R + B). It is shown in
[1] that the noise process Z
i
is stationary and nonergodic; thus the channel is
information unstable.
From Lemma 2 and Section IV in [4, Part I], we obtain
1

H
1
(X) C

1 lim
(1)

(X),
and
1 H
1
(X)

C

1 lim
(1)
H

(X).
It has been shown [1] that (1/n) log P
X
n(X
n
) converges in distribution to the
continuous random variable V h
b
(U), where U is beta-distributed with pa-
rameters (/, (1 )/), and h
b
() is the binary entropy function. Thus

H
1
(X) = lim
(1)

(X) =H
1
(X) = lim
(1)
H

(X) = F
1
V
(1 ),
where F
V
(a) PrV a is the cumulative distribution function of V , and
F
1
V
() is its inverse [1]. Consequently,
C

=

C

= 1 F
1
V
(1 ),
and
C =

C = lim
0
_
1 F
1
V
(1 )

= 0.
Example 6.16 Let

W
1
,

W
2
, . . . consist of the channel in Example 6.14, and let

W
1
,

W
2
, . . . consist of the channel in Example 6.15. Dene a new channel W as
follows:
W
2i
=

W
i
and W
2i1
=

W
i
for i = 1, 2, .
As in the previous examples, the channel is symmetric, and a Bernoulli(1/2)
input maximizes the inf/sup information rates. Therefore for a Bernoulli(1/2)
107
input X, we have
Pr
_
1
n
log
P
W
n(Y
n
[X
n
)
P
Y
n(Y
n
)

_
=
_

_
Pr
_
1
2i
_
log
P

W
i (Y
i
[X
i
)
P
Y
i (Y
i
)
+ log
P

W
i
(Y
i
[X
i
)
P
Y
i (Y
i
)
_

_
,
if n = 2i;
Pr
_
1
2i + 1
_
log
P

W
i (Y
i
[X
i
)
P
Y
i (Y
i
)
+ log
P

W
i+1
(Y
i+1
[X
i+1
)
P
Y
i+1(Y
i+1
)
_

_
,
if n = 2i + 1;
=
_

_
1 Pr
_

1
i
log P
Z
i (Z
i
) < 1 2 +
1
i
J(i)
_
,
if n = 2i;
1 Pr
_

1
i + 1
log P
Z
i+1(Z
i+1
) < 1
_
2
1
i + 1
_
+
1
i + 1
J(i)
_
,
if n = 2i + 1.
The fact that (1/i) log[P
Z
i (Z
i
)] converges in distribution to the continuous ran-
dom variable V h
b
(U), where U is beta-distributed with parameters (/, (1
)/), and the fact that
liminf
n
1
n
J(n) =
1
3
and limsup
n
1
n
J(n) =
2
3
imply that
i() liminf
n
Pr
_
1
n
log
P
W
n(Y
n
[X
n
)
P
Y
n(Y
n
)

_
= 1 F
V
_
5
3
2
_
,
and

i() limsup
n
Pr
_
1
n
log
P
W
n(Y
n
[X
n
)
P
Y
n(Y
n
)

_
= 1 F
V
_
4
3
2
_
.
Consequently,

=
5
6

1
2
F
1
V
(1 ) and C

=
2
3

1
2
F
1
V
(1 ).
Thus
0 < C =
1
6
<

C =
1
3
< C
SC
=
5
6
< log
2
[[ = 1.
108
Bibliography
[1] F. Alajaji and T. Fuja, A communication channel modeled on contagion,
IEEE Trans. on Information Theory, vol. IT40, no. 6, pp. 20352041,
Nov. 1994.
[2] P.-N. Chen and F. Alajaji, Strong converse, feedback capacity and hypoth-
esis testing, Proc. of CISS, John Hopkins Univ., Baltimore, Mar. 1995.
[3] P.-N. Chen and F. Alajaji, Generalization of information measures,
Proc. Int. Symp. Inform. Theory & Applications, Victoria, Canada, Septem-
ber 1996.
[4] P.-N. Chen and F. Alajaji, Generalized source coding theorems and hy-
pothesis testing, Journal of the Chinese Institute of Engineering, vol. 21,
no. 3, pp. 283-303, May 1998.
[5] I. Csiszar and J. K orner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic, New York, 1981.
[6] R. L. Dobrushin, General formulation of Shannons basic theorems of infor-
mation theory, AMS Translations, vol. 33, pp. 323-438, AMS, Providence,
RI, 1963.
[7] T. S. Han, Information-Spectrum Methods in Information Theory, (in
Japanese), Baifukan Press, Tokyo, 1998.
[8] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, Holden-Day, 1964.
[9] G. Polya, Sur quelques points de la theorie des probabilites, Ann. Inst.
H. Poincarre, vol. 1, pp. 117-161, 1931.
[10] Y. Steinberg, New converses in the theory of identication via channels,
IEEE Trans. on Information Theory, vol. 44, no. 3, pp. 984998, May 1998.
109
[11] S. Vembu, S. Verd u and Y. Steinberg, The source-channel separation theo-
rem revisited, IEEE Trans. on Information Theory, vol. 41, no. 1, pp. 44
54, Jan. 1995.
[12] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
110
Chapter 7
Lossy Data Compression
In this chapter, a rate-distortion theorem for arbitrary (not necessarily stationary
or ergodic) discrete-time nite-alphabet sources is shown. The expression of the
minimum -achievable xed-length coding rate subject to a delity criterion will
also be provided.
7.1 General lossy source compression for block codes
In this section, we consider the problem of source coding with a delity criterion
for arbitrary (not necessarily stationary or ergodic) discrete-time nite-alphabet
sources. We prove a general rate-distortion theorem by establishing the expres-
sion of the minimum -achievable block coding rate subject to a delity criterion.
We will relax all the constraints on the source statistics and distortion measure.
In other words, the source is not restricted to DMS, and the distortion measure
can be arbitrary.
Denition 7.1 (lossy compression block code) Given a nite source alpha-
bet Z and a nite reproduction alphabet

Z, a block code for data compression
of blocklength n and size M is a mapping f
n
() : Z
n


Z
n
that results
in |f
n
| = M codewords of length n, where each codeword is a sequence of n
reproducing letters.
Denition 7.2 (distortion measure) A distortion measure
n
(, ) is a map-
ping

n
: Z
n


Z
n
1
+
[0, ).
We can view the distortion measure as the cost of representing a source n-tuple
z
n
by a reproduction n-tuple f
n
(z
n
).
111
Similar to the general results for entropy rate and information rate, the av-
erage distortion measure no longer has nice properties, such as convergence-in-
probability, in general. Instead, its ultimate probability may span over some
regions. Hence, in order to derive the fundamental limit for lossy data compres-
sion code rate, a dierent technique should be applied. This technique, which
employing the spectrum of the normalized distortion density dened later, is in
fact the same as that for general lossless source coding theorem and general
channel coding theorem.
Denition 7.3 (distortion inf-spectrum) Let (Z,

Z) and
n
(, )
n1
be
given. The distortion inf-spectrum
Z,

Z
() is dened by

Z,

Z
() liminf
n
Pr
_
1
n

n
(Z
n
,

Z
n
)
_
.
Denition 7.4 (distortion inf-spectrum for lossy compression code f)
Let Z and
n
(, )
n1
be given. Let f() f
n
()

n=1
denote a sequence of
(lossy) data compression codes. The distortion inf-spectrum
Z,f(Z)
() for f()
is dened by

Z,f(Z)
() liminf
n
Pr
_
1
n

n
(Z
n
, f
n
(Z
n
))
_
.
Denition 7.5 Fix D > 0 and 1 > > 0. R is a -achievable data compression
rate at distortion D for a source Z if there exists a sequence of data compression
codes f
n
() with
limsup
n
1
n
log |f
n
| R,
and
sup
_
:
Z,f(Z)
()

D. (7.1.1)
Note that (7.1.1), which can be re-written as:
inf
_
: limsup
n
Pr
_
1
n

n
(Z
n
, f
n
(Z
n
)) >
_
< 1
_
D,
is equivalent to stating that the limsup of the probability of excessive distortion
(i.e., distortion larger than D) is smaller than 1 .
The inmum -achievable data compression rate at distortion D for Z is
denoted by T

(D, Z).
112
Theorem 7.6 (general data compression theorem) Fix D > 0 and 1 >
> 0. Let Z and
n
(, )
n1
be given. Then
R

(D) T

(D, Z) R

(D ),
for any > 0, where
R

(D) inf
_
P

Z[Z
: sup
_
:
Z,

Z
()
_
D
_

I(Z;

Z),
where the inmum is taken over all conditional distributions P
Z[

Z
for which the
joint distribution P
Z,

Z
= P
Z
P
Z[

Z
satises the distortion constraint.
Proof:
1. Forward part (achievability): T

(D, Z) R

(D) or T

(D+, Z) R

(D).
Choose > 0. We will prove the existence of a sequence of data compression
codes with
limsup
n
1
n
log |f
n
| R

(D) + 2,
and
sup
_
:
Z,f(Z)
()

D + .
step 1: Let P

Z[Z
be the distribution achieving R

(D).
step 2: Let R = R

(D) + 2. Choose M = e
nR
codewords of blocklength n
independently according to P

Z
n, where
P

Z
n( z
n
) =

z
n
?
n
P
Z
n(z
n
)P

Z
n
[Z
n( z
n
[z
n
),
and denote the resulting random set by (
n
.
step 3: For a given (
n
, we denote by A((
n
) the set of sequences z
n
Z
n
such
that there exists z
n
(
n
with
1
n

n
(z
n
, z
n
) D + .
step 4: Claim:
limsup
n
E

Z
n[P
Z
n(A
c
((
n
))] < .
The proof of this claim will be provided as a lemma following this theorem.
Therefore there exists (a sequence of) (

n
such that
limsup
n
P
Z
n(A
c
((

n
)) < .
113
step 5: Dene a sequence of codes f
n

n1
by
f
n
(z
n
) =
_
arg min
z
n
(

n
(z
n
, z
n
), if z
n
A((

n
);
0, otherwise,
where 0 is a xed default n-tuple in

Z
n
.
Then
_
z
n
Z
n
:
1
n

n
(z
n
, f
n
(z
n
)) D +
_
A((

n
),
since ( z
n
A((

n
)) there exists z
n
(

n
such that (1/n)
n
(z
n
, z
n
) D+,
which by denition of f
n
implies that (1/n)
n
(z
n
, f
n
(z
n
)) D + .
step 6: Consequently,

(Z,f(Z))
(D + ) = liminf
n
P
Z
n
_
z
n
Z
n
:
1
n

n
(z
n
, f(z
n
)) D +
_
liminf
n
P
Z
n(A((

n
))
= 1 limsup
n
P
Z
n(A
c
((

n
))
> .
Hence,
sup
_
:
Z,f(Z)
()

D + ,
where the last step is clearly depicted in Figure 7.1.
This proves the forward part.
2. Converse part: T

(D, Z) R

(D). We show that for any sequence of encoders


f
n
()

n=1
, if
sup
_
:
Z,f(Z)
()

D,
then
limsup
n
1
n
log |f
n
| R

(D).
Let
P

Z
n
[Z
n( z
n
[z
n
)
_
1, if z
n
= f
n
(z
n
);
0, otherwise.
Then to evaluate the statistical properties of the random sequence
(1/n)
n
(Z
n
, f
n
(Z
n
)

n=1
114
-
6
D +

Z,f(Z)
(D + )

Z,f(Z)
()

sup
_
:
Z,f(Z)
()

Figure 7.1:
Z,f (Z)
(D + ) > sup[ :
Z,f (Z)
() ] D + .
under distribution P
Z
n is equivalent to evaluating those of the random sequence
(1/n)
n
(Z
n
,

Z
n
)

n=1
under distribution P
Z
n
,

Z
n. Therefore,
R

(D) inf
P

Z[Z
: sup[ :
Z,

Z
() ] D

I(Z;

Z)


I(Z;

Z)


H(

Z) H(

Z[Z)


H(

Z)
limsup
n
1
n
log |f
n
|,
where the second inequality follows from (2.4.3) (with 1 and = 0), and the
third inequality follows from the fact that H(

Z[Z) 0. 2
Lemma 7.7 (cf. Proof of Theorem 7.6)
limsup
n
E

Z
n[P
Z
n(A
c
((

n
))] < .
Proof:
step 1: Let
D
()
sup
_
:
Z,

Z
()
_
.
115
Dene
A
()
n,

_
(z
n
, z
n
) :
1
n

n
(z
n
, z
n
) D
()
+
and
1
n
i
Z
n
,

Z
n(z
n
, z
n
)

I(Z;

Z) +
_
.
Since
liminf
n
Pr
_
T
_
1
n

n
(Z
n
,

Z
n
) D
()
+
__
> ,
and
liminf
n
Pr
_
c
_
1
n
i
Z
n
,

Z
n(Z
n
;

Z
n
)

I(Z;

Z) +
__
= 1,
we have
liminf
n
Pr(A
()
n,
) = liminf
n
Pr(T c)
liminf
n
Pr(T) + liminf
n
Pr(c) 1
> + 1 1 = .
step 2: Let K(z
n
, z
n
) be the indicator function of A
()
n,
:
K(z
n
, z
n
) =
_
1, if (z
n
, z
n
) A
()
n,
;
0, otherwise.
step 3: By following a similar argument in step 4 of the achievability part of
Theorem 8.3 of Volume I, we obtain
E

Z
n[P
Z
n(A
c
((

n
))] =

n
P

Z
n((

n
)

z
n
,A((

n
)
P
Z
n(z
n
)
=

z
n
?
n
P
Z
n(z
n
)

n
:z
n
,A((

n
)
P

Z
n((

n
)
=

z
n
?
n
P
Z
n(z
n
)
_
_
1

z
n


?
n
P

Z
n
(y
n
)K(z
n
, z
n
)
_
_
M

z
n
?
n
P
Z
n(z
n
)
_
1 e
n(

I(Z;

Z)+)

z
n


?
n
P

Z
n
[Z
n( z
n
[z
n
)K(z
n
, z
n
)
_
_
M
1

z
n
?
n

z
n


?
n
P
Z
n(z
n
)P

Z
n
[Z
n(z
n
, z
n
)K(z
n
, z
n
)
+exp
_
e
n(RR(D))
_
.
116
Therefore,
limsup
n
E

Z
n[P
Z
n(A
n
((

n
))] 1 liminf
n
Pr(A
()
n,
)
< 1 .
2
For the probability-of-error distortion measure
n
: Z
n
Z
n
, namely,

n
(z
n
, z
n
) =
_
n, if z
n
,= z
n
;
0, otherwise,
we dene a data compression code f
n
: Z
n
Z
n
based on a chosen (asymptotic)
lossless xed-length data compression code book (
n
Z
n
:
f
n
(z
n
) =
_
z
n
, if z
n
(
n
;
0, if z
n
, (
n
,
where 0 is some default element in Z
n
. Then (1/n)
n
(z
n
, f
n
(z
n
)) is either 1 or
0 which results in a cumulative distribution function as shown in Figure 7.2.
Consequently, for any [0, 1),
Pr
_
1
n

n
(Z
n
, f
n
(Z
n
))
_
= Pr Z
n
= f
n
(Z
n
) .
- c
c s
s
0 1 D
1
PrZ
n
= f
n
(Z
n
)
Figure 7.2: The CDF of (1/n)
n
(Z
n
, f
n
(Z
n
)) for the probability-of-
error distortion measure.
117
By comparing the (asymptotic) lossless and lossy xed-length compression
theorems under the probability-of-error distortion measure, we observe that
R

() = inf
P

Z[Z
: sup[ :
Z,

Z
() ]

I(Z;

Z)
=
_

_
0, 1;
inf
P

Z[Z
: liminf
n
PrZ
n
=

Z
n
>

I(Z;

Z), < 1,
=
_

_
0, 1;
inf
P

Z[Z
: limsup
n
PrZ
n
,=

Z
n
1

I(Z;

Z), < 1,
where

Z,

Z
() =
_
liminf
n
Pr
_
Z
n
=

Z
n
_
, 0 < 1;
1, 1.
In particular, in the extreme case where goes to one,

H(Z) = inf
P

Z[Z
: limsup
n
PrZ
n
,=

Z
n
= 0

I(Z;

Z).
Therefore, in this case, the data compression theorem reduces (as expected) to
the asymptotic lossless xed-length data compression theorem.
118
Chapter 8
Hypothesis Testing
8.1 Error exponent and divergence
Divergence can be adopted as a measure of how similar two distributions are.
This quantity has the operational meaning that it is the exponent of the type II
error probability for hypothesis testing system of xed test level. For rigorous-
ness, the exponent is rst dened below.
Denition 8.1 (exponent) A real number a is said to be the exponent for a
sequence of non-negative quantities a
n

n1
converging to zero, if
a = lim
n
_

1
n
log a
n
_
.
In operation, exponent is an index of the exponential rate-of-convergence for
sequence a
n
. We can say that for any > 0,
e
n(a+)
a
n
e
n(a)
,
as n large enough.
Recall that in proving the channel coding theorem, the probability of decod-
ing error for channel block codes can be made arbitrarily close to zero when the
rate of the codes is less than channel capacity. This result can be mathematically
written as:
P
e
( (

n
) 0, as n ,
provided R = limsup
n
(1/n) log | (

n
| < C, where (

n
is the optimal code for
block length n. From the theorem, we only know the decoding error will vanish
as block length increases; but, it does not reveal that how fast the decoding error
approaches zero. In other words, we do not know the rate-of-convergence of the
119
decoding error. Sometimes, this information is very important, especially for
one to decide the sucient block length to achieve some error bound.
The rst step of investigating the rate-of-convergence of the decoding error
is to compute its exponent, if the decoding error decays to zero exponentially
fast (it indeed does for memoryless channels.) This exponent, as a function of
the rate, is in fact called the channel reliability function, and will be discussed
in the next chapter.
For the hypothesis testing problems, the type II error probability of xed test
level also decays to zero as the number of observations increases. As it turns
out, its exponent is the divergence of the null hypothesis distribution against
alternative hypothesis distribution.
Lemma 8.2 (Steins lemma) For a sequence of i.i.d. observations X
n
which
is possibly drawn from either null hypothesis distribution P
X
n or alternative
hypothesis distribution P

X
n
, the type II error satises
( (0, 1)) lim
n

1
n
log

n
() = D(P
X
|P

X
),
where

n
() = min
n

n
, and
n
and
n
represent the type I and type II errors,
respectively.
Proof: [1. Forward Part]
In the forward part, we prove that there exists an acceptance region for null
hypothesis such that
liminf
n

1
n
log
n
() D(P
X
|P

X
).
step 1: divergence typical set. For any > 0, dene divergence typical set
as
/
n
()
_
x
n
:

1
n
log
P
X
n(x
n
)
P

X
n
(x
n
)
D(P
X
|P

X
)

<
_
.
Note that in divergence typical set,
P

X
n
(x
n
) P
X
n(x
n
)e
n(D(P
X
|P

X
))
.
step 2: computation of type I error. By weak law of large number,
P
X
n(/
n
()) 1.
Hence,

n
= P
X
n(/
c
n
()) < ,
for suciently large n.
120
step 3: computation of type II error.

n
() = P

X
n
(/
n
())
=

x
n
,n()
P

X
n
(x
n
)

x
n
,n()
P
X
n(x
n
)e
n(D(P
X
|P

X
))
e
n(D(P
X
|P

X
))

x
n
,n()
P
X
n(x
n
)
e
n(D(P
X
|P

X
))
(1
n
).
Hence,

1
n
log
n
() D(P
X
|P

X
) +
1
n
log(1
n
),
which implies
liminf
n

1
n
log
n
() D(P
X
|P

X
) .
The above inequality is true for any > 0; therefore
liminf
n

1
n
log
n
() D(P
X
|P

X
).
[2. Converse Part]
In the converse part, we will prove that for any acceptance region for null
hypothesis B
n
satisfying the type I error constraint, i.e.,

n
(B
n
) = P
X
n(B
c
n
) ,
its type II error
n
(B
n
) satises
limsup
n

1
n
log
n
(B
n
) D(P
X
|P

X
).

n
(B
n
) = P

X
n
(B
n
) P

X
n
(B
n
/
n
())

x
n
Bn,n()
P

X
n
(x
n
)

x
n
Bn,n()
P
X
n(x
n
)e
n(D(P
X
|P

X
)+)
= e
n(D(P
X
|P

X
)+)
P
X
n(B
n
/
n
())
e
n(D(P
X
|P

X
)+)
[1 P
X
n(B
c
n
) P
X
n (/
c
n
())]
e
n(D(P
X
|P

X
)+)
[1
n
(B
n
) P
X
n (/
c
n
())]
e
n(D(P
X
|P

X
)+)
[1 P
X
n (/
c
n
())] .
121
Hence,

1
n
log
n
(B
n
) D(P
X
|P

X
) + +
1
n
log [1 P
X
n (/
c
n
())] ,
which implies that
limsup
n

1
n
log
n
(B
n
) D(P
X
|P

X
) + .
The above inequality is true for any > 0; therefore,
limsup
n

1
n
log
n
(B
n
) D(P
X
|P

X
).
2
8.1.1 Composition of sequence of i.i.d. observations
Steins lemma gives the exponent of the type II error probability for xed test
level. As a result, this exponent, which is the divergence of null hypothesis
distribution against alternative hypothesis distribution, is independent of the
type I error bound for i.i.d. observations.
Specically under i.i.d. environment, the probability for each sequence of x
n
depends only on its composition, which is dened as an [A[-dimensional vector,
and is of the form
_
#1(x
n
)
n
,
#2(x
n
)
n
, . . . ,
#k(x
n
)
n
_
,
where A = 1, 2, . . . , k, and #i(x
n
) is the number of occurrences of symbol i
in x
n
. The probability of x
n
is therefore can be written as
P
X
n(x
n
) = P
X
(1)
#1(x
n
)
P
X
(2)
#2(x
n
)
P
X
(k)
#k(x
n
)
.
Note that #1(x
n
) + + #k(x
n
) = n. Since the composition of a sequence
decides its probability deterministically, all sequences with the same composition
should have the same statistical property, and hence should be treated the same
when processing. Instead of manipulating the sequences of observations based
on the typical-set-like concept, we may focus on their compositions. As it turns
out, such approach yields simpler proofs and better geometrical explanations for
theories under i.i.d. environment. (It needs to be pointed out that for cases when
composition alone can not decide the probability, this viewpoint does not seem
to be eective.)
Lemma 8.3 (polynomial bound on number of composition) The num-
ber of compositions increases polynomially fast, while the number of possible
sequences increases exponentially fast.
122
Proof: Let T
n
denotes the set of all possible compositions for x
n
A
n
. Then
since each numerator of the components in [A[-dimensional composition vector
ranges from 0 to n, it is obvious that [T
n
[ (n + 1)
[.[
which increases polyno-
mially with n. However, the number of possible sequences is [A[
n
which is of
exponential order w.r.t. n. 2
Each composition (or [A[-dimensional vector) actually represents a possible
probability mass distribution. In terms of the composition distribution, we can
compute the exponent of the probability of those observations that belong to
the same composition.
Lemma 8.4 (probability of sequences of the same composition) The
probability of the sequences of composition ( with respect to distribution P
X
n
satises
1
(n + 1)
[.[
e
nD(P
C
|P
X
)
P
X
n(() e
nD(P
C
|P
X
)
,
where P
(
is the composition distribution for composition (, and ( (by abusing
notation without ambiguity) is also used to represent the set of all sequences (in
A
n
) of composition (.
Proof: For any sequence x
n
of composition
( =
_
#1(x
n
)
n
,
#2(x
n
)
n
, . . . ,
#k(x
n
)
n
_
= (P
(
(1), P
(
(2), . . . , P
(
(k)) ,
where k = [A[,
P
X
n(x
n
) =
k

i=1
P
X
(i)
#i(x
n
)
=
k

x=1
P
X
(x)
nP
C
(x)
=
k

x=1
e
nP
C
(x) log P
X
(x)
= e
n
P
xX
P
C
(x) log P
X
(x)
= e
n(
P
xX
P
C
(x) log[P
C
(x)/P
X
(x)]P
C
(x) log P
C
(x))
= e
n(D(P
C
|P
X
)+H(P
C
))
.
Similarly, we have
P
(
(x
n
) = e
n[D(P
C
|P
C
)+H(P
C
)]
= e
nH(P
C
)
. (8.1.1)
123
(P
(
originally represents a one-dimensional distribution over A. Here, without
ambiguity, we re-use the same notation of P
(
() to represent its n-dimensional
extension, i.e., P
(
(x
n
) =

n
i=1
P
(
(x
i
).) We can therefore bound the number of
sequence in ( as
1

x
n
(
P
(
(x
n
)
=

x
n
(
e
nH(P
C
)
= [([e
nH(P
C
)
,
which implies [([ e
nH(P
C
)
. Consequently,
P
X
n(() =

x
n
(
P
X
n(x
n
)
=

x
n
(
e
n(D(P
C
|P
X
)+H(P
C
))
= [([e
n(D(P
C
|P
X
)+H(P
C
))
e
nH(P
C
)
e
n(D(P
C
|P
X
)+H(P
C
))
= e
nD(P
C
|P
X
)
. (8.1.2)
This ends the proof of the upper bound.
From (8.1.2), we observe that to derive a lower bound of P
X
n(() is equiva-
lent to nd a lower bound of [([, which requires the next inequality. For any
composition C
t
, and its corresponding composition set (
t
,
P
(
(()
P
(
((
t
)
=
[([
[(
t
[

i.
P
(
(i)
nP
C
(i)

i.
P
(
(i)
nP
C
(i)
=
n!

j.
(n P
(
(j))!
n!

j.
(n P
(
(j))!

i.
P
(
(i)
nP
C
(i)

i.
P
(
(i)
nP
C
(i)
=

j.
(n P
(
(j))!
(n P
(
(j))!

i.
P
(
(i)
nP
C
(i)

i.
P
(
(i)
nP
C
(i)

j.
(n P
(
(j))
nP
C
(j)nP
C
(j)

i.
P
(
(i)
nP
C
(i)nP
C
(i)
using the inequality
m!
n!

n
m
n
n
;
= n
n
P
jX
(P
C
(j)P
C
(j))
= n
0
= 1. (8.1.3)
124
Hence,
1 =

P
(
((
t
)

max
(

P
(
((
tt
)
=

P
(
(() (8.1.4)
(n + 1)
[.[
P
(
(()
= (n + 1)
[.[
[([e
nH(P
C
)
, (8.1.5)
where (8.1.4) and (8.1.5) follow from (8.1.3) and (8.1.1), respectively. Therefore,
a lower bound of [([ is obtained as:
[([
1
(n + 1)
[.[
e
nH(P
C
)
.
2
Theorem 8.5 (Sanovs Theorem) Let c
n
be the set that consists of all com-
positions over nite alphabet A, whose composition distribution belongs to T.
Fix a sequence of product distribution P
X
n =

n
i=1
P
X
. Then,
liminf
n

1
n
log P
X
n(c
n
) inf
P
C
1
D(P|P
X
),
where P
(
is the composition distribution for composition (. If, in addition, for
every distribution P in T, there exists a sequence of composition distributions
P
(
1
, P
(
2
, P
(
3
, . . . T such that limsup
n
D(P
(n
|P
X
) = D(P|P
X
), then
limsup
n

1
n
log P
X
n(c
n
) inf
P1
D(P
(
|P
X
),
Proof:
P
X
n(c
n
) =

Ccn
P
X
n(()

Ccn
e
nD(P
C
|P
X
)

Ccn
max
Ccn
e
nD(P
C
|P
X
)
=

Ccn
e
nmin
CEn
D(P
C
|P
X
)
(n + 1)
[.[
e
nmin
CEn
D(P
C
|P
X
)
.
125
Let (

c
n
satisfying that
D(P
(
|P
X
) = min
(cn
D(P
(
|P
X
)
for which its existence can be substantiated by the fact that the number of
compositions in c
n
is nite.
P
X
n(c
n
) =

(cn
P
X
n(()

(cn
1
(n + 1)
[.[
e
nD(P
C
|P
X
)

1
(n + 1)
[.[
e
nD(P
C
|P
X
)
.
Therefore,
1
(n + 1)
[.[
e
nmin
CEn
D(P
C
|P
X
)
P
X
n(c
n
) (n + 1)
[.[
e
nmin
CEn
D(P
C
|P
X
)
,
which immediately gives
liminf
n

1
n
log P
X
n(c
n
) liminf
n
min
(cn
D(P
(
|P
X
) inf
P
C
1
D(P|P
X
),
Now suppose that for P T, there exists a sequence of composition dis-
tributions P
(
1
, P
(
2
, P
(
3
, . . . T such that limsup
n
D(P
(n
|P
X
) = D(P|P
X
),
then

1
n
log P
X
n(c
n
) min
(cn
D(P
(
|P
X
) D(P
(n
|P
X
),
which implies that
limsup
n

1
n
log P
X
n(c
n
) limsup
n
D(P
(n
|P
X
) = D(P|P
X
).
As the above inequality is true for every P in T,
limsup
n

1
n
log P
X
n(c
n
) inf
P1
D(P|P
X
).
2
The geometrical explanation for Sanovs theorem is depicted in Figure 8.1.
In the gure, it shows that if we adopt the divergence as the measure of distance
in distribution space (the outer triangle region in Figure 8.1), and T is a set of
distributions that can be approached by composition distributions, then for any
distribution P
X
outside T, we can measure the shortest distance from P
X
to T
in terms of the divergence measure. This shortest distance, as it turns out, is
indeed the exponent of P
X
(c
n
). The following examples show that the viewpoint
from Sanov is really eective, especially for i.i.d. observations.
126

@
@
@
@
@
@
@
@
@
@
@
@
T
u
P
X
u
P
1
u
P
2
C
C
C
C
C
C

inf
P1
D(P|P
X
)
Figure 8.1: The geometric meaning for Sanovs theorem.
Example 8.6 One wants to roughly estimate the probability that the average of
the throws is greater or equal to 4, when tossing a fair dice n times. He observes
that whether the requirement is satised only depends on the compositions of the
observations. Let c
n
be the set of compositions which satisfy the requirement.
Also let P
X
be the probability for tossing a fair dice. Then c
n
can be written as
c
n
=
_
( :
6

i=1
iP
(
(i) 4
_
.
To minimize D(P
(
|P
X
) for ( c
n
, we can use the Lagrange multiplier tech-
nique (since divergence is convex with respect to the rst argument.) with the
constraints on P
(
being:
6

i=1
iP
(
(i) = k and
6

i=1
P
(
(i) = 1
for k = 4, 5, 6, . . . , n. So it becomes to minimize:
6

i=1
P
(
(i) log
P
(
(i)
P
X
(i)
+
1
_
6

i=1
iP
(
(i) k
_
+
2
_
6

i=1
P
(
(i) 1
_
.
By taking the derivatives, we found that the minimizer should be of the form
P
(
(i) =
e

1
i

6
j=1
e

1
j
,
for
1
is chosen to satisfy
6

i=1
iP
(
(i) = k. (8.1.6)
127
Since the above is true for all k 4, it suces to take the smallest one as
our solution, i.e., k = 4. Finally, by solving (8.1.6) for k = 4 numerically, the
minimizer should be
P
(
= (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468),
and the exponent of the desired probability is D(P
(
|P
X
) = 0.0433 nat. Conse-
quently,
P
X
n(c
n
) e
0.0433 n
.
8.1.2 Divergence typical set on composition
In Steins lemma, one of the most important steps is to dene the divergence
typical set, which is
/
n
()
_
x
n
:

1
n
log
P
X
n(x
n
)
P

X
n
(x
n
)
D(P
X
|P

X
)

<
_
.
One of the key feature of the divergence typical set is that its probability eventu-
ally tends to 1 with respect to P
X
n. In terms of the compositions, we can dene
another divergence typical set with the same feature as
T
n
() x
n
A
n
: D(P
(
x
n
|P
X
) ,
where (
x
n represents the composition of x
n
. The above typical set is re-written
in terms of composition set as:
T
n
() ( : D(P
(
|P
X
) .
P
X
n(T
n
()) 1 is justied by
1 P
X
n(T
n
()) =

C : D(P
C
|P
X
)>
P
X
n(()

C : D(P
C
|P
X
)>
e
nD(P
C
|P
X
)
, from Lemma 8.4.

C : D(P
C
|P
X
)>
e
n
(n + 1)
[.[
e
n
, cf. Lemma 8.3.
8.1.3 Universal source coding on compositions
It is sometimes not possible to know the true source distribution. In such case,
instead of nding the best lossless data compression code for a specic source
128
distribution, one may turn to design a code which is good for all possible candi-
date distributions (of some specic class.) Here, good means that the average
codeword length achieves the (unknown) source entropy for all sources with the
distributions in the considered class.
To be more precise, let
f
n
: A
n

_
i=1
0, 1
i
be a xed encoding function. Then for any possible source distribution P
X
n in
the considered class, the resultant average codeword length for this specic code
satises
1
n

i=1
P
X
n(x
n
)(f
n
(x
n
)) H(X),
as n goes to innity, where () represents the length of f
n
(x
n
). Then f
n
is said
to be a universal code for the source distributions in the considered class.
Example 8.7 (universal encoding using composition) First, binary-index
the compositions using log
2
(n + 1)
[.[
bits, and denote this binary index for
composition ( by a((). Note that the number of compositions is at most (n +
1)
[.[
.
Let (
x
n denote the composition with respect to x
n
, i.e. x
n
(
x
n.
For each composition (, we know that the number of sequence x
n
in ( is
at most 2
nH(P
C
)
(Here, H(P
(
) is measured in bits. I.e., the logarithmic base
in entropy computation is 2. See the proof of Lemma 8.4). Hence, we can
binary-index the elements in ( using n H(P
(
) bits. Denote this binary index
for elements in ( by b((). Dene a universal encoding function f
n
as
f
n
(x
n
) = concatenationa((
x
n), b((
x
n).
Then this encoding rule is a universal code for all i.i.d. sources.
Proof: All the logarithmic operations in entropy and divergence are taken to be
base 2 throughout the proof.
For any i.i.d. distribution with marginal P
X
, its associated average codeword
length is

n
=

x
n
.
n
P
X
n(x
n
)(a((
x
n)) +

x
n
.
n
P
X
n(x
n
)(b((
x
n))

x
n
.
n
P
X
n(x
n
) log
2
(n + 1)
[.[
+

x
n
.
n
P
X
n(x
n
) n H(P
(
x
n
)
[A[ log
2
(n + 1) +

(
P
X
n(() n H(P
(
).
129
Hence, the normalized average code length is upper-bounded by
1
n

n

[A[ log
2
(n + 1)
n
+

(
P
X
n(()H(P
(
).
Since the rst term of the right hand side of the above inequality vanishes as n
tends to innity, only the second term has contribution to the ultimate normal-
ized average code length, which in turn can be bounded above using typical set
T
n
() as:

(
P
X
n(()H(P
(
)
=

(Tn()
P
X
n(()H(P
(
) +

(,Tn()
P
X
n(()H(P
(
)
max
( : D(P
C
|P
X
)/ log(2)
H(P
(
) +

( : D(P
C
|P
X
)>/ log(2)
P
X
n(()H(P
(
)
max
( : D(P
C
|P
X
)/ log(2)
H(P
(
) +

( : D(P
C
|P
X
)>/ log(2)
2
nD(P
C
|P
X
)
H(P
(
),
(From Lemma 8.4)
max
( : D(P
C
|P
X
)/ log(2)
H(P
(
) +

( : D(P
C
|P
X
)>/ log(2)
e
n
H(P
(
)
max
( : D(P
C
|P
X
)/ log(2)
H(P
(
) +

( : D(P
C
|P
X
)>/ log(2)
e
n
log
2
[A[
max
( : D(P
C
|P
X
)/ log(2)
H(P
(
) + (n + 1)
[.[
e
n
log
2
[A[,
where the second term of the last step vanishes as n . (Note that when
base-2 logarithm is taken in divergence instead of natural logarithm, the range
[0, ] in T
n
() should be replaced by [0, / log(2)].) It remains to show that
max
( : D(P
C
|P
X
)/ log(2)
H(P
(
) H(X) + (),
where () only depends on , and approaches zero as 0.
Since D(P
(
|P
X
) / log(2),
|P
(
P
X
|
_
2 log(2) D(P
(
|P
X
)

2,
where |P
(
P
X
| represents the variational distance of P
(
against P
X
, which
implies that for all x A,
[P
(
(x) P
X
(x)[

x.
[P
(
(x) P
X
(x)[
= |P
(
P
X
|

2.
130
Fix 0 < < min min
x : P
X
(x)>0
P
X
(x), 2/e
2
. Choose =
2
/8. Then for
all x A, [P
(
(x) P
X
(x)[ /2.
For those x satisfying P
X
(x) , the following is true. From mean value
theorem, namely,
[t log t s log s[

max
u[t,s][s,t]
d(u log u)
du

[t s[,
we have
[P
(
(x) log P
(
(x) P
X
(x) log P
X
(x)[

max
u[P
C
(x),P
X
(x)][P
X
(x),P
C
(x)]
d(u log u)
du

[P
(
(x) P
X
(x)[

max
u[P
C
(x),P
X
(x)][P
X
(x),P
C
(x)]
d(u log u)
du

max
u[/2,1]
d(u log u)
du

2
(Since P
X
(x) + /2 P
(
(x) P
X
(x) /2 /2 = /2
and P
X
(x) > /2.)
=

2
[1 + log(/2)[ ,
where the last step follows from < 2/e
2
.
For those x satisfying P
X
(x) = 0, P
(
(x) < /2. Hence,
[P
(
(x) log P
(
(x) P
X
(x) log P
X
(x)[ = [P
(
(x) log P
(
(x)[

2
log

2
,
since 2/e
2
< 2/e. Consequently,
[H(P
(
) H(X)[
1
log(2)

x.
[P
(
(x) log P
(
(x) P
X
(x) log P
X
(x)[

[A[
log(2)
max
_

2
log

2
,
[1 + log(/2)[
2
_

[A[
log(2)
max
_

2
log(2),

1 +
1
2
log(2)

_
().
2
131
8.1.4 Likelihood ratio versus divergence
Recall that the Neyman-Pearson lemma indicates that the optimal test for two
hypothesis (H
0
: P
X
n against H
1
: P

X
n
) is of the form
P
X
n(x
n
)
P

X
n
(x
n
)
>
<
(8.1.7)
for some . This is the likelihood ratio test, and the quantity P
X
n(x
n
)/P

X
n
(x
n
)
is called the likelihood ratio. If the logarithm operation is taken on both sides of
(8.1.7), the test remains intact; and the resultant quantity becomes log-likelihood
ratio, which can be re-written in terms of the divergence of the composition
distribution against the hypothesis distributions. To be precise, let (
x
n be the
composition of x
n
, and let P
(
x
n
be the corresponding composition distribution.
Then
log
P
X
n(x
n
)
P

X
n
(x
n
)
=
n

i=1
log
P
X
(x
i
)
P

X
(x
i
)
=

a.
[#a(x
n
)] log
P
X
(a)
P

X
(a)
=

a.
[nP
(
x
n
(a)] log
P
X
(a)
P

X
(a)
= n

a.
P
(
x
n
(a) log
P
X
(a)
P
(
x
n
(a)
P
(
x
n
(a)
P

X
(a)
= n
_

a.
P
(
x
n
(a) log
P
(
x
n
(a)
P

X
(a)

a.
P
(
x
n
(a) log
P
(
x
n
(a)
P
X
(a)
_
= n[D(P
(
x
n
|P

X
) D(P
(
x
n
|P
X
)]
Hence, (8.1.7) is equivalent to
D(P
(
x
n
|P

X
) D(P
(
x
n
|P
X
)
>
<
1
n
log . (8.1.8)
This equivalence means that for hypothesis testing under i.i.d. observations,
selection of the acceptance region suces to be made upon compositions instead
of observations. In other words, the optimal decision function can be dened as:
(() =
_
_
_
0, if composition ( is classied to belong to null hypothesis
according to (8.1.8);
1, otherwise.
132
8.1.5 Exponent of Bayesian cost
Randomization on decision () can improve the resultant type-II error for
Neyman-Pearson criterion of xed test level; however, it cannot improve Baye-
sian cost. The latter statement can be easily justied by noting the contribution
to Bayesian cost for randomizing () as
(x
n
) =
_
0, with probability ;
1, with probability 1 ;
satises

0
P
X
n(x
n
) +
1
(1 )P

X
n
(x
n
) min
0
P
X
n(x
n
),
1
P

X
n
(x
n
).
Therefore, assigning x
n
to either null hypothesis or alternative hypothesis im-
proves the Bayesian cost for (0, 1). Since under i.i.d. statistics, all observa-
tions with the same composition have the same probability, the above statement
is also true for decision function dened based on compositions.
Now suppose the acceptance region for null hypothesis is chosen to be a set
of compositions, denoted by
/ ( : D(P
(
|P

X
) D(P
(
|P
X
) >
t
.
Then by Sanovs theorem, the exponent of type II error,
n
, is
min
(,
D(P
(
|P

X
).
Similarly, the exponent of type I error,
n
, is
min
(,
c
D(P
(
|P
X
).
Because whether or not an element x
n
with composition (
x
n is in / is deter-
mined by the dierence of D(P
(
x
n
|P
X
) and D(P
(
x
n
|P

X
), it can be expected
that as n goes to innity, there exists P
e
X
such that the exponents for type I
error and type II error are respectively D(P
e
X
|P
X
) and D(P
e
X
|P

X
) (cf. Figure
8.2.) The solution for P
e
X
can be solved using Lagrange multiplier technique by
re-formulating the problem as minimizing D(P
(
|P

X
) subject to the condition
D(P
(
|P

X
) = D(P
(
|P
X
)
t
. By taking the derivative of
D(P
e
X
|P

X
) +
_
D(P
e
X
|P

X
) D(P
e
X
|P
X
)
t

+
_

x.
P
e
X
(x) 1
_
with respective to each P
e
X
(x), we have
log
P
e
X
(x)
P

X
(x)
+ 1 + log
P
X
(x)
P

X
(x)
+ = 0.
133
Then the optimal P
e
X
(x) satises:
P
e
X
(x) = P

(x)
P

X
(x)P
1

X
(x)

a.
P

X
(a)P
1

X
(a)
.
t t
P
X
P

X
D(P
e
X
|P
X
)
t
P
e
X
D(P
e
X
|P

X
)
D(P
(
|P
X
) = D(P
(
|P

X
)
t

Figure 8.2: The divergence view on hypothesis testing.


The geometrical explanation for P

is that it locates on the straight line


(in the sense of divergence measure) between P
X
and P

X
over the probability
space. When 0, P

X
; when 1, P

P
X
. Usually, P

is named
the tilted or twisted distribution. The value of is dependent on
t
= (1/n) log .
It is known from detection theory that the best for Bayes testing is
1
/
0
,
which is xed. Therefore,

t
= lim
n
1
n
log

1

0
= 0,
which implies that the optimal exponent for Bayes error is the minimum of
D(P

|P
X
) subject to D(P

|P
X
) = D(P

|P

X
), namely the mid-point of the line
segment (P
X
, P

X
) on probability space. This quantity is called the Cherno
bound.
134
8.2 Large deviations theory
The large deviations theory basically consider the technique of computing the
exponent of an exponentially decayed probability. Such technique is very useful
in understanding the exponential detail of error probabilities. You have already
seen a few good examples in hypothesis testing. In this section, we will introduce
its fundamental theories and techniques.
8.2.1 Tilted or twisted distribution
Suppose the probability of a set P
X
(/
n
) decreasing down to zero exponentially
fact, and its exponent is equal to a > 0. Over the probability space, let T denote
the set of those distributions P
e
X
satisfying P
e
X
(/
n
) exhibits zero exponent. Then
applying similar concept as Sanovs theorem, we can expect that
a = min
P
e
X
1
D(P
e
X
|P
X
).
Now suppose the minimizer of the above function happens at f(P
e
X
) = for some
constant and some dierentiable function f() (e.g., T P
e
X
: f(P
e
X
) = .)
Then the minimizer should be of the form
( a A) P
e
X
(a) =
P
X
(a)e

f(P
e
X
)
P
e
X
(a)

.
P
X
(a
t
)e

f(P
e
X
)
P
e
X
(a

)
.
As a result, P
e
X
is the resultant distribution from P
X
exponentially twisted via
the partial derivative of the function f(). Note that P
e
X
is usually written as
P
()
X
since it is generated by twisting P
X
with twisted factor .
8.2.2 Conventional twisted distribution
The conventional denition for twisted distribution is based on the divergence
function, i.e.,
f(P
e
X
) = D(P
e
X
|P

X
) D(P
e
X
|P
X
). (8.2.1)
Since
D(P
e
X
|P
X
)
P
e
X
(a)
= log
P
e
X
(a)
P
X
(a)
+ 1,
135
the twisted distribution (with respect to the f() dened in (8.2.1)) becomes
( a A) P
e
X
(a) =
P
X
(a)e
log
P

X
(a)
P
X
(a)

.
P
X
(a
t
)e
log
P

X
(a

)
P
X
(a

)
=
P
1
X
(a)P

X
(a)

.
P
1
X
(a)P

X
(a)
Note that for Bayes testing, the probability of the optimal acceptance regions for
both hypotheses evaluated under the optimal twisted distribution (i.e., = 1/2)
exhibits zero exponent.
8.2.3 Cramers theorem
Consider a sequence of i.i.d. random variables, X
n
, and suppose that we are
interested in the probability of the set
_
X
1
+ + X
n
n
>
_
.
Observe that (X
1
+ + X
n
)/n can be re-written as

a.
aP
(
(a).
Therefore, the set considered can be dened through function f():
f(P
e
X
)

a.
aP
e
X
(a),
and its partial derivative with respect to P
e
X
(a) is a. The resultant twisted
distribution is
( a A) P
()
X
(a) =
P
X
(a)e
a

.
P
X
(a
t
)e
a

.
So the exponent of P
X
n(X
1
+ + X
n
)/n > is
min
P
e
X
: D(P
e
X
|P
X
)>
D(P
e
X
|P
X
) = min
P
()
X
: D(P
()
X
|P
X
)>
D(P
()
X
|P
X
).
136
It should be pointed out that

.
P
X
(a
t
)e
a

is the moment generating


function of P
X
. The conventional Cramers result does not use the divergence.
Instead, it introduced the large deviation rate function, dened by
I
X
(x) sup
T
[x log M
X
()], (8.2.2)
where M
X
() E[expX] is the moment generating function of X. Using his
statement, the exponent of P
X
n(X
1
+ + X
n
)/n > is respectively lower-
and upper bounded by
inf
x
I
X
(x) and inf
x>
I
X
(x).
An example on how to obtain the exponent bounds is illustrated in the next
subsection.
8.2.4 Exponent and moment generating function: an ex-
ample
In this subsection, we re-derive the exponent of Pr(X
1
+ + X
n
)/n
using the moment generating function for i.i.d. random variables X
i

i=1
with
mean E[X] = < and bounded variance. As a result, its exponent equals
I
X
() (cf. (8.2.2)).
A) Preliminaries : Observe that since E[X] = < and E[[X [
2
] < ,
Pr
_
X
1
+ + X
n
n

_
0 as n .
Hence, we can compute its rate of convergence (to zero).
B) Upper bound of the probability :
Pr
_
X
1
+ + X
n
n

_
= Pr (X
1
+ + X
n
) n , for any > 0
= Pr exp ((X
1
+ + X
n
)) exp (n)

E [exp ((X
1
+ + X
n
))]
exp (n)
=
E
n
[exp (X))]
exp (n)
=
_
M
X
()
exp ()
_
n
.
137
Hence,
liminf
n

1
n
Pr
_
X
1
+ + X
n
n
>
_
log M
X
().
Since the above inequality holds for every > 0, we have
liminf
n

1
n
Pr
_
X
1
+ + X
n
n
>
_
max
>0
[ log M
X
()]
=

log M
X
(

),
where

> 0 is the optimizer of the maximum operation. (The positivity


of

can be easily veried by the concavity of the function log M


X
()
in , and it derivative at = 0 equals () which is strictly greater than
0.) Consequently,
liminf
n

1
n
Pr
_
X
1
+ + X
n
n
>
_

log M
X
(

)
= sup
T
[ log M
X
()] = I
X
().
C) Lower bound of the probability :
Dene the twisted distribution of P
X
as
P
()
X
(x)
expxP
X
(x)
M
X
()
.
Let X
()
be the random variable having distribution P
()
X
. Then from
i.i.d. property of X
n
,
P
()
X
n(x
n
)
exp(x
1
+ + x
n
)P
X
n(x
n
)
M
n
X
()
,
which implies
P
X
n(x
n
) = M
n
X
() exp(x
1
+ + x
n
)P
()
X
n(x
n
).
138
Note that X
()
i

i=1
are also i.i.d. random variables. Then
Pr
_
X
1
+ + X
n
n

_
=

[x
n
: x
1
++xnn]
P
X
n(x
n
)
=

[x
n
: x
1
++xnn]
M
n
X
() exp(x
1
+ + x
n
)P
()
X
n(x
n
)
= M
n
X
()

[x
n
: x
1
++xnn]
exp(x
1
+ + x
n
)P
()
X
n(x
n
)
M
n
X
()

[x
n
: n(+)>x
1
++xnn]
exp(x
1
+ + x
n
)P
()
X
n(x
n
),
for > 0
M
n
X
()

[x
n
: n(+)>x
1
++xnn]
expn( + )P
()
X
n(x
n
), for > 0.
= M
n
X
() expn( + )

[x
n
: n(+)>x
1
++xnn]
P
()
X
n(x
n
)
= M
n
X
() expn( + )Pr
_
+ >
X
()
1
+ + X
()
n
n

_
.
Choosing =

(the one that maximizes log M


X
()), and observing
that E[X
(

)
] = , we obtain:
Pr
_
+ >
X
()
1
+ + X
()
n
n

_
= Pr
_

n
Var[X
(

)
]
>
(X
()
1
) + + (X
()
n
)

nVar[X
(

)
]
0
_
Pr
_
B >
(X
()
1
) + + (X
()
n
)

nVar[X
(

)
]
0
_
,
for

n > BVar[X
(

)
]/
(B) (0) as n ,
where the last step follows from the fact that
(X
()
1
) + + (X
()
n
)

nVar[X
(

)
]
converges in distribution to standard normal distribution (). Since B
139
can be made arbitrarily large,
Pr
_
+ >
X
()
1
+ + X
()
n
n

_

1
2
.
Consequently,
limsup
n

1
n
log Pr
_
X
1
+ + X
n
n

_

( + ) log M
X
(

)
= sup
T
[ log M
X
()] +

= I
X
() +

.
The derivation is completed by noting that

is xed, and can be chosen


arbitrarily small.
8.3 Theories on Large deviations
In this section, we will derive inequalities on the exponent of the probability,
PrZ
n
/n [a, b], which is a slight extension of the G artner-Ellis theorem.
8.3.1 Extension of Gartner-Ellis upper bounds
Denition 8.8 In this subsection, Z
n

n=1
will denote an innite sequence of
arbitrary random variables.
Denition 8.9 Dene

n
()
1
n
log E [exp Z
n
] and () limsup
n

n
().
The sup-large deviation rate function of an arbitrary random sequence Z
n

n=1
is dened as

I(x) sup
T : ()>
[x ()] . (8.3.1)
The range of the supremum operation in (8.3.1) is always non-empty since
(0) = 0, i.e. 1 : () > ,= . Hence,

I(x) is always dened.
With the above denition, the rst extension theorem of G artner-Ellis can be
proposed as follows.
Theorem 8.10 For a, b 1 and a b,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]

I(x).
140
Proof: The proof follows directly from Theorem 8.13 by taking h(x) = x, and
hence, we omit it. 2
The bound obtained in the above theorem is not in general tight. This can
be easily seen by noting that for an arbitrary random sequence Z
n

n=1
, the
exponent of Pr Z
n
/n b is not necessarily convex in b, and therefore, cannot
be achieved by a convex (sup-)large deviation rate function. The next example
further substantiates this argument.
Example 8.11 Suppose that PrZ
n
= 0 = 1 e
2n
, and PrZ
n
= 2n =
e
2n
. Then from Denition 8.9, we have

n
()
1
n
log E
_
e
Zn

=
1
n
log
_
1 e
2n
+ e
(+1)2n

,
and
() limsup
n

n
() =
_
0, for 1;
2( + 1), for < 1.
Hence, 1 : () > = 1 (i.e., the real line) and

I(x) = sup
T
[x ()]
= sup
T
[x + 2( + 1)1 < 1)]
=
_
x, for 2 x 0;
, otherwise,
where 1 represents the indicator function of a set. Consequently, by Theorem
8.10,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]

I(x)
=
_
_
_
0, for 0 [a, b]
b, for b [2, 0]
, otherwise
.
The exponent of PrZ
n
/n [a, b] in the above example is indeed given by
lim
n
1
n
log P
Z
n
_
Z
n
n
[a, b]
_
= inf
x[a,b]
I

(x),
where
I

(x) =
_
_
_
2, for x = 2;
0, for x = 0;
, otherwise.
(8.3.2)
141
Thus, the upper bound obtained in Theorem 8.10 is not tight.
As mentioned earlier, the looseness of the upper bound in Theorem 8.10 can-
not be improved by simply using a convex sup-large deviation rate function. Note
that the true exponent (cf. (8.3.2)) of the above example is not a convex func-
tion. We then observe that the convexity of the sup-large deviation rate function
is simply because it is a pointwise supremum of a collection of ane functions
(cf. (8.3.1)). In order to obtain a better bound that achieves a non-convex large
deviation rate, the involvement of non-ane functionals seems necessary. As
a result, a new extension of the G artner-Ellis theorem is established along this
line.
Before introducing the non-ane extension of G artner-Ellis upper bound, we
dene the twisted sup-large deviation rate function as follows.
Denition 8.12 Dene

n
(; h)
1
n
log E
_
exp
_
n h
_
Z
n
n
___
and
h
() limsup
n

n
(; h),
where h() is a given real-valued continuous function. The twisted sup-large
deviation rate function of an arbitrary random sequence Z
n

n=1
with respect
to a real-valued continuous function h() is dened as

J
h
(x) sup
T :
h
()>
[ h(x)
h
()] . (8.3.3)
Similar to

I(x), the range of the supremum operation in (8.3.3) is not empty,
and hence,

J
h
() is always dened.
Theorem 8.13 Suppose that h() is a real-valued continuous function. Then
for a, b 1 and a b,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]

J
h
(x).
Proof: The proof is divided into two parts. Part 1 proves the result under
[a, b]
_
x 1 :

J
h
(x) <
_
,= , (8.3.4)
and Part 2 veries it under
[a, b]
_
x 1 :

J
h
(x) =
_
. (8.3.5)
Since either (8.3.4) or (8.3.5) is true, these two parts, together, complete the
proof of this theorem.
142
Part 1: Assume [a, b]
_
x 1 :

J
h
(x) <
_
,= .
Dene J

inf
x[a,b]

J
h
(x). By assumption, J

< . Therefore,
[a, b]
_
x :

J
h
(x) > J


_
,
for any > 0, and
_
x 1 :

J
h
(x) > J


_
=
_
x 1 : sup
T :
h
()>
[h(x)
h
()] > J

_
T :
h
()>
x 1 : [h(x)
h
()] > J

.
Observe that
T :
h
()>
x 1 : [x
h
()] > J

is a col-
lection of (uncountably innite) open sets that cover [a, b] which is closed
and bounded (and hence compact). By the Heine-Borel theorem, we can
nd a nite subcover such that
[a, b]
k
_
i=1
x 1 : [
i
x
h
(
i
)] > J

,
and ( 1 i k)
h
(
i
) < (otherwise, the set x : [
i
x
h
(
i
)] > J

is empty, and can be removed). Also note ( 1 i k)


h
(
i
) > .
Consequently,
Pr
_
Z
n
n
[a, b]
_
Pr
_
Z
n
n

k
_
i=1
x :
i
h(x)
h
(
i
)] > J

i=1
Pr
_
h
_
Z
n
n
_

i

h
(
i
) > J

_
=
k

i=1
Pr
_
n h
_
Z
n
n
_

i
> n
h
(
i
) + n(J

)
_

i=1
exp n[
n
(
i
; h)
h
(
i
)] n(J

) ,
where the last step follows from Markovs inequality. Since k is a constant
independent of n, and for each integer i [1, k],
limsup
n
1
n
log (exp n[
n
(
i
; h)
h
(
i
)] n(J

)) = (J

),
143
we obtain
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
(J

).
Since is arbitrary, the proof is completed.
Part 2: Assume [a, b]
_
x 1 :

J
h
(x) =
_
.
Observe that [a, b]
_
x :

J
h
(x) > L
_
for any L > 0. Following the same
procedure as used in Part 1, we obtain
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
L.
Since L can be taken arbitrarily large,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
= = inf
x[a,b]

J
h
(x).
2
The proof of the above theorem has implicitly used the condition <

h
(
i
) < to guarantee that limsup
n
[
n
(
i
; h)
h
(
i
)] = 0 for each inte-
ger i [1, k]. Note that when
h
(
i
) = (resp. ), limsup
n
[
n
(
i
; h)

h
(
i
)] = (resp. ). This explains why the range of the supremum opera-
tion in (8.3.3) is taken to be 1 :
h
() > , instead of the whole real
line.
As indicated in Theorem 8.13, a better upper bound can possibly be found
by twisting the large deviation rate function around an appropriate (non-ane)
functional on the real line. Such improvement is substantiated in the next ex-
ample.
Example 8.14 Let us, again, investigate the Z
n

n=1
dened in Example 8.11.
Take
h(x) =
1
2
(x + 2)
2
1.
Then from Denition 8.12, we have

n
(; h)
1
n
log E [exp nh(Z
n
/n)]
=
1
n
log [exp n exp n( 2) + exp n( + 2)] ,
and

h
() limsup
n

n
(; h) =
_
( + 2), for 1;
, for > 1.
144
Hence, 1 :
h
() > = 1 and

J
h
(x) sup
T
[h(x)
h
()] =
_

1
2
(x + 2)
2
+ 2, for x [4, 0];
, otherwise.
Consequently, by Theorem 8.13,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]

J
h
(x)
=
_

_
min
_

(a + 2)
2
2
,
(b + 2)
2
2
_
2, for 4 a < b 0;
0, for a > 0 or b < 4;
, otherwise.
(8.3.6)
For b (2, 0) and a
_
2

2b 4, b
_
, the upper bound attained in the
previous example is strictly less than that given in Example 8.11, and hence,
an improvement is obtained. However, for b (2, 0) and a < 2

2b 4,
the upper bound in (8.3.6) is actually looser. Accordingly, we combine the two
upper bounds from Examples 8.11 and 8.14 to get
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
max
_
inf
x[a,b]

J
h
(x), inf
x[a,b]

I(x)
_
=
_

_
0, for 0 [a, b];
1
2
(b + 2)
2
2, for b [2, 0];
, otherwise.
A better bound on the exponent of PrZ
n
/n [a, b] is thus obtained. As a
result, Theorem 8.13 can be further generalized as follows.
Theorem 8.15 For a, b 1 and a b,
limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]

J(x),
where

J(x) sup
h1

J
h
(x) and H is the set of all real-valued continuous func-
tions.
Proof: By re-dening J

inf
x[a,b]

J(x) in the proof of Theorem 8.13, and
observing that
[a, b] x 1 :

J(x) > J

_
h1
_
T :
h
()>
x 1 : [h(x)
h
()] > J

,
145
the theorem holds under [a, b] x 1 :

J(x) < ,= . Similar modica-
tions to the proof of Theorem 8.13 can be applied to the case of [a, b] x
1 :

J(x) = . 2
Example 8.16 Let us again study the Z
n

n=1
in Example 8.11 (also in Ex-
ample 8.14). Suppose c > 1. Take h
c
(x) = c
1
(x + c
2
)
2
c, where
c
1

c +

c
2
1
2
and c
2

2

c + 1

c + 1 +

c 1
.
Then from Denition 8.12, we have

n
(; h
c
)
1
n
log E
_
exp
_
nh
c
_
Z
n
n
___
=
1
n
log [exp n exp n( 2) + exp n( + 2)] ,
and

hc
() limsup
n

n
(; h
c
) =
_
( + 2), for 1;
, for > 1.
Hence, 1 :
hc
() > = 1 and

J
hc
(x) = sup
T
[h
c
(x)
hc
()]
=
_
c
1
(x + c
2
)
2
+ c + 1, for x [2c
2
, 0];
, otherwise.
From Theorem 8.15,

J(x) = sup
h1

J
h
(x) maxliminf
c

J
hc
(x),

I(x) = I

(x),
where I

(x) is dened in (8.3.2). Consequently,


limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]

J(x)
inf
x[a,b]
I

(x)
=
_
_
_
0, if 0 [a, b];
2, if 2 [a, b] and 0 , [a, b];
, otherwise,
and a tight upper bound is nally obtained!
146
Theorem 8.13 gives us the upper bound on the limsup of (1/n) log PrZ
n
/n
[a, b]. With the same technique, we can also obtain a parallel theorem for the
quantity
liminf
n
1
n
log Pr
_
Z
n
n
[a, b]
_
.
Denition 8.17 Dene
h
() liminf
n

n
(; h), where
n
(; h) was de-
ned in Denition 8.12. The twisted inf-large deviation rate function of an arbi-
trary random sequence Z
n

n=1
with respect to a real-valued continuous function
h() is dened as
J
h
(x) sup
T :
h
()>
_
h(x)
h
()
_
.
Theorem 8.18 For a, b 1 and a b,
liminf
n
1
n
log Pr
_
Z
n
n
[a, b]
_
inf
x[a,b]
J(x),
where J(x) sup
h1
J
h
(x) and H is the set of all real-valued continuous func-
tions.
8.3.2 Extension of Gartner-Ellis lower bounds
The tightness of the upper bound given in Theorem 8.13 naturally relies on the
validity of
limsup
n
1
n
log Pr
_
Z
n
n
(a, b)
_
inf
x(a,b)

J
h
(x), (8.3.7)
which is an extension of the G artner-Ellis lower bound. The above inequality,
however, is not in general true for all choices of a and b (cf. Case A of Exam-
ple 8.21). It therefore becomes signicant to nd those (a, b) within which the
extended Gartner-Ellis lower bound holds.
Denition 8.19 Dene the sup-G artner-Ellis set with respect to a real-valued
continuous function h() as

T
h

_
T :
h
()>

T(; h)
147
where

T(; h)
_
x 1 : limsup
t0

h
( + t)
h
()
t
h(x) liminf
t0

h
()
h
( t)
t
_
.
Let us briey remark on the sup-G artner-Ellis set dened above. It is self-
explained in its denition that

T
h
is always dened for any real-valued function
h(). Furthermore, it can be derived that the sup-G artner-Ellis set is reduced to

T
h

_
T :
h
()>
x 1 :
t
h
() = h(x) ,
if the derivative
t
h
() exists for all . Observe that the condition h(x) =
t
h
()
is exactly the equation for nding the that achieves

J
h
(x), which is obtained
by taking the derivative of [h(x)
h
()]. This somehow hints that the sup-
G artner-Ellis set is a collection of those points at which the exact sup-large
deviation rate is achievable.
We now state the main theorem in this section.
Theorem 8.20 Suppose that h() is a real-valued continuous function. Then if
(a, b)

T
h
,
limsup
n
1
n
log Pr
_
Z
n
n

h
(a, b)
_
inf
x(a,b)

J
h
(x),
where

h
(a, b) y 1 : h(y) = h(x) for some x (a, b).
Proof: Let F
n
() denote the distribution function of Z
n
. Dene its extended
twisted distribution around the real-valued continuous function h() as
dF
(;h)
n
()
exp n h(x/n) dF
n
()
E[exp n h(Z
n
/n)]
=
exp n h(x/n) dF
n
()
exp n
n
(; h)
.
Let Z
(:h)
n
be the random variable having F
(;h)
n
() as its probability distribution.
Let
J

inf
x(a,b)

J
h
(x).
Then for any > 0, there exists v (a, b) with

J
h
(v) J

+ .
148
Now the continuity of h() implies that
B(v, ) x 1 : [h(x) h(v)[ <
h
(a, b)
for some > 0. Also, (a, b)

T
h
ensures the existence of satisfying
limsup
t0

h
( + t)
h
()
t
h(v) liminf
t0

h
()
h
( t)
t
,
which in turn guarantees the existence of t = t() > 0 satisfying

h
( + t)
h
()
t
h(v) +

4
and h(v)

4


h
()
h
( t)
t
. (8.3.8)
We then derive
Pr
_
Z
n
n

h
(a, b)
_
Pr
_
Z
n
n
B(v, )
_
= Pr
_

h
_
Z
n
n
_
h(v)

<
_
=
_
xT : [h(x/n)h(v)[<
dF
n
(x)
=
_
xT : [h(x/n)h(v)[<
exp
_
n
n
(; h) nh
_
x
n
__
dF
(;h)
n
(x)
exp n
n
(; h) nh(v) n[[
_
xT : [h(x/n)h(v)[<
dF
(;h)
n
(x)
= exp n
n
(; h) nh(v) n[[ Pr
_
Z
(;h)
n
n
B(v, )
_
,
which implies
limsup
n
1
n
log Pr
_
Z
n
n

h
(a, b)
_
[h(v)
h
()] [[ + limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
=

J
h
(v) [[ + limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
J

[[ + limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
.
149
Since both and can be made arbitrarily small, it remains to show
limsup
n
1
n
log Pr
_
Z
(;h)
n
n
B(v, )
_
= 0. (8.3.9)
To show (8.3.9), we rst note that
Pr
_
h
_
Z
(;h)
n
n
_
h(v) +
_
= Pr
_
e
nt h

Z
(;h)
n
/n

e
nt h(v)+n t
_
e
nt h(v)n t
_
T
e
nt h(x/n)
dF
(;h)
n
= e
nt h(v)n t
_
T
e
nt h(x/n)+n h(x/n)nn(;h)
dF
n
(x)
= e
nth(v)ntnn(;h)+nn(+t;h)
.
Similarly,
Pr
_
h
_
Z
(;h)
n
n
_
h(v)
_
= Pr
_
e
nt h

Z
(;h)
n
/n

e
nt h(v)+n t
_
e
nt h(v)n t
_
T
e
nt h(x/n)
dF
(;h)
n
= e
nt h(v)n t
_
T
e
nt h(x/n)+n h(x/n)nn(;h)
dF
n
(x)
= e
nth(v)ntnn(;h)+nn(t;h)
.
Now by denition of limsup,

n
( + t; h)
h
( + t) +
t
4
and
n
( t; h)
h
( t) +
t
4
(8.3.10)
for suciently large n; and

n
(; h)
h
()
t
4
(8.3.11)
150
for innitely many n. Hence, there exists a subsequence n
1
, n
2
, n
3
, . . . such
that for all n
j
, (8.3.10) and (8.3.11) hold. Consequently, for all j,
1
n
j
log Pr
_
Z
(;h)
n
j
n
j
, B(v, )
_

1
n
j
log
_
2 max
_
e
n
j
th(v)n
j
tn
j
n
j
(;h)+n
j
n
j
(+t;h)
,
e
n
j
th(v)n
j
tn
j
n
j
(;h)+n
j
n
j
(t;h)
__
=
1
n
j
log 2 + max
__
th(v) +
n
j
( + t; h), th(v) +
n
j
( t; h)
_

n
j
(; h) t

1
n
j
log 2 + max [th(v) +
h
( + t), th(v) +
h
( t)]
h
()
t
2
=
1
n
j
log 2
+t max
__

h
( + t)
h
()
t
h(v), h(v)

h
()
h
( t)
t
__

t
2

1
n
j
log 2
t
4
, (8.3.12)
where (8.3.12) follows from (8.3.8). The proof is then completed by obtaining
liminf
n
1
n
log Pr
_
Z
(;h)
n
n
, B(v, )
_

t
4
,
which immediately guarantees the validity of (8.3.9). 2
Next, we use an example to demonstrate that by choosing the right h(),
we can completely characterize the exact (non-convex) sup-large deviation rate

(x) for all x 1.


Example 8.21 Suppose Z
n
= X
1
+ +X
n
, where X
i

n
i=1
are i.i.d. Gaussian
random variables with mean 1 and variance 1 if n is even, and with mean 1
and variance 1 if n is odd. Then the exact large deviation rate formula

I

(x)
that satises for all a < b,
inf
x[a,b]

(x) limsup
n
1
n
log Pr
_
Z
n
n
[a, b]
_
limsup
n
1
n
log Pr
_
Z
n
n
(a, b)
_
inf
x(a,b)

(x)
151
is

(x) =
([x[ 1)
2
2
. (8.3.13)
Case A: h(x) = x.
For the ane h(),
n
() = +
2
/2 when n is even, and
n
() =
+
2
/2 when n is odd. Hence, () = [[ +
2
/2, and

T
h
=
_
_
>0
v 1 : v = 1 +
_
_
_
_
<0
v 1 : v = 1 +
_
_
v 1 : 1 v 1
= (1, ) (, 1).
Therefore, Theorem 8.20 cannot be applied to any a and b with (a, b)
[1, 1] ,= .
By deriving

I(x) = sup
T
x () =
_
_
_
([x[ 1)
2
2
, for [x[ > 1;
0, for [x[ 1,
we obtain for any a (, 1) (1, ),
lim
0
limsup
n
1
n
log Pr
_
Z
n
n
(a , a + )
_
lim
0
inf
x(a,a+)

I(x) =
([a[ 1)
2
2
,
which can be shown tight by Theorem 8.13 (or directly by (8.3.13)). Note
that the above inequality does not hold for any a (1, 1). To ll the
gap, a dierent h() must be employed.
Case B: h(x) = [x a[ for 1 < a < 1.
152
For n even,
E
_
e
nh(Zn/n)

= E
_
e
n[Zn/na[

=
_
na

e
x+na
1

2n
e
(xn)
2
/(2n)
dx +
_

na
e
xna
1

2n
e
(xn)
2
/(2n)
dx
= e
n(2+2a)/2
_
na

2n
e
[xn(1)]
2
/(2n)
dx
+e
n(+22a)/2
_

na
1

2n
e
[xn(1+)]
2
/(2n)
dx
= e
n(2+2a)/2

_
( + a 1)

n
_
+ e
n(+22a)/2

_
( a + 1)

n
_
,
where () represents the unit Gaussian cdf.
Similarly, for n odd,
E
_
e
nh(Zn/n)

= e
n(+2+2a)/2

_
( + a + 1)

n
_
+ e
n(22a)/2

_
( a 1)

n
_
.
Observe that for any b 1,
lim
n
1
n
log (b

n) =
_
_
_
0, for b 0;

b
2
2
, for b < 0.
Hence,

h
() =
_

([a[ 1)
2
2
, for < [a[ 1;
[ + 2(1 [a[)]
2
, for [a[ 1 < 0;
[ + 2(1 +[a[)]
2
, for 0.
Therefore,

T
h
=
_
_
>0
x 1 : [x a[ = + 1 +[a[
_
_
_
_
<0
x 1 : [x a[ = + 1 [a[
_
= (, a 1 [a[) (a 1 +[a[, a + 1 [a[) (a + 1 +[a[, )
153
and

J
h
(x) =
_

_
([x a[ 1 +[a[)
2
2
, for a 1 +[a[ < x < a + 1 [a[;
([x a[ 1 [a[)
2
2
, for x > a + 1 +[a[ or x < a 1 [a[;
0, otherwise.
(8.3.14)
We then apply Theorem 8.20 to obtain
lim
0
limsup
n
1
n
log Pr
_
Z
n
n
(a , a + )
_
lim
0
inf
x(a,a+)

J
h
(x)
= lim
0
( 1 +[a[)
2
2
=
([a[ 1)
2
2
.
Note that the above lower bound is valid for any a (1, 1), and can be
shown tight, again, by Theorem 8.13 (or directly by (8.3.13)).
Finally, by combining the results of Cases A) and B), the true large devi-
ation rate of Z
n

n1
is completely characterized.
2
Remarks.
One of the problems in applying the extended Gartner-Ellis theorems is the
diculty in choosing an appropriate real-valued continuous function (not to
mention the nding of the optimal one in the sense of Theorem 8.15). From
the previous example, we observe that the resultant

J
h
(x) is in fact equal to
the lower convex contour
1
(with respect to h()) of min
yT : h(y)=h(x)

I

(x).
Indeed, if the lower convex contour of min
yT : h(y)=h(x)

I

(x) equals

I

(x)
for some x lying in the interior of

T
h
, we can apply Theorems 8.13 and
8.20 to establish the large deviation rate at this point. From the above
example, we somehow sense that taking h(x) = [x a[ is advantageous
in characterizing the large deviation rate at x = a. As a consequence
of such choice of h(),

J
h
(x) will shape like the lower convex contour of
min

(x a),

I

(a x) in h(x) = [x a[. Hence, if a lies in



T
h
,

J
h
(a)
can surely be used to characterize the large deviation rate at x = a (as it
does in Case B of Example 8.21).
1
We dene that the lower convex contour of a function f() with respect to h() is the
largest g() satisfying that g(h(x)) f(x) for all x, and for every x, y and for all [0, 1],
g(h(x)) + (1 )g(h(y)) g( h(x) + (1 )h(y) ).
154
The assumptions required by the conventional Gartner-Ellis lower bound
[2, pp. 15] are
1. () = () = () exists;
2. () is dierentiable on its domain; and
3. (a, b) x 1 : x =
t
() for some .
The above assumptions are somewhat of limited use for arbitrary random
sequences, since they do not in general hold. For example, the condition
of () ,= () is violated in Example 8.21.
By using the limsup and liminf operators in our extension theorem, the
sup-Gartner-Ellis set is always dened without any requirement on the
log-moment generating functions. The sup-G artner-Ellis set also clearly
indicates the range in which the Gartner-Ellis lower bound holds. In other
words,

T
h
is a subset of the union of all (a, b) for which the Gartner-Ellis
lower bound is valid. This is concluded in the following equation:

T
h

_
_
(a, b) : limsup
n
1
n
log Pr
_
Z
n
n

h
(a, b)
_
inf
x(a,b)

J
h
(x)
_
.
To verify whether or not the above two sets are equal merits further inves-
tigation.
Modifying the proof of Theorem 8.20, we can also establish a lower bound
for
liminf
n
1
n
log Pr
_
Z
n
n

h
(a, b)
_
.
Denition 8.22 Dene the inf-Gartner-Ellis set with respect to a real-
valued continuous function h() as
T
h

_
T :
h
()>
T(; h)
where
T(; h)
_
x 1 : limsup
t0

h
( + t)
h
()
t
h(x)
liminf
t0

h
()
h
( t)
t
_
.
155
Theorem 8.23 Suppose that h() is a real-valued continuous function.
Then if (a, b) T
h
,
liminf
n
1
n
log Pr
_
Z
n
n

h
(a, b)
_
inf
x(a,b)
J
h
(x).
One of the important usages of the large deviation rate functions is to nd
the Varadhans asymptotic integration formula of
lim
n
1
n
log E [exp Z
n
]
for a given random sequence Z
n

n=1
. To be specic, it is equal [4, Theo-
rem 2.1.10] to
lim
n
1
n
log E [exp Z
n
] = sup
xT : I(x)<
[x I(x)],
if
lim
L
limsup
n
1
n
log
__
[xT : xL]
exp x dP
Zn
(x)
_
= .
The above result can also be extended using the same idea as applied to
the Gartner-Ellis theorem.
Theorem 8.24 If
lim
L
limsup
n
1
n
log
__
[xT : h(x)L]
exp h(x) dP
Zn
(x)
_
= ,
then
limsup
n
1
n
log E
_
exp
_
nh
_
Z
n
n
___
= sup
xT :

J
h
(x)<
[h(x)

J
h
(x)],
and
liminf
n
1
n
log
_
exp
_
nh
_
Z
n
n
___
= sup
xT : J
h
(x)<
[h(x) J
h
(x)].
Proof: This can be obtained by modifying the proofs of Lemmas 2.1.7
and 2.1.8 in [4]. 2
We close the section by remarking that the result of the above theorem
can be re-formulated as

J
h
(x) = sup
T :
h
()>
[h(x)
h
()]
156
and

h
() = sup
xT :

J
h
(x)<
[h(x)

J
h
(x)],
which is an extension of the Legendre-Fenchel transform pair. Similar
conclusion applies to J
h
(x) and
h
().
8.3.3 Properties of (twisted) sup- and inf-large deviation
rate functions.
Property 8.25 Let

I(x) and I(x) be the sup- and inf- large deviation rate func-
tions of an innite sequence of arbitrary random variables Z
n

n=1
, respectively.
Denote m
n
= (1/n)E[Z
n
]. Let m limsup
n
m
n
and m liminf
n
m
n
.
Then
1.

I(x) and I(x) are both convex.
2.

I(x) is continuous over x 1 :

I(x) < . Likewise, I(x) is continuous
over x 1 : I(x) < .
3.

I(x) gives its minimum value 0 at m x m.
4. I(x) 0. But I(x) does not necessary give its minimum value at both
x = m and x = m.
Proof:
1.

I(x) is the pointwise supremum of a collection of ane functions. Therefore,
it is convex. Similar argument can be applied to I(x).
2. A convex function on the real line is continuous everywhere on its domain
and hence the property holds.
3.&4. The proofs follow immediately from Property 8.26 by taking h(x) = x.2
Since the twisted sup/inf-large deviation rate functions are not necessarily
convex, a few properties of sup/inf-large deviation functions do not hold for
general twisted functions.
Property 8.26 Suppose that h() is a real-valued continuous function. Let

J
h
(x) and J
h
(x) be the corresponding twisted sup- and inf- large deviation rate
functions, respectively. Denote m
n
(h) E[h(Z
n
/n)]. Let
m
h
limsup
n
m
n
(h) and m
h
liminf
n
m
n
(h).
Then
157
1.

J
h
(x) 0, with equality holds if m
h
h(x) m
h
.
2. J
h
(x) 0, but J
h
(x) does not necessary give its minimum value at both
x = m
h
and x = m
h
.
Proof:
1. For all x 1,

J
h
(x) sup
T :
h
()>
[ h(x)
h
()] 0 h(x)
h
(0) = 0.
By Jensens inequality,
exp n
n
(; h) = E[exp n h(Z
n
/n)]
exp n E[h(Z
n
/n)]
= exp n m
n
(h) ,
which is equivalent to
m
n
(h)
n
(; h).
After taking the limsup and liminf of both sides of the above inequalities,
we obtain:
for 0,
m
h

h
() (8.3.15)
and
m
h

h
()
h
(); (8.3.16)
for < 0,
m
h

h
() (8.3.17)
and
m
h

h
()
h
(). (8.3.18)
(8.3.15) and (8.3.18) imply

J
h
(x) = 0 for those x satisfying h(x) = m
h
,
and (8.3.16) and (8.3.17) imply

J
h
(x) = 0 for those x satisfying h(x) = m
h
.
For x x : m
h
h(x) m
h
,
h(x)
h
() m
h

h
() 0 for 0
and
h(x)
h
() m
h

h
() 0 for < 0.
Hence, by taking the supremum over 1 :
h
() > , we obtain
the desired result.
158
2. The non-negativity of J
h
(x) can be similarly proved as

J
h
(x).
From Case A of Example 8.21, we have m = 1, m = 1, and
() =
_

_
+
1
2

2
, for 0
+
1
2

2
, for < 0
Therefore,
I(x) = sup
T
x () =
_

_
(x + 1)
2
2
, for x 0
(x 1)
2
2
, for x < 0,
.
for which I(1) = I(1) = 2 and min
xT
I(x) = I(0) = 1/2. Conse-
quently, I
h
(x) neither equals zero nor gives its minimum value at both
x = m
h
and x = m
h
. 2
8.4 Probabilitic subexponential behavior
In the previous section, we already demonstrated the usage of large deviations
techniques to compute the exponent of an exponentially decayed probability.
However, no eort so far is placed on studying its subexponential behavior. Ob-
serve that the two sequences, a
n
= (1/n) exp2n and b
n
= (1/

n) exp2n,
have the same exponent, but contain dierent subexponential terms. Such subex-
ponential terms actually aect their respective rate of convergence when n is not
large. Let us begin with the introduction of (a variation of) the Berry-Esseen
theorem [5, sec.XVI. 5], which is later applied to evaluate the subexponential
detail of a desired probability.
8.4.1 Berry-Esseen theorem for compound i.i.d. sequence
The Berry-Esseen theorem [5, sec.XVI. 5] states that the distribution of the sum
of independent zero-mean random variables X
i

n
i=1
, normalized by the standard
deviation of the sum, diers from the Gaussian distribution by at most C r
n
/s
3
n
,
where s
2
n
and r
n
are respectively sums of the marginal variances and the marginal
absolute third moments, and C is an absolute constant. Specically, for every
a 1,

Pr
_
1
s
n
(X
1
+ + X
n
) a
_
(a)

C
r
n
s
3
n
, (8.4.1)
where () represents the unit Gaussian cdf. The striking feature of this theorem
is that the upper bound depends only on the variance and the absolute third
159
moment, and hence, can provide a good asymptotic estimate based on only the
rst three moments. The absolute constant C is commonly 6 [5, sec. XVI. 5,
Thm. 2]. When X
n

n
i=1
are identically distributed, in addition to independent,
the absolute constant can be reduced to 3, and has been reported to be improved
down to 2.05 [5, sec. XVI. 5, Thm. 1].
The samples that we concern in this section actually consists of two i.i.d. se-
quences (and, is therefore named compound i.i.d. sequence.) Hence, we need to
build a Berry-Esseen theorem based on compound i.i.d. samples. We begin with
the introduction of the smoothing lemma.
Lemma 8.27 Fix the bandlimited ltering function
v
T
(x)
1 cos(Tx)
Tx
2
.
For any cumulative distribution function H() on the real line 1,
sup
xT
[
T
(x)[
1
2

6
T

2
h
_
T

2
2

_
,
where

T
(t)
_

[H(t x) (t x)] v
T
(x)dx, sup
xT
[H(x) (x)[ ,
and
h(u)
_

_
u
_

u
1 cos(x)
x
2
dx
=

2
u + 1 cos(u) u
_
u
0
sin(x)
x
dx, if u 0;
0, otherwise.
Proof: The right-continuity of the cdf H() and the continuity of the Gaussian
unit cdf () together indicate the right-continuity of [H(x) (x)[, which in
turn implies the existence of x
0
1 satisfying
either = [H(x
0
) (x
0
)[ or = lim
xx
0
[H(x) (x)[ > [H(x
0
) (x
0
)[ .
We then distinguish between three cases:
Case A) = H(x
0
) (x
0
)
Case B) = (x
0
) H(x
0
)
Case C) = lim
xx
0
[H(x) (x)[ > [H(x
0
) (x
0
)[ .
160
Case A) = H(x
0
) (x
0
). In this case, we note that for s > 0,
H(x
0
+ s) (x
0
+ s) H(x
0
)
_
(x
0
) +
s

2
_
(8.4.2)
=
s

2
, (8.4.3)
where (8.4.2) follows from sup
xT
[
t
(x)[ = 1/

2. Observe that (8.4.3)


implies
H
_
x
0
+

2
2
x
_

_
x
0
+

2
2
x
_

1

2
_

2
2
x
_
=

2
+
x

2
,
for [x[ <

2/2. Together with the fact that H(x) (x) for all
x 1, we obtain
sup
xT
[
T
(x)[

T
_
x
0
+

2
2

_
=
_

_
H
_
x
0
+

2
2
x
_

_
x
0
+

2
2
x
__
v
T
(x)dx
=
_
[[x[<

2/2]
_
H
_
x
0
+

2
2
x
_

_
x
0
+

2
2
x
__
v
T
(x)dx
+
_
[[x[

2/2]
_
H
_
x
0
+

2
2
x
_

_
x
0
+

2
2
x
__
v
T
(x)dx

_
[[x[<

2/2]
_

2
+
x

2
_
v
T
(x)dx +
_
[[x[

2/2]
() v
T
(x)dx
=
_
[[x[<

2/2]

2
v
T
(x)dx +
_
[[x[

2/2]
() v
T
(x)dx, (8.4.4)
where the last equality holds because of the symmetry of the ltering func-
161
tion, v
T
(). The quantity of
_
[
[x[

2/2]
v
T
(x)dx can be derived as follows:
_
[
[x[

2/2]
v
T
(x)dx = 2
_

2/2
v
T
(x)dx
=
2

2/2
1 cos(Tx)
Tx
2
dx
=
2

_

T

2/2
1 cos(x)
x
2
dx
=
4
T

2
h
_
T

2
2

_
.
Continuing from (8.4.4),
sup
xT
[
T
(x)[


2
_
1
4
T

2
h
_
T

2
2

__

_
4
T

2
h
_
T

2
2

__
=
1
2

6
T

2
h
_
T

2
2

_
.
Case B) = (x
0
) H(x
0
). Similar to case A), we rst derive for s > 0,
(x
0
s) H(x
0
s)
_
(x
0
)
s

2
_
H(x
0
) =
s

2
,
and then obtain
H
_
x
0

2
2
x
_

_
x
0

2
2
x
_

1

2
_

2
2
+ x
_
=

2

x

2
,
for [x[ <

2/2. Together with the fact that H(x) (x) for all
162
x 1, we obtain
sup
xT
[
T
(x)[

T
_
x
0

2
2

_

_
[[x[<

2/2]
_

2

x

2
_
v
T
(x)dx +
_
[[x[

2/2]
() v
T
(x)dx
=
_
[[x[<

2/2]

2
v
T
(x)dx +
_
[[x[

2/2]
() v
T
(x)dx
=
1
2

6
T

2
h
_
T

2
2

_
.
Case C) = lim
xx
0
[H(x) (x)[ > [H(x
0
) (x
0
)[.
In this case, we claim that for s > 0, either
H(x
0
+ s) (x
0
+ s)
s

2
or
(x
0
s) H(x
0
s)
s

2
holds, because if it were not true, then
_

_
H(x
0
+ s) (x
0
+ s) <
s

2
(x
0
s) H(x
0
s) <
s

_
_
_
lim
s0
H(x
0
+ s) (x
0
)
(x
0
) lim
s0
H(x
0
s)
which immediately gives us H(x
0
) = (x
0
) , contradicting to the
premise of the current case. Following this claim, the desired result can
be proved using the procedure of either case A) or case B); and hence, we
omit it. 2
163
Lemma 8.28 For any cumulative distribution function H() with characteristic
function
H
(),

1

_
T
T

H
() e
(1/2)
2

d
[[
+
12
T

2
h
_
T

2
2

_
,
where and h() are dened in Lemma 8.27.
Proof: Observe that
T
(t) in Lemma 8.27 is nothing but a convolution of v
T
()
and H(x) (x). The characteristic function (or Fourier transform) of v
T
() is

T
() =
_
1
[[
T
, if [[ T;
0, otherwise.
By the Fourier inversion theorem [5, sec. XV.3],
d (
T
(x))
dx
=
1
2
_

e
jx
_

H
() e
(1/2)
2
_

T
()d.
Integrating with respect to x, we obtain

T
(x) =
1
2
_
T
T
e
jx
_

H
() e
(1/2)
2
_
j

T
()d.
Accordingly,
sup
xT
[
T
(x)[ = sup
xT
1
2

_
T
T
e
jx
_

H
() e
(1/2)
2
_
j

T
()d

sup
xT
1
2
_
T
T

e
jx
_

H
() e
(1/2)
2
_
j

T
()

d
= sup
xT
1
2
_
T
T

H
() e
(1/2)
2

[
T
()[
d
[[
sup
xT
1
2
_
T
T

H
() e
(1/2)
2

d
[[
=
1
2
_
T
T

H
() e
(1/2)
2

d
[[
.
Together with the result in Lemma 8.27, we nally have

1

_
T
T

H
() e
(1/2)
2

d
[[
+
12
T

2
h
_
T

2
2

_
.
2
164
Theorem 8.29 (Berry-Esseen theorem for compound i.i.d. sequences)
Let Y
n
=

n
i=1
X
i
be the sum of independent random variables, among which
X
i

d
i=1
are identically Gaussian distributed, and X
i

n
i=d+1
are identically dis-
tributed but not necessarily Gaussian. Denote the mean-variance pair of X
1
and
X
d+1
by (,
2
) and ( ,
2
), respectively. Dene
E
_
[X
1
[
3

, E
_
[X
d+1
[
3

, and s
2
n
= Var[Y
n
] =
2
d +
2
(n d).
Also denote the cdf of (Y
n
E[Y
n
])/s
n
by H
n
(). Then for all y 1,
[H
n
(y) (y)[ C
n,d
2

(n d 1)
_
2(n d) 3

2
_


2
s
n
,
where C
n,d
is the unique positive number satisfying

6
C
n,d
h(C
n,d
) =

_
2(n d) 3

2
_
12(n d 1)

_

6
2
_
3

2
_
3/2
+
9
2(11 6

2)

n d
_
,
provided that n d 3.
Proof: From Lemma 8.28,

_
T
T

d
_

s
n
_

nd
_

s
n
_
e
(1/2)
2

d
[[
+
12
T

2
h
_
T

2
2

_
, (8.4.5)
where () and () are respectively the characteristic functions of (X
1
) and
(X
d+1
). Observe that the integrand satises

d
_

s
n
_

nd
_

s
n
_
e
(1/2)
2


nd
_

s
n
_
e
(1/2)(
2
(nd)/s
2
n
)
2

e
(1/2)(
2
d/s
2
n
)
2
(n d)


_

s
n
_
e
(1/2)(
2
/s
2
n
)
2

e
(1/2)(
2
d/s
2
n
)
2

nd1
, (8.4.6)
(n d)
_


_

s
n
_

_
1

2
2s
2
n

2
_

_
1

2
2s
2
n

2
_
e
(1/2)(
2
/s
2
n
)
2

_
e
(1/2)(
2
d/s
2
n)
2

nd1
(8.4.7)
where the quantity in (8.4.6) requires [5, sec. XVI. 5, (5.5)] that


_

s
n
_

and

e
(1/2)(
2
/s
2
n
)
2

.
165
By equation (26.5) in [1], we upperbound the rst and second terms in the braces
of (8.4.7) respectively by


_

s
n
_
1 +

2
2s
2
n



6s
3
n
[[
3
and

1

2
2s
2
n

2
e
(1/2)(
2
/s
2
n
)
2



4
8s
4
n

4
.
Continuing the derivation of (8.4.7),

d
_

s
n
_

nd
_

s
n
_
e
(1/2)
2

(n d)
_

6s
3
n
[[
3
+

4
8s
4
n

4
_
e
(1/2)(
2
d/s
2
n
)
2

nd1
. (8.4.8)
It remains to choose that bounds both [ (/s
n
)[ and exp(1/2) (
2
/s
2
n
)
2

from above.
From the elementary property of characteristic functions,


_

s
n
_

1

2
2s
2
n

2
+

6s
3
n
[
3
[,
if

2
2s
2
n

2
1. (8.4.9)
For those [T, T] (which is exactly the range of integration operation in
(8.4.5)), we can guarantee the validity of (8.4.9) by dening
T

2
s
n

_

2(n d) 3
n d 1
_
,
and obtain

2
2s
2
n


2
2s
2
n
T
2


6
2
2
_

2(n d) 3
n d 1
_
2

1
2
_

2(n d) 3
n d 1
_
2
1,
for n d 3. Hence, for [[ T,


_

s
n
_

1 +
_


2
2s
2
n

2
+

6s
3
n
[
3
[
_
exp
_


2
2s
2
n

2
+

6s
3
n
[
3
[
_
exp
_


2
2s
2
n

2
+

6s
3
n
T
2
_
= exp
_

_

2
2s
2
n

T
6s
3
n
_

2
_
= exp
_

(3

2)(n d)
6(n d 1)

2
s
2
n

2
_
.
166
We can then choose
exp
_

(3

2)(n d)
6(n d 1)

2
s
2
n

2
_
.
The above selected is apparently an upper bound of exp (1/2) (
2
/s
2
n
)
2
.
By taking the chosen into (8.4.8), the integration part in (8.4.5) becomes
_
T
T

d
_

s
n
_

(nd)
_

s
n
_
e
(1/2)
2

d
[[

_
T
T
(n d)
_

6s
3
n

2
+

4
8s
4
n
[[
3
_
(8.4.10)
exp
_

_
3
2
d + (3

2)
2
(n d)
_

2
6s
2
n
_
d

_
T
T
(n d)
_

6s
3
n

2
+

4
8s
4
n
[[
3
_
exp
_

_
(3

2)
2
d + (3

2)
2
(n d)
_

2
6s
2
n
_
d
=
_
T
T
(n d)
_

6s
3
n

2
+

4
8s
4
n
[[
3
_
exp
_

(3

2)
6

2
_
d
=
_

(n d)
_

6s
3
n

2
+

4
8s
4
n
[[
3
_
exp
_

(3

2)
6

2
_
d
=
(n d)
2
s
2
n



2
s
n

_
3

6
6(3

2)
3/2
+
9
2(11 6

2)


3



s
n
_

1
T
_

2(n d) 3
n d 1
__
3

6
6(3

2)
3/2
+
9
2(11 6

2)
1

n d
_
, (8.4.11)
where the last inequality follows respectively from
(n d)
2
s
2
n
,
/(
2
s
n
) = (1/T)(

2(n d) 3)/(n d 1),



3
,
and /s
n
1/

n d.
Taking (8.4.11) into (8.4.5), we nally obtain

1
T
_

2(n d) 3
n d 1
__
3

6
6(3

2)
3/2
+
9
2(11 6

2)
1

n d
_
+
12
T

2
h
_
T

2
2

_
,
167
-1
-0.5
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5

6
u h(u)
u
Figure 8.3: Function of (/6)u h(u).
or equivalently,

6
u h(u)

2
_
2(n d) 3

12(n d 1)
_

6
2(3

2)
3/2
+
9
2(11 6

2)

n d
_
,
(8.4.12)
for u T

2/2. Observe that the continuous function (/6)uh(u) equals 0


at u = u
0
2.448, is negative for 0 < u < u
0
, and is positive for u > u
0
. Also
observe that (/6)u h(u) is monotonely increasing (up to innity) for u > u
0
(cf. Figure 8.3). Inequality (8.4.12) is thus equivalent to
u C
n,d
,
where C
n,d
(> u
0
) is dened in the statement of the theorem. The proof is
completed by
= u
2
T

2
C
n,d
2
T

2
= C
n,d
2

(n d 1)
(2(n d) 3

2)


2
s
n
.
2
8.4.2 Berry-Esseen Theorem with a sample-size depen-
dent multiplicative coecient for i.i.d. sequence
By letting d = 0, the Berry-Esseen inequality for i.i.d. sequences can also be
readily obtained from Theorem 8.29.
168
Corollary 8.30 (Berry-Esseen theorem with a sample-size dependent
multiplicative coecient for i.i.d. sequence) Let Y
n
=

n
i=1
X
i
be the sum
of independent random variables with common marginal distribution. Denote
the marginal mean and variance by ( ,
2
). Dene E
_
[X
1
[
3

. Also
denote the cdf of (Y
n
n )/(

n ) by H
n
(). Then for all y 1,
[H
n
(y) (y)[ C
n
2(n 1)

_
2n 3

2
_


3

n
,
where C
n
is the unique positive solution of

6
u h(u) =

_
2n 3

2
_
12(n 1)
_

6
2
_
3

2
_
3/2
+
9
2(11 6

2)

n
_
,
provided that n 3.
Let us briey remark on the previous corollary. We observe from numericals
2
that the quantity
C
n
2

(n 1)
_
2n 3

2
_
is decreasing in n, and ranges from 3.628 to 1.627 (cf. Figure 8.4.) Numerical
result shows that it lies below 2 when n 9, and is smaller than 1.68 as n 100.
In other words, we can upperbound this quantity by 1.68 as n 100, and
therefore, establish a better estimate of the Berry-Esseen constant than that in
[5, sec. XVI. 5, Thm. 1].
2
We can upperbound C
n
by the unique positive solution D
n
of

6
u h(u) =

6
_
_

6
2
_
3

2
_
3/2
+
9
2(11 6

2)

n
_
_
,
which is strictly decreasing in n. Hence,
C
n
2

(n 1)
_
2n 3

2
_ E
n
D
n
2

(n 1)
_
2n 3

2
_,
and the right-hand-side of the above inequality is strictly decreasing (since both D
n
and
(n 1)/(2n 3

2) are decreasing) in n, and ranges from E


3
= 4.1911, . . . ,E
9
= 2.0363,. . . ,
E
100
= 1.6833 to E

= 1.6266. If the property of strictly decreasingness is preferred, one can


use D
n
instead of C
n
in the Berry-Esseen inequality. Note that both C
n
and D
n
converges to
2.8831 . . . as n goes to innity.
169
0
1
1.68
2
3
3 10 20 30 40 50 75 100 150 200
C
n
2

(n 1)
(2n 3

2)
n
Figure 8.4: The Berry-Esseen constant as a function of the sample size
n. The sample size n is plotted in log-scale.
8.4.3 Probability bounds using Berry-Esseen inequality
We now investigate an upper probability bound for the sum of a compound
i.i.d. sequence.
Basically, the approach used here is the large deviation technique, which is
considered the technique of computing the exponent of an exponentially decayed
probability. Other than the large deviations, the previously derived Berry-Esseen
theorem is applied to evaluate the subexponential detail of the desired probability.
With the help of these two techniques, we can obtain a good estimate of the
destined probability. Some notations used in large deviations are introduced
here for the sake of completeness.
Let Y
n
=

n
i=1
X
i
be the sum of independent random variables. Denote the
distribution of Y
n
by F
n
(). The twisted distribution with parameter corre-
sponding to F
n
() is dened by
dF
()
n
(y)
e
y
dF
n
(y)
_

e
y
dF
n
( y)
=
e
y
dF
n
(y)
M
n
()
, (8.4.13)
where M
n
() is the moment generating function corresponding to the distribution
F
n
(). Let Y
()
n
be a random variable having F
()
n
() as its probability distribution.
Analogously, we can twist the distribution of X
i
with parameter , yielding its
170
twisted counterpart X
()
i
. A basic large deviation result is that
Y
()
n
=
n

i=1
X
()
i
,
and X
()
i

n
i=1
still forms a compound i.i.d. sequence.
Lemma 8.31 (probability upper bound) Let Y
n
=

n
i=1
X
i
be the sum of
independent random variables, among which X
i

d
i=1
are identically Gaussian
distributed with mean > 0 and variance
2
, and X
i

n
i=d+1
have common
distribution as minX
i
, 0. Denote the variance and the absolute third moment
of the twisted random variable X
()
d+1
by
2
() and (), respectively. Also dene
s
2
n
() Var[Y
()
n
] and M
n
() E
_
e
Yn

. Then
Pr Y
n
0 B
_
d, (n d);
2
,
2
_

_
A
n,d
M
n
(), when E[Y
n
] 0;
1 B
n,d
M
n
(), otherwise,
where
A
n,d
min
_
1

2[[s
n
()
+ C
n,d
4(n d 1) ()

_
2(n d) 3


2
()s
n
()
, 1
_
,
B
n,d

1
[[s
n
()
exp
_
[[s
n
()
1
_
(0) +
4C
n,d
(n d 1) ()

[2(n d) 3

2]
2
()s
n
()
+
1
[[s
n
()
__
,
and is the unique solution of M
n
()/ = 0, provided that n d 3. (A
n,d
,
B
n,d
and are all functions of
2
and
2
. Here, we drop them to simplify the
notations.)
Proof: We derive the upper bound of Pr (Y
n
0) under two dierent situations:
E[Y
n
] 0 and E[Y
n
] < 0.
Case A) E[Y
n
] 0. Denote the cdf of Y
n
by F
n
(). Choose to satisfy
M
n
()/ = 0. By the convexity of the moment generating function
M
n
() with respect to and the nonnegativity of E[Y
n
] = M
n
()/[
=0
,
the satisfying M
n
()/ = 0 uniquely exists and is non-positive. Denote
171
the distribution of
_
Y
()
n
E[Y
()
n
]
_
/s
n
() by H
n
(). Then from (8.4.13),
we have
Pr (Y
n
0) =
_
0

dF
n
(y)
= M
n
()
_
0

e
y
dF
()
n
(y)
= M
n
()
_
0

e
sn() y
dH
n
(y). (8.4.14)
Integrating by parts on (8.4.14) with
(dy) s
n
() e
sn() y
dy,
and then applying Theorem 8.29 yields
_
0

e
sn() y
dH
n
(y)
=
_
0

[H
n
(0) H
n
(y)] (dy)

_
0

_
(0) (y) + C
n,d
4(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
_
(dy)
=
_
0

[(0) (y)] (dy) + C


n,d
4(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
= e

2
s
2
n
()/2
( s
n
()) + C
n,d
4(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
(8.4.15)

2 [[ s
n
()
+ C
n,d
4(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
,
where (8.4.15) holds by, again, applying integration by part, and the last
inequality follows from for u 0,
(u) =
_
u

2
e
t
2
/2
dt

_
u

2
_
1 +
1
t
2
_
e
t
2
/2
dt =
1

2u
e
u
2
/2
.
On the other hand,
_
0

e
sn() y
dH
n
(y) 1 can be established by ob-
172
serving that
M
n
()
_
0

e
sn() y
dH
n
(y) = PrY
n
0
= Pr
_
e
Yn
1
_
E[e
Yn
] = M
n
().
Finally, Case A concludes to
Pr(Y
n
0)
min
_
1

2 [[ s
n
()
+
4C
n,d
(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
, 1
_
M
n
().
Case B) E[Y
n
] < 0.
By following a procedure similar to case A, together with the observation
that > 0 for E[Y
n
] < 0, we obtain
Pr(Y
n
> 0)
=
_

0
dF
n
(y)
= M
n
()
_

0
e
sn() y
dH
n
(y)
M
n
()
_

0
e
sn() y
dH
n
(y), for some > 0 specied later
M
n
()e
sn()
_

0
dH
n
(y)
= M
n
()e
sn()
[H() H(0)]
M
n
()e
sn()

_
() (0) C
n,d
4(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
_
.
The proof is completed by letting be the solution of
() = (0) + C
n,d
4(n d 1) ()

_
2(n d) 3

2
_

2
()s
n
()
+
1
[[s
n
()
.
2
The upper bound obtained in Lemma 8.31 in fact depends only upon the
ratio of (/) or (1/2)(
2
/
2
). Therefore, we can rephrase Lemma 8.31 as the
next corollary.
173
Corollary 8.32 Let Y
n
=

n
i=1
X
i
be the sum of independent random variables,
among which X
i

d
i=1
are identically Gaussian distributed with mean > 0
and variance
2
, and X
i

n
i=d+1
have common distribution as minX
i
, 0. Let
(1/2)(
2
/
2
). Then
Pr Y
n
0 B (d, (n d); )
where
B (d, (n d); )
_
_
_
A
n,d
()M
n
(), if
d
n
1

4
e

4(

2)
;
1 B
n,d
()M
n
(), otherwise,
A
n,d
() min
_
1

s
n
()
+
4C
n,d
(n d 1) ()

_
2(n d) 3


2
() s
n
()
, 1
_
,
B
n,d
()
1

s
n
()
exp
_


_
2

s
n
()

1
_
1
2
+
4C
n,d
(n d 1) ()

[2(n d) 3

2]
2
() s
n
()
+
1

s
n
()
__
,
M
n
() e
n
e
d
2
/2
_
() e

2
/2
+ e

_
_
2
__
nd
,

2
()
d
n d

nd
(n d)
2

2
+
n
n d
1
1 +

2e

2)
,
()
n
(n d)

[1 +

2e

2)]
_
1
d(n + d)
(n d)
2

2
+2
_
n
2
(n d)
2

2
+ 2
_
e
d(2nd)
2
/(nd)
2

d
n d
_
n
2
(n d)
2

2
+ 3
_

2e

(
_
2)

2n
n d
_
n
2
(n d)
2

2
+ 3
_

2e

2
/2

n
n d

__
,
s
2
n
() n
_

d
n d

2
+
1
1 +

2e

2)
_
,
and is the unique positive solution of
e
(1/2)
2
() =
1

2
_
1
d
n
_

d
n
e

(
_
2),
provided that n d 3.
174
Proof: The notations used in this proof follows those in Lemma 8.31.
Let (/) + =

2 + . Then the moment generating functions of


X
1
and X
d+1
minX
1
, 0 are respectively
M
X
1
() = e

2
/2
and M
X
d+1
() = () e

2
/2
+
_
_
2
_
.
Therefore, the moment generating function of Y
n
=

n
i=1
X
i
becomes
M
n
() = M
d
X
1
()M
nd
X
d+1
() = e
n
e
d
2
/2
_
() e

2
/2
+ e

_
_
2
__
nd
.
(8.4.16)
Since and have a one-to-one correspondence, solving M
n
()/ = 0 is
equivalent to solving M
n
()/ = 0, which from (8.4.16) results in
e
(1/2)
2
() =
1

2
_
1
d
n
_

d
n
e

(
_
2). (8.4.17)
As it turns out, the solution of the above equation depends only on .
Next we derive

2
()

2
()

=(

2)/
, ()
()

=(

2)/
and
s
2
n
()
s
2
n
()

=(

2)/
.
By replacing e
(1/2)
2
() with
1

2
_
1
d
n
_

d
n
e

(
_
2)
(cf. (8.4.17)), we obtain

2
() =
d
n d

nd
(n d)
2

2
+
n
n d
1
1 +

2e

2)
, (8.4.18)
() =
n
(n d)

[1 +

2e

2)]
_
1
d(n + d)
(n d)
2

2
+2
_
n
2
(n d)
2

2
+ 2
_
e
d(2nd)
2
/(nd)
2

d
n d
_
n
2
(n d)
2

2
+ 3
_

2e

(
_
2)

2n
n d
_
n
2
(n d)
2

2
+ 3
_

2e

2
/2

n
n d

__
,(8.4.19)
175
and
s
2
n
() = n
_

d
n d

2
+
1
1 +

2e

2)
_
. (8.4.20)
Taking (8.4.18), (8.4.19), (8.4.20) and = (

2)/ into A
n,d
and B
n,d
in
Lemma 8.31, we obtain:
A
n,d
= min
_
1

s
n
()
+ C
n,d
4(n d 1) ()

_
2(n d) 3


2
() s
n
()
, 1
_
and
B
n,d
=
1

s
n
()
exp
_


_
2

s
n
()

1
_
(0) +
4C
n,d
(n d 1) ()

[2(n d) 3

2]
2
() s
n
()
+
1

s
n
()
__
,
which depends only on and , as desired. Finally, a simple derivation yields
E[Y
n
] = dE[X
1
] + (n d)E[X
d+1
]
=
_
d
_
2 + (n d)
_
(1/

2)e

+
_
2(
_
2)
__
,
and hence,
E[Y
n
] 0
d
n
1

4
e

4(

2)
.
2
We close this section by remarking that the parameters in the upper bounds
of Corollary 8.32, although they have complex expressions, are easily computer-
evaluated, once their formulas are established.
8.5 Generalized Neyman-Pearson Hypothesis Testing
The general expression of the Neyman-Pearson type-II error exponent subject to
a constant bound on the type-I error has been proved for arbitrary observations.
In this section, we will state the results in terms of the -inf/sup-divergence
rates.
Theorem 8.33 (Neyman-Pearson type-II error exponent for a xed
test level [3]) Consider a sequence of random observations which is assumed to
176
have a probability distribution governed by either P
X
(null hypothesis) or P

X
(alternative hypothesis). Then, the type-II error exponent satises
lim

(X|

X) limsup
n

1
n
log

n
()

D

(X|

X)
lim

(X|

X) liminf
n

1
n
log

n
() D

(X|

X)
where

n
() represents the minimum type-II error probability subject to a xed
type-I error bound [0, 1).
The general formula for Neyman-Pearson type-II error exponent subject to an
exponential test level has also been proved in terms of the -inf/sup-divergence
rates.
Theorem 8.34 (Neyman-Pearson type-II error exponent for an expo-
nential test level) Fix s (0, 1) and [0, 1). It is possible to choose decision
regions for a binary hypothesis testing problem with arbitrary datawords of
blocklength n, (which are governed by either the null hypothesis distribution
P
X
or the alternative hypothesis distribution P

X
,) such that
liminf
n

1
n
log
n


D

(

X
(s)
|

X) and limsup
n

1
n
log
n
D
(1)
(

X
(s)
|X),
(8.5.1)
or
liminf
n

1
n
log
n
D

(

X
(s)
|

X) and limsup
n

1
n
log
n


D
(1)
(

X
(s)
|X),
(8.5.2)
where

X
(s)
exhibits the tilted distributions
_
P
(s)

X
n
_

n=1
dened by
dP
(s)

X
n
(x
n
)
1

n
(s)
exp
_
s log
dP
X
n
dP

X
n
(x
n
)
_
dP

X
n
(x
n
),
and

n
(s)
_
.
n
exp
_
s log
dP
X
n
dP

X
n
(x
n
)
_
dP

X
n
(x
n
).
Here,
n
and
n
are the type-I and type-II error probabilities, respectively.
Proof: For ease of notation, we use

X to represent

X
(s)
. We only prove (8.5.1);
(8.5.2) can be similarly demonstrated.
177
By denition of dP
(s)

X
n
(), we have
1
s
_
1
n
d
e
X
n
(

X
n
|

X
n
)
_
+
1
1 s
_
1
n
d
e
X
n
(

X
n
|X
n
)
_
=
1
s(1 s)
_
1
n
log
n
(s)
_
.
(8.5.3)
Let

limsup
n
(1/n) log
n
(s). Then, for any > 0, N
0
such that n >
N
0
,
1
n
log
n
(s) <

+ .
From (8.5.3),
d
e
X
n
|

X
n
()
liminf
n
Pr
_
1
n
d
e
X
n
(

X
n
|

X
n
)
_
= liminf
n
Pr
_

1
1 s
_
1
n
d
e
X
n
(

X
n
|X
n
)
_

1
s(1 s)
_
1
n
log
n
(s)
_


s
_
= liminf
n
Pr
_
1
n
d
e
X
n
(

X
n
|X
n
)
1 s
s

1
s
_
1
n
log
n
(s)
__
liminf
n
Pr
_
1
n
d
e
X
n
(

X
n
|X
n
) >
1 s
s

1
s



s
_
= 1 limsup
n
Pr
_
1
n
d
e
X
n
(

X
n
|X
n
)
1 s
s

1
s



s
_
= 1

d
e
X
n
|X
n
_

1 s
s

1
s



s
_
.
Thus,

(

X|

X) sup : d
e
X
n
|

X
n
()
sup
_
: 1

d
e
X
n
|X
n
_

1 s
s

1
s
(

+ )
_
<
_
= sup
_

1
1 s
(

+ )
s
1 s

t
:

d
e
X
n
|X
n
(
t
) > 1
_
=
1
1 s
(

+ )
s
1 s
inf
t
:

d
e
X
n
|X
n
(
t
) > 1
=
1
1 s
(

+ )
s
1 s
sup
t
:

d
e
X
n
|X
n
(
t
) 1
=
1
1 s
(

+ )
s
1 s
D
1
(

X|X).
Finally, choosing the acceptance region for null hypothesis as
_
x
n
A
n
:
1
n
log
dP
e
X
n
dP

X
n
(x
n
)

D

(

X|

X)
_
,
178
we obtain:

n
= P

X
n
_
1
n
log
dP
e
X
n
dP

X
n
(X
n
)

D

(

X|

X)
_
exp
_
n

D

(

X|

X)
_
,
and

n
= P
X
n
_
1
n
log
dP
e
X
n
dP

X
n
(X
n
) <

D

(

X|

X)
_
P
X
n
_
1
n
log
dP
e
X
n
dP

X
n
(X
n
) <
1
1 s
(

+ )
s
1 s
D
1
(

X|X)
_
= P
X
n
_
1
1 s
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
)
_

1
s(1 s)
_
1
n
log
n
(s)
_
<

+
s(1 s)

1
1 s
D
1
(

X|X)
_
= P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
) > D
1
(

X|X) +
1
s
_


1
n
log
n
(s)
_
+

s
_
.
Then, for n > N
0
,

n
P
X
n
_
1
n
log
dP
e
X
n
dP
X
n
(X
n
) > D
1
(

X|X)
_
exp
_
nD
1
(

X|X)
_
.
Consequently,
liminf
n

1
n
log
n


D

(

X
(s)
|

X) and limsup
n

1
n
log
n
D
(1)
(

X
(s)
|X).
2
179
Bibliography
[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John
Wiley and Sons, 1995.
[2] James A. Bucklew. Large Deviation Techniques in Decision, Simulation,
and Estimation. Wiley, New York, 1990.
[3] P.-N. Chen, General formulas for the Neyman-Pearson type-II error
exponent subject to xed and exponential type-I error bound, IEEE
Trans. Info. Theory, vol. 42, no. 1, pp. 316323, November 1993.
[4] Jean-Dominique Deuschel and Daniel W. Stroock. Large Deviations. Aca-
demic Press, San Diego, 1989.
[5] W. Feller, An Introduction to Probability Theory and its Applications, 2nd
edition, New York, John Wiley and Sons, 1970.
180
Chapter 9
Channel Reliability Function
The channel coding theorem (cf. Chapter 5) shows that block codes with arbi-
trarily small probability of block decoding error exist at any code rate smaller
than the channel capacity C. However, the result mentions nothing on the re-
lation between the rate of convergence for the error probability and code rate
(specically for rate less than C.) For example, if C = 1 bit per channel usage,
and there are two reliable codes respectively with rates 0.5 bit per channel usage
and 0.25 bit per channel usage, is it possible for one to adopt the latter code even
if it results in lower information transmission rate? The answer is armative if
a higher error exponent is required.
In this chapter, we will rst re-prove the channel coding theorem for DMC,
using an alternative and more delicate method that describes the dependence
between block codelength and the probability of decoding error.
9.1 Random-coding exponent
Denition 9.1 (channel block code) An (n, M) code for channel W
n
with
input alphabet A and output alphabet is a pair of mappings
f : 1, 2, , M A
n
and g :
n
1, 2, , M.
Its average error probability is given by
P
e
(n, M)
.
=
1
M
M

m=1

y
n
:g(y
n
),=m
W
n
(y
n
[f(m)).
Denition 9.2 (channel reliability function [1]) For any R 0, dene the
channel reliability function E

(R) for a channel W as the largest scalar > 0


such that there exists a sequence of (n, M
n
) codes with
liminf
n

1
n
log P
e
(n, M
n
)
181
and
R liminf
n
1
n
log M
n
. (9.1.1)
From the above denition, it is obvious that E

(0) = ; hence; in deriving


the bounds on E

(R), we only need to consider the case of R > 0.


Denition 9.3 (random-coding exponent) The random coding exponent
for DMC with generic distribution P
Y [X
is dened by
E
r
(R) sup
0s1
sup
P
X
_
_
sR log

y
_

x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
.
Theorem 9.4 (random coding exponent) For DMC with generic transition
probability P
Y [X
,
E

(R) E
r
(R)
for R 0.
Proof: The theorem holds trivially at R = 0 since E

(0) = . It remains to
prove the theorem under R > 0.
Similar to the proof of channel capacity, the codeword is randomly selected
according to some distribution P
e
X
n
. For xed R > 0, choose a sequence of
M
n

n1
with R = lim
n
(1/n) log M
n
.
Step 1: Maximum likelihood decoder. Let c
1
, . . . , c
Mn
A
n
denote the
set of n-tuple block codewords selected, and let the decoding partition for
symbol m (namely, the set of channel outputs that classify to m) be
|
m
y
n
: P
Y
n
[X
n(y
n
[c
m
) > P
Y
n
[X
n(y
n
[c
m
), for all m
t
,= m.
Those channel outputs that are on the boundary, i.e.,
for some m and m
t
, P
Y
n
[X
n(y
n
[c
m
) = P
Y
n
[X
n(y
n
[c
m
),
will be arbitrarily assigned to either m or m
t
.
Step 2: Property of indicator function for s 0. Let
m
be the indicator
function of |
m
. Then
1
for all s 0,
1
m
(y
n
)
_

m

,=m,1m

Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
.
1
_

i
a
1/(1+s)
i
_
s
is no less than 1 when one of a
i
is no less than 1.
182
Step 3: Probability of error given codeword c
m
is transmitted. Let
P
e[m
denote the probability of error given codeword c
m
is transmitted.
Then
P
e[m

y
n
,|m
P
Y
n
[X
n(y
n
[c
m
)
=

y
n

n
P
Y
n
[X
n(y
n
[c
m
)[1
m
(y
n
)]

y
n

n
P
Y
n
[X
n(y
n
[c
m
)
_

m

,=m,1m

Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
=

y
n

n
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_

m

,=m,1m

Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
.
Step 4: Expectation of P
e[m
.
E[P
e[m
]
E
_

y
n

n
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_

m

,=m,1m

Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
=

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_

m

,=m,1m

Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
=

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
E
__

m

,=m,1m

Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_
,
where the latter step follows because c
m

1mM
are independent random
variables.
Step 5: Jensens inequality for 0 s 1, and bounds on probability
of error. By Jensens inequality, when 0 s 1,
E[t
s
] (E[t])
s
.
183
Therefore,
E[P
e[m
]

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
E
__

m

,=m,1m

Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
s
_

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
E
_

m

,=m,1m

Mn
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
s
=

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_

m

,=m,1m

Mn
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
s
Since the codewords are selected with identical distribution, the expecta-
tions
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
should be the same for each m. Hence,
E[P
e[m
]

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_

m

,=m,1m

Mn
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_
_
s
=

y
n

n
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
_ _
(M
n
1)E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
s
= (M
n
1)
s

y
n

n
_
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
1+s
M
s
n

y
n

n
_
E
_
P
1/(1+s)
Y
n
[X
n
(y
n
[c
m
)
__
1+s
= M
s
n

y
n

n
_

x
n
.
n
P
e
X
n
(x
n
)P
1/(1+s)
Y
n
[X
n
(y
n
[x
n
)
_
1+s
Since the upper bound of E[P
e[m
] is no longer dependent on m, the ex-
pected average error probability E[P
e
] can certainly be bounded by the
same bound, namely,
E[P
e
] M
s
n

y
n

n
_

x
n
.
n
P
e
X
n
(x
n
)P
1/(1+s)
Y
n
[X
n
(y
n
[x
n
)
_
1+s
, (9.1.2)
which immediately implies the existence of an (n, M
n
) code satisfying:
P
e
(n, M
n
) M
s
n

y
n

n
_

x
n
.
n
P
e
X
n
(x
n
)P
1/(1+s)
Y
n
[X
n
(y
n
[x
n
)
_
1+s
.
184
By using the fact that P
e
X
n
and P
Y
n
[X
n are product distributions with
identical marginal, and taking logarithmic operation on both sides of the
above inequality, we obtain:

1
n
log P
e
(n, M
n
)
s
n
log M
n
log

y
_

x.
P
e
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
,
which implies
liminf
n

1
n
log P
e
(n, M
n
) sR log

y
_

x.
P
e
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
.
The proof is completed by noting that the lower bound holds for any s
[0, 1] and any P
e
X
. 2
9.1.1 The properties of random coding exponent
We re-formulate E
r
(R) to facilitate the understanding of its behavior as follows.
Denition 9.5 (random-coding exponent) The random coding exponent
for DMC with generic distribution P
Y [X
is dened by
E
r
(R) sup
0s1
[sR + E
0
(s)] ,
where
E
0
(s) sup
P
X
_
_
log

y
_

x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
.
Based on the reformulation, the properties of E
r
(R) can be realized via the
analysis of function E
0
(s).
Lemma 9.6 (properties of E
r
(R))
1. E
r
(R) is convex and non-increasing in R; hence, it is a strict decreasing
and continuous function of R [0, ).
2. E
r
(C) = 0, where C is channel capacity, and E
r
(C ) > 0 for all 0 < <
C.
3. There exists R
cr
such that for 0 < R < R
cr
, the slope of E
r
(R) is 1.
185
Proof:
1. E
r
(R) is convex and non-increasing, since it is the supremum of ane
functions sR+E
0
(s) with non-positive slope s. Hence, E
r
(R) is strictly
decreasing and continuous for R > 0.
2. sR + E
0
(s) equals zero at R = E
0
(s)/s for 0 < s 1. Hence by the
continuity of E
r
(R), E
r
(R) = 0 for R = sup
0<s1
E
0
(s)/s, and E
r
(R) > 0
for R < sup
0<s1
E
0
(s)/s. The property is then justied by:
sup
0<s1
1
s
E
0
(s) = sup
0<s1
sup
P
X
_
_

1
s
log

y
_

x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
= sup
P
X
sup
0<s1
_
_

1
s
log

y
_

x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
= sup
P
X
lim
s0
_
_

1
s
log

y
_

x.
P
X
(x)P
1/(1+s)
Y [X
(y[x)
_
1+s
_
_
= sup
P
X
I(X; Y ) = C.
3. The intersection between sR + E
0
(s) and R + E
0
(1) is
R
s
=
E
0
(1) E
0
(s)
1 s
;
and below R
s
, sR + E
0
(s) R + E
0
(1). Therefore,
R
cr
= inf
0s1
E
0
(1) E
0
(s)
1 s
.
So below R
cr
, the slope of E
r
(R) remains 1. 2
Example 9.7 For BSC with crossover probability , the random coding expo-
nent becomes
E
r
(R) = max
0p1
max
0s1
_
sR log
_
_
p(1 )
1/(1+s)
+ (1 p)
1/(1+s)
_
(1+s)
+
_
p
1/(1+s)
+ (1 p)(1 )
1/(1+s)
_
(1+s)
__
,
where (p, 1 p) is the input distribution.
Note that the input distribution achieving E
r
(R) is uniform, i.e., p = 1/2.
Hence,
E
0
(s) = s log(2) (1 + s) log
_

1/(1+s)
+ (1 )
1/(1+s)
_
186

@
@
@
@
@
@
@
1 p
p
1
0
1
0
1
1

Figure 9.1: BSC channel with crossover probability and input distri-
bution (p, 1 p).
and
R
cr
= inf
0s1
E
0
(1) E
0
(s)
1 s
= lim
s1
E
0
(s)
s
= log(2) log
_
+

1
_
+

log() +

1 log(1 )
2(

1 )
.
The random coding exponent for = 0.2 is depicted in Figure 9.2.
9.2 Expurgated exponent
The proof of the random coding exponent is based on the argument of random
coding, which selects codewords independently according to some input distribu-
tion. Since the codeword selection is unbiased, the good codewords (i.e., with
small P
e[m
) and bad codewords (i.e., with large P
e[m
) contribute the same to
the overall average error probability. Therefore, if, to some extent, we can ex-
purgate the contribution of the bad codewords, a better bound on channel
reliability function may be found.
In stead of randomly selecting M
n
codewords, expurgated approach rst
draws 2M
n
codewords to form a codebook (
2Mn
, and sorts these codewords
in ascending order in terms of P
e[m
( (
2Mn
), which is the error probability given
codeword m is transmitted. After that, it chooses the rst M
n
codewords (whose
P
e[m
( (
2Mn
) is smaller) to form a new codebook, (
Mn
. It can be expected that
for 1 m M
n
and M
n
< m
t
2M
n
,
P
e[m
( (
Mn
) P
e[m
( (
2Mn
) P
e[m
( (
2Mn
),
provided that the maximum-likelihood decoder is always employed at the receiver
side. Hence, a better codebook is obtained.
187
-1
0
0.1
0 R
cr
0.08 0.12 0.16 0.192745
C
E
r
(R)
s

Figure 9.2: Random coding exponent for BSC with crossover proba-
bility 0.2. Also plotted is s

= arg sup
0s1
[sR E
0
(s)]. R
cr
=
0.056633.
To further reduce the contribution of the bad codewords when comput-
ing the expected average error probability under random coding selection, a
better bound in terms of Lyapounovs inequality
2
is introduced. Note that by
Lyapounovs inequality,
E
1/
_
P

e[m
( (
2Mn
)
_
E
_
P
e[m
( (
2Mn
)

.
for 0 < 1.
Similar to the random coding argument, we need to claim the existence of
one good code whose error probability vanishes at least at the rate of expurgated
exponent. Recall that the random coding exponent comes to this conclusion by
the simple fact that E[P
e
] e
nEr(R)
(cf. (9.1.2)) implies the existence of code
with P
e
e
nEr(R)
. However, a dierent argument is adopted for the existence
of such good codes for expurgated exponent.
Lemma 9.8 (existence of good code for expurgated exponent) There
2
Lyapounovs inequality:
E
1/
[[X[

] E
1/
[[X[

] for 0 < .
188
exists an (n, M
n
) block code (
Mn
with
P
e[m
( (
Mn
) 2
1/
E
1/
_
P

e[m
( (
2Mn
)
_
, (9.2.1)
where (
2Mn
is randomly selected according to some distribution P
e
X
n
.
Proof: Let the random codebook (
2Mn
be denoted by
(
2Mn
= c
1
, c
2
, . . . , c
2Mn
.
Let () be the indicator function of the set
t 1 : t < 2
1/
E
1/
[P

e[m
( (
2Mn
)],
for some > 0. Hence, ( P
e[m
( (
2Mn
) ) = 1 if
P
e[m
( (
2Mn
) < 2
1/
E
1/
[P

e[m
( (
2Mn
)].
By Markovs inequality,
E
_

1m2Mn
(P
e[m
( (
2Mn
))
_
=

1m2Mn
E
_
(P
e[m
( (
2Mn
))

= 2M
n
E
_
(P
e[m
( (
2Mn
))

= 2M
n
Pr
_
P
e[m
( (
2Mn
) < 2
1/
E
1/
[P

e[m
( (
2Mn
)]
_
= 2M
n
Pr
_
P

e[m
( (
2Mn
) < 2E[P

e[m
( (
2Mn
)]
_
2M
n
_
1
E[P

e[m
( (
2Mn
)]
2E[P

e[m
( (
2Mn
)]
_
= M
n
.
Therefore, there exist at lease one codebook such that

1m2Mn
(P
e[m
( (
2Mn
)) M
n
.
Hence, by selecting M
n
codewords from this codebook with (P
e[m
( (
2Mn
)) = 1,
a new codebook is formed, and it is obvious that (9.2.1) holds for this new
codebook. 2
189
Denition 9.9 (expurgated exponent) The expurgated exponent for DMC
with generic distribution P
Y [X
is dened by
E
ex
(R) sup
s1
sup
P
X
_
sR s log

x.

.
P
X
(x)P
X
(x
t
)
_

y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
)
_
1/s
_
_
.
Theorem 9.10 (expurgated exponent) For DMC with generic transition
probability P
Y [X
,
E

(R) E
ex
(R)
for R 0.
Proof: The theorem holds trivially at R = 0 since E

(0) = . It remains to
prove the theorem under R > 0.
Randomly select 2M
n
codewords according to some distribution P
e
X
n
. For
xed R > 0, choose a sequence of M
n

n1
with R = lim
n
(1/n) log M
n
.
Step 1: Maximum likelihood decoder. Let c
1
, . . . , c
2Mn
A
n
denote the
set of n-tuple block codewords selected, and let the decoding partition for
symbol m (namely, the set of channel outputs that classify to m) be
|
m
= y
n
: P
Y
n
[X
n(y
n
[c
m
) > P
Y
n
[X
n(y
n
[c
m
), for all m
t
,= m.
Those channel outputs that are on the boundary, i.e.,
for some m and m
t
, P
Y
n
[X
n(y
n
[c
m
) = P
Y
n
[X
n(y
n
[c
m
),
will be arbitrarily assigned to either m or m
t
.
Step 2: Property of indicator function for s = 1. Let
m
be the indicator
function of |
m
. Then for all s > 0.
1
m
(y
n
)
_

m

,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
.
(Note that this step is the same as random coding exponent, except only
s = 1 is considered.) By taking s = 1, we have
1
m
(y
n
)
_

m

,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
.
190
Step 3: Probability of error given codeword c
m
is transmitted. Let
P
e[m
( (
2Mn
) denote the probability of error given codeword c
m
is transmit-
ted. Then
P
e[m
( (
2Mn
)

y
n
,|m
P
Y
n
[X
n(y
n
[c
m
)
=

y
n

n
P
Y
n
[X
n(y
n
[c
m
)[1
m
(y
n
)]

y
n

n
P
Y
n
[X
n(y
n
[c
m
)
_

m

,=m,1m

2Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_

y
n

n
P
Y
n
[X
n(y
n
[c
m
)
_

1m

2Mn
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
=

1m

2Mn
_

y
n

n
P
Y
n
[X
n(y
n
[c
m
)
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
_
=

1m

2Mn
_

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
Step 4: Standard inequality for s
t
= 1/ 1. It is known that for any
0 < = 1/s
t
1, we have
_

i
a
i
_

i
a

i
,
for all non-negative sequence
3
a
i
.
3
Proof: Let f() = (

i
a

i
) /
_

j
a
j
_

. Then we need to show f() 1 when 0 < 1.


Let p
i
= a
i
/(

k
a
k
), and hence, a
i
= p
i
(

k
a
k
). Take it to the numerator of f():
f() =

i
p

i
(

k
a
k
)

j
a
j
)

i
p

i
Now since f()/ =

i
log(p
i
)p

i
0 (which implies f() is non-increasing in ) and
f(1) = 1, we have the desired result. 2
191
Hence,
P

e[m
( (
2Mn
)
_

1m

2Mn
_

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
__

1m

2Mn
_

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_

Step 5: Expectation of P

e[m
( (
2Mn
).
E[P

e[m
( (
2Mn
)]
E
_

1m

2Mn
_

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_

_
=

1m

2Mn
E
__

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_

_
= 2M
n
E
__

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_

_
.
Step 6: Lemma 9.8. From Lemma 9.8, there exists one codebook with size
M
n
such that
P
e[m
( (
Mn
)
2
1/
E
1/
[P

e[m
( (
2Mn
)]
2
1/
_
2M
n
E
__

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_

__
1/
= (4M
n
)
1/
E
1/
__

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_

_
= (4M
n
)
s

E
s

_
_
_

y
n

n
_
P
Y
n
[X
n(y
n
[c
m
)P
Y
n
[X
n(y
n
[c
m
)
_
1/s

_
_
= (4M
n
)
s

_
_

x
n
.
n

(x

)
n
.
n
P
e
X
n
(x
n
)P
e
X
n
((x
t
)
n
)

_

y
n

n
_
P
Y
n
[X
n(y
n
[x
n
)P
Y
n
[X
n(y
n
[(x
t
)
n
)
_
1/s

_
_
s

.
192
By using the fact that P
e
X
n
and P
Y
n
[X
n are product distributions with iden-
tical marginal, and taking logarithmic operation on both sides of the above
inequality, followed by taking the liminf operation, the proof is completed
by the fact that the lower bound holds for any s
t
1 and any P
e
X
. 2
As a result of the expurgated exponent obtained, it only improves the random
coding exponent at lower rate. It is even worse than random coding exponent
for rate higher R
cr
. One possible reason for its being worse at higher rate may
be due to the replacement of the indicator-function argument
1
m
(y
n
)
_

m

,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s
.
by
1
m
(y
n
)
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
.
Note that
min
s>0
_

m

,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/(1+s)
_
s

,=m
_
P
Y
n
[X
n(y
n
[c
m
)
P
Y
n
[X
n(y
n
[c
m
)
_
1/2
.
The improvement of expurgated bound over random coding bound at lower
rate may also reveal the possibility that the contribution of bad codes under
unbiased random coding argument is larger at lower rate; hence, suppressing the
weights from bad codes yield better result.
9.2.1 The properties of expurgated exponent
Analysis of E
ex
(R) is very similar to that of E
r
(R). We rst re-write its formula.
Denition 9.11 (expurgated exponent) The expurgated exponent for DMC
with generic distribution P
Y [X
is dened by
E
ex
(R) max
s1
[sR + E
x
(s)] ,
where
E
x
(s) max
P
X
_
_
s log

x.

.
P
X
(x)P
X
(x
t
)
_

y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
)
_
1/s
_
_
.
193
The properties of E
ex
(R) can be realized via the analysis of function E
x
(s)
as follows.
Lemma 9.12 (properties of E
ex
(R))
1. E
ex
(R) is non-increasing.
2. E
ex
(R) is convex in R. (Note that the rst two properties imply that E
ex
(R)
is strictly decreasing.)
3. There exists R
cr
such that for R > R
cr
, the slope of E
r
(R) is 1.
Proof:
1. Again, the slope of E
ex
(R) is equal to the maximizer s

for E
ex
(R) times
1. Therefore, the slope of E
ex
(R) is non-positive, which certainly implies
the desired property.
2. Similar to the proof of the same property for random coding exponent.
3. Similar to the proof of the same property for random coding exponent. 2
Example 9.13 For BSC with crossover probability , the expurgated exponent
becomes
E
ex
(R)
= max
1p0
max
s1
_
sR s log
_
p
2
+ 2p(1 p)
_
2
_
(1 )
_
1/s
+ (1 p)
2
__
where (p, 1 p) is the input distribution. Note that the input distribution
achieving E
ex
(R) is uniform, i.e., p = 1/2. The expurgated exponent, as well as
random coding exponent, for = 0.2 is depicted in Figures 9.3 and 9.4.
9.3 Partitioning bound: an upper bounds for channel re-
liability
In the previous sections, two lower bounds of the channel reliability, random
coding bound and expurgated bound, are discussed. In this section, we will start
to introduce the upper bounds of the channel reliability function for DMC, which
includes partitioning bound, sphere-packing bound and straight-line bound.
194
0
0.02
0.04
0.06
0.08
0.1
0.12
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.192745
C
Figure 9.3: Expurgated exponent (solid line) and random coding ex-
ponent (dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.192745)).
The partitioning bound relies heavily on theories of binary hypothesis testing.
This is because the receiver end can be modeled as:
H
0
: c
m
= codeword transmitted
H
1
: c
m
,= codeword transmitted,
where m is the nal decision made by receiver upon receipt of some channel
outputs. The channel decoding error given that codeword m is transmitted
becomes the type II error, which can be computed using the theory of binary
hypothesis testing.
Denition 9.14 (partitioning bound) For a DMC with marginal P
Y [X
,
E
p
(R) max
P
X
min
P
e
Y |X
: I(P
X
,P
e
Y |X
)R
D(P
e
Y [X
|P
Y [X
[P
X
).
One way to explain the above denition of partitioning bound is as follows.
For a given i.i.d. source with marginal P
X
, to transmit it into a dummy channel
with rate R and dummy capacity I(P
X
, P
e
Y [X
) R is expected to have poor
performance (note that we hope the code rate to be smaller than capacity).
We can expect that those bad noise patterns whose composition transition
probability P
e
Y [X
satises I(P
X
, P
e
Y [X
) R will contaminate the codewords more
195
0.1
0.105
0.11
0 0.001 0.002 0.003 0.004 0.005 0.006
Figure 9.4: Expurgated exponent (solid line) and random coding ex-
ponent (dashed line) for BSC with crossover probability 0.2 (over the
range of (0, 0.006)).
seriously than other noise patterns. Therefore for any codebook with rate R, the
probability of decoding error is lower bounded by the probability of these bad
noise patterns, since the bad noise patterns anticipate to unrecoverably distort
the codewords. Accordingly, the channel reliability function is upper bounded
by the exponent of the probability of these bad noise patterns.
Sanovs theorem already reveal that for any set consisting of all elements with
the same compositions, its exponent is equal to the minimum divergence of the
composition distribution against the true distribution. This is indeed the case
for the set of bad noise patterns. Therefore, the exponent of the probability
of these bad noise patterns is given as:
min
P
e
Y |X
: I(P
X
,P
e
Y |X
)R
D(P
e
Y [X
|P
Y [X
[P
X
).
We can then choose the best input source to yield the partitioning exponent.
The partitioning exponent can be easily re-formulated in terms of divergences.
This is because the mutual information can actually be written in the form of
196
divergence.
I(P
X
, P
e
Y [X
)

x

y
P
X
(x)P
e
Y [X
(y[x) log
P
X
e
Y
(x, y)
P
X
(x)P
e
Y
(y)
=

x
P
X
(x)

y
P
e
Y [X
(y[x) log
P
e
Y [X
(y[x)
P
e
Y
(y)
=

x
P
X
(x)D(P
e
Y [X
|P
e
Y
[X = x)
= D(P
e
Y [X
|P
e
Y
[P
X
)
Hence, the partitioning exponent can be written as:
E
p
(R) max
P
X
min
P
e
Y |X
: D(P
e
Y |X
|P
e
Y
[P
X
)R
D(P
e
Y [X
|P
Y [X
[P
X
).
In order to apply the theory of hypothesis testing, the partitioning bound needs
to be further re-formulated.
Observation 9.15 (partitioning bound) If P
e
X
and P
e
Y [X
are distributions
that achieves E
p
(R), and P
e
Y
(y) =

x.
P
e
X
(x)P
e
Y [X
(y[x), then
E
p
(R) max
P
X
min
P

Y |X
: D(P

Y |X
|P
e
Y
[P
X
)R
D(P

Y [X
|P
Y [X
[P
X
).
In addition, the distribution P

Y [X
that achieves E
p
(R) is a tilted distribution
between P
Y [X
and P
e
Y
, i.e.,
E
p
(R) max
P
X
D(P
Y

[X
|P
Y [X
[P
X
),
where P
Y

[X
is a tilted distribution
4
between P
Y [X
and P
e
Y
, and is the solution
of
D(P
Y

|P
e
Y
[P
X
) = R.
Theorem 9.16 (partitioning bound) For a DMC with marginal P
Y [X
, for
any > 0 arbitrarily small,
E(R + ) E
p
(R).
4
P
Y

|X
(y[x)
P

e
Y
(y)P
1
Y |X
(y[x)

Y
P

e
Y
(y

)P
1
Y |X
(y

[x)
.
197
Proof:
Step 1: Hypothesis testing. Let code size M
n
= 2e
nR
| e
n(R+)
for n >
ln2/, and let c
1
, . . . , c
Mn
be the best codebook. (Note that using a
code with smaller size should yield a smaller error probability, and hence,
a larger exponent. So, if this larger exponent is upper bounded by E
p
(R),
the error exponent for code size expR + should be bounded above by
E
p
(R).)
Suppose that P
e
X
and P
e
Y [X
are the distributions achieving E
p
(R). Dene
P
e
Y
(y) =

x.
P
e
X
(x)P
e
Y [X
(y[x). Then for any (mutually-disjoint) decod-
ing partitions at the output site, |
1
, |
2
, . . . , |
Mn
,

1m

Mn
P
e
Y
n
(|
m
) = 1,
which implies the existence of m satisfying
P
e
Y
n
(|
m
)
2
M
n
.
(Note that there are at least M
n
/2 codewords satisfying this condition.)
Construct the binary hypothesis testing problem of
H
0
: P
e
Y
n
() against H
1
: P
Y
n
[X
n([c
m
),
and choose the acceptance region for null hypothesis as |
c
m
(i.e., the ac-
ceptance region for alternative hypothesis is |
m
), the type I error

n
Pr(|
m
[H
0
) = P
e
Y
n
(|
m
)
2
M
n
,
and the type II error

n
Pr(|
c
m
[H
1
) = P
Y
n
[X
n(|
c
m
[c
m
),
which is exactly, P
e[m
, the probability of decoding error given codeword c
m
is transmitted. We therefore have
P
e[m
min
n2/Mn

n
min
ne
nR

n
.
_
Since
2
M
n
e
nR
_
198
Then from Theorem 8.34, the best exponent
5
for P
e[m
is
limsup
n
1
n
n

i=1
D(P
Y
,i
|P
Y [X
([a
i
)), (9.3.1)
where a
i
is the i
th
component of c
m
, P
Y
,i
is the tilted distribution between
P
Y [X
([a
i
) and P
e
Y
, and
liminf
n
1
n
n

i=1
D(P
Y
,i
|P
e
Y
) = R.
If we denote the composition distribution with respect to c
m
as P
(cm)
, and
observe that the number of dierent tilted distributions (i.e., P
Y
,i
) is at
most [A[, the quantity of (9.3.1) can be further upper bounded by
limsup
n
1
n
n

i=1
D(P
Y
,i
|P
Y [X
([a
i
))
limsup
n

x.
P
(cm)
(x)D(P
Y
,x
|P
Y [X
([x))
max
P
X
D(P
Y

|P
Y [X
[P
X
)
= D(P
Y

|P
Y [X
[P
e
X
)
= E
p
(R). (9.3.2)
Besides, D(P
Y

|P
e
Y
[P
e
X
) = R.
Step 2: A good code with rate R. Within the M
n
2e
nR
codewords, there
are at least M
n
/2 codewords satisfying
P
e
Y
n
(|
m
)
2
M
n
,
and hence, their P
e[m
should all satisfy (9.3.2). Consequently, if
B m : those m satisfying step 1,
5
According to Theorem 8.34, we need to consider the quantile function of the CDF of
1
n
log
dP
Y
n

dP
Y
n
|X=cm
(Y
n
).
Since both P
Y
n
|X
([c
m
) and P
e
Y
n
are product distributions, P
Y
n

is a product distribution,
too. Therefore, by the similar reason as in Example 5.13, the CDF becomes a bunch of unit
functions (see Figure 2.4), and its largest quantile becomes (9.3.1).
199
then
P
e

1
M
n

1mMn
P
e[m

1
M
n

mB
P
e[m

1
M
n

mB
min
mB
P
e[m
=
1
M
n
[B[ min
mB
P
e[m

1
2
min
mB
P
e[m
=
1
2
P
e[m
, (m
t
is the minimizer.)
which implies
E(R + ) limsup
n

1
n
log P
e
limsup
n

1
n
log P
e[m
E
p
(R).
2
The partitioning upper bound can be re-formulated to shape similar as ran-
dom coding lower bound.
Lemma 9.17
E
p
(R) = max
P
X
max
s0
_
_
sR log

y
_

x.
P
X
(x)P
Y [X
(y[x)
1/(1+s)
_
1+s
_
_
= max
s0
[sR E
0
(s)] .
Recall that the random coding exponent is max
0<s1
[sR E
0
(s)], and the
optimizer s

is the slope of E
r
(R) times 1. Hence, the channel reliability E(R)
satises
max
0<s1
[sR E
0
(s)] E(R) max
s0
[sR E
0
(s)] ,
and for optimizer s

(0, 1] (i.e., R
cr
R C), the upper bound meets the
lower bound.
200
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.192745
Figure 9.5: Partitioning exponent (thick line), random coding exponent
(thin line) and expurgated exponent (thin line) for BSC with crossover
probability 0.2.
9.4 Sphere-packing exponent: an upper bound of the
channel reliability
9.4.1 Problem of sphere-packing
For a given space / and a given distance measures d(, ) for elements in /, a
ball or sphere centered at a with radius r is dened as
b / : d(b, a) r.
The problem of sphere-packing is to nd the minimum radius if M spheres need
to be packed into the space /. Its dual problem is to nd the maximum number
of balls with xed radius r that can be packed into space /.
9.4.2 Relation of sphere-packing and coding
To nd the best codebook which yields minimum decoding error is one of the
main research issues in communications. Roughly speaking, if two codewords are
similar, they should be more vulnerable to noise. Hence, a good codebook should
be a set of codewords, which look very dierent from others. In mathematics,
such codeword resemblance can be modeled as a distance function. We can
201
then say if the distance between two codewords is large, they are more dierent,
and more robust to noise or interference. Accordingly, a good code book becomes
a set of codewords whose minimum distance among codewords is largest. This
is exactly the sphere-packing problem.
Example 9.18 (Hamming distance versus BSC) For BSC, the source al-
phabet and output alphabet are both 0, 1
n
. The Hamming distance between
two elements in 0, 1 is given by
d
H
(x, x) =
_
0, if x = x;
1, if x ,= x.
Its extension denition to n-tuple is
d
H
(x
n
, x
n
) =
n

i=1
d
H
(x
i
, x
i
).
It is known that the best decoding rule is the maximum likelihood ratio
decoder, i.e.,
(y
n
) = m, if P
Y
n
[X
n(y
n
[c
m
) P
Y
n
[X
n(y
n
[c
m
) for all m
t
.
Since for BSC with crossover probability ,
P
Y
n
[X
n(y
n
[c
m
) =
d
H
(y
n
,cm)
(1 )
nd
H
(y
n
,cm)
= (1 )
n
_

1
_
d
H
(y
n
,cm)
. (9.4.1)
Therefore, the best decoding rule can be re-written as:
(y
n
) = m, if d
H
(y
n
, c
m
) d
H
(y
n
, c
m
) for all m
t
.
As a result, if two codewords are too close in Hamming distance, the number of
bits of outputs that can be used to classify its origin will be less, and therefore,
will result in a poor performance.
From the above example, two observations can be made. First, if the dis-
tance measure between codewords can be written as a function of the transition
probability of channel, such as (9.4.1), one may regard the probability of decod-
ing error with the distances between codewords. Secondly, the coding problem
in some cases can be reduced to a sphere-packing problem. As a consequence,
solution of the sphere-packing problem can be used to characterize the error pro-
bability of channels. This can be conrmed by the next theorem, which shows
that an upper bound on the channel reliability can be obtained in terms of the
largest minimum radius among M disjoint spheres in a code space.
202
Theorem 9.19 Let
n
(, ) be the Bhattacharya distance
6
between two elements
in A
n
. Denote by d
n,M
the largest minimum distance among M selected code-
words of length n. (Obviously, the largest minimum radius among M disjoint
spheres in a code space is half of d
n,M
.) Then
limsup
n

1
n
log Pe(n, R) limsup
n
1
n
d
n,M=e
nR.
Proof:
Step 1: Hypothesis testing. For any code
c
0
, c
1
, . . . , c
M1

given, we can form a maximum likelihood partitions at the output as


|
1
, |
2
, . . . , |
M
, which is known to be optimal. Let /
m,m
be the optimal
acceptance region for alternative hypothesis under equal prior for
testing H
0
: P
Y
n
[X
n([c
m
) against H
1
: P
Y
n
[X
n([c
m
),
and denote by P
e[m
the error probability given codeword m is transmitted.
Then
P
e[m
= P
Y
n
[X
n(|
c
m
[c
m
) P
Y
n
[X
n(/
c
m,m
[c
m
),
where the superscript c represents the set complementary operation
(cf. Figure 9.6). Consequently,
P
e[m
P
Y
n
[X
n(/
c
m,m
[c
m
) exp
_
D
_
P
Y
n

|P
Y
n
[X
n([c
m
)
_
+ o(n)
7
_
6
The Bhattacharya distance (for channels P
Y
n
|X
n) between elements x
n
and x
n
is dened
by

n
(x
n
, x
n
) log

y
n
Y
n
_
P
Y
n
|X
n(y
n
[x
n
)P
Y
n
|X
n(y
n
[ x
n
).
7
This is a special notation, introduced in 1909 by E. Landau. Basically, there are two of
them: one is little-o notation and the other is big-O notation. They are respectively dened
as follows, based on the assumption that g(x) > 0 for all x in some interval containing a.
Denition 9.20 (little-o notation) The notation f(x) = o(g(x)) as x a means that
lim
xa
f(x)
g(x)
= 0.
Denition 9.21 (big-O notation) The notation f(x) = O(g(x)) as x a means that
lim
xa

f(x)
g(x)

< .
203
and
P
e[ m
P
Y
n
[X
n(/
m,m
[c
m
) exp
_
D
_
P
Y
n

|P
Y
n
[X
n([c
m
)
_
+ o(n)
_
,
where P
Y
n

is the tilted distribution between P


Y
n
[X
n([c
m
) and P
Y
n
[X
n([c
m
)
with
D(P
Y
n

|P
Y
n
[X
n([c
m
)) = D(P
Y
n

|P
Y
n
[X
n([c
m
)) =
n
(c
m
, c
m
),
and D(|) is the divergence (or Kullback-Leibler divergence). We thus
have
P
e[ m
+ P
e[m
2e
n(c
m
,cm)+o(n)
.
Note that the above inequality holds for any m and m with m ,= m.
Step 2: Largest minimum distance. By the denition of d
n,M
, there exists
an ( m, m) pair for the above code such that

n
(c
m
, c
m
) d
n,M
,
which implies
( ( m, m)) P
e[ m
+ P
e[m
2e
d
n,M
+o(n)
.
Step 3: Probability of error. Suppose we have found the optimal code with
size 2M = e
n(R+log 2/n)
, which minimizes the error probability. Index the
codewords in ascending order of P
e[m
, namely
P
e[0
P
e[1
. . . P
e[2M1
.
Form two new codebooks as
(
1
= c
0
, c
1
, . . . , c
M1
and (
2
= c
M
, . . . , c
2M1
.
Then, from steps 1 and 2, there exists at least one pair of codewords
(c
m
, c
m
) in (
1
such that
P
e[ m
+ P
e[m
2e
d
n,M
+o(n)
.
Since for all c
i
in (
2
P
e[i
maxP
e[m
, P
e[ m
,
and hence,
P
e[i
+ P
e[j
P
e[ m
+ P
e[m
2e
d
n,M
+o(n)
204
for any c
i
and c
j
in (
2
. Accordingly,
P
e
(n, R + log(2)/n) =
1
8M
2
2M1

i=0
2M1

j=0
(P
e[i
+ P
e[j
)

1
8M
2
2M1

i=M
2M1

j=M,j,=i
(P
e[i
+ P
e[j
)

1
8M
2
2M1

i=M
2M1

j=M,j,=i
(P
e[ m
+ P
e[m
)

1
8M
2
2M1

i=M
2M1

j=M,j,=i
2e
d
n,M
+o(n)
=
M 1
4M
e
d
n,M
+o(n)
,
which immediately implies the desired result. 2
Since, according the above theorem, the largest minimum distance can be
used to formulate an upper bound on channel reliability, this quantity becomes
essential. We therefore introduce its general formula in the next subsection.
9.4.3 The largest minimum distance of block codes
The ultimate capabilities and limitations of error correcting codes are quite im-
portant, especially for code designers who want to estimate the relative ecacy
of the designed code. In fairly general situations, this information is closely re-
lated to the largest minimum distance of the codes [16]. One of the examples is
that for binary block code employing the Hamming distance, the error correcting
capability of the code is half of the minimum distance among codewords. Hence,
the knowledge of the largest minimum distance can be considered as a reference
of the optimal error correcting capability of codes.
The problem on the largest minimum distance can be described as follows.
Over a given code alphabet, and a given measurable function on the distance
between two code symbols, determine the asymptotic ratio, the largest minimum
distance attainable among M selected codewords divided by the code blocklength
n, as n tends to innity, subject to a xed rate R log(M)/n.
Research on this problem have been done for years. In the past, only bounds
on this ratio were established. The most well-known bound on this problem
is the Varshamov-Gilbert lower bound, which is usually derived in terms of a
combinatorial approximation under the assumption that the code alphabet is
205
nite and the measure on the distance between code letters is symmetric [14].
If the size of the code alphabet, q, is an even power of a prime, satisfying q 49,
and the distance measure is the Hamming distance, a better lower bound can
be obtained through the construction of the Algebraic-Geometric code [11, 19],
of which the idea was rst proposed by Goppa. Later, Zinoviev and Litsyn
proved that a better lower bound than the Varshamov-Gilbert bound is actually
possible for any q 46 [21]. Other improvements of the bounds can be found in
[9, 15, 20].
In addition to the combinatorial techniques, some researchers also apply the
probabilistic and analytical methodologies to this problem. For example, by
means of the random coding argument with expurgation, the Varshamov-Gilbert
bound in its most general form can be established by simply using the Chebyshev
inequality ([3] or cf. Appendix A), and restrictions on the code alphabet (such
as nite, countable, . . .) and the distance measure (such as additive, symmetric,
bounded, . . .) are no longer necessary for the validity of its proof.
As shown in the Chapter 5, channels without statistical assumptions such
as memoryless, information stability, stationarity, causality, and ergodicity, . . .,
etc. are successfully handled by employing the notions of liminf in probability
and limsup in probability of the information spectrum. As a consequence, the
channel capacity C is shown to equal the supremum, over all input processes,
of the input-output inf-information rate dened as the liminf in probability of
the normalized information density [12]. More specically, given a channel W =
W
n
= P
Y
n
[X
n

n=1
,
C = sup
X
sup a 1 :
limsup
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < a
_
= 0
_
,
where X
n
and Y
n
are respectively the n-fold input process drawn from
X = X
n
= (X
(n)
1
, . . . , X
(n)
n
)

n=1
and the corresponding output process induced by X
n
via the channel W
n
=
P
Y
n
[X
n, and
1
n
i
X
n
W
n(x
n
; y
n
)
1
n
log
P
Y
n
[X
n(y
n
[x
n
)
P
Y
n(y
n
)
is the normalized information density. If the conventional denition of chan-
nel capacity, which requires the existence of reliable block codes for all su-
ciently large blocklengths, is replaced by that reliable codes exist for innitely
many blocklengths, a new optimistic denition of capacity

C is obtained [12]. Its
206
information-spectrum expression is then given by [8, 7]

C = sup
X
sup a 1 :
liminf
n
Pr
_
1
n
i
X
n
W
n(X
n
; Y
n
) < a
_
= 0
_
.
Inspired by such probabilistic methodology, together with random-coding
scheme with expurgation, a spectrum formula on the largest minimum distance
of deterministic block codes for generalized distance functions
8
(not necessarily
additive, symmetric and bounded) is established in this work. As revealed in the
formula, the largest minimum distance is completely determined by the ultimate
statistical characteristics of the normalized distance function evaluated under
a properly chosen random-code generating distribution. Interestingly, the new
formula has an analogous form to the general information-spectrum expressions
of the channel capacity and the optimistic channel capacity. This somehow
conrms the connection between the problem of designing a reliable code for a
given channel and that of nding a code with suciently large distance among
codewords, if the distance function is properly dened in terms of the channel
statistics.
With the help of the new formula, we characterize a minor class of distance
metrics for which the ultimate largest minimum distance among codewords can
be derived. Although these distance functions may be of secondary interest, it
sheds some light on the determination of largest minimum distance for a more
general class of distance functions. Discussions on the general properties of the
new formula will follow.
We next derive a general Varshamov-Gilbert lower bound directly from the
new distance-spectrum formula. Some remarks regarding to its properties are
given. A sucient condition under which the general Varshamov-Gilbert bound
is tight, as well as examples to demonstrate its strict inferiority to the distance-
spectrum formula, is also provided.
Finally, we demonstrate that the new formula can be used to derive the
known lower bounds for a few specic block coding schemes of general interests,
8
Conventionally, a distance or metric [13][18, pp. 139] should satisfy the properties of i)
non-negativity; ii) being zero if, and only if, two points coincide; iii) symmetry; and iv) triangle
inequality. The derivation in this paper, however, is applicable to any measurable function
dened over the code alphabets. Since none of the above four properties are assumed, the
measurable function on the distance between two code letters is therefore termed generalized
distance function. One can, for example, apply our formula to situation where the code
alphabet is a distribution space, and the distance measure is the Kullback-Leibler divergence.
For simplicity, we will abbreviate the generalized distance function simply as the distance
function in the remaining part of the article.
207
such as constant weight codes and the codes that corrects arbitrary noise. Trans-
formation of the asymptotic distance determination problem into an alternative
problem setting over a graph for a possible improvement of these known bounds
are also addressed.
A) Distance-spectrum formula on the largest minimum distance of
block codes
We rst introduce some notations. The n-tuple code alphabet is denoted by A
n
.
For any two elements x
n
and x
n
in A
n
, we use
n
( x
n
, x
n
) to denote the n-fold
measure on the distance of these two elements. A codebook with block length
n and size M is represented by
(
n,M

_
c
(n)
0
, c
(n)
1
, c
(n)
2
, . . . , c
(n)
M1
_
,
where c
(n)
m
(c
m1
, c
m2
, . . . , c
mn
), and each c
mk
belongs to A. We dene the
minimum distance
d
m
( (
n,M
) min
0 mM1
m=m

n
_
c
(n)
m
, c
(n)
m
_
,
and the largest minimum distance
d
n,M
max
(
n,M
min
0mM1
d
m
( (
n,M
).
Note that there is no assumption on the code alphabet A and the sequence of
the functions
n
(, )
n1
.
Based on the above denitions, the problem considered in this paper becomes
to nd the limit, as n , of d
n,M
/n under a xed rate R = log(M)/n. Since
the quantity is investigated as n goes to innity, it is justied to take M = e
nR
as integers.
The concept of our method is similar to that of the random coding technique
employed in the channel reliability function [2]. Each codeword is assumed to be
selected independently of all others from A
n
through a generic distribution P
X
n.
Then the distance between codewords c
(n)
m
and c
(n)
m
becomes a random variable,
and so does d
m
( (
n,M
). For clarity, we will use D
m
to denote the random variable
corresponding to d
m
( (
n,M
). Also note that D
m

M1
m=0
are identically distributed.
We therefore have the following lemma.
Lemma 9.22 Fix a triangular-array random process
X =
_
X
n
=
_
X
(n)
1
, . . . , X
(n)
n
__

n=1
.
208
Let D
m
= D
m
(X
n
), 0 m (M 1), be dened for the random codebook
of block length n and size M, where each codeword is drawn independently
according to the distribution P
X
n. Then
1. for any > 0, there exists a universal constant = () (0, 1) (in-
dependent of blocklength n) and a codebook sequence (
n,M

n1
such
that
1
n
min
0mM1
d
m
( (
n,M
)
> inf
_
a 1 : limsup
n
Pr
_
1
n
D
m
> a
_
= 0
_
, (9.4.2)
for innitely many n;
2. for any > 0, there exists a universal constant = () (0, 1) (in-
dependent of blocklength n) and a codebook sequence (
n,M

n1
such
that
1
n
min
0mM1
d
m
( (
n,M
)
> inf
_
a 1 : liminf
n
Pr
_
1
n
D
m
> a
_
= 0
_
, (9.4.3)
for suciently large n.
Proof: We will only prove (9.4.2). (9.4.3) can be proved by simply following the
same procedure.
Dene

L
X
(R) inf
_
a : limsup
n
Pr
_
1
n
D
m
> a
_
= 0
_
.
Let 1(/) be the indicator function of a set /, and let

m
1
_
1
n
D
m
>

L
X
(R)
_
.
By denition of

L
X
(R),
limsup
n
Pr
_
1
n
D
m
>

L
X
(R)
_
> 0.
Let 2 limsup
n
Pr
_
(1/n)D
m
>

L
X
(R)

. Then for innitely many n,


Pr
_
1
n
D
m
>

L
X
(R)
_
> .
209
For those n that satises the above inequality,
E
_
M1

m=0

m
_
=
M1

m=0
E [
m
] > M,
which implies that among all possible selections, there exist (for innite many
n) a codebook (
n,M
in which M codewords satisfy
m
= 1, i.e.,
1
n
d
m
( (
n,M
) >

L
X
(R)
for at least M codewords in the codebook (
n,M
. The collection of these M
codewords is a desired codebook for the validity of (9.4.2). 2
Our second lemma concerns the spectrum of (1/n)D
m
.
Lemma 9.23 Let each codeword be independently selected through the distri-
bution P
X
n. Suppose that

X
n
is independent of, and has the same distribution
as, X
n
. Then
Pr
_
1
n
D
m
> a
_

_
Pr
_
1
n

n
_

X
n
, X
n
_
> a
__
M
.
Proof: Let C
(n)
m
denote the m-th randomly selected codeword. From the de-
nition of D
m
, we have
Pr
_
1
n
D
m
> a

C
(n)
m
_
= Pr
_
min
0 mM1
m=m
1
n

n
_
C
(n)
m
, C
(n)
m
_
> a

C
(n)
m
_
=

0 mM1
m=m
Pr
_
1
n

n
_
C
(n)
m
, C
(n)
m
_
> a

C
(n)
m
_
(9.4.4)
=
_
Pr
_
1
n

n
(

X
n
, X
n
) > a

X
n
__
M1
,
where (9.4.4) holds because
_
1
n

n
_
C
(n)
m
, C
(n)
m
_
_
0 mM1
m=m
210
is conditionally independent given C
(n)
m
. Hence,
Pr
_
1
n
D
m
> a
_
=
_
.
n
_
Pr
_
1
n

n
(

X
n
, x
n
) > a
__
M1
dP
X
n(x
n
)

_
.
n
_
Pr
_
1
n

n
(

X
n
, x
n
) > a
__
M
dP
X
n(x
n
)
= E
X
n
_
_
Pr
_
1
n

n
_

X
n
, X
n
_
> a

X
n
__
M
_
E
M
X
n
_
Pr
_
1
n

n
_

X
n
, X
n
_
> a

X
n
__
(9.4.5)
=
_
Pr
_
1
n

n
_

X
n
, X
n
_
> a
__
M
,
where (9.4.5) follows from Lyapounovs inequality [1, page 76], i.e., E
1/M
[U
M
]
E[U] for a non-negative random variable U. 2
We are now ready to prove the main theorem of the paper. For simplicity,
throughout the article,

X
n
and X
n
are used specically to denote two indepen-
dent random variables having common distribution P
X
n.
Theorem 9.24 (distance-spectrum formula)
sup
X

X
(R) limsup
n
d
n,M
n
sup
X

X
(R + ) (9.4.6)
and
sup
X

X
(R) liminf
n
d
n,M
n
sup
X

X
(R + ) (9.4.7)
for every > 0, where

X
(R) inf a 1 :
limsup
n
_
Pr
_
1
n

n
(

X
n
, X
n
) > a
__
M
= 0
_
and

X
(R) inf a 1 :
liminf
n
_
Pr
_
1
n

n
(

X
n
, X
n
) > a
__
M
= 0
_
.
211
Proof:
1. lower bound. Observe that in Lemma 9.22, the rate only decreases by the
amount log()/n when employing a code (
n,M
. Also note that for any
> 0, log()/n < for suciently large n. These observations, together
with Lemma 9.23, imply the validity of the lower bound.
2. upper bound. Again, we will only prove (9.4.6), since (9.4.7) can be proved
by simply following the same procedure.
To show that the upper bound of (9.4.6) holds, it suces to prove the
existence of X such that

X
(R) limsup
n
d
n,M
n
.
Let X
n
be uniformly distributed over one of the optimal codes (

n,M
. (By
optimal, we mean that the code has the largest minimum distance among
all codes of the same size.) Dene

n

1
n
min
0mM1
d
m
( (

n,M
) and

limsup
n

n
.
Then for any > 0,

n
>

for innitely many n. (9.4.8)
For those n satisfying (9.4.8),
Pr
_
1
n

n
_

X
n
, X
n
_
>


_
Pr
_
1
n

n
_

X
n
, X
n
_

n
_
Pr
_

X
n
,= X
n
_
= 1
1
M
,
which implies
limsup
n
_
Pr
_
1
n

n
(

X
n
, X
n
) >


__
M
limsup
n
_
1
1
M
_
M
= limsup
n
_
1
1
e
nR
_
e
nR
= e
1
> 0.
Consequently,

X
(R)

. Since can be made arbitrarily small, the
upper bound holds. 2
212
Observe that sup
X

X
(R) is non-increasing in R, and hence, the number of
discontinuities is countable. This fact implies that
sup
X

X
(R) = lim
0
sup
X
(R + )
except for countably many R. Similar argument applies to sup
X

X
(R). We
can then re-phrase the above theorem as appeared in the next corollary.
Corollary 9.25
limsup
n
d
n,M
n
= sup
X

X
(R)
_
resp. liminf
n
d
n,M
n
= sup
X

X
(R)
_
except possibly at the points of discontinuities of
sup
X

X
(R) (resp. sup
X

X
(R)),
which are countable.
From the above theorem (or corollary), we can characterize the largest mini-
mum distance of deterministic block codes in terms of the distance spectrum. We
thus name it distance-spectrum formula. For convenience,

X
() and
X
() will
be respectively called the sup-distance-spectrum function and the inf-distance-
spectrum function in the remaining article.
We conclude this section by remarking that the distance-spectrum formula
obtained above indeed have an analogous form to the information-spectrum for-
mulas of the channel capacity and the optimistic channel capacity. Furthermore,
by taking the distance metric to be the n-fold Bhattacharyya distance [2, De-
nition 5.8.3], an upper bound on channel reliability [2, Theorem 10.6.1] can be
obtained, i.e.,
limsup
n

1
n
log P
e
(n, M = e
nR
)
sup
X
inf
_
a : limsup
n
_
Pr
_

1
n
log

y
n

n
P
1/2
Y
n
[X
n
(y
n
[

X
n
)P
1/2
Y
n
[X
n
(y
n
[X
n
) > a
__
M
= 0
_
213
and
liminf
n

1
n
log P
e
(n, M = e
nR
)
sup
X
inf
_
a : liminf
n
_
Pr
_

1
n
log

y
n

n
P
1/2
Y
n
[X
n
(y
n
[

X
n
)P
1/2
Y
n
[X
n
(y
n
[X
n
) > a
__
M
= 0
_
,
where P
Y
n
[X
n is the n-dimensional channel transition distribution from code
alphabet A
n
to channel output alphabet
n
, and P
e
(n, M) is the average proba-
bility of error for optimal channel code of blocklength n and size M. Note that
the formula of the above channel reliability bound is quite dierent from those
formulated in terms of the exponents of the information spectrum (cf. [6, Sec.
V] and [17, equation (14)]).
B) Determination of the largest minimum distance for a class of dis-
tance functions
In this section, we will present a minor class of distance functions for which the
optimization input X for the distance-spectrum function can be characterized,
and thereby, the ultimate largest minimum distance among codewords can be
established in terms of the distance-spectrum formula.
A simple example for which the largest minimum distance can be derived in
terms of the new formula is the probability-of-error distortion measure, which is
dened

n
( x
n
, x
n
) =
_
0, if x
n
= x
n
;
n, if x
n
,= x
n
.
It can be easily shown that
sup
X

X
(R)
inf a 1 :
limsup
n
sup
X
n
_
Pr
_
1
n

n
(

X
n
, X
n
) > a
__
M
= 0
_
=
_
1, for 0 R < log [A[;
0, for R log [A[,
and the upper bound can be achieved by letting X be uniformly distributed over
the code alphabet. Similarly,
sup
X

X
(R) =
_
1, for 0 R < log [A[;
0, for R log [A[.
214
Another example for which the optimizer X of the distance-spectrum func-
tion can be characterized is the separable distance function dened below.
Denition 9.26 (separable distance functions)

n
( x
n
, x
n
) f
n
([g
n
( x
n
) g
n
(x
n
)[),
where f
n
() and g
n
() are real-valued functions.
Next, we derive the basis for nding one of the optimization distributions for
sup
X
n
Pr
_
1
n

n
(

X
n
, X
n
) > a
_
under separable distance functionals.
Lemma 9.27 Dene G() inf
U
Pr
_

U U


_
, where the inmum is
taken over all (

U, U) pair having independent and identical distribution on [0, 1].


Then for j = 2, 3, 4, . . . and 1/j < 1/(j 1),
G() =
1
j
.
In addition, G() is achieved by uniform distribution over
_
0,
1
j 1
,
2
j 1
, . . . ,
j 2
j 1
, 1
_
.
Proof:
Pr
_

U U


_
Pr
_

U U


1
j
_

j2

i=0
Pr
_
_

U, U
_

_
i
j
,
i + 1
j
_
2
_
+Pr
_
_

U, U
_

_
j 1
j
, 1
_
2
_
=
j2

i=0
_
Pr
_
U
_
i
j
,
i + 1
j
___
2
+
_
Pr
_
U
_
j 1
j
, 1
___
2

1
j
.
215
Achievability of G() by uniform distribution over
_
0,
1
j 1
,
2
j 1
, . . . ,
j 2
j 1
, 1
_
can be easily veried, and hence, we omit here. 2
Lemma 9.28 For 1/j < 1/(j 1),
1
j
inf
Un
Pr
_

U
n
U
n


_

1
j
+
1
n + 0.5
,
where the inmum is taken over all (

U
n
, U
n
) pair having independent and iden-
tical distribution on
_
0,
1
n
,
2
n
, . . . ,
n 1
n
, 1
_
,
and j = 2, 3, 4, . . . etc.
Proof: The lower bound follows immediately from Lemma 9.27.
To prove the upper bound, let U

n
be uniformly distributed over
_
0,

n
,
2
n
, . . . ,
k
n
, 1
_
,
where = n| + 1 and k is an integer satisfying
n
n/(j 1) + 1
1 k <
n
n/(j 1) + 1
.
(Note that
n
j 1
+ 1
_
n
j 1
_
+ 1 .)
Then
inf
Un
Pr
_

U
n
U
n


_
Pr
_

n
U


_
=
1
k + 2

1
1
1/(j 1) + 1/n
+ 1
=
1
j
+
(j 1)
2
j
2
1
n + (j 1)/j

1
j
+
1
n + 0.5
.
216
2
Based on the above Lemmas, we can then proceed to compute the asymptotic
largest minimum distance among codewords of the following examples. It needs
to be pointed out that in these examples, our objective is not to attempt to
solve any related problems of practical interests, but simply to demonstrate the
computation of the distance spectrum function for general readers.
Example 9.29 Assume that the n-tuple code alphabet is 0, 1
n
. Let the n-fold
distance function be dened as

n
( x
n
, x
n
) [u( x
n
) u(x
n
)[
where u(x
n
) represents the number of 1s in x
n
. Then
sup
X

X
(R)
inf
_
a 1 : limsup
n
sup
X
n
_
Pr
_
1
n

u(

X
n
) u(X
n
)

> a
__
M
= 0
_
= inf
_
a 1 : limsup
n
_
1 inf
X
n
Pr
_
1
n

u(

X
n
) u(X
n
)

a
__
M
= 0
_
inf
_
a 1 : limsup
n
_
1 inf
U
Pr
_

U U

a
__
expnR
= 0
_
= inf
_
a 1 : limsup
n
_
1
1
1/a|
_
expnR
= 0
_
= 0.
Hence, the asymptotic largest minimum distance among block codewords is zero.
This conclusion is not surprising because the code with nonzero minimal distance
should contain the codewords of dierent Hamming weights and the whole num-
ber of such words is n + 1.
Example 9.30 Assume that the code alphabet is binary. Dene

n
( x
n
, x
n
) log
2
([g
n
( x
n
) g
n
(x
n
)[ + 1) ,
217
where
g
n
(x
n
) = x
n1
2
n1
+ x
n2
2
n2
+ + x
1
2 + x
0
.
Then
sup
X

X
(R)
inf
_
a 1 : limsup
n
sup
X
n
_
Pr
_
1
n
log
2
_

g
n
(

X
n
) g
n
(X
n
)

+ 1
_
> a
__
M
= 0
_
inf
_
a 1 : limsup
n
_
1 inf
U
Pr
_

U U


2
na
1
2
n
1
__
expnR
= 0
_
= inf
_

_
a 1 : limsup
n
_
_
_
_
1
1
_
2
n
1
2
na
1
_
_
_
_
_
expnR
= 0
_

_
= 1
R
log 2
. (9.4.9)
By taking X

to be the one under which U


K
g
n
(X
n
)/(2
n
1) has the distribu-
tion as used in the proof of the upper bound of Lemma 9.28, where K 2
n
1,
218
we obtain

(R)
inf
_
a 1 : limsup
n
_
_
_
_
1
1
_
2
n
1
2
na
1
_
1
K + 0.5
_
_
_
_
expnR
= 0
_

_
= inf
_
a 1 : limsup
n
_
_
_
_
1
1
_
2
n
1
2
na
1
_
1
2
n
0.5
_
_
_
_
expnR
= 0
_

_
= 1
R
log 2
.
This proved the achievability of (9.4.9).
C) General properties of distance-spectrum function
We next address some general functional properties of

X
(R) and
X
(R).
Lemma 9.31 1.

X
(R) and
X
(R) are non-increasing and right-continuous
functions of R.
2.

X
(R)

D
0
(X) limsup
n
ess inf
1
n

n
(

X
n
, X
n
) (9.4.10)

X
(R) D
0
(X) liminf
n
ess inf
1
n

n
(

X
n
, X
n
) (9.4.11)
where ess inf represents essential inmum
9
. In addition, equality holds for
(9.4.10) and (9.4.11) respectively when
R >

R
0
(X) limsup
n

1
n
log Pr

X
n
= X
n
(9.4.12)
9
For a given random variable Z, its essential inmum is dened as
ess inf Z supz : Pr[Z z] = 1.
219
and
R > R
0
(X) liminf
n

1
n
log Pr

X
n
= X
n
, (9.4.13)
provided that
( x
n
A
n
) min
x
n
.
n

n
( x
n
, x
n
) =
n
(x
n
, x
n
) = constant. (9.4.14)
3.

X
(R)

D
p
(X) (9.4.15)
for R >

R
p
(X); and

X
(R) D
p
(X) (9.4.16)
for R > R
p
(X), where

D
p
(X) limsup
n
1
n
E[
n
(

X
n
, X
n
)[
n
(

X
n
, X
n
) < ]
D
p
(X) liminf
n
1
n
E[
n
(

X
n
, X
n
)[
n
(

X
n
, X
n
) < ]

R
p
(X) limsup
n

1
n
log Pr
n
(

X
n
, X
n
) <
and
R
p
(X) liminf
n

1
n
log Pr
n
(

X
n
, X
n
) < .
In addition, equality holds for (9.4.15) and (9.4.16) respectively when R =

R
p
(X) and R = R
p
(X), provided that [
n
(

X
n
, X
n
)[
n
(

X
n
, X
n
) < ] has
the large deviation type of behavior, i.e., for all > 0,
liminf
n

1
n
log L
n
> 0, (9.4.17)
where
L
n
Pr
_
1
n
(Y
n
E[Y
n
[Y
n
< ])

Y
n
<
_
,
and Y
n

n
(

X
n
, X
n
).
4. For 0 < R <

R
p
(X),

X
(R) = . Similarly, for 0 < R < R
p
(X),

X
(R) = .
Proof: Again, only the proof regarding

X
(R) will be provided. The properties
of
X
(R) can be proved similarly.
1. Property 1 follows by denition.
220
2. (9.4.10) can be proved as follows. Let
e
n
(X) ess inf
1
n

n
(

X
n
, X
n
),
and hence,

D
0
(X) = limsup
n
e
n
(X). Observe that for any > 0 and
for innitely many n,
_
Pr
_
1
n

n
(

X
n
, X
n
) >

D
0
(X) 2
__
M

_
Pr
_
1
n

n
(

X
n
, X
n
) > e
n
(X)
__
M
= 1.
Therefore,

X
(R)

D
0
(X)2 for arbitrarily small > 0. This completes
the proof of (9.4.10).
To prove the equality condition for (9.4.10), it suces to show that for any
> 0,

X
(R)

D
0
(X) + (9.4.18)
for
R > limsup
n

1
n
log Pr
_

X
n
,= X
n
_
.
By the assumption on the range of R, there exists > 0 such that
R >
1
n
log Pr
_

X
n
,= X
n
_
+ for suciently large n. (9.4.19)
Then for those n satisfying e
n
(X
n
)

D
0
(X) +/2 and (9.4.19) (of which
there are suciently many)
_
Pr
_
1
n

n
(

X
n
, X
n
) >

D
0
(X) +
__
M

_
Pr
_
1
n

n
(

X
n
, X
n
) > e
n
(X
n
) +

2
__
M

_
Pr
_

X
n
,= X
n
__
M
(9.4.20)
=
_
1 Pr
_

X
n
= X
n
__
M
,
where (9.4.20) holds because (9.4.14). Consequently,
limsup
n
_
Pr
_
1
n

n
(

X
n
, X
n
) >

D
0
(X) +
__
M
= 0
which immediately implies (9.4.18).
221
3. (9.4.15) holds trivially if

D
p
(X) = . Thus, without loss of generality, we
assume that

D
p
(X) < . (9.4.15) can then be proved by observing that
for any > 0 and suciently large n,
_
Pr
_
1
n
Y
n
> (1 + )
2


D
p
(X)
__
M

_
Pr
_
1
n
Y
n
> (1 + )
1
n
E[Y
n
[Y
n
< ]
__
M
= (Pr Y
n
> (1 + ) E[Y
n
[Y
n
< ])
M
= (Pr Y
n
= + Pr Y
n
<
Pr Y
n
> (1 + ) E[Y
n
[Y
n
< ][ Y
n
< )
M

_
Pr Y
n
= + Pr Y
n
<
1
1 +
_
M
(9.4.21)
=
_
1

1 +
Pr Y
n
<
_
M
,
where (9.4.21) follows from Markovs inequality. Consequently, for R >

R
p
(X),
limsup
n
_
Pr
_
1
n

n
(

X
n
, X
n
) > (1 + )
2


D
p
(X)
__
M
= 0.
To prove the equality holds for (9.4.15) at R =

R
p
(X), it suces to show
the achievability of

X
(R) to

D
p
(X) by R

R
p
(X), since

X
(R) is right-
continuous. This can be shown as follows. For any > 0, we note from
(9.4.17) that there exists = () such that for suciently large n,
Pr
_
1
n
Y
n

1
n
E[Y
n
[Y
n
< ]

Y
n
<
_
e
n
.
222
Therefore, for innitely many n,
_
Pr
_
1
n
Y
n
>

D
p
(X) 2
__
M

_
Pr
_
1
n
Y
n
>
1
n
E[Y
n
[Y
n
< ]
__
M
= (Pr Y
n
= + Pr Y
n
<
Pr
_
1
n
Y
n
>
1
n
E[Y
n
[Y
n
< ]

Y
n
<
__
M
=
_
1 Pr
_
1
n
Y
n

1
n
E[Y
n
[Y
n
< ]

Y
n
<
_
Pr Y
n
< )
M

_
1 e
n
Pr Y
n
<
_
M
.
Accordingly, for

R
p
(X) < R <

R
p
(X) + ,
limsup
n
_
Pr
_
1
n
Y
n
>

D
p
(X) 2
__
M
> 0,
which in turn implies that

X
(R)

D
p
(X) 2.
This completes the proof of achievability of (9.4.15) by R

R
p
(X).
4. This is an immediate consequence of
( L > 0)
_
Pr
_
1
n

n
(

X
n
, X
n
) > L
__
M

_
1 Pr
_

n
(

X
n
, X
n
) <
__
M
.
2
Remarks.
A weaker condition for (9.4.12) and (9.4.13) is
R > limsup
n
1
n
log [o
n
[ and R > liminf
n
1
n
log [o
n
[,
where o
n
x
n
A
n
: P
X
n(x
n
) > 0. This indicates an expected result
that when the rate is larger than log [A[, the asymptotic largest minimum
distance among codewords remains at its smallest value sup
X

D
0
(X) (resp.
sup
X
D
0
(X)), which is usually zero.
223
Based on Lemma 9.31, the general relation between

X
(R) and the spec-
trum of (1/n)
n
(

X
n
, X
n
) can be illustrated as in Figure 9.7, which shows
that

X
(R) lies asymptotically within
ess inf
1
n
(

X
n
, X
n
)
and
1
n
E[
n
(

X
n
, X
n
)[
n
(

X
n
, X
n
)]
for

R
p
(X) < R <

R
0
(X). Similar remarks can be made on
X
(R). On
the other hand, the general curve of

X
(R) (similarly for
X
(R)) can be
plotted as shown in Figure 9.8. To summarize, we remark that under fairly
general situations,

X
(R)
_

_
= for 0 < R <

R
p
(X);
=

D
p
(X) at R = R
p
(X);
(

D
0
(X),

D
p
(X)] for R
p
(X) < R < R
0
(X);
=

D
0
(X) for R R
0
(X).
A simple universal upper bound on the largest minimum distance among
block codewords is the Plotkin bound. Its usual expression is given by [10]
for which a straightforward generalization (cf. Appendix B) is
sup
X
limsup
n
1
n
E[
n
(

X
n
, X
n
)].
Property 3 of Lemma 9.31, however, provides a slightly better form for the
general Plotkin bound.
We now, based on Lemma 9.31, calculate the distance-spectrum function of
the following examples. The rst example deals with the case of innite code
alphabet, and the second example derives the distance-spectrum function under
unbounded generalized distance measure.
Example 9.32 (continuous code alphabet) Let
A = [0, 1),
and let the marginal distance metric be
(x
1
, x
2
) = min 1 [x
1
x
2
[, [x
1
x
2
[ .
Note that the metric is nothing but treating [0, 1) as a circle (0 and 1 are
glued together), and then to measure the shorter distance between two posi-
tions. Also, the additivity property is assumed for the n-fold distance function,
i.e.,
n
( x
n
, x
n
)

n
i=1
( x
i
, x
i
).
224
Using the product X of uniform distributions over A, the sup-distance-
spectrum function becomes

X
(R) = inf
_
a 1 : limsup
n
_
1 Pr
_
1
n
n

i=1
(

X
i
, X
i
) a
__
M
= 0
_
_
_
.
By Cramer Theorem [4],
lim
n

1
n
Pr
_
1
n
n

i=1
(

X
i
, X
i
) a
_
= I
X
(a),
where
I
X
(a) sup
t>0
_
ta log E[e
t(

X,X)
]
_
is the large deviation rate function
10
. Since I
X
(a) is convex in a, there exists a
supporting line to it satisfying
I
X
(a) = t

a log E
_
e
t

(

X,X)
_
,
which implies
a = s

I
X
(a) s

log E
_
e
(

X,X)/s

_
for s

1/t

> 0; or equivalently, the inverse function of I


X
() is given as
I
1
X
(R) = s

R s

log E
_
e
(

X,X)/s

_
(9.4.22)
= sup
s>0
_
sR s log E
_
e
(

X,X)/s
__
,
where
11
the last step follows from the observation that (9.4.22) is also a support-
ing line to the convex I
1
X
(R). Consequently,

X
(R) = inf a 1 : I
X
(a) < R
= sup
s>0
_
sR s log
_
2s
_
1 e
1/2s
__
, (9.4.23)
10
We take the range of supremum to be [t > 0] (instead of [t 1] as conventional large
deviation rate function does) since what concerns us here is the exponent of the cumulative
probability mass.
11
One may notice the analog between the expression of the large deviation rate function
I
X
(a) and that of the error exponent function [2, Thm. 4.6.4] (or the channel reliability expo-
nent function [2, Thm. 10.1.5]). Here, we demonstrate in Example 9.32 the basic procedure of
225
which is plotted in Figure 9.9.
Also from Lemma 9.31, we can easily compute the marginal points of the
distance-spectrum function as follows.

R
0
(X) = log Pr

X = X = ;

D
0
(X) = ess inf(

X, X) = 0;

R
p
(X) = log Pr(

X, X) < = 0;

D
p
(X) = E[(

X, X)[(

X, X) < ] =
1
4
.
Example 9.33 Under the case that
A = 0, 1
and
n
(, ) is additive with marginal distance metric (0, 0) = (1, 1) = 0,
(0, 1) = 1 and (1, 0) = , the sup-distance-spectrum function is obtained
using the product of uniform (on A) distributions as:

X
(R) = inf a 1 : I
X
(a) < R
= sup
s>0
_
sR s log
2 + e
1/s
4
_
, (9.4.24)
where
I
X
(a) sup
t>0
_
ta log E[e
t(

X,X)
]
_
= sup
t>0
_
ta log
2 + e
t
4
_
.
This curve is plotted in Figure 9.10. It is worth noting that there exists a region
obtaining
inf
_
a 1 : sup
t>0
_
ta log E
_
e
tZ
_
< R
_
= sup
s>0
_
sRs log E
_
e
Z/s
__
,
for a random variable Z so that readers do not have to refer to literatures regarding to error
exponent function or channel reliability exponent function for the validity of the above equality.
This equality will be used later in Examples 9.33 and 9.36, and also, equations (9.4.27) and
(9.4.29).
226
that the sup-distance-spectrum function is innity. This is justied by deriving

R
0
(X) = log Pr

X = X = log 2;

D
0
(X) = ess inf(

X, X) = 0;

R
p
(X) = log Pr(

X, X) < = log
4
3
;

D
p
(X) = E[(

X, X)[(

X, X) < ] =
1
3
.
One can draw the same conclusion by simply taking the derivative of (9.4.24)
with respect to s, and obtaining that the derivative
R log
_
2 + e
1/s
_

e
1/s
s (2 + e
1/s
)
+ log(4)
is always positive when R < log(4/3). Therefore, when 0 < R < log(4/3), the
distance-spectrum function is innity.
From the above two examples, it is natural to question whether the formula
of the largest minimum distance can be simplied to the quantile function
12
of
the large deviation rate function (cf. (9.4.23) and (9.4.24)), especially when the
distance functional is symmetric and additive. Note that the quantile function
of the large deviation rate function is exactly the well-known Varshamov-Gilbert
lower bound (cf. the next section). This inquiry then becomes to nd the answer
of an open question: under what conditions is the Varshamov-Gilbert lower bound
tight? Some insight on this inquiry will be discussed in the next section.
D) General Varshamov-Gilbert lower bound
In this section, a general Varshamov-Gilbert lower bound will be derived directly
from the distance-spectrum formulas. Conditions under which this lower bound
is tight will then be explored.
Lemma 9.34 (large deviation formulas for

X
(R) and
X
(R))

X
(R) = inf
_
a 1 :

X
(a) < R
_
and

X
(R) = inf a 1 :
X
(a) < R
12
Note that the usual denition [1, page 190] of the quantile function of a non-decreasing
function F() is dened as: sup : F() < . Here we adopt its dual denition for a non-
increasing function I() as: infa : I(a) < R. Remark that if F() is strictly increasing (resp.
I() is strictly decreasing), then the quantile is nothing but the inverse of F() (resp. I()).
227
where

X
(a) and
X
(a) are respectively the sup- and the inf-large deviation
spectrums of (1/n)
n
(

X
n
, X
n
), dened as

X
(a) limsup
n

1
n
log Pr
_
1
n

n
(

X
n
, X
n
) a
_
and

X
(a) liminf
n

1
n
log Pr
_
1
n

n
(

X
n
, X
n
) a
_
.
Proof: We will only provide the proof regarding

X
(R). All the properties of

X
(R) can be proved by following similar arguments.
Dene

inf
_
a 1 :

X
(a) < R
_
. Then for any > 0,

X
(

+ ) < R (9.4.25)
and

X
(

) R. (9.4.26)
Inequality (9.4.25) ensures the existence of = () > 0 such that for suciently
large n,
Pr
_
1
n

n
(

X
n
, X
n
)

+
_
e
n(R)
,
which in turn implies
limsup
n
_
Pr
_
1
n

n
(

X
n
, X
n
) >

+
__
M
limsup
n
_
1 e
n(R)
_
e
nR
= 0.
Hence,

X
(R)

+ . On the other hand, (9.4.26) implies the existence of
subsequence n
j

j=1
satisfying
lim
j

1
n
j
log Pr
_
1
n
j

n
j
(

X
n
j
, X
n
j
)


_
R,
which in turn implies
limsup
n
_
Pr
_
1
n

n
(

X
n
, X
n
) >


__
M
limsup
j
_
1 e
n
j
R
_
e
n
j
R
= e
1
.
Accordingly,
X
(R)

. Since is arbitrary, the lemma therefore holds. 2
228
The above lemma conrms that the distance spectrum function

X
() (resp.

X
()) is exactly the quantile of the sup- (resp. inf-) large deviation spectrum
of (1/n)
n
(

X
n
, X
n
)
n>1
Thus, if the large deviation spectrum is known, so is
the distance spectrum function.
By the generalized Gartner-Ellis upper bound derived in [5, Thm. 2.1], we
obtain

X
(a) inf
[xa]
I
X
(x) = I
X
(a)
and

X
(a) inf
[xa]

I
X
(x) =

I
X
(a),
where the equalities follow from the convexity (and hence, continuity and strict
decreasing) of I
X
(x) sup
[<0]
[x
X
()] and

I
X
(x) sup
[<0]
[x
X
()],
and

X
() liminf
n
1
n
log E
_
e
n(

X
n
,X
n
)
_
and

X
() limsup
n
1
n
log E
_
e
n(

X
n
,X
n
)
_
.
Based on these observations, the relation between the distance-spectrum expres-
sion and the Varshamov-Gilbert bound can be described as follows.
Corollary 9.35
sup
X

X
(R) sup
X

G
X
(R) and sup
X

X
(R) sup
X
G
X
(R)
where

G
X
(R) inf a 1 : I
X
(a) < R
= sup
s>0
_
sR s
X
(1/s)
_
(9.4.27)
and
G
X
(R) inf
_
a 1 :

I
X
(a) < R
_
(9.4.28)
= sup
s>0
[sR s
X
(1/s)] . (9.4.29)
Some remarks on the Varshamov-Gilbert bound obtained above are given
below.
Remarks.
229
One can easily see from [2, page 400], where the Varshamov-Gilbert bound
is given under Bhattacharyya distance and nite code alphabet, that
sup
X

G
X
(R)
and
sup
X
G
X
(R)
are the generalization of the conventional Varshamov-Gilbert bound.
Since 0 exp
n
( x
n
, x
n
)/s 1 for s > 0, the function exp(, )/s
is always integrable. Hence, (9.4.27) and (9.4.29) can be evaluated under
any non-negative measurable function
n
(, ). In addition, no assumption
on the alphabet space A is needed in deriving the lower bound. Its full
generality can be displayed using, again, Examples 2.8 and 9.33, which
result in exactly the same curves as shown in Figures 9.9 and 9.10.
Observe that

G
X
(R) and G
X
(R) are both the pointwise supremum of
a collection of ane functions, and hence, they are both convex, which
immediately implies their continuity and strict decreasing property on the
interior of their domains, i.e.,
R :

G
X
(R) < and R : G
X
(R) < .
However, as pointed out in [5],

X
() and
X
() are not necessarily con-
vex, which in turns hints the possibility of yielding non-convex

X
() and

X
(). This clearly indicates that the Varshamov-Gilbert bound is not
tight whenever the asymptotic largest minimum distance among codewords
is non-convex.
An immediate improvement from [5] to the Varshamov-Gilbert bound
is to employ the twisted large deviation rate function (instead of

I
X
())

J
X,h
(x) sup
T :
X
(;h)>
[ h(x)
X
(; h)]
and yields a potentially non-convex Varshamov-Gilbert-type bound, where
h() is a continuous real-valued function, and

X
(; h) limsup
n
1
n
log E
_
e
nh(n(

X
n
,X
n
)/n)
_
.
Question of how to nd a proper h() for such improvement is beyond the
scope of this paper and hence is deferred for further study.
We now demonstrate that

X
(R) >

G
X
(R) by a simple example.
230
Example 9.36 Assume binary code alphabet A = 0, 1, and n-fold
Hamming distance
n
( x
n
, x
n
) =

n
i=1
( x
i
, x
i
). Dene a measurable func-
tion as follows:

n
( x
n
, x
n
)
_
_
_
0, if 0
n
( x
n
, x
n
) < n;
n, if n
n
( x
n
, x
n
) < 2n;
, if 2n
n
( x
n
, x
n
),
where 0 < < 1/2 is a universal constant. Let X be the product of
uniform distributions over A, and let Y
i
(

X
i
, X
i
) for 1 i n.
Then

X
(1/s)
= liminf
n
1
n
log E
_
e
n(

X
n
,X
n
)/s
_
= liminf
n
1
n
log
_
Pr
_
0

n
(

X
n
, X
n
)
n
<
_
+Pr
_


n
(

X
n
, X
n
)
n
< 2
_
e
n/s
_
= liminf
n
1
n
log
_
Pr
_
0
Y
1
+ + Y
n
n
<
_
+Pr
_

Y
1
+ + Y
n
n
< 2
_
e
n/s
_
= max
_
I
Y
(), I
Y
(2)
n
s
_
.
where I
Y
() sup
s>0
s log[(1 + e
s
)/2] (which is exactly the large
deviation rate function of (1/n)Y
n
). Hence,

G
X
(R)
= sup
s>0
_
sR s max
_
I
Y
(), I
Y
(2)
n
s
__
= sup
s>0
min s[I
Y
() R], s[I
Y
(2) R] +
=
_

_
, for 0 R < I
Y
(2);
(I
Y
() R)
I
Y
() I
Y
(2)
, for I
Y
(2) R < I
Y
();
0, for R I
Y
().
231
We next derive

X
(R). Since
Pr
_
1
n

n
(

X
n
, X
n
) a
_
=
_

_
Pr
_
0
Y
1
+ + Y
n
n
<
_
, 0 a < ;
Pr
_
0
Y
1
+ + Y
n
n
< 2
_
, a < 2,
we obtain

X
(R) =
_
_
_
, if 0 R < I
Y
(2);
, if I
Y
(2) R < I
Y
();
0, if I
Y
() R.
Consequently,

X
(R) >

G
X
(R) for I
Y
(2) < R < I
Y
().
One of the problems that remain open in the combinatorial coding theory
is the tightness of the asymptotic Varshamov-Gilbert bound for the binary
code and the Hamming distance [3, page vii]. As mentioned in Section I, it
is already known that the asymptotic Varshamov-Gilbert bound is in gen-
eral not tight, e.g., for algebraic-geometric code with large code alphabet
size and Hamming distance. Example 9.36 provides another example to
conrm the untightness of the asymptotic Varshamov-Gilbert bound for
simple binary code and quantized Hamming measure.
By the generalized Gartner-Ellis lower bound derived in [5, Thm. 2.1],
we conclude that

X
(R) = I
X
(R) (or equivalently

X
(R) =

G
X
(R))
if
_

D
0
(X),

D
p
(X)


_
<0
_
limsup
t0

X
( + t)
X
()
t
,
liminf
t0

X
()
X
( t)
t
_
. (9.4.30)
Note that although (9.4.30) guarantees that

X
(R) =

G
X
(R), it dose not
by any means ensure the tightness of the Varshamov-Gilbert bound. An
additional assumption needs to be made, which is summarized in the next
observation.
232
Observation 9.37 If there exists an

X such that
sup
X

X
(R) =

X
(R)
and (9.4.30) holds for

X, then the asymptotic Varshamov-Gilbert lower
bound is tight.
Problem in applying the above observation is the diculty in nding the
optimizing process X. However, it does provide an alternative to prove the
tightness of the asymptotic Varshamov-Gilbert bound. Instead of nding
an upper bound to meet the lower bound, one could show that (9.4.30)
holds for a fairly general class of random-code generating processes, which
the optimization process surely lies in.
9.4.4 Elias bound: a single-letter upper bound formula
on the largest minimum distance for block codes
As stated in the previous chapter, the number of compositions for i.i.d. source
is of the polynomial order. Therefore, we can divide any block code ( into a
polynomial number of constant-composition subcodes. To be more precise
(n + 1)
[.[
max
C
i

[ ( C
i
[ [ ( [ =

C
i

[ ( C
i
[ max
C
i

[ ( C
i
[. (9.4.31)
where C
i
represents a set of x
n
with the same composition. As a result of
(9.4.31), the rate of code ( should be equal to ( C

where C

is the maximizer
of (9.4.31). We can then conclude that any block code should have a constant
composition subcode with the same rate, and the largest minimum distance for
block codes should be upper bounded by that for constant composition block
codes.
The idea of the Elias bound for i.i.d. source is to upper bound the largest
minimum distance among codewords by the average distance among those code-
words on the spherical surface. Note that the minimum distance among code-
words within a spherical surface should be smaller than the minimum distance
among codewords on the spherical surface, which in terms is upper bounded by
its corresponding average distance.
By observing that points locate on the spherical surface in general have con-
stant joint composition with the center, we can model the set of spherical surface
in terms of joint composition
13
. To be more precise, the spherical surface cen-
tered at x
n
with radius r can be represented by
/
x
n
,r
x
n
: d(x
n
, x
n
) = r.
13
For some distance measures, points on the spherical surface may have more than one joint
233
(Recall that the sphere is modeled as x
n
: d(x
n
, x
n
) r.)
d(x
n
, x
n
)
n

=1
d(x

, x

)
=
K

i=1
K

j=1
#ij(x
n
, x
n
) d(i, j),
where #ij(x
n
, x
n
) represents the number of (x

, x

) pair that is equal to (i, j),


and A = 1, 2, . . . , K is the code alphabet. Hence, for any x
n
in /
x
n
,r
, it is
sucient for (x
n
, x
n
) to have constant joint composition. Therefore, /
x
n
,r
can
be re-dened as
/
x
n
,r
x
n
: #ij(x
n
, x
n
) = n
ij
,
where n
ij

1iK,1jK
is a xed joint composition.
Example 9.38 Assume a binary alphabet. Suppose n = 4 and the marginal
composition is #0, #1 = 2, 2. Then the set of all words with such compo-
sition is
0011, 0101, 0110, 1010, 1100, 1001
which has 4!/(2!2!) = 6 elements. Choose the joint composition to be
#00, #01, #10, #11 = 1, 1, 1, 1.
Then the set of the spherical surface centered at (0011) is
0101, 1001, 0110, 1010,
which has [2!/(1!1!)][2!/(1!1!)] = 4 elements.
Lemma 9.39 (average number of composition-constant codewords on
spherical surface) Fix a composition n
i

1iK
for code alphabet
1, 2, . . . , K.
compositions with the center. For example, if d(0, 1) = d(0, 2) = 1 for ternary alphabet, then
both 0011 and 0022 locate on the spherical surface
x
4
: d(0000, x
4
) = 2;
however, they have dierent joint composition with the center. In this subsection, what we
concern are the general distance measures, and hence, points on spherical surface in general
have constant joint composition with the centers.
234
The average number of M
n
codewords of blocklength n on a spherical surface
dened over a joint composition n
ij

1iK,1jK
(with

K
j=1
n
ij
=

K
j=1
n
ji
=
n
i
) is equal to

T = M
n
_
K

i=1
n
i
_
2
n!
K

i=1
K

j=1
n
ij
!
.
Proof:
Step 1: Number of words. The number of words that have a composition
n
i

1iK
is N n!/

K
i=1
n
i
!. The number of words on the spherical
surface centered at x
n
(having a composition n
i

1iK
)
/(x
n
) x
n
A
n
: #ij(x
n
, x
n
) = n
ij

is equal to
J =
K

i=1
n
i
!
K

j=1
n
ij
!
.
Note that since

K
i=1
= n
j
, /(x
n
) is equivalent to
/(x
n
) x
n
A
n
: #i x
n
= n
i
and #ij(x
n
, x
n
) = n
ij
.
Step 2: Average. Consider all spherical surfaces centered at words with com-
position n
i

1iK
. There are N of them, which we index by r in a xed
order, i.e., /
1
, /
2
, . . . , /
r
, . . . , /
N
. Form a code book ( with M
n
code-
words, each has composition n
i

1iK
. Because each codeword shares
joint composition n
ij
with J words with composition n
i

1iK
, it is on
J spherical surfaces among /
1
, . . . , /
N
. In other words, if T
r
denote the
number of codewords in /
r
, i.e., T
r
[ ( /
r
[, then
N

r=1
T
r
= M
n
J.
Hence, the average number of codewords on one spherical surface satises

T =
M
n
J
N
= M
n
_

K
i=1
n
i
_
2
n!

K
i=1

K
j=1
n
ij
!
.
2
235
Observation 9.40 (Reformulating

T)
a(n)e
n(RI(X;

X))
<

T < b(n)
2
1
(n)e
n(RI(X;

X))
.
where P
X

X
is the composition distribution to n
ij

1iK,1jK
dened in the
previous Lemma, and a(n) and b(n) are both polynomial functions of n.
Proof: We can write

T as

T = M
n
n!

K
i=1

K
j=1
n
ij
!

K
i=1
n
i
n!

K
i=1
n
i
n!
.
By the Stirlings approximation
14
, we obtain

1
(n)e
nH(X,

X)
<
n!

K
i=1

K
j=1
n
ij
!
<
2
(n)e
nH(X,

X)
,
and

1
(n)e
nH(X)
<
n!

K
i=1
n
i
!
<
2
(n)e
nH(X)
,
where
15

1
(n) = 6
2K
n
(12K)/2
and
2
(n) = 6

n, which are both polynomial


functions of n. Together with the fact that H(

X) = H(X) and M
n
= e
nR
, we
nally obtain

1
(n)
2
2
(n)e
n(R+H(X,

X)H(X)H(

X))
<

T <
2
(n)
2
1
(n)e
n(R+H(X,

X)H(X)H(

X))
.
14
Stirlings approximation:

2n
_
n
e
_
n
< n! <

2n
_
n
e
_
n
_
1 +
1
12n 1
_
.
15
Using a simplied Stirlings approximation as

n
_
n
e
_
n
< n! < 6

n
_
n
e
_
n
,
we have

n
_
n
e
_
n

K
i=1

K
j=1
6

n
ij
_
nij
e
_
nij
<
n!

K
i=1

K
j=1
n
ij
!
<
6

n
_
n
e
_
n

K
i=1

K
j=1

n
ij
_
nij
e
_
nij

K
i=1

K
j=1
n
ij
6
2K
n
n

K
i=1

K
j=1
n
nij
ij
<
n!

K
i=1

K
j=1
n
ij
!
<

K
i=1

K
j=1
n
ij
6n
n

K
i=1

K
j=1
n
nij
ij
.
Since
n
(12K)/2

K
i=1

K
j=1
n
ij

n,
236
2
This observations tell us that if we choose M
n
suciently large, say R >
I(X;

X), then

T goes to innity. We can then conclude that there exists a
spherical surface such that T
r
= [ ( /
r
[ goes to innity. Since the minimum
distance among all codewords must be upper bounded by the average distance
among codewords on spherical surface /
r
, an upper bound of the minimum
distance can be established via the next theorem.
Theorem 9.41 For Hamming(-like) distance, which is dened by
d(i, j) =
_
0, if i = j,
A, if i ,= j
,
d(R) max
P
X
min
P

X
e
X
: P

X
=P
e
X
=P
X
,I(

X;
e
X)<R
K

i=1
K

j=1
K

k=1
P

X
e
X
(i, k)P

X
e
X
(j, k)
P
X
(k)
d(i, j).
Proof:
Step 1: Notations. Choose a composition-constant code ( with rate R and
composition n
i

1iK
, and also choose a joint-composition
n
ij

1iK,1jK
whose marginal compositions are both n
i

1iK
, and satises I(

X;

X) <
R, where P

X
e
X
is its joint composition distribution. From Observation 9.40,
the above derivation can be continued as
6
2K
n
(12K)/2
n
n

K
i=1

K
j=1
n
nij
ij
<
n!

K
i=1

K
j=1
n
ij
!
< 6

n
n
n

K
i=1

K
j=1
n
nij
ij

1
(n)
n
n

K
i=1

K
j=1
n
nij
ij
<
n!

K
i=1

K
j=1
n
ij
!
<
2
(n)
n
n

K
i=1

K
j=1
n
nij
ij
.
Finally,
n
n

K
i=1

K
j=1
n
nij
ij
= e
nlog n
P
K
i=1
P
K
j=1
nij log nij
= e
P
K
i=1
P
K
j=1
nij log n
P
K
i=1
P
K
j=1
nij log nij
= e

P
K
i=1
P
K
j=1
nij log
n
i
j
n
= e
n
P
K
i=1
P
K
j=1
P
X

X
(i,j) log P
X

X
(i,j)
= e
nH(X,

X)
.
237
if I(

X;

X) < R,

T increases exponentially with respective to n. We can
then nd a spherical surface /
r
such that the number of codewords in it,
namely T
r
= [ ( /
r
[, increases exponentially in n, where the spherical
surface is dened on the joint composition n
ij

1iK,1jK
.
Dene

d
r

x
n
( ,r

x
n
( ,r, x
n
,=x
n
d(x
n
, x
n
)
T
r
(T
r
1)
.
Then it is obvious that the minimum distance among codewords is upper
bounded by

d
r
.
Step 2: Upper bound of

d
r
.

d
r

x
n
( ,r

x
n
( ,r, x
n
,=x
n
d(x
n
, x
n
)
T
r
(T
r
1)
=

x
n
( ,r

x
n
( ,r, x
n
,=x
n
_
K

i=1
K

j=1
#ij(x
n
, x
n
)d(i, j)
_
T
r
(T
r
1)
.
Let 1
ij[
(x
n
, x
n
) = 1 if x

= i and x

= j; and zero, otherwise. Then

d
r
=

x
n
( ,r

x
n
( ,r, x
n
,=x
n
_
K

i=1
K

j=1
_
n

=1
1
ij[
(x
n
, x
n
)
_
d(i, j)
_
T
r
(T
r
1)
=
K

i=1
K

j=1
_
_
n

=1

x
n
( ,r

x
n
( ,r, x
n
,=x
n
1
ij[
(x
n
, x
n
)
_
_
d(i, j)
T
r
(T
r
1)
Dene
T
i[
= [x
n
( /
r
: x

= i[, and
i[
=
T
i[
T
r
.
Then

K
i=1
T
i[
= T
r
. (Hence,
i[

K
i=1
is a probability mass function.) In
addition, suppose
j[
is 1 when the -th component of the center of the
spherical surface /
r
is j; and 0, otherwise. Then
n

=1
T
i[

j[
= T
r
n
ij
,
238
which is equivalent to
n

=1

i[

j[
= n
ij
.
Therefore,

d
r
=
K

i=1
K

j=1
_
n

=1
T
i[
T
j[
_
d(i, j)
T
r
(T
r
1)
=
T
r
T
r
1
K

i=1
K

j=1
_
n

=1

i[

j[
_
d(i, j)

T
r
T
r
1
max
:
P
K
i=1

i|
=1,
and P
n
=1

i|

j|
=n
ij

i=1
K

j=1
_
n

=1

i[

j[
_
d(i, j)
Step 3: Claim. The optimizer of the above maximization is

i[
=
K

k=1
n
ik
n
k

k[
.
(The maximizer of the above maximization can actually be solved in terms
of Lagrange multiplier techniques. For simplicity, a direct derivation of this
maximizer is provided in the following step.) proof: For any ,
n

=1

i[

j[
=
n

=1
_
K

k=1
n
ik
n
k

k[
_

j[
=
K

k=1
n
ik
n
k
_
n

=1

k[

j[
_
=
K

k=1
n
ik
n
jk
n
k
239
By the fact that
n

=1

i[

j[

=1

i[

j[
=
n

=1
(
i[

j[

i[

j[
)
=
n

=1
(
i[

j[

i[

j[
+

i[

j[

i[

j[
)
=
n

=1
(
i[

i[
)(
j[

j[
)
=
n

=1
c
i[
c
j[
,
where c
i[

i[

i[
, we obtain
K

i=1
K

j=1
_
n

=1

i[

j[
_
d(i, j)
K

i=1
K

j=1
_
n

=1

i[

j[
_
d(i, j)
=
K

i=1
K

j=1
_
n

=1
c
i[
c
j[
_
d(i, j)
=
n

=1
_
K

i=1
K

j=1
c
i[
c
j[
d(i, j)
_
=
n

=1
A
_
K

i=1
K

j=1,j,=i
c
i[
c
j[
_
=
n

=1
A
_
K

i=1
K

j=1
c
i[
c
j[

i=1
c
2
i[
_
=
n

=1
A
__
K

i=1
c
i[
__
K

j=1
c
j[
_

i=1
c
2
i[
_
=
n

=1
A
K

i=1
c
2
i[
0.
240
Accordingly,

d
r

T
r
T
r
1
n

=1
K

i=1
K

j=1
_
K

k=1
n
ik
n
jk
n
k
_
d(i, j)
=
T
r
T
r
1
n
n

=1
K

i=1
K

j=1
_
K

k=1
P

X
e
X
(i, k)P

X
e
X
(j, k)
P
X
(k)
_
d(i, j),
which implies that

d
r
n

T
r
T
r
1
n

=1
K

i=1
K

j=1
_
K

k=1
P

X
e
X
(i, k)P

X
e
X
(j, k)
P
X
(k)
_
d(i, j). (9.4.32)
Step 4: Final step. Since (9.4.32) holds for any joint composition satisfying
the constraints, and T
r
goes to innity as n goes to innity, we can take
the minimum of P

X
e
X
and yield a better bound. However, the marginal
composition is chosen so that the composition-constant subcode has the
same rate as the optimal code which is unknown to us. Hence, we choose
a pessimistic viewpoint to take the maximum over all P
X
, and complete
the proof. 2
The upper bound in the above theorem is called the Elias bound, and will
be denoted by d
E
(R).
9.4.5 Gilbert bound and Elias bound for Hamming dis-
tance and binary alphabet
The (general) Gilbert bound is
G(R) sup
X
sup
s>0
[sR s
X
(s)] ,
where
X
(s) limsup
n
(1/n) log E[exp
n
(

X
n
, X
n
)/s], and

X
n
X
n
are
independent. For binary alphabet,
Pr(

X, X) y =
_
_
_
0, y < 0
Pr(

X = X), 0 y < 1
1 , y 1.
Hence,
G(R) = sup
0<p<1
sup
s>0
_
sR s log
_
p
2
+ (1 p)
2
+ 2p(1 p)e
1/s
_
,
241
where P
X
(0) = p and P
X
(1) = 1 p. By taking the derivatives with respective
to p, the optimizer can be found to be p = 1/2. Therefore,
G(R) = sup
s>0
_
sR s log
1 + e
1/s
2
_
.
The Elias bound for Hamming distance and binary alphabet is equal to
d
E
(R) max
P
X
min
P

X
e
X
: P

X
=P
e
X
=P
X
,I(

X;
e
X)<R
K

i=1
K

j=1
K

k=1
P

X
e
X
(i, k)P

X
e
X
(j, k)
P
X
(k)
d(i, j)
= max
P
X
min
P

X
e
X
: P

X
=P
e
X
=P
X
,I(

X;
e
X)<R
2
K

k=1
P

X
e
X
(0, k)P

X
e
X
(1, k)
P
X
(k)
= max
0<p<1
min
n
0ap : a log
a
p
2
+2(pa) log
pa
p(1p)
+(12p+a) log
12p+a
(1p)
2
=R
o
2
_
a(p a)
p
+
(p a)(1 2p + a)
1 p
_
,
where P
X
(0) = p and P

X
e
X
(0, 0) = a. Since
f(p, a) 2
_
a(p a)
p
+
(p a)(1 2p + a)
1 p
_
,
is non-increasing with respect to a, by assuming that the solution for
a log
a
p
2
+ 2(p a) log
p a
p(1 p)
+ (1 2p + a) log
1 2p + a
(1 p)
2
= R
is a

p
which is a function of p,
d
E
(R) = max
0<p<1
f(p, mina

p
, p).
9.4.6 Bhattacharyya distance and expurgated exponent
The Gilbert bound for additive distance measure
n
(x
n
, x
n
) =

n
i=1
(x
i
, x
i
) can
be written as
G(R) = max
s>0
max
P
X
_
sR s log
_

x.

.
P
X
(x)P
X
(x
t
)e
(x,x

)/s
__
. (9.4.33)
Recall that the expurgated exponent for DMC with generic distribution P
Y [X
is dened by
E
ex
(R) max
s1
max
P
X
_
sR s log

x.

.
P
X
(x)P
X
(x
t
)
242
_

y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
)
_
1/s
_
_
,
which can be re-formulated as:
E
ex
(R) max
s1
max
P
X
_
sR s log
_

x.

.
P
X
(x)P
X
(x
t
)
e
(log
P
yY

P
Y |X
(y[x)P
Y |X
(y[x

))/s
__
. (9.4.34)
Comparing (9.4.33) and (9.4.34), we can easily nd that the expurgated lower
bound (to reliability function) is nothing but the Gilbert lower bound (to mini-
mum distance among codewords) with
(x, x
t
) log

y
_
P
Y [X
(y[x)P
Y [X
(y[x
t
).
This is called the Bhattacharyya distance for channel P
Y [X
.
Note that the channel reliability function E(R) for DMC satises
E
ex
(R) E(R) limsup
n
1
n
d
n,M=e
n,R;
also note that
G(R)
1
n
d
n,M=e
n,R
and for optimizer s

1,
E
ex
(R) = G(R).
We conclude that the channel reliability function for DMC is solved (not only for
high rate but also for low rate, if the Gilbert bound is tight for Bhattacharyya
distance! This again conrms the signicance of the query on when the Gilbert
bound being tight!
The nal remark in this subsection regarding the resemblance of Bhattachar-
ya distance and Hamming distance. The Bhattacharyya distance is in general
not necessary a Hamming-like distance measure. The channel which results in
a Hamming-like Bhattacharyya distance is named equidistance channels. For
example, any binary input channel is a equidistance channel.
9.5 Straight line bound
It is conjectured that the channel reliability function is convex. Therefore, any
non-convex upper bound can be improved by making it convex. This is the main
idea of the straight line bound.
243
Denition 9.42 (list decoder) A list decoder decodes the outputs of a noisy
channel by a list of candidates (possible inputs), and an error occurs only when
the correct codeword transmitted is not in the list.
Denition 9.43 (maximal error probability for list decoder)
P
e,max
(n, M, L) min
( with L candidates for decoder
max
1mM
P
e[m
,
where n is the blocklength and M is the code size.
Denition 9.44 (average error probability for list decoder)
P
e
(n, M, L) min
( with L candidates for decoder
1
M

1mM
P
e[m
,
where n is the blocklength and M is the code size.
Lemma 9.45 (lower bound on average error) For DMC,
P
e
(n, M, 1) P
e
(n
1
, M, L)P
e,max
(n
2
, L + 1, 1),
where n = n
1
+ n
2
.
Proof:
Step 1: Partition of codewords. Partition each codeword into two parts: a
prex with length n
1
and a sux of length n
2
. Likewise, divide each output
observations into a prex part with length n
1
and a sux part with length
n
2
.
Step 2: Average error probability. Let
c
1
, . . . , c
M
be the optimal codewords, and |
1
, . . . , |
M
be the corresponding output
partitions. Denote the prex parts of the codewords as c
1
, . . . , c
M
and the
sux parts as c
1
, . . . , c
M
. Similar notations applies to outputs y
n
. Then
P
e
(n, M, 1)
=
1
M
M

m=1

y
n
,|m
P
Y
n
[X
n(y
n
[c
m
)
=
1
M
M

m=1

y
n
,|m
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)P
Y
n
2 [X
n
2 ( y
n
2
[ c
m
)
=
1
M
M

m=1

y
n
1
n
1

y
n
2,|m( y
n
1 )
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)P
Y
n
2[X
n
2 ( y
n
2
[ c
m
),
244
where
|
m
( y
n
1
) y
n
2

n
2
: concatenation of y
n
1
and y
n
2
|
m
.
Therefore,
P
e
(n, M, 1)
=
1
M
M

m=1

y
n
1
n
1
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)

y
n
2 ,|m( y
n
1 )
P
Y
n
2[X
n
2 ( y
n
2
[ c
m
)
Step 3: [1 m M : P
e[m
( ( ) < P
e,max
(n, L + 1, 1)[ L.
Claim: For any code with size M L, the number of m that satisfying
P
e[m
< P
e,max
(n, L + 1, 1)
is at most L.
proof: Suppose the claim is not true. Then there exist at least L + 1
codewords satisfying P
e[m
( ( ) < P
e,max
(n, L + 1, 1). Form a new code (
t
with size L + 1 by these L + 1 codewords. The error probability given
each transmitted codeword should be upper bounded by its corresponding
original P
e[m
, which implies that max
1mL+1
P
e[m
( (
t
) is strictly less than
P
e,max
(n, L + 1, 1), contradicted to its denition.
Step 4: Continue from step 2. Treat |
m
( y
n
1
)
M
m=1
as decoding partitions
at the output site for observations of length n
2
, and denote its resultant
error probability given each transmitted codeword by P
e[m
( y
n
1
). Then
P
e
(n, M, 1)
=
1
M
M

m=1

y
n
1
n
1
P
Y
n
1[X
n
1
( y
n
1
[ c
m
)

y
n
2,|m( y
n
1 )
P
Y
n
2[X
n
2
( y
n
2
[ c
m
)
=
1
M
M

m=1

y
n
1
n
1
P
Y
n
1[X
n
1
( y
n
1
[ c
m
)P
e[m
( y
n
1
)

1
M

m/

y
n
1
n
1
P
Y
n
1[X
n
1 ( y
n
1
[ c
m
)P
e,max
(n
2
, L + 1),
where L is the set of indexes satisfying
P
e[m
( y
n
1
) P
e,max
(n
2
, L + 1).
Note that [L[ M L + 1. We nally obtain
P
e
(n, M, 1) P
e,max
(n
2
, L + 1)
1
M

m/

y
n
1
n
1
P
Y
n
1 [X
n
1
( y
n
1
[ c
m
).
245
By partitioning
n
1
into [L[ mutually disjoint decoding sets, and using
(1, 2, . . . , M L) plus the codeword itself as a decoding candidate list
which is at most L,
1
M

m/

y
n
1
n
1
P
Y
n
1[X
n
1
( y
n
1
[ c
m
) P
e
(n
1
, M, L)
Combining this with the previous inequality completes the proof. 2
Theorem 9.46 (straight-line bound)
E(R
1
+ (1 )R
2
) E
p
(R
1
) + (1 )E
sp
(R
2
).
Proof: By applying the previous lemma with = n
1
/n, log(M/L)/n
1
= R
1
and
log L/n
2
= R
2
, we can upper bound P
e
(n
1
, M, L) by partitioning exponent, and
P
e,max
(n
2
, L + 1, 1) by sphere-packing exponent. 2
246
P
Y
n
|X
n(
.
/c
m
)
P
Y
n
|X
n(
.
/c
m
)
P
Y
n
|X
n(
.
/c
m
)
P
Y
n
|X
n(
.
/c
m
)
(a)
(b)
Figure 9.6: (a) The shaded area is |
c
m
; (b) The shaded area is /
c
m,m
.
247

X
(R)
-
ess inf(1/n)
n
(

X
n
, X
n
)
(1/n)E[(

X
n
, X
n
)[(

X
n
, X
n
) < ]
Probability mass
of (1/n)
n
(

X
n
, X
n
)

+
Figure 9.7:

X
(R) asymptotically lies between ess inf(1/n)(

X
n
, X
n
)
and (1/n)E[
n
(

X
n
, X
n
)[
n
(

X
n
, X
n
)] for

R
p
(X) < R <

R
0
(X).
R

X
(R)
-
6

c
p
p
p
p
p
p
p
p
p
s

R
p
(X)

D
p
(X)
c
p
p
p
s
p
p
p
p
p
p
p
p
p
p
p
c
p
p
p
s

R
0
(X)

D
0
(X)
Figure 9.8: General curve of

X
(R).
248
0
1/4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
R
Figure 9.9: Function of sup
s>0
_
sR s log
_
2s
_
1 e
1/2s
__
.
0
1/3
0 log(4/3) log(2)
R
innite region
Figure 9.10: Function of sup
s>0
_
sR s log
__
2 + e
1/s
_
/4
_
.
249
Bibliography
[1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John
Wiley and Sons, 1995.
[2] Richard E. Blahut. Principles and Practice of Information Theory. Addison
Wesley, Massachusetts, 1988.
[3] V. Blinovsky. Asymptotic Combinatorial Coding Theory. Kluwer Acad. Pub-
lishers, 1997.
[4] James A. Bucklew. Large Deviation Techniques in Decision, Simulation,
and Estimation. Wiley, New York, 1990.
[5] P.-N. Chen, Generalization of G artner-Ellis theorem, submitted to IEEE
Trans. Inform. Theory, Feb. 1998.
[6] P.-N. Chen and F. Alajaji, Generalized source coding theorems and hy-
pothesis testing, Journal of the Chinese Institute of Engineering, vol. 21,
no. 3, pp. 283-303, May 1998.
[7] P.-N. Chen and F. Alajaji, On the optimistic capacity of arbitrary chan-
nels, in proc. of the 1998 IEEE Int. Symp. Inform. Theory, Cambridge,
MA, USA, Aug. 1621, 1998.
[8] P.-N. Chen and F. Alajaji, Optimistic shannon coding theorems for ar-
bitrary single-user systems, IEEE Trans. Inform. Theory, vol. 45, no. 7,
pp. 26232629, Nov. 1999.
[9] T. Ericson and V. A. Zinoviev, An improvement of the Gilbert bound for
constant weight codes, IEEE Trans. Inform. Theory, vol. IT-33, no. 5,
pp. 721723, Sep. 1987.
[10] R. G. Gallager. Information Theory and Reliable Communications. John
Wiley & Sons, 1968.
[11] G. van der Geer and J. H. van Lint. Introduction to Coding Theory and
Algebraic Geometry. Birkhauser:Basel, 1988.
250
[12] S. Verd u and T. S. Han, A general formula for channel capacity, IEEE
Trans. on Information Theory, vol. IT40, no. 4, pp. 11471157, Jul. 1994.
[13] A. N. Kolmogorov and S. V. Fomin. Introductory Real Analysis. New
York:Dover Publications, Inc., 1970.
[14] J. H. van Lint, Introduction to Coding Theory. New York:Springer-Verlag,
2nd edition, 1992.
[15] S. N. Litsyn and M. A. Tsfasman, A note on lower bounds, IEEE
Trans. on Inform. Theory, vol. IT-32, no. 5, pp. 705706, Sep. 1986.
[16] J. K. Omura, On general Gilbert bounds, IEEE Trans. Inform. Theory,
vol. IT-19, no. 5, pp. 661666, Sep. 1973.
[17] H. V. Poor and S. Verd u, A lower bound on the probability of error in
multihypothesis testing, IEEE Trans. on Information Theory, vol. IT41,
no. 6, pp. 19921994, Nov. 1995.
[18] H. L. Royden. Real Analysis. New York:Macmillan Publishing Company,
3rd edition, 1988.
[19] M. A. Tsfasman and S. G. Vladut. Algebraic-Geometric Codes. Nether-
lands:Kluwer Academic Publishers, 1991.
[20] S. G. Vladut, An exhaustion bound for algebraic-geometric modular
codes, Probl. Info. Trans., vol. 23, pp. 2234, 1987.
[21] V. A. Zinoviev and S. N. Litsyn, Codes that exceed the Gilbert bound,
Problemy Peredachi Informatsii, vol. 21:1, pp. 105108, 1985.
251
Chapter 10
Information Theory of Networks
In this chapter, we consider the theory regarding to the communications among
many (more than three) terminals, which is usually named network. Several kind
of variations on this research topic are formed. What follow are some of them:
Example 10.1 (multi-access channels) This is a standard problem on dis-
tributed source coding (data compression). In this channel, two or more senders
send information to a common receiver. In order to reduce the overall trans-
mitted rate on network, data compressions are applied independently among
senders. As shown in Figure 10.1, we want to nd lossless compressors f
1
, f
2
,
and f
3
such that R
1
+ R
2
+ R
3
is minimized. In this research topic, some other
interesting problems to study are, for example, how do the various senders
cooperate with each other, i.e., relations of f
i
s? and What is the relation
between rates and interferences among senders?.
Example 10.2 (broadcast channel) Another standard problem on network
information theory, which is of great interest in research, is the broadcast chan-
nel, such as satellite TV networks, which consists of one sender and many re-
ceivers. It is usually assumed that each receiver receives the same broadcast
information with dierent noise.
Example 10.3 (distributed detection) This is a variation of the multiaccess
channel in which the information of each sender comes from observing the same
targets. The target could belong to one of several categories. Upon receipt of the
compressed data from these senders, the receiver should decide which category
is the current observed one under some optimization criterion.
Perhaps the simplest model of the distributed detection consists of only two
categories which usually name null hypothesis and alternative hypothesis. The
criterion is chosen to be the Bayesian cost or Neyman-Pearson type II error
probability subject to a xed bound on the type I error probability. This system
is depicted in Figure 10.2.
252
-
-
-
Inform
3
Inform
2
Inform
1
Terminal
3
compressor f
3
with rate R
3
Terminal
2
compressor f
2
with rate R
2
Terminal
1
compressor f
1
with rate R
1
-
-
-
-
-
-
Noiseless
network
(such as
wired net)
-
Terminal
4
globally
decom-
pressor
-
Inform
1
,
Inform
2
,
Inform
3
Figure 10.1: The multi-access channel.
Example 10.4 (some other examples)
1) Relay channel. This channel consists of one source sender, one destination
receiver, and several intermediate sender-receiver pairs that act as relays
to facilitate the communication between the source sender and destination
receiver.
2) Interference channel. Several senders and several receivers communicate si-
multaneously on common channel, where interference among them could
introduce degradation on performance.
3) Two-way communication channel. Instead of conventional one-way channel,
two terminals can communicate in a two-way fashion (full duplex).
10.1 Lossless data compression over distributed sources
for block codes
10.1.1 Full decoding of the original sources
In this section, we will rst consider the simplest model of the distributed (cor-
related) sources which consists only two random variables, and then, extend the
result to those models consisting of three or more random variables.
253
-
-
-
-
-
-
-
Fusion center
-
T(U
1
, . . . , U
n
)
g
n
g
2
g
1

Y
n
Y
2
Y
1
U
n
U
2
U
1
Figure 10.2: Distributed detection with n senders. Each observations Y
i
may come from one of two categories. The nal decision T H
0
, H
1
.
Denition 10.5 (independent encoders among distributed sources) T-
here are several sources
X
1
, X
2
, . . . , X
m
(may or may not be independent) which are respectively obtained by m termi-
nals. Before each terminal transmits its local source to the receiver, a block
encoder f
i
with rate R
1
and block length n is applied
f
i
: A
n
i
1, 2, . . . , 2
nR
i
.
It is assumed that there is no conspiracy among block encoders.
Denition 10.6 (global decoder for independently compressed source-
s) A global decoder g() will recover the original sources after receiving all the
independently compressed sources, i.e.,
g : 1, . . . , 2
R
1
1, . . . , 2
Rm
A
n
1
A
n
m
.
Denition 10.7 (probability of error) The probability of error is dened as
P
e
(n) Prg(f
1
(X
n
1
), . . . , f
m
(X
n
m
)) ,= (X
n
1
, . . . , X
n
m
).
Denition 10.8 (achievable rates) A rates (R
1
, . . . , R
m
) is said to be achiev-
able if there exists a sequence of block codes such that
limsup
n
P
e
(n) = 0.
Denition 10.9 (achievable rate region for distributed sources) The a-
chievable rate region for distributed sources is the set of all achievable rates.
254
Observation 10.10 The achievable rate region is convex.
Proof: For any two achievable rates (R
1
, . . . , R
m
) and (

R
1
, . . . ,

R
m
), we can nd
two sequences of codes to achieve them. Therefore, by properly randomization
among these two sequences of codes, we can achieve the rates
(R
1
, . . . , R
m
) + (1 )(

R
1
, . . . ,

R
m
)
for any [0, 1]. 2
Theorem 10.11 (Slepian-Wolf) For distributed sources consisting of two ran-
dom variables X
1
and X
2
, the achievable region is
R
1
H(X
1
[X
2
)
R
2
H(X
2
[X
1
)
R
1
+ R
2
H(X
1
, X
2
).
Proof:
1. Achievability Part: We need to show that for any (R
1
, R
2
) satisfying the
constraint, a sequence of code pairs for X
1
and X
2
with asymptotically zero
error probability exists.
Step 1: Random coding. For each source sequence x
n
1
, randomly assign it an
index in 1, 2, . . . , 2
nR
1
according to uniform distribution. This index is
the encoding output of f(x
n
1
). Let |
i
= x
n
A
n
1
: f
1
(x
n
) = i.
Similarly, form f
2
() by randomly assigning indexes. Let 1
i
= x
n

A
n
2
: f
1
(x
n
) = i.
Upon receipt of two indexes (i, j) from the two encoders, the decoder g()
is dened by
g(i, j) =
_
_
_
(x
n
1
, x
n
2
) if [(|
i
1
j
) T
n
(; X
n
1
, X
n
2
)[ = 1,
and (x
n
1
, x
n
2
) (|
i
1
j
) T
n
(; X
n
1
, X
n
2
);
arbitrary otherwise.
,
where T
n
(; X
n
1
, X
n
2
) is the weakly -joint typical set of X
n
1
and X
n
2
.
Step 2: Error probability. The error happens when the source sequence pairs
(x
n
1
, x
n
2
)
(Case c
0
) (x
n
1
, x
n
2
) , T
n
(; X
n
1
, X
n
2
);
(Case c
1
) ( x
n
1
,= x
n
1
) f
1
( x
n
1
) = f
1
(x
n
1
) and ( x
n
1
, x
n
2
) T
n
(; X
n
1
, X
n
2
);
255
(Case c
2
) ( x
n
2
,= x
n
2
) f
2
( x
n
2
) = f
2
(x
n
2
) and (x
n
1
, x
n
2
) T
n
(; X
n
1
, X
n
2
);
(Case c
3
) ( ( x
n
1
, x
n
2
) ,= (x
n
1
, x
n
2
)) f
1
( x
n
1
) = f
2
( x
n
2
), f
2
( x
n
2
) = f
2
(x
n
2
)
and
( x
n
1
, x
n
2
) T
n
(; X
n
1
, X
n
2
).
Hence,
P
e
(n) Prc
0
c
1
c
2
c
3

Pr(c
0
) + Pr(c
1
) + Pr(c
2
) + Pr(c
3
)
Hence,
E[P
e
(n)] E[Pr(c
0
)] + E[Pr(c
1
)] + E[Pr(c
2
)] + E[Pr(c
3
)],
where the expectation is taken over the uniformly random coding scheme.
It remains to calculate the error for each event.
(Case E
0
) By AEP, E[Pr(c
0
)] 0.
256
(Case E
1
)
E[Pr(c
1
)]
=

x
n
1
.
n
1

x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)

x
n
1
.
n
1
: x
n
1
,=x
n
1
,
( x
n
1
,x
n
2
)Tn(;X
n
1
,X
n
2
)

Pr(f
1
( x
n
1
) = f
1
(x
n
1
))
=

x
n
1
.
n
1

x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)

x
n
1
.
n
1
: x
n
1
,=x
n
1
( x
n
1
,x
n
2
)Tn(;X
n
1
,X
n
2
)

2
nR
1
(By uniformly random coding scheme.)

x
n
1
.
n
1

x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)

x
n
1
.
n
1
: ( x
n
1
,x
n
2
)Tn(;X
n
1
,X
n
2
)
2
nR
1

x
n
1
.
n
1

x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)

x
n
1
.
n
1
:

1
n
log
2
P
X
n
1
|X
n
2
( x
n
1
[x
n
2
)H(X
1
[X
2
)

<2
2
nR
1
=

x
n
1
.
n
1

x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
) 2
nR
1

_
x
n
1
A
n
1
:

1
n
log
2
P
X
n
1
[X
n
2
( x
n
1
[x
n
2
) H(X
1
[X
2
)

< 2
_

x
n
1
.
n
1

x
n
2
.
n
2
P
X
n
1
X
n
2
(x
n
1
, x
n
2
)2
n[H(X
1
[X
2
)+2]
2
nR
1
= 2
n[R
1
H(X
1
[X
2
)2]
,
Hence, we need R
1
> H(X
1
[X
2
) to make E[Pr(c
1
)] 0.
(Case c
2
) Same as Case c
1
. we need R
2
> H(X
2
[X
1
) to make
E[Pr(c
2
)] 0.
(Case c
3
) By replacing Prf
1
( x
n
1
) = f
1
(x
n
) by
Prf
1
( x
n
1
) = f
1
(x
n
), f
2
( x
n
2
) = f
2
(x
n
2
)
and using the fact that the number of elements in T
n
(; X
n
1
, X
n
2
) is
less than 2
n(H(X
1
,X
2
)+)
in the proof of Case c
1
, we have R
1
+ R
2
>
H(X
1
, X
2
) implies E[Pr(c
3
)] 0.
2. Converse Part: We have to prove that if a sequence of code pairs for X
1
and
X
2
has asymptotically zero error probability, then its rate pair (R
1
, R
2
) satises
the constraint.
257
Step 1: R
1
+ R
2
H(X
1
, X
2
). Apparently, to losslessly compress sources
(X
1
, X
2
)
requires at least H(X
1
, X
2
) bits.
Step 2: R
1
H(X
1
[X
2
). At the extreme case of R
2
> log
2
[A
2
[ (no compression
at all for X
2
), we can follow the proof of lossless data compression theorem
to show that R
1
H(X
1
[X
2
).
Step 3: R
2
H(X
2
[X
1
). Similar argument as in step 2 can be applied. 2
Corollary 10.12 Given sequences of several (correlated) discrete memoryless
sources X
1
, . . . , X
m
which are obtained from dierent terminals (and are to be
encoded independently), the achievable code rate region satises

iI
R
i
H(X
I
[X
LI
),
for any index set I L 1, 2, . . . , m, where X
I
represents (X
i
1
, X
i
2
, . . .) for
i
1
, i
2
, . . . = I.
10.1.2 Partial decoding of the original sources
In the previous subsection, the receiver intends to fully reconstruct all the orig-
inal information transmitted, X
1
, . . . , X
m
. In some situations, the receiver may
only want to reconstruct part of the original information, say X
i
for i I
1, . . . , m or X
I
. Since it is in general assumed that X
1
, . . . , X
m
are dependent,
the remaining information, X
i
for i , I, should be helpful in the re-construction
of X
I
. Accordingly, these remain information are usually named the side infor-
mation for lossless data compression.
Due to the dierent intention of the receiver, the denitions of the decoder
mapping and error probability should be modied.
Denition 10.13 (reconstructed information and side information) Let
L 1, 2, . . . , m and I is any proper subset of L. Denote X
I
as the sources X
i
for i I, and similar notation is applied to X
LI
.
In the data compression with side-information, X
I
is the information needs
to be re-constructed, and X
LI
is the side-information.
Denition 10.14 (independent encoders among distributed sources) T-
here are several sources X
1
, X
2
, . . . , X
m
(may or may not be independent) which
258
are respectively obtained by m terminals. Before each terminal transmits its
local source to the receiver, a block encoder f
i
with rate R
i
and block length n
is applied
f
i
: A
n
i
1, 2, . . . , 2
nR
i
.
It is assumed that there is no conspiracy among block encoders.
Denition 10.15 (global decoder for independently compressed sourc-
es) A global decoder g() will recover the original sources after receiving all the
independently compressed sources, i.e.,
g : 1, . . . , 2
R
1
1, . . . , 2
Rm
A
n
I
.
Denition 10.16 (probability of error) The probability of error is dened
as
P
e
(n) Prg(f
I
(X
n
I
) ,= (X
n
I
).
Denition 10.17 (achievable rates) A rates
(R
1
, . . . , R
m
)
is said to be achievable if there exists a sequence of block codes such that
limsup
n
P
e
(n) = 0.
Denition 10.18 (achievable rate region for distributed sources) The
achievable rate region for distributed sources is the set of all achievable rates.
Observation 10.19 The achievable rate region is convex.
Theorem 10.20 For distributed sources with two random variable X
1
and X
2
,
let X
1
be the re-constructed information and X
2
be the side information, the
boundary function R
1
(R
2
) for the achievable region is
R
1
(R
2
) min
Z : X
1
X
2
Z and I(X
2
;Z)R
2

H(X
1
[Z)
The above result can be re-written as
R
1
H(X
1
[Z) and R
2
I(X
2
; Z),
for any X
1
X
2
Z. Actually, Z can be viewed as the coding outputs of
X
2
, received by the decoder, and is used by the receiver as a side information to
reconstruct X
1
. Hence, I(X
2
; Z) is the transmission rate from sender X
2
to the
receiver. For all f
2
(X
2
) = Z that has the same transmission rate I(X
2
; Z), the
one that minimize H(X
1
[Z) will yield the minimum compression rate for X
1
.
259
10.2 Distributed detection
Instead of re-construction of the original information, the decoder of a multiple
sources system may only want to classify the sources into one of nitely many
categories. This problem is usually named distributed detection.
A distributed or decentralized detection system consists of a number of ob-
servers (or data collecting units) and one or more data fusion centers. Each
observer is coupled with a local data processor and communicates with the data
fusion center through network links. The fusion center combines all compressed
information received from the observers and attempts to classify the source of
the observations into one of nitely many categories.
Denition 10.21 (distributed system o
n
) A distributed detection system
o
n
, as depicted in Fig. 10.3, consists of n geographically dispersed sensors, noise-
less one-way communication links, and a fusion center. Each sensor makes an
observation (denoted by Y
i
) of a random source, quantizes Y
i
into an m-ary mes-
sage U
i
= g
i
(Y
i
), and then transmits U
i
to the fusion center. Upon receipt of
(U
1
, . . . , U
n
), the fusion center makes a global decision T(U
1
, . . . , U
n
) about the
nature of the random source.
-
-
-
-
-
-
-
Fusion center
-
T(U
1
, . . . , U
n
)
g
n
g
2
g
1

Y
n
Y
2
Y
1
U
n
U
2
U
1
Figure 10.3: Distributed detection in o
n
.
The optimal design of o
n
entails choosing quantizers g
1
, . . . , g
n
and a global
decision rule T so as to optimize a given performance index. In this section,
we consider binary hypothesis testing under the (classical) Neyman-Pearson and
Bayesian formulations. The rst formulation dictates minimization of the type II
error probability subject to an upper bound on the type I error probability; while
the second stipulates minimization of the Bayes error probability, computed
according to the prior probabilities of the two hypotheses.
260
The joint optimization of entities g
1
, . . . , g
n
and T in o
n
is a hard compu-
tational task, except in trivial cases (such as when the observations Y
i
lie in
a set of size no greater than m). The complexity of the problem can only be
reduced by introducing additional statistical structure in the observations. For
example, it has been shown that whenever Y
1
, . . . , Y
n
are independent given each
hypothesis, an optimal solution can be found in which g
1
, . . . , g
n
are threshold-
type functions of the local likelihood ratio (possibly with some randomization for
Neyman-Pearson testing). Still, we should note that optimization of g
1
, . . . , g
n
over the class of threshold-type likelihood-ratio quantizers is prohibitively com-
plex when n is large.
Of equal importance are situations where the statistical model exhibits spatial
symmetry in the form of permutation invariance with respect to the sensors. A
natural question to ask in such cases is
Question: whether a symmetric optimal solution exists in which the quan-
tizers g
i
are identical?
If so, then the optimal system design is considerably simplied. The answer
is clearly negative for cases where sensor observations are highly dependent; as
an extreme example, take Y
1
= . . . = Y
n
= Y with probability 1 under each
hypothesis, and note that any two identical quantizers lead to a redundancy.
The general problem is as follows. System o
n
is used for testing H
0
: P
versus H
1
: Q, where P and Q are one-dimensional marginals of the i.i.d. data
Y
1
, . . . , Y
n
. As n tends to innity, both the minimum type II error probability

n
() (as function of the type I error probability bound ) and the Bayes error
probability

n
() (as function of the prior probability of H
0
) vanish at an
exponential rate. It thus becomes legitimate to adopt a measure of asymptotic
performance based on the error exponents
e

NP
() lim
n

1
n
log

n
()
e

B
() lim
n

1
n
log

n
() .
It was shown by Tsitsiklis [3] that, under certain assumptions on the hypothe-
ses P and Q, it is possible to achieve the same error exponents using identical
quantizers. Thus if

n
(),

n
(), e

NP
() and e

B
() are the counterparts of

n
(),

n
(), e

NP
() and e

B
() under the constraint that the quantizers g
1
, . . . , g
n
are
identical, then
( (0, 1)) e

NP
() = e

NP
()
and
( (0, 1)) e

B
() = e

B
() .
261
(Of course, for all n,

n
()

n
() and

n
()

n
().) This result provides
some justication for restricting attention to identical quantizers when designing
a system consisting of a large number of sensors.
Here we will focus on two issues. The rst issue is the exact asymptotics
of the minimum error probabilities achieved by the absolutely optimal and best
identical-quantizer systems. Note that equality in the error exponents of

n
()
and

n
() does not in itself guarantee that for any given n, the values of

n
()
and

n
() are in any sense close. In particular, the ratio

n
()/

n
() may vanish
at a subexponential rate, and thus the best identical-quantizer system may be
vastly inferior to the absolutely optimal system. (The same argument can be
made for

n
()/

n
()).
From numerical simulations of Bayes testing in o
n
, the ratio

n
()/

n
()
is (apparently) bounded from below by a positive constant which is, in many
cases, reasonably close to unity (but not necessarily one. Cf. Example 10.32).
This simulation result is substantiated by using large deviations techniques to
prove that

n
()/

n
() is, indeed, always bounded from below (it is, of course,
upper bounded by unity). For Neyman-Pearson testing, an additional regularity
condition is required in order for the ratio

n
()/

n
() to be lower-bounded in
n. In either case, the optimal system essentially consists of almost identical
quantizers, and is thus only marginally dierent from the best identical-quantizer
system.
Denition 10.22 (observational model) Each sensor observation Y = Y
i
takes values in the measurable space (, B). The distribution of Y under the
null (H
0
) and alternative (H
1
) hypotheses is denoted by P and Q, respectively.
Assumption 10.23 P and Q are mutually absolutely continuous, i.e., P Q.
Under Assumption 10.23, the (pre-quantization) log-likelihood ratio
X(y) log
dP
dQ
(y)
is well-dened for y and is a.s. nite. (Since P Q, almost surely and
almost everywhere are understood as under both P and Q.) In o
n
, the variable
X(Y
i
) will also be denoted as X
i
.
Assumption 10.24 Every measurable m-ary partition of contains an atom
over which X = log(dP/dQ) is not almost everywhere constant.
The objective of the above assumption is to guarantee that trivial quantiza-
tion in which no compression is applied on the observations cannot be obtained.
262
Denition 10.25 (divergence) The (Kullback-Leibler, informational) diver-
gence, or relative entropy, of P relative to Q is dened by
D(P|Q) E
P
[X] =
_
log
dP
dQ
(y) dP(y) .
Lemma 10.26 On the convex domain consisting of distribution pairs (P, Q)
with the property P Q, the functional D(P|Q) is nonnegative and convex
with respect to all (P, Q) pair with the property P Q.
Lemma 10.27 (Neyman-Pearson type II error exponent of xed test
level) The optimal Neyman-Pearson error exponent in testing P versus Q at
any level (0, 1) based on the i.i.d. observations Y
1
, . . . , Y
n
is D(P|Q).
Denition 10.28 (moment generation function of log-likelihood ratio)
() is the moment generation function of X under Q:
() E
Q
[expX] =
_ _
dP
dQ
(y)
_

dQ(y) .
Lemma 10.29 (concavity of ())
1. For xed [0, 1], () is a nite-valued concave functionals of the pair
(P, Q) with the property P Q.
2. For xed (P, Q) with P Q, () is nite and convex in [0, 1].
This last property, together with the fact that (0) = (1) = 1, guarantees
that () has a minimum value which is less than or equal to unity, achieved by
some

(0, 1).
Denition 10.30 (Cherno exponent) We dene the Cherno exponent
(P, Q) log (

) = log
_
min
(0,1)
()
_
.
Lemma 10.31 The Cherno exponent coincides with the Bayes error exponent.
Example 10.32 (counterexample to

n
()/

n
() 1) Consider a ternary
observation space = a
1
, a
2
, a
3
with binary quantization. The two hypotheses
are assumed equally likely, with
263
y a
1
a
2
a
3
P(y) 1/12 1/4 2/3
Q(y) 1/3 1/3 1/3
(dP/dQ)(y) 1/4 3/4 2
There are only two nontrivial deterministic LRQs: g, which partitions
into a
1
and a
2
, a
3
; and g, which partitions into a
1
, a
2
and a
3
. The
corresponding output distributions and log-likelihood ratios X

() and X

() are
given by
u 1 2
P

(u) 1/12 11/12
Q

(u) 1/3 2/3
X

(u) log 4 log(11/8)
and
u 1 2
P

(u) 1/3 2/3
Q

(u) 2/3 1/3
X

(u) log 2 log 2
(10.2.1)
(P

, Q

) =
1
2
log
9
8
= 0.0589 > 0.0534 = (P

, Q

) ,
and thus for n suciently large, the best identical-quantizer system o

n
employs
g on all n sensors (the other choice yields a suboptimal error exponent and thus
eventually a higher value of
n
()).
We will now show by contradiction that if o

n
is an absolutely optimal system
consisting of deterministic LRQs, then for all even values of n, at least one of
the quantizers in o

n
must be g.
Assume the contrary, i.e., o

n
is such that for all i n, g
i
= g. Now consider
the problem of optimizing the quantizer g
n
in o
n
subject to the constraint that
each of the remaining quantizers g
1
, . . . , g
n1
equals g. It is known that either
g
n
= g or g
n
= g is optimal. Our assumption about o

n
then implies that g
n
= g
is, in fact, optimal.
To see why this cannot be so if n is even, consider the Bayes error probability
for this problem. Writing u
n
1
for (u
1
, . . . , u
n
), we have

n
_
1
2
_
=
1
2

u
n
1
1,2
n
[P

(u
n1
1
)P
gn
(u
n
)] [Q

(u
n1
1
)Q
gn
(u
n
)]
=
1
2

u
n1
1
1,2
n1

u
n
1,2
[P

(u
n1
1
)P
gn
(u
n
)] [Q

(u
n1
1
)Q
gn
(u
n
)]
=

u
n1
1
1,2
n1
_
P

(u
n1
1
) + Q

(u
n1
1
)
2
_

_
P

(u
n1
1
)
P

(u
n1
1
) + Q

(u
n1
1
)
_
(10.2.2)
264
where () represents the Bayes error probability function of the nth
sensor/quantizer pair (note that in this equation, the argument of () is just
the posterior probability of H
0
given u
n1
1
).
We note from (10.2.1) that the log-likelihood ratio
log
P

(u
n1
1
)
Q

(u
n1
1
)
= X

(u
1
) + + X

(u
n1
)
can be also expressed as (2l
n1
(u) n + 1)(log 2), where l
n1
(u) is the number
of 2s in u
n1
1
. Now l
n1
(U) is a binomial variable under either hypothesis, and
we can rewrite (10.2.2) as

n
_
1
2
_
=
n1

l=0
_
n1
l
_
_
2
l
+ 2
nl1
2 3
n1
_

_
2
2ln+1
2
2ln+1
+ 1
_
. (10.2.3)
The two candidates for are and , given by
() =
_
1
12

1
3
(1 )
_
+
_
11
12

2
3
(1 )
_
() =
_
1
3

2
3
(1 )
_
+
_
2
3

1
3
(1 )
_
and shown in Figure 10.4. Note that () = () for 1/3, = 4/7 and
4/5. Thus the critical values of l in (10.2.3) are those for which 2
2ln+1
/(2
2ln+1
+
1) lies in the union of (1/3, 4/7) and (4/7, 4/5).
If n is odd, then the range of 2l n+1 in (10.2.3) comprises the even integers
between n +1 and n 1 inclusive. The only critical value of l is (n 1)/2, for
which the posterior probability of H
0
is 1/2. Since (1/2) = 3/8 > 1/3 = (1/2),
the optimal choice is g.
If n is even, then 2l n + 1 ranges over all odd integers between n + 1
and n 1 inclusive. Here the only critical value of l is n/2, which makes the
posterior probability of H
0
equal to 2/3. Since (2/3) = 5/18 < 1/3 = (2/3),
g is optimal.
We thus obtain the required contradiction, together with the inequality

2k
_
1
2
_

2k
_
1
2
_

_
2k1
k
_
_
2
k
+ 2
k1
2 3
2k1
_ _

_
2
3
_

_
2
3
__
=
1
8
_
2k1
k
_ _
2
9
_
k
.
Stirlings formula gives (4k)
1/2
expk log 4(1 + o(1)) for the binomial coef-
cient, and thus
liminf
k
_

2k
_
1
2
_

2k
_
1
2
__

k
_
9
8
_
k

1
16
. (10.2.4)
265
0
1
0 1 1/3 4/7 4/5
1/3

()
()
Figure 10.4: Bayes error probabilities associated with g and g.
Since (9/8)
k
= exp2k(P

, Q

) = exp2k
2
, we immediately deduce from
(10.2.4) that
limsup
k

2k
(1/2)

2k
(1/2)
< 1 .
A ner approximation to

2k
_
1
2
_
= QX

(U
1
) + +X

(U
2k
) > 0 +
1
2
QX

(U
1
) + +X

(U
2k
) = 0
using [2, Theorem 1] and Stirlings formula for the rst and second summands,
respectively, yields
lim
k

2k
_
1
2
_

k
_
9
8
_
k
=
3
2
.
From (10.2.4) we then obtain
limsup
k

2k
(1/2)

2k
(1/2)

23
24
.
266
10.2.1 Neyman-Pearson testing in parallel distributed d-
etection
The equality (and niteness) of e

NP
() and e

NP
() was established in Theorem 2
of [3] under the assumption that E
P
[X
2
] < . Actually, the proof only utilized
the following weaker condition for = 0 on the post-quantization log-likelihood
ratio.
Assumption 10.33 (boundedness assumption) There exists 0 for
which
sup
gm
E
P
[[X
g
[
2+
] < , (10.2.5)
where (
m
is the set of all possible m-ary quantizers.
Let us briey examine the above assumption. For an arbitrary randomized
quantizer g, let p
u
= P
g
(u) and q
u
= Q
g
(u). Then
E
P
[[X
g
[
2+
] =
m

u=1
p
u

log
p
u
q
u

2+
.
Our rst observation is that the negative part X

g
of X
g
has bounded (2+)-
th moment under P, and thus Assumption 10.33 is equivalent to
sup
g
E
P
[(X
+
g
)
2+
] < .
Indeed, we have
E
P
[[X

g
[
2+
] =

u : pu<qu
p
u

log
p
u
q
u

2+
=

u : pu<qu
p
u

log
q
u
p
u

2+
,
and using the inequality [ log x
1/(2+)
[ x
1/(2+)
1 for x 1, we obtain
E
P
[(X

g
)
2+
] (2 + )
2+

u : pu<qu
p
u
_
_
q
u
p
u
_
1/(2+)
1
_
2+
(2 + )
2+
m

u=1
q
u
(2 + )
2+
.
Theorem 10.34 Assumption 10.33 is equivalent to the condition
sup
T
2
E
P
[[X

[
2+
] < , (10.2.6)
where T
2
is the set of all possible binary log-likelihood ratio quantizers.
267
Proof: Assumption 10.33 clearly implies (10.2.6). To prove the converse, let
t
be the LRP in T
2
dened by

t

_
(, t] , (t, )
_
, (10.2.7)
and let p(t) = PX > t, q(t) = QX > t. By (10.2.6), there exists b <
such that for all t 1,
E
P
[[X
t
[
2+
] = (1 p(t))

log
1 p(t)
1 q(t)

2+
+ p(t)

log
p(t)
q(t)

2+
b . (10.2.8)
Consider now an arbitrary deterministic m-ary quantizer g with output pmfs
(p
1
, . . . , p
m
) and (q
1
, . . . , q
m
), and let the maximum of p
u
[ log(p
u
/q
u
)[
2+
subject
to p
u
q
u
be achieved at u = u

. Then, from the rst observation in this section,


it follows that
E
P
[[X
g
[
2+
] mp

log
p

2+
+ (2 + )
2+
.
To see that p

[log(p

/q

)[
2+
is bounded from above if (10.2.8) holds, note that
for a given p

= PY C

, the value of q

= QY C

can be bounded from


below using the Neyman-Pearson lemma. In particular, there exist t 1 and
[0, 1] such that
p

= PX = t + p(t) ,
q

QX = t + q(t) .
If PX = t = 0, then
p

log
p

2+
p(t)

log
p(t)
q(t)

2+
,
where by virtue of (10.2.8), the r.h.s. is upper-bounded by b. Otherwise, the
r.h.s. is of the form
_
p(t) + PX = t
_

log
p(t) + PX = t
q(t) + QX = t

2+
,
for some (0, 1]. For = 1, this can again be bounded using (10.2.6): take
t
t
consisting of intervals (, t) and [t, ), or use
t
and a simple continuity argu-
ment. Then the log-sum inequality can be applied together with the concavity
of f(t) = t
1/(2+)
to show that the same bound b is valid for (0, 1).
If g is a randomized quantizer, then the probabilities p

and q

dened previ-
ously will be expressible as

k

k
p
(k)
and

k

k
q
(k)
. Here k ranges over a nite
index set, and each pair (p
(k)
, q
(k)
) is derived from a deterministic quantizer.
Again, using the log-sum inequality and the concavity of f(t) = t
1/(2+)
, one can
obtain p

[log(p

/q

)[
2+
b. 2
268
Observation 10.35 The boundedness assumption is equivalent to
limsup
t
E
P
[[X
t
[
2+
] < , (10.2.9)
where
t
is dened in (10.2.7).
Theorem 10.36 If Assumption 10.33 holds, then for all (0, 1),

1
n
log

n
() = D
m
+ O(n
1/2
)
and

1
n
log

n
() = D
m
+ O(n
1/2
) . 2
As an immediate corollary, we have e

NP
() = e

NP
() = D
m
, which is The-
orem 2 in [3]. The stated result sharpens this equality by demonstrating that
the nite-sample error exponents of the absolutely optimal and best identical-
quantizer systems converge at a rate O(n
1/2
). It also motivates the following
observations.
The rst observation concerns the accuracy, or tightness, of the O(n
1/2
)
convergence factor. Although the upper and lower bounds on

n
() in Theo-
rem 10.36 are based on a suboptimal null acceptance region, examples in the
context of centralized testing show that in general, the O(n
1/2
) rate cannot be
improved on. At the same time, it is rather unlikely that the ratio

n
()/

n
()
decays as fast as expc
t

n, i.e., there probably exists a lower bound which is


tighter than what is implied by Theorem 10.36. As we shall see later in Theo-
rem 10.40, under some regularity assumptions on the mean and variance of the
post-quantization log-likelihood ratio computed w.r.t. the null distribution, the
ratio

n
()/

n
() is indeed bounded from below.
The second observation is about the composition of an optimal quantizer set
(g
1
, . . . , g
n
) for o
n
. The proof of Theorem 10.36 also yields an upper bound on
the number of quantizers that are at least distant from the deterministic LRQ
that achieves e

NP
(). Specically, if K
n
() is the number of indices i for which
D(P
g
i
|Q
g
i
) < D
m

(where > 0), then
K
n
()
n
= O(n
1/2
) .
Thus in an optimal system, most quantizers will be essentially identical to the
one that achieves e

NP
(). (This conclusion can be signicantly strengthened if
additional assumption is made in the case of Neyman-Pearson testing [1].)
269
In the remainder of this section we discuss the asymptotics of Neyman-
Pearson testing in situations where Assumption 10.33 does not hold. By the
remark following the proof of Theorem 10.34, this condition is violated if and
only if
limsup
t
E
P
[X
2
t
] = , (10.2.10)
where
t
is the binary LRP dened by

t
=
_
(, t] , (t, )
_
.
We now distinguish between three cases.
Case A. limsup
t
E
P
[X
t
] = .
Case B. 0 < limsup
t
E
P
[X
t
] < .
Case C. limsup
t
E
P
[X
t
] = 0 and limsup
t
E
P
[X
2
t
] = .
Example Let the observation space be the unit interval (0, 1] with its Borel
eld. For a > 0, dene the distributions P and Q by
PY y = y , QY y = exp
_
a + 1
a
_
1
1
y
a
__
.
The pdf of Q is strictly increasing in y, and thus the likelihood ratio (dP/dQ)(y)
is strictly decreasing in y. Hence the event X > t can also be written as
Y < y
t
, where y
t
0 as t . Using this equivalence, we can examine the
limiting behavior of E
P
[X
t
] and E
P
[X
2
t
] to obtain:
a. a > 1: lim
t
E
P
[X
t
] = lim
t
E
P
[X
2
t
] = (Case A)
b. a = 1: lim
t
E
P
[X
t
] = 2, lim
t
E
P
[X
2
t
] = (Case B)
c. 1/2 < a < 1: lim
t
E
P
[X
t
] = 0, lim
t
E
P
[X
2
t
] = (Case C)
d. a 1/2: lim
t
E
P
[X
2
t
] < (Assumption 10.33 is satised) .
2
In Case A, the error exponents e

NP
() and e

NP
() are both innite. This
result is neither dicult to prove nor surprising, considering that E
P
[X
g
] =
D(P
g
|Q
g
) can be made arbitrarily large by choice of the quantizer g.
270
Theorem 10.37 (result for Case A) If limsup
t
E
P
[X
t
] = , then for
all m 2 and (0, 1),
e

NP
() = e

NP
() = . 2
We now turn to case B, which is more interesting. Using the notation p(t) =
PX > t and q(t) = QX > t introduced earlier, we have
E
P
[X
t
] = (1 p(t)) log
1 p(t)
1 q(t)
+ p(t) log
p(t)
q(t)
.
The rst summand on the r.h.s. clearly tends to zero (as t , which is under-
stood throughout), hence the lim sup of the second summand p(t) log[p(t)/q(t)]
is greater than zero. Since p(t) tends to zero, both log[p(t)/q(t)] and
p(t) log
2
[p(t)/q(t)] have lim sup equal to innity. Thus in particular, (10.2.10)
always holds in Case B.
A separate argument (which we omit) reveals that the centralized error ex-
ponent D(P|Q) = E
P
[X] is also innite in Case B. Yet unlike Case A, the
decentralized error exponent e

NP
() obtained here is not innite. Quite surpris-
ingly, if this exponent exists, then it must depend on the test level . This is
stated in the following theorem.
Theorem 10.38 (result for Case B) Consider hypothesis testing with m-ary
quantization, where m 2. If
0 < limsup
t
E
P
[X
t
] < , (10.2.11)
then there exist:
1. an increasing sequence of integers n
k
, k N and a function L : (0, 1)
(0, ) which is monotonically increasing to innity, such that
liminf
k

1
n
k
log

n
k
() L() D
m
;
2. a function M : (0, 1) (0, ) which is monotonically increasing to
innity and is such that
limsup
n

1
n
log

n
() M() .
271
Proof:
(i) Lower bound. As was argued in the proof of Theorem 10.36, an error exponent
equal to D
m
can be achieved using identical quantizers; hence one part of the
bound follows immediately. In what follows, we construct a sequence of identical-
quantizer detection schemes with nite sample-error exponent almost always
exceeding L(), where L() increases to innity as tends to unity.
Let limsup
t
E
P
[X
t
], so that 0 < < by (10.2.11). By subsequence
selection, we obtain a sequence of LRPs
t
k
with the property that
k
E
P
[X

k
]
converges to as k . (We eliminate t from all subscripts to simplify the
notation.) Letting p
k
= p(t
k
), q
k
= q(t
k
),
k
= log[(1 p
k
)/(1 q
k
)] and

k
= log(p
k
/q
k
), we can write

k
= (1 p
k
)
k
+ p
k

k
,
where by the discussion preceding this theorem, p
k
and
k
tend to zero and
k
increases to innity.
Fix > 0 and assume w.l.o.g. that
k
>
k1
+ (1/). Consider a system
consisting of n
k
sensors, where n
k
=
k
|, and let each sensor employ the same
binary LRQ with LRP
k
. This choice is clearly suboptimal (since m 2), but
it suces for our purposes.
Dene the set /
k
1, 2
n
k
by
/
k
= (u
1
, . . . , u
n
k
) : at least one u
j
equals 2 .
Recalling that U
j
= 2 if, and only if, the observed log-likelihood ratio is larger
than t
k
, we have
P

(/
k
) = 1 (1 p
k
)
n
k
,
and thus
lim
k
(1 P

(/
k
)) = lim
k
(1 p
k
)
n
k
= lim
k
(1 p
k
)

k
=
_
lim
k
(1 p
k
)

k
_

= exp ,
where the last equality follows from the fact that
k
and p
k

k
as
k .
Thus given any > 0, for all suciently large k the set /
k
is admissible
(albeit not necessarily optimal) as a null acceptance region for testing at level
272
= exp + . For this value of , we have

n
k
() Q

(/
k
)
= 1 (1 q
k
)
n
k
= 1 (1 p
k
exp
k
)
n
k
= n
k
p
k
exp
k

n
k
(n
k
1)
2!
p
2
k
exp2
k

+ (1)
n
k
p
n
k
k
expn
k

n
k
p
k
exp
k
+ n
2
k
p
2
k
exp2
k
+
Summing the geometric series, we obtain

n
k
()
n
k
p
k
exp
k

1 n
k
p
k
exp
k

.
The r.h.s. denominator tends to unity because
k
and n
k
p
k
as
k . Since
k
/n
k
1/, we conclude that
liminf
k

1
n
k
log

n
k
()
1

=

log(1/) +
.
As > 0 and > 0 were chosen arbitrarily, it follows that
liminf
k

1
n
k
log

n
k
()

log(1/)
for all (0, 1). The lower bound in statement (i) of the theorem is obtained
by taking L() / log(1/).
(ii) Upper bound. Consider an optimal detection scheme for o
n
, with the same
setup as in the proof of Theorem 10.36. Recall in particular that the fusion center
employs a randomized test with log-likelihood threshold
n
and randomization
constant
n
.
For to be an upper bound on the error exponent of

n
(), it suces that n
be greater than
n
and such that the events
_
n
i=1
X
g
i
n
_
and
_
n
i=1
X
g
i

n
_
have signicant overlap under P
g
. Indeed, if >
n
/n is such that for all
suciently large n,

n
P
g
_

n
i=1
X
g
i
=
n
_
+ P
g
_
n

n
i=1
X
g
i
>
n
_
> 0 , (10.2.12)
then

n
() > expn, as required.
The threshold
n
is rather dicult to determine, so we use an indirect method
for nding . We have
P
g
_

n
i=1
X
g
i
> n
_
P
g
_

n
i=1
[X
g
i
[ > n
_

1

sup
gm
E
P
[[X
g
[] ,
273
where the last bound follows from the Markov inequality. We claim that the
supremum in this relationship is nite. This is because the negative part of
X
g
has bounded expectation under P (see the discussion following Assump-
tion 10.33), and the proof of Theorem 10.34 can be easily modied to show that

t
sup
gm
E
P
[[X
g
[] is nite i = limsup
t
E
P
[X
t
] is (which is our current
hypothesis). Thus
P
g
_

n
i=1
X
g
i
> n
_


t

.
Now let > 0 and =
t
/(1), so that P
g
_
n
i=1
X
g
i
> n
_
1.
The Neyman-Pearson lemma immediately yields n >
n
. Also, using

n
P
g
_

n
i=1
X
g
i
=
n
_
+ P
g
_

n
i=1
X
g
i
>
n
_
= 1
and a simple contradiction, we obtain (10.2.12). Thus the chosen value of is an
upper bound on limsup
n
(1/n) log

n
(). Since > 0 can be made arbitrarily
small, we have that
M()

t
1
is also an upper bound. 2
From Theorem 10.38 we conclude that in Case B, the error exponent e

NP
()
must lie between the bounds L() and M() whenever it exists as a limit. (Since

t
D
m
and 1 log(1/), the inequality M() L() is indeed true.)
These bounds are shown in Figure 10.5.
Theorem 10.39 (result for Case C) In Case C,
e

NP
() = e

NP
() = D(P|Q).
Theorem 10.40 (result under boundedness assumption) Let 1 sat-
isfy (10.2.5). If 1/2, or if > 1/2 and observation space is nite, then

n
()

n
()
expc
t
(, )n
1
2
.
In particular, if (10.2.5) holds for 1, then the ratio

n
()/

n
() is bounded
from below.
10.2.2 Bayes testing in parallel distributed detection sys-
tems
We now turn to the asymptotic study of optimal Bayes detection in o
n
. The
prior probabilities of H
0
and H
1
are denoted by and 1 , respectively, the
274
0 1
Upper bound
Lower bound

NP
()
D
m

t
Figure 10.5: Upper and lower bounds on e

NP
() in Case B.
probability of error of the absolutely optimal system is denoted by

n
(), and the
probability of error of the best identical-quantizer system is denoted by

n
().
In our analysis, we will always assume that o
n
employs deterministic m-
ary LRQs represented by LRPs
1
, . . . ,
n
. This is clearly sucient because
randomization is of no help in Bayes testing.
Theorem 10.41 In Bayes testing with m-ary quantization,
liminf
n

n
()

n
()
> 0 (10.2.13)
for all (0, 1).
10.3 Capacity region of multiple access channels
Denition 10.42 (discrete memoryless multiple access channel) A dis-
crete memoryless multiple access channel contains several senders
(X
1
, X
2
, . . . , X
m
)
and one receiver Y , which are respectively dened over nite alphabet (A
1
,A
2
,. . .)
and . Also given is the transition probability P
Y [X
1
,X
2
,...,Xm
.
275
For simplicity, we will focus on the system with only two senders. The block
code for this simple multiple access channel is dened below.
Denition 10.43 (block code for multiple access channels) A block code
(n, M
1
M
2
)
for multiple access channel has block length n and rates R
1
= (1/n) log
2
M
1
and
R
2
= (1/n) log
2
M
2
respectively for each sender as:
f
1
: 1, . . . , M
1
A
n
1
,
and
f
2
: 1, . . . , M
2
A
n
2
.
Upon receipt of the channel output, the decoder is a mapping
g :
n
1, . . . , M
1
1, . . . , M
2
.
Theorem 10.44 (capacity region of memoryless multiple access chan-
nel) The capacity region for memoryless multiple access channel is the convex
set of the set
_
(R
1
, R
2
) (1
+
0)
2
: R
1
I(X
1
; Y [X
2
), R
2
I(X
2
; Y [X
1
)
and R
1
+ R
2
I(X
1
, X
2
; Y ) .
Proof: Again, the achievability part is proved using random coding arguments
and typical-set decoder. The converse part is based on the Fanos inequality. 2
10.4 Degraded broadcast channel
Denition 10.45 (broadcast channel) A broadcast channel consists of one
input alphabet A and two (or more) output alphabets
1
and
2
. The noise is
dened by the conditional probability P
Y
1
,Y
2
[X
(y
1
, y
2
[x).
Example 10.46 Examples of broadcast channels are
Cable Television (CATV) network;
Lecturer in classroom;
Code Division Multiple Access channels.
276
Denition 10.47 (degraded broadcast channel) A broadcast channel is
said to be degraded if
P
Y
1
,Y
2
[X
(y
1
, y
2
[x) = P
Y
1
[X
(y
1
[x)P
Y
2
[Y
1
(y
2
[y
1
).
It can be veried that when X Y
1
Y
2
forms a Markov chain, in which
P
Y
2
[Y
1
,X
(y
2
[y
1
, x) = P
Y
2
[Y
1
(y
2
[y
1
), a degraded broadcast channel is resulted. This
indicates that the parallelly broadcast channel degrades to a serially broad-
cast channel, where the channel output Y
2
can only obtain information from
channel input X through the previous channel output Y
1
.
Denition 10.48 (block code for broadcast channel) A block code for
broadcast channel consists of one encoder f() and two (or more) decoders g
i
()
as
f : 1, . . . , 2
nR
1
1, . . . , 2
nR
2
A
n
,
and
g
1
: A
n
1, . . . , 2
nR
1
,
g
2
: A
n
1, . . . , 2
nR
2
.
Denition 10.49 (error probability) Let the source index random variable
be W
1
and W
2
, namely W
1
1, . . . , 2
nR
1
and W
2
1, . . . , 2
nR
2
. Then the
probability of error is dened as
P
e
PrW
1
,= g
1
[f(W
1
, W
2
)] or W
2
,= g
2
[f(W
1
, W
2
)].
Theorem 10.50 (capacity region for degraded broadcast channel) The
capacity region for memoryless degraded broadcast channel is the convex hull of
the closure of
_
U
(R
1
, R
2
) : R
1
I(X; Y
1
[U) and R
2
I(U; Y
2
) ,
where the union is taking over all U satisfying U X (Y
1
, Y
2
) with alphabet
size [|[ min[A[, [
1
[, [
2
[.
Note that U X Y
1
Y
2
is equivalent to
P
U,X,Y
1
,Y
2
(u, x, y
1
, y
2
) = P
U
(u)P
X[U
(x[u)P
Y
1
,Y
2
[X,U
(y
1
, y
2
[x, u)
= P
U
(u)P
X[U
(x[u)P
Y
1
,Y
2
[X
(y
1
, y
2
[x)
= P
U
(u)P
X[U
(x[u)P
Y
1
[X
(y
1
[x)P
Y
2
[Y
1
(y
2
[y
1
).
277
Example 10.51 (capacity region for degraded BSC) Suppose P
Y
1
[X
and
P
Y
2
[Y
1
are BSC with crossover
1
and
2
, respectively. Then the capacity region
can be parameterized through as:
R
1
h
b
(
1
) h
b
(
1
)
R
2
1 h
b
( (
1
(1
2
) + (1
1
)
2
)),
where P
X[U
(0[1) = P
X[U
(1[0) = and U 0, 1.
Example 10.52 (capacity region for degraded AWGN channel) The
channel is modeled as
Y
1
= X + N
1
and Y
2
= Y
1
+ N
2
,
where the noise power for N
1
and N
2
are
2
1
and
2
2
, respectively. Then the
capacity region for input power constraint S should satisfy
R
1

1
2
log
2
_
1 +
S

2
1
_
R
2

1
2
log
2
_
1 +
(1 )S
S +
2
1
+
2
2
_
,
for any [0, 1].
10.5 Gaussian multiple terminal channels
In the previous chapter, we have dealt with the Gaussian channel with possibly
several inputs, and obtained the the best power distribution of each input should
follow the water-lling scheme. In the problem, the encoder is dened as
f : 1
m
1,
which means that all the inputs are observed by the same terminal and hence,
can be utilized in a centralized fashion. However, for Gaussian multiple terminal
channels, these inputs are observed distributed by dierent channels, and hence,
need to be encoded separately. In other words, the encoder now becomes
f
1
: 1 1
f
2
: 1 1
.
.
.
f
m
: 1 1
278
So we have now m (independent) transmitters, and one receiver in the system.
The system can be modeled as
Y =
m

i=1
X
i
+ N.
Theorem 10.53 (capacity region for AWGN multiple access channel)
Suppose each transmitter has (constant) power constraint S
i
. Let I denote the
subset of 1, 2, . . . , m. Then the capacity region should be
_
(R
1
, . . . , R
m
) : ( I)

iI
R
i

1
2
log
2
_
1 +

iI
S
i

2
_
_
,
where
2
is the noise power of N.
Note that S
i
/
2
is the SNR ratio for each terminal. So if one can aord to
unlimited number of terminals with xed transmitting power S, he can make
the channel capacity as large as he desired. On the other hand, if we have the
constraint that the total power of the system is xed, then distributing the power
to m terminals can only reduce the capacity due to the distributed encoding.
279
Bibliography
[1] P.-N. Chen and A. Papamarcou, New asymptotic results in parallel dis-
tributed detection, IEEE Trans. Info. Theory, vol. 39, no. 6, pp. 18471863,
Nov. 1993.
[2] J. A. Fill and M. J. Wichura, The convergence rate for the strong law
of large numbers: General lattice distributions, Probab. Th. Rel. Fields,
vol. 81, pp. 189212, 1989.
[3] J. N. Tsitsiklis, Decentralized Detection by a large number of sensors,
Mathematics of Control, Signals and Systems, vol. 1, no. 2, pp. 167182,
1988.
280