Contents
Abstract
Fundamentals
1 Introduction
1.1 Motivation for this Monograph . . . . .
1.2 Preview of this Monograph . . . . . . .
1.3 Fundamentals of Information Theory . .
1.3.1 Notation . . . . . . . . . . . . . .
1.3.2 InformationTheoretic Quantities
1.4 The Method of Types . . . . . . . . . .
1.5 Probability Bounds . . . . . . . . . . . .
1.5.1 Basic Bounds . . . . . . . . . . .
1.5.2 Central LimitType Bounds . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
8
9
10
10
11
13
13
14
II
PointToPoint Communication
25
3 Source Coding
3.1 Lossless Source Coding: NonAsymptotic Bounds . . . . . . . . . . . . . . . . .
3.1.1 An Achievability Bound . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 A Converse Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Lossless Source Coding: Asymptotic Expansions . . . . . . . . . . . . . . . . .
3.3 SecondOrder Asymptotics of Lossless Source Coding via the Method of Types
3.4 Lossy Source Coding: NonAsymptotic Bounds . . . . . . . . . . . . . . . . . .
3.4.1 An Achievability Bound . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 A Converse Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Lossy Source Coding: Asymptotic Expansions . . . . . . . . . . . . . . . . . . .
3.6 SecondOrder Asymptotics of Lossy Source Coding via the Method of Types .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
27
27
28
29
31
32
32
33
35
4 Channel Coding
4.1 Definitions and NonAsymptotic Bounds . . . . . . . . . .
4.1.1 Achievability Bounds . . . . . . . . . . . . . . . . .
4.1.2 A Converse Bound . . . . . . . . . . . . . . . . . .
4.2 Asymptotic Expansions for Discrete Memoryless Channels
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
39
40
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.3
4.4
4.5
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
42
45
49
53
54
54
55
57
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
58
59
61
65
68
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
71
72
73
74
76
77
79
79
79
80
. . . . . . . .
and Decoder
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
. 92
. 94
. 95
. 96
. 97
. 99
. 99
. 99
. 103
. 104
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
105
105
106
106
107
107
107
107
108
108
109
110
Abstract
This monograph presents a unified treatment of single and multiuser problems in Shannons information
theory where we depart from the requirement that the error probability decays asymptotically in the blocklength. Instead, the error probabilities for various problems are bounded above by a nonvanishing constant
and the spotlight is shone on achievable coding rates as functions of the growing blocklengths. This represents
the study of asymptotic estimates with nonvanishing error probabilities.
In Part I, after reviewing the fundamentals of information theory, we discuss Strassens seminal result for
binary hypothesis testing where the typeI error probability is nonvanishing and the rate of decay of the typeII error probability with growing number of independent observations is characterized. In Part II, we use this
basic hypothesis testing result to develop second and sometimes, even thirdorder asymptotic expansions
for pointtopoint communication. Finally in Part III, we consider network information theory problems
for which the secondorder asymptotics are known. These problems include some classes of channels with
random state, the multipleencoder distributed lossless source coding (SlepianWolf) problem and special
cases of the Gaussian interference and multipleaccess channels. Finally, we discuss avenues for further
research.
Part I
Fundamentals
Chapter 1
Introduction
Claude E. Shannons epochal A Mathematical Theory of Communication [141] marks the dawn of the
digital age. In his seminal paper, Shannon laid the theoretical and mathematical foundations for the basis
of all communication systems today. It is not an exaggeration to say that his work has had a tremendous
impact in communications engineering and beyond, in fields as diverse as statistics, economics, biology and
cryptography, just to name a few.
It has been more than 65 years since Shannons landmark work was published. Along with impressive
research advances in the field of information theory, numerous excellent books on various aspects of the
subject have been written. The authors favorites include Cover and Thomas [33], Gallager [56], Csisz
ar and
K
orner [39], Han [67], Yeung [189] and El Gamal and Kim [49]. Is there sufficient motivation to consolidate
and present another aspect of information theory systematically? It is the authors hope that the answer is
in the affirmative.
To motivate why this is so, let us recapitulate two of Shannons major contributions in his 1948 paper.
First, Shannon showed that to reliably compress a discrete memoryless source (DMS) X n = (X1 , . . . , Xn )
where each Xi has the same distribution as a common random variable X, it is sufficient to use H(X)
bits per source symbol in the limit of large blocklengths n, where H(X) is the Shannon entropy of the
source. By reliable, it is meant that the probability of incorrect decoding of the source sequence tends to
zero as the blocklength n grows. Second, Shannon showed that it is possible to reliably transmit a message
M {1, . . . , 2nR } over a discrete memoryless channel (DMC) W as long as the message rate R is smaller
than the capacity of the channel C(W ). Similarly to the source compression scenario, by reliable, one means
that the probability of incorrectly decoding M tends to zero as n grows.
There is, however, substantial motivation to revisit the criterion of having error probabilities vanish
asymptotically. To state Shannons source compression result more formally, let us define M (P n , ) to
be the minimum code size for which the lengthn DMS P n is compressible to within an error probability
(0, 1). Then, Theorem 3 of Shannons paper [141], together with the strong converse for lossless source
coding [49, Ex. 3.15], states that
1
log M (P n , ) = H(X),
n n
lim
(1.1)
lim
log Mave
(W n , ) = C(W ),
n
(1.2)
In many practical communication settings, one does not have the luxury of being able to design an arbitrarily
long code, so one must settle for a nonvanishing, and hence finite, error probability . In this finite blocklength
and nonvanishing error probability setting, how close can one hope to get to the asymptotic limits H(X) and
6
C(W )? This is, in general a difficult question because exact evaluations of log M (P n , ) and log Mave
(W n , )
are intractable, apart from a few special sources and channels.
In the early years of information theory, Dobrushin [45], Kemperman [91] and, most prominently,
where V (X) is known as the source dispersion or the varentropy, terms introduced by KostinaVerd
u [97]
and KontoyiannisVerd
u [95]. In (1.3), 1 is the inverse of the Gaussian cumulative distribution function.
Observe that the firstorder term in the asymptotic expansion above, namely H(X), coincides with the (firstorder) fundamental limit shown by Shannon. From this expansion, one sees that if the error probability is
fixed to < 21 , the extra rate above the entropy we have to pay for operating at finite blocklength n with
p
admissible error probability is approximately V (X)/n 1 (1 ). Thus, the quantity V (X), which is
a function of P just like the entropy H(X), quantifies how fast the rates of optimal source codes converge
to H(X). Similarly, for wellbehaved DMCs, under mild conditions, Strassen showed that the limiting
statement in (1.2) may be refined to
p
log Mave
(W n , ) = nC(W ) + nV (W )1 () + O(log n)
(1.4)
and V (W ) is a channel parameter known as the channel dispersion, a term introduced by PolyanskiyPoorVerd
u [123].
p Thus the backoff from capacity at finite blocklengths n and average error probability is
approximately V (W )/n 1 (1 ).
1.1
It turns out that Gaussian approximations (first two terms of (1.3) and (1.4)) are good proxies to the true non
asymptotic fundamental limits (log M (P n , ) and log Mave
(W n , )) at moderate blocklengths and moderate
error probabilities for some channels and sources as shown by PolyanskiyPoorVerd
u [123] and KostinaVerd
u [97]. For error probabilities that are not too small (e.g., [106 , 103 ]), the Gaussian approximation
is often better than that provided by traditional error exponent or reliability function analysis [39, 56], where
the code rate is fixed (below the firstorder fundamental limit) and the exponential decay of the error
probability is analyzed. Recent refinements to error exponent analysis using exact asymptotics [10, 11,
135] or saddlepoint approximations [137] are alternative proxies to the nonasymptotic fundamental limits.
The accuracy of the Gaussian approximation in practical regimes of errors and finite blocklengths gives us
motivation to study refinements to the firstorder fundamental limits of other single and multiuser problems
in Shannon theory.
The study of asymptotic estimates with nonvanishing error probabilitiesor more succinctly, fixed error
asymptoticsalso uncovers several interesting phenomena that are not observable from studies of firstorder
fundamental limits in single and multiuser information theory [33, 49]. This analysis may give engineers
deeper insight into the design of practical communication systems. A nonexhaustive list includes:
1. Shannon showed that separating the tasks of source and channel coding is optimal ratewise [141]. As
we see in Section 4.5.2 (and similarly to the case of error exponents [35]), this is not the case when the
probability of excess distortion of the source is allowed to be nonvanishing.
2. Shannon showed that feedback does not increase the capacity of a DMC [142]. It is known, however,
that variablelength feedback [125] and full output feedback [8] improve on the fixed error asymptotics
of DMCs.
1 Some of the results in [122, 123] were already announced by S. Verd
u in his Shannon lecture at the 2007 International
Symposium on Information Theory (ISIT) in Nice, France.
3. It is known that the entropy can be achieved universally for fixedtovariable length almost lossless
source coding of a DMS [192], i.e., the source statistics do not have to be known. The redundancy has
also been studied for prefixfree codes [27]. In the fixed error setting (a setting complementary to [27]),
it was shown by Kosut and Sankar [100, 101] that universality imposes a penalty in the thirdorder
term of the asymptotic expansion in (1.3).
4. Han showed that the output from any source encoder at the optimal coding rate with asymptotically
vanishing error appears almost completely random [68]. This is the socalled folklore theorem. Hayashi
[75] showed that the analogue of the folklore theorem does not hold when we consider the secondorder
terms in asymptotic expansions (i.e., the secondorder asymptotics).
5. Slepian and Wolf showed that separate encoding of two correlated sources incurs no loss ratewise
compared to the situation where side information is also available at all encoders [151]. As we shall see
in Chapter 6, the fixed error asymptotics in the vicinity of a corner point of the polygonal SlepianWolf
region suggests that sideinformation at the encoders may be beneficial.
None of the aforementioned books [33, 39, 49, 56, 67, 189] focus exclusively on the situation where the
error probabilities of various Shannontheoretic problems are upper bounded by (0, 1) and asymptotic
expansions or secondorder terms are sought. This is what this monograph attempts to do.
1.2
This monograph is organized as follows: In the remaining parts of this chapter, we recap some quantities in
information theory and results in the method of types [37, 39, 74], a particularly useful tool for the study of
discrete memoryless systems. We also mention some probability bounds that will be used throughout the
monograph. Most of these bounds are based on refinements of the central limit theorem, and are collectively
known as BerryEsseen theorems [17, 52]. In Chapter 2, our study of asymptotic expansions of the form
(1.3) and (1.4) begins in earnest by revisiting Strassens work [152] on binary hypothesis testing where
the probability of false alarm is constrained to not exceed a positive constant. We find it useful to revisit
the fundamentals of hypothesis testing as many informationtheoretic problems such as source and channel
coding are intimately related to hypothesis testing.
Part II of this monograph begins our study of informationtheoretic problems starting with lossless and
lossy compression in Chapter 3. We emphasize, in the first part of this chapter, that (fixedtofixed length)
lossless source coding and binary hypothesis testing are, in fact, the same problem, and so the asymptotic
expansions developed in Chapter 2 may be directly employed for the purpose of lossless source coding. Lossy
source coding, however, is more involved. We review the recent works in [86] and [97], where the authors
independently derived asymptotic expansions for the logarithm of the minimum size of a source code that
reproduces symbols up to a certain distortion, with some admissible probability of excess distortion. Channel
coding is discussed in Chapter 4. In particular, we study the approximation in (1.4) for both discrete
memoryless and Gaussian channels. We make it a point here to be precise about the thirdorder O(log n)
term. We state conditions on the channel under which the coefficient of the O(log n) term can be determined
exactly. This leads to some new insights concerning optimum codes for the channel coding problem. Finally,
we marry source and channel coding in the study of sourcechannel transmission where the probability of
excess distortion in reproducing the source is nonvanishing.
Part III of this monograph contains a sparse sampling of fixed error asymptotic results in network
information theory. The problems we discuss here have conclusive secondorder asymptotic characterizations
(analogous to the second terms in the asymptotic expansions in (1.3) and (1.4)). They include some channels
with random state (Chapter 5), such as Costas writing on dirty paper [30], mixed DMCs [67, Sec. 3.3], and
quasistatic singleinputmultipleoutput (SIMO) fading channels [18]. Under the fixed error setup, we also
consider the secondorder asymptotics of the SlepianWolf [151] distributed lossless source coding problem
8
7
1. Introduction
2. Hypothesis Testing
3.
4.
5.
6.
7.
8.
Source Coding
Channel Coding
Channels with State
SlepianWolf
Gaussian IC
Gaussian AMAC
Figure 1.1: Dependence graph of the chapters in this monograph. An arrow from node s to t means that
results and techniques in Chapter s are required to understand the material in Chapter t.
(Chapter 6), the Gaussian interference channel (IC) in the strictly very strong interference regime [22]
(Chapter 7), and the Gaussian multiple access channel (MAC) with degraded message sets (Chapter 8).
The MAC with degraded message sets is also known as the cognitive [44] or asymmetric [72, 167, 128] MAC
(AMAC). Chapter 9 concludes with a brief summary of other results, together with open problems in this
area of research. A dependence graph of the chapters in the monograph is shown in Fig. 1.1.
This area of information theoryfixed error asymptoticsis vast and, at the same time, rapidly expanding. The results described herein are not meant to be exhaustive and were somewhat dependent on the
authors understanding of the subject and his preferences at the time of writing. However, the author has
made it a point to ensure that results herein are conclusive in nature. This means that the problem is solved
in the informationtheoretic sense in that an operational quantity is equated to an information quantity. In
terms of asymptotic expansions such as (1.3) and (1.4), by solved, we mean that either the secondorder term
is known or, better still, both the second and thirdorder terms are known. Having articulated this, the
author confesses that there are many relevant informationtheoretic problems that can be considered solved
in the fixed error setting, but have not found their way into this monograph either due to space constraints
or because it was difficult to meld them seamlessly with the rest of the story.
1.3
In this section, we review some basic informationtheoretic quantities. As with every article published in the
Foundations and Trends in Communications and Information Theory, the reader is expected to have some
background in information theory. Nevertheless, the only prerequisite required to appreciate this monograph
is information theory at the level of Cover and Thomas [33]. We will also make extensive use of the method
of types, for which excellent expositions can be found in [37, 39, 74]. The measuretheoretic foundations of
probability will not be needed to keep the exposition accessible to as wide an audience as possible.
1.3.1
Notation
The notation we use is reasonably standard and generally follows the books by CsiszarKorner [39] and
Han [67]. Random variables (e.g., X) and their realizations (e.g., x) are in upper and lower case respectively.
Random variables that take on finitely many values have alphabets (support) that are denoted by calligraphic
font (e.g., X ). The cardinality of the finite set X is denoted as X . Let the random vector X n be the vector
of random variables (X1 , . . . , Xn ). We use bold face x = (x1 , . . . , xn ) to denote a realization of X n . The set
of all distributions (probability mass functions) supported on alphabet X is denoted as P(X ). The set of all
conditional distributions (i.e., channels) with the input alphabet X and the output alphabet Y is denoted by
P(YX ). The joint distribution induced by a marginal distribution P P(X ) and a channel V P(YX )
is denoted as P V , i.e.,
(P V )(x, y) := P (x)V (yx).
(1.5)
The marginal output distribution induced by P and V is denoted as P V , i.e.,
X
P V (y) :=
P (x)V (yx).
(1.6)
xX
1.3.2
InformationTheoretic Quantities
Informationtheoretic quantities are denoted in the usual way [39, 49]. All logarithms and exponential
functions are to the base 2. The entropy of a discrete random variable X with probability distribution
P P(X ) is denoted as
X
H(X) = H(P ) :=
P (x) log P (x).
(1.7)
xX
For the sake of clarity, we will sometimes make the dependence on the distribution P explicit. Similarly
given a pair of random variables (X, Y ) with joint distribution P V P(X Y), the conditional entropy
of Y given X is written as
X
X
H(Y X) = H(V P ) :=
P (x)
V (yx) log V (yx).
(1.8)
xX
yY
or
(1.9)
(1.10)
The mutual information is a measure of the correlation or dependence between random variables X and Y .
It is interchangeably denoted as
I(X; Y ) := H(Y ) H(Y X),
or
(1.11)
(1.12)
Given three random variables (X, Y, Z) with joint distribution P V W where V P(YX ) and W
P(ZX Y), the conditional mutual information is
I(Y ; ZX) := H(ZX) H(ZXY ), or
X
I(V, W P ) :=
P (x)I V (x), W (x, ) .
(1.13)
(1.14)
xX
A particularly important quantity is the relative entropy (or KullbackLeibler divergence [102]) between
P and Q which are distributions on the same finite support set X . It is defined as the expectation with
P (x)
, i.e.,
respect to P of the loglikelihood ratio log Q(x)
D(P kQ) :=
P (x) log
xX
P (x)
.
Q(x)
(1.15)
Note that if there exists an x X for which Q(x) = 0 while P (x) > 0, then the relative entropy D(P kQ) = .
If for every x X , if Q(x) = 0 then P (x) = 0, we say that P is absolutely continuous with respect to Q
and denote this relation by P Q. In this case, the relative entropy is finite. It is well known that
D(P kQ) 0 and equality holds if and only if P = Q. Additionally, the conditional relative entropy between
V, W P(YX ) given P P(X ) is defined as
X
D(V kW P ) :=
P (x)D V (x)kW (x) .
(1.16)
xX
The mutual information is a special case of the relative entropy. In particular, we have
I(P, V ) = D(P V kP P V ) = D(V kP V P ).
(1.17)
Furthermore, if UX is the uniform distribution on X , i.e., UX (x) = 1/X  for all x X , we have
D(P kUX ) = H(P ) + log X .
(1.18)
The definition of relative entropy D(P kQ) can be extended to the case where Q is not necessarily a
probability measure. In this case nonnegativity does not hold in general. An important property we exploit
is the following: If denotes the counting measure (i.e., (A) = A for A X ), then similarly to (1.18)
D(P k) = H(P ).
1.4
(1.19)
For finite alphabets, a particularly convenient tool in information theory is the method of types [37, 39, 74].
For a sequence x = (x1 , . . . , xn ) X n in which X  is finite, its type or empirical distribution is the probability
mass function
n
1X
11{xi = x},
x X.
(1.20)
Px (x) =
n i=1
Throughout, we use the notation 11{clause} to mean the indicator function, i.e., this function equals 1 if
clause is true and 0 otherwise. The set of types formed from nlength sequences in X is denoted as
Pn (X ). This is clearly a subset of P(X ). The type class of P , denoted as TP , is the set of all sequences of
length n for which their type is P , i.e.,
TP := {x X n : Px = P } .
11
(1.21)
It is customary to indicate the dependence of TP on the blocklength n but we suppress this dependence for
the sake of conciseness throughout. For a sequence x TP , the set of all sequences y Y n such that (x, y)
has joint type P V is the V shell, denoted as TV (x). In other words,
TV (x) := {y Y n : Px,y = P V } .
(1.22)
The conditional distribution V is also known as the conditional type of y given x. Let Vn (Y; P ) be the set
of all V P(YX ) for which the V shell of a sequence of type P is nonempty.
We will often times find it useful to consider informationtheoretic quantities of empirical distributions.
All such quantities are denoted using hats. So for example, the empirical entropy of a sequence x X n is
denoted as
H(x)
:= H(Px ).
(1.23)
The empirical conditional entropy of y Y n given x X n where y TV (x) is denoted as
H(yx)
:= H(V Px ).
(1.24)
The empirical mutual information of a pair of sequences (x, y) X n Y n with joint type Px,y = Px V is
denoted as
y) := I(Px , V ).
I(x
(1.25)
The following lemmas form the basis of the method of types. The proofs can be found in [37, 39].
Lemma 1.1 (Type Counting). The sets Pn (X ) and Vn (Y; P ) for P Pn (X ) satisfy
Pn (X ) (n + 1)X  ,
In fact, it is easy to check that Pn (X ) =
and
n+X 1
X 1
(1.26)
Pn (X ) (n + 1)X 1
(1.27)
usually suffices for our purposes in this monograph. This key property says that the number of types is
polynomial in the blocklength n.
Lemma 1.2 (Size of Type Class). For a type P Pn (X ), the type class TP X n satisfies
Pn (X )1 exp nH(P ) TP  exp nH(P ) .
For a conditional type V Vn (Y; P ) and a sequence x TP , the V shell TV (x) Y n satisfies
Vn (Y; P )1 exp nH(V P ) TV (x) exp nH(V P ) .
This lemma says that, on the exponential scale,
TP 
= exp nH(P ) , and TV (x)
= exp nH(V P ) ,
(1.28)
(1.29)
(1.30)
n
Y
Q(xi ),
i=1
x X n.
(1.31)
and
W (yx) = exp nD(V kW P ) nH(V P ) .
n
12
(1.32)
(1.33)
This, together with Lemma 1.2, leads immediately to the final lemma in this section.
Lemma 1.4 (Probability of Type Classes). For a type P Pn (X ),
Pn (X )1 exp nD(P kQ) Qn (TP ) exp nD(P kQ) .
For a conditional type V Vn (Y; P ) and a sequence x TP , we have
Vn (Y; P )1 exp nD(V kW P ) W n (TV (x)x)
exp nD(V kW P ) .
(1.34)
(1.35)
The interpretation of this lemma is that the probability that a random i.i.d. (independently and identically
distributed) sequence X n generated from Qn belongs to the type class TP is exponentially small with exponent
D(P kQ), i.e.,
Qn (TP )
(1.36)
= exp nD(P kQ) .
The bounds in (1.35) can be interpreted similarly.
1.5
Probability Bounds
In this section, we summarize some bounds on probabilities that we use extensively in the sequel. For a
random variable X, we let E[X] and Var(X) be its expectation and variance respectively. To emphasize that
the expectation is taken with respect to a random variable X with distribution P , we sometimes make this
explicit by using a subscript, i.e., EX or EP .
1.5.1
Basic Bounds
1
n
Pn
i=1
(1.39)
Xi converges to 0 in probability.
Pn
This follows by applying Chebyshevs inequality to the random variable n1 i=1 Xi . In fact, under mild
conditions, the convergence to zero in (1.39) occurs exponentially fast. See, for example, Cramers theorem
in [43, Thm. 2.2.3].
13
0.9
0.8
0.7
0.6
1
1 ()
(y)
0.5
0.4
0.3
0.2
0.1
0
5
5
0
y
0.2
0.4
0.6
0.8
1.5.2
In preparation for the next result, we denote the probability density function (pdf) of a univariate Gaussian
as
2
2
1
e(x) /(2 ) .
(1.40)
N (x; , 2 ) =
2 2
We will also denote this as N (, 2 ) if the argument x is unnecessary. A standard Gaussian distribution is
one in which the mean = 0 and the standard deviation = 1. In the multivariate case, the pdf is
0 1
1
1
N (x; , ) = p
e 2 (x) (x)
k
(2) 
(1.41)
where x Rk . A standard multivariate Gaussian distribution is one in which the mean is 0k and the
covariance is the identity matrix Ikk .
For the univariate case, the cumulative distribution function (cdf) of the standard Gaussian is denoted
as
Z y
(y) :=
N (x; 0, 1) dx.
(1.42)
(1.43)
which evaluates to the usual inverse for (0, 1) and extends continuously to take values for outside
(0, 1). These monotonically increasing functions are shown in Fig. 1.2.
If the scaling in front of the sum in the statement of the law of large numbers in (1.39) is 1n instead of
Pn
1
1
i=1 Xi converges in distribution to a Gaussian random variable. As
n , the resultant random variable
n
14
in Proposition 1.3, let X n be a collection of i.i.d. random variables where each Xi has zero mean and finite
variance 2 .
Proposition 1.4 (Central Limit Theorem). For any a R, we have
!
n
1 X
lim Pr
Xi < a = (a).
n
n i=1
(1.44)
In other words,
n
1 X
d
Xi Z
n i=1
(1.45)
where means convergence in distribution and Z is the standard Gaussian random variable.
Throughout the monograph, in the evaluation of the nonasymptotic bounds, we will use a more quantitative version of the central limit theorem known as the BerryEsseen theorem [17, 52]. See Feller [54,
Sec. XVI.5] for a proof.
Theorem 1.1
(BerryEsseen
Theorem (i.i.d. Version)). Assume that the third absolute moment is finite,
i.e., T := E X1 3 < . For every n N, we have
!
n
T
1 X
Xi < a (a) 3 .
(1.46)
sup Pr
n
n i=1
aR
Remarkably, the BerryEsseen theorem says that the convergence in the central P
limit theorem in (1.44) is
n
uniform in a R. Furthermore, the convergence of the distribution function of 1n i=1 Xi to the Gaussian
cdf occurs at a rate of O( 1n ). The constant of proportionality in the O()notation depends only on the
variance and the third absolute moment and not on any other statistics of the random variables.
There are many generalizations of the BerryEsseen theorem. One which we will need is the relaxation of
the assumption that the random variables are identically distributed. Let X n = (X1 , . . . , Xn ) be a collection
2
2
of independent random variables
where
each random variable has zero mean, variance i := E[Xi ] > 0 and
3
third absolute moment Ti := E Xi  < . We respectively define the average variance and average third
absolute moment as
n
n
1X 2
1X
2 :=
i , and T :=
Ti .
(1.47)
n i=1
n i=1
Theorem 1.2 (BerryEsseen Theorem (General Version)). For every n N, we have
!
n
6T
1 X
Xi < a (a) 3 .
sup Pr
n
n i=1
aR
(1.48)
Observe that as with the i.i.d. version of the BerryEsseen theorem, the remainder term scales as O( 1n ).
The proof of the following theorem uses the BerryEsseen theorem (among other techniques). This
theorem is proved in PolyanskiyPoorVerd
u [123, Lem. 47]. Together with its variants, this theorem is
useful for obtaining thirdorder asymptotics for binary hypothesis testing and other coding problems with
nonvanishing error probabilities.
Theorem 1.3. Assume the same setup as in Theorem 1.2. For any 0, we have
"
X
X
#
n
n
log 2 12T exp()
E exp
Xi 11
Xi >
2 + 2
.
n
2
i=1
i=1
15
(1.49)
It is trivial to see that the expectation in (1.49) is upper bounded by exp(). The additional factor of
( n)1 is crucial in proving coding theorems with better thirdorder terms. Readers familiar with strong
large deviation theorems or exact asymptotics (see, e.g., [23, Thms. 3.3 and 3.5] or [43, Thm. 3.7.4]) will
notice that (1.49) is in the same spirit as the theorem by Bahadur and RangaRao [13]. There are two
advantages of (1.49) compared to strong large deviation theorems. First, the bound is purely in terms of 2
and T , and second, one does not have to differentiate between lattice and nonlattice random variables. The
disadvantage of (1.49) is that the constant is worse but this will not concern us as we focus on asymptotic
results in this monograph, hence constants do not affect the main results.
For multiterminal problems that we encounter in the latter parts of this monograph, we will require
vector (or multidimensional) versions of the BerryEsseen theorem. The following is due to Gotze [63].
Theorem 1.4 (Vector BerryEsseen Theorem I). Let X1k , . . . , Xnk be independent Rk valued random vectors
with zero mean. Let
n
1 X k
X .
(1.50)
Snk =
n i=1 i
Assume that Snk has the following statistics
n
Cov(Snk ) = E Snk (Snk )0 = Ikk ,
and
:=
1 X k 3
E kXi k2 .
n i=1
(1.51)
Let Z k be a standard Gaussian random vector, i.e., its distribution is N (0k , Ikk ). Then, for all n N, we
have
ck
(1.52)
sup Pr Snk C Pr Z k C ,
n
C Ck
where Ck is the family of all convex subsets of Rk , and where ck is a constant that depends only on the
dimension k.
Theorem 1.4 can be applied for random vectors that are independent but not necessarily identically
distributed. The constant ck can be upper bounded by 400 k 1/4 if the random vectors are i.i.d., a result by
Bentkus [15]. However, its precise value will not be of concern to us in this monograph. Observe that the
scalar versions of the BerryEsseen theorems (in Theorems 1.1 and 1.2) are special cases (apart from the
constant) of the vector version in which the family of convex subsets is restricted to the family of semiinfinite
intervals (, a).
We will frequently encounter random vectors with nonidentity covariance matrices. The following modification of Theorem 1.4 is due to WatanabeKuzuokaTan [177, Cor. 29].
Corollary 1.1 (Vector BerryEsseen Theorem II). Assume the same setup as in Theorem 1.4, except that
Cov(Snk ) = V, a positive definite matrix. Then, for all n N, we have
sup Pr Snk C Pr Z k C
C Ck
ck
,
min (V)3/2 n
(1.53)
16
Theorem 1.5 (BerryEsseen Theorem for Functions of i.i.d. Random Vectors). Assume that X1k , . . . , Xnk are
Rk valued, zeromean, i.i.d. random vectors with positive definite covariance Cov(X1k ) and finite third absolute
moment := E[kX1k k32 ]. Let f (x) be a vectorvalued function from Rk to Rl that is also twice continuously
differentiable in a neighborhood of x = 0. Let J Rlk be the Jacobian matrix of f (x) evaluated at x = 0,
i.e., its elements are
fi (x)
,
(1.54)
Jij =
xj x=0
where i = 1, . . . , l and j = 1, . . . , k. Then, for every n N, we have
!
X
n
1
c
sup Pr f
Xik C Pr Z l C
n i=1
n
C Cl
(1.55)
where c > 0 is a finite constant, and Z l is a Gaussian random vector in Rl with mean vector and covariance
matrix respectively given as
E[Z l ] = f (0),
and
Cov(Z l ) =
J Cov(X1k )J0
.
n
1
d
k
n f
Xi f (0) N 0, J Cov(X1k )J0 ,
n i=1
(1.56)
(1.57)
which is a canonical statement in the study of the multivariate delta method [174, Thm. 5.15].
Finally, we remark that IngberWangKochman [87] used a result similar to that of Theorem 1.5 to
derive secondorder asymptotic results for various Shannontheoretic problems. However, they analyzed the
behavior of functions of distributions instead of functions of random vectors as in Theorem 1.5.
17
Chapter 2
2.1
H1 : Z Q
(2.1)
where P and Q are two probability distributions on the same space Z. We assume that the space Z is finite
to keep the subsequent exposition simple. The notation in (2.1) means that under the null hypothesis H0 , the
random variable Z is distributed as P P(Z) while under the alternative hypothesis H1 , it is distributed
according to a different distribution Q P(Z). We would like to study the optimal performance of a
hypothesis test in terms of the distributions P and Q.
There are several ways to measure the performance of a hypothesis test which, in precise terms, is a
mapping from the observation space Z to [0, 1]. If the observation z is such that (z) 0, this means the
test favors the null hypothesis H0 . Conversely, (z) 1 means that the test favors the alternative hypothesis
H1 (or alternatively, rejects the null hypothesis H0 ). If (z) {0, 1}, the test is called deterministic, otherwise
it is called randomized. Traditionally, there are three quantities that are of interest for a given test . The
first is the probability of false alarm
X
PFA :=
(z)P (z) = EP (Z) .
(2.2)
zZ
18
(2.3)
zZ
The third is the probability of detection, which is one minus the probability of missed detection, i.e.,
X
PD :=
(z)Q(z) = EQ (Z) .
(2.4)
zZ
The probability of false alarm and miss detection are traditionally called the typeI and typeII errors respectively in the statistics literature. The probability of detection and the probability of false alarm are also
called the power and the significance level respectively. The holy grail is, of course, to design a test such
that PFA = 0 while PD = 1 but this is clearly impossible unless P and Q are mutually singular measures.
Since misses are usually more costly than false alarms, let us fix a number (0, 1) that represents
a tolerable probability of false alarm (typeI error). Then define the smallest typeII error in the binary
hypothesis test (2.1) with typeI error not exceeding , i.e.,
n
o
1 (P, Q) := inf
EQ 1 (Z) : EP (Z) .
(2.5)
:Z[0,1]
Observe that EP (Z) constrains the probability
of false alarm to be no greater than . Thus, we are
searching over all tests satisfying EP (Z) such that the probability of missed detection is minimized.
Intuitively, 1 (P, Q) quantifies, in a nonasymptotic fashion, the performance of an optimal hypothesis
test between P and Q.
A related quantity is the hypothesis testing divergence
Dh (P kQ) := log
1 (P, Q)
.
1
(2.6)
This is a measure of the distinguishability of P from Q. As can be seen from (2.6), 1 (P, Q) and Dh (P kQ)
are simple functions of each other. We prefer to express the results in this monograph mostly in terms of
Dh (P kQ) because it shares similar properties with the usual relative entropy D(P kQ), as is evidenced from
the following lemma.
Lemma 2.1 (Properties of Dh ). The hypothesis testing divergence satisfies the positive definiteness condition [48, Prop. 3.2], i.e.,
Dh (P kQ) 0.
(2.7)
Equality holds if and only if P = Q. In addition, it also satisfies the data processing inequality [173, Lem. 1],
i.e., for any channel W ,
Dh (P W kQW ) Dh (P kQ).
(2.8)
While the hypothesis testing divergence occurs naturally and frequently in coding problems, it is usually
hard to analyze directly. Thus, we now introduce an equally important quantity. Define the information
spectrum divergence Ds (P kQ) as
n
o
P (z)
Ds (P kQ) := sup R R : P
z Z : log
R
.
(2.9)
Q(z)
Just as in information spectrum analysis [67], this quantity places the distribution of the loglikelihood ratio
P (Z)
log Q(Z)
(where Z P ), and not just its expectation, in the most prominent role. See Fig. 2.1 for an
interpretation of the definition in (2.9).
As we will see, the information spectrum divergence is intimately related to the hypothesis testing
divergence (cf. Lemma 2.4). The former is, however, easier to compute. Note that if P and Q are product
19
Density of log
1
P (Z)
Q(Z)
when Z P
Figure 2.1: Illustration of the information spectrum divergence Ds (P kQ) which is the largest point R for
which the probability mass to the left is no larger than .
P (Z)
measures, then by virtue of the fact that log Q(Z)
is a sum of independent random variables, one can estimate
the probability in (2.9) using various probability tail bounds. This we do in the following section.
We now state two useful properties of Ds (P kQ). The proofs of these lemmas are straightforward and
can be found in [164, Sec. III.A].
P
Lemma 2.2 (Sifting from a convex combination). Let P P(Z) and Q =
k k Qk be an at most
countable
convex
combination
of
distributions
Q
P(Z)
with
nonnegative
weights
k summing to one,
k
P
i.e., k k = 1. Then,
1
.
(2.10)
Ds (P kQ) inf Ds (P kQk ) + log
k
k
In particular, Lemma 2.2 tells us that if there exists some > 0 such that Q(z)
Q(z) for all z Z
then,
Ds (P kQ) log .
Ds (P kQ)
(2.11)
Lemma 2.3 (Symbolwise relaxation of Ds ). Let W and V be two channels from X to Y and let P
P(X ). Then,
Ds (P W kP V ) sup Ds (W (x)kV (x)).
(2.12)
xX
One can readily toggle between the hypothesis testing divergence and the information spectrum
divergence because they satisfy the bounds in the following lemma. The proof of this lemma mimics that
of [163, Lem. 12].
Lemma 2.4 (Relation between divergences). For every (0, 1) and every (0, 1 ), we have
Ds (P kQ) log
1
Dh (P kQ)
1
(2.13)
1
.
(2.14)
Proof. The following proof is based on that for [163, Lem. 12]. For the lower bound in (2.13), consider the
likelihood ratio test
n
o
P (z)
(z) := 11 log
, where := Ds (P kQ)
(2.15)
Q(z)
for some small > 0. This test clearly satisfies EP (Z) by the definition of the information spectrum
divergence. On the other hand,
EQ 1 (Z)
n
o
X
P (z)
=
Q(z)11 log
>
(2.16)
Q(z)
zZ
20
X
zZ
n
o
P (z)
P (z) exp()11 log
>
Q(z)
(2.17)
P (z) exp()
(2.18)
zZ
exp()
(2.19)
1
1
= Ds (P kQ) log
.
1
1
(2.20)
Q(z) 1 (z)
(2.21)
(2.22)
z:P (z)Q(z)
X
z:P (z)Q(z)
P (z) 1 (z)
(2.23)
X
1 X
P (z) 1 (z)
P (z)
z
z:P (z)>Q(z)
n
o
P (z)
1
1P
z:
>
Q(z)
where (2.25) follows because EP (Z) . Now fix a small > 0 and choose
= exp Ds+ (P kQ) + .
(2.24)
(2.25)
(2.26)
P (z)
Q(z)
(2.27)
By the definition of Ds+ (P kQ), the probability within the logarithm is upper bounded by 1 . Taking
0 completes the proof of (2.14) and hence, the lemma.
2.2
Asymptotic Expansions
In this section, we consider the asymptotic expansions of Dh (P (n) kQ(n) ) and Ds (P (n) kQ(n) ) when P (n) and
Q(n) are product distributions, i.e.,
P (n) (z) :=
n
Y
Pi (zi ),
i=1
n
Y
i=1
21
Qi (zi ),
(2.28)
for all z = (z1 , . . . , zn ) Z n . The component distributions {(Pi , Qi )}ni=1 are not necessarily the same for
each i. However, we do assume for the sake of simplicity that for each i, Pi Qi so D(Pi kQi ) < . Let
V (P kQ) be the variance of the loglikelihood ratio between P and Q, i.e.,
2
X
P (z)
(2.29)
D(P kQ) .
V (P kQ) :=
P (z) log
Q(z)
zZ
This is also known as the relative entropy variance. Let the third absolute moment of the loglikelihood ratio
between P and Q be
3
X
P (z)
T (P kQ) :=
P (z) log
(2.30)
D(P kQ) .
Q(z)
zZ
Dn :=
1X
D(Pi kQi ),
n i=1
Vn :=
1X
V (Pi kQi ),
n i=1
Tn :=
1X
T (Pi kQi ).
n i=1
(2.31)
and
(2.32)
(2.33)
1 q
s
.
(2.34)
1 + q
nVn
nV3
nV3
Proof. Let Z n be distributed according to P (n) . By using the product structure of P (n) and Q(n) in (2.28),
X
n
P (n) (Z n )
Pi (Zi )
Pr log (n) n R = Pr
log
R .
(2.35)
Qi (Zi )
Q (Z )
i=1
By the BerryEsseen theorem in Theorem 1.2, we have
X
n
Pi (Zi )
R nDn
6 Tn
Pr
log
.
R
p
Q
(Z
)
nVn
nVn3
i
i
i=1
(2.36)
= () + O ,
(2.38)
n
n
which completes the proof.
22
In some applications, it is not possible to guarantee that Vn is uniformly bounded away from zero (per
Proposition 2.1). In this case, to obtain an upper bound on Ds , we employ Chebyshevs inequality instead
of the BerryEsseen theorem. In the following proposition, which is usually good enough to establish strong
converses, we do not assume that the component distributions are the same.
Proposition 2.2 (Chebyshev bound for Ds ). We have
r
Ds (P (n) kQ(n) )
nDn +
nVn
.
1
(2.39)
(2.40)
n
o
P (z)
D := sup R nDn : P
z Z : log
R
,
Q(z)
n
o
P (z)
+
.
R
D := sup R > nDn : P
z Z : log
Q(z)
(2.41)
(2.42)
Clearly, D nDn so it remains to upper bound D+ . Let R > nDn be fixed. By Chebyshevs inequality,
X
n
Pi (Zi )
nVn
Pr
log
R 1
.
(2.43)
Q
(Z
)
(R
nDn )2
i
i
i=1
Hence, we have
nVn
(2.44)
(2.45)
Thus, we see that the bound on D+ dominates. This yields (2.39) as desired.
Now we would like an expansion for Dh similar to that for Ds in Corollary 2.1. The following was shown
by Strassen [152, Thm. 3.1].
Proposition 2.3 (Asymptotics of Dh ). Assume the conditions in Corollary 2.1. The following holds:
Dh (P n kQn ) = nD(P kQ) +
p
1
nV (P kQ)1 () + log n + O(1).
2
(2.46)
As a result, in the asymptotic setting for identical product distributions, Dh (P n kQn ) exceeds Ds (P n kQn )
by 12 log n ignoring constant terms, i.e.,
Dh (P n kQn ) = Ds (P n kQn ) +
1
log n + O(1).
2
(2.47)
Proof. Let us first verify the upper bound. Let in the upper bound of Lemma 2.4 be chosen to be
Now, for n large enough (so 1n < 1 ), combine this upper bound with Corollary 2.1 to obtain that
1
(P n kQn ) + log n + log(1 )
2
p
1
1
1
= nD(P kQ) + nV (P kQ)
+
+ log n + O(1)
2
n
+ 1n
Dh (P n kQn ) Ds
23
1 .
n
(2.48)
(2.49)
Applying a Taylor expansion to the last step and noting that V (P kQ) < because P Q yields the upper
bound in (2.46).
The proof of the lower bound in (2.46) is slightly more involved. Observe that if we navely employed
1
, the thirdorder term would be O(1) instead
(2.13) to lower bound Dh (P n kQn ) with Ds (P n kQn ) log 1
1
of the better 2 log n + O(1). The idea is to propose an appropriate test for Dh and to use Theorem 1.3.
Consider the likelihood ratio test
P n (z)
(z) := 11 log n
(2.50)
Q (z)
Define 2 := V (P kQ) and T := T (P kQ). Also define the i.i.d. random variables Ui := log P (Zi ) log Q(Zi ),
1 i n, each having variance 2 and third absolute moment T . Consider, the expectation of 1 (Z n )
under the distribution Qn :
EQn 1 (Z n )
X
P n (z)
>
(2.51)
=
Qn (z)11 log n
Q (z)
z
X
P n (z)
P n (z)
P n (z) exp log n
11 log n
>
(2.52)
=
Q (z)
Q (z)
z
X
X
n
n
= EP n exp
Ui 11
Ui >
(2.53)
i=1
i=1
log 2 12 T exp()
2 + 2
n
2
(2.54)
6 T (P kQ)
nV (P kQ)1 p
.
nV (P kQ)3
.
EP (Z ) = P
z:
Q(zi )
i=1
(2.55)
(2.56)
(2.57)
The proof is concluded by plugging (2.55) into (2.57) and Taylor expanding 1 () around .
We remark that the lower bound in Proposition 2.3 can be achieved using deterministic tests, i.e., can
be chosen to be an indicator function as in (2.50). Randomization is thus unnecessary. Also, one can relax
the assumption that Qn is a product probability measure; it can be an arbitrary product measure. These
realizations are important to make the connection between hypothesis testing and lossless source coding
which we discuss in the next chapter.
A corollary of Proposition 2.3 is the ChernoffStein lemma [25] quantifying the error exponent of the
probability of missed detection keeping the probability of false alarm bounded above by .
Corollary 2.2 (ChernoffStein lemma). Assume the conditions in Corollary 2.1 and recall the definition of
1 in (2.5). For every (0, 1),
1
1
Dh (P n kQn )
log
=
lim
= D(P kQ).
n n
1 (P n , Qn ) n
n
lim
24
(2.58)
Part II
PointToPoint Communication
25
Chapter 3
Source Coding
In this chapter, we revisit the fundamental problem of fixedtofixed length lossless and lossy source compression. Shannon, in his original paper [141] that launched the field of information theory, showed that
the fundamental limit of compression of a discrete memoryless source (DMS) P is the entropy H(P ). For
the case of continuous sources, lossless compression is not possible and some distortion must be allowed.
Shannon showed in [144] that the corresponding fundamental limit of compression of memoryless source P
up to distortion 0, assuming a separable distortion measure d, is the ratedistortion function
R(P, ) :=
min
I(P, W ).
(3.1)
These firstorder fundamental limits are attained as the number of realizations of the source (i.e., the blocklength of the source) P tends to infinity. The strong converse for ratedistortion is also known and shown,
for example, in [39, Ch. 7]. In the following, we present known nonasymptotic bounds for lossless and lossy
source coding. We then fix the permissible error probability in the lossless case and the excess distortion
probability in the lossy case at some nonvanishing (0, 1). The nonasymptotic bounds are evaluated as
n becomes large so as to obtain asymptotic expansions of the logarithm of the smallest achievable code size.
These refined results provide an approximation of the extra code rate (beyond H(P ) or R(P, )) one must
incur when operating in the finite blocklength regime. Finally, for both the lossless and lossy compression
problems, we provide alternative proof techniques based on the method of types that are partially universal.
The material in this chapter concerning lossless source coding is based on the seminal work by Strassen [152,
Thm. 1.1]. The material on lossy source coding is based on more recent work by IngberKochman [86] and
KostinaVerd
u [97].
3.1
We now set up the almost lossless source coding problem formally. As mentioned, we only consider fixedtofixed length source coding in this monograph. A source is simply a probability mass function P on some
finite alphabet X or the associated random variable X with distribution P . See Fig. 3.1 for an illustration
of the setup.
An (M, )code for the source P P(X ) consists of a pair of maps that includes an encoder f : X
{1, . . . , M } and a decoder : {1, . . . , M } X such that the error probability
P {x X : (f (x)) 6= x} .
(3.2)
The number M is called the size of the code (f, ).
Given a source P , we define the almost lossless source coding nonasymptotic fundamental limit as
M (P, ) := min M N : an (M, )code for P .
(3.3)
26
x

m

3.1.1
An Achievability Bound
One of the takehome messages that we would like to convey in this section is that fixedtofixed length
lossless source coding is nothing but binary hypothesis testing where the measures P and Q are chosen
appropriately. In fact, a reasonable coding scheme for the lossless source coding would simply be to encode
a typical set of source symbols T X , ignore the rest, and declare an error if the realized symbol from
the source is not in T . In this way, one sees that
M (P, )
min
T X :P (X \T )
T 
(3.4)
This bound can be stated in terms of 1 (P, Q) or, equivalently, the hypothesis testing divergence
Dh (P kQ) if we restrict the tests that define these quantities to be deterministic, and also allow Q to be
an arbitrary measure (not necessarily a probability measure). Let be the counting measure, i.e.,
(A) := A,
A X.
(3.5)
Proposition 3.1 (Source coding as hypothesis testing: Achievability). Let (0, 1) and P be any source
with countable alphabet X . We have
M (P, ) 1 (P, ),
(3.6)
or in terms of the hypothesis testing divergence (cf. (2.6)),
log M (P, ) Dh (P k) log
3.1.2
1
.
1
(3.7)
A Converse Bound
The converse bound we evaluate is also intimately connected to a divergence we introduced in the previous
chapter, namely the information spectrum divergence where the distribution in the alternate hypothesis Q
is chosen to be the counting measure.
Proposition 3.2 (Source coding as hypothesis testing: Converse). Let (0, 1) and P be any source with
countable alphabet X . For any (0, 1 ), we have
1
log M (P, ) Ds+ (P k) log .
(3.8)
This statement is exactly Lemma 1.3.2 in Hans book [67]. Since the proof is short, we provide it for
completeness.
27
Proof. By the definition of the information spectrum divergence, it is enough to establish that every
(M, )code for P must satisfy
n
o
,
(3.9)
+ P
x : P (x)
M
for any (0, 1 ). Let T := {x : P (x)
M}
P (T ) P (X \ S) + P (T S) + P (T S).
Furthermore,
P (T S) =
P (x)
xT S
X
xT S
S
.
M
M
(3.10)
(3.11)
3.2
Now we assume that the source P n is stationary and memoryless, i.e., a DMS. More precisely,
P n (x) =
n
Y
P (xi ),
i=1
x X n.
(3.12)
We assume throughout that P (x) > 0 for all x X . Shannon [141] showed that the minimum rate to achieve
almost lossless compression of a DMS P is the entropy H(P ). In this section as well as the next one, we
derive finer evaluations of the fundamental compression limit by considering the asymptotic expansion of
log M (P n , ). To do so, we need to define another important quantity related to the source P .
Let the source dispersion of P be the variance of the selfinformation random variable log P (X), i.e.,
X
2
1
1
V (P ) := Var log
=
H(P ) .
P (x) log
P (X)
P (x)
(3.13)
xX
Note that the expectation of the selfinformation is the entropy H(P ). In KontoyannisVerd
u [95], V (P ) is
called the varentropy. If V (P ) = 0 this means that the source is either deterministic or uniform.
The two nonasymptotic theorems in the preceding section combine to give the following asymptotic
expansion of the minimum code size M (P n , ).
Theorem 3.1. If the source P P(X ) satisfies V (P ) > 0, then
log M (P n , ) = nH(P )
nV (P )1 ()
1
log n + O(1).
2
(3.14)
Otherwise, we have
log M (P n , ) = nH(P ) + O(1).
(3.15)
1
Proof. For the direct part of (3.14) (upper bound), note that the term log 1
in Proposition 3.1 is a
n
n 1
constant, so we simply have to evaluate Dh (P k ). From Corollary 2.1 and its remark that the lower
bound on Dh (P n kQn ) can be achieved using deterministic tests, we have
Dh (P n kn ) = nD(P k) +
nV (P k)1 () +
1 Just
1
log n + O(1).
2
(3.16)
P
to be pedantic, for any A X n , the measure n (A) is defined as xA n (x) = A and n (x) = 1 for each x X n .
Hence,
has the required product structure for the application of Corollary 2.1, for which the second argument of Dh (P n kQn )
is not restricted to product probability measures.
n
28
and V (P k) = V (P ).
(3.17)
This concludes the proof of the direct part in light of Proposition 3.1.
For the converse part of (3.14) (lower bound), choose = 1n so the term log 1 in Proposition 3.2 gives
us the 12 log n term. Furthermore, by Proposition 2.1 and the simplifications in (3.17),
p
+ 1n
1
1
n
n
Ds
+
(P k ) = nH(P ) + nV (P )
+ O(1).
(3.18)
n
A Taylor expansion of 1 () completes the proof of (3.14).
For (3.15), note that the selfinformation log P (X) takes on the value H(P ) with probability one. In
other words,
P n log P n (X n ) R = 11 R nH(P ) .
(3.19)
The bounds on log M (P n , ) in Propositions 3.1 and 3.2 and the relaxation to the information spectrum
divergence (Lemma 2.4) yields (3.15).
The expansion in (3.14) in Theorem 3.1 appeared in early works by Yushkevich [190] (with o( n) in
1
place of 2 log n but for Markov chains) and Strassen [152, Thm. 1.1] (in the form stated). It has since
appeared in various other forms and levels of generality in Kontoyannis [93], Hayashi [75], KostinaVerd
u [97],
NomuraHan [117] and KontoyannisVerd
u [95] among others.
As can be seen from the nonasymptotic bounds and the asymptotic evaluation, fixedtofixed length
lossless source coding and binary hypothesis testing are virtually the same problem. Asymptotic expansions
for Dh and Ds can be used directly to estimate the minimum code size M (P n , ) for an reliable lossless
source code.
3.3
Clearly, the coding scheme described in (3.4) is nonuniversal, i.e., the code depends on knowledge of the
source distribution. In many applications, the exact source distribution is unknown, and hence has to be
estimated a priori, or one has to design a source code that works well for any source distribution. It is a
wellknown application of the method of types that universal source codes achieve the lossless source coding
error exponent [39, Thm. 2.15]. One then wonders whether there is any degradation in the asymptotic
expansion of log M (P n , ) if the encoder and decoder know less about the source. It turns out that the
source dispersion term can be achieved only with the knowledge of H(P ) and V (P ). However, one has to
work much harder to determine the thirdorder term. For conclusive results on the thirdorder term for fixedtovariable length source coding, the reader is referred to the elegant work by Kosut and Sankar [100, 101].
The technique outlined in this section involves the method of types.
Let Mu (P, ) be the almost lossless source coding nonasymptotic fundamental limit where the source
code (f, ) is ignorant of the probability distribution of the source P , except for the entropy H(P ) and the
varentropy V (P ).
Theorem 3.2. If the source P P(X ) satisfies V (P ) > 0, then
p
log Mu (P n , ) nH(P ) nV (P )1 () + (X  1) log n + O(1).
(3.20)
The proof we present here results in a thirdorder term that is likely to be far from optimal but we present
this proof to demonstrate the similarity to the classical proof of the fixedtofixed length source coding error
exponent using the method of types [39, Thm. 2.15].
29
Proof of Theorem 3.2. Let X = {a1 , . . . , ad } without loss of generality. Set M , the size of the code, to be
the smallest integer exceeding
p
c
1
exp nH(P ) nV (P )
+ (d 1) log(n + 1) ,
(3.21)
n
for some finite constant c > 0 (given in Theorem 1.5). Let K be the set of sequences in X n whose empirical
entropy is no larger than
(d 1) log(n + 1)
1
.
(3.22)
:= log M
n
n
In other words,
[
K :=
TQ .
(3.23)
QPn (X ):H(Q)
Encode all sequences in K in a onetoone way and sequences not in K arbitrarily. By the type counting
lemma in (1.27) and Lemma 1.2 (size of type class), we have
X
exp nH(Q) (n + 1)d1 exp n M
K
(3.24)
QPn (X ):H(Q)
so the number of sequences that can be reconstructed without error is at most M as required. An error
occurs if and only if the source sequence has empirical entropy exceeding , i.e., the error probability is
p := Pr(H(PX n ) > )
(3.25)
(3.26)
(3.27)
In (3.26) and
P (3.27), we regarded the type PX n and the true distribution P as vectors of length d = X , and
H(w) = j wj log wj is the entropy. Note that the argument of f () in (3.26) can be written as
PX n P =
1
n
n
X
i=1
11{Xi = a1 } P (a1 )
..
.
11{Xi = ad } P (ad )
1X d
U .
=:
n i=1 i
(3.28)
Since Uid := [11{Xi = a1 } P (a1 ), . . . , 11{Xi = ad } P (ad )]0 for i = 1, . . . , n are zeromean, i.i.d. random
vectors, we may appeal to the BerryEsseen theorem for functions of i.i.d. random vectors in Theorem 1.5.
Indeed, we note that f (0) = H(P ), the Jacobian of f evaluated at v = 0 is
1
1
J = log
. . . log
,
(3.29)
e P (a1 )
e P (ad )
and the (s, t) {1, . . . , d}2 element of the covariance matrix of U1d is
P (as ) 1 P (as ) s = t
Cov U1d st =
.
P (as )P (at )
s 6= t
(3.30)
(3.31)
(3.32)
where c is a finite positive constant (depending on P ). By the choice of M and in (3.21)(3.22), we see
that p is no larger than .
This coding scheme is partially universal in the sense that H(P ) and V (P ) need to be known to be
used to determine and the threshold in (3.22), but otherwise no other characteristic of the source P
is required to be known. This achievability proof strategy is rather general and can be applied to ratedistortion (cf. Section 3.6), channel coding, joint sourcechannel coding [87, 170], as well as multiterminal
problems [157] (cf. Section 6.3).
The point we would like to emphasize in this section is the following: In large deviations (error exponent)
analysis of almost lossless source coding, the probability of error in (3.25) is evaluated using, for example,
Sanovs theorem [39, Ex. 2.12], or refined versions of it [39, Ex. 2.7(c)]. In the above proof, the probability
of error is instead estimated using the BerryEsseen theorem (Theorem 1.5) since the deviation of the code
rate from the firstorder fundamental limit H(P ) is of the order ( 1n ) instead of a constant. Essentially,
the proof of Theorem 3.2 hinges on the fact that for a random vector X n with distribution P n , the entropy
n ), satisfies the following central limit relation:
of the type PX n , namely the empirical entropy H(X
d
n ) H(P )
n H(X
N 0, V (P ) .
(3.33)
Finally, we note that the technique to bound the probability in (3.26) is similar to that suggested by Kosut
and Sankar [101, Lem. 1].
3.4
In the second half of this chapter, we consider the lossy source coding problem where the source P P(X )
does not have to be discrete. The setup is as in Fig. 3.1 and the reconstruction alphabet (which need not be
the same as X ) is denoted as X . For the lossy case, one considers a distortion measure d(x, x
) between the
source x X and its reconstruction x
X . This is simply a mapping from X X to the set of nonnegative
real numbers.
We make the following simplifying assumptions throughout.
(i) There exists a such that R(P, ), defined in (3.1), is finite.
(ii) The distortion measure is such that there exists a finite set E X such that E[minxE d(X, x
)] is finite.
(iii) For every x X , there exists an x
X such that d(x, x
) = 0.
(iv) The source P and the distortion d are such that the minimizing test channel W in the ratedistortion
function in (3.1) is unique and we denote it as W .
These assumptions are not overly restrictive. Indeed, the most common distortion measures and sources,
such as finite alphabet sources with the Hamming distortion d(x, x
) = 11{x 6= x
} and Gaussian sources with
quadratic distortion d(x, x
) = (x x
)2 , satisfy these assumptions.
An (M, , d, )code for the source P P(X ) consists of an encoder f : X {1, . . . , M } and a decoder
: {1, . . . , M } X such that the probability of excess distortion
P {x X : d x, (f (x)) > } .
(3.34)
The number M is called the size of the code (f, ).
31
Given a source P , define the lossy source coding nonasymptotic fundamental limit as
M (P, , d, ) := min M N : an (M, , d, )code for P .
(3.35)
In the following subsections, we present a nonasymptotic achievability bound and a corresponding converse
bound, both of which we evaluate asymptotically in the next section.
3.4.1
An Achievability Bound
The nonasymptotic achievability bound is based on Shannons random coding argument, and is due to
KostinaVerd
u [97, Thm. 9]. The encoder is similar to the familiar joint typicality encoder [49, Ch. 2] with
typicality defined in terms of the distortion measure. To state the bound compactly, define the distortion
ball around x as
B (x) := x
X : d(x, x
) .
(3.36)
Proposition 3.3 (Random Coding Bound). There exists an (M, , d, )code satisfying
i
h
inf EX eM PX (B (X)) .
PX
(3.37)
= EX
Pr d(X, X(m)) > X
(3.39)
(3.40)
m=1
= EX
Y
M
m=1
= EX
1 PX B (X(m))
M
1 PX B (X)
(3.41)
(3.42)
Applying the inequality (1 x)k ekx and minimizing over all possible choices of PX completes the
proof.
3.4.2
A Converse Bound
In order to state the converse bound, we need to introduce a quantity that is fundamental to ratedistortion
theory. For discrete random variables with the Hamming distortion measure (d(x, x
) = 11{x 6= x
}), it
coincides with the selfinformation random variable, which, as we have seen in Section 3.2, plays a key role
in the asymptotic expansion of log M (P n , ).
The tilted information of x [94, 97] for a given distortion measure d (whose dependence is suppressed)
is defined as
h
i
)
(x; P, ) := log EX exp d(x, X
(3.43)
is distributed as P W and
where X
R(P, 0 )
:=
.
0 0 =
(3.44)
The differentiability of the ratedistortion function with respect to is guaranteed by the assumptions in
Section 3.4. The term tilted information was introduced by Kostina and Verd
u [97].
32
Example 3.1. Consider the Gaussian source X P (x) = N (x; 0, 2 ) with squarederror distortion measure
d(x, x
) = (x x
)2 . For this problem, simple calculations reveal that
2
log e
1
2
x
1
(x; P, ) = log
(3.45)
2
2
2
if 2 , and 0 otherwise.
One important property of the tilted information of x is that the expectation is exactly equal to the
ratedistortion function, i.e.,
R(P, ) = EX (X; P, ) .
(3.46)
For the Gaussian source with quadratic distortion, the equality above is easy to verify from Example 3.1.
In view of the asymptotic expansion of lossless source coding in Theorem 3.1, we may expect that the
variance of (X; P, ) characterizes the secondorder asymptotics of ratedistortion. This is indeed so, as we
will see in the following. Other properties of the tilted information are summarized in [34, Lem. 1.4] and
[97, Properties 1 & 2].
Equipped with the definition of the tilted information, we are now ready to state the nonasymptotic
converse bound that will turn out to be amenable to asymptotic analyses. This elegant bound was proved
by KostinaVerd
u [97, Thm. 7].
Proposition 3.4 (Converse Bound for Lossy Compression). Fix > 0. Every (M, , d, )code must satisfy
Pr (X; P, ) log M + exp().
(3.47)
Observe that this is a generalization of Proposition 3.2 for the lossless case. In particular, it generalizes
the bound in (3.9). It is also remarkably similar to the Verd
uHan information spectrum converse bound [169,
Lem. 4] for channel coding (reviewed in (4.10) in Section 4.1.2). This is unsurprising, as channel coding and
ratedistortion are duals in many ways. We refer the reader to [97, Thm. 7] for the proof of Proposition 3.4.
3.5
As mentioned in the introduction of this chapter, the firstorder fundamental limit for lossy source coding
of stationary and memoryless sources P n is the rate distortion function R(P, ). We are interested in finer
approximations of the nonasymptotic fundamental limit M (P n , , d(n) , ) where P n is the distribution of
a stationary, memoryless source X and the distortion measure d(n) : X n X n is separable, i.e.,
n
) =
d(n) (x, x
1X
d(xi , x
i ).
n i=1
) X n X n .
for any (x, x
Let the variance of the tilted information of X be termed the ratedispersion function
V (P, ) := Var (X; P, ) .
(3.48)
(3.49)
Example 3.2. Let us revisit the Gaussian source with quadratic distortion in Example 3.1. It is easy to
verify that the variance of (X; P, ) is
log2 e
V (P, ) =
(3.50)
2
if 2 , and 0 otherwise. Hence, interestingly, the ratedispersion function for the Gaussian source with
quadratic distortion depends neither on the source variance 2 nor the distortion if 2 . This is
peculiar to the Gaussian source with quadratic distortion.
33
Theorem 3.3. If P and d satisfy the assumptions in Section 3.4 and, in addition, V (P, ) > 0 and
)9 ] < ,
EP P W [d(X, X
p
(3.51)
log M (P n , , d(n) , ) = nR(P, ) nV (P, )1 () + O(log n).
For the case of zero ratedispersion function V (P, ) = 0, the reader is referred to [97, Thm. 12]. The
)9 ] < is a technical one, made to ensure that the third absolute moment of
condition EP P W [d(X, X
(X; P, ) is finite for the applicability of the BerryEsseen theorem.
Proof sketch. For an i.i.d. source X n , the tilted information singleletterizes because the optimum test
channel in the ratedistortion formula also has the required product structure. Hence,
n
(X ; P , ) =
n
X
(Xi ; P, ).
(3.52)
i=1
Using the BerryEsseen theorem, the probability in (3.47) can be lower bounded as
nR(P, ) log M
n
n
p
Pr (X ; P , ) log M +
n
nV (P, )
(3.53)
where is a function of the third absolute moment of (X; P, ) which is finite by the assumption that
)9 ] < . Now set = 1 log n and M to the smallest integer larger than
EP P W [d(X, X
2
p
+ 1/2
1
0
exp nR(P, ) nV (P, )
.
(3.54)
n
By the nonasymptotic converse bound in Proposition 3.4, we find that 0 . This implies that the number
of codewords must not be smaller than that stated in (3.54), concluding the converse proof.
n
For the direct part, we need a technical lemma [97, Lem. 2] relating the PX
probability of a distortion
ball to the tilted information.
Lemma 3.1. There exist constants c, b, k > 0 such that for all sufficiently large n,
!
n
X
1
k
Pr log n
(Xi ; P, ) + b log n + c .
>
n
PX (B (X ))
n
i=1
(3.55)
n
This lemma says that we can control the PX
probability of distortion balls centered at a random
n
source sequence X in terms of the tilted information. Now define the random variable
Gn := log M
n
X
i=1
(Xi ; P, ) b log n c.
(3.56)
Choose the distribution PX in the nonasymptotic achievability bound in Proposition 3.3 to be the product
n
distribution PX
. Applying Lemma 3.1, we find that
n
n
0 E eM PX (B (X ))
k
E e exp(Gn ) +
n
ln n
1
ln n
k
Pr Gn log
+ Pr Gn > log
+ .
2
2
n
n
(3.57)
(3.58)
(3.59)
where in the final step, we split the expectation into two parts depending on whether Gn > log ln2n or
otherwise. Since Gn is a sum of i.i.d. random variables, the first probability can be evaluated using the
BerryEsseen theorem similarly to (3.53), and the second bounded above by 1.
34
3.6
In this final section of the chapter, we briefly comment on how Theorem 3.3 can be obtained by means of
a technique that is based on the method of types. Of course, this technique only applies to discrete (finite
alphabet) sources so it is more restrictive than the KostinaVerd
u [97] method we presented. However, as
with all results proved using the method of types, the analysis technique and the form of the result may be
more insightful to some readers. The exposition in this section is due to Ingber and Kochman [86].
We make the simplifying assumption that the ratedistortion function R(P, ) is differentiable with
respect to (guaranteed by the assumption (iv) in Section 3.4) and twice differentiable with respect to the
probability mass function P . Ingber and Kochman [86] considered the fundamental quantity
R(Q, )
0
.
(3.60)
R (x; P, ) :=
Q(x) Q=P
It can be shown [96, Thm. 2.2] that R0 (x; P, ) and the tilted information are related as follows:
R0 (x; P, ) = (x; P, ) log e.
(3.61)
Hence the expectation of R0 (X; P, ) is the ratedistortion function R(P, ) up to a constant and its variance
is exactly the ratedispersion function V (P, ) in (3.49).
(m) X n is simply an output of the decoder (m). The collection of all M codewords
A codeword x
(M )}, we say that x X n is covered by C if there
forms the codebook. Given a codebook C = {
x(1), . . . , x
(n)
(m) C such that d (x, x
(m)) .
exists a codeword x
The analysis technique in [86] is based on the following lemma.
(M )}
Lemma 3.2 (Type Covering). For every type Q Pn (X ), there exists a codebook C := {
x(1), . . . , x
X n of size M and a function g1 (X , X ) such that every x TP is covered by C, and
1
log n
log M R(Q, ) + g1 (X , X )
n
n
(3.62)
Furthermore, let the code size M and a type Q Pn (X ) be such that log M < nR(Q, ). Then for every
codebook C X n of size M the fraction of TP that is covered by C is at most
exp nR(Q, ) + log M g2 (X , X ) log n
(3.63)
for some function g2 (X , X ).
The achievability part of the lemma in (3.62) is a refined version of the type covering lemma by Berger [16,
Sec. 6.2.1, Lem. 1]. A slightly weaker version of the lemma is also presented in CsiszarKorner [39, Ch. 9] and
was used by Marton [106] to find the error exponent for lossy source coding. The refinement comes about
in the O( logn n ) remainder term which is required for analyzing the setting in which the excess distortion
probability is nonvanishing. The converse part in (3.63) is a corollary of ZhangYangWei [191, Lem. 3].
We now provide an alternative proof of Theorem 3.3 using the type covering lemma. The crux of the
achievability argument is to use the type covering lemma to identify a set of sequences of size M such that
the sequences in X n that it manages to cover has probability approximately 1 so the excess distortion
probability is roughly . The types of sequences in this set is denoted as K in the proof below. The P n probability of K can be estimated using the central limit relation similar to the analysis in the proof of
Theorem 3.2. The converse argument hinges on the fact that the codebook given the achievability part of
the type covering lemma is essentially optimal in terms of its size.
35
Proof sketch of Theorem 3.3. Roughly speaking, the idea in the achievability proof is to encode all sequences in X n whose empirical rate distortion function is no larger than some threshold. More specifically,
encode (use codes prescribed by Lemma 3.2) sequences belonging to
[
K :=
TQ ,
(3.64)
QPn (X ):R(Q,)
where
r
:= R(P, )
V (P, ) 1
().
n
(3.65)
By (3.62) and the type counting lemma, the size of K satisfies the requirement in Theorem 3.3. The resultant
probability of excess distortion is Pr (R(PX n , ) > ) where PX n Pn (X ) is the (random) type of X n X n .
Similarly to (3.33) for the lossless case, the following central limit relation holds:
d
n R(PX n , ) R(P, ) N 0, V (P, ) .
(3.66)
The above convergence can be verified by using the BerryEsseen theorem for functions of i.i.d. random
vectors (Theorem 1.5) per the proof of Theorem 3.2. Hence, probability of excess distortion is roughly and
the achievability proof is complete.
The converse part follows from the fact that that we can lower bound the probability of the excess
n ) > } as
distortion event E := {d(n) (X n , X
Pr E Pr E R(PX n , ) > R + n Pr R(PX n , ) > R + n ,
(3.67)
where R = n1 log M is the code rate and n is arbitrary. Now, by (3.63), if the realized type of the source is
Q Pn (X ) where R(Q, ) > R + n , then the fraction of the type class TQ that is covered is at most
exp (nR(Q, ) + nR g2 log n) exp (nn g2 log n) .
(3.68)
Since all sequences in a type class are equally likely (Lemma 1.3), the probability of no excess distortion
conditioned on the event {R(PX n , ) > R + n } is at most n1 if n := (g2 + 1) logn n . Thus
1
Pr E 1
Pr R(PX n , ) > R + n .
(3.69)
n
For log M = nR chosen to be as in (3.51) in Theorem 3.3, the probability on the right is at least O( 1n )
by a quantitative version of the convergence in distribution in (3.66).
36
Chapter 4
Channel Coding
This chapter presents fixed error asymptotic results for pointtopoint channel coding, which is perhaps
the most fundamental problem in information theory. Shannon [141] showed that the maximum rate of
transmission over a memoryless channel is the information capacity
C(W ) = max I(P, W ).
P P(X )
(4.1)
This firstorder fundamental limit is attained as the number of channel uses (or blocklength) tends to infinity.
Wolfowitz [180] showed the strong converse for a large class of memoryless channels, which intuitively means
that for codes with rates above C(W ), the error probability necessarily tends to one. The contrapositive of
this statement is that, even if we allow the error probability to be close to one (a strange requirement in
practice), one cannot send more bits per channel use than what is prescribed by the information capacity in
(4.1).
In the rest of this chapter, we revisit the problem of channel coding from the viewpoint of the error
probability being nonvanishing. First, we define the channel coding problem as well as some important nonasymptotic fundamental limits. Next we derive bounds on these limits. Some of these bounds are intimately
linked to ideas in and quantities related to binary hypothesis testing. We then evaluate these bounds for
large blocklengths while keeping the error probability (either maximum or average) bounded above by some
constant (0, 1). We only concern ourselves with two classes of channels, namely the discrete memoryless
channel (DMC) and the additive white Gaussian noise (AWGN) channel. We present second and even
thirdorder asymptotic expansions for the logarithm of the nonasymptotic fundamental limits. The chapter
is concluded with a discussion of sourcechannel transmission and the cost of separation.
The material in this chapter on pointtopoint channel coding is based primarily on the works by
Strassen [152], Hayashi [76], PolyanskiyPoorVerd
u [123], AltugWagner [12], TomamichelTan [164] and
TanTomamichel [159]. The material on joint sourcechannel coding is based on the works by KostinaVerd
u [99] and WangIngberKochman [170].
4.1
We now set up the channel coding problem formally. A channel is simply a stochastic map W from an input
alphabet X to an output alphabet Y. For the majority of the chapter, we assume that there are no cost
constraints on the codewordsthe necessary changes required for channels with cost constraints (such as
the AWGN channel) will be pointed out. See Fig. 4.1 for an illustration of the setup.
An (M, )ave code for the channel W P(YX ) consists of a message set M = {1, . . . , M } and pair of
maps including an encoder f : {1, . . . , M } X and a decoder : Y {1, . . . , M } such that the average
error probability
1 X
W (Y \ 1 (m)f (m)) .
(4.2)
M
mM
37
y

Mave
(W, ) := max M N : an (M, )ave code for W , and
Mmax
(W, ) := max M N : an (M, )max code for W .
(4.4)
(4.5)
In the following, we will evaluate these limits when W assumes some structure, for example memorylessness and stationarity. Note that blocklength plays no role in the above definitions. In the sequel, we study
the dependence of the fundamental limits on the blocklength n by inserting a superchannel W n indexed
by n in place of W in (4.4) and (4.5). Before we perform the evaluations, we state some bounds on M and
for arbitrary channels W .
4.1.1
Achievability Bounds
In this section, we state three achievability bounds. We evaluate these bounds for memoryless channels
in the following sections. The first is Feinsteins bound [53] stated in terms of the information spectrum
divergence.
Proposition 4.1 (Feinsteins theorem). Let (0, 1) and let W be any channel from X to Y. Then for
any (0, ), we have
log Mmax
(W, )
1
Ds (P W kP P W ) log .
P P(X )
sup
(4.6)
The proof of this bound can be found in Hans book [67, Lem. 3.4.1] and uses a greedy approach
for selecting codewords. The average error probability version of this bound can be proved in a more
straightforward manner using threshold decoding; cf. [66, Thm. 1]. The following is a slight strengthening
of Feinsteins theorem.
Proposition 4.2. There exists an (M, )max code for W such that for any > 0 and any input distribution
P P(X ),
W (Y X)
W (Y x)
Pr log
+ M sup Pr log
> ,
(4.7)
P W (Y )
P W (Y )
xX
where the distribution of (X, Y ) is P W in the first probability and the distribution of Y is P W in the
second.
The proof of this bound can be found in [123, Thm. 21]. It uses a sequential random coding technique
where each codeword is chosen at random based on previous choices. Feinsteins bound can be derived as
a corollary to Proposition 4.2 by upper bounding the final probability in (4.7) by exp() and using the
identification log 1 .
The previous two bounds are essentially threshold decoding bounds, i.e., we compare the likelihood ratio
between the channel and the output distribution to a threshold . For the average probability of error
setting, one can compare the likelihood ratios of codewords directly and use maximum likelihood decoding
to obtain the following bound.
38
Proposition 4.3 (Random Coding Union (RCU) Bound). There exists an (M, )ave code for W such that
for any input distribution P P(X ),
"
#
W (Y X)
W (Y X)
(4.8)
X, Y
E min 1, M Pr log
log
P W (Y )
P W (Y )
Y ) is distributed as P (x)P (
where (X, X,
x)W (yx).
The proof of this bound can be found in [123, Thm. 16]. Note that the outer expectation is over X, Y
Under certain conditions on a DMC and any AWGN channel, one can
while the inner probability is over X.
use the RCU bound to prove the achievability of 21 log n + O(1) for the thirdorder term in the asymptotic
expansion of log M (W n , ). This is what we do in the subsequent sections.
4.1.2
A Converse Bound
The only converse bound we will evaluate asymptotically appeared in different forms in the works by Verd
uHan [169, Lem. 4], HayashiNagaoka [77, Lem. 4], PolyanskiyPoorVerd
u [123, Sec. IIIE] and TomamichelTan [164, Prop. 6]. This converse bound relates channel coding to binary hypothesis testing. This relation,
and its application to asymptotic converse theorems, can be traced back to early works by ShannonGallagerBerlekamp [146] and Wolfowitz [181]. The reader is referred to Dalais work [42, App. B] for an excellent
modern exposition on this topic.
Proposition 4.4 (SymbolWise Converse Bound). Let (0, 1) and let W be any channel from X to Y.
Then, for any (0, 1 ), we have
log Mave
(W, )
inf
1
sup Ds+ (W (x)kQ) + log .
QP(Y) xX
(4.9)
If the codewords are constrained to belong to some set A X (due to cost contraints, say), the supremum
above is to be replaced by supxA .
The first part of the proof is analogous to the metaconverse in [123, Thm. 27]. See also WangColbeckRenner [172] and WangRenner [173], which inspired the conceptually simpler proof technique presented
below. The bound in (4.9) is a symbolwise relaxation of the metaconverse [123, Thms. 28 and 31] and
HayashiNagaokas converse [77, Lem. 4]. The maximization over symbols allows us to apply our converse
bound on nonconstantcomposition codes for DMCs directly. With an appropriate choice of Q, it allows
us to prove a 12 log n + O(1) upper bound for the thirdorder asymptotics for positive dispersion DMCs
(cf. Theorem 4.3).
We remark that, in our notation, the information spectrum converse bound in Verd
uHan [169, Lem. 4]
takes the form
1
log Mave
(W, ) sup Ds+ (P W kP P W ) + log
(4.10)
P P(X )
so it does not allow one to choose the output distribution Q. Observe the beautiful duality of the Verd
uHan converse with Feinsteins direct theorem. The bound in HayashiNagaoka [77, Lem. 4] (stated for
classicalquantum channels in their context) affords this freedom and is stated as
log Mave
(W, )
inf
sup
QP(Y) P P(X )
1
Ds+ (P W kP Q) + log .
(4.11)
Hence, we see that the bound in Proposition 4.4 is essentially a symbolwise relaxation of the HayashiNagaoka converse bound [77, Lem. 4] (applying Lemma 2.3) as well as the metaconverse theorems in [123,
Thms. 28 and 31].
Since the proof of Proposition 4.4 is short, we provide the details.
39
Proof of Proposition 4.4. Fix an (M, )ave code for W with message set M and an arbitrary output distri be the sent message and estimated message respectively. Starting from a
bution Q P(Y). Let M and M
f
W
induces the joint distribution
uniform distribution PM over M, the Markov chain M X Y M
(4.12)
where PX = P and QM is the distribution induced by applied to QY = Q. Moreover, using the test
(m, m)
:= 11{m 6= m},
we see that
) = Pr(M 6= M
)
EPM M (M, M
(4.13)
) P above, and
where (M, M
MM
)
EPM QM (M, M
X
=
PM (m)QM (m)
11{m 6= m}
(4.14)
(m,m)MM
=1
=1
=1
QM (m)
mM
X
mM
PM (m)11{m = m}
(4.15)
mM
QM (m)
1
M
1
.
M
(4.16)
(4.17)
Hence, Dh (PM M kPM QM ) log M + log(1 ) per the definition of the hypothesis testing divergence.
Finally, applying Lemmas 2.2 and 2.3 yields
sup Ds+ W (x)
Q Ds+ P W
P Q
(4.18)
xX
1
Dh P W
P Q log
1
log M log .
(4.19)
(4.20)
4.2
In this section, we consider asymptotic expansions for DMCs. Recall that a DMC (without feedback) for
blocklength n is a channel W n P(Y n X n ) where the input and output alphabets are finite and the channel
law satisfies
n
Y
W n (yx) =
W (yi xi ),
(x, y) X n Y n .
(4.21)
i=1
Thus, the channel behaves in a stationary and memoryless manner. Shannon [141] found the maximum
rate of reliable communication over a DMC and termed this rate the capacity C(W ) given in (4.1). In this
section, we derive refinements of this fundamental limit of communication by characterizing the first three
40
4.2.1
(4.22)
is the set of capacityachieving input distributions (CAIDs), respectively.1 The set of CAIDs is convex and
compact in P(X ). The unique [56, Cor. 2 to Thm. 4.5.2] capacityachieving output distribution (CAOD) is
denoted as Q and Q = P W for all P . Furthermore, it satisfies Q (y) > 0 for all y Y [56, Cor. 1 to
Thm. 4.5.2], where we assume that all outputs are accessible.
Channel Dispersions
P
under P is known as the divergence
Recall from (2.29) that the variance of the loglikelihood ratio log Q
variance, i.e.,
2
X
P (x)
D(P kQ) .
(4.23)
V (P kQ) :=
P (x) log
Q(x)
xX
P
We also define the conditional divergence variance V (W kQP ) := x P (x)V (W (x)kQ) and the conditional
information variance V (P, W ) := V (W kP W P ). Define the unconditional information variance U (P, W ) :=
V (P W kP P W ). Note that
V (P, W ) = U (P, W )
(4.24)
for all P [123, Lem. 62]. This is easy to verify because from [56, Thm. 4.5.1], we know that all P
(i.e., CAIDs) satisfy
x : P (x) > 0
D W (x)kP W = C
(4.25)
x : P (x) = 0
D W (x)kP W C.
(4.26)
The channel dispersion [123, Def. 2] for (0, 1) \ { 21 } is the following operational quantity.
1
V (W ) := lim inf
n n
(W n , ) nC(W )
log Mave
1 ()
This operational quantity was shown [123, Eq. (223)] to be equal to2
(
Vmin (W ) if < 12
,
V (W ) :=
Vmax (W ) if 12
2
.
(4.27)
(4.28)
41
W (yx0 )
{, 0, }
W (yx)
(4.29)
for all (x, x0 , y) X X Y. Intuitively, if a DMC is singular, checking feasibility is, in fact, optimum
decoding. That is, given a codebook C := {x(1), . . . , x(M )}, we decide that m {1, . . . , M } is sent if, given
the channel output y, it uniquely satisfies
n
Y
W n yx(m) =
W yi xi (m) > 0.
(4.30)
i=1
It is known [161] that if W is singular, the capacity of W equals its zeroundetected error capacity.
Example 4.1. Consider the binary erasure channel W with input alphabet X = {0, 1} and output alphabet
Y = {0, e, 1} where e is the erasure symbol. The channel transition probabilities of W are given by
0
y=0
1 0 y = 0
0
y = e and W (y1) =
1
y=e
W (y0) =
(4.31)
0
y=1
1 1 y = 1
If 0 = 1 = > 0, then W (e0)W (e1) > 0 and W (e0) = W (e1) = , and so the channel is singular. If
0 6= 1 , the channel is nonsingular.
Symmetry
We say a DMC is symmetric [56, pp. 94] if the channel outputs can be partitioned into subsets such that
within each subset, the matrix of transition probabilities satisfies the following: every row (resp. column) is
a permutation of every other row (resp. column).
4.2.2
log Mmax
(W n , ) nC +
p
nV 1 () + O(1).
(4.32)
log Mave
(W n , ) nC +
nV 1 () +
1
log n + O(1).
2
(4.33)
Theorem
4.1 says that asymptotically, log Mmax
(W n , ) is lower bounded by the Gaussian approximation
1
nC + nV () plus a constant term. In addition, under the nonsingularity condition, one can say more,
42
Bound
Feinstein + Const. Compo. (Thm. 4.2)
Feinstein + i.i.d. (Rmk. 4.1)
Strengthened Feinstein + i.i.d. (Thm. 4.1)
RCU + i.i.d. (Thm. 4.1)
ThirdOrder Term
(X  21 ) log n+O(1)
12 log n + O(1)
O(1)
1
log
n + O(1)
2
Table 4.1: Comparison of the thirdorder terms achievable by using various achievability bounds (in Section
4.1.1) or requirements on the code (such as constant composition). The 12 log n + O(1) that is achieved by
evaluating the RCU bound holds only for the class of nonsingular DMCs.
Proof of (4.32). We specialize the strengthened version of Feinsteins result in Proposition 4.2. Choose PX n
W n (Y n X n )
log
(PX W )n (Y n )
X
n
W (Yi Xi )
= Pr
log
PX W (Yi )
i=1
nC
6 T
+p
nV
nV3
(4.34)
(4.35)
, W ) which is
W (Y ) and the variance is U (PX
where T is the third absolute moment of log W (Y X) log PX
equal to V by (4.24). To bound the second probability in (4.7), we define
Vx := V W (x)kPX
W , and
(4.36)
"
3 #
W (Y x)
Tx := E log
(4.37)
D W (x)kPX
W ,
PX W (Y )
(PX W )n (Y n )
W n (Y n x)
= E(PX W )n 11 log
>
(4.38)
(PX W )n (Y n )
W n (Y n x)
W n (Y n x)
= EW n (x) exp log
1
1
log
>
(4.39)
W )n (Y n )
(PX W )n (Y n )
(PX
log 2 12 T + exp()
p
+
,
(4.40)
2
V
2
nV
+
log
2 + 12 T
2
1
6T
V
2
p
0 :=
+ p .
n
V
V3
(4.41)
(4.42)
Also choose M = bexp()c. Substituting these choices into the above bounds completes the proof of (4.32).
43
Remark 4.1. We remark that if we use Feinsteins theorem in Proposition 4.1 (instead of its strengthened
n
version in Proposition 4.2), and the codebook is generated in an i.i.d. manner according to (PX
) , the third1
1
order term would be 2 log n+O(1). Indeed, let in Feinsteins theorem be n . Then, the ()information
spectrum divergence can be expanded as
p
n
n
Ds (PX
) W n
(PX
) (PX
W )n = nC + nV 1 () + O(1).
(4.43)
This follows the asymptotic expansion of Ds (Corollary 2.1) and the fact that U (PX
, W ) = V (PX
,W) =
1
1
V (W ) similarly to (4.35). Coupled with the fact that log = 2 log n, we see that the thirdorder term is
(at least) 12 log n + O(1).
(4.44)
for some cost function b : X [0, ) and some cost constraint > 0. In this case, if the type P Pn (X )
of each codeword x(m) is the same for all m and it satisfies
EP [b(X)] ,
(4.45)
then the cost constraint in (4.44) is satisfied. This class of codes is called constant composition codes of type
P . The Gaussian approximation can be achieved using constant composition codes. Constant composition
coding was used by Hayashi for the DMC with additive cost constraints [76, Thm. 3]. He then used this
result to prove the secondorder asymptotics for the AWGN channel [76, Thm. 5] by discretizing the real line
increasingly finely as the blocklength grows. It is more difficult to prove conclusive results on the thirdorder
terms using a constant composition ensemble [98], nonetheless it is instructive to understand the technique
log Mmax,cc
(W n , )
nC +
nV
1
() X 
log n + O(1).
2
(4.46)
Proof sketch of Theorem 4.2. We use Feinsteins theorem (Proposition 4.1. Choose a type P Pn (X ) that
.
1
n
(4.47)
Then consider the input distribution in Feinsteins theorem to be PX n (x), the uniform distribution over TP ,
i.e.,
11{x TP }
PX n (x) =
.
(4.48)
TP 
Clearly such a code is constant composition. Now we claim that
44
(4.49)
1
Pn (X ) exp nH(P ) = Pn (X )P n (x)
TP 
(4.50)
where the inequality follows from Lemma 1.2 and the final equality from Lemma 1.3. For x
/ TP , (4.50)
be
also holds as PX n (x) = 0. Multiplying (4.50) by W n (yx) and summing over all x yields (4.49). Let x
is a sequence with type P . The ( )information spectrum divergence
an arbitrary sequence in TP , i.e., x
in Feinsteins theorem can be bounded as
Ds PX n W n
PX n PX n W n
= Ds W n (
x)
PX n W n
(4.51)
n
n
(4.52)
Ds
W (
x) (P W ) log Pn (X )
p
6
T
(P,
W
)
nI(P, W ) + nV (P, W )1 p
nV (P, W )3
log Pn (X )
(4.53)
where (4.51) follows from permutation invariance within a type class, and the change of output measure
step in (4.52) uses the bound in (4.49) as well as the consequence of the sifting property of Ds in (2.11).
Inequality (4.53) uses the lower bound in the BerryEsseen bound on Ds in Proposition 2.1 with T (P, W ) :=
P
1 . In view of (4.47), the following
x P (x)T (W (x)kP W ). Choose in Feinsteins theorem to be
n
continuity properties hold for c1 , c2 > 0:
I(P, W ) I(PX
, W ) c1 n2 , and
(4.54)
q
p
, W ) c n1 .
(4.55)
V (P, W ) V (PX
2
follows from the Lipschitzness of P 7 V (P, W ) near PX because V (W ) > 0. Combining these bounds
with the type counting lemma in (1.27) and Taylor expansion of 1 () in (4.53) concludes the proof.
We remark that if there are additive cost constraints on the codewords, the above proof goes through
almost unchanged. The leading term in the asymptotic expansion in (4.46) would, of course, be the capacitycost function [49, Sec. 3.3]. The analogues of Vmin (W ) and Vmax (W ) that define the dispersion (cf. (4.28))
would involve the maximum and minimum over the set of input distributions P satisfying EP [b(X)] .
The thirdorder term remains unchanged. For more details, the reader is referred to [98].
In fact, the Gaussian approximation can be achieved with constant composition codes that are also
partially universal. The only statistics of the DMC we need to know are the capacity and the dispersion.
The idea is to compare the empirical mutual information of a codeword and the channel output I(x(m)y)
to
a threshold (that depends on capacity and dispersion), similar to maximum mutual information decoding [38,
62]. This technique was delineated in the proof of Theorem 3.2 for lossless source coding. Essentially, in
channel coding, it uses the fact that if X n is uniform over the type class TP and Y n is the corresponding
n Y n ) satisfies the central limit relation
channel output, the empirical mutual information I(X
4.2.3
d
n Y n ) I(P, W )
n I(X
N 0, V (P, W ) .
(4.56)
log Mave
(W n , ) nC +
nV 1 () +
45
1
log n + O(1).
2
(4.57)
log Mave
(W n , ) nC +
nV 1 () + O(1).
(4.58)
The claim in (4.57) is due to TomamichelTan [164], and proved concurrently by Moulin [113], while (4.58)
is due to Altu
gWagner [12]. The case V (W ) = 0 was also treated in TomamichelTan [164] but we focus on
channels with V (W ) > 0. See [164, Fig. 1] for a summary of the best known upper bounds on log Mave
(W n , )
for all classes of DMCs (regardless of the positivity of V (W )).
log Mave
(W n , ) min maxn Ds+ W n (x)
Q(n) + log .
Q(n) xX
In the following, we choose =
1
n
n
1
2
(4.59)
a suitable choice of Q
P(Y ) to further upper bound the above. Symmetry considerations (see, e.g.,
[121, Sec. V]) allow us to restrict the search to distributions that are invariant under permutations of the n
channel uses.
Let := Y(Y 1) and let > 0. Consider the following convex combination of product distributions:
n
1 X exp kkk22 Y
(n)
Q (y) :=
Qk (yi )
2
F
i=1
(n)
kK
1
2
X
Px Pn (X )
n
Y
1
Px W (yi ),
Pn (X ) i=1
(4.60)
Q(n) (y) = 1,
ky
Qk (y) := Q (y) + ,
n
and the index set K is defined as
X
p
Y
K := k = {ky }yY Z
:
ky = 0, ky Q (y) n .
(4.61)
(4.62)
yY
See Fig. 4.2. The convex combination of output distributions induced by input types (Px W )n and the optimal
output distribution (Q )n (corresponding to k = 0) in Q(n) is inspired partly by Hayashi [76, Thm. 2]. What
we have done in our choice of Qk is to uniformly quantize the simplex P(Y) along axisparallel directions
to form a net. The constraint that each k belongs to K ensures that each Qk is a valid probability mass
46
Q(1)
6
s
(1, 0) @
P(Y)
)
@ sQ
@ [2,2]
@ sQ[1,1]
@
@ sQ
@
1
2n
@Q
s [1,1]
1 @
2n @
sQ[2,2]
@
@
@s Q(0)
(0, 1)
Figure 4.2: Illustration of the choice of {Qk }kK for Y = {0, 1}. Note that all probability distributions lie
on the line Q(0) + Q(1) = 1 and each element of the net is denoted by Qk where k denotes some vector with
integer elements.
function. It can be shown that F < . Furthermore one can verify that for any Q P(Y), there exists a
k K such that
1
kQ Qk k2
(4.63)
n
so the net we have constructed is
1 dense
n
Let us provide some intuition for the choice of Q(n) . The first part of the convex combination is used to
approximate output distributions induced by input types that are close to the set of CAIDs . We choose
a weight for each element of the net that drops exponentially with the distance from the unique CAOD.
This ensures that the normalization F does not depend on n even though the number of elements in the
net increases with n. The smaller weights for types far from the CAIDs will later be compensated by the
larger deviation of the corresponding mutual information from the capacity. This is achieved by the second
part of the convex combination which we use to match the input types far from the CAIDs. This partition
of input types into those that are close and far from was also used by Strassen [152] in his proof of the
secondorder asymptotics for DMCs,
Now we just have to evaluate Ds+ W n (x)
Q(n) for all x X n . The idea is to partition input
sequences depending on their distance from the set of CAIDs. For this define
n
o
:= P P(X ) : min
kP
P
k
(4.64)
2
(4.66)
Since I(Px , W ) C 0 < C (i.e., the firstorder mutual information term is strictly bounded away from
capacity), V (Px , W ) is uniformly bounded [67, Rmk. 3.1.1] and the number
of types is polynomial, the right
handside of the preceding inequality is upper bounded by nC 0 + O( n). This is smaller than the Gaussian
approximation for all sufficiently large n as C 0 < C.
47
Now for sequences in , we pick Qk(x) from the net that is closest to Px W . Per Lemma 2.2, this gives
Ds+ W n (x)
Q(n) Ds+ W n (x)
Qnk(x) + kk(x)k22 + log 2F .
By the BerryEsseentype bound in Proposition 2.1, we have
Ds+ W n (x)
Q(n) nD(W kQk(x) Px )
q
+ nV (W kQk(x) Px )1 +
+ kk(x)k22 + log 2F
n
(4.67)
(4.68)
for some finite > 0. By the 1n denseness of the net, the positivity of the CAOD, and the bound D(QkQ)
2
kQ Qk2 / minz Q(z) [40, Lem. 6.3] we can show that there exists a constant q > 0 such that
D(W kQk(x) Px ) I(Px , W ) +
(4.69)
p
Furthermore by the Lipschitzness of Q 7 V (W kQP ) which follows from the fact that Q(y) > 0 for all
y Y, we have
q
p
(4.70)
nV (W kQk(x) Px ) V (Px , W ) kPx W Qk(x) k2 .
n
It is known from Strassens work [152, Eq. (4.41)] and continuity considerations that for all Px ,
p
p
(4.71)
I(Px , W ) C 2 and V (Px , W ) V (P , W ) ,
where P is the closest element in to Px and is the corresponding Euclidean distance. Let kW k2
spectral norm of W . By the construction of the net,
p
kk(x)k2 nkQk(x) Q k2
p
n kQk(x) Px W k2 + kPx W Q k2
p
1
n + kW k2
n
be the
(4.72)
(4.73)
(4.74)
Uniting (4.68), (4.69), (4.70) and (4.74) and using some simple algebra completes the proof.
As can be seen from the above proof, the net serves to approximate all possible output distributions so
that, together with standard continuity arguments concerning information quantities, the remainder terms
resulting from (4.69), (4.70) and (4.74) are all O(1).
If we had chosen the more natural output distribution
(n) (y) =
Q
X
Px Pn (X )
n
Y
1
Px W (yi )
Pn (X ) i=1
(4.75)
in place of Q(n) in (4.60), an application of Lemma 2.2, the type counting lemma in (1.27), and continuity
arguments shows that the thirdorder term would be (X  12 ) log n + O(1). This upper bound on the thirdorder term was shown in the works by Strassen [152, Thm. 1.2] and PolyanskiyPoorVerd
u [123, Eq. (279)].
The choice of output distribution in (4.75) is essentially due to Hayashi [76].
48
4.3
In this section, we consider discretetime additive white Gaussian noise (AWGN) channels in which
Yi = Xi + Zi ,
(4.76)
for each time i = 1, . . . , n. The noise {Zi }ni=1 is a memoryless, stationary Gaussian process with zero mean
and unit variance so the channel can be expressed as
2
1
W (yx) = N (y; x, 1) = e(yx) /2 .
2
(4.77)
This is perhaps the most important and wellstudied channel in communication systems. In the case of
Gaussian channels, we must impose a cost constraint on the codewords, namely for every m,
kf (m)k22 =
n
X
i=1
fi (m)2 n snr
(4.78)
where n is the blocklength, snr is the admissible power and fi (m) is the ith coordinate of the mth codeword.
1
log(1 + snr),
2
snr(snr + 2)
2(snr + 1)2
(4.79)
respectively. The direct part of the following theorem was proved in TanTomamichel [159] and the converse
in PolyanskiyPoorVerd
u [123, Thm. 54]. The secondorder asymptotics (ignoring the thirdorder term) was
proved concurrently with [123] by Hayashi [76, Thm. 5]. Hayashi showed the direct part using the secondorder asymptotics for DMCs with cost constraints (similar to Theorem 4.2) and a quantization argument
(also see [153]). The converse part was shown using the HayashiNagaoka converse bound in (4.11) with the
output distribution chosen to be the product CAOD.
Theorem 4.4. For every snr (0, ),
log Mave
(W n , snr, ) = nC(snr) +
nV(snr)1 () +
1
log n + O(1).
2
(4.80)
For the AWGN channel, we see that the asymptotic expansion is known exactly up to the third order
under the average error setting. The converse proof (upper bound of (4.80)) is simple and uses a specialization
of Proposition 4.4 with the product CAOD.
The achievability proof is, however, more involved and uses the RCU bound and Laplaces technique for
approximating highdimensional integrals [150, 162]. The main step establishes that if X n is uniform on the
power sphere {x : kxk22 = n snr}, one has
Pr hX n , Y n i [b, b + ] Y n = y .
n
(4.81)
where does not depend on b R and typical y, i.e., y such that kyk22 n(snr + 1). The estimate in
(4.81) is not obvious as the inner product hX n , Y n i is not a sum of independent random variables
and so
standard limit theorems (like those in Section 1.5) cannot be employed directly. The division by n gives
us the 12 log n beyond the Gaussian approximation.
If one is content with just the Gaussian approximation with an O(1) thirdorder term, one can evaluate
the socalled bound [123, Thm. 25]. See [123, Thm. 54] for the justification. The reader is also referred
49
to MolavianJaziLaneman [112] for an elegant proof strategy using the central limit theorem for functions
(Theorem 1.5) to prove the achievability part of Theorem 4.4 under the average error setting with an O(1)
thirdorder term.
It remains an open question with regard to whether 12 log n + O(1) is achievable under the maximum
Mave
(W n , snr, ) Mave,eq
(W n+1 , snr, )
(4.82)
where Mave,eq
(W n , snr, ) is similar to Mave
(W n , snr, ), except that the codewords must satisfy the cost
2
constraints with equality, i.e., kf (m)k2 = kx(m)k22 = n snr. Since increasing the blocklength by 1 does not
logMave
(W n , snr, ) inf
Q(n)
1
Ds+ W n (x)kQ(n) + log .
kxk22 =n snr
sup
(4.83)
Take = 1n so the final log term gives 12 log n. It remains to show that the ( + )information spectrum
divergence term is upper bounded by the Gaussian approximation plus at most a constant term.
For this purpose, we have to choose the output distribution Q(n) P(Rn ). This choice is easy compared
to the DMC case. We choose
Q(n) (y) =
n
Y
QY (yi ),
where
i=1
One can then check that for every x Rn such that kxk22 = n snr,
X
n
1
W (Yi xi )
E
log
= C(snr), and
n i=1
QY (Yi )
X
n
W (Yi xi )
V(snr)
1
log
.
=
Var
n i=1
QY (Yi )
n
Then, by the BerryEsseentype bound in Proposition 2.1, we have
p
6T
+
n
(n)
1
Ds
W (x)kQ
nC(snr) + nV(snr)
++ p
nV(snr)3
(4.84)
(4.85)
(4.86)
(4.87)
(Y xi )
1
where T < is related to the third absolute moments of log W
() concludes
QY (Y ) . A Taylor expansion of
the proof of the converse.
Since the proof of the direct part is long, we only highlight some key ideas in the following steps. Details
can be found in [159].
Step 1: (Random coding distribution) Consider the following input distribution to be applied to the RCU
bound:
{kxk22 n snr}
PX n (x) =
(4.88)
An ( n snr)
n/2
2
where {} is the Dirac delta and An (r) = (n/2)
rn1 is the surface area of a sphere of radiusr in Rn . The
power constraints are automatically satisfied with probability one. Let
q(x, y) := log
50
W n (yx)
PX n W n (y)
(4.89)
n , X n , Y n ) PX n (
where (X
x)PX n (x)W n (yx). Let
n, Y n) t Y n = y
g(t, y) := Pr q(X
(4.92)
(4.93)
(4.94)
Step 3: (A highprobability set) Now, we define a set of channel outputs with high probability
o
n
1
T := y : kyk22 [snr + 1 n , snr + 1 + n ]
n
(4.95)
p(a,  y) ,
n
(4.98)
51
where > 0 is a constant. Then we apply Lemma 4.1 to each segment. This is modeled on the proof of
Theorem 1.3. Carrying out the calculations, we have
g(t, y)
X
l=0
X
l=0
exp(t l)p(t + l,  y)
exp(t l)
n
exp(t)
.
1 exp()
n
(4.99)
(4.100)
(4.101)
exp(t)
exp(t)
=:
.
n
n
(4.102)
Step 6: (Evaluation of RCU) We now have all the necessary ingredients to evaluate the RCU bound in
(4.91). Consider,
0 E min 1, M g q(X n , Y n ), Y n
(4.103)
Pr(Y n T c )
h
i
+ E min 1, M g(q(X n , Y n ), Y n ) Y n T Pr(Y n T ).
The first term is bounded above by n and the second can be bounded above by
M exp(q(X n , Y n )) n
T
Pr(Y n T )
E min 1,
n
(4.104)
(4.105)
due to (4.102)
with t = q(X n , Y n ). We split the expectation into two parts depending on whether q(x, y) >
T
(4.106)
E min 1,
n
M
Pr q(X n , Y n ) log Y n T
n
M
M
n
n
n
n n
(4.107)
exp(q(X , Y ))Y T .
+ E 11 q(X , Y ) > log
n
n
By applying (4.102) with t = log(M / n), we know that the second term
can be bounded above by / n.
Q
n
Now let QY (y) = N (y; 0, snr + 1) be the CAOD and QY n (y) = i=1 QY (yi ) its nfold memoryless
extension. In Step 1 of the proof of Lem. 61 in [123], PolyanskiyPoorVerd
u showed that there exists a finite
constant > 0 such that
PX n W n (y)
sup
.
(4.108)
QY n (y)
yF
Thus, the first probability in (4.107) multiplied by Pr(Y n T ) can be upper bounded using the BerryEsseen
theorem and the statistics in (4.85)(4.86) by
!
log M
nC(snr)
W n (Y n X n )
M
n
p
Pr log
log
+ ,
(4.109)
QY n (Y n )
n
n
nV(snr)
where is a finite positive constant that depends only on snr.
52
Channel
ThirdOrder Term
1
2
log n + O(1)
O(1)
1
2
AWGN
log n + O(1)
Prefactor %n
1
(1+E 0 (R))/2
n
1
1/2
n
1
(1+E 0 (R))/2
n
Table 4.2: Comparison between the thirdorder term in the normal approximation and prefactors in the
error exponents regime %n for various classes of channels. The reliability function [39, 56, 74] is denoted as
E(R) and its derivative (if it exists) is E 0 (R). For the first row of the table, symmetry is not required for
the thirdorder term to be equal to 21 log n + O(1) (cf. (4.33) and (4.57)).
Putting all the bounds together, we obtain
0
log
nC(snr)
nV(snr)
+ + + n .
n
n
(4.110)
+
1
nV(snr)1 n + log n log().
2
n
(4.111)
This choice ensures that 0 . By a Taylor expansion of 1 (), this completes the proof of the lower
bound in (4.80).
4.4
We conclude our discussion on fixed error asymptotics for channel coding with a final remark. We have
seen from Theorems 4.1 and 4.3 that the thirdorder term in the normal approximation for DMCs is given
by 12 log n + O(1) (resp. O(1)) for nonsingular channels (resp. singular, symmetric channels). We have also
seen from Theorem 4.4 that the thirdorder term for AWGN channels is 12 log n + O(1). These results are
summarized in Table 4.2.
In another line of study, Altu
gWagner [10, 11] and ScarlettMartinezGuillen i F`abregas [135] derived
prefactors in the error exponents regime for DMCs. In a nutshell, the authors were concerned with finding
a sequence %n such that, for high rates (i.e., rates above the critical rate),3
(W n , bexp(nR)c) %n exp nE(R) ,
(4.112)
where (W n , M ) is the smallest average error probability of a code for the channel W n with M codewords,
and E(R) is the reliability function (or error exponent) of the channel [39, 56, 74]. The results are also
summarized in Table 4.2. For the AWGN channel, it can be verified from Shannons work on the error
exponents for the AWGN channel [145] that the prefactor is the same as that for nonsingular, symmetric
DMCs. Also see the work by Wiechman and Sason [178]. Table 4.2 suggests that there is a correspondence
between thirdorder terms and prefactors. A precise relation between these two fundamental quantities is
an interesting avenue for future research.
3 We
53
sk

xn

Wn
yn

sk

4.5
We conclude our discussion on channel coding by putting together the results and techniques presented in
this and the previous chapter on (lossy and lossless) source coding. We consider the fundamental problem of
transmitting a memoryless source over a memoryless channel as shown in Fig. 4.3. Shannon showed [141, 144]
that as long as
C(W )
kn
<
,
(4.113)
lim sup
R(P, )
n n
where kn is the number of independent source symbols from P and n is the number of channel uses, the
probability of excess distortion can be arbitrarily small in the limit of large blocklengths. The ratio kn /n is
also known as the bandwidth expansion ratio. We summarize known fixed error probabilitytype results on
sourcechannel transmission in this section.
The sourcechannel transmission problem is formally defined as follows: A (d, , )code for source S with
distribution P P(S) over the channel W P(YX ) is a pair of maps including an encoder f : S X
and a decoder : Y S such that the probability of excess distortion
X
(4.114)
P (s)W {y : d(s, (y)) > } f (s) .
sS
Again we assume there are no cost constraints on the channel inputs to simplify the exposition. If there are
cost constraints, a natural coding strategy would involve constant compostion codes as discussed in Theorem
4.2.
In the conventional fixedtofixed length setting in which X and Y are nfold Cartesian products of
the input and output alphabets respectively and S is the kfold Cartesian product of the source alphabet
respectively, we may define the following: A (k, n, d(k) , , )code is simply a (d(k) , , )code for the source
S k with distribution P k P(S k ) and over the channel W n P(Y n X n ) such that the probability of excess
distortion measure according to d(k) is no greater than .
The sourcechannel nonasymptotic fundamental limit we are interested in is defined as follows:
k (n, d(k) , , ) := max k N : a (k, n, d(k) , , )code for (P k , W n ) .
(4.115)
This represents the maximum number of source symbols transmissible over the channel W n such that the
probability of excess distortion (at distortion level ) does not exceed . One is also interested in the
maximum joint sourcechannel coding rate which is ratio between the number of source symbols and the
number of channel uses, i.e.,
k (n, d(k) , , )
R (n, d(k) , , ) :=
.
(4.116)
n
4.5.1
Asymptotic Expansion
sk 
mfc
fs
xn
Wn
y n
m
s
sk
(k)
R (n, d , , ) =
+
()+O
R(P, )
n
n
(4.118)
(4.119)
We will not prove this theorem here, as the main ideas, based on new nonasymptotic bounds, have been
detailed in previous asymptotic expansions.
The intuition behind the result in Theorem 4.5 is perhaps more important. The nonasymptotic bounds
that are evaluated very roughly say that a joint sourcechannel coding scheme with probability of excess
distortion no larger than exists if and only if
Pr In < Jk,n
(4.120)
where the random variables In and Jk,n are defined as
In :=
1
W n (Y n x)
log
,
n
(PX W )n (Y n )
and Jk,n :=
1
(S k ; P k , )
n
(4.121)
and x has type P Pn (X ) close to P(X ), the set of CAIDs. The bound in (4.120) provides the
intuition that erroneous transmission of the source occurs if and only if the information density random
variable In of the channel is not large enough to support the information content of the source, represented
by the tilted information Jk,n . We can now estimate the probability in (4.120) by using the central limit
theorem for k + n independent random variables, and the fact that In Jk,n has first and secondorder
statistics
k
R(P, ), and
n
1
k
Var[In Jk,n ] = V (W ) + 2 V (P, ).
n
n
E[In Jk,n ] = C(W )
(4.122)
(4.123)
4.5.2
In showing the seminal result in (4.113), Shannon used a separation scheme. That is, he first considers source
compression to distortion level using a source encoder fs and subsequently, information transmission over
channel W n using a channel encoder fc . To decode, simply reverse the process by using a channel decoder
d and a source decoder s . See Fig. 4.4 where m denotes the digital interface. While this idea of separation
has guided the design of communication systems for decades and is firstorder optimal in the limit of large
blocklengths, it turns out that such is scheme is neither optimal from the error exponents4 perspective [35]
4 To be more precise, the suboptimality of separation in the error exponents regime occurs only when kR(P, ) < nC(W ). In
the other case, by analyzing the probability of no excess distortion, WangIngberKochman [171] showed, somewhat surprisingly,
that separation is optimal.
55
nor the fixed error setting. What is the cost of separation in when the error probability is allowed to be
nonvanishing? By combining Theorem 3.3 (for rate distorion), Theorems 4.14.3 (for channel coding), one
sees that there exists a sequence of (k, n, d(k) , , )codes for P k and W n satisfying
kR(P, ) nC(W ) + O(log n)
o
np
p
kV (P, )1 (s ) + nV (W )1 (c ) .
max
s +c
(4.124)
Inequality (4.124) suggests that we first compress the source up to distortion level with excess distortion
probability s , then we transmit the resultant bit string over the channel W n with average error probability
c . In order to have the endtoend excess distortion probability be no larger than , one has to design the
source and channel codes so that s + c .
Because the maximum in (4.124) is no larger than the square root term in (4.117), separation is strictly
suboptimal in the secondorder asymptotic sense (unless either V (W ) or V (P, ) vanishes). This is unsurprising because for the separation scheme, the source and channel error events are treated separately,
while the (approximate) nonasymptotic bound in (4.120) suggests that treating the system jointly results
in better performances in terms of both error and rate.
56
Part III
57
Chapter 5
5.1
We warm up with the simple model shown in Fig. 5.1. Here there is a state distribution PS P(S) on a
finite alphabet S which generates an i.i.d. random state S, i.e., a discrete memoryless source (DMS). The
channel W is a conditional probability distribution from X S to Y. If the state process is i.i.d. and the
channel is discrete, stationary and memoryless given the state, it is easy to see that the capacity is
CSID (W, PS ) = max I(X; Y S) = max I(X; Y S).
P P(X )
P P(X )
(5.1)
PS
s
?
m

x

?
y

define MSID
(W n , PS n , ) to be the maximum number of messages transmissible over the DMC W n with
n
i.i.d. state S PS n known at the decoder and with average error probability not exceeding (0, 1). We
also let Ws (yx) := W (yx, s) denote the channel indexed by s S.
The following is due to Ingber and Feder [85].
Theorem 5.1. Assume that V (Ws ) > 0 for all s S and V (Ws ) does not depend1 on (0, 1). Then,
log MSID
(W n , PS n , )
= nCSID (W, PS ) +
p
nVSID (W, PS )1 () + O(log n),
(5.2)
(5.3)
(5.4)
(5.5)
= ES [V (WS )] + Var[C(WS )]
(5.6)
P
where (5.5) follows from the law of total variance with the conditional distribution PX W (ys) := x PX (x)W (yx, s),
and (5.6) follows from the definition of the capacity and dispersion of Ws .
The dispersion in (5.3) is intuitively pleasing: The term ES [V (WS )] represents the randomness of the
channels {Ws : s S} given the state; the term VarS [C(WS )] represents the randomness of the state.
5.2
The next model we will study is similar to that in the previous section. However, here the i.i.d. state is
known noncausally at both the encoder and the decoder. See Fig. 5.2. Again, let W P(YX S) be a
statedependent discrete memoryless channel, stationary and memoryless given the state and let PS P(S)
be a DMS. It is known [49, Sec. 7.4.1] that the capacity of this channel is
CSIED (W, PS ) =
1 If
max
I(X; Y S).
the CAIDs of each Ws is unique, V (Ws ) does not depend on (0, 1).
59
(5.7)
PS
s
?
m

?
x

?
y
n
In the spirit of this monograph, let MSIED (W , PS n , ) be the maximum number of messages transmissible over the channel W n with i.i.d. random state S n PS n known at both encoder and decoder and with
average error probability not exceeding (0, 1).
The following is due to Tomamichel and Tan [165].
Theorem 5.2. Let W satisfy the assumptions in Theorem 5.1. Then,
log MSIED
(W n , PS n , )
= nCSIED (W, PS ) +
(5.8)
(s) :=
Ps (s)C(Ws ) =
1X
C(Wsi ),
n i=1
Ps (s)V (Ws ) =
1X
V (Wsi )
n i=1
sS
and
(5.9)
(s) :=
X
sS
(5.10)
n
p
(W , M, s) =
+O ,
(5.11)
n
n(s)
where the implied constant in the O()notation above is uniform over all strongly typical state types Ps .
The optimum error probability when the state is random and i.i.d. is denoted as (W n , M ) and it can be
60
(5.12)
(5.13)
The analysis of (5.13) is facilitated by following lemmas whose proofs can be found in [165].
Lemma 5.1. The following holds uniformly in R:
"
"
!#
!#
(S n )
(S n )
log n
E
n p
E
n p
=O
.
n
(S n )
ES [V (WS )]
(5.14)
This lemma says that we can essentially replace the random quantity (S n ) in (5.13) with the deterministic quantity ES [V (WS )]. The next step involves approximating (S n ) in (5.13) with the true capacity
CSIED (W, PS ).
Lemma 5.2. The following holds uniformly in R:
!#
"
(S n )
n p
E
ES [V (WS )]
=
n p
CSIED (W, PS )
+O
.
(5.15)
The idea behind the proof of this lemma is as follows: From (5.9), one can write (S n ) as an average of
i.i.d. random variables C(WSi ). The expectation in (5.15) can then be written as
s
!#
"
CSIED (W, PS )
VarS [C(WS )]
p
n
+
Jn
(5.16)
E
ES [V (WS )]
ES [V (WS )]
where
1 X
C(WSi ) CSIED (W, PS )
p
Jn :=
Ei , and Ei :=
n i=1
VarS [C(WS )]
(5.17)
Clearly, Ei are zeromean, unitvariance, i.i.d. random variables and thus Jn converges in distribution to a
standard Gaussian. Now, (5.15) can be established by using the fact that the convolution of two independent
Gaussians is a Gaussian, where the mean and variance are the sums of the constituent means and variances.
Combining Lemmas 5.1 and 5.2 with (5.11)(5.13) completes the proof.
Finally, we remark that by appropriate modifications to Lemmas 5.1 and 5.2, Theorem 5.2 can be
generalized to the case where the distribution of the state sequence follows a timehomogeneous and ergodic
Markov chain [165, Thm. 8].
5.3
Costas writing on dirty paper result is probably one of the most surprising in network information theory.
It is a special instance of the GelfandPinsker problem [59] whose setup is shown in Fig. 5.3. In contrast to
the previous two sections, here the state (usually assumed to be i.i.d.) is known noncausally at the encoder.
The capacity of the GelfandPinsker channel is
CSIE (W, PS ) =
max
PU S ,f :U SX
61
I(U ; Y ) I(U ; S)
(5.18)
PS
s
s
?
m

?
x

i = 1, . . . , n.
(5.19)
As usual, we assume that the codeword power is constrained to not exceed snr, i.e.,
n
1X 2
X snr
n i=1 i
(5.20)
with probability one. If the state is not known at either terminal, then the capacity of the channel is
snr
CnoSI (W, PS ) = C
.
(5.21)
1 + inr
If the state is known at both terminals, the decoder can simply subtract it off and the channel behaves like
an AWGN channel with signaltonoise ratio snr. Thus, the capacity is
CSIED (W, PS ) = C(snr).
(5.22)
Costas showed the surprising result [30] that knowledge of the state is not required at the decoder for the
capacity to be C(snr)! In other words,
CSIE (W, PS ) = C(snr).
(5.23)
The natural question, in the spirit of this monograph, is whether there is a degradation to higherorder terms in the asymptotic expansion of logarithm of the maximum code size of the channel for a fixed
average error probability (cf. the AWGN case in Theorem 4.4). Scarlett [134] and JiangLiu [89] showed the
surprising result that there is no degradation up to the secondorder dispersion term! Furthermore, Scarlett
[134] showed that the state sequence only has to satisfy a very mild condition. In particular, it neither has to
be Gaussian nor ergodic. The approach by JiangLiu [89] is via lattice coding [51]. The proof sketch below
follows Scarletts approach in [134].
Theorem 5.3. Assume that there exists some finite > 0 such that
1 n 2
log n
Pr
kS k2 > = O
.
n
n
For any snr (0, ), the maximum code size for average error probability no larger than satisfies
p
log MSIE
(W n , PS n , ) = nC(snr) + nV(snr)1 () + O(log n).
62
(5.24)
(5.25)
The condition in (5.24) is mild. For example if S n is a zeromean, i.i.d. process and is chosen to be
larger than E[S12 ], under the condition that E[S14 ] < , the probability decays at least as fast as O( n1 ) by
Chebyshevs inequality, thus satisfying (5.24).
Before we sketch the proof of Theorem 5.3, let us recap Costas proof of the DPC capacity in (5.23). He
assumes S is Gaussian with some variance inr and chooses U = X + S, where X N (0, snr) and S are
independent. He then performs calculations which yield
(snr + inr + 1)(snr + 2 inr)
1
,
(5.26)
I(U ; Y ) = log
2
snr inr(1 )2 + (snr + 2 inr)
1
snr + 2 inr
I(U ; S) = log
, and
(5.27)
2
snr
1
snr(snr + inr + 1)
I(U ; Y ) I(U ; S) = log
.
(5.28)
2
snr inr(1 )2 + (snr + 2 inr)
snr
Differentiating the final expression (5.28) with respect to and setting it to zero shows that = snr+1
independent of inr. Furthermore the expression (5.28) evaluated at yields C(snr) which is, of course, also
independent of inr. So the important thing to note here is that I(U ; Y ) I(U ; S) is independent of inr at
the optimum , which is the weight of the minimum mean squared error estimate of X given X + Z.
Proof sketch of Theorem 5.3. The main ideas of the proof are sketched here. The converse follows from
Theorem 4.4 so we only have to prove achievability. We start with some preliminary definitions.
The analogue of types of states which take values in Euclidean space and which we find helpful here is
the notion of power types [109]. Fix > 0 and consider the power type class
1
Tn ( ) := s Rn : ksk22 < +
(5.29)
n
n
where = k
n . Intuitively, what we are doing is partitioning [0, ) into small intervals, each of length n .
2
For any sequence s Tn ( ), we say that its power type is , i.e., n ksk2 n + . Thus, the normalized
square of the `2 norm, quantized to the left endpoint of the interval [, + n ), is the power type of s. The
set of all power types is denoted as Pn [0, ).
Consider the following typical set of power types (also called typical types)
Pn := Pn [0, ].
(5.30)
Thus, we are simply truncating those power types that are larger than , the threshold in the statement of
the theorem. Clearly, the size of the typical set of power types Pn  = bn/c = (n), which is polynomial
in n. This is similar to the discrete case [39, Ch. 2]. Furthermore, by the assumption in (5.24),
log n
Pr PS n
/ Pn = O
.
(5.31)
n
We use the first (log n) symbols to transmit PS n , the state type. The rest of the n (log n) symbols are
used to transmit the message. By using the theory of error exponents for the GelfandPinsker problem [115]
and the fact that the number of state types is polynomial, one can show that PS n can be decoded with error
probability O( 1n ). The (log n) symbols used to transmit the state type does not affect the dispersion
term. In the following, with a slight abuse of notation, n refers to the remaining channel uses.
The decoder uses information density thresholding with respect to the joint distribution
( )
( )
(5.32)
where indexes a power type, the state distribution is PS = N (0, ), the conditional distributions
PU S (s) = N (s, snr) and PY SU (s, u) = N (u + (1 )s, 1). The corresponding mutual informations
63
induced by the joint distribution in (5.32) are denoted as I ( ) (U ; S) and I ( ) (U ; Y ). The constant > 0 is
arbitrary for now.
With these preparations, we are ready to prove Theorem 5.3 and we divide the proof into several steps.
Step 1 (Codebook Generation): The number of auxiliary codewords for each type Pn is denoted as
( )
L . For each state type Pn and each message m {1, . . . , M }, generate a typedependent codebook
C ( ) consisting of codewords {U n (m, l) : m {1, . . . , M }, l {1, . . . , L( ) }} where each codeword is drawn
independently from
{kuk22 n(snr + 2 )}
( )
p
PU n (u) :=
(5.33)
.
An
n(snr + 2 )
That is, similar to the proof of Theorem 4.4,p
we uniformly generate codewords U n (m, l) from a sphere in Rn
with radius depending on the type, namely n(snr + 2 ).
Step 2 (Encoding): Given the state sequence S n and message m, the encoder first calculates the type of
S n , denoted as . If is not typical in the sense of (5.31) declare an error. The contribution to the overall
error probability is given in (5.31) which is easily seen to not affect the secondorder term in (5.25). If
is typical, the encoder then proceeds to find an index l {1, . . . , L( ) } such that U n (m, l) is typical in the
sense that
n
U (m, l) S n
2 n snr , n snr ,
(5.34)
2
where > 0 is chosen to be a small constant. If there are multiple such l, choose one with the smallest
index. If there is none, declare an encoding error. The encoder transmits X n := U n (m, l) S n . Clearly
the power constraint on X n in (5.20) is satisfied with probability one.
Step 3 (Decoding): Given the channel output y and the state type , the decoder looks for a codeword
u(m,
l) C ( ) such that
( )
n
X
PY U (yi ui (m,
l))
( )
(5.35)
log
q ( ) (u(m,
l), y) :=
( )
P
(y
)
i
i=1
Y
( )
(5.38)
(5.39)
for some > 0 and all typical types Pn . The event Ep can be analyzed per Feinsteinstyle [53] threshold
decoding as follows:
Pr Ep E1c , PS n = Pr q ( ) (U n , Y n ) ( ) E1c , PS n =
n , Y n ) > ( ) E1c , PS n =
+ M L( ) Pr q ( ) (U
(5.40)
( )
(5.41)
where 2 := 1 + 1, then the second term in (5.40) decays as O( n1 ). So it remains to analyze the firstterm.
snr
, for any Pn and any s and
We do so using the BerryEsseen theorem and the fact that with = snr+1
c
u in the support of PS n ,U n ,Y n conditioned on E1 and PS n = ,
(5.42)
E q ( ) (U n , Y n )S n = s, U n = u = nI ( ) (U ; Y ) + O(1), and
( ) n n n
n
Var q (U , Y )S = s, U = u = nV(snr) + O(1).
(5.43)
snr
The proof is completed by noting that for = snr+1
, the difference of mutual informations I ( ) (U ; Y )
I ( ) (U ; S) equals C(snr) for every power type (in fact every variance) as we discussed prior to the start of
this proof.
5.4
Mixed Channels
In this section, we consider statedependent DMCs W P(YX S) where the state sequence is random
but fixed throughout the entire transmission block once it is determined at the start. This class of channels
is known as mixed channels [67, Sec. 3.3]. The precise setup is as follows. Let S be a state random variable
with a binary alphabet S = {0, 1} and let PS be its distribution. We consider two DMCs, each indexed by
a state s S. These DMCs are denoted as W0 := W (, 0) and W1 := W (, 1) and have capacities C(W0 )
and C(W1 ) respectively. Without loss of generality, we assume that C(W0 ) C(W1 ). We also assume that
each of these channels has a unique CAID and the CAIDs coincide.2 Their channel dispersions (cf. (4.27)
(4.28)) are denoted by V (W0 ) and V (W1 ) respectively. The dispersions are assumed to be positive and
are independent of because the CAIDs are unique.
Before transmission begins, the entire state sequence S n = (S, . . . , S) {0, 1}n is determined. Note that
the probability that the DMC is Ws is s := PS (s). The realization of the state is known to neither the
encoder nor the decoder. The probability of observing the sequence y Y n given an input sequence x X n
is
n
X Y
(n)
Pr(Y n = yX n = x) =
s
Ws (yi xi ) =: Wmix (yx),
(5.44)
i=1
sS
Mmix
(W n , PS , )
n
log Mmixed
(W n , PS , )
n
(5.45)
depends on in general. To state C (W, PS ) for binary state distributions, we consider three different cases:
Case (i): C(W0 ) = C(W1 ) and relative magnitudes of and 0 are arbitrary
Case (ii): C(W0 ) < C(W1 ) and < 0
Case (iii): C(W0 ) < C(W1 ) and 0
It is known that [67, Sec. 3.3] that
C(W1 )
Case (iii)
A plot of the capacity is provided in Fig. 5.4.
2 An example of this would be two binary symmetric channels. Both the CAIDs are uniform distributions on {0, 1} and they
are clearly unique.
65
C (W, PS )
6
C(W1 )
C(W0 )
Figure 5.4: Plot of the capacity against for the case C(W0 ) < C(W1 ). The strong converse property [67,
Sec. 3.5] holds iff C(W0 ) = C(W1 ) in which case C (W, PS ) does not depend on .
The following theorem was proved for the special case of GilbertElliott channels [50, 60, 116] by
PolyanskiyPoorVerd
u [124, Thm. 7] where W0 and W1 are binary symmetric channels so their CAIDs
are uniform on X . The coefficient L(; W, PS ) R in the asymptotic expansion
(5.47)
log Mmix
(W n , PS , ) = nC (W, PS ) + nL(; W, PS ) + o( n),
was sought. This coefficient is termed the secondorder coding rate. In the following theorem, we state and
prove a more general version of the result by PolyanskiyPoorVerd
u [124, Thm. 7]. For a result imposing
even less restrictive assumptions, we refer the reader to the work by Yagi and Nomura [185].
Theorem 5.4. Assume that each channel Ws , s S has a unique CAID and the CAIDs coincide. In the
various cases above, the secondorder coding rate is given as follows:
Case (i): L(; W, PS ) is the solution l to the following equation:
l
l
p
p
0
+ 1
= .
(5.48)
V (W0 )
V (W1 )
Case (ii):
L(; W, PS ) =
V (W0 ) 1
.
0
(5.49)
0
.
1
(5.50)
Case (iii):
L(; W, PS ) =
V (W1 )
If = 0 , then L(; W, PS ) = .
We observe that in Case (i) where the capacities C(W0 ) and C(W1 ) coincide (but not necessarily the
dispersions), the secondorder coding rate is a function of both the dispersions V (W0 ) and V (W1 ), together
with 0 and . This function also involves two Gaussian cdfs, suggesting, in the proof, that we apply the
central limit theorem twice. In the case where one capacity is strictly smaller than another (Cases (ii) and
(iii)), there is only one Gaussian cdf, which means that one of the two channels dominates the overall system
behavior. Intuitively for Case (ii), the first order term is C(W0 ) < C(W1 ) and < 0 , so the channel with
the smaller capacity dominates the asymptotic behavior of the channel, resulting in the secondorder term
being solely dependent on V (W0 ). In Case (iii), since 0 , we can tolerate a higher error probability so
the channel with the larger capacity dominates the asymptotic behavior. Hence, L(; W, PS ) depends only
on V (W1 ).
The corresponding result for source coding, random number generation and SlepianWolf coding were
derived by NomuraHan [117, 118]. We only provide a proof sketch of Case (i) in Theorem 5.4 here.
66
Proof sketch of Case (i) in Theorem 5.4. For the direct part of Case (i), we specialize Feinsteins theorem
(Proposition 4.1) with the input distribution chosen to be the nfold product of the common CAID of W0
(n)
and W1 , denoted as P P(X ). Recall the definition of Wmix (yx) in (5.44). By the law of total probability,
the probability defining the ( )information spectrum divergence simplifies as follows:
!
(n)
Wmix (Y n X n )
R
(5.51)
p := Pr log
(n)
P n Wmix (Y n )
!
X
Wsn (Ysn X n )
R ,
=
s Pr log
(5.52)
(n)
P n Wmix (Ysn )
sS
where Ysn , s S denotes the output of Wsn when the input is X n . Fix > 0. Consider the probability
indexed by s = 0 in (5.52):
!
(P W0 )n (Y0n )
W0n (Y0n X n )
+ log
R
(5.53)
p0 := Pr log
(n)
(P W0 )n (Y0n )
P n Wmix (Y0n )
!
n
(P W0 )n (Y0n )
W0n (Y0n X n )
+ log
R Y0 A
Pr log
(n)
(P W0 )n (Y0n )
P n Wmix (Y0n )
+ Pr Y0n Ac
(5.54)
where the set
A :=
y Y n : log
(P W0 )n (y)
(n)
P n Wmix (y)
.
(5.55)
Because Y0n (P W0 )n , we have Pr(Ac ) exp(). This, together with the definition of A , implies that
W n (Y n X n )
p0 Pr log 0 0n n R + + exp()
(5.56)
(P W0 ) (Y0 )
R + nC(W0 )
1
p
+O
+ exp(),
(5.57)
n
nV (W0 )
where the final step follows from the i.i.d. version of the BerryEsseen theorem (Theorem 1.1). The same
technique can be used to upper bound the second probability in (5.52). Choosing = 1n and = 12 log n
results in
X
R + 21 log n nCs
1
p
s
+O .
(5.58)
n
nVs
sS
Now we substitute this bound on p into the definition of ()information spectrum divergence in Feinsteins
theorem. We note that C(W0 ) = C(W1 ) and thus may solve for a lower bound of R. This then completes
the direct part of Case (i) in (5.48). Notice that for Case (ii), all the derivationsup to (5.58)
hold verbatim.
However, note that since C (W, PS ) = C(W0 ), we have that R = nC(W0 ) + l n + o( n) for some l R.
By virtue of the fact that C(W0 ) < C(W1 ), the second term in (5.58) vanishes asymptotically and we
recover (5.49) which involves only one Gaussian cdf.
For the converse part of Case (i), we appeal to the symbolwise converse bound (Proposition 4.4). For
a fixed x X n and arbitrary output distribution Q(n) P(Y n ), the probability that defines the ( + )information spectrum divergence can be written as
!
(n)
Wmix (Y n x)
q := Pr log
R
(5.59)
Q(n) (Y n )
X
Wsn (Ysn x)
=
s Pr log (n) n R
(5.60)
Q (Ys )
sS
67
where Ysn , s S is the output of Wsn given input x. Now choose the output distribution to be
1 (n)
(n)
Q0 (y) + Q1 (y)
Q(n) (y) :=
2
where for each s S,
Q(n)
s (y) :=
Px Pn (X )
n
Y
1
Px Ws (yi )
Pn (X ) i=1
n
Y
1
Px Ws (yi )
2Pn (X ) i=1
(5.61)
(5.62)
(5.63)
for any s S and type Px P(X ). By sifting out the type corresponding to x for channel W0 , the
probability in (5.60) corresponding to s = 0 can be lower bounded as
W0n (Y0n x)
q0 Pr log
log(2P
(X
))
.
(5.64)
n
(Px W0 )n (Y0n )
By separately considering types close to (BerryEsseen) and far away (Chebyshev) from the CAID similarly
to the proof of Theorem 4.3 (or [76, Thm. 3]), we can show that (5.64) simplifies to
1
R X  log(2(n + 1)) nC(W0 )
p
O
(5.65)
q0
n
nV (W0 )
uniformly for all x X n . The same calculation holds for the second probability in (5.60). By choosing
= 1n , we can upper bound R using Proposition 4.4 and the converse proof of Case (i) can be completed.
We observe that the crux of the above proof is to use the law of total probability to write the probabilities
in the information spectrum divergences as convex combination of constituent probabilities involving nonmixed channels. For the direct part, a changeofoutputmeasure by conditioning on the event Y0n A in
(5.54) is required. For the converse part, the proof proceeds in a manner similar to the converse proof for
the secondorder asymptotics for DMCs, upon choosing the auxiliary output measure Q(n) appropriately.
5.5
The final channel with state we consider in this chapter is the quasistatic singleinputmultipleoutput
(SIMO) channel with r receive antennas. The term quasistatic means that the channel statistics (fading coefficients) remain constant during the transmission of each codeword, similarly to mixed channels.
YangDurisiKochPolyanskiy [186] derived asymptotic expansions for this channel model which is described
precisely as follows: For time i = 1, . . . , n, the channel law is given as
Yi1
H1
Zi1
.. ..
..
(5.66)
. = . Xi + .
Yir
Hr
Zir
where H r := (H1 , . . . , Hr )0 is the vector of (realvalued) i.i.d. fading coefficients, which are random but
remain constant for all channel uses, and {Zij } are i.i.d. noises distributed as N (0, 1). In the theory of
fading channels [18], the channel inputs and outputs are usually complexvalued, but to illustrate the key
ideas, it is sufficient to consider realvalued channels and fading coefficients. In this section, we restrict our
attention to the realvalued SIMO model in (5.66). The channel input X n must satisfy
kX n k22 =
n
X
i=1
68
Xi2 n snr
(5.67)
logMnoSI
(W n , PH r , snr, )
n n
1
log MSIED
(W n , PH r , snr, ).
= lim
n n
lim
(5.68)
(5.69)
(5.70)
Observe that for a fixed value of H r = h (i.e., the channel state is not random), the expression C snrkhk22
is simply the Shannon capacity of the channel. Beyond the firstorder characterization, what are the refined
log MnoSI
(W n , PH r , snr, ) = nC (W, PH r ) + O(log n),
log MSIED
(W n , PH r , snr, )
and
(5.71)
(5.72)
The condition on the channel gain G is satisfied by many fading models of interest, including Rayleigh,
Rician and Nakagami.
Theorem 5.5 says interestingly that, in the quasistatic setting, the ( n) dispersion terms that we
usually see in asymptotic expansions are absent. This means that the capacity is good benchmark for the
where (W n , PH r , snr, M ) the smallest error probability with M codewords and random channel gains. In
fact, the above argument can be formalized using the following lemma whose proof can be found in [186].
Lemma 5.3. Let A be a random variable with zero mean, unit variance and finite third moment. Let B be
independent of A with twice continuously differentiable pdf. Then,
1
.
(5.75)
Pr A nB = Pr B 0 + O
n
The approximation in (5.74) is then justified by taking
q
A = V snrkH r k22 Z, and B = log M nC snrkH r k22 .
(5.76)
Finally, we remark that this quasistatic SIMO model is different from that in Section 5.2 in two significant
ways: First, the state here is a continuous random variable and second, according to (5.66), the quasistatic
scenario here implies that the state H r is constant throughout transmission and does not vary across time
i = 1, . . . , n. This explains the difference in secondorder behavior vis`avis the result in Theorem 5.2. The
distinction between this model and that in Section 5.4 on mixed channels with finitely many states is that
the fading coefficients contained in H r are continuous random variables.
70
Chapter 6
R2 H(X2 X1 ),
R1 + R2 H(X1 , X2 ).
(6.1)
In this chapter, we analyze refinements to Slepian and Wolfs seminal result. Essentially, we fix a point
(R1 , R2 ) on the boundary of the region in (6.1). We then find all possible secondorder coding rate pairs
(L1 , L2 ) R2 such that there exists lengthn block codes of sizes Mjn , j = 1, 2 and error probabilities n
such that
log Mjn nRj + nLj + o n , and n + o(1).
(6.2)
The latter condition means that the sequence of codes is reliable. We will see that if (R1 , R2 ) is a corner
point, the set of all such (L1 , L2 ) is characterized in terms of a multivariate Gaussian cdf. This is the
distinguishing feature compared to results in the previous chapters.
The material in this chapter is based on the work by Nomura and Han [118] and Tan and Kosut [157].
6.1
In this section, we set up the distributed lossless source coding problem formally and mention some known
nonasymptotic bounds. Let PX1 X2 P(X1 X2 ) be a correlated source. See Fig. 6.1.
An (M1 , M2 , )code for the correlated source PX1 X2 P(X1 X2 ) consists of a triplet of maps that
includes two encoders fj : Xj {1, . . . , Mj } for j = 1, 2 and a decoder : {1, . . . , M1 } {1, . . . , M2 }
X1 X2 such that the error probability
PX1 X2 {(x1 , x2 ) X1 X2 : f1 (x1 ), f2 (x2 ) 6= (x1 , x2 )} .
(6.3)
The numbers M1 and M2 are called the sizes of the code.
71
x1

f1
m1
(
x ,x
)
1 2
x2

f2
m2

log
1
PX2 X1 (x2 x1 )
log
1
PX1 X2 (x1 ,x2 )
i0
(6.6)
6.2
SecondOrder Asymptotics
We would like to make concrete statements about performance of optimal codes with asymptotic error
probabilities not exceeding and blocklength n tending to infinity. For this purpose, we assume that the
source PX1 X2 is a DMMS, i.e.,
PX1n X2n (x1 , x2 ) =
n
Y
i=1
72
(6.7)
As such, the alphabets Xj , j = 1, 2 in the definition of an (M1 , M2 , )code are replaced by their nfold
Cartesian products.
6.2.1
Unlike the pointtopoint problems where the firstorder fundamental limit is a single number (e.g., capacity
for channel coding, ratedistortion function for lossy compression), for multiterminal problems like the
SlepianWolf problem there is a continuum of firstorder fundamental limits. Hence, to define secondorder
quantities, we must center the analysis at a point (R1 , R2 ) on the boundary of the optimal rate region
(in source coding scenarios) or capacity region (in channel coding settings). Subsequently, we can ask what
is the local secondorder behavior of the system in the vicinity of (R1 , R2 ). This is the essence of secondorder asymptotics for multiterminal problems. Note that for multiterminal problems, we exclusively study
secondorder asymptotics, and we do not go beyond this to study thirdorder asymptotics.
Fix a rate pair (R1 , R2 ) on the boundary of the optimal rate region given by (6.1). Let (L1 , L2 ) R2
be called an achievable (, R1 , R2 )secondorder coding rate pair if there exists a sequence of (M1n , M2n , n )codes for the correlated source PX1n X2n such that the sequence of error probabilities does not exceed
asymptotically, i.e.,
lim sup n
(6.8)
n
j = 1, 2.
(6.9)
The set of all achievable (, R1 , R2 )secondorder coding rate pairs is denoted as L(; R1 , R2 ) R2 , the
secondorder coding rate region. Note that even though we term the elements of L(; R1 , R2 ) as rates, they
could be negative. This convention follows that in Hayashis works [75, 76]. The number Lj has units is bits
per squareroot source symbols.
Let us pause for a moment to understand the above definition as it is a recurring theme in subsequent
chapters on secondorder asymptotics in network information theory. SlepianWolf [151] showed that there
exists a sequence of codes for the (stationary, memoryless) correlated source (X1 , X2 ) whose error probabilities vanish asymptotically (i.e., n = o(1)) and whose sizes Mjn satisfy
lim sup
n
1
log Mjn Rj ,
n
j = 1, 2
(6.10)
where the rates R1 and R2 satisfy the bounds in (6.1). Hence, the definition of a secondorder coding rate
pair in (6.9) is a refinement of the scaling of the code sizes in SlepianWolfs setting, centering the rate
analysis at (R1 , R2 ), and analyzing deviations of order ( 1n ) from this firstorder fundamental limit. In
doing so, we allow the error probability to be nonvanishing per (6.8). This requirement is subtly different
from that in the chapters on source and channel coding where we are interested in approximating non
asymptotic fundamental limits like log M (P, ) or log Mave
(W, ) and therein, the error probabilities are
constrained to be no larger than a nonvanishing (0, 1) for all blocklengths. Here we allow some slack
(i.e., n + o(1)). This turns out to be immaterial from the perspective of secondorder asymptotics,
as we are seeking to characterize a region of secondorder rates L(; R1 , R2 ) and we are not attempting
to characterize higherorder (i.e., thirdorder) terms in an asymptotic expansion. The o(1) slack affects
the thirdorder asymptotics but since we are not interested in this study for network information theory
problems, we find it convenient to define L(; R1 , R2 ) using (6.8)(6.9), analogous to information spectrum
analysis [67].
Before we state the main result of this chapter, let us consider the following bivariate generalization of
the cdf of a Gaussian:
Z t1 Z t2
N (x; ; ) dx,
(6.11)
(t1 , t2 ; , ) :=
73
R2 6
(i)
H2
H21
(iii)
RSW
(n, )
t
t
@t
@(ii)
@
H12
H1
 R1
Figure 6.2: Illustration of the different cases in Theorem 6.1 where H1 = H(X1 ) and H21 = H(X2 X1 ) etc.
The curve is a schematic of the boundary of the set of rate pairs (R1 , R2 ) achievable at blocklength n with
V
V
V21
2,12 V21 V1,2 .
(6.13)
=
1,2 p 12 21
p
V1,2
1,12 V12 V1,2 2,12 V21 V1,2
We also denote the diagonal entries as V (X1 X2 ) = V12 , V (X2 X1 ) = V21 and V (X1 , X2 ) = V1,2 . Define
V1,12 (resp. V2,12 ) as the 2 2 submatrix indexed by the 1st (resp. 2nd ) and 3rd entries of V, i.e.,
p
V12
1,12 V12 V1,2
p
V1,12 :=
,
(6.14)
1,12 V12 V1,2
V1,2
and V2,12 is defined similarly.
6.2.2
The set L(; R1 , R2 ) is characterized in the following result. This result was proved by Nomura and Han [118].
A slightly different form of this result was proved earlier by Tan and Kosut [157].
Theorem 6.1. Assume V is positive definite. Depending on (R1 , R2 ) (see Fig. 6.2), there are 5 cases of
which we state 3 explicitly:
Case (i): R1 = H(X1 X2 ) and R2 > H(X2 ) (vertical boundary)
n
o
p
L(; R1 , R2 ) = (L1 , L2 ) : L1 V (X1 X2 )1 (1 ) .
(6.15)
Case (ii): R1 + R2 = H(X1 , X2 ) and H(X1 X2 ) < R1 < H(X1 ) (diagonal face)
n
o
p
L(; R1 , R2 ) = (L1 , L2 ) : L1 + L2 V (X1 , X2 )1 (1 ) .
Case (iii): R1 = H(X1 X2 ) and R2 = H(X2 ) (topleft corner point)
n
o
L(; R1 , R2 ) = (L1 , L2 ) : (L1 , L1 + L2 ; 0, V1,12 ) 1 .
74
(6.16)
(6.17)
4
3
L2 (bits/ symb)
2
1
0
1
2
3
= 0.01
= 0.1
4
0
2
3
L1 (bits/ symb)
Figure 6.3: Illustration of L(; R1 , R2 ) in (6.17) for the source PX1 X2 in (6.18) with = 0.01, 0.1. The
regions are to the top right of the boundaries indicated.
75
The region L(; R1 , R2 ) for Case (iii) is illustrated in Fig. 6.3 for a binary source (X1 , X2 ) with distribution
0.7 0.1
PX1 X2 (x1 , x2 ) =
.
(6.18)
0.1 0.1
Note that L(; R1 , R2 ) for other points on the boundary can be found by symmetry. For example for the
horizontal boundary, simply interchange the indices 1 and 2 in (6.15). The case in which V is not positive
definite was dealt with in detail in [157].
6.2.3
Proof. The proof of the direct part specializes the nonasymptotic bound in Proposition 6.1 with the choice
= n1/4 . Choose code sizes M1n and M2n to be the smallest integers satisfying
(6.19)
log Mjn nRj + nLj + 2n1/4 , j = 1, 2,
for some (L1 , L2 ) R2 . Substitute these choices into the probability in (6.4), denoted as p. The complementary probability 1 p is
.
1p = Pr hX1n X2n (X1n , X2n ) <
(6.20)
nR2 + nL2 + n1/4
1/4
n(R1 +R2 )+ n(L1 +L2 )+3n
Recall that hX1n X2n (x1 , x2 ) is the entropy density in (6.6) and that inequalities (like <) are applied elementwise. The three events in the probability above are
1
1
L1
3/4
log
,
(6.21)
A1 :=
< R1 + + n
n
PX1n X2n (X1n X2n )
n
1
1
L2
3/4
A2 :=
log
<
R
+
+
n
, and
(6.22)
2
n
PX2n X1n (X2n X1n )
n
1
1
L1 +L2
3/4
A12 :=
log
+
3n
.
(6.23)
<
R
+R
+
1
2
n
PX1n X2n (X1n , X2n )
n
As such, the probability in (6.20) is Pr(A1 A2 A12 ).
Let us consider Case (i) in Theorem 6.1. In this case, R2 > H(X2 ) and R1 + R2 > H(X1 , X2 ). By the
weak law of large numbers, Pr(A2 ) 1 and Pr(A12 ) 1 as n grows. In fact, these probabilities converge
to one exponentially fast. Thus,
1 p Pr(A1 ) + exp(n)
(6.24)
for some > 0. Furthermore, because R1 = H(X1 X2 ), Pr(A1 ) can be estimated using the BerryEsseen
theorem as
L1
Pr(A1 ) p
+ O(n1/4 ).
(6.25)
V (X1 X2 )
p1 p
L1
V (X1 X2 )
+ O(n1/4 ).
(6.26)
Coupled with the fact that exp() = exp(n1/4 ), the proof of the direct part of (6.15) is complete. The
converse employs essentially the same technique. Case (ii) is also similar with the exception that now
Pr(A1 ) 1 and Pr(A2 ) 1, while Pr(A12 ) is estimated using the BerryEsseen theorem.
We are left with Case (iii). In this case, only Pr(A2 ) 1. Thus, just as in (6.24), (6.20) can be estimated
as
1 p Pr(A1 A12 ) + exp(n 0 )
(6.27)
76
for some 0 > 0. The probability can now be estimated using the multivariate BerryEsseen theorem
(Corollary 1.1) as
Pr(A1 A12 ) L1 , L1 + L2 ; 0, V1,12 + O(n1/4 ).
(6.28)
We complete the proof of (6.17) similarly to Case (i). The converse is completely analogous.
A couple of takehome messages are in order:
First, consider Case (i). In this case, we are operating far away from the constraint concerning the
second rate and the sum rate constraint. This corresponds to the events Ac2 and Ac12 . Thus, by the theory
of large deviations, Pr(Ac2 ) and Pr(Ac12 ) both tend to zero exponentially fast. Essentially for these two error
events, we are in the error exponents regime.1 The same holds true for Case (ii).
Second, consider Case (iii). This is the most interesting case for the secondorder asymptotics for the
SlepianWolf problem. We are operating at a corner point and are far away from the second rate constraint,
i.e., in the error exponents regime for Ac2 . The remaining two events A1 and A12 are, however, still in the
central limit regime and hence their joint probability must be estimated using the multivariate BerryEsseen
theorem. Instead of the result being expressible in terms of a univariate Gaussian cdf (which is the case
for singleterminal problems in Part II of this monograph), the multivariate version of the Gaussian cdf ,
parameterized by the (in general, full) covariance matrix V1,12 in (6.14), must be employed. Compared to
the cooperative case where X2n (resp. X1n ) is available to encoder 1 (resp. encoder 2), we see from the result
in Case (iii) that SlepianWolf coding, in general, incurs a rateloss over the case where sideinformation is
available to all terminals. Indeed, when sideinformation is available at all terminals, the matrix V1,12 that
characterizes L(; R1 , R2 ) in Case (iii) would be diagonal [157], since the source coding problems involving
X1 and X2 are now independent of each other. In other words, in this case, there exists a sequence of codes
with error probabilities n satisfying (6.8) and sizes (M1n , M2n ) satisfying
p
(6.29)
log M1n nH(X1 X2 ) nV (X1 X2 )1 () + o( n),
p
1
log M2n nH(X2 X1 ) nV (X2 X1 ) () + o( n),
(6.30)
p
6.3
Just as in Section 3.3 (secondorder asymptotics of lossless data compression via the method of types), we
can show that codes that do not necessarily have to have full knowledge of the source statistics (i.e., partially
universal source codes) can achieve the secondorder coding rate region L(; R1 , R2 ). However, the coding
scheme does require the knowledge of the entropies together with the pair of secondorder rates (L1 , L2 ) we
would like to achieve. We illustrate the achievability proof technique for Case (iii) of Theorem 6.1, in which
R1 = H(X1 X2 ) and R2 = H(X2 ).
The code construction is based on Covers random binning idea [32] and the decoding strategy is similar
to minimum empirical entropy decoding [38, 39]. Fix (L1 , L2 ) L(; R1 , R2 ) where L(; R1 , R2 ) is given
in (6.17). Also fix code sizes M1n and M2n satisfying (6.19). For each j = 1, 2, uniformly and independently assign each sequence xj Xjn into one of Mjn bins labeled as Bj (mj ), mj {1, . . . , Mjn }. The bin
assignments are revealed to all parties. To send xj Xjn , encoder j transmits its bin index mj .
1 Of
course, the error exponents for the SlepianWolf problem are known [36, 58] but any exponential bound suffices for our
purposes here.
77
The decoder, upon receipt of the bin indices (m1 , m2 ) {1, . . . , M1n } {1, . . . , M2n }, finds a pair of
2 ) B1 (m1 ) B2 (m2 ) satisfying
sequences (
x1 , x
H(x1 x2 )
1
1 , x2 ) := H(x
2 x1 ) 2 =:
H(x
1 , x2 )
12
H(x
(6.32)
(6.33)
(6.34)
(6.35)
n , X n ) H(X1 , X2 )
n H(X
N 0, V .
(6.40)
1
2
This is the multidimensional analogue of (3.33) for almost lossless source coding. Thus, by the same argu n , X n ) because R = H(X2 ) > H(X2 X1 )),
ment as that in (6.27)(6.28) (ignoring the second entry in H(X
1
2
2
one has
Pr(E0 ) + O(n1/4 ).
(6.41)
Furthermore, by using the method of types, we may verify that
X
X
Pr(E1 )
PX1n X2n (x1 , x2 )
x1 ,x2
x1 ,x2 )
1 6=x1 :H(
x
x1 x2 )1
1 6=x1 :H(
x
X
x1 ,x2
x1 ,x2
x1 ,x2
1 B1 (1)
Pr x
x1 x2 )1
1 6=x1 :H(
x
(6.42)
1 B1 (1)
Pr x
(6.43)
1
M1n
(6.44)
1 TV (x2 )
V Vn (X2 ;Px2 ): x
H(V Px2 )1
78
1
M1n
(6.45)
X
x1 ,x2
V Vn (X2 ;Px2 ):
H(V Px2 )1
X1n X2n
x1 ,x2
(x1 , x2 )
V Vn (X2 ;Px2 ):
H(V Px2 )1
(6.46)
(6.47)
(6.48)
1 into
where in (6.44) we used the uniformity of the binning, in (6.45) we partitioned the set of sequences x
conditional types given x2 and in (6.46), we used the fact that TV (x2 ) exp nH(V Px2 ) (cf. Lemma 1.2).
Finally, the type counting lemma and the choices of 1 and M1n were used in (6.48). The same calculation
can be performed for Pr(E2 ) and Pr(E12 ). Thus, asymptotically, the error probability is no larger than , as
desired.
6.4
In the preceding sections, we were solely concerned with the deviations of order ( 1n ) away from the
firstorder fundamental limit (R1 , R2 ). However, one may also be interested in other metrics that quantify
backoffs from particular firstorder fundamental limits. Here we mention three other quantities that have
appeared in the literature.
6.4.1
For constants , 0, the minimum value of R1 + R2 for asymptotically achievable (R1 , R2 ) is called
the optimal weighted sumrate. Of particular interest is the case = = 1, corresponding to the standard
sumrate R1 + R2 , but other cases may be important as well, e.g., if transmitting from encoder 1 is more
costly than transmitting from encoder 2. Because of the polygonal shape of the optimal region described in
the SlepianWolf region in (6.1), the optimal weighted sumrate is always achieved at (at least) one of the
two corner points, and the optimal rate is given by
H(X1 X2 ) + H(X2 )
Rsum (, ) :=
.
(6.49)
H(X1 ) + H(X2 X1 ) <
One can then define J R to be an achievable (, , )weighted secondorder coding rate if there exists a
sequence of (M1n , M2n , n )codes for the correlated source PX1n X2n such that the error probability condition
in (6.8) holds and
1
6.4.2
DispersionAngle Pairs
One can also imagine approaching a point on the boundary (R1 , R2 ) fixing an angle of approach [0, 2).
Let (F, ) be called an achievable (, R1 , R2 )dispersionangle pair if there exists a sequence of (M1n , M2n , n )codes for the correlated source PX1n X2n such that the error probability condition in (6.8) holds and
1
lim sup log M1n nR1 F cos ,
n
n
1
lim sup log M2n nR2 F sin .
n
n
79
and
(6.51)
(6.52)
Clearly, dispersionangle pairs (F, ) are in onetoone correspondence with secondorder coding rate pairs
(L1 , L2 ). The minimum such F for a given , denoted as F (, ; R1 , R2 ), measures the speed of approach
to (R1 , R2 ) at an angle . This fundamental quantity F (, ; R1 , R2 ) was also characterized in [157].
6.4.3
Global Approaches
Authors of early works on secondorder asymptotics in multiterminal systems [84, 111, 156] considered global
rate regions, meaning that they were concerned with quantifying the sizes (M1n , M2n ) of lengthn block codes
with error probability not exceeding . These sizes are called (n, )achievable. In the SlepianWolf context,
a result by TanKosut [156] states that (M1n , M2n ) are (n, )achievable iff
log M1n
nH(X1 X2 )
80
Chapter 7
(7.1)
for some PQ , PX1 Q and PX2 Q , where Q is known as the timesharing random variable. In the Gaussian case
in which Carleial [22] studied, the above region can be written more explicitly as
R1 C(snr1 )
and R2 C(snr2 ),
(7.2)
where snrj is the signaltonoise ratio of the direct channel from sender j to receiver j and the Gaussian
capacity function is defined as C(snr) := 12 log(1 + snr). See Fig. 7.2 for an illustration of the capacity region
and the monograph by Shang and Chen [140] for further discussions on Gaussian interference channels. Carleials result is surprising because it appears that interference does not reduce the capacity of the constituent
channels since C(snrj ) is the capacity of the j th channel. In Carleials own words [22],
Very strong interference is as innocuous as no interference at all.
Similarly to the discrete case in (7.1), the (firstorder optimal) achievability proof strategy for the Gaussian
case involves first decoding the interference, subtracting it off from the received channel output, and finally,
reliably decoding the intended message. The VSI condition ensures that the rate constraints in (7.2),
representing requirement for the second decoding steps to succeed, dominate.
In this chapter, we make a slightly stronger assumption compared to that made by Carleial [22]. We
assume that the inequalities that define the VSI condition are strict; we call this the strictly VSI (SVSI)
81
m1

f1
x1
y1

m
1

W
m2

f2
x2
y2

m
2

7.1
Let us now state the Gaussian IC problem. The Gaussian IC is defined by the following inputoutput relation:
Y1i = g11 X1i + g12 X2i + Z1i ,
(7.3)
(7.4)
where i = 1, . . . , n and gjk are the channel gains from sender k to receiver j and Z1i N (0, 1) and
Z2i N (0, 1) are independent noise components.1 Thus, the channel from (x1 , x2 ) to (y1 , y2 ) is
2 !
1
1
y
g
g
x1
1
11
12
.
exp
W (y1 , y2 x1 , x2 ) =
(7.5)
g21 g22 x2
2
2
2 y2
1 The independence assumption between Z and Z was not made in Carleials work [22] (i.e., Z and Z may be correlated)
1i
2i
1i
2i
but we need this assumption for the analyses here. It is well known that the capacity region of any general IC depends only
on the marginals [49, Ch. 6] but it is, in general, not true that the set of achievable secondorder rates L(; R1 , R2 ), defined in
(7.17), has the same property. This will become clear in the proof of Theorem 7.1 in the text following (7.32).
82
Let W1 and W2 denote the marginals of W . The channel also acts in a stationary, memoryless way so
W n (y1 , y2 x1 , x2 ) =
n
Y
i=1
(7.6)
It will be convenient to make the dependence of the code on the blocklength explicit right away.
We define an (n, M1 , M2 , S1 , S2 , )code for the Gaussian IC as four maps that consists of two encoders
fj : {1, . . . , Mj } Rn , j = 1, 2 and two decoders j : Rn {1, . . . , Mj } such that the following power
constraints2 are satisfied
n
X
fj (mj )
2 =
fji (mj )2 nSj
(7.7)
2
i=1
and, denoting Dm1 ,m2 := {(y1 , y2 ) : 1 (y1 ) = m1 and 2 (y2 ) = m2 } as the decoding region for (m1 , m2 ),
the average error probability
M1 X
M2
X
1
W n Rn Rn \ Dm1 ,m2 f1 (m1 ), f2 (m2 ) .
M1 M2 m =1 m =1
1
(7.8)
In (7.7), S1 and S2 are the admissible powers on the codewords f1 (m1 ) and f2 (m2 ). The signaltonoise
ratios of the direct channels are
2
snr1 := g11
S1 ,
2
and snr2 := g22
S2 .
(7.9)
2
inr1 := g12
S2 ,
2
and inr2 := g21
S1 .
(7.10)
We say that the Gaussian IC W , together with the transmit powers (S1 , S2 ), is in the VSI regime if the
signal and interferencetonoise ratios satisfy
snr1
inr2
,
1 + snr2
and snr2
inr1
,
1 + snr1
(7.11)
(7.12)
The Gaussian IC is in the SVSI regime if the inequalities in (7.11)(7.12) are strict. Intuitively, the VSI (or
SVSI) assumption means that the cross channel gains g12 and g21 are much stronger (larger) than the direct
gains g11 and g22 for given transmit powers S1 and S2 .
We now state nonasymptotic bounds that are evaluated asymptotically later. The proofs of these bounds
are standard. See [66] or [103].
Proposition 7.1 (Achievability bound for IC). Fix any input distributions PX1n and PX2n whose support
satisfies the power constraints in (7.7), i.e., kXjn k2 nSj with probability one. For every n N, every
> 0 and for any choice of (conditional) output distributions QY1n X2n , QY2n X1n , QY1n and QnY2 there exists
an (n, M1 , M2 , S1 , S2 , )code for the IC such that
W n (Y n X n , X n )
Pr log 1 1 1n 2n log M1 + n or
QY1n X2n (Y1 X2 )
log
2 The
or
notation fji (mj ) denotes the ith coordinate of the codeword fj (mj ) Rn .
83
where :=
P2
k=1
P2
j=1 jk
log
log
or
+ exp(n),
(7.13)
and
11 := sup
x2 ,y1
21 := sup
x1 ,y2
12 := sup
(7.14)
22 := sup
(7.15)
y1
y2
This is a generalization of the average error version of Feinsteins lemma [53] (Proposition 4.1). Notice
that we have the freedom to choose the output distributions at the cost of having to control the ratios jk
of the induced output distributions and our choice of output distributions.
Proposition 7.2 (Converse bound for IC). For every n N, every > 0 and for any choice of (conditional)
output distributions QY1n X2n and QY2n X1n , every (n, M1 , M2 , S1 , S2 , )code for the IC must satisfy
W n (Y n X n , X n )
Pr log 1 1 1n 2n log M1 n or
QY1n X2n (Y1 X2 )
W2n (Y2n X1n , X2n )
log M2 n 2 exp(n)
(7.16)
log
QY2n X1n (Y2n X1n )
for some input distributions PX1n and PX2n whose support satisfies the power constraints in (7.7).
Observe the following features of the nonasymptotic converse, which is a generalization of the ideas of
Verd
uHan [169, Lem. 4] and HayashiNagaoka [77, Lem. 4]: First, there are only two error events compared
to the four in the achievability bound. The SVSI assumption allows us to eliminate two error events in
the direct bound so the two bounds match in the secondorder sense. Second, we are free to choose output
distributions without any penalty (cf. the achievability bound in Proposition 7.1). Third, the intuition
behind this bound is in line with the SVSI assumptionnamely that decoder 1 knows the codeword X2n and
vice versa. Indeed, the proof of Proposition 7.2 uses this genieaided idea.
7.2
SecondOrder Asymptotics
Similar to the study of the secondorder asymptotics for the SlepianWolf problem, we are interested in
deviations from the boundary of the capacity region of order O( 1n ) for the Gaussian IC under the SVSI
assumption. This motivates the following definition.
Let (R1 , R2 ) be a point on the boundary of the capacity region in (7.2). Let (L1 , L2 ) R2 be called
an achievable (, R1 , R2 )secondorder coding rate pair if there exists a sequence of (n, M1n , M2n , S1 , S2 , n )codes for the Gaussian IC such that
1
lim sup n , and lim inf log Mjn nRj Lj ,
(7.17)
n
n
n
for j = 1, 2. The set of all achievable (, R1 , R2 )secondorder coding rate pairs is denoted as L(; R1 , R2 )
R2 . The intuition behind this definition is exactly analogous to the SlepianWolf case.
Define Vj := V(snrj ) where, recall from (4.79) that,
V(snr) = log2 e
is the Gaussian dispersion function.
84
snr(snr + 2)
2(snr + 1)2
(7.18)
R2 6
(ii)
C2
t(iii)
t(i)
C1
 R1
Figure 7.2: Illustration of the different cases in Theorem 7.1. For brevity, we write Cj = C(snrj ) for j = 1, 2.
Theorem 7.1. Let the Gaussian IC W , together with the transmit powers (S1 , S2 ), be in the SVSI regime.
Depending on (R1 , R2 ) (see Fig. 7.2), there are 3 different cases:
Case (i): R1 = C(snr1 ) and R2 < C(snr2 ) (vertical boundary)
o
n
p
L(; R1 , R2 ) = (L1 , L2 ) : L1 V1 1 () .
(7.19)
Case (ii): R1 < C(snr1 ) and R2 = C(snr2 ) (horizontal boundary)
o
n
p
L(; R1 , R2 ) = (L1 , L2 ) : L2 V2 1 () .
Case (iii): R1 = C(snr1 ) and R2 = C(snr2 ) (corner point)
L1
L2
L(; R1 , R2 ) = (L1 , L2 ) :
1 .
V1
V2
(7.20)
(7.21)
A proof sketch of this result is provided in Section 7.3. The region L(; R1 , R2 ) for Case (iii) is sketched
in Fig. 7.3 for the symmetric case in which V1 = V2 .
few remarks are in order: First, for Case (i), L(; R1 , R2 ) depends only on and V1 . Note that
A 1
V1 () is the optimum (maximum) secondorder coding rate of the AWGN channel (Theorem 4.4) from
X1 to Y1 when there is no interference, i.e., g12 = 0 in (7.3). The fact that user 2s parameters do not
feature in (7.19) is because R2 < C(snr2 ). This implies that the channel 2 operates in large deviations (error
exponents) regime so the second constraint in (7.2) does not feature in the secondorder analysis, since the
error probability of decoding message 2 is exponentially small. An analogous observation was also made for
the SlepianWolf problem in Chapter 6.
Second, notice that for Case (iii), L(; R1 , R2 ) is a function of and both V1 and V2 as we are operating
at rates near the corner point of the capacity region. Both constraints in the capacity region in (7.2) are
active. We provide an intuitive reasoning for the result in (7.21). Let Gj denote the event that message
j = 1, 2 is decoded correctly. The error probability criterion in (7.8) can be rewritten as
Pr G1 G2 1 .
(7.22)
Assuming independence of the events G1 and G2 , which is generally not true in an IC because of interfering
signals,
Pr G1 Pr G2 1 .
(7.23)
Given that the number of messages for codebook j satisfies
Mjn = exp nRj + nLj + o( n) ,
85
(7.24)
4
4.5
L1 (nats/ use)
5
5.5
6
6.5
7
7.5
8
8
V1 = V2 = 2
V1 = V2 = 3
7.5
6.5
6
5.5
L1 (nats/ use)
4.5
Figure 7.3: Illustration of the region L(; R1 , R2 ) in Case (iii) with = 103 . The regions are to the bottom
left of the boundaries indicated.
the optimum probability of correct detection satisfies (cf. Theorem 4.4)
Lj
Pr Gj = p
+ o(1)
Vj
(7.25)
which then (heuristically) justifies (7.21). The proof makes the steps from (7.22)(7.25) rigorous. Since V1
and V2 are the dispersions of the Gaussian channels without interference, this is the secondorder analogue
of Carleials result for Gaussian ICs in the VSI regime [22] because the dispersions are not affected. Note
that no cross dispersion terms are present in (7.21) unlike the SlepianWolf problem, where the correlation
of two different entropy densities appears in the characterization of L(; R1 , R2 ) for corner points (R1 , R2 ).
Finally, it is somewhat surprising that in the converse, even though we must ensure that the codewords
X1n and X2n are independent, we do not need to leverage the wringing technique invented by Ahlswede [3],
which was used to prove that the discrete memoryless MAC admits a strong converse. This is thanks to
Gaussianity which allows us to show that the first and secondorder statistics of a certain set of information
densities in (7.29)(7.30) are independent of x1 and x2 belonging to their respective power spheres.
7.3
The proof of Theorem 7.1 is somewhat long and tedious so we only sketch the key steps and refer the reader
to [103] for the detailed calculations.
Proof. We begin with the converse. We may assume, using the same argument as that for the pointtopoint AWGN channel (cf. the Yaglom map trick [28, Ch. 9, Thm. 6] in the proof of Theorem 4.4) that
all the codewords xj (mj ) satisfy kxj (mj )k22 = nSj , j = 1, 2. Choose the auxiliary output distributions in
86
2
N (y2 ; g21 x1 , g22
S2
and
(7.26)
+ 1).
(7.27)
These are the output distributions induced if the input distributions PX1n and PX2n are nfold products of
N (0, S1 ) and N (0, S2 ) respectively. Fix any achievable (, R1 , R2 )secondorder coding rate pair (L1 , L2 ),
i.e., (L1 , L2 ) L(; R1 , R2 ). Then, for every > 0, every sequence of (n, M1n , M2n , S1 , S2 , n )codes satisfies
(7.28)
log Mjn nRj + n(Lj ), j = 1, 2,
for n large enough. To keep our notation succinct, define the information densities
j1 (x1 , x2 , Y1n ) := log
j2 (x1 , x2 , Y2n ) := log
n
W1n (Y1n x1 , x2 ) X
W1 (Y1i x1i , x2i )
=
,
log
n
QY1n X2n (Y1 x2 ) i=1
QY1 X2 (Y1i x2i )
and
n
W2 (Y2i x1i , x2i )
W2n (Y2n x1 , x2 ) X
=
.
log
QY2n X1n (Y2n x1 ) i=1
QY2 X1 (Y2i x1i )
Let Cj := C(snrj ) for j = 1, 2. For any pair of vectors (x1 , x2 ) satisfying kxj k22 = nSj ,
j1 (x1 , x2 , Y1n )
C
E
= n 1 , and
j2 (x1 , x2 , Y2n )
C2
n
j1 (x1 , x2 , Y1 )
V1 0
Cov
=n
.
j2 (x1 , x2 , Y2n )
0 V2
(7.29)
(7.30)
(7.31)
(7.32)
Importantly, notice that the covariance matrix in (7.32) is diagonal. This is due to the independence of the
noises Z1i and Z2i and is the crux of the converse proof for the corner point case in (7.21).
Now let := n3/4 in the probability in the nonasymptotic converse bound in (7.16). We denote this
probability as p. By the law of total probability, the complementary probability 1 p can be written as
!
Z
j1 (x1 , x2 , Y1n )
log M1n n1/4
1p = Pr
>
dPX1n (x1 ) dPX2n (x2 ).
(7.33)
j2 (x1 , x2 , Y2n )
log M2n n1/4
By (7.28), for large enough n, the inner probability evaluates to
!
j1 (x1 , x2 , Y1n )
log M1n n1/4
Pr
>
j2 (x1 , x2 , Y2n )
log M2n n1/4
!
j1 (x1 , x2 , Y1n )
nR1 n(L1 2)
Pr
>
nR2 n(L2 2)
j2 (x1 , x2 , Y2n )
!
n(C
R
)
L
+
2
V
0
1
1
1
1
; 0,
+
n(C2 R2 ) L2 + 2
0 V2
n
2
Y
n(Cj Rj ) Lj + 2
p
=
+ ,
n
V
j
j=1
(7.34)
(7.35)
(7.36)
where (7.35) is an application of the multivariate BerryEsseen theorem (Corollary 1.1) and is a finite
constant. Note that denotes the bivariate generalization of the Gaussian cdf, defined in (6.11). Equality (7.36) holds because the covariance matrix in (7.35) is diagonal by the calculation in (7.32). Since the
bound in (7.36) does not depend on x1 , x2 as long as kxj k22 = nSj , we have
2
Y
n(Cj Rj ) Lj + 2
p
1p
+ ,
(7.37)
n
Vj
j=1
87
In Case (i), R1 = C1 and R2 < C2 so the term corresponding to j = 2 in the above product converges to
one and we have
L1 + 2
p
+ n ,
(7.38)
1p
Vj
where n 0 as n . Thus Proposition 7.2 yields
L1 2
p
n
+ n .
(7.39)
Vj
Taking lim sup on both sides yields
L1 2
p
lim sup n
.
Vj
n
(7.40)
p
V1 1 () + 2.
(7.41)
Since > 0 is arbitrarily small, we may take 0 to complete the proof of the converse part for Case (i). For
Case (ii), swap the indices 1 and 2 in the above calculation. For Case (iii), the analysis until (7.36) applies.
However, now Rj = Cj for both j = 1, 2 so both () functions in (7.36) are numbers strictly between 0 and
1. Consequently, we have
L1 + 2
L2 + 2
(7.42)
1p
+ .
n
V2
Vj
The rest of the arguments are similar to those for Case (i).
For the direct part, similarly to the singleuser case in (4.88), we choose the input distributions
PXjn (xj ) =
j = 1, 2,
(7.43)
where {} is the Dirac function and An (r) is the area of a sphere in Rn with radius r. Clearly, the power
constraints are satisfied with probability one. We choose the conditional output distributions QY1n X2n and
QY2n X1n as in (7.26) and (7.27) and the output distributions QY1n and QY2n to be the nfold products of
2
2
QY1 (y1 ) := N (y1 ; 0, g11
S1 + g12
S2 + 1),
QY2 (y2 ) :=
2
N (y2 ; 0, g21
S1
2
g22
S2
+ 1).
and
(7.44)
(7.45)
With these choices of auxiliary output distributions, one can show the following technical lemma concerning
the ratios of the induced (conditional) output distributions and the chosen (conditional) output distributions
in Proposition 7.1. This is the multiterminal analogue of (4.108) for the pointtopoint AWGN channel and
it allows us to replace the inconvenient induced output distributions PX1n W1n and PX1n PX2n W1n (which is
present in standard Feinsteintype achievability bounds, for example [66]) with the convenient QY1n X2n and
QY1n without too much degradation in error probability.
Lemma 7.1. Let QY1n , QY2n , QY1n X2n and QY2n X1n be defined as the nfold products of those in (7.44), (7.45),
(7.26) and (7.27) respectively. Then, there exists a finite constant such that the ratios jk in (7.14)(7.15)
P2
are uniformly bounded by as n grows. Hence, their sum = j,k=1 jk is also uniformly bounded.
The proof of this lemma can be found in [103] and [112].
Because X1n and X2n are uniform on their respective power spheres, it is not straightforward to analyze
the behavior of random vector
W n (Y n X n ,X n )
log QY1n X1n (Y1n X2n )
1
2
1
2
B11
n
n
n
n
log W2n(Y2nX1n,X2n)
B21
Q
(Y
X
Y2 X1
2
1 )
(7.46)
B=
W1n (Y1n X1n ,X2n ) ,
B12 :=
log
QY n (Y1n )
1
B22
W n (Y n X n ,X n )
log 2 QY2n (Y1n ) 2
2
88
which is present in (7.13). Note that B can be written as a sum of dependent random variables due to
the product structure of the chosen output distributions. To analyze the probabilistic behavior of B for
large n, we leverage a technique by MolavianJazi and Laneman [112]. The basic ideas are as follows: Let
Tjn N (0n , Inn ) for j = 1, 2 be standard Gaussian random vectors that are independent of each other and
of the noises Zjn . Note that the input distributions in (7.43) allow us to write Xji as
Xji =
nSj
Tji
,
kTjn k2
i = 1, . . . , n.
(7.47)
Indeed, kXjn k22 = nSj with probability one from the random code construction and (7.47). Now consider
the length10 random vector Ui := ({Uj1i }4j=1 , {Uj2i }4j=1 , U9i , U10i ), where
2
U11i := 1 Z1i
,
p
U31i := g12 S2 T2i Z1i ,
2
Z2i
,
U12i := 1
p
U32i := g21 S1 T1i Z2i ,
U9i :=
2
T1i
1,
U21i := g11
U41i
U22i
U42i
S1 T1i Z1i ,
p
:= g11 g12 S1 S2 T1i T2i ,
p
:= g22 S2 T2i Z2i ,
p
:= g21 g22 S1 S2 T1i T2i ,
U10i :=
2
T2i
1.
(7.48)
(7.49)
(7.50)
(7.51)
(7.52)
Clearly, Ui is i.i.d. across channel uses. Furthermore, E[U1 ] = 0 and E[kU1 k32 ] is finite. The covariance
matrix of U1 can also be computed. Define the functions 11 , 12 : R10 R
2u21
11 (u) := snr1 u11 +
,
1 + u9
and
2u31
2u21
+
12 (u) := (snr1 + inr1 )u11 +
1 + u9
1 + u10
2u41
+
,
1 + u9 1 + u10
(7.53)
(7.54)
for user 1, and analogously for user 2. Then, through some algebra, one sees that B11 and B12 can be written
as
X
n
n
1
B11 = nC(snr1 ) +
11
Ui , and
(7.55)
2(1 + snr1 )
n i=1
X
n
n
1
12
Ui .
B12 = nC(snr1 + inr1 ) +
(7.56)
2(1 + snr1 + inr1 )
n i=1
The other random variables in the B vector can be expressed similarly.
From (7.55)(7.56), we are able to see the essence of the MolavianJaziLaneman [112] technique. The
information densities Bjk , j, k = 1, 2 were initially difficult to analyze because the input random vectors Xjn
in (7.43) are uniform on power spheres. This choice of input distributions results in codewords Xjn whose
coordinates are dependent so standard limit theorems do not readily apply. By defining higherdimensional
random vectors Ui and appropriate functions jk , one then sees that B can be expressed as a function of a
sum of i.i.d. random vectors. Now, one may consider a Taylor expansion of the differentiable functions jk
around the mean 0 to approximate B with a sum of i.i.d. random vectors. Through this analysis, one can
rigorously show that
C(snr1 )
V1 0
C(snr2 ) d
0 V2
1
,
B n
(7.57)
C(snr1 + inr1 ) N 0,
n
C(snr2 + inr2 )
89
where the entries marked as are finite and inconsequential for the subsequent analyses. Recall also that
Vj = V(snrj ) for j = 1, 2. In fact, the rate of convergence to Gaussianity in (7.57) can be quantified by
means of Theorem 1.5.
With these preparations, we are ready to evaluate the probability in the direct bound in (7.13), which
we denote as p. We consider all three cases in tandem. Fix (L1 , L2 ) L(; R1 , R2 ). Let the number of
codewords in the j th codebook be
(7.58)
Mjn = bexp nRj + nLj 2n1/4 c
for j = 1, 2. It is clear that
1
lim inf log Mjn nRj Lj .
n
n
Also let := n3/4 . With these choices, the complementary probability 1 p can be expressed as
B
>
1 p = Pr
(7.59)
(7.60)
(7.61)
(7.62)
Since the expectations of B12 and B22 are strictly larger than R1 + R2 (cf. (7.61)), by standard Chernoff
bounding techniques,
1 p Pr
>
2 exp(n).
B21
nR2 + nL2 n1/4
(7.65)
Just as in the converse, one can then analyze this probability for the various cases using the convergence to
Gaussianity in (7.57). This completes the proof of the direct part.
90
Chapter 8
91
m1
m2

f1
x1

W
f2
y

(m
1, m
)
 2
x2

8.1
(8.2)
where i = 1, . . . , n and Zi N (0, 1) is white Gaussian noise. The channel gains are set to unity without loss
of generality. Thus, the channel transition law is
1
1
W (yx1 , x2 ) = exp (y x1 x2 )2 .
(8.3)
2
2
The channel operates in a stationary and memoryless manner.
We define an (n, M1 , M2 , S1 , S2 , )code for the Gaussian AMAC which includes two encoders f1 :
{1, . . . , M1 } {1, . . . , M2 } Rn , f2 : {1, . . . , M2 } Rn and a decoder : Rn {1, . . . , M1 } {1, . . . , M2 }
such that the following power constraints are satisfied
n
X
f1 (m1 , m2 )
2 =
f1i (m1 )2 nS1 ,
2
f2 (m2 )
2 =
2
i=1
n
X
i=1
92
and
(8.4)
(8.5)
(iii)
0.8 v
CR Boundary
=0
= 1/3
= 2/3
0.7
0.6
(ii)
v
AKA
A
A
A
A
R2 (nats/use)
0.5
0.4
v
A
A
v0
0.3
0.2
v(i)
0.1
0.05
0.1
0.15
0.2
0.25
R1 (nats/use)
0.3
0.35
0.4
Figure 8.2: Capacity region of a Gaussian AMAC where S1 = S2 = 1. The three cases of Theorem 8.1 are
illustrated. Each (0, 1] corresponds to a trapezoid of rate pairs achievable by a unique input distribution
N (0, ()). However, coding with a fixed input distribution is insufficient to achieve all angles of approach
to a boundary point as there are regions within C not in the trapezoid parametrized by . Suppose = 23 , one
can approach the corner point in the direction indicated by the vector v using the fixed input distribution
N (0, ( 23 )), but the same is not true of the direction indicated by v0 , since the approach is from outside the
trapezoid.
93
(8.6)
As with the Gaussian IC discussed in the previous chapter, Dm1 ,m2 denotes the decoding region for messages
(m1 , m2 ) and Sj represents the admissible power for the j th user.
The following nonasymptotic bounds are easily derived. They are analogues of the bounds by Feinstein [53] and Verd
uHan [169] (or HayashiNagaoka [77]). See BoucheronSalamatian [20] for the proofs of
similar results.
Proposition 8.1 (Achievability bound for the AMAC). Fix any input joint distribution PX1n X2n whose
support satisfies the power constraints in (8.4)(8.5), i.e., kXjn k2 nSj with probability one. For every
n N, every > 0, any choice of output distributions QY n X2n and QY n , and any two sets A1 X2n Y n
and A12 Y n , there exists an (n, M1 , M2 , S1 , S2 , )code for the AMAC such that
W n (Y n X1n , X2n )
Pr log
log M1 +n or
QY n X2n (Y n X2n )
W n (Y n X1n , X2n )
log(M
M
)+n
log
1 2
QY n (Y n )
+ Pr (X2n , Y n )
/ A1 + Pr Y n
/ A12 + exp(n),
(8.7)
where = 1 + 12 and
1 :=
sup
(x2 ,y)A1
12 := sup
yA12
(8.8)
Again notice that our freedom to choose QY n X2n and QY n results in the need to control 1 and 12 ,
which are the maximum values of the ratios of the densities induced by the code with respect to the chosen
output densities. The maximum values are restricted to those typical values of (x2 , y) and y indicated by
the chosen sets A1 and A12 .
Proposition 8.2 (Converse bound for the AMAC). For every n N, every > 0 and for any choice of
output distributions QY n X2n and QY n , every (n, M1 , M2 , S1 , S2 , )code for the AMAC must satisfy
W n (Y n X1n , X2n )
log M1 n or
Pr log
QY n X2n (Y n X2n )
W n (Y n X1n , X2n )
log(M1 M2 )n 2 exp(n),
(8.9)
log
QY n (Y n )
for some input joint distribution PX1n X2n whose support satisfies the power constraints in (8.4)(8.5).
8.2
SecondOrder Asymptotics
As in the previous chapters on multiterminal problems, given a point on the boundary of the capacity region
(R1 , R2 ), we are interested in characterizing the set of all (L1 , L2 ) pairs for which there exists a sequence of
(n, M1n , M2n , S1 , S2 , n )codes such that
1
lim inf log Mjn nRj Lj , and lim sup n .
n
n
n
We denote this set as L(; R1 , R2 ) R2 .
94
(8.10)
8.2.1
Preliminary Definitions
Before we can state the main results, we need to define a few more fundamental quantities. For a pair of
rates (R1 , R2 ), the rate vector is
R1
R :=
.
(8.11)
R1 + R2
The input distribution to achieve a point on the boundary characterized by some [0, 1] is a 2dimensional
Gaussian distribution with zero mean and covariance matrix
S1
S1 S2
.
(8.12)
() :=
S1 S2
S2
The corresponding mutual information vector is given by
2
C S1 (1
)
I ()
.
I() = 1
:=
I12 ()
C S1 + S2 + 2 S1 S2
(8.13)
Let
x(y + 2)
(8.14)
2(x + 1)(y + 1)
be the Gaussian crossdispersion function and note that V(x) := V(x, x) is the Gaussian dispersion function
defined previously in (4.79). For fixed 0 1, define the informationdispersion matrix
V1 ()
V1,12 ()
V() :=
,
(8.15)
V1,12 () V12 ()
V(x, y) := log2 e
V1,12 () := V S1 (1 ), S1 + S2 + 2
p
V12 () := V S1 + S2 + 2 S1 S2 .
(8.16)
p
S1 S2 ,
(8.17)
(8.18)
Let (X1 , X2 ) PX1 X2 = N (0, ()) and define QY X2 and QY to be Gaussian distributions induced by
PX1 X2 and W , namely
p
QY X2 (yx2 ) := N y; x2 (1 + S1 /S2 ), 1 + S1 (1 2 ) , and
(8.19)
p
QY (y) := N y; 0, 1 + S1 + S2 + 2 S1 S2 .
(8.20)
It should be noted that the random variables (X1 , X2 ) and the densities QY X2 and QY all depend on ;
this dependence is suppressed throughout the chapter. The mutual information vector I() and informationdispersion matrix V() are the mean vector and conditional covariance matrix of the information density
vector
#
"
1 ,x2 )
log QWY (yx
j1 (x1 , x2 , y)
X2 (yx2 )
j(x1 , x2 , y) :=
=
.
(8.21)
1 ,x2 )
j12 (x1 , x2 , y)
log W (yx
QY (y)
That is, we can write I() and V() as
I() = E j(X1 , X2 , Y ) , and
V() = E Cov j(X1 , X2 , Y ) X1 , X2 ,
(8.22)
(8.23)
with (X1 , X2 , Y ) PX1 X2 W . We also need a generalization of the () function. Define the inverse
image of (z1 , z2 ; 0, ) as
1 (, ) := (t1 , t2 ) R2 : (z1 , z2 ; 0, ) 1 .
(8.24)
An illustration of this set is provided in Fig. 8.3. Observe that for < 12 , the set lies entirely within the
third quadrant of the R2 plane. This represents backoffs from the firstorder fundamental limits.
95
= 0.5
= 0.995
0.05
0.05
z2
0.05
z2
0.05
0.1
0.1
0.15
0.15
= 0.01
= 0.80
0.2
0.2
0.15
0.1
= 0.01
= 0.80
z1
0.05
0.2
0.2
0.05
0.15
0.1
z1
0.05
0.05
Figure 8.3: Illustration of the set 1 (V(), ) where V() is defined in (8.15). The regions are to the
bottom left of the boundaries indicated.
8.2.2
Here we provide inner and outer bounds on C(n, ), defined to be the set of (R1 , R2 ) pairs such that there
exist codebooks of length n and rates at least R1 and R2 yielding an average error probability not exceeding
. Let g(, , n) and g(, , n) be arbitrary functions of , and n for now, and define the inner and outer
regions
1 (V(), )
01
(8.28)
(8.29)
Lemma 8.1 serves as a stepping stone to establish the local behavior of firstorder optimal codes near a
boundary point. A proof sketch of the lemma is provided in Section 8.3.1.
96
We remark that even though the union for the outer bound in (8.27) is taken over [1, 1], only the
values [0, 1] will play a role in establishing the local asymptotics in Section 8.2.3, since negative values
of are not even firstorder optimal, i.e., they fail to achieve a point on the boundary of the capacity region.
We do not claim that the remainder terms in (8.28)(8.29) are uniform in the limiting value of {n }nN ;
such uniformity will not be required in establishing our main local result below. On the other hand, it is
crucial that values of varying with n are handled.
8.2.3
To characterize L(; R1 , R2 ), we need yet another definition, which is a feature we have not encountered thus
far in this monograph. Define
I1 ()
D1 ()
,
(8.30)
D() =
:=
D12 ()
I12 ()
to be the derivative of the mutual information vector with respect to where the individual derivatives are
given by
S1
I1 ()
=
, and
1 + S1 (1 2 )
I12 ()
S1 S2
=
.
1 + S1 + S2 + 2 S1 S2
(8.31)
(8.32)
Note that (0, 1] represents the strictly concave part of the boundary (the part of the boundary where
R2 > 0.2 in Fig. 8.2), and in this interval we have D1 () < 0 and D12 () > 0.
Furthermore, for a vector v = (v1 , v2 ) R2 , we define the downset of v as
v := {(w1 , w2 ) R2 : w1 v1 , w2 v2 }.
(8.33)
We are now in a position to state our main result whose proof is sketched in Section 8.3.2.
Theorem 8.1 (Local SecondOrder Rates). Depending on (R1 , R2 ) (see Fig. 8.2), we have the following
three cases:
Case (i): R1 = I1 (0) and R1 + R2 I12 (0) (vertical segment of the boundary corresponding to = 0),
o
n
p
(8.34)
L(; R1 , R2 ) = (L1 , L2 ) : L1 V1 (0)1 () .
Case (ii): R1 = I1 () and R1 + R2 = I12 () (curved segment of the boundary corresponding to 0 < < 1),
(
)
[n
o
L1
1
L(; R1 , R2 ) = (L1 , L2 ) :
D()+ (V(), ) .
(8.35)
L1 +L2
R
Case (iii): R1 = 0 and R1 + R2 = I12 (1) (point on the vertical axis corresponding to = 1),
L(; R1 , R2 )
(
=
[
)
0
L1
p
(L1 , L2 ) :
D(1) +
.
L1 + L2
V12 (1)1 ()
(8.36)
See Fig. 8.4 for an illustration of L(; R1 , R2 ) in (8.35) and the set of (L1 , L2 ) such that (L1 , L1 + L2 )
belongs to 1 (V(), ), i.e., G1 (V(), ), where G = [1, 0; 1, 1] is the invertible matrix that transforms
the coordinate system from [L1 , L1 + L2 ]0 to [L1 , L2 ]0 . In other words, G1 (V(), ) is the same set as
that in (8.35) neglecting the union and setting = 0. It can be seen that G1 (V(), ) is a strict subset
97
SecondOrder Region
1.5
G1 (V(), )
L(; R1 , R2 )
L2 (nats/ use)
0.5
0.5
1.5
2
1.5
1
0.5
L1 (nats/ use)
0.5
Figure 8.4: Illustration of the set L(; R1 , R2 ) in (8.35) with S1 = S2 = 1, = 12 and = 0.1. The
set corresponding to = 0 in (8.35) is denoted as G1 (V(), ). Regions are to the bottomleft of the
boundaries.
98
of L(; R1 , R2 ). In fact, L(; R1 , R2 ) is a halfspace in R2 for any (R1 , R2 ) on the boundary of the capacity
region corresponding to < 1. So L(; R1 , R2 ) in (8.35) can be alternatively written as
L(; R1 , R2 ) = (L1 , L2 ) : L2 a L1 + b,
(8.37)
where the slope and intercept are respectively defined as
D12 () D1 ()
, and
D1 ()
:= inf b R : L1 R s.t.
a :=
b,
(L1 , (a + 1)L1 + b) G1 (V(), ) .
8.2.4
(8.38)
(8.39)
Observe that in Case (i), the secondorder region is simply characterized by a scalar dispersion term V1 (0)
and the inverse of the Gaussian cdf 1 . In this part of the boundary, there is effectively only a single rate
constraint in terms of R1 , since we are operating far away from the sum rate constraint. This results in a
large deviationstype event for the sum rate constraint which has no bearing on secondorder asymptotics.
This is similar to observations made in Chapters 6 and 7.
Cases (ii)(iii) are more interesting, and their proofs do not follow from standard techniques. As in Case
(iii) for Theorem 6.1, the secondorder asymptotics for Case (ii) depend on the dispersion matrix V() and
the bivariate Gaussian cdf, since both rate constraints are active at a point on the boundary parametrized
by (0, 1). However, the expression containing 1 alone (i.e., the expression obtained by setting = 0
in (8.35)) corresponds to only considering the unique input distribution N (0, ()) achieving the point
(R1 , R2 ) = (I1 (), I12 () I1 ()). From Fig. 8.2, this is not sufficient to achieve all secondorder coding
rates, since there are nonempty regions within the capacity region that are not contained in the trapezoid
of rate pairs achievable using a single Gaussian N (0, ()).
Thus, to achieve all (L1 , L2 ) pairs in L(; R1 , R2 ), we must allow the sequence of input distributions to
vary with the blocklength n. This is manifested in the D() term. Roughly speaking, our proof strategy
of the direct part involves random coding with a sequence
of input distributions that are uniform on two
spheres with correlation coefficient n = + O 1n between them. By a Taylor expansion, the resulting
mutual information vector
I(n ) I() + (n )D().
(8.40)
1
Since n = O n , the gradient term (n )D() also contributes to the secondorder behavior, together
with the traditional Gaussian approximation term 1 (V(), ).
For the converse, we consider an arbitrary sequence of codes with rate pairs {(R1n , R2n )}nN converging
to (R1 , R2 ) = (I1 (), I12 () I1 ()) with secondorder behavior given by (8.10). From the global result, we
know [R1n , R1n + R2n ]T R(n, ; n ) for some sequence {n}nN . We then establish, using the definition of
the secondorder coding rates in (8.10), that n = + O 1n . Finally, by the BolzanoWeierstrass theorem,
we may pass to a subsequence of n (if necessary), thus establishing the converse.
A similar discussion holds true for Case (iii); the main differences are that the covariance matrix is
singular, and that the union in (8.36) is taken over 0 only, since n can only approach one from below.
8.3
8.3.1
Proof. Because the proof is rather lengthy, we only focus on the case where n (1, 1). The main
ideas are already present here. The case where n 1 is omitted, and the reader is referred to [138] for
the details.
99
The converse proof is split into several steps for clarity. In the first three steps, we perform a series
of reductions to simplify the problem. We do so to simplify the evaluation of the probability in the nonasymptotic converse bound in Proposition 8.2.
Step 1: (Reduction from Maximal to Equal Power Constraints) As usual, by the Yaglom map trick [28,
Ch. 9, Thm. 6], it suffices to consider codes such that the inequalities in (8.10) hold with equality. See the
argument for the proof of the converse for the asymptotic expansion of the AWGN channel (Theorem 4.4).
Step 2: (Reduction from Average to Maximal Error Probability) Using similar arguments to [119,
Sec. 3.4.4], it suffices to prove the converse for maximal (rather than average) error probability.2 This
is shown by starting with an averageerror code, and then constructing a maximalerror code as follows: (i)
Keep only the fraction 1n of user 2s messages with the smallest error probabilities (averaged over user 1s
message); (ii) For each of user 2s messages, keep only the fraction 1n of user 1s messages with the smallest
error probabilities.
k
Step 3: (Correlation Type Classes) Define I0 := {0} and Ik := ( k1
n , n ], k = 1, . . . , n, and let Ik := Ik .
Consider the correlation type classes (or simply type classes)
hx1 , x2 i
Tn (k) := (x1 , x2 ) :
Ik
(8.41)
kx1 k2 kx2 k2
where k = n, . . . , n. The total number of type classes is 2n + 1, which is polynomial in n analogously to
the finite alphabet case (cf. the type counting lemma). Using a similar argument to that for the asymmetric
broadcast channel in [39, Lem. 16.2], and the fact that we are considering the maximal error probability so
all message pairs (m1 , m2 ) have error probabilities not exceeding (cf. Step 2), it suffices to consider codes
for which all pairs (x1 , x2 ) that are in a single type class, say indexed by k. This results in a rate loss of R1
and R2 of only O( logn n ). We define := nk according to the type class indexed by k in (8.41).
Step 4: (Approximation of Empirical Moments with True Moments) The value of used in the singleletter information densities in (8.21) is arbitrary, and is chosen to be .
Using the definition
Pn of Tn (k) and the information densities in (8.21), we can show that the first and
second moments of i=1 j(x1i , x2i , Yi ) are approximately given by I(
) and V(
) respectively, i.e.,
"
#
n
1
1X
j(x1i , x2i , Yi ) I(
)
, and
(8.42)
E
n
n
i=1
"
#
n
X
1
2
(8.43)
j(x1i , x2i , Yi ) V(
)
Cov
n
n i=1
for some 1 > 0 and 2 > 0 not depending on . The expectation and covariance above are taken with
respect to W n (x1 , x2 ). Roughly speaking, the reason for (8.42) and (8.43) is because all pairs of vectors
in Tn (k) have approximately the same empirical correlation coefficient so the expectation and covariance of
appropriately normalized information density vectors are also close to a representative mutual information
vector and dispersion matrix respectively.
Step 5: (Evaluation of the NonAsymptotic Converse Bound in Proposition 8.2) Let Rn := [R1n , R1n +
R2n ]0 (where Rjn = n1 log Mjn ) be the rate vector consisting of the nonasymptotic rates (R1n , R2n ). Additionally, let
W n (Y n X1n , X2n )
log M1n n
(8.44)
F := log
QY n X2n (Y n X2n )
W n (Y n X1n , X2n )
G := log
log(M1n M2n ) n
(8.45)
QY n (Y n )
2 This argument is not valid for the standard MAC, but is possible here due to the partial cooperation (i.e., user 1 knowing
both messages). It is well known that the capacity regions for the MAC under the average and maximum error probability
criteria are different, an observation first made by Dueck [46].
100
be the two error events within the probability in (8.9). We then have
Pr(F G) = 1 Pr(F c G c )
= 1 EX1n ,X2n Pr(F c G c X1n , X2n ) .
(8.46)
(8.47)
In particular, using the definition of j(x1 , x2 , y) in (8.21) and the fact that QY n X2n and QY n are product
distributions, the conditional probability in (8.47) can be bounded as
Pr(F c G c X1n = x1 , X2n = x2 )
n
= Pr
Pr
1X
j(x1i , x2i , Yi ) > Rn 1
n i=1
n
1 X
j(x1i , x2i , Yi ) E[j(x1i , x2i , Yi )]
n i=1
!
1
> Rn I(
) 1 1 ,
n
(8.48)
(8.49)
where (8.49) follows from the approximation of the empirical expectation in (8.42). In the rest of this global
n
1 in the nonasymptotic converse bound in (8.9).
converse proof, is set to log
2n so exp(n) =
n
Applying the multivariate BerryEsseen theorem (Corollary 1.1) to (8.49) yields
Pr(F c G c X1n = x1 , X2n = x2 )
n I1 (
) + + 1 /n R1n
;
n I12 (
) + + 1 /n (R1n + R2n )
#!
"
n
(
)
1 X
j(x1i , x2i , Yi )
+ ,
0, Cov
n i=1
n
(8.50)
where (
) is a constant. By Taylor expanding the continuously differentiable function (z1 , z2 , V) 7
(z1 , z2 ; 0, V), and using the approximation of the empirical covariance in (8.43) together with the fact
that det(V(
)) > 0 for (1, 1), we obtain
Pr(F c G c X1n = x1 , X2n = x2 )
!
(
) log n
n
I
(
)
+
/n
R
1
1
1n
; 0, V(
) +
n I12 (
) + + 1 /n (R1n + R2n )
n
(8.51)
where (
) is a constant. It should be noted that (
), (
) as 1, since V(
) becomes singular
as 1. Despite this nonuniformity, we conclude from (8.9), (8.47) and (8.51) that any (n, )code with
codewords (x1 , x2 ) all belonging to Tn (k) must have rates (R1n , R2n ) that satisfy
log n
1 V(
), + 2n + ()
n
R1n
.
(8.52)
I(
) +
R1n + R2n
n
We immediately obtain the global converse bound on the (n, )capacity region (outer bound in (8.27) of
Lemma 8.1) by employing the approximation
c log n
h(V(
), , c) log n
1
V(
), +
1 V(
), ) +
1,
(8.53)
n
n
where c > 0 is an arbitrary finite constant and h(V(
), , c) is finite for 6= 1. The details of the approximation in (8.53) are omitted, and can be found in [138].
101
We now provide a proof sketch of the achievability part of Lemma 8.1 (inner bound in (8.27)). At a high
level, we will adopt the strategy of drawing random codewords on appropriate power spheres, similar to the
coding strategy for AWGN channels (Section 4.3) and the Gaussian IC with SVSI (Chapter 7). We then
analyze the ensemble behavior of this random code.
Step 1: (RandomCoding Ensemble) Let [0, 1] be a fixed correlation coefficient. The ensemble will
be defined in such a way that, with probability one, each codeword pair falls into the set
o
n
p
(8.54)
Dn () := x1 , x2 : kx1 k22 = nS1 , kx2 k22 = nS2 , hx1 , x2 i = n S1 S2 .
This means that the power constraints in (8.4)(8.5) are satisfied with equality and the empirical correlation
between each codeword pair is also exactly . We use superposition coding, in which the codewords are
generated according to
1
X2n (m2 ), {X1n (m1 , m2 )}M
m1 =1
M2
Y
m2 =1
M2
m2 =1
M1
Y
PX2n (x2 (m2 ))
PX1n X2n (x1 (m1 , m2 )x2 (m2 ))
(8.55)
m1 =1
for codeword distributions PX2n and PX1n X2n . We choose the codeword distributions to be
PX2n (x2 ) kx2 k22 = nS2 ,
and
p
2
PX1n X2n (x1 x2 ) kx1 k2 = nS1 , hx1 , x2 i = n S1 S2 ,
(8.56)
(8.57)
{xA}
where {} is the Dirac function, and PX n (x)
, with the
c
R {x A} means that PX n (x) =
normalization constant c > 0 chosen such that A PX n (x) dx = 1. In other words, each x2 (m2 ) is drawn
uniformly from an (n1)sphere with radius nS2 and for each m2 , each x1 (m1 , m2 ) is drawn uniformly from
the set of all x1 satisfying the power and correlation coefficient constraints with equality. These distributions
clearly ensure that the codeword pairs belong to Dn () with probability one.
Step 2: (Evaluation of the NonAsymptotic Achievability Bound in Proposition 8.1) We now need to
identify typical sets of (x2 , y) and y such that the maximum values of the ratios of the densities 1 and 12 ,
defined in (8.8), are uniformly bounded on these sets. For this purpose, we leverage the following lemma.
Lemma 8.2. Consider the setup of Proposition 8.1, where the output distributions are chosen to be QY n X2n :=
(PX1 X2 W )n and QY n := (PX1 X2 W )n with PX1 X2 := N (0, ()), and the input joint distribution PX1n X2n is
described by (8.56)(8.57). There exist sets A1 X2n Y n and A12 Y n (depending on n and ) such that
the following
max max{1 , 12 }
max max Pr (X2n , Y n )
/ A1 , Pr Y n
/ A12 exp(n),
[0,1]
[0,1]
(8.58)
(8.59)
hold for all n > n0 , where where < , > 0 and n0 N are constants not depending on .
The proof of this technical lemma is omitted and can be found in [138]. It extends and refines ideas in
PolyanskiyPoorVerd
us proof of the dispersion of AWGN channels [123, Thm. 54 & Lem. 61].
Note that the uniformity of the bounds and exp(n) in (8.58)(8.59) in is crucial for handling
varying with n, as is required in Lemma 8.1.
Equipped with Lemma 8.2, we now apply the multivariate BerryEsseen theorem (Corollary 1.1) to
estimate the probability in the nonasymptotic achievability bound in Proposition 8.1. This computation is
similar to that sketched in the converse proof with 1 = 2 = 0. This concludes the achievability proof of
Lemma 8.1.
102
8.3.2
Proof. We begin with the converse proof. We only prove the result in Case (ii), because Case (i) is standard
(follows from the singleuser case in Theorem 4.4) and Case (iii) similar to Case (ii).
Step 1: (Passage to a Convergent Subsequence) Fix a correlation coefficient (0, 1], and consider any sequence of (n, M1n , M2n , S1 , S2 , n )codes satisfying (8.10). Let us consider the associated rates {(R1n , R2n )}nN .
As required by the definition of secondorder rate pairs (L1 , L2 ) L(; R1 , R2 ), these codes must satisfy
lim inf Rjn Rj ,
n
(8.60)
j = 1, 2,
(8.61)
lim sup n
(8.62)
for some (R1 , R2 ) on the boundary parametrized by , i.e., R1 = I1 () and R1 + R2 = I12 (). The firstorder optimality condition in (8.60) is not explicitly required by (8.10), but it is implied by (8.61). Letting
Rn := [R1n , R1n +R2n ]0 be the nonasymptotic rate vector, we have, from the global converse bound in (8.27),
that there exists a (possibly nonunique) sequence {n }nN [1, 1] such that
Rn I(n ) +
1 (V(n ), )
+ g(n , , n)1.
n
(8.63)
Since we used the lim inf for the rates and lim sup for the error probability in the conditions in (8.60)(8.62),
we may pass to a convergent (but otherwise arbitrary) subsequence of {n }nN , say indexed by {nl }lN .
Recalling that the lim inf (resp. lim sup) is the infimum (resp. supremum) of all subsequential limits, any
converse result associated with this subsequence also applies to the original sequence. Note that at least one
convergent subsequence is guaranteed to exist, since [1, 1] is compact.
For the sake of clarity, we avoid explicitly writing the subscript l. However, it should be understood that
asymptotic notations such as O() and ()n () are taken with respect to the convergent subsequence.
Step 2: (Establishing The Convergence of n to ) Although g(n , , n) in (8.63)depends on n , we know
from the global bounds on the (n, )capacity region (Lemma 8.1) that it is o 1n for both n 1 and
n (1, 1). Hence,
1 (V(n ), )
1
Rn I(n ) +
+o
1.
(8.64)
n
n
We claim that the result in (8.64) allows us to conclude that every sequence {n }nN that serves to
parametrize an outer bound of the nonasymptotic rates in (8.63) converges to . Indeed, since the boundary of the capacity region is curved and uniquely parametrized by for (0, 1], n 6 implies for some
> 0 and for all sufficiently large n that either I1 (n ) I1 () or I12 (n ) I12 () . Combining this
with (8.64), we deduce that
R1n I1 () ,
2
(8.65)
for sufficiently large n. This, in turn, contradicts the convergence of (R1n , R2n ) to (R1 , R2 ) implied by (8.10).
Step 3: (Establishing The Convergence Rate of n to ) Because each entry of I() is twice continuously
differentiable, a Taylor expansion yields
I(n ) = I() + D()(n ) + O (n )2 1.
(8.66)
In the case that n = 1n , it is not difficult to show that (8.64) and (8.66) imply
Rn I() + D()(n ) + o(n )1.
(8.67)
Since the first entry of D() is negative and the second entry is positive, (8.67) states that L1 = + (i.e.,
a large addition to R1 ) only if L1 + L2 = (i.e., a large backoff from R1 + R2 ), and L1 + L2 = +
103
only if L1 = . This is due the fact that we only consider secondorder deviations from the boundary of
the capacity region of the order 1n . Neglecting these degenerate cases as they are already captured by
Theorem 8.1 (cf. Fig. 8.4), in the remainder, we focus on case where n = O 1n .
Step 4: (Completion of the Proof) Assuming now that n = O 1n , we can use the BolzanoWeierstrass theorem to conclude that there exists a (further) subsequence indexed by {nk }kN (say) such
that nk (nk ) for some R. Then, for the blocklengths indexed by nk , by combining (8.64) and
(8.66), we have
limk nk Rjnk Rj ) Lj for j = 1, 2. In other words, for all > 0, there exists an integer Kj such
that nk Rjnk Rj ) Lj for all k Kj . Thus, for all k max{K1 , K2 }, we may lower bound
each component in the vector on the left of (8.68) with L1 and L1 + L2 2. There also exists an
integer K3 such that the o(1) terms are upper bounded by for all k K3 . We conclude that any pair of
(, R1 , R2 )achievable secondorder coding rates (L1 , L2 ) must satisfy
[
L1 2
D() + 1 (V(), ) .
(8.69)
L1 + L2 3
R
Finally, since > 0 is arbitrary, we can take 0, thus completing the converse proof for Case (ii).
The achievability part is similar to the converse part, yet simpler. Specifically, we can simply choose
n := + ,
n
(8.70)
8.4
We conclude our discussion by discussing the difficulties in performing fixed error probability analysis for
the discrete memoryless or Gaussian MACs (with nondegraded message sets).
First, it is known that the capacity region of the MAC depends on whether one is adopting the average
or maximal error probability criterion. The capacity regions are, in general, different [46]. In Step 2 of the
converse proof, we performed an important reduction from the average to the maximal error probability
criterion. This is one obstacle for any (global or local) converse proof for fixed error analysis of the MAC.
Second, in the characterization of the capacity region of the discrete memoryless MAC, one needs to
involve an auxiliary timesharing random variable Q [49, Sec. 4.5]. At the time of writing, there does not
appear to be a principled and unified way to introduce such a variable in strong converse proofs (unlike weak
converse proofs [49]).
Finally, for the discrete memoryless MAC, one needs to take the convex closure of the union over input
distributions PX1 Q , PX2 Q for a given timesharing distribution PQ [49, Sec. 4.5]. In the absence of the degraded message sets (or asymmetry) assumption, one needs to develop a converse technique, possibly related
to the wringing technique of Ahlswede [3], to assert that the given codewords pairs are almost independent
(or almost orthogonal for the Gaussian case). By leveraging the degraded message sets assumption, we circumvented this requirement in this chapter but for the MAC, it is not clear whether the wringing technique
yields a redundancy term that matches the best known inner bound to the secondorder region [112, 136].
104
Chapter 9
In this monograph, we compiled a list of conclusive fixed error results in information theory. We began our
discussion with a review of binary hypothesis testing and used the asymptotic expansions of the information
spectrum divergence and hypothesis testing divergence for product measures to derive similar asymptotic
expansions for the minimum code size in lossless data compression. Lossy data compression and channel
coding were discussed in detail next. These subjects culminated in our derivation of an asymptotic expansion
for the sourcechannel coding rate. We then analyzed various channel models whose behaviors are governed
by random states.
In this monograph, we also discussed a small collection of problems in multiuser information theory [49],
where we were interested in quantifying the optimum speed of rate pairs converging towards a fixed point
on the boundary of the (firstorder) capacity region in the channel coding case, or optimal rate region in
the source coding case. We saw three examples of problems in network information theory where conclusive
results can be obtained in the secondorder sense. These included the distributed lossless source coding
(SlepianWolf) problem, as well as some special classes of Gaussian multipleaccess and interference channels.
We conclude our treatment of fixed error asymptotics in information theory by mentioning related works
in the literature.
9.1.1
Channel Coding
Early works on fixed error asymptotics in channel coding by Dobrushin [45], Kemperman [91] and Strassen [152]
were discussed in Chapter 4. The interest in asymptotic expansions was revived in recent years by the works
of Hayashi [76] and PolyanskiyPoorVerd
u [123]. Before these prominent works came to the fore, BaronKhojastepourBaraniuk [14] considered the rate of convergence to channel capacity for simple channel models
such as the binary symmetric channel.
In this monograph, we did not discuss channels with feedback or variablelength terminations, both
of which are important in practical communication systems. PolyanskiyPoorVerd
u [125] studiedvarious
incremental redundancy schemes and derived several asymptotic expansions. Generally, the ( n) dispersion term is not present, showing that channels with feedback perform much better than without the
feedback, an observation that is also corroborated by a more traditional error exponent analysis [21, 73].
WilliamsonChenWesel [179] showed that their proposed reliabilitybased decoding schemes for variablelength coding with feedback can achieve higher rates than [125]. AltugWagner [8] showed that full output
feedback improves the secondorder term in the asymptotic expansion for channel coding if Vmin < Vmax .
TanMoulin [158] considered the secondorder asymptotics of erasure and list decoding. This analysis is the
105
fixed error probability analogue of Forneys analysis of erasure and list decoding from the error exponents
perspective [55]. Erasure decoding is intimately connected to decision feedback or automatic retransmission
request (ARQ) schemes as the declaration of an erasure event at the decoder can inform the encoder to
resend the erased information bits.
ShkelTanDraper [147, 148] considered the unequal error protection of message classes and related the
asymptotic expansions for this problem to lossless joint sourcechannel coding [149]. Moulin [113] studied
the asymptotics for the channel coding problem up to the fourthorder term using strong large deviation
techniques [43, Thm. 3.7.4]. Matthews [108] made an interesting observation concerning the relation of
the nonasymptotic channel coding converse (Proposition 4.4) to socalled nonsignaling codes in quantum
information. He demonstrated efficient linear programmingbased algorithms to evaluate the converse for
DMCs.
Other (rather more unconventional) works on fixed error asymptotics for pointtopoint communication
include RiedlColemanSingers analysis of queuing channels [130], PolyanskiyPoorVerd
us analysis of the
minimum energy for sending k bits for Gaussian channels with and without feedback [126], and IngberZamirFeders analysis of the infinite constellations problem [88].
9.1.2
The problem of intrinsic randomness is to approximate an arbitrary source with uniform bits while random
number generation is the dual, i.e., that of generating uniform bits from a given source [67, Ch. 2] [71]. These
problems were treated from the fixed approximation error (in terms of the variational distance) perspective
by Hayashi [75] and NomuraHan [117]. An interesting observation made by Hayashi in [75] is that the
folklore theorem1 posed by Han [68] does not hold for the variational distance criterion. This is interesting,
because the firstorder fundamental limit for source coding and intrinsic randomness is the same, i.e., the
entropy rate [67, Ch. 2] (at least for sources that satisfy the ShannonMcMillanBreiman theorem). Thus,
the violation of the folklore theorem for variational distance appears to be distillable only from the study
of second and not firstorder asymptotics, demonstrating additional insight one can glean from studying
higherorder terms in asymptotic expansions.
The channel resolvability problem consists in approximating the output statistics of an arbitrary channel
given uniform bits at the input [67, Ch. 6] [71]. It is particularly useful for the strong converse of the
identification problem [5]. Watanabe and Hayashi [176] considered the channel resolvability problem, proving
a secondorder coding theorem under an information radius condition not dissimilar to what is known for
channel coding [56, Thm. 4.5.1].
9.1.3
For channels with random state, WatanabeKuzuokaTan [177] and YassaeeArefGohari [188] derived the
best nonasymptotic bounds for the GelfandPinsker problem, improving on those by Verd
u [168]. With
these bounds, one can easily derive achievable secondorder coding rates by appealing to various BerryEsseen theorems. The technique in [177] is based on channel resolvability [71] and channel simulation [41]
while that in [188] is based on an elegant coding scheme known as the stochastic likelihood decoder (also
known as the pretty good measurement in quantum information), which is also applicable to other multiterminal problems such as multipledescription coding and the BergerTung problem [49]. Scarlett [134] also
considered the secondorder asymptotics for the discrete memoryless GelfandPinsker problem and used
ideas in Section 5.2 to evaluate the best known achievable secondorder coding rates based on constant
composition codes.
Polyanskiy [120] derived the secondorder asymptotics for the compound channel where the channel
state is nonrandom in contrast to the models studied in Chapter 5. Similar to the Gaussian MAC with
1 The folklore theorem [68] of Han states that the output from any source encoder working at the optimal coding rate with
asymptotically vanishing probability of error looks almost completely random.
106
degraded message sets, the secondorder term depends on the variance of the channel information density
and the derivatives of the mutual informations. Finally, Hoydis et al. [82, 83] considered blockfading MIMO
channels. In contrast to Section 5.5, here the channel matrix is not quasistatic and so the analysis is
somewhat more involved and requires the use of random matrix theory.
9.1.4
The advances in the secondorder asymptotics for multiterminal problems have been modest. Early works
include those by SarvothamBaronBaranuik [131, 132] and He et al. [80] for the singleencoder SlepianWolf
problem. However, unlike our treatment in Chapter 6, there is only one source to be compressed, and full
sideinformation is available at the decoder.
Other authors [84, 111, 112, 136] also considered inner bounds to the (n, )rate regions (also called global
achievability regions) for the discrete memoryless and Gaussian MACs, but it appears that conclusive results
are much harder to derive without any further assumptions on the channel model. These are multiterminal
channel coding analogues of the corresponding discussion for SlepianWolf coding in Section 6.4.3. See further
discussions in Section 9.2.3.
9.1.5
The study of secondorder coding rates is intimately related to moderate deviations analysis. In the former,
the error probability is bounded above by a nonzero constant and optimal rates converge to the firstorder
fundamental limit with speed ( 1n ). In the latter, the error probability decays to zero subexponentially
and the optimal rates converge to the firstorder fundamental limit slower than ( 1n ). The dispersion
also appears in the solution of the moderate deviations analysis because the second derivative of the error
exponent (reliability function) is inversely proportional to the dispersion. The study of moderate deviations
in information theory started with the work by AltugWagner [9] and PolyanskiyVerd
u [127] on channel
coding. Sason [133], Tan [154] and TanWatanabeHayashi [160] considered the binary hypothesis testing,
ratedistortion and lossless joint sourcechannel coding counterparts respectively.
In Section 4.4, we mentioned efforts from AltugWagner [10, 11] and ScarlettMartinezGuillen i F`abregas
[135] in deriving the exact asymptotics for channel coding. The authors were motivated to find the prefactors
in the error exponents regime for various classes of DMCs. ScarlettMartinezGuillen i F`abregas [137] recently
demonstrated that results concerning secondorder coding rates, moderate deviations, large deviations, and
even exact asymptotics may be unified through the use of socalled saddlepoint approximations.
9.2
Clearly, there are many avenues of further research, some of which we mention here. We also highlight some
challenges we foresee.
9.2.1
Universal Codes
In Section 3.3, we analyzed a partially universal source code that achieves the source dispersion (varentropy).
The source code only requires the knowledge of the entropy and the varentropy. The channel dispersion can
also be achieved using partially universal channel codes as discussed in the paragraph above (4.56). However,
the thirdorder terms are much more difficult to quantify. It would be interesting to pursue research in the
thirdorder asymptotics of source and channel coding for fully universal codes to understand the loss of
performance due to universality. Work along these lines for fixedtovariable length lossless source coding
has been carried out by Kosut and Sankar [100, 101].
107
9.2.2
SideInformation Problems
WatanabeKuzuokaTan [177] and YassaeeArefGohari [188] derived the best known achievability bounds for
sideinformation problems including the WynerAhlswedeKorner (WAK) problem [7, 182] and the WynerZiv [184] problem. However, nonasymptotic converses are difficult to derive for such problems which involve
auxiliary random variables. Even when they can be derived, the evaluation of such converses asymptotically
appears to be formidable.
Because a secondorder converse implies the strong converse, it is useful to first understand the techniques involved in obtaining a strong converse. To the best of the authors knowledge, there are only three
approaches that may be used to obtain strong converses for network problems whose firstorder (capacity)
characterization involves auxiliary random variables. The first is the information spectrum method [67]. For
example, Boucheron and Salamatian [20, Lem. 2] provide a nonasymptotic converse bound for the asymmetric broadcast channel. However, the bound is neither computable nor amenable to good approximations
for large or even moderate blocklengths n as one has to perform an exhaustive search over the space of all
nletter auxiliary random variables. The second is the entropy and image size characterization technique
[6] based on the blowingup lemma [6, 107]. (Also see the monograph [129] for a thorough description of
this technique.) This has been used to prove the strong converse for the WAK problem [6], the asymmetric
broadcast channel [6], the GelfandPinsker problem [166] and the GrayWyner problem [64]. However, the
use of the blowingup approach to obtain secondorder converse bounds is not straightforward. The third
method involves a changeofmeasure argument, and was used in the work of Kelly and Wagner [90, Thm. 2]
to prove an upper bound on the error exponent for WAK coding. Again, it does not appear, at first glance,
that this argument is amenable to secondorder analysis.
A problem similar to sideinformation problems such as GelfandPinsker, WynerZiv and WAK is the
multiterminal statistical inference problem studied by Han and Amari [69] among others. Asymptotic
expansions with nonvanishing typeII error probability may be derivable under some settings (using established techniques), if the firstorder characterization is conclusively known, and there are no auxiliary
random variables, e.g., the problem of multiterminal detection with zerorate compression [139].
9.2.3
The study of secondorder asymptotics for multiterminal problems is at its infancy and the problems described in this monograph form only the tip of a large iceberg. The primary difficulty is our inability to deal,
in a systematic and principled way, with auxiliary random variables for the (strong) converse part. Thus,
genuinely new nonasymptotic converses need to be developed, and these converses have to be amenable
to asymptotic evaluations in the presence of auxiliary random variables. As an example, for the degraded
broadcast channel, the usual nonintuitive identification of the auxiliary random variable by Gallager [57]
(see [49, Thm. 5.2]) for proving the weak converse does not suffice as the strong converse is implied by a
secondorder converse. Other possible techniques, such as information spectrum analysis [20] or the blowingup lemma [6], were highlighted in the previous section. Their limitations were also discussed. For the discrete
memoryless MAC, a strong converse was proved by Ahlswede [3] but his wringing technique does not seem
to be amenable to secondorder refinements as discusseed in Section 8.4.
In contrast to the singleuser setting, constant composition codes may be beneficial even in the absence
of cost constraints for discrete memoryless multiuser problems. This is because there does not exist an
analogue of the relation in (4.24), where the unconditional information is equal to the conditional information
variance for all CAIDs. ScarlettMartinezGuillen i F`abregas [136] provided the best known inner bounds
to the (n, )rate region for the discrete memoryless MAC. TanKosut [157] also showed that conditionally
constant composition codes also outperforms i.i.d. codes for the asymmetric broadcast channel when the error
probability is nonvanishing. It would be fruitful to continue pursuing research in the direction of constant
composition codes for multiuser problems (e.g., the interference channel) to exploit their full potential.
108
9.2.4
InformationTheoretic Security
Finally, we mention that within the realm of informationtheoretic security [19, 104], there are several
partial results in the fixed error and leakage setting. YassaeeArefGohari [187, Thm. 4] used a general
random binning procedure, called output statistics of random binning, to derive a secondorder achievability
bound for the wiretap channel [183], improving on earlier work by Tan [153]. The constraints pertain to the
error probability of the legitimate receiver in decoding the message and the leakage rate to the eavesdropper
measured in terms of the variational distance. However, in [187], there were no converse results even for the
less noisy (or even degraded) case where there are no auxiliary random variables.
The most conclusive work in thus far in informationtheoretic security pertains to the secret key agreement
model [4], where the secondorder asymptotics were derived by HayashiTyagiWatanabe [78]. Interestingly,
the nonasymptotic converse bound relates the size of the key to the hypothesis testing divergence, similar
to some pointtopoint problems as discussed in this monograph. The nonasymptotic direct bound is derived
based on the information spectrum slicing technique (e.g., [67, Thm. 1.9.1]). The author believes that the
fixed error and fixed leakage analysis for the wiretap channel, leveraging the secret key result, may lead to
new insights into the design of secure communication systems at the physical layer. For converse theorems,
the development of novel strong converse techniques for the wiretap channel appears to be necessary; there
are recent results on this for degraded wiretap channels using the information spectrum method [155] and
active hypothesis testing [79].
109
Acknowledgements
Even though this monograph bears only my name, many parts of it are works of other information theorists
and the remaining parts germinated from my collaborations with my coauthors. I sincerely thank my coauthors for educating me on information theory, ensuring I was productive and, most importantly, making
research fun. My first work on fixed error asymptotics was in collaboration with Oliver Kosut while we
were both at MIT. We had a wonderful collaboration on the secondorder asymptotics of the SlepianWolf
problem, discussed in Chapter 6. Upon my return to Singapore, I had the tremendous pleasure of working
with Marco Tomamichel on several projects that led to some of the key results in Chapters 4 and 5. I thank
SyQuoc Le and Mehul Motani for the collaboration that led to results concerning Gaussian interference
channels in Chapter 7. Jonathan Scarlett and I had numerous discussions on various topics in information
theory, including a thread that led to the results on Gaussian MAC with degraded message sets in Chapter 8.
I have also had the distinguished honor of collaborating on the topic of fixed error asymptotics with Stark
Draper, Masahito Hayashi, Shigeaki Kuzuoka, Pierre Moulin, Yanina Shkel, and Shun Watanabe.
In addition to my collaborators, I have had many interactions with other colleagues on this exciting topic,
including Y
ucel Altu
g, Yuval Kochman, Shaowei Lin, Alfonso Martinez, Ebrahim MolavianJazi, Lalitha
Sankar, and Da Wang. I thank them tremendously for sharing their insights on various problems.
I would like to express my deepest gratitude to Jonathan Scarlett for reading through the first draft
of this monograph, providing me with constructive comments, spotting typos, and helping to fix egregious
errors. Special thanks also goes out to Stark Draper, Silas Fong, Ebrahim MolavianJazi, Mehul Motani,
Mark Wilde and Lav Varshney for proofreading parts of later versions of the monograph.
I am deeply indebted to my academic mentors Professor Alan Willsky, Professor Stark Draper and Dr.
Cedric Fevotte for teaching me how to write in a clear, concise and yet precise manner. Any parts of this
monograph that violate these ideals are, of course, down to my personal inadequacies.
I am especially grateful to the National University of Singapore (NUS) for providing me with the ideal
environment to pursue my academic dreams. This work is supported by NUS startup grants R263000A98750 (FoE) and R263000A98133 (ODPRT).
I am very grateful to Editor Professor Yury Polyanskiy as well as the two anonymous reviewers for their
extensive and constructive suggestions during the revision process. One reviewer, in particular, suggested
the unambiguous and succinct title of this monograph.
Finally, this monograph, and my research that led to it, would not have been possible without the
constant love and support of my family, especially my wife Huili, and my son Oliver.
110
Bibliography
[1] R. Ahlswede. Multiway communication channels. In Proceedings of the International Symposium on
Information Theory, pages 2351, Tsahkadsor, Armenia, Sep 1971.
[2] R. Ahlswede. The capacity of a channel with two senders and two receivers. Annals of Probability,
2(5):805814, 1974.
[3] R. Ahlswede. An elementary proof of the strong converse theorem for the multiple access channel.
Journal of Combinatorics, Information & System Sciences, 7(3):216230, 1982.
[4] R. Ahlswede and I. Csisz
ar. Common randomness in information theory and cryptographyI: Secret
sharing. IEEE Transactions on Information Theory, 39(4):12211132, 1993.
[5] R. Ahlswede and G. Dueck. Identification via channels. IEEE Transactions on Information Theory,
35(1):1529, 1989.
[6] R. Ahlswede, P. G
acs, and J. K
orner. Bounds on conditional probabilities with applications in multiuser communication. Z. Wahrscheinlichkeitstheorie verw. Gebiete, 34(3):157177, 1976.
[7] R. Ahlswede and J. K
orner. Source coding with side information and a converse for the degraded
broadcast channel. IEEE Transactions on Information Theory, 21(6):629637, 1975.
[8] Y. Altu
g and A. B. Wagner. Feedback can improve the secondorder coding performance in discrete
memoryless channels. In Proceedings of the International Symposium on Information Theory, pages
23612365, Honolulu, HI, Jul 2014.
[9] Y. Altu
g and A. B. Wagner. Moderate deviations in channel coding. IEEE Transactions on Information
Theory, 60(8):44174426, 2014.
[10] Y. Altu
g and A. B. Wagner. Refinement of the random coding bound. IEEE Transactions on Information Theory, 60(10):60056023, Oct 2014.
[11] Y. Altu
g and A. B. Wagner. Refinement of the spherepacking bound: Asymmetric channels. IEEE
Transactions on Information Theory, 60(3):15921614, 2014.
[12] Y. Altu
g and A. B. Wagner. The thirdorder term in the normal approximation for singular channels.
In Proceedings of the International Symposium on Information Theory, pages 18971901, Honolulu,
HI, Jul 2014. arXiv:1309.5126 [cs.IT].
[13] R. R. Bahadur and R. Ranga Rao. On deviations of the sample mean. Annals of Mathematical
Statistics, 31(4):10151027, 1980.
[14] D. Baron, M. A. Khojastepour, and R. G. Baraniuk. How quickly can we approach channel capacity? In
Proceedings of Asilomar Conference on Signals, Systems and Computers, pages 10961100, Monterey,
CA, Nov 2004.
111
[15] V. Bentkus. On the dependence of the BerryEsseen bound on dimension. Journal of Statistical
Planning and Inference, 113:385402, 2003.
[16] T. Berger. RateDistortion Theory: A Mathematical Basis for Data Compression. Englewood Cliffs,
N.J.: PrenticeHall, 1971.
[17] A. C. Berry. The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49(1):122136, 1941.
[18] E. Biglieri, J. Proakis, and S. Shamai (Shitz). Fading channels: informationtheoretic and communications aspects. IEEE Transactions on Information Theory, 44(6):26192692, 1998.
[19] M. Bloch and J. Barros. PhysicalLayer Security: From Information Theory to Security Engineering.
Cambridge University Press, 2011.
[20] S. Boucheron and M. R. Salamatian. About priority encoding transmission. IEEE Transactions on
Information Theory, 46(2):699705, 2000.
[21] M. V. Burnashev. Information transmission over a discrete channel with feedback. Problems of Information Transmission, 12(4):1030, 1976.
[22] A. B. Carleial. A case where interference does not reduce capacity. IEEE Transactions on Information
Theory, 21:569570, 1975.
[23] N. R. Chaganty and J. Sethuraman. Strong large deviation and local limit theorems. Annals of
Probability, 21(3):16711690, 1993.
[24] L. H. Y. Chen and Q.M. Shao. Normal approximation for nonlinear statistics using a concentration
inequality approach. Bernoulli, 13(2):581599, 2007.
[25] H. Chernoff. Measure of asymptotic effiency tests of a hypothesis based on the sum of observations.
Annals of Mathematical Statistics, 23:493507, 1952.
[26] H.F. Chong, M. Motani, H. K. Garg, and H. El Gamal. On the HanKobayashi region for the
interference channel. IEEE Transactions on Information Theory, 54(7):31883195, 2008.
[27] B. S. Clarke and A. R. Barron. Informationtheoretic asymptotics of bayes methods. IEEE Transactions
on Information Theory, 36(3):453471, 1990.
[28] J. H. Conway and N. J. A. Sloane. Sphere packings, lattices and groups. Springer Verlag, 2003.
[29] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. McGrawHill Science/Engineering/Math, 2nd edition, 2003.
[30] M. Costa. Writing on dirty paper. IEEE Transactions on Information Theory, 29(3):439441, 1983.
[31] T. Cover. Broadcast channels. IEEE Transactions on Information Theory, 18(1):214, 1972.
[32] T. M. Cover. A proof of the data compression theorem of Slepian and Wolf for ergodic sources. IEEE
Transactions on Information Theory, 21(3):226228, 1975.
[33] T. M. Cover and J. A. Thomas. Elements of Information Theory. WileyInterscience, 2nd edition,
2006.
[34] I. Csisz
ar. On an extremum problem of information theory. Studia Sci. Math. Hungarica, 9(1):5771,
1974.
[35] I. Csisz
ar. Joint sourcechannel error exponent. Problems of Control and Information Theory, 9:315
328, 1980.
112
[36] I. Csisz
ar. Linear codes for sources and source networks: Error exponents, universal coding. IEEE
Transactions on Information Theory, 28(4), 1982.
[37] I. Csisz
ar. The method of types. IEEE Transactions on Information Theory, 44(6):250523, 1998.
[38] I. Csisz
ar and J. K
orner. Graph decomposition: A new key to coding theorems. IEEE Transactions
on Information Theory, 27:511, 1981.
[39] I. Csisz
ar and J. K
orner. Information Theory: Coding Theorems for Discrete Memoryless Systems.
Cambridge University Press, 2011.
[40] I. Csisz
ar and Z. Talata. Context tree estimation for not necessarily finite memory processes, via BIC
and MDL. IEEE Transactions on Information Theory, 52(3):10071016, 2006.
[41] P. Cuff. Distributed channel synthesis. IEEE Transactions on Information Theory, 59(11):70717096,
2012.
[42] M. Dalai. Lower bounds on the probability of error for classical and classicalquantum channels. IEEE
Transactions on Information Theory, 59(12):80278056, 2013.
[43] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Springer, 2nd edition,
1998.
[44] N. Devroye, P. Mitran, and V. Takokh. Achievable rates in cognitive radio channels. IEEE Transactions
on Information Theory, 52(5):18131827, 2006.
[45] R. L. Dobrushin. Mathematical problems in the Shannon theory of optimal coding of information. In
Proc. 4th Berkeley Symp. Math., Statist., Probabil.,, pages 211252, 1961.
[46] G. Dueck. Maximal error capacity regions are smaller than average error capacity regions for multiuser
channels. Problems of Control and Information Theory, 7(1):1119, 1978.
[47] G. Dueck. The strong converse coding theorem for the multipleaccess channel. Journal of Combinatorics, Information & System Sciences, 6(3):187196, 1981.
[48] F. Dupuis, L. Kraemer, P. Faist, J. M. Renes, and R. Renner. Generalized entropies. In Proceedings
of the XVIIth International Congress on Mathematical Physics, 2012.
[49] A. El Gamal and Y.H. Kim. Network Information Theory. Cambridge University Press, Cambridge,
U.K., 2012.
[50] E. O. Elliott. Estimates of error rates for codes on burstnoise channels. The Bell Systems Technical
Journal, 42:197797, Sep 1963.
[51] U. Erez, S. Litsyn, and R. Zamir. Lattices which are good for (almost) everything. IEEE Transactions
on Information Theory, 51(10):34013416, 2005.
[52] C.G. Esseen. On the Liapunoff limit of error in the theory of probability. Arkiv f
or matematik,
astronomi och fysik, A28(1):119, 1942.
[53] A Feinstein. A new basic theorem of information theory. IEEE Transactions on Information Theory,
4(4):222, 1954.
[54] W. Feller. An Introduction to Probability Theory and Its Applications. John Wiley and Sons, 2nd
edition, 1971.
[55] G. D. Forney. Exponential error bounds for erasure, list, and decision feedback schemes. IEEE
Transactions on Information Theory, 14:206220, 1968.
113
[56] R. G. Gallager. Information Theory and Reliable Communication. Wiley, New York, 1968.
[57] R. G. Gallager. Capacity and coding for degraded broadcast channels. Problems of Information
Transmission, 10(3):314, 1974.
[58] R. G. Gallager. Source coding with side information and universal coding. Technical report, MIT
LIDS, 1976.
[59] S. Gelfand and M. Pinsker. Coding for channel with random parameters. Problems of Control and
Information Theory, 9(1):1931, 1980.
[60] E. N. Gilbert. Capacity of burstnoise channels. The Bell Systems Technical Journal, 39:12531265,
Sep 1960.
[61] A. J. Goldsmith and P. P. Varaiya. Capacity of fading channels with channel side information. IEEE
Transactions on Information Theory, 43(6):19861992, 1997.
[62] V. D. Goppa. Nonprobabilistic mutual information without memory. Problems of Control and Information Theory, 4:97102, 1975.
[63] F. G
otze. On the rate of convergence in the multivariate CLT. Annals of Probability, 19(2):721739,
1991.
[64] W. Gu and M. Effros. A strong converse for a collection of network source coding problems. In
Proceedings of the International Symposium on Information Theory, pages 23162320, Seoul, S. Korea,
JunJul 2009.
[65] E. Haim, Y. Kochman, and U. Erez. A note on the dispersion of network problems. In Convention of
Electrical and Electronics Engineers in Israel, pages 19, Eilat, Nov 2012.
[66] T. S. Han. An informationspectrum approach to capacity theorems for the general multipleaccess
channel. IEEE Transactions on Information Theory, 44(7):27732795, 1998.
[67] T. S. Han. InformationSpectrum Methods in Information Theory. Springer Berlin Heidelberg, Feb
2003.
[68] T. S. Han. Folklore in source coding: Informationspectrum approach. IEEE Transactions on Information Theory, 51(2):747753, 2005.
[69] T. S. Han and S.I. Amari. Statistical inference under multiterminal data compression. IEEE Transactions on Information Theory, 44(6):23002324, 1998.
[70] T. S. Han and K. Kobayashi. A new achievable rate region for the interference channel. IEEE
Transactions on Information Theory, 27(1):4960, 1981.
[71] T. S. Han and S. Verd
u. Approximation theory of output statistics. IEEE Transactions on Information
Theory, 39(3):752772, 1993.
[72] E. A. Haroutunian. Error probability lower bound for the multipleaccess communication channels.
Problems of Information Transmission, 11(2):2236, 1975.
[73] E. A. Haroutunian. A lower bound on the probability of error for channels with feedback. Problems
of Information Transmission, 3(2):3748, 1977.
[74] E. A. Haroutunian, M. E. Haroutunian, and A. N. Harutyunyan. Reliability criteria in information theory and statistical hypothesis testing. Foundations and Trends in Communications and Information
Theory, 4(23):97263, 2007.
114
[75] M. Hayashi. Secondorder asymptotics in fixedlength source coding and intrinsic randomness. IEEE
Transactions on Information Theory, 54(10):46194637, 2008.
[76] M. Hayashi. Information spectrum approach to secondorder coding rate in channel coding. IEEE
Transactions on Information Theory, 55(11):49474966, 2009.
[77] M. Hayashi and H. Nagaoka. General formulas for capacity of classicalquantum channels. IEEE
Transactions on Information Theory, 49(7):175368, 2003.
[78] M. Hayashi, H. Tyagi, and S. Watanabe. Secret key agreement: General capacity and secondorder
asymptotics. In Proceedings of the International Symposium on Information Theory, pages 11361140,
Honolulu, HI, Jul 2014.
[79] M. Hayashi, H. Tyagi, and S. Watanabe. Strong converse for degraded wiretap channel via active
hypothesis testing. In Proceedings of Allerton Conference on Communication, Control, and Computing,
Monticello, IL, Oct 2014.
[80] D.K. He, L. A. LastrasMonta
no, E.H. Yang, A. Jagmohan, and J. Chen. On the redundancy of
SlepianWolf coding. IEEE Transactions on Information Theory, 55(12):56075627, 2009.
[81] W. Hoeffding and H. Robbins. The central limit theorem for dependent random variables. Duke
Mathematical Journal, 15(3):773780, 1948.
[82] J. Hoydis, R. Couillet, and P. Piantanida. Bounds on the secondorder coding rate of the MIMO
Rayleigh blockfading channel. In Proceedings of the International Symposium on Information Theory,
pages 15261530, Istanbul, Turkey, Jul 2013. arXiv:1303.3400 [cs.IT].
[83] J. Hoydis, R. Couillet, P. Piantanida, and M. Debbah. A random matrix approach to the finite
blocklength regime of MIMO fading channels. In Proceedings of the International Symposium on
Information Theory, pages 21812185, Cambridge, MA, Jul 2012.
[84] Y.W. Huang and P. Moulin. Finite blocklength coding for multiple access channels. In Proceedings of
the International Symposium on Information Theory, pages 831835, Cambridge, MA, Jul 2012.
[85] A. Ingber and M. Feder. Finite blocklength coding for channels with side information at the receiver. In
Proceedings of the Convention of Electrical and Electronics Engineers in Israel, pages 798802, Eilat,
Nov 2010.
[86] A. Ingber and Y. Kochman. The dispersion of lossy source coding. In Proceedings of the Data Compression Conference (DCC), pages 5362, Snowbird, UT, Mar 2011. arXiv:1102.2598 [cs.IT].
[87] A. Ingber, D. Wang, and Y. Kochman. Dispersion theorems via second order analysis of functions of
distributions. In Proceedings of Conference on Information Sciences and Systems, pages 16, Princeton,
NJ, Mar 2012.
[88] A. Ingber, R. Zamir, and M. Feder. Finite dimensional infinite constellations. IEEE Transactions on
Information Theory, 59(3):16301656, 2013.
[89] J. Jiang and T. Liu. On dispersion of modulo lattice additive noise channels. In Proceedings of the
International Symposium on Wireless Communication Systems, pages 241245, Aachen, Germany, Nov
2011.
[90] B. Kelly and A. Wagner. Reliability in source coding with side information. IEEE Transactions on
Information Theory, 58(8):50865111, 2012.
[91] J. H. B. Kemperman. Studies in Coding Theory I. Technical report, University of Rochester, NY,
1962.
115
[92] G. Keshet, Y. Steinberg, and N. Merhav. Channel coding in the presence of side information. Foundations and Trends in Communications and Information Theory, 4(6):445486, 2007.
[93] I. Kontoyiannis. Secondorder noiseless source coding theorems. IEEE Transactions on Information
Theory, 43(4):13391341, 1997.
[94] I. Kontoyiannis. Pointwise redundancy in lossy data compression and universal lossy data compression.
IEEE Transactions on Information Theory, 46:136152, 2000.
[95] I. Kontoyiannis and S. Verd
u. Optimal lossless data compression: Nonasymptotics and asymptotics.
IEEE Transactions on Information Theory, 60(2):777795, 2014.
[96] V. Kostina. Lossy Data Compression: Nonasymptotic fundamental limits. PhD thesis, Princeton
University, 2013.
[97] V. Kostina and S. Verd
u. Fixedlength lossy compression in the finite blocklength regime. IEEE
Transactions on Information Theory, 58(6):33093338, 2012.
[98] V. Kostina and S. Verd
u. Channels with cost constraints: strong converse and dispersion. In Proceedings
of the International Symposium on Information Theory, pages 17341738, Istanbul, Turkey, Jul 2013.
arXiv:1401.5124 [cs.IT].
[99] V. Kostina and S. Verd
u. Lossy joint sourcechannel coding in the finite blocklength regime. IEEE
Transactions on Information Theory, 59(5):25452575, 2013.
[100] O. Kosut and L. Sankar. Universal fixedtovariable source coding in the finite blocklength regime. In
Proceedings of the International Symposium on Information Theory, pages 649653, Istanbul, Turkey,
Jul 2013.
[101] O. Kosut and L. Sankar. New results on thirdorder coding rate for universal fixedtovariable source
coding. In Proceedings of the International Symposium on Information Theory, pages 26892693,
Honolulu, HI, Jul 2014.
[102] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics,
22:7986, 1951.
[103] S.Q. Le, V. Y. F. Tan, and M. Motani. A case where interference does not affect the channel dispersion.
IEEE Transactions on Information Theory, 61(5), May 2015.
[104] Y. Liang, H. V. Poor, and S. Shamai (Shitz). Informationtheoretic security. Foundations and Trends
in Communications and Information Theory, 5(45):355580, 2008.
[105] H. H. J. Liao. Multiple access channels. PhD thesis, University of Hawaii, Honolulu, 1972.
[106] K. Marton. Error exponent for source coding with a fidelity criterion. IEEE Transactions on Information Theory, 20(2):197199, 1974.
[107] K. Marton. A simple proof of the blowingup lemma. IEEE Transactions on Information Theory,
32(3):445446, 1986.
[108] W. Matthews. A linear program for the finite block length converse of PolyanskiyPoorVerd
u via
nonsignaling codes. IEEE Transactions on Information Theory, 58(2):70367044, 2012.
[109] N. Merhav. Universal decoding for memoryless Gaussian channels with a deterministic interference.
IEEE Transactions on Information Theory, 39(4):12611269, 1993.
[110] S. Miyake and F. Kanaya. Coding theorems on correlated general sources. IEICE Transactions on
Fundamentals of Electronics, Communications and Computer Sciences, E78A(9):10631070, 1995.
116
[111] E. MolavianJazi and J. N. Laneman. Simpler achievable rate regions for multiaccess with finite blocklength. In Proceedings of the International Symposium on Information Theory, pages 3640, Cambridge, MA, Jul 2012.
[112] E. MolavianJazi and J. N. Laneman. A finiteblocklength perspective on Gaussian multiaccess channels. Submitted to the IEEE Transactions on Information Theory, 2014. arXiv:1309.2343 [cs.IT].
[113] P. Moulin. The logvolume of optimal codes for memoryless channels, within a few nats. Submitted to
the IEEE Transactions on Information Theory, Nov 2013. arXiv:1311.0181 [cs.IT].
[114] P. Moulin and J. A. OSullivan. Informationtheoretic analysis of information hiding. IEEE Transactions on Information Theory, 49(3):563593, 2003.
[115] P. Moulin and Y. Wang. Capacity and randomcoding exponents for channel coding with side information. IEEE Transactions on Information Theory, 53(4):13261347, 2007.
[116] M. Mushkin and I. BarDavid. Capacity and coding for the GilbertElliott channels. IEEE Transactions
on Information Theory, 35(6):12771290, 1989.
[117] R. Nomura and T. S. Han. Secondorder resolvability, intrinsic randomness, and fixedlength source
coding for mixed sources: Information spectrum approach. IEEE Transactions on Information Theory,
59(1):116, 2013.
[118] R. Nomura and T. S. Han. Secondorder SlepianWolf coding theorems for nonmixed and mixed
sources. IEEE Transactions on Information Theory, 60(9):55535572, 2014.
[119] Y. Polyanskiy. Channel coding: Nonasymptotic fundamental limits. PhD thesis, Princeton University,
2010.
[120] Y. Polyanskiy. On dispersion of compound dmcs. In Proceedings of Allerton Conference on Communication, Control, and Computing, pages 2632, Monticello, IL, Oct 2013.
[121] Y. Polyanskiy. Saddle point in the minimax converse for channel coding. IEEE Transactions on
Information Theory, 59(5):25762595, 2013.
[122] Y. Polyanskiy, H. V. Poor, and S. Verd
u. New channel coding achievability bounds. In Proceedings of
the International Symposium on Information Theory, pages 17631767, Toronto, ON, Jul 2008.
[123] Y. Polyanskiy, H. V. Poor, and S. Verd
u. Channel coding rate in the finite blocklength regime. IEEE
Transactions on Information Theory, 56(5):23072359, 2010.
[124] Y. Polyanskiy, H. V. Poor, and S. Verd
u. Dispersion of the GilbertElliott channel. IEEE Transactions
on Information Theory, 57(4):182948, 2011.
[125] Y. Polyanskiy, H. V. Poor, and S. Verd
u. Feedback in the nonasymptotic regime. IEEE Transactions
on Information Theory, 57(8):49034925, 2011.
[126] Y. Polyanskiy, H. V. Poor, and S. Verd
u. Minimum energy to send k bits through the Gaussian channel
with and without feedback. IEEE Transactions on Information Theory, 57(8):48804902, 2011.
[127] Y. Polyanskiy and S. Verd
u. Channel dispersion and moderate deviations limits for memoryless channels. In Proceedings of Allerton Conference on Communication, Control, and Computing, pages 1334
1339, Monticello, IL, Oct 2010.
[128] V. V. Prelov. Transmission over a multipleaccess channel with a special source hierarchy. Problems
of Information Transmission, 20(4):310, 1984.
117
[129] M. Raginsky and I. Sason. Concentration of measure inequalities in information theory, communications and coding. Foundations and Trends in Communications and Information Theory, 10(12):1
247, 2013.
[130] T. J. Riedl, T. P. Coleman, and A. C. Singer. Finite blocklength achievable rates for queuing timing
channels. In Proceedings of the IEEE Information Theory Workshop, pages 200204, Paraty, Brazil,
Oct 2011.
[131] S. Sarvotham, D. Baron, and R. G. Baraniuk. Nonasymptotic performance of symmetric SlepianWolf
coding. In Proceedings of Conference on Information Sciences and Systems, Baltimore, MD, Mar 2005.
[132] S. Sarvotham, D. Baron, and R. G. Baraniuk. Variablerate universal SlepianWolf coding with feedback. In Proceedings of Asilomar Conference on Signals, Systems and Computers, pages 812, Pacific
Grove, CA, Nov 2005.
[133] I. Sason. Moderate deviations analysis of binary hypothesis testing. In Proceedings of the International
Symposium on Information Theory, pages 821825, Cambridge, MA, Jul 2012.
[134] J. Scarlett. On the dispersion of dirty paper coding. In Proceedings of the International Symposium
on Information Theory, pages 22822286, Honolulu, HI, Jul 2014. arXiv:1309.6200 [cs.IT].
[135] J. Scarlett, A. Martinez, and A. Guillen i F`abregas. A derivation of the asymptotic randomcoding
prefactor. In Proceedings of Allerton Conference on Communication, Control, and Computing, pages
956961, Monticello, IL, Oct 2013. arXiv:1306.6203 [cs.IT].
[136] J. Scarlett, A. Martinez, and A. Guillen i F`abregas. Secondorder rate region of constantcomposition
codes for the multipleaccess channel. In Proceedings of Allerton Conference on Communication,
Control, and Computing, pages 588593, Monticello, IL, Oct 2013. arXiv:1303.6167 [cs.IT].
[137] J. Scarlett, A. Martinez, and A. Guillen i F`abregas. The saddlepoint approximation: Unified random
coding asymptotics for fixed and varying rates. In Proceedings of the International Symposium on
Information Theory, pages 18921896, Honolulu, HI, Jul 2014. arXiv:1402.3941 [cs.IT].
[138] J. Scarlett and V. Y. F. Tan. Secondorder asymptotics for the Gaussian MAC with degraded message
sets. IEEE Transactions on Information Theory, 2014.
[139] H. M. H. Shalaby and A. Papamarcou. Multiterminal detection with zerorate data compression. IEEE
Transactions on Information Theory, 38(2):254 267, 1992.
[140] X. Shang and B. Chen. Twouser Gaussian interference channels: An information theoretic point of
view. Foundations and Trends in Communications and Information Theory, 10(3):247378, 2013.
[141] C. E. Shannon. A mathematical theory of communication. The Bell Systems Technical Journal,
27:379423, 1948.
[142] C. E. Shannon. The zero error capacity of a noisy channel. IRE Transactions on Information Theory,
2(3):819, 1956.
[143] C. E. Shannon. Channels with side information at the transmitter. IBM J. Res. Develop., 2:289293,
1958.
[144] C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec.,
pages 142163, 1959.
[145] C. E. Shannon. Probability of error for optimal codes in a Gaussian channel. The Bell Systems
Technical Journal, 38:611656, 1959.
118
[146] C. E. Shannon, R. G. Gallager, and E. R. Berlekamp. Lower bounds to error probability for coding in
discrete memoryless channels III. Information and Control, 10:65103,522552, 1967.
[147] Y. Y. Shkel, V. Y. F. Tan, and S. C. Draper. Converse bounds for assorted codes in the finite
blocklength regime. In Proceedings of the International Symposium on Information Theory, pages
17201724, Istanbul, Turkey, Jul 2013.
[148] Y. Y. Shkel, V. Y. F. Tan, and S. C. Draper. Achievability bounds for unequal message protection
at finite block lengths. In Proceedings of the International Symposium on Information Theory, pages
25192523, Honolulu, HI, Jul 2014. arXiv:1405.0891 [cs.IT].
[149] Y. Y. Shkel, V. Y. F. Tan, and S. C. Draper. On mismatched unequal error protection for finite
blocklength joint sourcechannel coding. In Proceedings of the International Symposium on Information
Theory, pages 16921696, Honolulu, HI, Jul 2014.
[150] Z. Shun and P. McCullagh. Laplace approximation of high dimensional integrals. Journal of the Royal
Statistical Society, Series B (Methodology), 57(4):749760, 1995.
[151] D. Slepian and J. K. Wolf. Noiseless coding of correlated information sources. IEEE Transactions on
Information Theory, 19(4):47180, 1973.
[152] V. Strassen. Asymptotische Absch
atzungen in Shannons Informationstheorie. In Trans. Third Prague
Conf. Inf. Theory, pages 689723, Prague, 1962. http://www.math.cornell.edu/pmlut/strassen.pdf.
[153] V. Y. F. Tan. Achievable secondorder coding rates for the wiretap channel. In IEEE International
Conference on Communication Systems, pages 6569, Singapore, Nov 2012.
[154] V. Y. F. Tan. Moderatedeviations of lossy source coding for discrete and Gaussian sources. In
Proceedings of the International Symposium on Information Theory, pages 920924, Cambridge, MA,
Jul 2012. arXiv:1111.2217 [cs.IT].
[155] V. Y. F. Tan and M. Bloch. Information spectrum approach to strong converse theorems for degraded
wiretap channels. In Proceedings of Allerton Conference on Communication, Control, and Computing,
Monticello, IL, Oct 2014. arXiv:1406.6758 [cs.IT].
[156] V. Y. F. Tan and O. Kosut. The dispersion of SlepianWolf coding. In Proceedings of the International
Symposium on Information Theory, pages 915919, Cambridge, MA, Jul 2012.
[157] V. Y. F. Tan and O. Kosut. On the dispersions of three network information theory problems. IEEE
Transactions on Information Theory, 60(2):881903, 2014.
[158] V. Y. F. Tan and P. Moulin. Secondorder capacities for erasure and list decoding. In Proceedings
of the International Symposium on Information Theory, pages 18871891, Honolulu, HI, Jul 2014.
arXiv:1402.4881 [cs.IT].
[159] V. Y. F. Tan and M. Tomamichel. The thirdorder term in the normal approximation for the AWGN
channel. IEEE Transactions on Information Theory, 60(5), May 2015.
[160] V. Y. F. Tan, S. Watanabe, and M. Hayashi. Moderate deviations for joint sourcechannel coding
of systems with Markovian memory. In Proceedings of the International Symposium on Information
Theory, pages 16871691, Honolulu, HI, Jul 2014.
E. Telatar. Multiaccess communications with decision feedback. PhD thesis, Massachusetts Institute
[161] I.
of Technology, 1992.
[162] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities.
Journal of the American Statistical Association, 81(393):8286, Mar 1986.
119
[163] M. Tomamichel and M. Hayashi. A hierarchy of information quantities for finite block length analysis
of quantum tasks. IEEE Transactions on Information Theory, 59(11):76937710, 2013.
[164] M. Tomamichel and V. Y. F. Tan. A tight upper bound for the thirdorder asymptotics of most discrete
memoryless channels. IEEE Transactions on Information Theory, 59(11):70417051, 2013.
[165] M. Tomamichel and V. Y. F. Tan. Secondorder coding rates for channels with state. IEEE Transactions
on Information Theory, 60(8):44274448, 2014.
[166] H. Tyagi and P. Narayan. The GelfandPinsker channel: Strong converse and upper bound for the
reliability function. In Proceedings of the International Symposium on Information Theory, pages
19541957, Seoul, S. Korea, Jul 2009. arXiv:0910.0653 [cs.IT].
[167] E. C. van der Meulen. Some recent results on the asymmetric multipleaccess channel. In Proceedings
of the 2nd joint SwedishSoviet International Workshop on Information Theory, 1985.
[168] S. Verd
u. Nonasymptotic achievability bounds in multiuser information theory. In Proceedings of
Allerton Conference on Communication, Control, and Computing, pages 18, Monticello, IL, Oct
2012.
[169] S. Verd
u and T. S. Han. A general formula for channel capacity. IEEE Transactions on Information
Theory, 40(4):11471157, 1994.
[170] D. Wang, A. Ingber, and Y. Kochman. The dispersion of joint sourcechannel coding. In Proceedings
of Allerton Conference on Communication, Control, and Computing, pages 180187, Monticello, IL,
Oct 2011. arXiv::1109.6310 [cs.IT].
[171] D. Wang, A. Ingber, and Y. Kochman. A strong converse for joint sourcechannel coding. In Proceedings
of the International Symposium on Information Theory, pages 21172121, Cambridge, MA, Jul 2012.
[172] L. Wang, R. Colbeck, and R. Renner. Simple channel coding bounds. In Proceedings of the International
Symposium on Information Theory, pages 18041808, Seoul, S. Korea, July 2009. arXiv::0901.0834
[cs.IT].
[173] L. Wang and R. Renner. Oneshot classicalquantum capacity and hypothesis testing. Physical Review
Letters, 108:200501, May 2012.
[174] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004.
[175] L. Wasserman, M. Kolar, and A. Rinaldo. BerryEsseen bounds for estimating undirected graphs.
Electronic Journal of Statistics, 8:11881224, 2014.
[176] S. Watanabe and M. Hayashi. Strong converse and secondorder asymptotics of channel resolvability.
In Proceedings of the International Symposium on Information Theory, pages 18821886, Honolulu,
HI, Jul 2014.
[177] S. Watanabe, S. Kuzuoka, and V. Y. F. Tan. Nonasymptotic and secondorder achievability bounds
for coding with sideinformation. IEEE Transactions on Information Theory, 61(4):15741605, Apr
2015.
[178] G. Wiechman and I. Sason. An improved spherepacking bound for finitelength codes over symmetric
memoryless channels. IEEE Transactions on Information Theory, 54(5):19621990, 2009.
[179] A. R. Williamson, T.Y. Chen, and R. D. Wesel. Reliabilitybased error detection for feedback communication with low latency. In Proceedings of the International Symposium on Information Theory,
pages 25522556, Istanbul, Turkey, Jul 2013. arXiv::1305.4560 [cs.IT].
120
[180] J. Wolfowitz. The coding of messages subject to chance errors. Illinois Journal of Mathematics,
1(4):591606, 1957.
[181] J. Wolfowitz. Coding Theorems of Information Theory. SpringerVerlag, New York, 3rd edition, 1978.
[182] A. D. Wyner. On source coding with side information at the decoder. IEEE Transactions on Information Theory, 21(3):294300, 1975.
[183] A. D. Wyner. The wiretap channel. The Bell Systems Technical Journal, 54:13551387, 1975.
[184] A. D. Wyner and J. Ziv. The ratedistortion function for source coding with side information at the
decoder. IEEE Transactions on Information Theory, 22(1):110, 1976.
[185] H. Yagi and R. Nomura. Channel dispersion for wellordered mixed channels decomposed into memoryless channels. In Proceedings of the International Symposium on Information Theory and Its Applications, Melbourne, Australia, Oct 2014.
[186] W. Yang, G. Durisi, T. Koch, and Y. Polyanskiy. Quasistatic MIMO fading channels at finite blocklength. IEEE Transactions on Information Theory, 60(7):42324265, 2014.
[187] M. H. Yassaee, M. R. Aref, and A. Gohari. Nonasymptotic output statistics of random binning and its
applications. In Proceedings of the International Symposium on Information Theory, pages 18491853,
Istanbul, Turkey, Jul 2013. arXiv:1303.0695 [cs.IT].
[188] M. H. Yassaee, M. R. Aref, and A. Gohari. A technique for deriving oneshot achievability results in
network information theory. In Proceedings of the International Symposium on Information Theory,
pages 12871291, Istanbul, Turkey, Jul 2013. arXiv:1303.0696 [cs.IT].
[189] R. Yeung. A First Course on Information Theory. Springer, 2002.
[190] A. A. Yushkevich. On limit theorems connected with the concept of entropy of Markov chains. Uspekhi
Matematicheskikh Nauk, 5(57):177180, 1953.
[191] Z. Zhang, E.H. Yang, and V. K. Wei. The redundancy of source coding with a fidelity criterion:
Known statistics. IEEE Transactions on Information Theory, 43(1):7191, 1997.
[192] J. Ziv and A. Lempel. Compression of individual sequences via variablerate coding. IEEE Transactions
on Information Theory, 24(5):530536, 1978.
121