Professional Documents
Culture Documents
Lecture9 PDF
Lecture9 PDF
Reading:
1. Read Chapter 6, which has some interesting things in it.
2. Read Chapter 8.
C = max I(X; Y ),
p(x)
where the maximum is taken over all possible input distributions p(x). 2
Example 1 Consider the noiseless channel, with error-free transmission. For every
x we put in, we get out a y with no equivocation: C = 1 bit, which occurs when
p(x) = (1/2, 1/2) 2
Example 3 Noisy typewriter, with crossover of 1/2 and 26 input symbols. One
way to think about this: we can use every other symbol, and get 13 bits through.
Or we can go this way:
C = 1 − H(p) .
More insight on the BSC from Lucky, p. 62. Suppose that we have a sequence of
information that we send:
X n = 10110100011011100101111001011000
ECE 7680: Lecture 7 – Channel Capacity 2
Y n = 10110000011011100111111001111000
E n = 00000100000000000010000000100000
If we knew this sequence, we could find X n exactly. How much information does
this sequence represent? We can think of it as being generated by a Bernoulli
source with probability p (an error with probability p representing a crossover).
Since we don’t know this sequence, this deficiency represents the amount by which
the information we receive drops: We thus have
1 − H(p)
bits of useful information left over for every bit of information received.
2
To maximize this, we must find max H(Y ). There are three possibly Y outcomes,
but we cannot achieve H(Y ) = log 3 for any input distribution. We have work a bit
harder. Let P (X = 1) = π. Then
P (Y = 0) = P (X = 0)(1 − α) = (1 − π)(1 − α)
P (Y = e) = P (X = 0)α + P (X = 1)α = α
Then
H(Y ) = −[(1 − π)(1 − α) log(1 − π)(1 − α) + α log α + π(1 − α) log π(1 − α)] = H(α) + (1 − α)H(π).
Then
3. C ≤ log |Y|.
4. I(X; Y ) is continuous function of p(x).
5. I(X; Y ) is a concave function of p(x). (It has a maximum)
ECE 7680: Lecture 7 – Channel Capacity 3
Symmetric channels
We now consider a specific class of channels for which the entropy is fairly easy to
compute, the symmetric channels.
A channel can be characterized by a transmission matrix such as
.3 .2 .5
p(y|x) = .5 .3 .2 = P
.2 .5 .3
For each typical input n-sequence, there are approximately 2nH(Y |X) possible Y
sequences, each of them more or less equally likely by the AEP. In order to reliably
detect them, we want to ensure that no two X sequences produce the same Y
sequence. The total number of possible (typical) Y sequences is ≈ 2nH(Y ) . This has
to be divided into sets of size 2nH(Y |X) , corresponding to the different X sequences.
ECE 7680: Lecture 7 – Channel Capacity 4
The total number of disjoint sets is therefore approximately 2n(H(Y )−H(Y |X)) =
2nI(X;Y ) . Hence we can send at most ≈ 2nI(X;Y ) distinguishable sequences of length
n.
Definition 3 A discrete channel, denoted by (X , p(y|x), Y) consists of an input
set X , and output set Y, and a collection of probability mass functions p(y|x), one
for each x ∈ X .
The extension of a discrete memoryless channel (DMC) is the channel (X n , p(y n |xn ), X ),
where
In other words, the distribution of the channel given a sequence of inputs and prior
outputs depends only on the current input (memoryless channel). 2
When we talk about data transmission through a channel, the issue of coding
arises. We define a code as follows:
Definition 4 An (M, n) code for the channel (X , p(x|y), Y) consists of:
1. An index set {1, . . . , M } (which represents the input alphabet W )
2. A function X n : {1, 2, . . . , M } → X n yielding codewords X n (1), X n (2), . . . , X n (M ).
3. A decoding function g : Y n → {1, 2, . . . , M }.
2
In other words, the code takes symbols from W , and encodes them to produce
a sequence of n symbols in X .
The probability of error is defined as
In other words, if the message symbol is i, but the output symbol is not i, then we
have an error. This can be written using the indicator function I(·) as
X
λi = Pr(g(Y n ) 6= i|X n = X n (i)) = p(y n |xn (i))I(g(y n ) 6= i)
yn
λ(n) = max λi
i
log M
R= bits per transmission.
n
2
Example 6 Suppose M = 4, and codewords are n = 4 bits long. Then every input
symbol supplies 2 bits of information, and takes 4 symbol transmissions to send.
The rate is R = 1/2. 2
(n)
2. |A | ≤ 2n(H(X,Y )+) . (The number of jointly typical sequences is close to the
number of representations determined by the entropy of (X, Y ).)
3. If (X̃ n , Ỹ n ) ∼ p(xn )p(y n ), that is, X̃ n and Ỹ n are independent with the same
marginals as p(xn , y n ) then
−n(I(X;Y )−3)
Pr((X̃ n , Ỹ n ) ∈ A(n)
)≤2 .
Proof By WLLN,
1
− log p(X n ) → −E log p(X) = H(X)
n
so that, by the definition of convergence in probability, for any > 0 there is an n1
such that for n > n1 ,
1
Pr | − log p(X n ) − H(X)| > < .
n 3
Similar statements can be said about WLLN convergence in the Y data and the Z
data: There is an n2 and and n3 such that for n > n2
1
Pr | − log p(Y n ) − H(Y )| > < .
n 3
and for n > n3
1
Pr | − log p(X n , Y n ) − H(X, Y )| > < .
n 3
So for n > max(n1 , n2 , n3 ), the probability of the union of the three sets must be
(n)
< . Hence the probability of the set A must be > 1 − .
The second part of the proof is proved exactly as for typical sequences:
X X
1= p(xn , y n ) ≥ p(xn , y n )
(n)
A
−n(H(X,Y )+)
≥ |A(n)
|e
so that
|A(n)
|≤2
n(H(X,Y )+)
.
For the third part, we note first that for n sufficiently large,
Pr(A(n)
)≥1−
so we can write
X
1−≤ p(xn , y n )
(n)
(xn ,y n )∈A
−n(H(X,Y )−)
≤ |A(n)
|2
from which
|A(n)
| ≥ (1 − )2
n(H(X,Y )−)
.
ECE 7680: Lecture 7 – Channel Capacity 7
(since there are approximately 2n(H(X,Y )+) terms in the sum). Similar arguments
can be used to derive the last inequality. 2
nH(X) nH(Y )
There are about 2 typical X sequences, about 2 typical Y sequences,
and only about 2nH(X,Y ) jointly typical sequences. This means that if we choose a
typical X sequence and independently choose a typical Y sequence (without regard
to the X sequence), in not all cases will the sequence (X n , Y n ) so chosen be jointly
typical. In fact, from the last part of the theorem, the probability that the sequences
chosen independently will be jointly typical is about 2−nI(X;Y ) . This means that we
would have to try (at random) about 2nI(X;Y ) sequence pairs before we choose a
jointly typical pair. Thinking now in terms of fixing Y and choosing X at random,
this suggests that there are about 2nI(X;Y ) distinguishable sequences in X.
Theorem 3 (The channel coding theorem) All rates below capacity C are achiev-
able. That is, for every > 0 and rate R < C, there exists a sequence of (2nR , n)
codes with maximum probability of error λ(n) → 0.
Conversely, any sequence of (2nR , n) codes with λ(n) → 0 must have R ≤ C.
Note that by the statement of the theorem, we only state that “there exists” a
sequence of codes. The proof of the theorem is not constructive, and hence does
not tell us how to find the codes.
Also note that the theorem is asymptotic. To obtain an arbitrarily small prob-
ability of error, the block length may have to be infinitely long.
Being the big proof of the semester, it will take some time. As we grind through,
don’t forget to admire the genius of the guy who was able to conceive of such an
intellectual tour de force. Aren’t you glad this wasn’t your homework assignment
to prove this!
Proof We prove that rates R < C are achievable. The converse will wait.
Fix p(x). The first important step is to generate a (2nR , n) code at random
using the distribution p(x). (The idea of using a random code was one of Shannon’s
key contributions.) Specifically, we draw 2nR codewords according to
n
Y
p(xn ) = p(xi ).
i=1
and the wth codeword X n (w) (corresponding to the wth row of the C matrix)
is sent over the channel.
3. The receiver receives the sequence Y n according to the distribution
n
Y
P (y n |xn (w)) = p(yi |xi (w))
i=1
(the fact that the probabilities are multiplied together is a consequence of the
memoryless nature of the channel).
4. Based on the received sequence, the receiver tries to determine which message
was sent. In the interest of making the decision procedure tractable for this
proof, a typical set decoding procedure is used. The receiver decides that
message Ŵ was sent if
• (X n (Ŵ ), Y n ) is jointly typical: in other words, we could expect this to
happen, and
(n)
• There is no other k such that (X n (k), Y n ) ∈ A . If this happens, the
receiver cannot tell which symbol should be decoded, and declares an
error.
5. If W 6= Ŵ , then there is a decoding error.
The next key step is we find the average probability of error, averaged over
all codewords in the codebook and averaged over all codebooks C. This is a key
idea, because of the following observation: If the probability of error obtained by
averaging over random codebooks can be shown to be made arbitrarily small (as
we demonstrate below), then that means that there is at least one fixed codebook
that has a small average probability of error. So the random codebook idea and
the average over random codebooks is simply a mechanism to prove that there does
exist a good code.
X
P (E) = P (C)Pe(n) (C)
C
nR
2
X 1 X
= P (C) λw (C)
2nR
C w=1
2nR
1 X X
= P (C) λw (C)
2nR
C w=1
ECE 7680: Lecture 7 – Channel Capacity 9
By the symmetry of the code construction, the average probability of error averaged
over all codes does not depend on the particular index that was sent, so we can
assume without loss of generality that w = 1.
Define the event
That is, Ei is the event that Y n is jointly typical with the ith codeword. An error
occurs if
• E1c occurs: Y n is not jointly typical with the first codeword, or
nR
• E2 ∪ E3 ∪ · · · ∪ E22 occurs (a wrong codeword is jointly typical with the
received sequence).
Let P (E) denote Pr(E|W = 1) we have
nR
P (E) = P (E1c ∪ E2 ∪ E3 ∪ · · · ∪ E 2 )
nR
2
X
≤ P (E1c ) + P (Ei )
i=2
P (E1c ) ≤
for n sufficiently large. And the probability that Y b and X n (i) are jointly typical
(for i 6= 1)is ≤ 2−nI(X;Y )−3) . Hence
nR
2
X
P (E) ≤ + 2−n(I(X;Y )−3)
i=2
= + (2nR − 1)2−n(I(X;Y )−3)
≤ + 2−n(I(X;Y )−3−R)
≤ 2
2. Since the average probability of error, averaged over all codebooks is small,
then there exists at least one codebook C ∗ with a small average probability of
error. Pen (C ∗ ) ≤ e. (How to find it?!) (Is exhaustive search an option?)
3. Observe that the best half the codewords must have a probability of error less
than 4 (otherwise we would not be able to achieve the average less than 2.
We can throw away the worst half the codewords, giving us a rate
1
R0 = R −
n
which asymptotically is negligible.
ECE 7680: Lecture 7 – Channel Capacity 10
Hence we can conclude that for the best codebook, the maximal probability (not
the average) has a probability of error ≤ 4. 2
Whew!
The proof used a random-coding argument to make the math tractable. In prac-
tice, random codes are never used. Good, useful, codes are a matter of considerable
interest and practicality.