You are on page 1of 10

ECE 7680

Lecture 7 – Channel Capacity


Objective: To define channel capacity and prove its fundamental properties.

Reading:
1. Read Chapter 6, which has some interesting things in it.
2. Read Chapter 8.

Channel capacity definition and examples


We are now ready to talk about the fundamental concept of the capacity of a
channel. This is a measure of how much information per channel usage we can get
through a channel.
Definition 1 The information channel capacity is defined as the maximum
mutual information,

C = max I(X; Y ),
p(x)

where the maximum is taken over all possible input distributions p(x). 2
Example 1 Consider the noiseless channel, with error-free transmission. For every
x we put in, we get out a y with no equivocation: C = 1 bit, which occurs when
p(x) = (1/2, 1/2) 2

Example 2 Noisy channel with non-overlapping outputs. C = 1 bit, when


p(x) = (1/2, 1/2). 2

Example 3 Noisy typewriter, with crossover of 1/2 and 26 input symbols. One
way to think about this: we can use every other symbol, and get 13 bits through.
Or we can go this way:

I(X; Y ) = max[H(Y ) − H(Y |X)] = max H(Y ) − 1 = log 26 − 1 = log 13.

Example 4 The BSC, with crossover probability p.


I(X; Y ) = H(Y ) − H(Y |X)
X
= H(Y ) − p(x)H(Y |X = x)
Xx
= H(Y ) − p(x)H(p)
x
= H(Y ) − H(p)
≤ 1 − H(p)
So

C = 1 − H(p) .

More insight on the BSC from Lucky, p. 62. Suppose that we have a sequence of
information that we send:

X n = 10110100011011100101111001011000
ECE 7680: Lecture 7 – Channel Capacity 2

and the received sequence

Y n = 10110000011011100111111001111000

The error sequence is therefore

E n = 00000100000000000010000000100000

If we knew this sequence, we could find X n exactly. How much information does
this sequence represent? We can think of it as being generated by a Bernoulli
source with probability p (an error with probability p representing a crossover).
Since we don’t know this sequence, this deficiency represents the amount by which
the information we receive drops: We thus have

1 − H(p)

bits of useful information left over for every bit of information received.
2

Example 5 Binary erasure channel: Probability 1 − α of correct, α of erasure.

I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − H(α)

To maximize this, we must find max H(Y ). There are three possibly Y outcomes,
but we cannot achieve H(Y ) = log 3 for any input distribution. We have work a bit
harder. Let P (X = 1) = π. Then

P (Y = 0) = P (X = 0)(1 − α) = (1 − π)(1 − α)

P (Y = e) = P (X = 0)α + P (X = 1)α = α

P (Y = 1) = P (X = 1)(1 − α) = π(1 − α).

Then

H(Y ) = −[(1 − π)(1 − α) log(1 − π)(1 − α) + α log α + π(1 − α) log π(1 − α)] = H(α) + (1 − α)H(π).

Then

I(X; Y ) = H(α) + (1 − α)H(π) − H(α) = (1 − α)H(π)

This is maximized when π = .5, and C = 1 − α. 2


The channel capacity has the following properties:
1. C ≥ 0. (why?)
2. C ≤ log |X |. (why?)

3. C ≤ log |Y|.
4. I(X; Y ) is continuous function of p(x).
5. I(X; Y ) is a concave function of p(x). (It has a maximum)
ECE 7680: Lecture 7 – Channel Capacity 3

Symmetric channels
We now consider a specific class of channels for which the entropy is fairly easy to
compute, the symmetric channels.
A channel can be characterized by a transmission matrix such as
 
.3 .2 .5
p(y|x) = .5 .3 .2 = P
.2 .5 .3

The indexing is x for rows, y for columns: Px,y = p(y|x).


Definition 2 A channel is said to be symmetric if the rows of the channel transi-
tion matrix p(y|x) re permutations of each other, and the columns are permutations
of each other. A channel is said to be weakly symmetric if every row of the tran-
sition matrix p(·|x) is a permutation of every other row, and all the column sums
P
x p(y|x) are equal. 2

Theorem 1 For a weakly symmetric channel,

C = log |Y| − H(row of transition matrix).

This is achieved by a (discrete) uniform distribution over the input alphabet.


To see this, let r denote a row of the transition matrix. Then

I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − H(r) ≤ log |Y| − H(r).

Equality holds if the output distribution is uniform.

A closer look at capacity


The rationale for the coding theorem: “for large block lengths, every channel looks
like the noisy typewriter. The channel has a subset of inputs that produce
essentially disjoint sequences at the output.”

For each typical input n-sequence, there are approximately 2nH(Y |X) possible Y
sequences, each of them more or less equally likely by the AEP. In order to reliably
detect them, we want to ensure that no two X sequences produce the same Y
sequence. The total number of possible (typical) Y sequences is ≈ 2nH(Y ) . This has
to be divided into sets of size 2nH(Y |X) , corresponding to the different X sequences.
ECE 7680: Lecture 7 – Channel Capacity 4

The total number of disjoint sets is therefore approximately 2n(H(Y )−H(Y |X)) =
2nI(X;Y ) . Hence we can send at most ≈ 2nI(X;Y ) distinguishable sequences of length
n.
Definition 3 A discrete channel, denoted by (X , p(y|x), Y) consists of an input
set X , and output set Y, and a collection of probability mass functions p(y|x), one
for each x ∈ X .
The extension of a discrete memoryless channel (DMC) is the channel (X n , p(y n |xn ), X ),
where

p(yk |xk , y k−1 ) = p(yk |xk ).

In other words, the distribution of the channel given a sequence of inputs and prior
outputs depends only on the current input (memoryless channel). 2
When we talk about data transmission through a channel, the issue of coding
arises. We define a code as follows:
Definition 4 An (M, n) code for the channel (X , p(x|y), Y) consists of:
1. An index set {1, . . . , M } (which represents the input alphabet W )
2. A function X n : {1, 2, . . . , M } → X n yielding codewords X n (1), X n (2), . . . , X n (M ).
3. A decoding function g : Y n → {1, 2, . . . , M }.
2

In other words, the code takes symbols from W , and encodes them to produce
a sequence of n symbols in X .
The probability of error is defined as

λi = Pr(g(Y n ) 6= i|X n = X n (i))

In other words, if the message symbol is i, but the output symbol is not i, then we
have an error. This can be written using the indicator function I(·) as
X
λi = Pr(g(Y n ) 6= i|X n = X n (i)) = p(y n |xn (i))I(g(y n ) 6= i)
yn

In our development, it may be convenient to deal with the maximal probability


of error. If it can be shown that the maximal probability of errors goes to zero, then
clearly the other probabilities of error do also. The maximal probability of error is

λ(n) = max λi
i

The average probability of error is


M
1 X
Pe(n) = λi
M i=1
ECE 7680: Lecture 7 – Channel Capacity 5

Definition 5 The rate of an (M, n) code is

log M
R= bits per transmission.
n
2
Example 6 Suppose M = 4, and codewords are n = 4 bits long. Then every input
symbol supplies 2 bits of information, and takes 4 symbol transmissions to send.
The rate is R = 1/2. 2

Definition 6 A rate R is said to be achievable if there exists a sequence of (d2nR e, n)


codes such that the maximal probability of error λ(n) tends to 0 as n → ∞. 2

Definition 7 The capacity of a DMC is the supremum of all achievable rates. 2


This is a different definition than the “information” channel capacity of a DMC,
presented above. What we will show (this is Shannon’s theorem) is that the two
definitions are equivalent.
The implication of this definition of capacity is that for an achievable rate,
the probability of error tends to zero as the block length gets large. Since the
capacity is the largest achievable rates, then for rates less than the capacity,
the probability of error goes to zero with the block length.

Jointly typical sequences


Recall the definition of a typical sequence: it is the sequence we expect (probabilis-
tically) to occur, given the statistics of the source. We had a theorem regarding
the approximate number of typical sequences, the probability of a typical sequence,
and so forth. In this section we generalize this.
All along this semester we have used the notion of representing sequences of
random variables as vectors. For example, the r.v.’s (X1 , X2 ) we could represent
with a vector-valued random variable X. In a sense, this is all that we are doing
with the jointly-typical sequences. We consider sequences (xn , y n ) as if there were
simply some sequence z n , and ask: for the set of sequences z n , what are the typical
sequences:
(n)
Definition 8 The set A of jointly typical sequences {(xn , y n )} with respect to
the distribution p(x, y) is the set of sequences with empirical (statistical averages)
entropies close to the true entropy:

n 1
(n) n n n
n
1 n

A = (x , y ) ∈ X × Y : − log p(x ) − H(X) <  and − log p(y ) − H(Y ) < 

n  n
1
and − log p(xn , y n ) − H(X, Y ) < 

n
2
From this definition, we can conclude similar sorts of things about jointly typical
sequences as we can about typical sequences:

Theorem 2 (Joint AEP).


(n)
1. Pr((X n , Y n ) ∈ A → 1 as n → ∞. (As the block length gets large, the
probability that we observe a jointly typical sequence approaches 1.)
ECE 7680: Lecture 7 – Channel Capacity 6

(n)
2. |A | ≤ 2n(H(X,Y )+) . (The number of jointly typical sequences is close to the
number of representations determined by the entropy of (X, Y ).)
3. If (X̃ n , Ỹ n ) ∼ p(xn )p(y n ), that is, X̃ n and Ỹ n are independent with the same
marginals as p(xn , y n ) then
−n(I(X;Y )−3)
Pr((X̃ n , Ỹ n ) ∈ A(n)
 )≤2 .

For sufficiently large n,


−n(I(X;Y )+3)
Pr((X̃ n , Ỹ n ) ∈ A(n)
 ) ≥ (1 − )2 .

Proof By WLLN,
1
− log p(X n ) → −E log p(X) = H(X)
n
so that, by the definition of convergence in probability, for any  > 0 there is an n1
such that for n > n1 ,
 
1 
Pr | − log p(X n ) − H(X)| >  < .
n 3
Similar statements can be said about WLLN convergence in the Y data and the Z
data: There is an n2 and and n3 such that for n > n2
 
1 
Pr | − log p(Y n ) − H(Y )| >  < .
n 3
and for n > n3
 
1 
Pr | − log p(X n , Y n ) − H(X, Y )| >  < .
n 3

So for n > max(n1 , n2 , n3 ), the probability of the union of the three sets must be
(n)
< . Hence the probability of the set A must be > 1 − .
The second part of the proof is proved exactly as for typical sequences:
X X
1= p(xn , y n ) ≥ p(xn , y n )
(n)
A
−n(H(X,Y )+)
≥ |A(n)
 |e

so that

|A(n)
 |≤2
n(H(X,Y )+)
.

For the third part, we note first that for n sufficiently large,

Pr(A(n)
 )≥1−

so we can write
X
1−≤ p(xn , y n )
(n)
(xn ,y n )∈A
−n(H(X,Y )−)
≤ |A(n)
 |2

from which

|A(n)
 | ≥ (1 − )2
n(H(X,Y )−)
.
ECE 7680: Lecture 7 – Channel Capacity 7

Now if X̃ and Ỹ are independent with the same marginals as X and Y ,


X
Pr((X̃ n , Ỹ n ) ∈ A(n)
 )= p(xn )p(y n )
(n)
(xn ,y n )∈A
n(H(X,Y )+) −n(H(X)−) −n(H(Y )−)
≤2 2 2
n(I(X;Y )−3)
=2

(since there are approximately 2n(H(X,Y )+) terms in the sum). Similar arguments
can be used to derive the last inequality. 2
nH(X) nH(Y )
There are about 2 typical X sequences, about 2 typical Y sequences,
and only about 2nH(X,Y ) jointly typical sequences. This means that if we choose a
typical X sequence and independently choose a typical Y sequence (without regard
to the X sequence), in not all cases will the sequence (X n , Y n ) so chosen be jointly
typical. In fact, from the last part of the theorem, the probability that the sequences
chosen independently will be jointly typical is about 2−nI(X;Y ) . This means that we
would have to try (at random) about 2nI(X;Y ) sequence pairs before we choose a
jointly typical pair. Thinking now in terms of fixing Y and choosing X at random,
this suggests that there are about 2nI(X;Y ) distinguishable sequences in X.

The channel coding theorem and its proof


We are now ready for the big theorem of the semester.

Theorem 3 (The channel coding theorem) All rates below capacity C are achiev-
able. That is, for every  > 0 and rate R < C, there exists a sequence of (2nR , n)
codes with maximum probability of error λ(n) → 0.
Conversely, any sequence of (2nR , n) codes with λ(n) → 0 must have R ≤ C.
Note that by the statement of the theorem, we only state that “there exists” a
sequence of codes. The proof of the theorem is not constructive, and hence does
not tell us how to find the codes.
Also note that the theorem is asymptotic. To obtain an arbitrarily small prob-
ability of error, the block length may have to be infinitely long.
Being the big proof of the semester, it will take some time. As we grind through,
don’t forget to admire the genius of the guy who was able to conceive of such an
intellectual tour de force. Aren’t you glad this wasn’t your homework assignment
to prove this!
Proof We prove that rates R < C are achievable. The converse will wait.
Fix p(x). The first important step is to generate a (2nR , n) code at random
using the distribution p(x). (The idea of using a random code was one of Shannon’s
key contributions.) Specifically, we draw 2nR codewords according to
n
Y
p(xn ) = p(xi ).
i=1

Now we will list these codewords as the rows of a matrix


 
x1 (1) x2 (1) ... xn (1)
 x1 (2) x2 (2) . . . xn (2) 
C= .
 
..
 . 
x1 (2nR ) x2 (2nR ) · · · nR
xn (2 )
ECE 7680: Lecture 7 – Channel Capacity 8

Each element in the matrix is independently generated according to p(x). The


probability that we generate any particular code is
nR
2Y n
Y
Pr(C) = p(xi (w)).
w=1 i=1

The communication process is accomplished as follows:


1. A random code C is generated as described. The code is made available (prior
to transmission) to both the sender and the receiver. Also, both sender and
receiver are assumed to know the channel transition matrix p(y|x).
2. A message W is chosen by the sender according to a uniform distribution,

Pr(W = w) = 2−nR , w = 1, 2, . . . , 2nR .

and the wth codeword X n (w) (corresponding to the wth row of the C matrix)
is sent over the channel.
3. The receiver receives the sequence Y n according to the distribution
n
Y
P (y n |xn (w)) = p(yi |xi (w))
i=1

(the fact that the probabilities are multiplied together is a consequence of the
memoryless nature of the channel).
4. Based on the received sequence, the receiver tries to determine which message
was sent. In the interest of making the decision procedure tractable for this
proof, a typical set decoding procedure is used. The receiver decides that
message Ŵ was sent if
• (X n (Ŵ ), Y n ) is jointly typical: in other words, we could expect this to
happen, and
(n)
• There is no other k such that (X n (k), Y n ) ∈ A . If this happens, the
receiver cannot tell which symbol should be decoded, and declares an
error.
5. If W 6= Ŵ , then there is a decoding error.
The next key step is we find the average probability of error, averaged over
all codewords in the codebook and averaged over all codebooks C. This is a key
idea, because of the following observation: If the probability of error obtained by
averaging over random codebooks can be shown to be made arbitrarily small (as
we demonstrate below), then that means that there is at least one fixed codebook
that has a small average probability of error. So the random codebook idea and
the average over random codebooks is simply a mechanism to prove that there does
exist a good code.

X
P (E) = P (C)Pe(n) (C)
C
nR
2
X 1 X
= P (C) λw (C)
2nR
C w=1
2nR
1 X X
= P (C) λw (C)
2nR
C w=1
ECE 7680: Lecture 7 – Channel Capacity 9

By the symmetry of the code construction, the average probability of error averaged
over all codes does not depend on the particular index that was sent, so we can
assume without loss of generality that w = 1.
Define the event

Ei = {(X n (i), Y n ) is in A(n)


 } for i ∈ {1, 2, . . . , 2
nR
}.

That is, Ei is the event that Y n is jointly typical with the ith codeword. An error
occurs if
• E1c occurs: Y n is not jointly typical with the first codeword, or
nR
• E2 ∪ E3 ∪ · · · ∪ E22 occurs (a wrong codeword is jointly typical with the
received sequence).
Let P (E) denote Pr(E|W = 1) we have
nR
P (E) = P (E1c ∪ E2 ∪ E3 ∪ · · · ∪ E 2 )
nR
2
X
≤ P (E1c ) + P (Ei )
i=2

(the union bound).


Finally we can use the typicality properties. By AEP,

P (E1c ) ≤ 

for n sufficiently large. And the probability that Y b and X n (i) are jointly typical
(for i 6= 1)is ≤ 2−nI(X;Y )−3) . Hence
nR
2
X
P (E) ≤  + 2−n(I(X;Y )−3)
i=2
=  + (2nR − 1)2−n(I(X;Y )−3)
≤  + 2−n(I(X;Y )−3−R)
≤ 2

if n is sufficiently large and R < I(X; Y ) − 3.


Hence, if R < I(X; Y ), we can choose  and n so that the average probability of
error, averaged over codebooks and codewords, is less than 2.
We are not quite there yet, in light of all the averaging we had to do. Here are
the final touch-ups:
1. In the proof, choose p(x) to be that distribution that achieves capacity. Then
R < I(X; Y ) can be replaced by the achievability conditions R < C.

2. Since the average probability of error, averaged over all codebooks is small,
then there exists at least one codebook C ∗ with a small average probability of
error. Pen (C ∗ ) ≤ e. (How to find it?!) (Is exhaustive search an option?)
3. Observe that the best half the codewords must have a probability of error less
than 4 (otherwise we would not be able to achieve the average less than 2.
We can throw away the worst half the codewords, giving us a rate
1
R0 = R −
n
which asymptotically is negligible.
ECE 7680: Lecture 7 – Channel Capacity 10

Hence we can conclude that for the best codebook, the maximal probability (not
the average) has a probability of error ≤ 4. 2
Whew!
The proof used a random-coding argument to make the math tractable. In prac-
tice, random codes are never used. Good, useful, codes are a matter of considerable
interest and practicality.

You might also like