Channel Capacity

Channel Capacity
In this chapter we find the maximum number of distinguishable

signals for n uses of a communication channel.
This number grows exponentially with n, and the exponent is
known as the channel capacity.
1
Channel Capacity
If two different input sequences may give rise

to the same output sequence, the inputs are confusable.
We need to reconstruct the input sequences at the output with

a negligible probability of error.
By mapping the source into the appropriate “widely spaced”

input sequences to the channel, we can transmit a message
with very low probability of error and reconstruct the source
message at the output.
The maximum rate at which this can be done is called the

capacity of the channel.
2
Channel Capacity
Discrete Channel: We define a discrete channel to be a

system consisting of an input alphabet X and output alphabet
Y and a probability transition matrix p(y|x) that expresses the
probability of observing the output symbol y given that we
send the symbol x.
The channel is said to be memoryless if the probability

distribution of the output depends only on the input at
that time and is conditionally independent of previous
channel inputs or outputs.
3
Channel Capacity
Channel Capacity: We define the “information” channel capacity

of a discrete memoryless channel as
where the maximum is taken over all possible input distributions

p(x).
Operational definition of channel capacity can be given as the
highest rate in bits per channel use at which information can be sent
with arbitrarily low probability of error.
Shannon’s second theorem establishes that the information channel

capacity is equal to the operational channel capacity. Thus, we drop
the word information in most discussions of channel capacity.
4
Channel Capacity
EXAMPLES OF CHANNEL CAPACITY
Noiseless Binary Channel
The binary input is reproduced exactly the same output as shown in
the Figure
In this case, any transmitted bit is received without error. Hence,

one error-free bit can be transmitted per use of the channel.
We can also calculate the information capacity C = max I(X; Y) = 1

bit, which is achieved by using p(x) = (1/2, 1/2).
5
Channel Capacity
Noisy Channel with Nonoverlapping Outputs
This channel has two possible outputs corresponding to each of the
two inputs as shown in Figure. The channel appears to be noisy, but
really is not.
FIGURE: Noisy channel with nonoverlapping outputs. C = 1 bit.

6
Channel Capacity
Noisy Typewriter
In this case the channel input is either received unchanged at the

output with probability ½ or is transformed into the next letter with
probability ½ as shown in Figure. If the input has 26 symbols and
we use every alternate input symbol, we can transmit one of 13
symbols without error with each transmission. Hence, the capacity
of this channel is log 13 bits per transmission.
We can also calculate the information capacity C = max I (X; Y) =
max ( H(Y) − H(Y|X) ) = max H(Y) − 1 = log 26 − 1 = log 13,
achieved by using p(x) distributed uniformly over all the inputs.
7
Channel Capacity
Noisy Typewriter
8
Channel Capacity
Binary Symmetric Channel
The example of BSC is as shown in following Figure. This is a
binary channel in which the input symbols are complemented
with probability p.
Figure: Binary symmetric channel. C = 1- H(p) bits

9
Channel Capacity
We can write the mutual information as
10
Channel Capacity
where the last inequality follows because Y is a binary

random variable. Equality is achieved when the input
distribution is uniform. Hence, the information capacity of a
binary symmetric channel with parameter p is
C = 1 − H(p) bits.
11
Channel Capacity
Binary Erasure Channel
A BEC in which some bits are lost (rather
than corrupted) is the binary erasure channel
In this channel, a fraction α of the bits are
erased. The receiver knows which bits have
been erased. The binary erasure channel has
two inputs and three outputs as shown in
Figure.
12
Channel Capacity
Capacity of binary erasure channel:
13
Channel Capacity
We may think that H(Y) is log 3, but we cannot achieve this by any
choice of input distribution p(x). Let E be event (Y = e), using the
and letting Pr(X = 1) = , we have
14
Channel Capacity
Hence the capacity is
where capacity is achieved by π = 1/2
15
Channel Capacity
Symmetric Channel
Consider a channel with following transition matrix
All the rows of the probability transition matrix are

permutations of each other and so are the columns.
Such a channel is said to be symmetric. Another
example of a symmetric channel is one of the form
16
Channel Capacity
Symmetric Channel
Another example of a symmetric channel is one of the form
where Z has some distribution on the integers {0, 1, 2, . . . ,

c − 1}, X has the same alphabet as Z, and Z is independent
of X
In both cases we can find the channel capacity as follows.
Letting r be a row of the transition matrix, we have
17
Channel Capacity
Symmetric Channel
with equality if the output distribution is uniform. But p(x) = 1/|X|
achieves a uniform distribution on Y, as seen from
where c is the sum of the entries in one column of the

probability transition matrix.
Thus, the channel with the above transition matrix has the
capacity
and C is achieved by a uniform distribution on the input. 18

Channel Capacity
Symmetric Channel
Definition A channel is said to be symmetric if the rows of the
channel transition matrix p(y|x) are permutations of each other and
the columns are permutations of each other. A channel is said to be
weakly symmetric if every row of the transition matrix p(·|x) is a
permutation of every other row and all the column sums are equal.
For example, the channel with transition matrix
is weakly symmetric but not symmetric
19
Channel Capacity
Symmetric Channel
Theorem: For a weakly symmetric channel
and this is achieved by a uniform distribution on the input alphabet
20
Channel Capacity
Properties of Channel Capacity
4. I(X; Y) is a continuous function of p(x).

5. I(X; Y) is a concave function of p(x) (Theorem 2.7.4).
Since I (X; Y) is a concave function over a closed convex set,
a local maximum is a global maximum.
The maximum can then be found by standard nonlinear
optimization techniques such as gradient search.
21
Channel Capacity
Properties of Channel Capacity
Some of the methods that can be used include the following:

• Constrained maximization using calculus and the Kuhn–Tucker
conditions.
• The Frank–Wolfe gradient search algorithm.
• An iterative algorithm developed by Arimoto [25] and Blahut
[65].
In general, there is no closed-form solution for the capacity. But
for many simple channels it is possible to calculate the capacity
using properties such as symmetry.
22
Channel Capacity
Preview of the Channel Coding Theorem
Now we prove Shannon’s second theorem, which gives an
operational meaning to the definition of capacity as the number of
bits we can transmit reliably over the channel.
Why we can transmit C bits of information over a channel?
The basic idea is that for large block lengths, every channel looks
like the noisy typewriter channel (Figure 7.4) and the channel has a
subset of inputs that produce essentially disjoint sequences at the
output.
23
Channel Capacity
Preview of the Channel Coding Theorem
For each (typical) input n-sequence, there are approximately 2nH(Y |X)
possible Y sequences, all of them equally likely (Figure 7.7). We
should ensure that no two X sequences produce the same Y output
sequence. Otherwise, we will not be able to decide which X
sequence was sent.
The total number of possible (typical) Y
sequences is ≈ 2nH(Y ). This set has to be
divided into sets of size 2nH(Y |X)
corresponding to the different input X
sequences. The total number of disjoint sets
is less than or equal to 2n( H(Y) −H(Y|X)) = 2nI (X;Y).
Hence, we can send at most ≈ 2nI(X;Y)
distinguishable sequences of length n.
24
Channel Capacity
Some Definitions
Suppose a message W, drawn from the index set {1, 2, . . . , M},
results in the signal Xn(W) is sent by a transmitter. which is
received by the receiver as a random sequence
The receiver receives Yn ∼ p(yn|xn). The receiver then guesses

the index W by an appropriate decoding rule . The receiver
makes an error if is not the same as the index W that was
transmitted.
25
Channel Capacity
Some Definitions
Discrete Channel: A discrete channel, denoted by (X, p(y|x), Y),

consists of two finite sets X and Y and a collection of probability
mass functions p(y|x), one for each x ∈ X, such that for every x and
y, p(y|x) ≥ 0, and for every x, with the interpretation that X is the
input and Y is the output of the channel.
Definition The nth extension of the discrete memoryless channel

(DMC) is the channel (Xn p(yn|xn), Yn), where
p(yk|, xk , xk-1) = p(yk|xk) , k = 1, 2, ….
26
Channel Capacity
Some Definitions
If the channel is used without feedback [i.e., if the input

symbols do not depend on the past output symbols, namely,
p(xk|, xk-1 , yk-1) = p(xk|xk-1) ], the channel transition function for
the nth extension of the discrete memoryless channel reduces to
27
Channel Capacity
Some Definitions
Definition An (M, n) code for the channel (X, p(y|x), Y),
consists of the following:
1. An index set {1, 2, . . . , M}.

2. An encoding function Xn : {1, 2, . . .,M} → Xn, yielding
codewords xn(1), xn(2), . . ., xn(M). The set of codewords
is called the codebook.
3. A decoding function
g : Yn → {1, 2, . . . , M},
which is a deterministic rule that assigns a guess to each
possible received vector.
28
Channel Capacity
Some Definitions
Conditional Probability of Error: Let
be the conditional probability of error given that index i was

sent, where I(·) is the indicator function.
Maximal Probability of Error: The maximal probability of

error (n) for an (M, n) code is defined as
29
Channel Capacity
Some Definitions
Average Probability of Error: The (arithmetic) average
probability of error Pe(n) for an (M, n) code is defined as
Note that if the index W is chosen according to a uniform

distribution over the set {1, 2, . . . , M}, and Xn = xn(W), then
(i.e., Pe(n) is the probability of error). Also obviously
Pe(n)  (n)
30
Channel Capacity
Some Definitions
Rate:
Achievable rate: A rate R is said to be achievable if there exists

a sequence of codes such that the maximal
probability of error λ(n) tends to 0 as n→∞.
Later, we write codes to mean codes. This

will simplify the notation.
31
Channel Capacity
Some Definitions
Capacity: The capacity of a channel is the supremum of all achievable
rates.
Thus, rates less than capacity yield arbitrarily small probability of error
for sufficiently large block lengths.
32
Channel Capacity
Some Definitions
Jointly Typical Sequences
33
Channel Capacity
Some Definitions
where
34
Theorem 7.6.1 (Joint AEP) Let (Xn, Yn) be sequences of
length n drawn i.i.d. according to .
Then,
35
Proof
1. We show using the law of large number (LLN),
 1 as n  
36
2. To prove the second part of the theory, we have
and hence
37
3. Now if and are independent but have the same
marginal as Xn and Yn, then
38
and
39
By similar arguments to the upper bound above, we
can also show that for n sufficiently large,
40
Channel Coding Theorem
Theorem 7.7.1
(Channel coding theorem) For a discrete memoryless channel,

all rates below capacity C are achievable. Specifically, for
every rate R < C, there exists a sequence of (2nR, n) codes with
maximum probability of error λ(n) → 0.
Conversely, any sequence of (2nR, n) codes with λ(n) → 0 must

have
R ≤ C.
41
Theorem 7.7.1
Proof: We prove that rates R < C

Let a distribution p(x). Generate a (2nR, n) code at random
according to the distribution p(x). Specifically, we generate
2nR codewords independently according to the distribution
We generate the 2nR codewords as the rows of a matrix:
42
Theorem 7.7.1
Each entry in this matrix is generated i.i.d. according to p(x).

Thus, the probability that we generate a particular code C is
43
44
Theorem 7.7.1
6. Now consider jointly typical decoding. In jointly typical

decoding, the receiver declares that the index was sent if the
following conditions are satisfied:
45
Theorem 7.7.1
Probability of error: We calculate the average over all codes

generated at random according to the distribution (7.62). By the
symmetry of the code construction, the average probability of
error does not depend on the particular index that was sent.
There are two different sources of error when we use jointly

typical decoding:
Either the output Yn is not jointly typical with the transmitted
codeword or there is some other codeword that is jointly typical
with Yn.
46
Theorem 7.7.1
47
Theorem 7.7.1
48
Theorem 7.7.1
49
Theorem 7.7.1
50
Theorem 7.7.1
51
Zero-error Codes
We
52
Zero-error Codes
53
Theorem 7.7.1
54
Universal Codes and Channel Capacity
We define the redundancy of the code as the difference

between the expected length of the code and the lower limit
for the expected length:
Relative entropy
55
Where q(x) = 2−l(x). Is the distribution that corresponds to the

codeword length l(x) .
We want to find a code without knowing the true distribution

of p , thus we define minimax redundancy as
This minimax redundancy is achieved by a distribution q that

is at the “center” of the information ball containing the
distributions p , that is, the distribution q whose maximum
distance from any of the distributions p , is minimized
(Figure 13.1).
56
Where q(x) = 2−l(x) is the distribution that corresponds to the

codeword length l(x) .
We want to find a code without knowing the true distribution

of p , thus we define minimax redundancy as
This minimax redundancy is achieved by a distribution q that

is at the “center” of the information ball containing the
distributions p , that is, the distribution q whose maximum
distance from any of the distributions p , is minimized as
shown in the following Figure.
57
To find the distribution q that is as close as possible to all the

possible pθ in relative entropy, consider the following channel
58
To find the distribution q that is as close as possible to all the

possible pθ in relative entropy, consider the following channel
This is a channel with the rows of the transition

matrix equal to the pθ ’s, the possible distributions of the source.
59
60
61
Example:
62
We will show that the minimax redundancy R∗ is equal to the

capacity of this channel, and the corresponding optimal
coding distribution is the output distribution of this channel
induced by the capacity-achieving input distribution. The
capacity of this channel is given by
where
Mutual information
63
64
Proof: the Let π(θ) be an input distribution on θ ∈ {1, 2, . . . , m},

and let the induced output distribution be qπ :
where
Then for any distribution q on the output, we have:
65
Proof: Then for any distribution q on the output, we have:
for all q, with equality iff q = qπ
66
Proof: Thus for all q
and therefore
is achieved when q = qπ
67
Proof: Thus for all q
and therefore
68

Channel Capacity

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Channel Capacity

Uploaded by

Copyright:

Available Formats

Channel Capacity

In this chapter we find the maximum number of distinguishable

If two different input sequences may give rise

We need to reconstruct the input sequences at the output with

By mapping the source into the appropriate “widely spaced”

The maximum rate at which this can be done is called the

Discrete Channel: We define a discrete channel to be a

The channel is said to be memoryless if the probability

Channel Capacity: We define the “information” channel capacity

where the maximum is taken over all possible input distributions

Shannon’s second theorem establishes that the information channel

In this case, any transmitted bit is received without error. Hence,

We can also calculate the information capacity C = max I(X; Y) = 1

FIGURE: Noisy channel with nonoverlapping outputs. C = 1 bit.

In this case the channel input is either received unchanged at the

Figure: Binary symmetric channel. C = 1- H(p) bits

We can write the mutual information as

where the last inequality follows because Y is a binary

and letting Pr(X = 1) = , we have

where capacity is achieved by π = 1/2

All the rows of the probability transition matrix are

where Z has some distribution on the integers {0, 1, 2, . . . ,

where c is the sum of the entries in one column of the

and C is achieved by a uniform distribution on the input. 18

is weakly symmetric but not symmetric

and this is achieved by a uniform distribution on the input alphabet

4. I(X; Y) is a continuous function of p(x).

Some of the methods that can be used include the following:

Why we can transmit C bits of information over a channel?

The receiver receives Yn ∼ p(yn|xn). The receiver then guesses

Discrete Channel: A discrete channel, denoted by (X, p(y|x), Y),

Definition The nth extension of the discrete memoryless channel

p(yk|, xk , xk-1) = p(yk|xk) , k = 1, 2, ….

If the channel is used without feedback [i.e., if the input

1. An index set {1, 2, . . . , M}.

be the conditional probability of error given that index i was

Maximal Probability of Error: The maximal probability of

Note that if the index W is chosen according to a uniform

(i.e., Pe(n) is the probability of error). Also obviously

Achievable rate: A rate R is said to be achievable if there exists

Later, we write codes to mean codes. This

(Channel coding theorem) For a discrete memoryless channel,

Conversely, any sequence of (2nR, n) codes with λ(n) → 0 must

Proof: We prove that rates R < C

We generate the 2nR codewords as the rows of a matrix:

Each entry in this matrix is generated i.i.d. according to p(x).

6. Now consider jointly typical decoding. In jointly typical

Probability of error: We calculate the average over all codes

There are two different sources of error when we use jointly

We define the redundancy of the code as the difference

Where q(x) = 2−l(x). Is the distribution that corresponds to the

We want to find a code without knowing the true distribution

This minimax redundancy is achieved by a distribution q that

Where q(x) = 2−l(x) is the distribution that corresponds to the

We want to find a code without knowing the true distribution

This minimax redundancy is achieved by a distribution q that

To find the distribution q that is as close as possible to all the

To find the distribution q that is as close as possible to all the

This is a channel with the rows of the transition

We will show that the minimax redundancy R∗ is equal to the

Proof: the Let π(θ) be an input distribution on θ ∈ {1, 2, . . . , m},

Then for any distribution q on the output, we have:

Proof: Then for any distribution q on the output, we have:

for all q, with equality iff q = qπ