You are on page 1of 62

# Chapter 7

Channel Capacity
Peng-Hua Wang
National Taipei University

Chapter Outline
Chap. 7 Channel Capacity
7.1 Examples of Channel Capacity
7.2 Symmetric Channels
7.3 Properties of Channel Capacity
7.4 Preview of the Channel Coding Theorem
7.5 Definitions
7.6 Jointly Typical Sequences
7.7 Channel Coding Theorem
7.8 Zero-Error Codes
7.9 Fano’s Inequality and the Converse to the Coding Theorem

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 2/62

7.1 Examples of Channel Capacity

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 3/62

we can send M signals without error. Peng-Hua Wang.Channel Model ■ Operational channel capacity: the bit number to represent the maximum number of distinguishable signals for n uses of a communication channel. ◆ Fundamental theory and central success of information theory. 2012 Information Theory. Chap. 7 . April 16. 4/62 . ◆ In n transmission. ■ Information channel capacity: the maximum mutual information ■ Operational channel capacity is equal to Information channel capacity. the channel capacity is log M/n bits per transmission.p.

7 .p.Channel capacity Definition 1 (Discrete Channel) A system consisting of an input alphabet X and output alphabet Y and a probability transition matrix p(y|x). ■ Operational definition of channel capacity: The highest rate in bits per channel use at which information can be sent. ■ Shannon’s second theorem: The information channel capacity is equal to the operational channel capacity. Chap. April 16. 2012 Information Theory. Peng-Hua Wang. Definition 2 (Channel capacity) The “information” channel capacity of a discrete memoryless channel is C = max I(X. Y ) p(x) where the maximum is taken over all possible input distribution p(x). 5/62 .

7 . Y ) = H(Y ) − H(Y |X) = H(Y ) ≤1 “=” ⇒ π0 = π1 = 1/2 Peng-Hua Wang. 6/62 .p. 2012 Information Theory. Chap. p(Y = 1) = p(X = 1) = π1 = 1 − π0 I(X. April 16.Example 1 Noiseless binary channel p(Y = 0) = p(X = 0) = π0 .

q = 1/3 I(X. p(Y = 4) = π1 (1 − q).Example 2 Noisy channel with non-overlapping outputs p(X = 0) = π0 . 7/62 . 2012 Information Theory. April 16. Chap. p(Y = 2) = π0 (1 − p). Y ) = H(Y ) − H(Y |X) = H(Y ) − π0 H(p) − π1 H(q) Peng-Hua Wang. 7 .p. p(X = 1) = π1 = 1 − π0 p(Y = 1) = π0 p. p = 1/2 p(Y = 3) = π1 q.

Noisy Typewriter Noisy typewriter Peng-Hua Wang. 2012 Information Theory. Chap.p. 8/62 . 7 . April 16.

9/62 . Chap.p. 7 . April 16. Y ) = H(Y ) − H(Y |X) X = H(Y ) − p(x)H(Y |X = x) x = H(Y ) − X p(x)H( 21 ) x = H(Y ) − H( 12 ) ≤ log 26 − 1 = log 13 C = max I(X. 2012 Information Theory.Noisy Typewriter I(X. Y ) = log 13 Peng-Hua Wang.

10/62 .Binary Symmetric Channel (BSC) Binary Symmetric Channel (BSC) Peng-Hua Wang. Chap. 2012 Information Theory.p. April 16. 7 .

11/62 . 2012 Information Theory.p. Chap.Binary Symmetric Channel (BSC) I(X. 7 . Y ) = H(Y ) − H(Y |X) X = H(Y ) − p(x)H(Y |X = x) x = H(Y ) − X p(x)H(p) x = H(Y ) − H(p) ≤ 1 − H(p) C = max I(X. April 16. Y ) = 1 − H(p) Peng-Hua Wang.

p. April 16. 12/62 . 2012 Information Theory.Binary Erasure Channel Peng-Hua Wang. 7 . Chap.

Chap. 2012 Information Theory. Y ) = H(Y ) − H(Y |X) X = H(Y ) − p(x)H(Y |X = x) x = H(Y ) − X p(x)H(α) x = H(Y ) − H(α) H(Y ) = (1 − α)H(π0 ) + H(α) C = max I(X. 7 . 13/62 .p.Binary Erasure Channel I(X. Y ) = 1 − α Peng-Hua Wang. April 16.

April 16.p. 2012 Information Theory. Chap.7. 14/62 .3 Properties of Channel Capacity Peng-Hua Wang. 7 .

2012 Information Theory.Properties of Channel Capacity ■ C ≥ 0. Chap. 7 . ■ I(X. Y ) is a concave function of p(x). ■ I(X. ■ C ≤ log |X |. 15/62 . ■ C ≤ log |Y|.p. April 16. Y ) is a continuous function of p(x). Peng-Hua Wang.

April 16. 2012 Information Theory.p.4 Preview of the Channel Coding Theorem Peng-Hua Wang. 7 .7. 16/62 . Chap.

April 16.Y ) ■ We can send at most 2nI(X. 2012 Information Theory. Chap. ■ The total number of possible (typical) Y sequences is 2nH(Y ) . ■ This set has to be divided into sets of size 2nH(Y |X) corresponding to the different input X sequences. 17/62 .p.Preview of the Channel Coding Theorem ■ For each input n-sequence. possible Y sequences. ■ The total number of disjoint sets is less than or equal to 2nH(Y ) /2nH(Y |X) = 2n(H(Y )−H(Y |X)) = 2nI(X. 7 . there are approximately 2nH(Y |X) . Peng-Hua Wang.Y ) distinguishable sequences of length n.

■ For every X n . for X n Peng-Hua Wang. 2012 = 001100 ⇒ Y n = 010100. 101011. 7 . we have 2nH(X.. Chap. e.Example ■ 6 typical sequences for X n . 18/62 . 4 typical sequences for Y n .p. Y n ).g. Information Theory.Y ) /2nH(X) = 2nH(Y |X) = 2 typical Y n . ■ 12 typical sequences for (X n . April 16.

19/62 . Peng-Hua Wang. 2012 Information Theory.Example ■ Since we have 2nH(Y ) = 4 typical Y n in total. and 101000 as codewords. ■ Can we assign more typical X n ? No.g. how many typical X n can these typical Y n be assigned? 2nH(Y ) /2nH(Y |X) = 2n(H(Y )−H(Y |X)) = 2nI(X..Y ) = 2. we can’t not determine which X n is received. we can’t determine which codeword is sent when we receive 101011. April 16. For some Y n received. If we use 001100. Chap.p. 101101. e. 7 .

7 . April 16.7. Chap. 20/62 . 2012 Information Theory.5 Definitions Peng-Hua Wang.p.

2012 Information Theory. ≡ X n (W ) ∈ X n n is the length of the signal. ■ Channel: input X n . ■ ˆ If W 6= W . an error occurs. We then transmit the signal via the channel by using the channel n times. 2. Chap. 21/62 ..Communication Channel ■ Message W ■ Encoder: input W . April 16.p. output Y n with distribution p(y n |xn ) ■ ˆ Decoder: input Y n .. Peng-Hua Wang. M }. output X n ◆ ∈ {1. . 7 .. Every time we send a symbol of the signal. output W = g(Y n ) where g(Y n ) is a deterministic decoding rule.

Definitions
Definition 3 (Discrete Channel) A discrete channel, denoted by

(X , p(y|x), Y), consists of two finite sets X and Y and a collection of
probability mass functions p(y|x).
P
■ X : input, Y :output, for every input x ∈ X ,
y p(y|x) = 1.
Definition 4 (Discrete Memoryless Channel, DMC) The nth
extension of the discrete memoryless channel is the channel

(X n , p(y n |xn ), Y n ) where
p(yk |xk , y k−1 ) = p(yk |xk ), k = 1, 2, . . . , n.
■ Without feedback: p(xk |xk−1 , y k−1 ) = p(xk |xk−1 )

nth extension of DMC without feedback:
n
Y
p(y n |xn ) =
p(yi |xi ).
i=1

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 22/62

Definitions
(M, n) code] An (M, n) code for the channel
(X , p(y|x), Y) consists of the following:
1. An index set {1, 2, . . . , M }.
Definition 5

2. An encoding function X n

: {1, 2, . . . , M } → X n . The codewords
are xn (1), xn (2), . . . , xn (M ). The set of codewords is called the
codebook.

3. A decoding function g

Peng-Hua Wang, April 16, 2012

: Y n → {1, 2, . . . , M }

Information Theory, Chap. 7 - p. 23/62

Definitions
Definition 6 (Conditional probability of error)
n

n

n

λi = Pr(g(Y ) 6= i|X = x (i)) =

X

p(y n |xn (i))

g(y n )6=i

=

X

p(y n |xn (i))I(g(y n ) 6= i)

yn

I(·) is the indicator function.

Peng-Hua Wang, April 16, 2012

Information Theory, Chap. 7 - p. 24/62

.Definitions Definition 7 (Maximal probability of error) λ(n) = max i∈{1. April 16. 7 .. 25/62 . .2. M }. 2.. . then (n) Pe = Pr(g(Y n ) 6= W ).. 2012 Information Theory. Chap.p. . Peng-Hua Wang..M } λi Definition 8 (Average probability of error) Pe(n) ■ The decoding error is Pr(g(Y n ) 6= W ) = M X M X 1 λi = M i=1 Pr(W = i) Pr(g(Y n ) 6= i|W = i) i=1 If the index W is chosen uniformly from {1.

Definition 11 (Channel capacity) The capacity of a channel is the supremum of all achievable rates.p. Chap. Peng-Hua Wang. 7 . 26/62 . n) code such that the maximal probability of error λ(n) tends to 0 as n → ∞. n) code is log M R= n bits per transmission Definition 10 (Achievable rate) A rate R is said to be achievable if there exists a (⌈2nR ⌉. 2012 Information Theory. April 16.Definitions Definition 9 (Rate) The rate R of an (M.

7 .6 Jointly Typical Sequences Peng-Hua Wang. 2012 Information Theory. Chap. 27/62 . April 16.p.7.

y n )} with respect to the distribution p(x.Definitions (n) Definition 12 (Jointly typical sequences) The set Aǫ of jointly typical sequences {(xn . y) is defined by A(n) ǫ where .

.

 .

.

1 n n n n .

n = (x . y ) ∈ X × Y : .

− log p(x ) − H(X).

.

n . < ǫ.

.

.

1 .

.

− log p(y n ) − H(Y ).

< ǫ. .

n .

.

.

 .

1 .

.

y n ) − H(X. Y ).− log p(xn .

< ǫ .

n .

2012 Information Theory. April 16. Chap. y n ) = n Y p(xi . 28/62 . yi ) i=1 Peng-Hua Wang.p. p(xn . 7 .

Then:  (n)  Pr (xn .i. . Y n ) be sequences of length n drawn i. y n ) ∈ Aǫ → 1 as n → ∞. y n ). according to p(xn .Joint AEP Theorem 1 (Joint AEP) Let (X n .d.

.

.

(n) .

2. .

Aǫ .

X ˜ n and Y˜ n are independent with 3.e. the same marginals]. 7 . Pr (X ǫ Peng-Hua Wang. April 16. ˜ n .Y )+ǫ) ˜ n . Y˜ n ) ∈ A(n) Pr (X ≤ 2 ǫ Also.p.. then   −n(I(X. 29/62 . 2012 Information Theory.Y )+3ǫ).Y )−3ǫ). for sufficient large n. Y˜ n ) ∈ A(n) ≥ (1 − ǫ)2−n(I(X. Chap. ≤ 2n(H(X. If (X 1. Y˜ n ) ∼ p(xn )p(y n ) [i.   ˜ n .

y n ) ∈ Aǫ → 1 as n → ∞. B. Given ǫ > 0. C as . Proof. define events A. Pr (xn .Joint AEP   (n) Theorem 2 (Joint AEP) 1.

.

  .

.

n .

1 A . X : .

− log p(X n ) − H(X).

.

≥ ǫ n .

.

  .

.

n .

1 B . Y : .

− log p(Y n ) − H(Y ).

.

≥ ǫ n .

.

  .

.

1 n n n n .

(X . Y ) : . C .

Y ). Y ) − H(X.− log p(X .

.

April 16. 2012 Information Theory.p. Chap. 30/62 . n Peng-Hua Wang. ≥ ǫ . 7 .

April 16. ∀n > n2 .p. Pr (B) < . n3 such that. there exists n1 .Joint AEP Then. 2012 Information Theory. ∀n > n1 . n2 . 3 3 ǫ Pr (C) < . 7 . y ) ∈ A(n) ǫ  = Pr(Ac ∩ B c ∩ C c ) =1 − Pr(A ∪ B ∪ C) ≥ 1 − (Pr(A) + Pr(B) + Pr(C)) ≥1 − ǫ for all n > max{n1 .  Peng-Hua Wang. 3 Thus. n n Pr (x . n3 }. ∀n > n3 . 31/62 . by weak law of large number. n2 . Chap. ǫ ǫ Pr (A) < .

Joint AEP .

.

.

(n) .

Theorem 3 (Joint AEP) 2. .

Aǫ .

.Y )+ǫ) Proof. 1= X p(xn . y n ) ≥ X p(xn . y n ) (n) (xn . ≤ 2n(H(X.Y )+ǫ) |2 ≥ |A(n) ǫ Thus.y n )∈Aǫ −n(H(X.

(n) .

.

Aǫ .

7 . ≤ 2n(H(X.  Peng-Hua Wang. April 16. Chap.Y )+ǫ) .p. 2012 Information Theory. 32/62 .

p. April 16.. then   ˜ n . for sufficient large n. 7 .e. Pr (X ǫ Also.   −n(I(X.Y )+3ǫ) ˜ n . If (X ˜ n and ∼ p(xn )p(y n ) [i. Y˜ n ) ∈ A(n) ≤ 2−n(I(X. ǫ Peng-Hua Wang. Y˜ n ) ∈ A(n) Pr (X ≥ (1 − ǫ)2 . Y˜ n ) Theorem 4 (Joint AEP) 3. X Y˜ n are independent with the same marginals].Joint AEP ˜ n . 33/62 . Chap. 2012 Information Theory.Y )−3ǫ) .

  (n) For sufficient large n. Y˜ n ) ∈ A(n) = Pr (X ǫ X p(xn )p(y n ) (n) (xn .Joint AEP Proof. Pr Aǫ ≥ 1 − ǫ. and therefore 1−ǫ≤ X n (n) (xn .   ˜ n .y n )∈Aǫ ≤ 2n(H(X. 2012 n .Y )+ǫ) 2−n(H(X)−ǫ) 2−n(H(Y )−ǫ) = 2−n(I(X. April 16.y n )∈Aǫ and Peng-Hua Wang.Y )−3ǫ) .

(n) .

−n(H(X. y ) ≤ .Y )−ǫ) p(x .

Aǫ .

. 2 .

(n) .

.

Aǫ .

7 . Chap. ≥ (1 − ǫ)2n(H(X. 34/62 .Y )−ǫ) Information Theory.p.

Y˜ n ) ∈ A(n) Pr (X ǫ X = p(xn )p(y n ) (n) (xn . Chap.p. 35/62 .Y )−ǫ) 2−n(H(X)+ǫ) 2−n(H(Y )+ǫ) =(1 − ǫ)2−n(I(X.Joint AEP   ˜ n .y n )∈Aǫ ≥(1 − ǫ)2n(H(X. 7 . April 16. 2012 Information Theory.Y )+3ǫ) Peng-Hua Wang.

36/62 .Joint AEP: Conclusion ■ There are about 2n(H(X) typical X sequences. April 16. 2012 Information Theory. ■ Randomly chosen a pair of typical X n and typical Y n .Y ) jointly typical sequences.p. the probability that it is jointly typical is about 2−nI(X.Y ) . 7 . Chap. Peng-Hua Wang. ■ There are about 2n(H(X. and about 2n(H(Y ) typical X sequences.

Chap.7. 37/62 . April 16. 7 .p. 2012 Information Theory.7 Channel Coding Theorem Peng-Hua Wang.

n) codes with λ(n) → 0 must have R ≤ C. n) codes with maximum probability of error λ(n) → 0. 2012 Information Theory.p. ■ We have to prove two parts.Channel Coding Theorem Theorem 5 (Channel coding theorem) For every rate R < C . any sequence of (2nR . Main ideas. Chap. 7 . there exists a sequence of (2nR . Achievable → R ≤ C . ◆ Random encoding (random code) ◆ Jointly typical decoding Peng-Hua Wang. Conversely. ◆ ◆ ■ R < C → achievable. 38/62 . April 16.

Both sender and receiver are also assumed to know the channel transition matrix Peng-Hua Wang. 39/62 . n) code at random according to the distribution p(x) (fixed).  x1 (2nR ) x2 (2nR ) · · · xn (1)   xn (2)    . Chap. 7 .. That is. ... 2012 Information Theory. .  ■ x1 (1) x2 (1) ···   x (2) x2 (2) · · ·  1 C= . April 16. . the 2nR codewords have the distribution p(xn ) = n Y p(xi ) i=1 ■ A particular code C is the matrix with 2nR codewords as the row.p.  xn (2nR ) The code C is revealed to both sender and receiver..Random Code ■ Generate a (2nR . .  .  .

7 . April 16. 2012 Information Theory. nR Pr(C) = 2 n Y Y p (xi (w)) w=1 i=1 Peng-Hua Wang.p. Chap.Random Code nR ■ There are (|X |n )2 ■ The probability of a particular code C is different codes. 40/62 .

is sent over the channel. . . Chap. 2nR . 41/62 . . 2.p. use the DMC channel for n times. ■ The receiver receives a sequence Y n according to the distribution P (y n |xn (w)) = n Y p(yi |xi (w)). w = 1. . Peng-Hua Wang. i=1 That is. ■ The w th codeword X n (w). 2012 Information Theory. corresponding to the w th row of C . 7 . April 16.Transmission and Channel ■ A message W is chosen according to a uniform distribution Pr[W = w] = 2−nR .

2012 ˆ 6= W ]. Y n is jointly typical. Let E be the event [W Information Theory. 42/62 .Jointly Typical Decoding ■ ˆ was sent if The receiver declares that the message W   ◆ ◆ ˆ ). Y n is jointly typical.p. there is no other W ′ ■ ˆ exists or if there is more than one such. 6= W = 0). an error is If no such W ˆ declared (W ■ ˆ such that W ′ . 7 . Chap. April 16. That is. X n (W There is no other jointly typical pair for Y n . ˆ There is decoding error if W Peng-Hua Wang. 6= W .

By the symmetry of the code construction. and averaged over all codebooks.Proof of R < C → Achievable ■ The average probability of error averaged over all codewords in the codebook. 7 . Pr(E) = X Pe(n) (C) C = ◆ ◆ (n) Pe (C) Pr(C) C nR 2 X X 1 2nR = X w=1 2nR λw (C) w=1 Pr(C)λw (C) C is defined for jointly typical decoding. 43/62 . not depend on w. Peng-Hua Wang. 2012 1 nR 2 X X Pr(C)λw (C) does C Information Theory. April 16. Chap.p.

Chap. April 16. 2. 2nR where Y n is the channel output when the first codeword X n (1) was sent. Pr(E) = = 1 2nR X nR 2 XX w=1 Pr(C)λw (C) C Pr(C)λw (C) for any w C = X Pr(C)λ1 (C) = Pr(E|W = 1) C ■ = {(X n (i). ◆ E2 ∪ E3 ∪ · · · ∪ E2nR : a wrong codeword is jointly typical with Information Theory. . Then decoding error is declared if ◆ E1c : The transmitted codeword and the received sequence are not Define Ei jointly typical. 44/62 Peng-Hua Wang. Y n ) is jointly typical pair} for i = 1.p.Proof of R < C → Achievable ■ Therefore. 2012 . . 7 . . .

2012 Information Theory. E2nR : wrong codewords that are jointly typical with the received sequence. 7 . . . ■ E2 . E3 . . .Proof of R < C → Achievable ■ Y n is the channel output when the first codeword X n (1) was sent. Chap. Peng-Hua Wang. 45/62 . April 16. ■ E1c : The transmitted codeword and the received sequence are not jointly typical.p.

2012 n and X n (1) are jointly typical.Y )−3ǫ) (Y Peng-Hua Wang. Chap.Proof of R < C → Achievable ■ The average error Pr(E|W = 1) = P (E1c ∪ E2 ∪ E3 ∪ · · · ∪ E2nR |W = 1) nR ≤ P (E1c |W = 1) + 2 X P (Ei |W = 1) i=2 ■ By AEP. P (E1c |W = 1) ≤ ǫ for n sufficiently large P (Ei |W = 1) ≤ 2−n(I(X. April 16.p.) Information Theory. 7 . 46/62 .

47/62 . April 16. if R < I(X. Almost.Y )−3ǫ) ≤ ǫ + 2nR 2−n(I(X.) Information Theory. the maximum error probability λ(n) → 0. 7 . Y ) and n sufficient large.Y )−R−3ǫ) ■ If I(X.. 2012 (We are almost there. Y ) − R − 3ǫ > 0.Y )−3ǫ) = ǫ + 2−n(I(X. then 2−n(I(X.. Chap. and Pr(E|W = 1) ≤ 2ǫ.Proof of R < C → Achievable ■ We have Pr(E|W = 1) ≤ ǫ + (2nR − 1)2−n(I(X. we prove that: for any ǫ. Peng-Hua Wang.Y )−R−3ǫ) < ǫ for n sufficiently large.p. ■ So far. ■ What do we need? If R < C . the average decoding error Pr(E) = Pr(E|W = 1) < 2ǫ.

C ∗ can be found by an exhaustive search over all codes. Since W is chosen uniformly. the highest Peng-Hua Wang. 7 . Chap. choose p(x) such that I(X.p. Y ) is maximum. ■ Since the average probability of error over codebooks is less then 2ǫ. final part ■ Choose p(x) such that I(X. Then the condition R < I(X. Y ) achieve channel capacity C .Information Then Theory. Their average score is 40.Proof of R < C → Achievable. there exists at least one codebook C ∗ such that Pr(E|C ∗ ) ◆ ■ < 2ǫ. That is. ◆ April 16. Y ) − 3ǫ can be replaced by the achievability condition R < C − 3ǫ. 48/62 There are 10 students. 2012 . we have nR Pr(E|C ∗ ) = 2 1 X 2nR λi (C ∗ ) ≤ 2ǫ i=1 which implies that the maximal error probability of the better half codewords is less than 4ǫ.

If R − 1/n < C − 3ǫ for any ǫ. Peng-Hua Wang. April 16. n or 2 . The new code has a maximal probability of error less than 4ǫ.Proof of R < C → Achievable. Chap. 49/62 . The rate of the new code is R − 1/n. ■ Summary.p. final part ■ We throw away the worst half of the codewords in the best codebook C ∗ . 7 .   nR n(R−1/n) However. we construct a 2 /2. 2012 Information Theory. n code. then λ(n) ≤ 4ǫ for n sufficiently large.

2012 Information Theory.7.8 Zero-Error Codes Peng-Hua Wang. 50/62 .p. Chap. 7 . April 16.

Y n ) ≤ I(X n . assume that W is uniformly distributed over {1.No error → R ≤ C ■ ■ 2  . Yi ) (See next page. nR = H(W ) = H(W |Y n ) + I(W . n code with zero probability of error. 51/62 . Y n ) = I(W . 2nR }. Assume that we have a ◆ nR To obtain a strong bound. . . no error → Peng-Hua Wang. 2. April 16. 2012 R ≤ C.p. p(g(Y n ) = W ) = 1. H(W |Y n ) = 0. 7 . Information Theory.) i=1 ≤ nC (definition of channel capacity) ■ That is. . Y n )(data processing ineq. Chap. W is determined by Y n . W → X n (W ) → Y n ) ≤ n X I(Xi . .

Y n ) ≤ nC. Yi−1 .No error → R ≤ C Lemma 1 Let Y n be the result of passing X n through a discrete memoryless channel of capacity C .p. I(X n . 7 . Y n ) = H(Y n ) − H(Y n |X n ) = H(Y n ) − = H(Y n ) − n X i=1 n X H(Yi |Y1 . X n ) H(Yi |Xi ) (definition of DMC) i=1 ≤ n X i=1 Peng-Hua Wang. I(X n . Xi ) ≤ nC i=1 Information Theory. Proof. . April 16. . 52/62 . Chap. . . Then for all p(xn ). 2012 H(Yi ) − n X i=1 H(Yi |Xi ) = n X I(Yi .

2012 Information Theory.p. 7 . 53/62 .7.9 Fano’s Inequality and the Converse to the Coding Theorem Peng-Hua Wang. Chap. April 16.

m. . M } and have the joint p. Chap.f. . w). 2012 Information Theory. 7 . Pe = Pr[X 6= W ] = x∈X w∈X . w). . Let XX p(x. Peng-Hua Wang. April 16.Fano’s Inequality Theorem 6 (Fano’s inequality) Let X and W have the same sample spaces X = {1. 54/62 . . p(x.p. w6=x Then Pe log(M − 1) + H(Pe ) ≥ H(X|W ) where H(Pe ) = −Pe log Pe − (1 − Pe ) log(1 − Pe ). 2.

w) log 1 p(x|w) XX p(x.Fano’s Inequality Proof. April 16.p. We will prove that H(X|W ) − H(Pe ) − Pe log(M H(X|W ) = XX p(x. Chap. −H(Pe ) = Pe log Pe + (1 − Pe ) log(1 − Pe ) XX p(x. w) log Pe = x + w6=x XX x p(x. w) log 1 M −1 x = x + x −Pe log(M − 1) = x w w6=x w=x w6=x − 1) ≤ 0. 2012 Information Theory. w) log(1 − Pe ) w=x Add the above three terms together. Peng-Hua Wang. w) log 1 p(x|w) XX p(x. w) log 1 p(x|w) XX p(x. 7 . 55/62 .

p. April 16. 7 . w) log + (M − 1)p(x|w) p(x|w) x w6=x x w=x   XX XX P 1 − Pe  e ≤ log  p(x. w) log p(x. Chap. w) (M − 1)p(x|w) p(x|w) x w6=x x w=x   XX XX P e p(w) + (1 − Pe ) p(w) = log  M −1 x x w=x w6=x = log[Pe + (1 − Pe )] = 0  Peng-Hua Wang. w) + p(x.) H(X|W ) − Pe log(M − 1) − H(Pe ) XX XX Pe 1 − Pe = p(x.Fano’s Inequality Proof (cont. 2012 Information Theory. 56/62 .

Peng-Hua Wang. Pe = Pr[X 6= W ] 1 + Pe log M ≥ H(X|W ). Pe log M + H(Pe ) ≥ H(X|W ). 2012 Information Theory. ˆ then →Y →X ˆ ≥ H(X|Y ) H(Pe ) + Pe log M ≥ H(X|X) Remark.p. Chap. 1.Fano’s Inequality Corollary 1 1. 7 . 57/62 . 2. If X ˆ and Pe = Pr[X 6= X]. H(X|W ) ≤ Pe log(M − 1) + H(Pe ) ≤ Pe log M + 1. Pe = Pr[X 6= W ] 3. can be obtained by data processing ineq. 3. April 16. 2. H(X|W ) ≤ Pe log(M − 1) + H(Pe ) ≤ Pe log M + H(Pe ). The second ineq.

y. z) (by convexity of logarithm) p(x|y) x y z XXX Peng-Hua Wang. Y ) =H(X) − H(X|Z) − [H(X) − H(X|Y )] = H(X|Y ) − H(X|Z) XX XX 1 1 = p(x. y. 7 . y) log − p(x. z) log − p(x. z) log = p(x|y) p(x|z) x y z x y z ! XXX p(x|z) ≤ log p(x. y. z) log p(x|y) p(x|z) x y x z XXX 1 1 p(x.p. Z) ≤ I(X. Y ) Proof. I(X. 58/62 .Data Processing Inequality Lemma 2 (Data processing inequality) If X → Y → Z . then I(X. 2012 Information Theory. Z) − I(X. April 16. Chap.

z) = p(x|y) p(z) x y z X X p(x. z) p(x. we have p(x. z) p(x. z) p(x. 59/62 . z) = XX x p(x. z) = × = p(x. y)p(z|y) = p(y) and p(x|z) p(x.p.Data Processing Inequality Proof (cont.) Since X → Y → Z . z) = 1  z Information Theory. y. 7 . y)p(z|x. z) = p(x. y) p(z) Therefore. y. y)p(y. z) X x z Peng-Hua Wang. XXX x = y z p(x|z) X X X p(x. z) p(x|y) p(y) p(z)p(x. z)p(y) p(x. y)p(y. April 16. y) = p(x. Chap. z)p(y. z)p(y. 2012 p(z) y p(y. y.

W ) ≤ I(X. 7 . 60/62 . If X → Y → Z → W . Z) . Z) ≤ I(Y . H(X|Y ) ≤ H(X|Z) 2. I(X. Z) Peng-Hua Wang. 2012 Information Theory.Data Processing Inequality (Summary) Lemma 3 1. Z). April 16. W ) ≤ I(Y .p. Z) + I(Y . Chap. Y ) I(X. then ( I(X. W ) + I(Y . then I(X. If X → Y → Z .

n) codes with λ(n) → 0 must have R ≤ C . 2. ■ Since W has a uniform distribution. 2012 2 X 1 2nR λi . . . April 16. ■ For a fixed encoding rule X n (W ) and a fixed decoding rule ˆ = g(Y n ). Proof. Chap. W ■ For each n. 7 . 2nR }.p. let W be drawn according to a uniform distribution= over {1. . i=1 Information Theory. . 61/62 . nR ˆ ] = Pe(n) = Pr[W 6= W Peng-Hua Wang. we have W → X n (W ) → Y n → W ˆ.Achievable → R ≤ C Theorem 7 (Converse to Channel coding theorem) Any sequence of (2nR .

7 . Y n ) (data processing ineq.) ≤ 1 + Pe(n) nR + nC (lemma 7. the probability of error is large than a positive value for sufficiently large n.9. if R > C .) nR = H(W ) (W is uniform distribution) ˆ ) + I(W .) ≤ 1 + Pe(n) nR + I(W .  Pe(n) Peng-Hua Wang.2) C 1 ⇒ ≥1− − R nR That is.p. 62/62 . April 16. W ˆ) = H(W |W ˆ ) (Fano’s ineq. 2012 Information Theory.Achievable → R ≤ C Proof (cont. Chap. W ≤ 1 + Pe(n) nR + I(X n . The error probability can’t achieve arbitrary small.