This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Thomas M. Cover Joy A. Thomas September 22, 2006

1 COPYRIGHT 2006 Thomas Cover Joy Thomas All rights reserved

2

Contents 1 Introduction 2 Entropy. Relative Entropy and Mutual Information 3 The Asymptotic Equipartition Property 4 Entropy Rates of a Stochastic Process 5 Data Compression 6 Gambling and Data Compression 7 9 49 61 97 139 3 .

4 CONTENTS .

Preface

The problems in the book, “Elements of Information Theory, Second Edition”, were chosen from the problems used during the course at Stanford. Most of the solutions here were prepared by the graders and instructors of the course. We would particularly like to thank Prof. John Gill, David Evans, Jim Roche, Laura Ekroot and Young Han Kim for their help in preparing these solutions. Most of the problems in the book are straightforward, and we have included hints in the problem statement for the diﬃcult problems. In some cases, the solutions include extra material of interest (for example, the problem on coin weighing on Pg. 12). We would appreciate any comments, suggestions and corrections to this Solutions Manual. Tom Cover Durand 121, Information Systems Lab Stanford University Stanford, CA 94305. Ph. 415-723-4505 FAX: 415-723-8473 Email: cover@isl.stanford.edu Joy Thomas Stratify 701 N Shoreline Avenue Mountain View, CA 94043. Ph. 650-210-2722 FAX: 650-988-2159 Email: jat@stratify.com

5

6

CONTENTS

Chapter 1 Introduction

7

8 Introduction .

(1 − r )2 (b) A random variable X is drawn according to this distribution.Chapter 2 Entropy. Find an “eﬃcient” sequence of yes-no questions of the form. Coin ﬂips. . then H (X ) = 2 bits. (a) Find the entropy H (X ) in bits. } . . The following expressions may be useful: 1 . Solution: (a) The number X of tosses till the ﬁrst head appears has the geometric distribution with parameter p = 1/2 . 9 . where P (X = n) = pq n−1 . Relative Entropy and Mutual Information 1. If p = 1/2 . Let X denote the number of ﬂips required. . “Is X contained in the set S ?” Compare H (X ) to the expected number of questions required to determine X . r = 1−r n=0 n ∞ ∞ n=0 nr n = r . Hence the entropy of X is H (X ) = − = − ∞ pq n−1 log(pq n−1 ) pq n log p + ∞ n=0 n=1 ∞ n=0 npq n log q −p log p pq log q = − 1−q p2 −p log p − q log q = p = H (p)/p bits. n ∈ {1. 2. A fair coin is ﬂipped until the ﬁrst head occurs.

1 = yes.. Y ) pairs: (1. (3. Consequently. is X = 3 ? . (2. For this set x: y =g (x) p(x) log p(x) ≤ p(x) log p(y ) = p(y ) log p(y ). 2. the entropy is exactly the same as the average number of questions needed to deﬁne X . Extending this argument to the entire range of X (and Y ). This should reinforce the intuition that H (X ) is a measure of the uncertainty of X . (b) Y = cos(X ) is not necessarily one-to-one. with equality iﬀ g is one-to-one with probability one. and Y = Encoded Source. etc. . this intuitively derived code is the optimal (Huﬀman) code minimizing the expected number of questions. Let 0 = no. H (X ) = H (Y ) . This problem has an interpretation as a source coding problem. it seems clear that the best questions are those that have equally likely chances of receiving a yes or a no answer. Indeed in this case. with equality if cosine is one-to-one on the range of X . X = Source. Hence all that we can say is that H (X ) ≥ H (Y ) . . Entropy of functions. In fact.10 Entropy. Let X be a random variable taking on a ﬁnite number of values. Consider any set of x ’s that map onto a single y . 01) . which is just a function of the probabilities (and not the values of a random variable) does not change.e. What is the (general) inequality relationship of H (X ) and H (Y ) if (a) Y = 2X ? (b) Y = cos X ? Solution: Let y = g (x) . 001) . . with a resulting expected number of questions equal to ∞ n n=1 n(1/2 ) = 2. Then the set of questions in the above procedure can be written as a collection of (X. and in general E (# of questions) ≥ H (X ) . . we obtain H (X ) = − = − ≥ − p(x) log p(x) x p(x) log p(x) y x: y =g (x) p(y ) log p(y ) y = H (Y ). Relative Entropy and Mutual Information (b) Intuitively. one possible guess is that the most “eﬃcient” series of questions is: Is X = 1 ? If not. i. 1) . x: y =g (x) since log is a monotone increasing function and p(x) ≤ x: y=g(x) p(x) = p(y ) . is X = 2 ? If not. (a) Y = 2X is one-to-one and hence the entropy. Then p(y ) = x: y =g (x) p(x).

Thus H (Y |X ) = − > > 0. Assume that there exists an x .e. 1. . . H (g (X )) + H (X | g (X )) H (g (X )). .e. . . . for all x with p(x) > 0 . (0. . . (a) H (X. Solution: We wish to ﬁnd all probability vectors p = (p 1 . (0.4) = H (X. 0. . with equality iﬀ X is a function of g (X ) . there is only one possible value of y with p(x. Hence the only possible probability vectors which minimize H (p) are those with p i = 1 for some i and pj = 0. . (2. say y1 and y2 such that p(x0 . . Let X be a discrete random variable. Hence H (X. and the minimum value of H (p) is 0. Zero conditional entropy. and hence H (g (X )|X ) = x p(x)H (g (X )|X = x) = x 0 = 0 .1) (2.2) (2. . g (X )) ≥ H (g (X )) . i. .e.. i. . Then p(x0 ) ≥ p(x0 .) is one-to-one. p(x) x y p(y |x) log p(y |x) (2. . j = i . p2 . pn ) = H (p) as p ranges over the set of n -dimensional probability vectors? Find all p ’s which achieve this minimum. . ..5) (2. say x 0 and two diﬀerent values of y . g (X )) = H (X ) + H (g (X )|X ) by the chain rule for entropies. 0. i Now −pi log pi ≥ 0 . with equality iﬀ pi = 0 or 1 .. There are n such vectors. and p(y1 |x0 ) and p(y2 |x0 ) are not equal to 0 or 1.7) ≥ p(x0 )(−p(y1 |x0 ) log p(y1 |x0 ) − p(y2 |x0 ) log p(y2 |x0 )) . (c) H (X. 1) . . . 5. Combining parts (b) and (d).. 0) . Show that the entropy of a function of X is less than or equal to the entropy of X by justifying the following steps: H (X. y2 ) > 0 . Solution: Entropy of functions of a random variable. y1 ) + p(x0 . g (X )) = H (g (X )) + H (X |g (X )) again by the chain rule. 0. pn ) which minimize H (p) = − pi log pi . i. Solution: Zero Conditional Entropy. y ) > 0 .. 0) . What is the minimum value of H (p 1 . g (. then Y is a function of X . Relative Entropy and Mutual Information 11 3.. Entropy of functions of a random variable.6) (2.3) (2. we obtain H (X ) ≥ H (g (X )) . 4. g(X) is ﬁxed. g (X )) (a) = (b) H (X ) + H (g (X ) | X ) H (X ). (d) H (X |g (X )) ≥ 0 . g (X )) (c) = (d) ≥ Thus H (g (X )) ≤ H (X ). y2 ) > 0 . (b) H (g (X )|X ) = 0 since for any particular value of X. Minimum entropy. (1. . Show that if H (Y |X ) = 0 . y1 ) > 0 and p(x0 . .Entropy.

Z ) = 0 or X and Z are independent. Y ) . Y | Z ) = H (X | Z ) = 1/2.12 Entropy. Therefore the conditional entropy H (Y |X ) is 0 if and only if Y is a function of X . I (X . Y | Z ) . • One of the n coins is lighter.8. • One of the n coins is heavier. y | z ) = p(x | z )p(y | z ) then. Y and Z such that (a) I (X . Y ) > I (X . Conditional mutual information vs. So I (X . (b) (Diﬃcult) What is the coin weighing strategy for k = 3 weighings and 12 coins? Solution: Coin weighing. In this case. (b) This example is also given in the text. Let X. Y ) ≥ I (X . there are 2n + 1 possible situations or “states”. A simple example of random variables satisfying the inequality conditions above is. 7. I (X . . (b) I (X . it may be either heavier or lighter than the other coins. and is strictly positive for t not equal to 0 or 1. if p(x. Y be independent fair binary random variables and let Z = X + Y . X is a fair binary random variable and Y = X and Z = Y . Y ) . (a) The last corollary to Theorem 2. Y ) = H (X ) − H (X | Y ) = H (X ) = 1 and. Equality holds if and only if I (X . I (X . Y | Z ) . Y | Z ) < I (X . Suppose one has n coins. Y. Solution: Conditional mutual information vs. among which there may or may not be one counterfeit coin. Y | Z ) = H (X | Z ) − H (X | Y. Y | Z ) . Y ) = 0 and. Y ) < I (X . Relative Entropy and Mutual Information since −t log t ≥ 0 for 0 ≤ t ≤ 1 . (a) For n coins. (a) Find an upper bound on the number of coins n so that k weighings will ﬁnd the counterfeit coin (if any) and correctly declare it to be heavier or lighter. unconditional mutual information. In this case we have that.1 in the text states that if X → Y → Z that is. unconditional mutual information. • They are all of equal weight. 6. Give examples of joint random variables X . If there is a counterfeit coin. I (X . So that I (X . Y | Z ) > I (X . I (X . Note that in this case X. Coin weighing. Z ) = 0. The coins are to be weighed by a balance. Z are not markov.

if ni = −1 .9. place coin n • On left pan. 1. (b) There are many solutions to this problem.-1. and 1 if the right pan is heavier.-1) indicates (0)30 +(−1)3+(−1)32 = −12 . If the expansion is of the opposite sign. Here are some details. We form the matrix with the representation of the positive numbers as its columns. . Also if any coin . . −1. negating columns 7. We will give one which is based on the ternary number system. (0. (0. 1} . if ni = 1 . For example.1) where −1 × 30 + 0 × 31 + 1 × 32 = 8 . Hence 2n + 1 ≤ 3k or n ≤ (3k − 1)/2 .Entropy.11 and 12. Then the three weighings give the ternary expansion of the index of the odd coin. . which gives the same result as above. 1 2 3 4 5 6 7 8 9 10 11 12 0 3 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0 31 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2 2 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8 3 Note that the row sums are not all zero. 0.0) indicates no odd coin. Looking at it from an information theoretic viewpoint. the number 8 is (-1. there are 3 k possible outcomes and hence we can distinguish between at most 3k diﬀerent “states”.-1) indicates #8 is light. • On right pan. For example. 0. First note a few properties of the matrix above that was used for the scheme. it indicates that the coin is heavier. each weighing gives at most log 2 3 bits of information. Relative Entropy and Mutual Information 13 Each weighing has three possible outcomes . . Hence in this situation.0. We can negate some columns to make the row sums zero. the coin is lighter. 12} in a ternary number system with alphabet {−1. if ni = 0 . with a maximum entropy of log 2 (2n + 1) bits.11 in the book).0. . We may express the numbers {−12. All the columns are distinct and no two columns add to (0. -1 if the left pan is heavier. −11. . one would require at least log 2 (2n + 1)/ log 2 3 weighings to extract enough information for determination of the odd coin. . There are 2n + 1 possible “states”. The result of each weighing is 0 if both pans are equal. The outcome of the three weighings will ﬁnd the odd coin if any and tell whether it is heavy or light. Hence with k weighings. left pan heavier or right pan heavier. . • Aside. Why does this scheme work? It is a single error correcting Hamming code for the ternary alphabet (discussed in Section 8.equal.0). (1. If the expansion is the same as the expansion in the matrix. we obtain 1 2 3 4 5 6 7 8 9 10 11 12 0 3 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0 31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0 32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0 Now place the coins on the balance according to the following rule: For weighing #i .0.0. For example. hence coin #12 is heavy.

. . y . The bound did not prohibit the division of coins into halves. It is easier to compute the unconditional entropy. For example. r+w +b red (2. Intuitively. and that the coin can be determined from the sequence. But computing the conditional distributions is slightly involved. not with a scheme like the one above.9) r w b = log(r + w + b) − log r − log w − log (2. and therefore the conditional entropy is larger. . it is possible to ﬁnd the odd coin of 13 coins in 3 weighings. w white. x) . r+w+b r+w+b r+w+b • Without replacement. neither did it disallow the existence of another coin known to be normal. . X1 ) is less than the unconditional entropy. A metric. . Relative Entropy and Mutual Information is heavier. . Which has higher entropy. y ) = ρ(y.8) and therefore H (Xi |Xi−1 . 8. it is clear that if the balls are drawn with replacement. A function ρ(x. (There is both a hard way and a relatively simple way to do this. Thus r with prob. the number of possible choices for the i -th ball is larger. it produces the negative of its column as a sequence of weighings. under the assumptions under which the bound was derived. One of the questions that many of you had whether the bound derived in part (a) was actually achievable. Thus the unconditional entropy H (X i ) is still the same as with replacement. and therefore the entropy of drawing without replacement is lower. y ) is a metric if for all x. Yes. 9. y ) ≥ 0 • ρ(x. r+w +b white with prob. Combining all these facts. r+w Xi = w +b b black with prob. • ρ(x.) Solution: Drawing with and without replacement. The conditional entropy H (X i |Xi−1 . we can see that any single odd coin will produce a unique sequence of weighings. . Drawing with and without replacement.10) b. Under both these conditions. The unconditional probability of the i -th ball being red is still r/(r + w + b) . it will produce the sequence of weighings that matches its column in the matrix. You could try modifying the above scheme to these cases. • With replacement. In this case the conditional distribution of each draw is the same for every draw. X1 ) = H (Xi ) (2. etc. If it is lighter. can one distinguish 13 coins in 3 weighings? No.14 Entropy. . and b black balls. An urn contains r red. drawing k ≥ 2 balls from the urn with replacement or without replacement? Set it up and show why.

X ) . Then H (X |Y ) + H (Y |Z ) ≥ H (X |Y. Y ) = H (X |Y ) + H (Y |X ). z ) ≥ ρ(x. Y ) + ρ(Y. The second relation follows from the ﬁrst equation and the fact that H (X. Y ) is 0 iﬀ X and Y are functions of each other . . Y ) − H (X ) − H (Y ). Z ) ≥ ρ(X. y ) = 0 if and only if x = y • ρ(x. second and fourth properties above. Z ) ≥ H (X |Z ). Let X= X1 . Y ) = H (X |Y ) + H (Y |X ) satisﬁes the ﬁrst. it follows that H (Y |X ) is 0 iﬀ Y is a function of X and H (X |Y ) is 0 iﬀ X is a function of Y .13) Solution: A metric (a) Let Then • Since conditional entropy is always ≥ 0 . Y ) = H (X ) + H (Y ) − I (X . . Y ) = H (X ) + H (Y ) − 2I (X . Z ) + H (Y |Z ) = H (X |Z ) + H (Y |X. Thus ρ(X. Y ) = H (X ) + H (Y ) − H (X. 10. • By problem 2. Entropy of a disjoint mixture. z ) . m} and X2 = {m + 1. X2 . .12) (2. (2.and therefore are equivalent up to a reversible transformation. .16) (2. 2. Y ) . Y |Z ) (2. Y ) is a metric.18) ρ(X. • The symmetry of the deﬁnition implies that ρ(X. Y ) . Y ) = H (X. . . n}. with probability α.14) Note that the inequality is strict unless X → Y → Z forms a Markov Chain and Y is a function of X and Z . The third follows on substituting I (X .15) (2. Z ).11) (2.19) = H (X. . Y ) . . from which it follows that ρ(X. y ) + ρ(y. Y and Z . If we say that X = Y if there is a one-to-one function mapping from X to Y . • Consider three random variables X . Y ) = ρ(Y. . with probability 1 − α. and ρ(X. Relative Entropy and Mutual Information • ρ(x. then the third property is also satisﬁed. (2. Y ) − I (X .Entropy. ρ(X. Y ) ≥ 0 .17) (2. Y ) = 2H (X. Let X 1 and X2 be discrete random variables drawn according to probability mass functions p 1 (·) and p2 (·) over the respective alphabets X1 = {1. Y ) can also be expressed as ρ(X. the ﬁrst equation follows.6. (2. 15 (a) Show that ρ(X. (b) Since H (X |Y ) = H (X ) − I (X . (b) Verify that ρ(X.

X2 ) . Let H (X2 | X1 ) ρ=1− . Since X1 and X2 have disjoint support sets. A measure of correlation.X2 ) H (X1 ) . (b) Show 0 ≤ ρ ≤ 1. We can do this problem by writing down the deﬁnition of entropy and expanding the various terms. Solution: Entropy. Relative Entropy and Mutual Information (a) Find H (X ) in terms of H (X1 ) and H (X2 ) and α. f (X )) = H (θ ) + H (X |θ ) = H (θ ) + p(θ = 1)H (X |θ = 1) + p(θ = 2)H (X |θ = 2) = H (α) + αH (X1 ) + (1 − α)H (X2 ) 1 2 when X = X1 when X = X2 X1 X2 with probability α with probability 1 − α where H (α) = −α log α − (1 − α) log(1 − α) . we can write X= Deﬁne a function of X . (c) When is ρ = 0 ? (d) When is ρ = 1 ? Solution: A measure of correlation.16 Entropy. H (X1 ) H (X2 |X1 ) H (X1 ) . Let X1 and X2 be identically distributed. but not necessarily independent. (b) Maximize over α to show that 2H (X ) ≤ 2H (X1 ) + 2H (X2 ) and interpret using the notion that 2H (X ) is the eﬀective alphabet size. we have H (X ) = H (X. θ = f (X ) = Then as in problem 1. X 1 and X2 are identically distributed and ρ=1− (a) ρ = = = H (X1 ) − H (X2 |X1 ) H (X1 ) H (X2 ) − H (X2 |X1 ) (since H (X1 ) = H (X2 )) H (X1 ) I (X1 . H (X1 ) (a) Show ρ = I (X1 . 11. we will use the algebra of entropies for a simpler proof. Instead.

(e) I (X . 17 (d) ρ = 1 iﬀ H (X2 |X1 ) = 0 iﬀ X2 is a function of X1 . H (Y ).e.918 bits = H (Y ) . we have 0≤ H (X2 |X1 ) ≤1 H (X1 ) 0 ≤ ρ ≤ 1. Solution: Example of joint entropy (a) H (X ) = 2 3 log 3 2 + 1 3 log 3 = 0. Y ) . i. (c) ρ = 0 iﬀ I (X1 . Show ln x ≥ 1 − for x > 0. By symmetry. 0 (f) Draw a Venn diagram for the quantities in (a) through (e). X2 ) = 0 iﬀ X1 and X2 are independent. (f) See Figure 1.Entropy.667 bits = H (Y |X ) . 12. X1 is a function of X2 . 1 x 13. Using the Remainder form of the Taylor expansion of ln(x) about x = 1 . X1 and X2 have a one-to-one relationship. Let p(x. Y ). H (Y | X ).251 bits. Y ) = H (Y ) − H (Y |X ) = 0.251 bits. Relative Entropy and Mutual Information (b) Since 0 ≤ H (X2 |X1 ) ≤ H (X2 ) = H (X1 ) . = 1) = 0. (b) H (X |Y ) = (c) H (X. we have for some c between 1 and x ln(x) = ln(1) + 1 t (x − 1) + −1 t2 (x − 1)2 ≤ x−1 2 t=1 t=c . Solution: Inequality. (e) I (X . (b) H (X | Y ). 2 1 3 H (X |Y = 0) + 3 H (X |Y 1 3× 3 log 3 = 1. (c) H (X. Y ) = (d) H (Y ) − H (Y |X ) = 0. Example of joint entropy. Inequality..585 bits. y ) be given by @ Y @ X @ 0 1 3 1 1 3 1 3 0 1 Find (a) H (X ). (d) H (Y ) − H (Y | X ).

Let Z = X + Y. 14.Y) H(Y|X) since the second term is always negative. x2 . we obtain − ln y ≤ or ln y ≥ 1 − with equality iﬀ y = 1 . Thus the addition of independent random variables adds uncertainty.18 Entropy. . . . xr and y1 . H (Z |X ) = = − = x 1 −1 y 1 y p(x)H (Z |X = x) p(x) x z y p(Z = z |X = x) log p(Z = z |X = x) p(x) p(Y = z − x|X = x) log p(Y = z − x|X = x) = = H (Y |X ). respectively. . Hence p(Z = z |X = x) = p(Y = z − x|X = x) . . (c) Under what conditions does H (Z ) = H (X ) + H (Y ) ? Solution: Entropy of a sum. p(x)H (Y |X = x) . (a) Show that H (Z |X ) = H (Y |X ). Argue that if X. then H (Y ) ≤ H (Z ) and H (X ) ≤ H (Z ). Y are independent. . Let X and Y be random variables that take on values x 1 . Entropy of a sum. Relative Entropy and Mutual Information Figure 2. . (a) Z = X + Y .1: Venn diagram to illustrate the relationships of entropy and relative entropy H(Y) H(X) H(X|Y) I(X. (b) Give an example of (necessarily dependent) random variables in which H (X ) > H (Z ) and H (Y ) > H (Z ). . Hence letting y = 1/x . y2 . ys .

i.. i. . Suppose a (non-stationary) Markov chain starts in one of n states. then H (Y |X ) = H (Y ) . Let X1 → X2 → X3 → · · · → Xn form a Markov chain in this order. m} . X2 ) + I (X1 . . the past and the future are conditionally independent given the present and hence all terms except the ﬁrst are zero. Similarly we can show that H (Z ) ≥ H (X ) . Relative Entropy and Mutual Information 19 If X and Y are independent. . .e. X2 . but Z = 0 with prob. . X3 |X2 ) + · · · + I (X1 . . x3 ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) .Entropy. By the chain rule for mutual information. 15. Thus X 1 → X2 → X3 .20) By the Markov property. (2. . . i. Solution: Bottleneck. 2. . (a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving that I (X1 . and then fans back to m > k states. Reduce I (X1 . . . . Data processing. X3 ) for k = 1 . X2 . Y ) is a function of Z and H (Y ) = H (Y |X ) .e. . Y ) and H (X.e.. Solution: Data Processing. . xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 ). . .21) 16. X and Y are independent. . . 2. we have H (Z ) ≥ H (Z |X ) = H (Y |X ) = H (Y ) . X2 ). Xn ) to its simplest form. (b) Consider the following joint distribution for X and Y Let X = −Y = 1 0 with probability 1/2 with probability 1/2 Then H (X ) = H (Y ) = 1 . X3 ) ≤ log k. Y ) ≤ H (X ) + H (Y ) because Z is a function of (X. x3 ∈ {1. (2. necks down to k < n states. (c) We have H (Z ) ≤ H (X. . . and conclude that no dependence can survive such a bottleneck. Xn−2 ). Therefore I (X1 . We have equality iﬀ (X. 1 and hence H (Z ) = 0 . .. Xn |X2 . Since I (X . . let p(x1 . I (X1 . . x2 . . Z ) ≥ 0 . . x2 ∈ {1. . for all x1 ∈ {1. . . p(x1 . Xn ) = I (X1 . . n} . Bottleneck. . X2 . Y ) = H (X ) + H (Y |X ) ≤ H (X ) + H (Y ) . x2 . . . (b) Evaluate I (X1 . 2. Xn ) = I (X1 . . k } .

(where {0. Solution: Pure randomness and bent coins. . Xn ) . ZK ) (b) ≥ . and the fact that entropy is maximum for a uniform distribution. . nH (p) = H (X1 . 00. f (10) = 1 . where Zi ∼ Bernoulli ( 2 on (X1 . . for n = 2 . 1}∗ = {Λ. f (00) = f (11) = Λ (the null string). . That is I (X1 . I (X1 . . . 0. I (X1 . . . . . . Xn denote the outcomes of independent ﬂips of a bent coin. we get I (X1 . . . . (a) nH (p) = H (X1 . 1}∗ . Xn ) = (Z1 . X2 . . the map f from bent coin ﬂips to fair ﬂips must have the property that all 2 k sequences (Z1 . . . . for k = 1. . and K may depend f (X1 . Toward this end let f : X n → {0. . . . . . (b) (c) ≥ = (d) = (e) ≥ Thus no more than nH (p) fair coin tosses can be derived from (X 1 . 1. Zk ) of a given length k have equal probability (possibly 0). X2 ) ≤ H (X2 ) ≤ log k. Xn . . Exhibit a good map f on sequences of length 4. Let X 1 . . . . Z2 . . . Xn ) . Z2 . . . . X3 ) ≤ I (X1 . Thus. . Xn ) H (Z1 . . ZK of fair coin ﬂips from X1 . . In order that the sequence Z1 . 17. for k = 1 . . X3 ) = 0 . . . . . . . Thus Pr {X i = 1} = p. appear to be fair coin ﬂips. X2 . Z2 . . X3 ) ≤ log 1 = 0 and since I (X1 . . . Z2 . . Z2 . . .20 Entropy. on the average. Pr {Xi = 0} = 1 − p . X3 ) ≥ 0 . . where p is unknown. X2 . 2. . ZK ) . = H (X2 ) − H (X2 | X1 ) (b) For k = 1 . be a mapping 1 ) . X1 and X3 are independent. . the dependence between X1 and X3 is limited by the size of the bottleneck. . has the property that Pr{Z 1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 1 2. Relative Entropy and Mutual Information (a) From the data processing inequality. K ) H (K ) + H (Z1 . Pure randomness and bent coins. We wish to obtain a sequence Z 1 . X3 ) ≤ log k . ZK . . } is the set of all ﬁnite length binary sequences). ZK |K ) H (K ) + E (K ) EK. . . . Z2 . . the map f (01) = 0 . 01. . . . Give reasons for the following inequalities: (a) Thus. Xn ) H (Z1 . For example. .

. . . the sequences 0001.d. the only way to generate pure random bits is to use the fact that all sequences with the same number of ones are equally likely. . . the entropy H (X1 . Z2 . Z2 . (c) K is a function of Z1 . .0010.i. ZK |K ) H (K ) + E (K ) EK . Z2 . ZK is a function of X1 . . Zk |K = k ) p(k )k (2. ZK . . . . . .24) = EK. . 3 2 2 3 (2. . . . (d) = (e) = (f ) ≥ (a) Since X1 . . X2 .27) . (d) Follows from the chain rule for entropy. ZK . Relative Entropy and Mutual Information (c) 21 = H (Z1 . . Xn .0100 and 1000 are equally likely and can be used to generate 2 pure random bits. ZK are pure random bits (given K ). ZK |K ) = = k k p(K = k )H (Z1 . . . X2 . Xn ) . with probability of Xi = 1 being p . Z2 . . .22) (2. ZK ) + H (K |Z1 . Xn are i. Z2 .23) (2. X2 . so its conditional entropy given Z 1 . . . . ZK ). . . Z2 . . . . . . (g) Since we do not know p . ZK is 0.Entropy. Z1 . . . . ZK ) = H (Z1 . . X2 . . .25) (2. . (f) Follows from the non-negativity of discrete entropy. . . K ) = H (Z1 . with entropy 1 bit per symbol. . . . (e) By assumption. Z2 . Hence H (Z1 . and since the entropy of a function of a random variable is less than the entropy of the random variable. For example. . . . . . . An example of a mapping to generate random bits is 0000 → Λ 0001 → 00 0010 → 01 0100 → 10 1000 → 11 0011 → 00 0110 → 01 1100 → 10 1001 → 11 1010 → 0 0101 → 1 1110 → 11 1101 → 10 1011 → 01 0111 → 00 1111 → Λ The resulting expected number of bits is EK = 4pq 3 × 2 + 4p2 q 2 × 2 + 2p2 q 2 × 1 + 4p3 q × 2 = 8pq + 10p q + 8p q. . . . Z2 . Hence H (Z1 .26) (2. . . . . . Z2 . . (b) Z1 . . . Xn ) is nH (p) . . . H (Z 1 . . . . ZK . . K ) H (K ) + H (Z1 . ZK ) ≤ H (X1 . .

The algorithm that we will use is the obvious extension of the above method of generating pure bits using the fact that all sequences with the same number of ones are equally likely. in the general case. etc. . (2. Relative Entropy and Mutual Information For example. Then at least l half of the elements belong to a set of size 2 and would generate l random bits.31) (2. If n k pure k random bits from such a set. the number of bits generated is E [K |k 1’s in sequence] ≥ 1 1 1 l + (l − 1) + · · · + l 1 2 4 2 1 l−1 1 2 3 = l− 1 + + + + · · · + l −2 4 2 4 8 2 ≥ l − 1. There are n k such sequences.30) since the inﬁnite series sums to 1. 1 . k =2 Instead of analyzing the scheme exactly. The worst case would occur when n l+1 − 1 .32) (2. p q log k k (2. The largest set would have a size 2 log ( k ) and could be random bits. Consider all sequences with k ones. 2l−2 .29) (2. 2l−1 .22 Entropy. . . Hence the fact that n k is not a power of 2 will cost at most 1 bit on the average in the number of random bits that are produced. Let l = log k . we will just ﬁnd a lower bound on number n of random bits generated from a set of size n k .34) ≥ n(p− )≤k ≤n(p+ ) . This is substantially less then the 4 pure random bits that could be generated if p were exactly 1 2. n is not a power of k n 2 and the best we can to is the divide the set of k elements into subset of sizes n which are powers of 2. We could divide the remaining elements used to generate log n k into the largest set which is a power of 2. etc.28) (2. However.625. then we could generate log n all equally likely. We will now analyze the eﬃciency of this scheme of generating random bits for long sequences of bent coin ﬂips. the expected number of pure random bits produced by this algorithm is n EK ≥ ≥ = k =0 n k =0 n k =0 n k n−k n p q log −1 k k n n k n−k −2 log p q k k n k n−k n p q log −2 k k n n k n−k − 2. On at least 1 4 th belong to a set of size 2 the average. the expected number of pure random bits is close to 1. Hence. which are were a power of 2. . l−1 and generate l − 1 random bits. in which case the subsets would be of sizes 2 l .33) (2. for p ≈ 1 2 . Let n be the number of bent coin ﬂips.

which is very good since nH (p) is an upper bound on the number of pure random bits that can be produced from the bent coin sequence. The probability of a 4 game series ( Y = 4 ) is 2(1/2) 4 = 1/8 . There are 2 (AAAA. k n is close to p and hence there exists a δ such that k n ≥ 2n(H ( n )−δ) ≥ 2n(H (p)−2δ) k (2. The probability of a 6 game series ( Y = 6 ) is 20(1/2) 6 = 5/16 . 18. Solution: World Series. For such sequences. If we assume that n is large enough so that the probability that n(p − ) ≤ k ≤ n(p + ) is greater than 1 − . BABABAB. Let Y be the number of games played. H (X ) = 1 p(x) = 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128 p(x)log = 5. The probability of a 7 game series ( Y = 7 ) is 40(1/2) 7 = 5/16 . Each happens with probability (1/2)4 . the probability that the number of 1’s in the sequence is close to np is near 1 (by the weak law of large numbers). The probability of a 5 game series ( Y = 5 ) is 8(1/2) 5 = 1/4 . The World Series is a seven-game series that terminates as soon as either team wins four games. Each happens with probability (1/2) 7 .8125 1 p(y ) = 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5) p(y )log = 1. H (Y ) . Two teams play until one of them has won 4 games. World Series. Each happens with probability (1/2) 5 . Let X be the random variable that represents the outcome of a World Series between teams A and B. then we see that EK ≥ (1 − )n(H (p) − 2δ ) − 2 . and BBBAAAA.35) using Stirling’s approximation for the binomial coeﬃcients and the continuity of the entropy function. H (Y |X ) . BBBB) World Series with 4 games. and H (X |Y ) .Entropy. World Series with 6 games. There are 8 = 2 There are 20 = 2 There are 40 = 2 4 3 5 3 6 3 World Series with 5 games.924 H (Y ) = . World Series with 7 games. calculate H (X ) . possible values of X are AAAA. Each happens with probability (1/2) 6 . Assuming that A and B are equally matched and that the games are independent. which ranges from 4 to 7. Relative Entropy and Mutual Information 23 Now for suﬃciently large n .

Let A = ∞ n=2 (n log n) the inﬁnite sum by the integral of (x log 2 x)−1 . Solution: Run length coding. By deﬁnition. Relative Entropy and Mutual Information Y is a deterministic function of X. 2) . the sequence X = 0001100100 yields run lengths R = (3. . . 1 > An log n n=2 We conclude that H (X ) = +∞ . Xn . Y ) = H (Y ) + H (X |Y ) . . . X2 . . R2 . X2 . Compare H (X1 . H (R) and H (Xn . and bound all the diﬀerences. Any Xi together with the run lengths determine the entire sequence X1 . H (R) ≤ H (X) . . . . Xn ) = H (Xi . all the elements in the sum in the last term are nonnegative. . . so if you know X there is no randomness in Y. Hence H (X1 . X2 . p n = Pr(X = n) = 1/An log 2 n for n ≥ 2 . Inﬁnite entropy. 1. . . the terms of the last sum eventually all become positive. it is easy to determine H (X |Y ) = H (X ) + H (Y |X ) − H (Y ) = 3. . R) . . . . Xn be (possibly dependent) binary random variables. For base 2 logarithms. Solution: Inﬁnite entropy. Therefore H (X ) = − = − = = ∞ n=2 ∞ p(n) log p(n) 1/An log 2 n log 1/An log 2 n log(An log 2 n) An log2 n n=2 log A + log n + 2 log log n An log2 n n=2 ∞ 1 2 log log n . Run length coding. . X2 . R) (2. . . (For any other base. . Xn ) . (It is easy to show that A is ﬁnite by bounding be inﬁnite.36) ∞ ∞ 2 1 dx = K ln ln x Ax log x ∞ 2 = +∞ . 3. Since the run lengths are a function of X 1 . Xn .) So all we have to do is bound the middle sum. Let X1 . 20. 2. . X2 . has H (X ) = +∞ . + 2 An log n n=2 n=2 An log n ∞ ∞ n=2 ∞ = log A + The ﬁrst term is ﬁnite. . Suppose one calculates the run lengths R = (R 1 . 2.) Show that the integer-valued random variable X deﬁned by Pr(X = n) = (An log 2 n)−1 for n = 2.) of this sequence (in order as they occur).24 Entropy. Show all equalities and inequalities. Since H (X ) + H (Y |X ) = H (X. For example. . Or. . . . . . which we do by comparing with an integral.889 19. H (Y |X ) = 0 . This problem shows that the entropy of a discrete random variable can 2 −1 .

. Y ) . . . Xn . Y ) ≥ 0 followed as a special case of the non-negativity of D . . . Solution: Logical ordering of ideas. 1 Pr {p(X ) ≤ d} log ≤ H (X ). . (b) D (f ||g ) ≥ 0 .38) (2. xn )) . .40) d Solution: Markov inequality applied to entropy. X3 |X1 ). It is also possible to derive the chain rule for I from the chain rule for H as was done in the notes. 25 (2. .43) (2. .39) 21. . Conditional mutual information. xn )||q (x1 . . Y ) ≥ 0 . . . . . . x2 .Entropy. X2 ).37) (2. Reorder the following ideas. Since I (X . Since H (X ) = I (X . . implications following: (a) Chain rule for I (X1 . it is possible to derive the chain rule for I from the chain rule for D . I (X . and then generalized if necessary. Xn−2 ). Markov’s inequality for probabilities. The inequality I (X . and chain rule for H (X1 . (b) In class. Jensen’s inequality. 23. . . X2 .41) (2. X2 . Logical order of ideas. I (X2 . . Xn |X1 . Each sequence with an even number of 1’s has probability 2 −(n−1) and each sequence with an odd number of 1’s has probability 0. Jensen’s inequality was used to prove the non-negativity of D . . strongest ﬁrst. . .42) (2. Find the mutual informations I (X1 . (2. Prove.44) ≤ ≤ p(x) log x 1 p(x) = H (X ) 22. for all d ≥ 0 . I (Xn−1 . Let p(x) be a probability mass function. Xn . Consider a sequence of n binary random variables X1 . it is possible to derive the chain rule for H from the chain rule for I . . . Ideas have been developed in order of need. chain rule for D (p(x1 . Relative Entropy and Mutual Information = H (R) + H (Xi |R) ≤ H (R) + H (Xi ) ≤ H (R) + 1. y )||p(x)p(y )) is a special case of relative entropy. . . (a) The following orderings are subjective. . . P (p(X ) < d) log 1 d = x:p(x)<d p(x) log p(x) log x:p(x)<d 1 d 1 p(x) (2. X ) . Y ) = D (p(x. Xn ) .

then the average entropy (in nats) is 1 1 − 0 p ln p + (1 − p) ln(1 − p)dp = −2 1 2 0 x ln x dx = −2 x2 x2 ln x + 2 4 1 0 = 1 2 . with probability 3/4 . Let H (p) = −p log 2 p − (1 − p) log 2 (1 − p) be the binary entropy function. . . Any n − 1 or fewer of these are independent. If not the ﬁrst outcome. This selection can be made in two steps. Xk−2 ) = 0.584 . Since this has probability 1/4 . Solution: Average Entropy. . I (Xk−1 . (b) If p is chosen uniformly in the range 0 ≤ p ≤ 1 . However. Xn−2 . . (b) Calculate the average entropy H (p) when the probability p is chosen uniformly in the range 0 ≤ p ≤ 1 . . p2 . X2 . Therefore the average entropy is log 2 e = 1/(2 ln 2) = . one of which is more interesting than the others. X2 . given X1 . Xn−2 ) − H (Xn |X1 . . . Xn . X2 . . Xn−1 ) = 1 − 0 = 1 bit. p2 . Hint: You may wish to consider an experiment with four equally likely outcomes. p3 ) where (p1 . (c) (Optional) Calculate the average entropy H (p 1 . First we decide whether the ﬁrst outcome occurs. for k ≤ n − 1 . X2 . Xk |X1 .75)(1. . . the information generated is H (1/4) . X2 . (a) Evaluate H (1/4) using the fact that log 2 3 ≈ 1. Consider a sequence of n binary random variables X 1 . . Xn−2 ) = H (Xn |X1 . Thus H (1/4) + (3/4) log 2 3 = 2 and so H (1/4) = 2 − (3/4) log 2 3 = 2 − (. . Each sequence of length n with an even number of 1’s is equally likely and has probability 2 −(n−1) . . Xn |X1 . . .26 Entropy. we know that once we know either Xn−1 or Xn we know the other. Thus. . (a) We can generate two bits of information by picking one of four equally likely alternatives. this produces log 2 3 bits of information. 24.811 bits. .721 bits. .585) = 0. . . Relative Entropy and Mutual Information Solution: Conditional mutual information. . . Generalize to dimension n . then we select one of the three remaining outcomes. . . I (Xn−1 . p3 ) is a uniformly distributed probability vector. Average entropy. X2 . .

The probability density function has the constant value 2 because the area of the triangle is 1/2. Y ) = H (X ) + H (Y ) − H (X. Z ) − (H (Y ) + H (Z ) − I (Y . Z ) − I (X . Y . Y. Y ) − I (X . Z )) = I (X . Y . Solution: Venn Diagrams. Y . Y and Z can be deﬁned by I (X .202 bits. Z ) − H (Y. 26. Y ) . Y ) − (I (X . I (X . Y. Z ) is not necessarily nonnegative. X )+ H (X )+ H (Y )+ H (Z ) The ﬁrst identity can be understood using the Venn diagram analogy for entropy and mutual information. Z ) + H (X. Here is one attempt at a deﬁnition: Using Venn diagrams. p2 . Z ) − H (Z. p3 ) is equivalent to choosing a point (p1 . Z ) − H (X. Z ) = H (X. Y ) + I (X . These two identities show that I (X . Y |Z ) . This quantity is symmetric in X . Z ) − I (X . Y ) . I (X . Y |Z ) by deﬁnition by chain rule = I (X . Z ) is a symmetric (but not necessarily nonnegative) function of three random variables. Z ) − (H (X ) + H (Y. Y ) + I (X . Y ) + I (X . Y. After some enjoyable calculus. To show the second identity. Z ) = I (X . Relative Entropy and Mutual Information 27 (c) Choosing a uniformly distributed probability vector (p 1 . Z )) = I (X . X ) (b) I (X . The second identity follows easily from the ﬁrst. despite the preceding asymmetric deﬁnition. Venn diagrams. Z ) = I (X . Z ) = I (X . Z ) − H (X ) + H (X. p1 ≤ p2 ≤ 1 . Z )) = I (X . Z ) = H (X. p3 ) is 1 1 p1 −2 0 p1 ln p1 + p2 ln p2 + (1 − p1 − p2 ) ln(1 − p1 − p2 )dp2 dp1 . Find X . (a) Show that ln x ≤ x − 1 for 0 < x < ∞ . Z ) − H (X ) − H (Y ) − H (Z ). Z ) = I (X . Y. Y . and prove the following two identities: (a) I (X . I (X . Unfortunately. Y and Z . Y ) − H (Y. Y. Z ) − H (X. Y . we obtain the ﬁnal result 5/(6 ln 2) = 1.Entropy. Y . Z ) − H (X ) − H (Y ) − H (Z ) + I (X . Z ) . we can see that the mutual information common to three random variables X . In view of the fundamental nature of the result D (p||q ) ≥ 0 . Y. Y ) + I (Y . Z ) using equations like I (X . . p2 ) uniformly from the triangle 0 ≤ p1 ≤ 1 . we will give another proof. Y . Z ) + I (Z . Y ) + I (X . There isn’t realy a notion of mutual information common to three random variables. Y ) − I (X . So the average entropy H (p 1 . and I (Y . Y. To show the ﬁrst identity. simply substitute for I (X . Y and Z such that I (X . p2 . 25. Another proof of non-negativity of relative entropy. Z ) − H (X ) + H (X. Z ) + I (Y . Y ) + I (X . Y. Z ) < 0 .

i.48) 1 1 for 0 < x < ∞ . There are many ways to prove this..52) (2.53) p(x) x∈A q (x) − (c) What are the conditions for equality? We have equality in the inequality ln t ≤ t − 1 if and only if t = 1 . (b) We let A be the set of x such that p(x) > 0 . x − 1 − ln x ≥ 1 − 1 − ln 1 = 0 which gives us the desired inequality.46) (2. i. the second step follows from the inequality ln t ≤ t − 1 . Relative Entropy and Mutual Information (b) Justify the following steps: −D (p||q ) = ≤ p(x) ln x q (x) p(x) (2. and the last step from the fact that the q (A) ≤ 1 and p(A) = 1 . −De (p||q ) = ≤ = ≤ 0 x∈A (2. we will give another proof. when x = 1 . Thus the condition for equality is that p(x) = q (x) for all x .47) p(x) x q (x) −1 p(x) ≤ 0 (c) What are the conditions for equality? Solution: Another proof of non-negativity of relative entropy. Therefore f (x) ≥ f (1) .. Let f (x) = x − 1 − ln x (2.e. The function has a local minimum at the point where f (x) = 0 . the third step from expanding the sum. Then f (x) = 1 − x and f (x) = x 2 > 0 . (a) Show that ln x ≤ x − 1 for 0 < x < ∞ . . Therefore a local minimum of the function is also a global minimum. and therefore f (x) is strictly convex. Therefore we have equality in step 2 of the chain iﬀ q (x)/p(x) = 1 for all x ∈ A . In view of the fundamental nature of the result D (p||q ) ≥ 0 . This implies that p(x) = q (x) for all x .45) (2.49) p(x)ln x∈A q (x) p(x) q (x) −1 p(x) p(x) x∈A (2.28 Entropy.50) (2.51) (2.e. Equality occurs only if x = 1 . The easiest is using calculus. The ﬁrst step follows from the deﬁnition of D . and we have equality in the last step as well.

i 2 j . (p1 . the distribution q is the same as p on {1. pi ≥ 0 . . . pj .. . and the probability of the last element in q is the sum of the last two probabilities of p . .54) H (p) = − = − = − pi log pi i=1 m−2 i=1 m−2 i=1 (2. . .e.. . and m i=1 pi = 1 .58) pm−1 pm = H (q) − pm−1 log − pm log (2. . pi .. i 2 j . b) = −a log a − b log b . . pm ) be a probability distribution on m elements. is less than the entropy of the distribution p +p p +p (p1 . by the log sum inequality. . . (2. .59) pm−1 + pm pm−1 + pm pm−1 pm pm pm−1 log − log (2.e. . . . . . Solution: Mixing increases entropy. Mixing increases entropy. . . 2. pm ) . pi . . . .57) pi log pi − pm−1 log pm−1 − pm log pm pi log pi − pm−1 log pm pm−1 − pm log pm−1 + pm pm−1 + pm −(pm−1 + pm ) log(pm−1 + pm ) (2. pm ) . . qm−2 = pm−2 . . and qm−1 = pm−1 + pm . Show that H (p) = H (q) + (pm−1 + pm )H Solution: m pm pm−1 . . 28. . p2 . . .55) (2. pm ) 2 2 Then. i. . . . m − 2} . Let P1 = (p1 . . . Show that the entropy of the probability distribution. . . . Show that in general any transfer of probability that makes the distribution more uniform increases the entropy. . Deﬁne a new distribution q on m − 1 elements as q1 = p1 . (2. . i.56) (2. . .. . Grouping rule for entropy: Let p = (p 1 . .60) = H (q) − (pm−1 + pm ) pm−1 + pm pm−1 + pm pm−1 + pm pm−1 + pm pm pm−1 . . . . . . . . Relative Entropy and Mutual Information 29 27. .Entropy. . . This problem depends on the convexity of the log function. .61) = H (q) + (pm−1 + pm )H2 pm−1 + pm pm−1 + pm where H2 (a. pj . .. pm−1 + pm pm−1 + pm .. . . . q2 = p2 . pm ) pi + p j pj + p i P2 = (p1 . . H (P2 ) − H (P1 ) = −2( pi + p j pi + p j ) log( ) + pi log pi + pj log pj 2 2 pi + p j = −(pi + pj ) log( ) + pi log pi + pj log pj 2 ≥ 0. .

Prove the following inequalities and ﬁnd conditions for equality. (a) H (X. Y |X ) − I (Z . (b) Using the chain rule for mutual information. Z |Y ) ≥ I (Z . Find the probability mass function p(x) that maximizes the entropy H (X ) of a non-negative integer-valued random variable X subject to the constraint ∞ EX = np(n) = A n=0 . Z ) − H (X. (a) Using the chain rule for conditional entropy. Let X . (c) H (X. Relative Entropy and Mutual Information H (P2 ) ≥ H (P1 ). Y . Maximum entropy. Z ) − H (X ) . Z ) = I (Z . Y ) = H (Z |X. Y ) ≤ H (X. Z ) . Z ) ≥ H (X |Z ). with equality iﬀ I (Y . Z |Y ) + I (Z . Z ) . Entropy. (d) I (X . Z ) ≥ I (X . Z ). Z ) − H (X. when Y is a function of X and Z . that is. Z ) . and therefore I (X . Z ) = 0 . (d) Using the chain rule for mutual information. Y . Y . Y ) + I (X . H (X. 29. Y |X ) − I (Z . 30. Z ) . that is. Z |X ) ≤ H (Z |X ) = H (X. Z |X ) ≥ I (X . Y ) + I (X . Y |Z ) ≥ H (X |Z ) . Y |Z ) = H (X |Z ) + H (Y |X. Z ) + I (Y . Solution: Inequalities. We see that this inequality is actually an equality in all cases. I (X. (b) I (X. I (X . Y ) = I (X. Y ) = H (Z |X ) − I (Y . Z |Y ) = I (Z . (c) Using ﬁrst the chain rule for entropy and then the deﬁnition of conditional mutual information. Y. Y. with equality iﬀ I (Y .30 Thus. Inequalities. H (X. when Y and Z are conditionally independent given X . when Y and Z are conditionally independent given X . Z |X ) = 0 . Z ) − H (X ) . with equality iﬀ H (Y |X. Z ) = I (X . Z |X ) = 0 . Y |X ) + I (X . that is. Y and Z be joint random variables.

− ∞ i=0 pi log pi ≤ − ∞ i=0 pi log qi ∞ i=0 = − log(α) pi + log(β ) ∞ i=0 ipi = − log α − A log β Notice that the ﬁnal right hand side expression is independent of {p i } . ∞ i=0 iαβ i = A = α β . Evaluate this maximum H (X ) . Then we have that. Relative Entropy and Mutual Information for a ﬁxed value A > 0 . ∞ i=0 ∞ i=0 pi log pi ≤ − log α − A log β αβ i = 1 = α 1 . β such that. A+1 . (1 − β )2 Combining the two constraints we have. Let qi = α(β )i . β = α = A A+1 1 . 1−β The constraint on the expected value also requires that. α β (1 − β )2 α 1−β β = 1−β = A. = β 1−β which implies that. and that the inequality. − holds for all α. − ∞ i=0 ∞ i=0 31 pi log pi ≤ − pi log qi .Entropy. Solution: Maximum entropy Recall that.

Plugging these values into the expression for the maximum entropy. 31. it has only one local minima. pi = αβ i can be obtained either by guessing or by Lagrange multipliers where. (a) Find the minimum probability of error estimator X (b) Evaluate Fano’s inequality for this problem and compare. Hence H (X |g (Y )) = H (X |Y ) iﬀ X → g (Y ) → Y . and X and Y being independent. We are given the following joint distribution on (X. However. λ2 ) = − ∞ i=0 pi log pi + λ1 ( ∞ i=0 pi − 1) + λ2 ( ∞ i=0 ipi − A) is the function whose gradient we set to 0. From the derivation of the inequality. . This is the condition for equality in the data processing inequality. − log α − A log β = (A + 1) log(A + 1) − A log A. we have equality iﬀ X → g (Y ) → Y forms a Markov chain. Fano. 32. This condition includes many special cases. Conditional entropy. no local maxima and therefore Lagrange multiplier actually gives the global maximum for H (p) .32 Entropy. I (X . If H (X |g (Y )) = H (X |Y ) . i. Under what conditions does H (X | g (Y )) = H (X | Y ) ? Solution: (Conditional Entropy). F (pi . Y ) Y X 1 2 3 a 1 6 1 12 1 12 b 1 12 1 6 1 12 c 1 12 1 12 1 6 ˆ (Y ) be an estimator for X (based on Y) and let P e = Pr{X ˆ (Y ) = X }. One possible argument is based on the fact that −H (p) is convex. pi = 1 A+1 A A+1 i . these two special cases do not exhaust all the possibilities. The general form of the distribution. λ1 . such as g being oneto-one. it is necessary to show that the local maximum is the global maximum. Y ) .. Let X ˆ (Y ) and the associated Pe . g (Y )) = I (X . To complete the argument with Lagrange multipliers. Relative Entropy and Mutual Information So the entropy maximizing distribution is.e. then H (X )−H (X |g (Y )) = H (X ) − H (X |Y ) .

b). 2 4 4 2 4 4 2 4 4 1 1 1 = H . (Pr{y = a} + Pr{y = b} + Pr{y = c}) 2 4 4 1 1 1 . If Hence our estimator X ˆ X ∈ X . 2. as it does here. Let Pr(X = i) = p i .5 − 1 = . P (3. . log |X | ˆ (Y ) is not very close to Fano’s bound in this form. m and let p1 ≥ p2 ≥ p3 ≥ ˆ = 1 . Relative Entropy and Mutual Information Solution: (a) From inspection we see that ˆ (y ) = X 1 y=a 3 33 2 y=b y=c Hence the associated Pe is the sum of P (1. P (1. This is Fano’s inequality in the absence of conditioning. . = H 2 4 4 = 1. . i = 1. a). we can use the stronger form of Fano’s inequality to get Pe ≥ and Pe ≥ H (X |Y ) − 1 . log 3 H (X |Y ) − 1 . log 2 2 ˆ (Y ) is actually quite good. P (2. with resulting · · · ≥ pm .Entropy. . .5 bits. a) and P (3. Hence Pe ≥ 1. Maximize H (p) subject to the constraint 1 − p 1 = Pe to ﬁnd a bound on Pe in terms of H . (b) From Fano’s inequality we know Pe ≥ Here. H (X |Y ) = H (X |Y = a) Pr{y = a} + H (X |Y = b) Pr{y = b} + H (X |Y = c) Pr{y = c} 1 1 1 1 1 1 1 1 1 = H Pr{y = a} + H Pr{y = b} + H Pr{y = c} .316.5 − 1 = . Pe = 1/2. Therefore our estimator X 33. The minimal probability of error predictor of X is X probability of error Pe = 1 − p1 . . . P (2. . . c). . Fano’s inequality. c). Therefore. b). . . log(|X |-1) 1 1.

Entropy of initial conditions. 35. (2.68) Therefore H (X0 ) − H (X0 |Xn−1 ) ≥ H (X0 ) − H (X0 |Xn ) or H (X0 |Xn ) increases with n .65) Pe i=2 pi pi log − Pe log Pe Pe Pe pm p2 p3 .) The minimal probability of error predictor when there ˆ = 1 .69) Calculate H (p) .. We maximize the entropy of X for a given Pe to obtain an upper bound on the entropy for a given P e . Relative Entropy and Mutual Information Solution: (Fano’s Inequality.. by the data processing theorem... For a Markov chain. we have I (X0 . we ﬁx p1 . Pe ≥ H (X ) − 1 . the most probable value of X . Solution: Entropy of initial conditions. Hence if we ﬁx Pe . . (2.63) (2. Verify that in this case D (p||q ) = D (q ||p) .62) (2. Consider two distributions on this random variable Symbol a b c p(x) 1/2 1/4 1/4 q (x) 1/3 1/3 1/3 (2.66) which is the unconditional form of Fano’s inequality. b. Solution: H (p) = 1 1 1 log 2 + log 4 + log 4 = 1.64) (2. Prove that H (X 0 |Xn ) is non-decreasing with n for any Markov chain. Xn ). The entropy. 2 4 4 (2. We can weaken this inequality to obtain an explicit lower bound for P e . m H (p) = −p1 log p1 − = −p1 log p1 − pi log pi i=2 m (2. The probability of error in is no information is X this case is Pe = 1 − p1 .67) 34. log(m − 1) (2.70) . Relative entropy is not symmetric: Let the random variable X have three possible outcomes {a. Xn−1 ) ≥ I (X0 .34 Entropy. . D (p||q ) and D (q ||p) .. c} .5 bits. H (q ) .. = H (Pe ) + Pe H p2 p3 m .. Pe Pe Pe ≤ H (Pe ) + Pe log(m − 1).. Hence since the maximum of H P Pe e Pe any X that can be predicted with a probability of error P e must satisfy H (X ) ≤ H (Pe ) + Pe log(m − 1). p is attained by an uniform distribution.

We ask whether X ∈ S and receive the answer Y = 1. z )||p(x)p(y )p(z )) = E log p(x. 38. m} . for p log is when q = 1 − p . Give an example of two distributions p and q on a binary alphabet such that D (p||q ) = D (q ||p) (other than the trivial case p = q ). z ) p(x)p(y )p(z ) (2.66666−1.77) p(z )] (2. i.75) 1−p q 1−q p + (1 − p) log = q log + (1 − q ) log q 1−q p 1−p (2.e. Y. as the previous example shows. . .58496−1.Entropy. y. 1 − q )) = D ((q. y. Relative Entropy and Mutual Information H (q ) = 1 1 1 log 3 + log 3 + log 3 = log 3 = 1. y. z ) .5 = 1. i. z )||p(x)p(y )p(z )) = E log p(x. Symmetric relative entropy: Though. 1 − p)||(q. Relative entropy: Let X.08170 (2. . Solution: A simple case for D ((p.5 = 0. Y. Apparently any set S with a given α is as good as any other. z ) . y.e.74) Expand this in terms of entropies. z ) = p(x)p(y )p(z ) for all (x. 2. . 2. The relative entropy between the joint distribution and the product of the marginals is D (p(x.72) D (q ||p) = 1/3 log(2/3)+1/3 log(4/3)+1/3 log(4/3) = 5/3−log(3) = 1. y. y. 1 − p)) . D (p||q ) = D (q ||p) in general. Find the decrease in uncertainty H (X ) − H (X |Y ) .76) p(x)p(y )p(z ) = E [log p(x.73) 36. .08496 (2. if X and Y and Z are independent. 37.. if X ∈ S if X ∈ S. Z be three random variables with a joint probability mass function p(x. When is this quantity zero? Solution: D (p(x.78) = −H (X.58496 bits. . there could be distributions for which equality holds.71) D (p||q ) = 1/2 log(3/2)+1/4 log(3/4)+1/4 log(3/4) = log(3)−1. Z ) + H (X ) + H (Y ) + H (Z ) We have D (p(x. 3 3 3 35 (2. . m . 1 − q )||(p. z )||p(x)p(y )p(z )) = 0 if and only p(x. Suppose Pr{X ∈ S } = α .58496 = 0. z ) (2. y. y. z )] − E [log p(x)] − E [log p(y )] − E [log (2. . y. x = 1. We are given a set S ⊆ {1. 0.. The value of a question Let X ∼ p(x) . .

Entropy and pairwise independence. 39. Z ) = H (X. we show in part (b) that this bound is attainable. Z ) ? (b) Give an example achieving this minimum. (b) Let X and Y be iid Bernoulli( 1 2 ) and let Z = X ⊕ Y . that is. Z ) = I (Y . I (X . Relative Entropy and Mutual Information Solution: The value of a question. 2. . To show that is is actually equal to 2. Y ) + H (Z |X. H (Y ) = k k 2−k = 2 . . Solution: (a) For a uniform distribution. (See solution to problem 2. . X − Y ) . . Z be three binary Bernoulli ( 2 dent. Z ) is at least 2. Y. (a) Under this constraint.36 Entropy. Y ) = 2. Y ) (2. . Let X be uniformly distributed over {1. . 40. H (X ) = log m = log 8 = 3 . Z ) = 0 . Y. Y.81) So the minimum value for H (X. . where ⊕ denotes addition mod 2 (xor).80) (2. Y. Y ) = I (X . Discrete entropies Let X and Y be two independent integer-valued random variables. and let Pr{Y = k } = 2 −k . k = 1. H (X ) − H (X |Y ) = I (X .79) (2. 8} . 2. Solution: (a) H (X. what is the minimum value for H (X. ≥ H (X.1 . (b) For a geometric distribution. 3. Y ) = H (Y ) − H (Y |X ) = H (α) − H (Y |X ) = H (α) since H (Y |X ) = 0 . (a) Find H (X ) (b) Find H (Y ) (c) Find H (X + Y. 1 ) random variables that are pairwise indepenLet X.

. Interpret. X ) = H (Q) + H (A|Q) − H (Q) = H (A|Q) The interpretation is as follows. Q2 ) I (X . (a) Show I (X . (b) Using the result from part a and the fact that questions are independent. Relative Entropy and Mutual Information 37 (c) Since (X. (b) Now suppose that two i. A1 . Q1 . A2 |A1 . Solution: Random questions. Q1 . A) = H (Q. Q1 . A1 . A1 . X − Y ) = H (X.i. Q1 ) + I (X . (a) I (X . A) is the uncertainty in X removed by the question-answer (Q. Q1 ) − H (Q2 |X. A1 . Q1 ) + I (X .} . A1 ) . Q2 ) I (X . A1 |Q1 ) + H (A2 |A1 . Q2 ) I (X . ∼ r (q ) are asked. Show that two questions are less valuable than twice a single question in the sense that I (X . Q1 .Entropy. Q2 ) I (X . questions Q 1 . Then I (X . I (X . A1 |Q1 ) = (c) I (X . Q1 . X − Y ) is a one to one transformation. A2 ) ≤ 2I (X . Q2 . Q1 ) + I (X . A question Q ∼ r (q ) is asked at random according to r (q ) . A1 |Q1 ) + H (A2 |A1 . Random questions One wishes to identify a random object X ∼ p(x) . a2 . Q1 . Q2 |A1 .d. A) . A2 |A1 . Q1 . The uncertainty removed in X given (Q. . A1 |Q1 ) + I (X . A) is the same as the uncertainty in the answer given the question. Suppose X and Q are independent. A2 |A1 . (b) X and Q1 are independent. . Q2 ) = = (d) = (e) (f ) ≤ = (a) Chain Rule. Q2 . A1 |Q1 ) + I (X . Q1 . we can easily obtain the desired relationship. |X ) = H (Q) + H (A|Q) − H (Q|X ) − H (A|Q. Q. A) − H (Q. Q2 . A1 |Q1 ) + H (Q2 |A1 . Q. This results in a deterministic answer A = A(x. Q1 . eliciting answers A1 and A2 . Q2 ) − H (A2 |X. . A1 |Q1 ) + H (A2 |Q2 ) 2I (X . Y ) = H (X ) + H (Y ) = 3 + 2 = 5 . Q. A2 ) (a) = (b) I (X . q ) ∈ {a1 . 41. A. Y ) → (X + Y. H (X + Y. A) = H (A|Q) .

X1 ) H (X1 . Y )/(H (X ) + H (Y )) vs. . (d) A2 is completely determined given Q2 and X . H (c(X1 . . Which of the following inequalities are generally ≥. ≤ ? Label each with ≥. . . Q1 . and hence H (X ) = H (5X ) . which are equally probable. . . . . . Y ) H (X0 |X−1 ) vs. . . Y ) ≤ H (X ) + H (Y ) . X1 ) . Y )/(H (X ) + H (Y )) ≤ 1 . Relative Entropy and Mutual Information (c) Q2 are independent of X . where c(x1 . B ) = H (B ) − H (B |T ) = log 2 = 1 since B ∼ Ber (1/2) . What is the mutual information between the top side and the bottom side of the coin? (b) A 6-sided fair die is rolled. (e) H (X. . x2 .38 Entropy. Thus. (e) Conditioning decreases entropy. =. To prove (a) observe that I (T . (a) Consider a fair coin ﬂip. there are four possibilities for the top T . x2 . . Xn ) vs. I (X . and B = f (T ) . Y ) ≤ I (X . (a) (b) (c) (d) H (5X ) vs. X2 . Y ) . Mutual information of heads and tails. To prove (b) note that having observed a side of the cube facing us F . . Here B. I (T . xn ) . 42. or ≤ . H (X ) I (g (X ). . so H (X. =. What is the mutual information between the top side and the front face (the side most facing you)? Solution: Mutual information of heads and tails. (f) Result from part a. Y ) vs. . H (X. Inequalities. Xn )) . X2 . I (g (X ). By data processing inequality. 6} . . 43. H (X0 |X−1 . T stand for Bottom and Top respectively. . 1 Solution: (a) (b) (c) (d) X → 5X is a one to one mapping. . xn ) is the Huﬀman codeword assigned to (x1 . 2. H (X 0 |X−1 ) ≥ H (X0 |X−1 . F ) = H (T ) − H (T |F ) = log 6 − log 4 = log 3 − 1 since T has uniform distribution on {1. Because conditioning reduces entropy. . and A1 . .

we have the following functional to optimize J (p) = − pi log pi − λ1 pi − λ 2 pi log i (2. So the expected number of fair bits generated is 0 ∗ [P (AA) + P (BB ) + P (CC )] + 1 ∗ [1 − P (AA) − P (BB ) − P (CC )]. We get one bit. BB or CC we don’t assign a value to B . we ﬁnd that the p i that maximizes the entropy set pi = aiλ . X2 to generate (if possible) a Bernoulli( 1 2) random variable Z ? (b) What is the resulting maximum expected number of fair bits generated? Solution: (a) The trick here is to notice that for any two letters Y and Z produced by two independent tosses of our bent three-sided coin. CB or CA (if we get AA . iλ log i = c iλ (2. .e. pi logpi We will now ﬁnd the maximum entropy distribution subject to the constraint on the expected logarithm. Y Z has the same probability as 1 ) coin ﬂips by letting B = 0 when we ZY . and B = 1 when we get BA . 2 2 1 − p2 A − pB − pC . . we can see that the entropy is ﬁnite. . i. p2 . Using Lagrange multipliers or the results of Chapter 12.Entropy. Then H (p) = − and E log X = pi logi = c < ∞ . (a) How would you use 2 independent ﬂips X 1 . pB . . X= C. } . .84) Diﬀerentiating with respect to pi and setting to zero. 2. pC are unknown. if E log X < ∞ . then H (X ) < ∞ .82) (2. Solution: Let the distribution on the integers be p 1 . Finite entropy. . So we can produce B ∼ Bernoulli( 2 get AB . B.85) Using this value of pi .83) 45.) (b) The expected number of bits generated by the above scheme is as follows. where a = 1/( iλ ) and λ chosed to meet the expected log constraint. . Show that for a discrete random variable X ∈ {1. or. Pure randomness We wish to use a 3-sided coin to generate a probability mass function A. . (2. Let the coin X have pA pB pC where pA . except when the two ﬂips of the 3-sided coin produce the same symbol. 39 fair coin toss. BC or AC . Relative Entropy and Mutual Information 44.

. so we will ﬁrst outline what we plan to do. p2 . 2. Hm (p1 . .. we extend the grouping axiom.86) There are various other axiomatic formulations which also result in the same deﬁnition of entropy. . . . pm ) satisﬁes the following properties. . the book by Csisz´ ar and K¨ orner[1]. . The following problem is considerably more diﬃcult than the other problems in this section. 1 − p) is a continuous function of p . If we assume certain axioms for our measure of information. . . we will extend the result to H m for m ≥ 2 . pk+1 .90) . . First we will extend the grouping axiom by induction and prove that Hm (p1 . ..87) . .40 Entropy. This is a long solution. By continuity. . pm ) = Hm−k (p1 + p2 + · · · + pk .. p2 . We use this to show that f (m) = log m . .88) We will then show that for any two integers r and s . • Grouping: Hm (p1 . we will rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. Shannon used this to justify his initial deﬁnition of entropy. (2. then we will be forced to use a logarithmic measure like entropy. . 3. . . . . for example. pm ) = Hm−1 (p1 +p2 . Axiomatic deﬁnition of entropy. For convenience in notation. . . +(p1 + p2 + · · · + pk )Hk p1 + p 2 + · · · + p k p1 + p 2 + · · · + p k Let f (m) be the entropy of a uniform distribution on m symbols. pm ) + S2 h p2 . that f (rs) = f (r ) + f (s) . m m m (2. we will extend it to irrational p and ﬁnally by induction and grouping. pm ) = Hm−1 (S2 . Then we can write the grouping axiom as Hm (p1 . . . p 1 +p 2 • Continuity: H2 (p. . . that H2 (p. 1 − q ) as h(q ) . p2 . pm )+(p1 +p2 )H2 prove that Hm must be of the form m .. pm ) pk p1 ( . • Normalization: H2 1 1 2. i. . i=1 m = 2. p2 .. . If a sequence of symmetric functions H m (p1 . See. 1 − p) = −p log p − (1 − p) log(1 − p) .. . . . . pm ) = − pi log pi . we will let k Sk = i=1 pi (2. Solution: Axiomatic deﬁnition of entropy. p3 . Relative Entropy and Mutual Information 46. 2 = 1. ... . p3 . S2 (2. . f (m) = Hm 1 1 1 . . In this book.89) and we will denote H2 (q. . To begin.e. p1 p2 p 1 +p 2 . . . We then show for rational p = r/s ..

pk+1 . mn mn mn By repeatedly applying the extended grouping axiom. namely that the function Hm is symmetric with respect to its arguments. . .101) (2... . pm ) + Sk Hk which is the extended grouping axiom. . .104) .102) (2. . ) m m mn mn m n n . .95) (2. . ) mn mn mn 1 1 1 1 1 1 = Hmn−n ( . . .. .96) Si h i=2 pi Si From (2. pm ) + S3 h S3 . m ). p3 . 1 1 1 Let f (m) denote Hm ( m . .. ). .. ... ) m mn mn m n n 1 1 1 1 2 1 1 = Hmn−2n ( . . . it follows that Hm (p1 . . . . .. pm ) = Hm−k (Sk .. ) + Hn ( . pm ) = Hm−1 (S2 .. . pk+1 . . . 1 1 1 1 = Hm ( .94) = Hm−(k−1) (Sk . k −1 i=2 Si pi /Sk h Sk Si /Sk (2.97) Consider 1 1 1 . . . . we have f (mn) = Hmn ( f (mn) = Hmn ( 1 1 1 . Relative Entropy and Mutual Information Applying the grouping axiom again.93) Si h pi .. .99) (2. . we have Hm (p1 .. Si (2. Now we need to use an axiom that is not explicitly stated in the text.. . k 41 (2..94) and (2. . .. .91) + S2 h p2 S2 (2. . . . ) + H ( .. . . . . Sk Sk k + .. . . . .98) (2. pm ) + i=2 Now. pm ) + S2 h p2 S2 p3 = Hm−2 (S3 .92) (2. pk /Sk ) . (2.100) (2.96). Using this.. Sk Sk . ) m m n n = f (m) + f (n). . . .. we can combine any set of arguments of Hm using the extended grouping axiom.. m.. to obtain Hk pk p1 . pk p1 . Sk Sk = H2 = 1 Sk Sk−1 pk . . we apply the same grouping axiom repeatedly to H k (p1 /Sk . .. p4 .. . . . . (2. ) + Hn ( . . .. . .103) (2. .Entropy.

i=2 (2.115) Therefore summing over n . 0) = 0 .109) (2. and therefore H (1. 0) for all p2 .. . 0) = h(1) = 0 . 0) + (p1 + p2 )H2 (p1 . ) m+1 m+1 m 1 1 1 )+ Hm ( . p2 ) Thus p2 H2 (1.114) and therefore (n + 1)bn+1 = (n + 1)an+1 + ai . p2 ) + p2 H2 (1.111) (2. 0) = H2 (p1 . . 0) ( p1 + p2 = 1 ) in two diﬀerent ways using the grouping axiom H3 (p1 . p2 . we have N N N nbn = n=2 n=2 (nan + an−1 + . we will argue that H2 (1.110) 1 Thus lim f (m + 1) − mm +1 f (m) = lim h( m+1 ). . + a2 ) = N ai . To prove this.105) (2.113) (2. 0) = H2 (1. we use the extended grouping axiom and write f (m + 1) = Hm+1 ( 1 1 . Let us deﬁne an+1 = f (n + 1) − f (n) and 1 bn = h( ). We will also need to show that f (m + 1) − f (m) → 0 as m → ∞ .112) Then an+1 = − (2. Thus lim h( m1 +1 ) = 0 . But by the continuity of H 2 . it follows that the limit on the right is h(0) = 0 . n 1 f (n) + bn+1 n+1 n 1 ai + bn+1 = − n + 1 i=2 n (2.116) .108) (2.. 0) = H2 (1. m+1 m+1 (2.107) (2. n=2 (2.. p2 .42 Entropy. Relative Entropy and Mutual Information We can immediately use this to conclude that f (m k ) = kf (m) . We do this by expanding H 3 (p1 . . Now.106) and therefore f (m + 1) − (2. . .. ) = h( m+1 m+1 m m 1 m = h( )+ f (m) m+1 m+1 m 1 f (m) = h( ).

121) then the second assumption in the lemma implies that lim α n = 0 . asn → ∞.118) aN +1 = bN +1 − N + 1 n=2 also goes to 0 as N → ∞ .1 Let the function f (m) satisfy the following assumptions: • f (mn) = f (m) + f (n) for all integers m . P (2.0. Thus f (n + 1) − f (n) → 0 We will now prove the following lemma Lemma 2. (2. For an integer n . we obtain N 2 an = N + 1 n=2 N n=2 nbn N n=2 n (2.122) . Thus the left hand side goes to 0.120) Then g (n) satisﬁes the ﬁrst assumption of the lemma. it follows that bn → 0 as n → ∞ . Also g (P ) = 0 . Since the right hand side is essentially an average of the b n ’s. deﬁne n(1) = Then it follows that n(1) < n/P .117) Now by continuity of H2 and the deﬁnition of bn . Also if we let αn = g (n + 1) − g (n) = f (n + 1) − f (n) + f (P ) n log2 log2 P n+1 (2. Proof of the lemma: Let P be an arbitrary prime number and let g (n) = f (n) − f (P ) log2 n log2 P (2. We can then see that N 1 an (2. then the function f (m) = log 2 m . Relative Entropy and Mutual Information Dividing both sides by N n=1 43 n = N (N + 1)/2 . it also goes to 0 (This can be proved more precisely using ’s and δ ’s). n .119) • limn→∞ (f (n + 1) − f (n)) = 0 • f (2) = 1 .Entropy. and n = n(1) P + l (2.123) n .

. and f (P ) = log 2 P . Continuing this process. Similarly. we replace it by the assumption that f (n) is monotone in n . it follows that Thus it follows that g (n) log2 n log n +1 . Thus we can write tn g (n) = i=1 αi (2. where bn ≤ P Since αn → 0 . Relative Entropy and Mutual Information where 0 ≤ l < P .128) → 0 . Applying the third axiom in the lemma. From the fact that g (P ) = 0 . Pl . it follows that f (P )/ log 2 P = c for every prime number P . n is of the form f (m) = log a m for some base a . Now f (4) = f (2 × 2) = f (2) + f (2) = 2c . For composite numbers N = P1 P2 . it is easy to see that f (2k ) = kc = c log 2 2k .125) Since n(k) ≤ n/P k . we can then write k g (n) = g (n(k) + j =1 n(i−1) i=P n(i) αi . we have n(k) = 0 . We will extend this to integers that are not powers of 2. log P (2.129) f (n) f (P ) = n→∞ log n log2 P 2 lim Since P was arbitrary. and g (0) = 0 (this follows directly from the additive property of g ). we can deﬁne n(2) from n(1) . Let c = f (2) .130) . (2. f (Pi ) = log 2 Pi = log2 N. (2. we can apply the ﬁrst property of f and the prime number factorization of N to show that f (N ) = Thus the lemma is proved. .127) the sum of bn terms. and g (n) = g (n(1) ) + g (n) − g (P n(1) ) = g (n(1) ) + n−1 αi (2. We will now argue that the only function f (m) such that f (mn) = f (m) + f (n) for all integers m.124) i=P n(1) Just as we have deﬁned n(1) from n .126) terms. after k= log n +1 log P (2. The lemma can be simpliﬁed considerably. since g (n) has at most o(log 2 n) terms αi . if instead of the second assumption. it follows that the constant is 1. it follows that g (P n (1) ) = g (n(1) ) .44 Entropy. (2.

. . we must have f (m) = log2 m c (2. Consider the extended grouping axiom for Hs 1 1 1 1 s−r s−r f (s) = Hs ( .140) . let p be a rational number. etc.136) is true for rational p ..135) log 2 m 1 < c r (2. (2. we have c k k+1 ≤ log 2 m < r r Combining these two equations. log2 1 − . . we obtain r s−r s−r r r s−r H2 ( .132) (2. . we have kc ≤ rf (m) < (k + 1)c or k k+1 ≤ f (m) < c r r Now by the monotonicity of log .Entropy.136) is also true at irrational p .138) (2.131) (2. 1/m) = log 2 m . Then by the monotonicity assumption on f . We have shown that for any uniform distribution on m outcomes. we have to show that Hm (p1 .134) (2. . (2. . f (m) = Hm (1/m. )+ f (s − r ) s s s s s s r (2. be another integer and let 2 k ≤ mr < 2k+1 . . let r > 0 . . . we have to extend the deﬁnition from H 2 to Hm . s s r s Substituting f (s) = log 2 s . .136) To begin. To complete the proof. .137) s s−r r s−r ) + f (s) + f (s − r ) = H2 ( . ) = H ( . We will now show that H2 (p. . i. . Now we are almost done. say.e. . Relative Entropy and Mutual Information 45 For any integer m . By the continuity assumption. we obtain f (m) − Since r was arbitrary.139) Thus (2. . r/s . . .133) and we can identify c = 1 from the last assumption of the lemma. ) = − log 2 − 1 − s s s s s s (2. 1 − p) = −p log p − (1 − p) log(1 − p). pm ) = − pi log pi (2.

The entropy of a missorted ﬁle. There are n ways to choose which card gets mis-sorted. pn ) = Hn−1 (p1 + p2 . . . • The selected card is at its original location after a replacement.145) = −(p1 + p2 ) log(p1 + p2 ) − − n pi log pi i=3 p1 p1 p2 p2 log − log p1 + p 2 p1 + p 2 p1 + p 2 p1 + p 2 pi log pi . • The selected card is at most one location away from its original location after a replacement. So we need to carefully count the number of distinguishable outcome states. Thus we have ﬁnally proved that the only symmetric function that satisﬁes the axioms is m Hm (p1 . once the card is chosen. and by induction. Thus. By the grouping axiom. it is true for all m .46 Entropy. The resulting deck can only take on one of the following three cases. Case 1 (resulting deck is the same as the original): There are n ways to achieve this outcome state. p3 . One card is removed at random then replaced at random. . Each of these shuﬄing actions has probability 1/n 2 . . . n is provided.146) 47. • The selected card is at least two locations away from its original location after a replacement. we need to know the probability of each case. Relative Entropy and Mutual Information for all m . What is the entropy of the resulting deck? Solution: The entropy of a missorted ﬁle. .141) (2. one for each of the n cards in the deck. . the probability associated with case 1 is n/n2 = 1/n . A deck of n cards in order 1. p1 + p 2 p1 + p 2 n (2. i=1 (2. Now assume that it is true for m = n − 1 . . . This is a straightforward induction. = − i=1 Thus the statement is true for m = n . pm ) = − The proof above is due to R´ enyi[4] pi log pi . We have just shown that this is true for m = 2 . Hn (p1 .142) (2. .144) (2. there are again n ways to choose where the card is replaced in the deck.143) (2. . . The heart of this problem is simply carefully counting the possible outcome states. . pn ) p1 p2 +(p1 + p2 )H2 . . . . To compute the entropy of the resulting deck. . not all of these n2 actions results in a unique mis-sorted ﬁle. Unfortunately. and. 2.

X N ). N ) = = H (N ) − H (N |X N ) H (N ) − 0 . Thus X N is an element of the set of all ﬁnite length binary sequences {0. X N ). . Let this stopping time be independent of the sequence X 1 X2 . (d) Find I (N . either by selecting the left-hand card and moving it one to the right. 11. . each of which will have a probability of 2/n2 . (f) Find H (X N ). 1. (a) Find I (N . Let N designate this stopping time. since for each pair. For this part. there are two ways to achieve the swap. How much information does the length of a sequence give about the content of a sequence? Suppose we consider a Bernoulli (1/2) process {X i }. 10. again assume X i ∼ Bernoulli (1/2) but stop at time N = 6 . X12 . 000. Sequence length. (b) Find H (X N |N ). }. n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtracted the case 1 and case 2 situations above). .Entropy. Solution: (a) I (X N . Case 3 (typical situation): None of the remaining actions “collapses”. Relative Entropy and Mutual Information 47 Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs. Of the n2 possible shuﬄing actions. The entropy of the resulting deck can be computed as follows. 01. or by selecting the right-hand card and moving it one to the left. each with probability 1/n 2 . 00. Let’s now consider a diﬀerent stopping time. . H (X ) = = 2 n2 1 1 log (n) + (n − 1) 2 log ( ) + (n2 − 3n + 2) 2 log (n2 ) n n 2 n 2(n − 1) 2n − 1 log (n) − n n2 48. (e) Find H (X N |N ). They all result in unique outcome states. with probability 1/3 and stop at time N = 12 with probability 2/3. . (c) Find H (X N ). Stop the process when the ﬁrst 1 appears. 1} ∗ = {0.

Relative Entropy and Mutual Information (a) = E (N ) 2 I (X . N ) N = where (a) comes from the fact that the entropy of a geometric random variable is just the mean. (b) Since given N we know that Xi = 0 for all i < N and XN = 1. H (X N |N ) = 0. (d) I (X N . . N ) = HB (1/3) N = H (N ) − 0 (e) 2 1 H (X 6 |N = 6) + H (X 12 |N = 12) 3 3 1 2 6 12 = H (X ) + H (X ) 3 3 1 2 = 6 + 12 3 3 H (X N |N ) = 10. (c) H (X N ) = I (X N . H (X N |N ) = (f) H (X N ) = I (X N . N ) + H (X N |N ) = I (X N . N ) + 0 H (X N ) = 2. N ) + H (X N |N ) = I (X N . N ) + 10 H (X N ) = HB (1/3) + 10. N ) = H (N ) − H (N |X N ) I (X .48 Entropy.

3) Pr Z n − µ > n Thus Pr Z n − µ > numbers. By letting X = (Y − µ)2 .) Let Z 1 .) Let Y be a random variable with mean µ and variance σ 2 . (3.1) t Exhibit a random variable that achieves this inequality with equality. (b) (Chebyshev’s inequality.i. Markov’s inequality and Chebyshev’s inequality. (a) If X has distribution F (x) .2) (c) (The weak law of large numbers. EX = 0 δ ∞ xdF ∞ δ = 0 xdF + 49 xdF . This is known as the weak law of large Solution: Markov’s inequality and Chebyshev’s inequality. . Z2 . Show that σ2 ≤ 2.) For any non-negative random variable X and any t > 0 . .Chapter 3 The Asymptotic Equipartition Property 1. (3. show that for any > 0 . . Pr {|Y − µ| > } ≤ σ2 2 .d. → 0 as n → ∞ . (a) (Markov’s inequality. Let Z n = n i=1 Zi be the sample mean. (3. show that EX Pr {X ≥ t} ≤ . . Zn be a sequence of i. random n 1 variables with mean µ and variance σ 2 .

and noticing that Pr{(Y − µ)2 > 2} = Pr{|Y − µ| > } . we get. It goes like EX = E (X |X ≤ δ ) Pr{X ≥ δ } + E (X |X < δ ) Pr{X < δ } ≥ E (X |X ≤ δ ) Pr{X ≥ δ } ≥ δ Pr{X ≥ δ }. Given δ . n . Rearranging sides and dividing by δ we get. EX . δ δ with probability µ δ 0 with probability 1 − µ δ. Pr{X ≥ δ } ≤ EX . σ2 2 Pr{|Y − µ| > } ≤ .4) One student gave a proof based on conditional expectations. and noticing that (c) Letting Y in Chebyshev’s inequality from part (b) equal Z 2 Zi σ ¯n = µ and Var(Z ¯n ) = ¯ EZ n (ie.50 ≥ ≥ The Asymptotic Equipartition Property ∞ δ δ ∞ xdF δ dF = δ Pr{X ≥ δ }. Pr{|Z n 2 . (b) Letting X = (Y − µ)2 in Markov’s inequality. 2 ¯n − µ| > } ≤ σ .’s. Pr{(Y − µ)2 > 2 } ≤ Pr{(Y − µ)2 ≥ E (Y − µ)2 ≤ 2 = σ2 2 2 } . which leads to (3. ¯n .4) as well. Zn is the sum of n iid r. the distribution achieving Pr{X ≥ δ } = is X= where µ ≤ δ . each with σ2 variance n 2 ). we have.v. δ (3.

AEP and mutual information. . What is the limit of p(X n )p(Y n ) 1 log ? n p(X n . Let (X i . the ﬁrst cut (and choice of largest piece) may result in a piece of 3 2 size 3 5 . as in Question 2 of Homework Set #3. Cutting and choosing from this piece might reduce it to size ( 5 )( 3 ) at time 2. 5 ) w. Y n ) = = n p(Xi )p(Yi ) 1 log n p(Xi . Yi ) → E (log = n n p(Xi )p(Yi ) ) p(Xi . lim 1 n 1 log Tn = lim log Ci n n i=1 = E [log C1 ] 3 2 1 3 = log + log . Yi ) i=1 1 n n log i=i p(Xi )p(Yi ) p(Xi . y ) . Yi ) −I (X .p. . Cn = n i=1 Ci . (3 2 3 ( 5 . Y ) (X )p(Y ) −nI (X . How large. the other pieces discarded.d. We form the log likelihood ratio of the hypothesis that X and Y are independent vs.Y ) . ∼ p(x. Hence. pp (X n . and so on. the hypothesis that X and Y are dependent.Y n ) → 2 independent. 3. the largest piece being chosen each time.i. 3 ) w. 4 3 4 5 . and let Tn be the fraction of cake left after n cuts. Then we have T n = C1 C2 . is the piece of cake after n cuts? Solution: Let Ci be the fraction of the piece of cake that is cut at the i th cut. which will converge to 1 if X and Y are indeed Thus. Piece of cake A cake is sliced roughly in half. for example. Y n ) Solution: 1 p(X n )p(Y n ) log n p(X n . to ﬁrst order in the exponent. We will assume that a random cut creates pieces of proportions: P = 2 1 .The Asymptotic Equipartition Property 51 2.p. Yi ) be i. 3 4 1 4 Thus.

So for all n > max(N1 . Let X1 . (c) Show |An ∩ B n | ≤ 2n(H + ) . for x n ∈ An .2 in the text. .52 The Asymptotic Equipartition Property 4. . by the AEP for discrete random variables the probability X n is typical goes to 1. there exists N such that P r {X n ∈ 1 An ∩ B n } ≥ 2 for all n > N . 1 n ≤ p(xn ) ≤ 2−n(H − ) . Sets deﬁned by probabilities. . X2 . therefore P r (X n ∈ An ∩ B n ) → 1 . sequence of discrete random variables with entropy H (X ). x ∈ {1. From Theorem 3. by the Strong Law of Large Numbers P r (X n ∈ B n ) → 1 . (b) Does Pr{X n ∈ An ∩ B n } −→ 1 ? (a) Does Pr{X n ∈ An } −→ 1 ? 1 n(H − ) (d) Show |An ∩ B n | ≥ ( 2 )2 . Multiplying through by 2 n(H − ) gives xn ∈An ∩B n 2 1 n(H − ) n the result |A ∩ B n | ≥ ( 2 )2 for n suﬃciently large. N2 ) : P r (X n ∈ An ∩ B n ) = P r (X n ∈ An ) + P r (X n ∈ B n ) − P r (X n ∈ An ∪ B n ) = 1− > 1− 2 +1− 2 −1 (d) Since from (b) P r {X n ∈ An ∩ B n } → 1 . 2. Let B n = {xn ∈ X n : | n An = {xn ∈ X n : | − n i=1 Xi − µ| ≤ } . m} .i. Let n 1 1 log p(xn ) − H | ≤ } . So there exists > 0 and N1 such that P r (X n ∈ An ) > 1 − 2 for all n > N1 .2 in the text. p(xn ) ≥ 2−n(H + ) . So for any > 0 there exists N = max(N1 . AEP Let Xi be iid ∼ p(x). 5.1. . N2 ) such that P r (X n ∈ An ∩ B n ) > 1 − for all n > N .d. (b) For what values of t does P ({X n ∈ Cn (t)}) → 1? Solution: (a) Show |Cn (t)| ≤ 2nt . (b) Yes. Solution: (a) Yes. for all n . and there exists N2 such that P r (X n ∈ B n ) > 1 − 2 for all n > N2 . . . Combining these two equations gives 1 ≥ xn ∈An ∩B n p(xn ) ≥ xn ∈An ∩B n 2−n(H + ) = |An ∩ B n |2−n(H + ) . for n suﬃciently large. and H = − p(x) log p(x). Let µ = EX. Multiplying through by 2n(H + ) gives the result |An ∩ B n | ≤ 2n(H + ) . . . So combining these two gives 2 xn ∈An ∩B n p(x ) ≤ −n(H − ) = |An ∩ B n |2−n(H − ) . from Theorem 3. for x ∈ A . Let Cn (t) = {xn ∈ X n : p(xn ) ≥ 2−nt } denote the subset of n -sequences with probabilities ≥ 2 −nt . Also.1. be an i. n n n (c) By the law of total probability xn ∈An ∩B n p(x ) ≤ 1 .

i. a. The digits are taken 100 at a time and a binary codeword is provided for every sequence of 100 digits containing three or fewer ones. .0454 . Find 1 lim [p(X1 . Solution: THe AEP and source coding. .e. . Xn )] n . . X2 .) (b) The probability that a 100-bit sequence has three or fewer ones is 3 i=0 100 (0. . .d. (a) Assuming that all codewords are the same length. Xn )) n 1 = lim 2log(p(X1 . ∼ p(x) .. if t < H .i. 6. X2 . |C n (t)| minxn ∈Cn (t) p(xn ) ≤ 1 . . so 18 is quite a bit larger than the 4. and lim(p(X1 . (c) Use Chebyshev’s inequality to bound the probability of observing a source sequence for which no codeword has been assigned.The Asymptotic Equipartition Property 53 1 (b) Since − n log p(xn ) → H .60577 + 0. . n→∞ Solution: An AEP-like limit. (a) The number of 100-bit binary sequences with three or fewer ones is 100 100 100 100 + + + 3 2 1 0 = 1 + 100 + 4950 + 161700 = 166751 . E (log(p(X ))) a. and hence |Cn (t)| ≤ 2nt . 7. X1 . the probability goes to 1.01243 = 0. A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p(1) = 0. (Note that H (0. . . An AEP-like limit. The required codeword length is log 2 166751 = 18 .99833 i .7572 + 0. and if t > H .. X2 . Let X1 . the probability that p(x n ) > 2−nt goes to 0.995)100−i = 0. (a) Since the total probability of all sequences is less than 1.d.30441 + 0.e. (b) Calculate the probability of observing a source sequence for which no codeword has been assigned.005) = 0.Xn )) n = 2lim n = 2 = 2 −H (X ) 1 1 log p(Xi ) a. . Hence log(Xi ) are also i. ﬁnd the minimum length required to provide codewords for all sequences with three or fewer ones. Compare this bound with the actual probability computed in part (b).. drawn according to probability mass function p(x).995 .i. . . by the strong law of large numbers (assuming of course that H(X) exists).. be i.005)i (0. X2 .5 bits of entropy. i. . .X2 .d.e. The AEP and source coding.005 and p(0) = 0.

. ∼ p(x) . (Therefore nµ and nσ 2 are the mean and variance of Sn . Solution: Products.005)(0. . Thus p(x 1 . . . Let X= 1.i. x ∈ {1. 1 2 1 4 1 4 2. (3. where q is another probability mass function on {1. Xn ) → H (X ) in probability.5 .5)2 This bound is much larger than the actual probability 0. . be drawn i. .5 . n i=1 1 3.995) . . Chebyshev’s inequality states that Pr(|Sn − nµ| ≥ ) ≤ nσ 2 2 .995) ≈ 0.04061 .005)| ≥ 3. 2.. X2 . . random variables X1 .00167. by the strong law of large numbers. (c) In the case of a random variable S n that is the sum of n i. .d. 1 1 1 1. . . and therefore 1 Pn → 2 4 log 6 = 1. Find the limiting behavior of the product 1 (X1 X2 · · · Xn ) n . Note that S100 ≥ 4 if and only if |S100 − 100(0. . X2 . ∼ p(x) .. . Thus the odds favoring q are exponentially small when p is true. are i. 1 (a) Evaluate lim − n log q (X1 .. . . . .i. (3. xn ) = i=1 q (xi ). Then Pr(S100 ≥ 4) ≤ 100(0. where X1 . . Let n q (x1 . 2. (3. . Xn ) . Let X1 . q (X1 .565 . .d. X2 . .. Thus P n → 2E log X with prob. m} . Xn ) n . We can easily calculate E log X = 1 2 log 1 + 4 log 2 + 4 log 3 = 4 log 6 .5) log Xi → E log X.Xn ) 1 (b) Now evaluate the limit of the log likelihood ratio n log p (X1 . X2 . . X2 . x2 . .i.. according to this distribution.d. . be independent identically distributed random variables drawn according to the probability mass function p(x). . AEP. n = 100 . x2 . . .. . Xn . We know that − n log p(X1 . . .99833 = 0.00167 . µ = 0. .005)(0. .i. where µ and σ 2 are the mean and variance of Xi . xn ) = n 1 i=1 p(xi ) . m} . Products. .005 . . and σ 2 = (0. .. . .6) with probability 1. X2 . 9. . are i. .54 The Asymptotic Equipartition Property Thus the probability that the sequence that is generated cannot be encoded is 1 − 0. so we should choose = 3. . Let Then logPn = 1 n Pn = (X1 X2 . .d.Xn ) when X1 . . 8. Let X1 . X2 . ..) In this problem.

d. Find lim n→∞ Vn . . . .16) = − p(x) log q (x) (3. Xn ) = lim − log q (Xi ) n n = −E (log q (X )) w. . 10. . . since ex is a continuous function. . and E (log e (X )) < ∞ . X2 . Xn are i. Random box size. . Solution: Random box size. characterizes the behavior of products. n→∞ lim Vnn = elimn→∞ n loge Vn = 1 1 1 < . Now 1 E (log e (Xi )) = 0 log e (x) dx = −1 1 Hence.7) (3. . .13) (3. . . The volume V n = n i=1 Xi is a random variable. 55 (a) Since the X1 . so are q (X1 ). .12) (3. . X2 .d. . 1]. The edge length l of a n -cube with 1/n the same volume as the random box is l = V n . q (X2 ). 1 p(x) − q (x) = D (p||q) + H (p). . .p. .The Asymptotic Equipartition Property Solution: (AEP). rather than the arithmetic mean. and compare to 1 (EVn ) n . by the strong law of large numbers. log e Vnn = log e Vn = n n by the Strong Law of Large Numbers. Xn is to be constructed.i.d. . An n -dimensional rectangular box with sides X 1 . V n tends to 0 as n → ∞ .11) = −E (log q (X ) ) w. .10) (3. X2 .. and hence we can apply the strong law of large numbers to obtain 1 1 lim − log q (X1 . X2 .14) (3. . . X3 .8) (3. . Clearly the expected edge length does not capture the idea of the volume of the box. The geometric mean. uniform 1/n random variables over the unit interval [0. be i. X2 . X2 . = p(x) log (b) Again. However 1 1 1 loge Xi → E (log e (X )) a. q (Xn ) . . 1 q (X1 . 1 p(X ) q (x) = − p(x) log p(x) p(x) = p(x) log q (x) = D (p||q). . The volume is Vn = n i=1 Xi .15) (3. e 2 . since the Xi are random variables uniformly distributed on [0. Xn ) lim − log n p(X1 .e. . 1] .i. .i.9) p(x) log p(x) (3. Let X1 . Xn ) = lim − 1 n log q (Xi ) p(Xi ) (3.p. since X i and loge (Xi ) are i.

1. (a) Given any two sets A .17) (3. and 1 is the geometric mean. . This problem shows that the size of the smallest “probable” (n) set is about 2nH . .29) (3. we have the following chain of inequalities 1− −δ (a) (b) ≤ Pr(A(n) ∩ Bδ ) p(xn ) A (n) (n) ∩Bδ (n) (3.3.20) (3. (n) (n) (3. (n) (3. Since P (A) ≥ 1 − 1. Let Bδ ⊂ X n such that (n) 1 Pr(Bδ ) > 1 − δ .1.23) (3.27) = (c) ≤ 2−n(H − A (n) (n) ∩Bδ ) (3.d. e 11. Xn be i. Fix < 2 .56 The Asymptotic Equipartition Property Thus the “eﬀective” edge length of this solid is e −1 .28) ) (d) = ≤ (e) |A(n) ∩ Bδ |2−n(H − |Bδ |2−n(H − ) . Also 1 independent.18) ) ≤ 2−n(H − A (n) (n) ∩Bδ (3.25) ≥ 1 − P (A ) − P (B ) ≥ 1− 1 c − 2. .22) 2. ∼ p(x) . (b) To complete the proof.26) (3.19) ) = |A(n) ∩ Bδ |2−n(H − ≤ (c) Complete the proof of the theorem. . P (B c ) ≤ Hence (3. Note that since the Xi ’s are 1 n ) . Let X1 . P (A ∩ B ) = 1 − P (Ac ∪ B c ) c Similarly. show (n) (3. E (Vn ) = E (Xi ) = ( 2 2 is the arithmetic mean of the random variable.30) . X2 . (b) Justify the steps in the chain of inequalities 1 − − δ ≤ Pr(A(n) ∩ Bδ ) = p(x ) n A (n) (n) ∩Bδ 2. Solution: Proof of Theorem 3.i. (a) Let Ac denote the complement of A .24) (3. Hence Pr(A ∩ Bδ ) ≥ 1 − − δ.21) (3. Then P (Ac ) ≤ P (Ac ∪ B c ) ≤ P (Ac ) + P (B c ). B such that Pr(A) > 1 − 1 and Pr(B ) > 1 − (n) (n) that Pr(A ∩ B ) > 1 − 1 − 2 .3. (n) |Bδ |2−n(H − ) . 1. Proof of Theorem 3.

2 2 Taking expectations and using the fact the X i ’s are identically distributed we get. 1 1 1 1 D (ˆ p2n ||p) = D ( p ˆn + p ˆn || p + p) 2 2 2 2 1 1 ≤ D (ˆ pn ||p) + D (ˆ pn ||p). and (e) from the fact that A ∩ Bδ ⊆ Bδ . 12. 1 n p ˆn (x) = I (Xi = x) n i=1 is the proportion of times that Xi = x in the ﬁrst n samples. (b) follows by deﬁnition of probability of a set. (d) from the deﬁnition of |A ∩ Bδ | as the cardinality (n) (n) (n) (n) (n) of the set A ∩ Bδ .The Asymptotic Equipartition Property 57 where (a) follows from the previous part. . (a) Show for X binary that ED (ˆ p2n p) ≤ ED (ˆ pn p). X2 . 1 1 Hint: Write p ˆ2n = 2 p ˆn + 2 p ˆn and use the convexity of D . Speciﬁcally. . ED (ˆ p2n ||p) ≤ ED (ˆ pn ||p). Monotonic convergence of the empirical distribution. Xn i. x ∈ X . Solution: Monotonic convergence of the empirical distribution. (a) Note that. ∼ p(x). .d. (b) Show for an arbitrary discrete X that ED (ˆ pn p) ≤ ED (ˆ pn−1 p).i. where I is the indicator function. Let p ˆn denote the empirical probability mass function corresponding to X 1 . 2 2 n Using convexity of D (p||q ) we have that. p ˆ2n (x) = = = 1 2n 2n I (Xi = x) i=1 n 11 1 1 2n I (Xi = x) + I (Xi = x) 2 n i=1 2 n i=n+1 1 1 p ˆn (x) + p ˆ (x). . (c) follows from the fact that the probability of elements of the typical set are (n) (n) bounded by 2−n(H − ) . Thus the expected relative entropy “distance” from the empirical distribution to the true distribution decreases with sample size. Hint: Write p ˆn as the average of n empirical mass functions with each of the n samples deleted in turn. .

where the probability that Xi = 1 is 0.6 (and therefore the probability that X i = 0 is 0. Calculation of typical set To clarify the notion of a typical set A (n) set of high probability Bδ . we obtain the ﬁnal result. X 1 .i. n where j ranges from 1 to n . binary random variables. . . . we will calculate the set for a simple example.58 The Asymptotic Equipartition Property (b) The trick to this part is similar to part a) and involves rewriting p ˆn in terms of p ˆn−1 . 1 n j p ˆ p ˆn = n j =1 n−1 where. which sequences fall in the typical set A ? What is the probability of the typical set? How many elements are there in the typical set? (This involves computation of a table of probabilities for sequences with k 1’s. X2 . n−1 n j np ˆn = p ˆ +p ˆn . n j =1 p ˆj n−1 = 1 n−1 I (Xi = x). We see that. i=j Again using the convexity of D (p||q ) and the fact that the D (ˆ pj n−1 ||p) are identically distributed for all j and hence have the same expected value. p ˆn = 1 n I (Xi = x) + i=j I (Xn = x) 1 n−1 I (Xi = x) + n i=0 n I (Xj = x) . (a) Calculate H (X ) .1 . Summing over j we get. Consider a sequence of i.4). n j =1 n−1 or. Xn . and the smallest 13. (b) With n = 25 and = 0. p ˆn = or in general.d.) (n) (n) . and ﬁnding those sequences that are in the typical set. 0 ≤ k ≤ 25 . .

947552 0.830560 0.001937 0.000054 0.900755 0.6 log 0.9? (d) How many elements are there in the intersection of the sets in part (b) and (c)? What is the probability of this intersection? Solution: (a) H (X ) = −0. The probability that the number of 1’s lies between 11 and 19 is equal to F (19) − F (10) = 0.017748 0.97095 bits.321928 1.999950 0.000000 0.1 is the set of sequences such that − n log p(xn ) lies in the range (H (X ) − .736966 1 25 300 2300 12650 53130 177100 480700 1081575 2042975 3268760 4457400 5200300 5200300 4457400 3268760 2042975 1081575 480700 177100 53130 12650 2300 300 25 1 pk (1 − p)n−k 0.003121 0.087943 1.079986 0.4 = 0.021222 0. the number of ones lying between 11 and 19.970951 0. .204936 1.146507 0. i.4 log 0.000007 0..07095).970638 − 0.267718 0. it is easy to see that the typical set is the set of all sequences with k .075967 0.000000 0.e.000001 0.041146 1. H (X )+ ) .275131 1.000003 (c) How many elements are there in the smallest set that has probability 0.034392 = 0. (n) 1 (b) By deﬁnition.846448 0..924154 0.853958 0.019891 0.970638 0.298530 1.783763 0.134740 1.064545 1.997633 0.760364 0.936246 . the n is large enough for the probability of the typical set to be greater than 1 − .111342 1.e.807161 0.000047 0. in the range (0.000000 0.87095.151086 0.013169 0.181537 1.251733 1.994349 0.6 − 0.000227 0.The Asymptotic Equipartition Property k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 n k n k 59 1 log p(xn ) −n 1.077801 0.158139 1. i. A for = 0. Examining the last column of the table.228334 1.877357 0.001205 0.575383 0. 1. The probability of the typical set can be calculated from cumulative probability column. Note that this is greater than 1 − .

the size of the smallest set is a well deﬁned number.1) = 226. and Bδ in parts (b) and (c) consists of all (d) The intersection of the sets A sequences with k between 13 and 19.153768 = 0. Note that the set B δ is not uniquely deﬁned . Looking at the fourth column of the table.9 − 0.77 = 1.846232 to the set B δ .97095−0.it could include any 3680691 sequences with k = 12 . and 3680691 sequences with k = 12 .9. Both bounds are very loose! 19 |A(n) | = (c) To ﬁnd the smallest set Bδ of probability 0. it is clear that the probability of a sequence increases monotonically with k .9 has 33554432 − 16777216 + (n) 3680691 = 20457907 sequences. 19 10 n n n = − = 33486026 − 7119516 = 26366510. The sequences with (n) k ≥ 13 provide a total probability of 1 − 0. The remaining probability of 0.147365 × 108 . and (1 − )2n(H − ) = 0.460813×10−8 = 3680690. . it corresponds to using the sequences with the highest probability. (n) (n) (n) (n) . and the size of this intersection = 33486026 − 16777216 + 3680691 = 20389501 . it follows that the set B δ consist of sequences with k ≥ 13 and some sequences with k = 12 . .97095+0. However.053768 = 0. To minimize the number of pieces that we use.9. The number of such sequences needed to ﬁll this probability is at least 0.9875 = 3742308 .846232 = 0.153768 + 0. In this case.1 . until we have a total probability 0.60 The Asymptotic Equipartition Property The number of elements in the typical set can be found using the third column. . Thus the set consists of seqeunces of k = 25.870638 . we can imagine that we are ﬁlling a bag with pieces such that we want to reach a certain weight with the minimum number of pieces.053768/p(xn ) = 0. Using the cumulative probability column. Thus the smallest set with probability 0. .9 × 221.31) (n) Note that the upper and lower bounds for the size of the A can be calculated as 2n(H + ) = 225(0.053768/1.9.970638 − 0. we should use the largest possible pieces.1) = 0. which we round up to 3680691.053768 should come from sequences with k = 12 .9 × 225(0. Thus we keep putting the high probability sequences into this set until we reach a total probability of 0. 24. The probability of this intersection = 0. k k k k =11 k =0 k =0 (3.

(b) Show that a stationary distribution µ for a doubly stochastic matrix P is the uniform distribution. .j bj (4. (a) H (b) − H (a) = − = j i bj log bj + j i ai log ai ak Pkj ) + k i (4.4) . be a probability vector.Chapter 4 Entropy Rates of a Stochastic Process 1. Thus stochastic mixing increases entropy.j 61 ai Pij log ai i. Show that b is a probability vector and that H (b1 . . An n × n matrix P = [P ij ] is said to be doubly stochastic if Pij ≥ 0 and i Pij = 1 for all j . Solution: Doubly Stochastic Matrices. . then P is doubly stochastic. . . prove that if the uniform distribution is a stationary distribution for a Markov transition matrix P . b2 .3) ai Pij log( ai Pij log i j = ai k ak Pkj i. (c) Conversely. An n × n matrix P j Pij = 1 for all i and is said to be a permutation matrix if it is doubly stochastic and there is precisely one Pij = 1 in each row and each column. ai = 1 . Doubly stochastic matrices. (a) Let at = (a1 . bn ) ≥ H (a1 . an ) . an ) .1) ai log ai (4. a2 . . where P is doubly stochastic. . It can be shown that every doubly stochastic matrix can be written as the convex combination of permutation matrices. Let b = aP . ai ≥ 0 . . . . a2 .2) (4. . .j ≥ i.

. Argue that for any distribution on shuﬄes T and any distribution on card positions X that H (T X ) ≥ H (T X |T ) = H (T −1 (4. .7) Pji = 1 or that the matrix is doubly stochastic. .5) (4. one can determine the direction of time by looking at a sample function of the process. . . 3. .13) (4. H (X0 |X−1 . X−1 . X2 .9) follows from stationarity. . Nonetheless. the conditional uncertainty of the next symbol in the future is equal to the conditional uncertainty of the previous symbol in the past. Xn ) − H (X1 . . X2 . then 1 = µi = m or j 1 m . Xn ). Time’s arrow. = H (X |T ) T X |T ) .9) (4.6) where the inequality follows from the log sum inequality. . By the chain rule for entropy. X−n ) − H (X−1 . . the substituting µ i = that it satisﬁes µ = µP . . . . . Solution: Time’s arrow. if X and T are independent. . . m m Entropy Rates of a Stochastic Process (4. . Let {Xi }∞ i=−∞ be a stationary stochastic process. X−n ) = H (X0 . Prove that H (X0 |X−1 . X−n ) = H (X0 |X1 . .11) (4. . j (4. . . the present has a conditional entropy given the past equal to the conditional entropy given the future.12) (4. (c) If the uniform is a stationary distribution. Xn ) where (4. 2. we can easily check µj Pji = j 1 m Pji . X2 . X1 . X−2 .10) = H (X0 .8) (4.62 = 1 log = 0. That is to say. X2 . . Shuﬄes increase entropy.14) = H (X ). . . . Xn ). . . This is true even though it is quite easy to concoct stationary random processes for which the ﬂow into the future looks quite diﬀerent from the ﬂow into the past. In other words. . X−n ) = H (X0 |X1 . . (b) If the matrix is doubly stochastic. . . (4. . given the present state.

be a stationary ﬁrst-order Markov chain. we can reverse the shuﬄe. By stationarity. X2 ) = H (Xn |X2 ) = H (Xn−1 |X1 ) (Conditioning reduces entropy) (by stationarity) (4. H (Xn |X1 ) ≤ H (Xn |X1 . Consider the following method of generating a random tree with n nodes.24) (4. we have I (X1 . it was shown that H (X n | X1 ) ≥ H (Xn−1 | X1 ) for n = 2. H (T X ) ≥ H (T X |T ) = H (T −1 63 (4. This is true although the unconditional uncertainty H (X n ) remains constant. by an application of the data processing inequality to the Markov chain X1 → Xn−1 → Xn . 3 . X3 . Entropy of a random tree. First expand the root node: @ @ @ Then expand one of the two terminal nodes at random: @ @ @ @ @ @ @ @ @ @ @ @ . we have H (Xn−1 ) − H (Xn−1 |X1 ) ≥ H (Xn ) − H (Xn |X1 ).16) (4. Thus conditional uncertainty about the future grows with time. 4.17) (4. . . Second law of thermodynamics.4. show by example that H (Xn |X1 = x1 ) does not necessarily grow with n for every x 1 .15) (4. Xn−1 ) ≥ I (X1 .Entropy Rates of a Stochastic Process Solution: Shuﬄes increase entropy.22) 5. (4. Xn ). X2 .18) = H (X ). = H (X |T ) T X |T ) The inequality follows from the fact that conditioning reduces entropy and the ﬁrst equality follows from the fact that given T . In Section 4. . Solution: Second law of thermodynamics. Let X 1 .20) (4. H (Xn−1 ) = H (Xn ) and hence we have H (Xn−1 |X1 ) ≤ H (Xn |X1 ).19) (4. .23) (4.21) (by Markovity) Alternatively. Expanding the mutual informations in terms of entropies. However. .

choose one of the k − 1 terminal nodes according to a uniform distribution and expand it. but we can ﬁnd the entropy of this distribution in recursive form. from Polya’s urn model. . (n − N1 ) − 1} . . and independently choose another integer N3 uniformly over {1. For n = 2 .64 Entropy Rates of a Stochastic Process At time k . For n = 3 .) Now let Tn denote a random n -node tree generated as described. . . (The equivalence of these two tree generation schemes follows. . The picture is now: ¨¨ ¨ ¨ @ @ @ ¨H HH HH @ @ @ N2 N1 − N 2 N3 n − N 1 − N3 Continue the process until no further subdivision can be made. 2. We then have the picture. . N 1 − 1} . n − 1} . Thus H (T 2 ) = 0 . First some examples. the following method of generating random trees yields the same probability distribution on trees with n terminal nodes. Thus a sequence leading to a ﬁve node tree might look like this: @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ Surprisingly. . 2. @ @ @ N1 n − N1 Then choose an integer N2 uniformly distributed over {1. we have only one tree. Continue until n terminal nodes have been generated. . . The probability distribution on such trees seems diﬃcult to describe. . . for example. 2. we have two equally probable trees: @ @ @ @ @ @ @ @ @ @ @ @ . First choose an integer N 1 uniformly distributed on {1. .

or Hn Hn−1 = + cn . 1/6. Justify each of the steps in the following: H (Tn ) (a) = H (N1 .Entropy Rates of a Stochastic Process 65 Thus H (T3 ) = log 2 . .29) (4. the left subtree and the right subtree are chosen independently.31) (4. n − 1} .26) (4. Thus the expected number of bits necessary to describe the random tree Tn grows linearly with n .34) 1 n−1 H (Tn |N1 = k ) n − 1 k=1 by the deﬁnition of conditional entropy. (b) H (Tn . with probabilities 1/3. Since cn = c < ∞ .30) (b) = (c) = (d) = (e) = = (f) Use this to show that (n − 1)Hn = nHn−1 + (n − 1) log(n − 1) − (n − 2) log(n − 2). Hk . Let N 1 (Tn ) denote the number of terminal nodes of Tn in the right half of the tree. . Tn−k |N1 = . (d) (c) H (N1 ) = log(n − 1) since N1 is uniform on {1.27) [H (Tk ) + H (Tn−k )] H (Tk ). Tn ) H (N1 ) + H (Tn |N1 ) log(n − 1) + H (Tn |N1 ) log(n − 1) + log(n − 1) + log(n − 1) + 1 n−1 2 n−1 2 n−1 n−1 k =1 n−1 k =1 n−1 k =1 (4. 1/6. N1 ) = H (Tn ) + H (N1 |Tn ) = H (Tn ) + 0 by the chain rule for entropies and since N1 is a function of Tn . 1/6. (a) H (Tn . 1/6. N1 ) = H (N1 ) + H (Tn |N1 ) by the chain rule for entropies. we have ﬁve possible trees. For n = 4 . 2. . Solution: Entropy of a random tree. Now for the recurrence relation. you have proved that n H (Tn ) converges to a constant.32) 1 for appropriately deﬁned cn . n−1 k =1 H (Tn |N1 ) = = P (N1 = k )H (Tn |N1 = k ) (4. n n−1 (4. . (4. Since conditional on N 1 . H (T n |N1 = k ) = H (Tk .25) (4.28) (4.33) (4.

43) (4.45) (4. j (j − 1) n j =3 .37) (4.41) (4. so H (Tn |N1 ) = (e) By a simple change of variables.38) (4.42) Substituting the equation for Hn−1 in the equation for Hn and proceeding recursively. we get (n − 1)Hn − (n − 2)Hn−1 = (n − 1) log(n − 1) − (n − 2) log(n − 2) + 2H n−1 (4. n − 1 k=1 (4. (n − 1)Hn = (n − 1) log(n − 1) + 2 (n − 2)Hn−1 = (n − 2) log(n − 2) + 2 n−1 k =1 n−2 k =1 Hk Hk (4.39) Subtracting the second equation from the ﬁrst. we obtain a telescoping sum Hn n n = j =3 n Cj + H2 2 (4. n−1 k =1 Entropy Rates of a Stochastic Process 1 n−1 (H (Tk ) + H (Tn−k )) .44) Hn−1 log(n − 1) (n − 2) log(n − 2) + − n−1 n n(n − 1) Hn−1 + Cn n−1 (4.35) H (Tn−k ) = n−1 k =1 H (Tk ).40) or Hn n = = where Cn = = log(n − 1) (n − 2) log(n − 2) − n n(n − 1) log(n − 1) log(n − 2) 2 log(n − 2) − + n (n − 1) n(n − 1) (4. (4.36) (f) Hence if we let Hn = H (Tn ) .46) = 1 2 log(j − 2) + log(n − 1).66 k ) = H (Tk ) + H (Tn−k ) .

49) is dominated by the −3 sum j j 2 which converges. X2 . 6. . X2 . . . show that (a) H (X1 . n n−1 (b) H (X1 .Entropy Rates of a Stochastic Process Since limn→∞ 1 n 67 log(n − 1) = 0 Hn n→∞ n lim = ≤ = 2 log j − 2 j (j − 1) j =3 2 log(j − 1) (j − 1)2 j =3 2 log j j2 j =2 ∞ ∞ ∞ (4. H (X1 . Xn−1 ) H (X1 . n Solution: Monotonicity of entropy per element.53) (4. For a stationary stochastic process X 1 . . . (a) By the chain rule for entropy. . .55) n H (Xn |X n−1 ) + H (X1 . . Xn ) n = = = n i−1 ) i=1 H (Xi |X n−1 i−1 ) i=1 H (Xi |X (4. . . Xn ) ≥ H (Xn |Xn−1 . . . .54) (4. X2 . .50) Thus the number of bits required to describe a random n -node tree grows linearly with n. . In fact. . n j (j − 1) j =3 (4.48) (4. Xn . . . H (Xn |X n−1 ) ≤ H (Xi |X i−1 ). Xn−1 ) . X2 . . X1 ). . n From stationarity it follows that for all 1 ≤ i ≤ n . Xn ) ≤ . . . . X2 .736 bits. . X2 . . . . .49) √ For suﬃciently large j . .52) n H (Xn |X n−1 ) + (4. computer evaluation shows that lim ∞ Hn 2 = log(j − 2) = 1.51) (4. Hence the above sum converges. Monotonicity of entropy per element. log j ≤ j and hence the sum in (4.47) (4.

which implies that. X2 . We expect that the maximizing value of p should be less than 1/2 . . .60) (4.59) n−1 H (Xn |X n−1 ) ≤ H (Xi |X i−1 ). Find N (t) and calculate 1 log N (t) . X2 . n−1 n−1 i=1 (4. H (Xn |X n−1 ) = ≤ = 7. Xn−1 ) + H (X1 . X2 . Xn ) . Xn ) n ≤ = H (Xi |X i−1 ) n−1 H (X1 . . . . .62) (a) Find the entropy rate of the two-state Markov chain with transition matrix P = 1 − p01 p01 p10 1 − p10 . (e) Let N (t) be the number of allowable state sequences of length t for the Markov chain of part (c). H (Xn |X n−1 ) ≤ = Combining (4. . . X2 . (4. t→∞ t Hint: Find a linear recurrence that expresses N (t) in terms of N (t − 1) and N (t − 2) . H (Xn |X n−1 ) n n i−1 ) i=1 H (Xi |X n H (X1 .57) (b) By stationarity we have for all 1 ≤ i ≤ n . .57) yields. . . that. Entropy rates of Markov chains. . . . (d) Find the maximum value of the entropy rate of the Markov chain of part (c).56) (4. since the 0 state permits more information to be generated than the 1 state.55) and (4. .68 Entropy Rates of a Stochastic Process which further implies. Why is H0 an upper bound on the entropy rate of the Markov chain? Compare H0 with the maximum entropy found in part (d). H0 = lim .58) ) n n−1 H (X1 . . (b) What values of p01 . by averaging both sides. . 1 H (X1 . n n i=1 (4. . . . . X2 . p10 maximize the rate of part (a)? (c) Find the entropy rate of the two-state Markov chain with transition matrix P = 1−p p 1 0 . X2 . Xn−1 (4.61) (4. . . Xn−1 ) . Xn−1 ) . . H (X1 .

. we ﬁnd that the maximum value of H (X ) of part (c) √ occurs for p = (3 − 5)/2 = 0.Entropy Rates of a Stochastic Process Solution: Entropy rates of Markov chains. namely. 62–63. Consider any allowable sequence of symbols of length t . . 1 = z −1 + z −2 . So the number of allowable sequences of length t satisﬁes the recurrence N (t) = N (t − 1) + N (t − 2) N (1) = 2.382 .694 bits . (a) The stationary distribution is easily calculated. the remaining N (t − 2) symbols can form any allowable sequence. . and so the entropy rate of the Markov chain of part (c) is at most H0 . since there is exactly one allowable sequence of length 0. In fact. 2 √ Note that ( 5 − 1)/2 = 0. Xt . . We could also choose N (0) = 1 as an initial condition. an upper bound on H (X1 . (See EIT pp. .694 bits . the entropy rate is H (X2 |X1 ) = µ0 H (p) + µ1 H (1) = H (p) . then the remaining N (t − 1) symbols can be any allowable sequence. (e) The Markov chain of part (c) forbids consecutive ones.i.d.) The sequence N (t) grows exponentially. the empty sequence. p+1 (d) By straightforward calculus. (c) As a special case of the general two-state Markov chain. n→∞ t Since there are only N (t) possible outcomes for X 1 . µ0 = p01 + p10 p01 + p10 Therefore the entropy rate is H (X2 |X1 ) = µ0 H (p01 ) + µ1 H (p10 ) = p10 H (p01 ) + p01 H (p10 ) . the Golden Ratio. (The sequence {N (t)} is the sequence of Fibonacci numbers.618 is (the reciprocal of) the Golden Ratio. .) Therefore √ 1 H0 = lim log N (t) = log(1 + 5)/2 = 0. where λ is the maximum magnitude solution of the characteristic equation √ Solving the characteristic equation yields λ = (1+ 5)/2 . This rate can be achieved if (and only if) p 01 = p10 = 1/2 . with Pr(Xi = 0) = Pr(Xi = 1) = 1/2 . N (2) = 3 (The initial conditions are obtained by observing that for t = 2 only the sequence 11 is not allowed. . .) p01 p10 . that is. . p01 + p10 69 (b) The entropy rate is at most 1 bit because the process has only two states. Xt ) is log N (t) . µ0 = . N (t) ≈ cλ t . If the ﬁrst symbol is 0. then the next symbol must be 0. If the ﬁrst symbol is 1. in which case the process is actually i. The maximum value is √ 5−1 H (p) = H (1 − p) = H = 0. we saw in part (d) that this upper bound can be achieved.

The entropy per symbol of the source is H (p1 ) = −p1 log p1 − (1 − p1 ) log(1 − p1 ) and the average symbol duration (or time per symbol) is T (p1 ) = 1 · p1 + 2 · p2 = p1 + 2(1 − p1 ) = 2 − p1 = 1 + p2 . 9.61803 .63) Therefore H (X0 ) − H (X0 |Xn−1 ) ≥ H (X0 ) − H (X0 |Xn ) (4. √ 1 ( 5 − 1) = 0. Solution: Initial conditions. T (∂H/∂p1 ) − H (∂T /∂p1 ) ∂ H (p1 ) = ∂p1 T (p1 ) T2 After some calculus. the reciprocal which is zero when 1 − p1 = p2 = p2 . What is the maximum value H ? Solution: Maximum entropy process. Xn−1 ) ≥ I (X0 . 1 2 ( 5 + 1) = 1. that H (X0 |Xn ) ≥ H (X0 |Xn−1 ).64) or H (X0 |Xn ) increases with n .61803 . The corresponding entropy per unit time is 2 H (p1 ) −(1 + p2 −p1 log p1 − p2 1 ) log p1 1 log p1 = = − log p1 = 0. = T (p1 ) 2 − p1 1 + p2 1 Note that this result is the same as the maximum entropy rate for the Markov chain in problem #4(d) of homework #4. that is. This is because a source in which every 1 must be followed by a 0 is equivalent to a source in which the symbol 1 has duration 2 and the symbol 0 has duration 1. A discrete memoryless source has alphabet {1. Show. Therefore the source entropy per unit time is f (p1 ) = −p1 log p1 − (1 − p1 ) log(1 − p1 ) H (p1 ) = . (4. Initial conditions. .69424 bits. Xn ). p1 = 2 1 √ of the golden ratio. The probabilities of 1 and 2 are p1 and p2 . for a Markov chain. we ﬁnd that the numerator of the above expression (assuming natural logarithms) is T (∂H/∂p1 ) − H (∂T /∂p1 ) = ln(1 − p1 ) − 2 ln p1 . For a Markov chain. the maximum value of f (p 1 ) must occur for some point p1 such that 0 < p1 < 1 and ∂f /∂p1 = 0 .70 Entropy Rates of a Stochastic Process 8. Thus initial conditions X0 become more diﬃcult to recover as the future X n unfolds. by the data processing theorem. T (p1 ) 2 − p1 Since f (0) = f (1) = 0 . respectively. 2} where the symbol 1 has duration 1 and the symbol 2 has duration 2. we have I (X0 . Find the value of p1 that maximizes the source entropy per unit time H (X )/ElX . Maximum entropy process.

72) 11 22 = P (X1 = 1)P (Xn = 1) and similarly for other possible values of the pair X 1 . . . 1 (4.69) (4.71) (4. .76) = n − 1. . . Clearly this is true for k = 1 . . Assume that it is true for k − 1 . We will ﬁrst prove that for any k ≤ n − 1 . for i = j . the probability that S k is odd is equal to the probability that it is even. . . Let Xn = 1 if otherwise. the probability that i=1 Xi is odd is 1/2 . . . . X1 )(4. We will prove this by induction.65) 11 11 + (4. n−1 = P (X1 = 1)P ( = Xi i=2 n−1 i=2 even) Xi even) (4. . . Xn . Hence. Pairwise independence. . . H (Xi . Taking i = 1 without loss of generality. X2 . 1} .67) = 2 Hence for all k ≤ n − 1 . n} . Hence X1 and Xn are independent. Xn−1 are i. . . random variables taking values n−1 in {0. Xn ) . . Xn1 .68) P (Xn = 1) = P (Xn = 0) = .75) (4. . Xj ) = H (Xi ) + H (Xj ) = 1 + 1 = 2 bits. .i. (c) By the chain rule and the independence of X 1 .66) = 22 22 1 . Xj ) .Entropy Rates of a Stochastic Process 71 10. 2 (a) It is clear that when i and j are both less than n . . P (X1 = 1. Xn = 1) = P (X1 = 1. . X2 . Let n ≥ 3 . (a) Show that Xi and Xj are independent. . Let Sk = k i=1 Xi . for i = j .70) (4. .73) H (Xi ) + 0 (4. j ∈ {1. with Pr{Xi = 1} = 1 i=1 Xi is odd and Xn = 0 2 . X2 . (4.i. i.d. (b) Since Xi and Xj are independent and uniformly distributed on {0.d. . 1} . . . X2 . X i and Xj are independent. (b) Find H (Xi .74) = n−1 i=1 (4. Is this equal to nH (X1 ) ? Solution: (Pairwise Independence) X 1 . Xn−1 be i. X2 . X2 . (c) Find H (X1 . . Xn ) = H (X1 . . Bernoulli(1/2) random k variables. Let X1 . . we have H (X1 . 2. . . Then P (Sk odd) = P (Sk−1 odd)P (Xk = 0) + P (Sk−1 even)P (Xk = 1) (4. . The only possible problem is when j = n . Xn−1 ) + H (Xn |Xn−1 .

(b) H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . 1. .78) (c) H (Xn |X1 . possibly reversing direction at each step with probability p = . Xn−1 . since H (Xn |X0 ) = H (Xn . X0 ) by stationarity. .i. Which of the following statements are true? Prove or provide a counterexample. X1 . n−1 (c) H (Xn |X1 . . 0. The entropy rate of a dog looking for a bone. . X This statement is true. Solution: Stationary processes. X2 . Let X 0 = 0 . X0 ) = H (X−n . X1 . (a) H (Xn |X0 ) = H (X−n |X0 ) . This statement is true. 12. Let . since by stationarity H (X n |X1 . X2 . A dog walks on the integers. This example illustrates that pairwise independence does not imply complete independence. X2 .1. Let X 0 . Xn+1 ) is nonincreasing in n . . . −4. X0 ) − H (X0 ) (4. . . H (Xn |X0 ) = 0 and H (Xn−1 |X0 ) = 1 . The ﬁrst step is equally likely to be positive or negative. In this case. . .77) (4. . Xn+1 ) is non-increasing in n . . . (c) What is the expected number of steps the dog takes before reversing direction? Solution: The entropy rate of a dog looking for a bone. (a) Find H (X1 . . −1. −3. . . . n−1 n. Xn ). The total entropy is not n . . X0 . Xn+1 . X1 . though it is true for ﬁrst order Markov chains. 11. . uniformly distributed binary random variables and let X k = Xk−n for k ≥ n . A simple counterexample is a periodic process with period n . −2. −3. H (X−n |X0 ) = H (X−n . (b) Find the entropy rate of this browsing dog. . . . which is what would be obtained if the Xi ’s were all independent. . . −2. −1. Stationary processes. Xn−1 be i. . A typical walk might look like this: (X0 . X −1 . ). .72 Entropy Rates of a Stochastic Process since Xn is a function of the previous Xi ’s. Xn−1 . . X H (Xn+1 |X1 ) where the inequality follows from the fact that conditioning n+2 reduces entropy. X0 ) − H (X0 ) and H (Xn . . .) = (0. (d) H (Xn |X1 . . contradicting the statement H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . Xn+1 ) = H (Xn+1 |X2 n+2 ) ≥ n.d. (b) H (Xn |X0 ) ≥ H (Xn−1 |X0 ) . (a) H (Xn |X0 ) = H (X−n |X0 ) . X2n ) is non-increasing in n . This statement is not true in general. . . . be a stationary (not necessarily Markov) stochastic process.

. X2 . Xn . H (X0 . . . Letting S be the number of steps taken between reversals. . H (X0 ) = 0 and since the ﬁrst step is equally likely to be positive or negative. . . (b) From a).79) n→∞ 2n Thus the dependence between adjacent n -blocks of a stationary process does not grow linearly with n . . . X2 .9)s−1 (. . . . . Since X 0 = 0 deterministically. X1 . The past has little to say about the future. Xn+1 . . For a stationary stochastic process X 1 . Solution: I (X1 . .9) n+1 → H (. . Xn+2 . . . Xn+1 . n 73 H (X0 . Xn . Xi−2 ). .. 13. . . . Xn . the next position depends only on the previous two (i. . . Starting at time 0. . . . . if the dog’s position is the state). . Xn+1 . . . = (c) The dog must take at least one step to establish the direction of travel from which it ultimately reverses. . Xn ) = i=0 H (Xi |X i−1 ) n = H (X0 ) + H (X1 |X0 ) + i=2 H (Xi |Xi−1 . . Xn ) = H (Xn+1 . . since. the expected number of steps to the ﬁrst reversal is 11. X2 . Xn ) n+1 1 + (n − 1)H (. . . . . . X2n ) − H (X1 . . . .9). . . Xn+2 . Xn+1 . (4. show that 1 lim I (X1 . X2 .80) since H (X1 . . X2 .9). X2n ) (4. . . . X2n ) = 2H (X1 . X2 . Therefore. Xn ) + H (Xn+1 . Xn . .9). Xn+2 . .1. . . . . H (X1 |X0 ) = 1 . .1. . X1 . . .e. . X1 . . . the dog’s walk is 2nd order Markov. X2 . X2 . . . . X2n ) by stationarity.1.1) = 10. X2n ) = 0. . . . . we have E (S ) = ∞ s=1 s(. H (Xi |Xi−1 . .Entropy Rates of a Stochastic Process (a) By the chain rule. Xn . Xn+2 . . for i > 1 . . X2n ) = H (X1 . Xn ) − H (X1 .1. . Xn ) = 1 + (n − 1)H (. . Xn+2 . H (X0 . Xn+2 . Furthermore for i > 1 . . Xi−2 ) = H (. . . .

Since (Y1 . .81) X 2n ) n→∞ 2n n→∞ 2n 1 1 = lim H (X1 . Xn+2 . X2 . . 2. . .74 Thus n→∞ Entropy Rates of a Stochastic Process lim 1 I (X1 . Yn ) ≤ lim n→∞ n→∞ n n lim or H(Y ) ≤ H(X ) (4. . . . . Solution: The key point is that functions of a random variable have lower entropy. . . X2 . . . . X2 . Xn . . . . . . X2 . Xn . Xn+1 . . X−1 . . . X2 . . . 2. . . . . Xn ) = limn→∞ 21 n H (X1 . (4. and taking the limit as n → ∞ . . (4. Xn+2 . . Xn+1 . Xn+1 . . . Yn ) ≤ H (X1 . Xn+1 . Entropy rate. . . . (a) Consider a stationary stochastic process X 1 . . Y2 . . . X(4. . .83) 14. . . . . . .(4. Yn be deﬁned by Yi = φ(Xi ). . . . . Xn+1 . Xn . for some function ψ . X−k ) → H (X ). . . 2n (4. Xn ) − lim H (X1 . . . Show 1 H (Xn . 2. we have H (X1 . X2 . Xn . . X2 .84) for some function φ . Xi+1 ).90) n for k = 1. . . X2 .89) 15. Xn ) H (Y1 . (4. . X2n ) 2n 1 1 = lim 2H (X1 . .87) i = 1. . Xn ) Dividing by n . . . . X2 . . . . X2n ) = 0. Xn . . . . . Functions of a stochastic process. . . .88) (4. Xn ) − lim H (X1 . and therefore n→∞ lim 1 I (X1 . . we have (from Problem 2. Y2 . . X2 . . . Xn+2 . . Prove that H (Y ) ≤ H (X ) (b) What is the relationship between the entropy rates H (Z ) and H (X ) if Zi = ψ (Xi . . .85) . Xn . X2 . (4. .86) (4. . i = 1. . . X1 | X0 .82) 2n ) n→∞ 2n n→∞ n 1 Now limn→∞ n H (X1 . . Let {Xi } be a discrete stationary stochastic process with entropy rate H (X ). Yn ) is a function of (X1 . . . . .4) H (Y1 . . . . Y2 . . . Xn+2 . X2n ) since both converge to the entropy rate of the process. . X2 . . . and let Y1 . . . . . Xn+2 . . . Y2 . . Xn ) (each Yi is a function of the corresponding Xi ).

. it may be necessary to require at least one 0 between any two 1’s. . (4. .Entropy Rates of a Stochastic Process 75 ¢£ ¢£¤¢¤¢ ¦¥¦¥ ¡ ¡ Figure 4. to ensure proper sychronization. it is often necessary to limit the length of runs of 0’s between two 1’s. . Entropy rate of constrained sequences. .91) ) n i=1 H. 16. Xi−2 .1: Entropy rate of constrained sequence Solution: Entropy rate of a stationary process. . In magnetic recording. Also to reduce intersymbol interference. but 0110010 and 0000101 are not. Suppose that we are required to have at least one 0 and at most two 0’s between any pair of 1’s in a sequences. Thus. Hence 1 H (X1 . We will consider a simple example of such a constraint. . X−k ) n = 1 n H (Xi |Xi−1 . . the mechanism of recording and reading the bits imposes constraints on the sequences of bits that can be recorded. . (a) Show that the set of constrained sequences is the same as the set of allowed paths on the following state diagram: (b) Let Xi (n) be the number of valid paths of length n ending at state i . . X−k (4. By the Ces´ aro mean theorem. sequences like 101001 and 0101001 are valid sequences. We wish to calculate the number of valid sequences of length n .92) ) = the entropy rate of the process. the running average of the terms tends to the same limit as the limit of the terms. X−1 . . . . .93) → lim H (Xn |Xn−1 . Xn−2 . Xn |X0 . . For example. X2 . Argue that . X−k(4. .

76

**Entropy Rates of a Stochastic Process
**

X(n) = [X1 (n) X2 (n) X3 (n)]t satisﬁes the following recursion:

**with initial conditions X(1) = [1 1 0] t . (c) Let
**

X1 (n) 0 1 1 X1 (n − 1) X ( n ) = 1 0 0 2 X2 (n − 1) , X3 (n) 0 1 0 X3 (n − 1)

(4.94)

Then we have by induction

0 1 1 A = 1 0 0 . 0 1 0

(4.95)

X(n) = AX(n − 1) = A2 X(n − 2) = · · · = An−1 X(1).

(4.96)

Using the eigenvalue decomposition of A for the case of distinct eigenvalues, we can write A = U −1 ΛU , where Λ is the diagonal matrix of eigenvalues. Then An−1 = U −1 Λn−1 U . Show that we can write

n−1 n−1 n−1 X(n) = λ1 Y1 + λ 2 Y2 + λ 3 Y3 ,

(4.97)

where Y1 , Y2 , Y3 do not depend on n . For large n , this sum is dominated by the largest term. Therefore argue that for i = 1, 2, 3 , we have 1 log Xi (n) → log λ, n (4.98)

where λ is the largest (positive) eigenvalue. Thus the number of sequences of length n grows as λn for large n . Calculate λ for the matrix A above. (The case when the eigenvalues are not distinct can be handled in a similar manner.) (d) We will now take a diﬀerent approach. Consider a Markov chain whose state diagram is the one given in part (a), but with arbitrary transition probabilities. Therefore the probability transition matrix of this Markov chain is 0 1 0 P = α 0 1 − α . 1 0 0

(4.99)

Show that the stationary distribution of this Markov chain is µ= 1 1−α 1 . , , 3−α 3−α 3−α (4.100)

(e) Maximize the entropy rate of the Markov chain over choices of α . What is the maximum entropy rate of the chain? (f) Compare the maximum entropy rate in part (e) with log λ in part (c). Why are the two answers the same?

**Entropy Rates of a Stochastic Process
**

Solution: Entropy rate of constrained sequences.

77

(a) The sequences are constrained to have at least one 0 and at most two 0’s between two 1’s. Let the state of the system be the number of 0’s that has been seen since the last 1. Then a sequence that ends in a 1 is in state 1, a sequence that ends in 10 in is state 2, and a sequence that ends in 100 is in state 3. From state 1, it is only possible to go to state 2, since there has to be at least one 0 before the next 1. From state 2, we can go to either state 1 or state 3. From state 3, we have to go to state 1, since there cannot be more than two 0’s in a row. Thus we can the state diagram in the problem. (b) Any valid sequence of length n that ends in a 1 must be formed by taking a valid sequence of length n − 1 that ends in a 0 and adding a 1 at the end. The number of valid sequences of length n − 1 that end in a 0 is equal to X 2 (n − 1) + X3 (n − 1) and therefore, X1 (n) = X2 (n − 1) + X3 (n − 1). (4.101) By similar arguments, we get the other two equations, and we have

The initial conditions are obvious, since both sequences of length 1 are valid and therefore X(1) = [1 1 0]T . (c) The induction step is obvious. Now using the eigenvalue decomposition of A = U −1 ΛU , it follows that A2 = U −1 ΛU U −1 ΛU = U −1 Λ2 U , etc. and therefore X(n) = An−1 X(1) = U −1 Λn−1 U X(1) = U

−1

X1 (n − 1) 0 1 1 X1 (n) 1 0 0 = X ( n ) X2 (n − 1) . 2 X3 (n − 1) 0 1 0 X3 (n)

(4.102)

(4.103)

n−1 λ1

0

n−1 λ2

0 0

0 0

n−1 λ3

0

1 0

**1 0 0 1 0 0 0 1 n−1 −1 n−1 −1 = λ1 U 0 0 0 U 1 + λ2 U 0 1 0 U 1 0 0 0 0 0 0 0 0 1 0 0 0 n−1 −1 +λ3 U 0 0 0 U 1 0 0 0 1
**

U 1

(4.104)

(4.105) (4.106)

n−1 n−1 n−1 = λ1 Y1 + λ 2 Y2 + λ 3 Y3 ,

**where Y1 , Y2 , Y3 do not depend on n . Without loss of generality, we can assume that λ1 > λ2 > λ3 . Thus
**

n−1 n−1 n−1 X1 (n) = λ1 Y11 + λ2 Y21 + λ3 Y31

(4.107) (4.108) (4.109)

X2 (n) = X3 (n) =

n−1 λ1 Y12 n−1 λ1 Y13

+ +

n−1 λ2 Y22 n−1 λ2 Y23

+ +

n−1 λ3 Y32 n−1 λ3 Y33

78

**Entropy Rates of a Stochastic Process
**

For large n , this sum is dominated by the largest term. Thus if Y 1i > 0 , we have 1 log Xi (n) → log λ1 . n (4.110)

To be rigorous, we must also show that Y 1i > 0 for i = 1, 2, 3 . It is not diﬃcult to prove that if one of the Y1i is positive, then the other two terms must also be positive, and therefore either 1 log Xi (n) → log λ1 . n (4.111)

for all i = 1, 2, 3 or they all tend to some other value. The general argument is diﬃcult since it is possible that the initial conditions of the recursion do not have a component along the eigenvector that corresponds to the maximum eigenvalue and thus Y1i = 0 and the above argument will fail. In our example, we can simply compute the various quantities, and thus 0 1 1 A = 1 0 0 = U −1 ΛU, 0 1 0 1.3247 0 0 0 −0.6624 + 0.5623i 0 Λ= , 0 0 −0.6624 − 0.5623i

(4.112)

where

(4.113)

and

and therefore

−0.5664 −0.7503 −0.4276 U = 0.6508 − 0.0867i −0.3823 + 0.4234i −0.6536 − 0.4087i , 0.6508 + 0.0867i −0.3823i0.4234i −0.6536 + 0.4087i 0.9566 Y1 = 0.7221 , 0.5451

(4.114)

(4.115)

which has all positive components. Therefore,

**1 log Xi (n) → log λi = log 1.3247 = 0.4057 bits. n (d) To verify the that µ= 1 1 1−α , , 3−α 3−α 3−α
**

T

(4.116)

.

(4.117)

is the stationary distribution, we have to verify that P µ = µ . But this is straightforward.

One is to derive the Markov transition matrix from the eigenvalues. we ﬁnd that 1 1 dH = (−α ln α − (1 − α) ln(1 − α))+ (−1 − ln α + 1 + ln(1 − α)) = 0. (f) The answers in parts (c) and (f) are the same.118) and diﬀerentiating with respect to α to ﬁnd the maximum. 3−α (4. In Chapter 12. In part (c) we used a direct argument to calculate the number of sequences of length n and found that asymptotically.5698 and the maximum entropy rate as 0. we show that there are at most a polynomial number of types. we can see that there are approximately 2 nH typical sequences of length n for a Markov chain of entropy rate H .4057 bits. the number of sequences in the typeclass is 2nH . we will use an argument from the method of types. but the essential idea is that both answers give the asymptotics of the number of sequences of length n for the state diagram in part (a).121) . (4. we see that 2nH ≤ λn 1 or H ≤ log λ1 . the largest type class has the same number of sequences (to the ﬁrst order in the exponent) as the entire set. H = log λ 1 .122) which can be solved (numerically) to give α = 0. of parts (a)–(c). Instead. we need to show that there exists an Markov transition matrix that achieves the upper bound. This result is a very curious one that connects two apparently unrelated objects the maximum eigenvalue of a state transition matrix. The same arguments can be applied to Markov types.120) which reduces to 3 ln α = 2 ln(1 − α). α3 = α2 − 2α + 1. and the maximum entropy (4. Thus. This can be done by two diﬀerent methods. i..e. If we extend the ideas of Chapter 3 (typical sequences) to the case of Markov chains. Why? A rigorous argument is quite involved. If we consider all Markov chains with state diagram given in part (a). 2 dα (3 − α) 3−α (4. and therefore for this type class. etc.119) or (3 − α) (ln a − ln(1 − α)) = (−α ln α − (1 − α) ln(1 − α)) (4. and that therefore. There are only a polynomial number of Markov types and therefore of all the Markov type classes that satisfy the state constraints of part (a).2812 nats = 0. For this Markov type.Entropy Rates of a Stochastic Process (e) The entropy rate of the Markov chain (in nats) is H= µi i j 79 Pij ln Pij = 1 (−α ln α − (1 − α) ln(1 − α)) . at least one of them has the same exponent as the total number of sequences that satisfy the state constraint. To complete the argument. X (n) ≈ λ n 1. the number of typical sequences should be less than the total number of sequences of length n that satisfy the state constraints.

X2 .d. Since X 0 . . so that X1 follows Xn . . X1 . etc) and we can deﬁne the waiting time N k at time . To simplify matters. we have E log N = i p(X0 = i)E (log N |X0 = i) p(X0 = i) log E (N |X0 = i) p(i) log 1 p(i) (4. .i. Waiting times are insensitive to distributions.124) E (N |X0 = i) = p(i) Then using Jensen’s inequality. Then we can get rid of the edge eﬀects (namely that the waiting time is not deﬁned for Xn . ∼ p(x) . x ∈ X = {1. In essence. the expected time until we see it again is 1/p(i) . since given X0 = i . 17.80 Entropy Rates of a Stochastic Process rate for a probability transition matrix with the same state diagram. where N = minn {Xn = X0 } . X2 . (c) (Optional) Prove part (a) for {X i } stationary and ergodic. . . Xn arranged in a circle.i.d. m} and let N be the waiting time to the next occurrence of X0 . . . (c) The property that EN = m is essentially a combinatorial property rather than a statement about expectations. (4. be drawn i.126) (4. We prove this for stationary ergodic sources. EN = E [E (N |X0 )] = p(X0 = i) 1 p(i) = m. . and thus the expected value must be m . (a) Given X0 = i . (b) Show that E log N ≤ H (X ) . Since the process is ergodic. By the same argument. 2. we will calculate the empirical average of the waiting time. (a) Show that EN = m . Let X 0 . (4. the waiting time for the next occurrence of X 0 has a geometric distribution with probability of success p(x 0 ) . . We don’t know a reference for a formal proof of this result. X2 . . . X1 . the empirical average converges to the expected value. N has a geometric distribution with mean p(i) and 1 . Solution: Waiting times are insensitive to distributions. we will consider X 1 . . .127) (4. ∼ p(x). Therefore. and show that this converges to m .128) ≤ = i i = H (X ). Xn are drawn i. . The problem should read E log N ≤ H (X ) .125) (4. .123) (b) There is a typographical error in the problem.

one with probability of heads p and the other with probability of heads 1 − p . Solution: (a) SInce the coin tosses are indpendent conditional on the coin chosen. . then Y 1 and Y2 are not independent. One of these coins is chosen at random (i. Y2 ) . Stationary but not ergodic process. (a) Calculate I (Y1 .130) Now we can rewrite the outer sum by grouping together all the terms which correspond to xi = l .131) But the inner two sums correspond to summing 1 over all the n terms. Calculate H(Y ) . (b) Calculate I (X . min{n>i:xn =xi } j =i+1 (4. I (Y 1 . . 18. (b) The key point is that if we did not know the coin being used. . A bin has two biased coins. and is then tossed n times. (Hint: Relate this to lim n→∞ n You can check the answer by considering the behavior as p → 1/2 . Yn ) ). The joint distribution of Y 1 and Y2 can be easily calculated from the following table . Y2 . Thus we obtain N= 1 n m l=1 i:xi =l min{n>i:xn =l} j =i+1 1 (4. Let X denote the identity of the coin that is picked. (c) Let H(Y ) be the entropy rate of the Y process (the sequence of coin tosses).e.. Y1 .Entropy Rates of a Stochastic Process 81 k as min{n > k : Xn = Xk } .132) Thus the empirical average of N over any sample sequence is m and thus the expected value of N must also be m . 1 H (X. . Y2 |X ) = 0. with probability 1/2). With this deﬁnition. Y1 . we can write the empirical average of Nk for a particular sample sequence N = = 1 n Ni n i=1 1 n i=1 n (4. Y2 |X ) .129) 1 . and thus N= 1 n m n=m l=1 (4. and let Y 1 and Y2 denote the results of the ﬁrst two tosses.

140) 19. and we can now calculate I (X . Yn |X ) − H (X |Y1 . . . (c) H (Y1 . p(1 − p). Y2 ) − H (Y1 |X ) − H (Y2 |X ) H(Y ) = lim 1 H (X ) = 0 and simSince 0 ≤ H (X |Y1 . Also. . Yn ) − H (X |Y1 . we have lim n 1 ilarly n H (X |Y1 . Random walk on graph. . Combining these terms. . Y1 . p(1 − p). since the Yi ’s are i. Y2 .133) (4. Y2 . . . . . we get H(Y ) = lim nH (p) = H (p) n (4.d. . . Yn ) = lim (4. . Y2 |X ) (4. Y2 . H (Y1 . . p(1 − p).82 X 1 1 1 1 2 2 2 2 Y1 H H T T H H T T Y2 H T H T H T H T Probability p2 p(1 − p) p(1 − p) (1 − p)2 (1 − p)2 p(1 − p) p(1 − p) p2 Entropy Rates of a Stochastic Process 1 2 2 2 Thus the joint distribution of (Y1 . Yn ) = 0 . Yn ) ≤ H (X ) ≤ 1 .137) n H (X. . . . p(1 − p). Yn ) = lim (4. Y2 . Yn ) (4. .138) n H (X ) + H (Y1 . Y2 . given X .i.136) where the last step follows from using the grouping rule for entropy. (p2 + (1 − p)2 ) − 2H (p) = H 2 2 = H (p(1 − p)) + 1 − 2H (p) (4. Y2 ) − 2H (p) (4. Y2 . . Yn |X ) = nH (p) . .134) = H (Y1 . . Y2 ) = H (Y1 . Y2 . . . . . . Y2 ) is ( 1 2 (p + (1 − p) ). . Y2 ) − H (Y1 . .139) n = H (Y1 . Y2 . Consider a random walk on the graph . Y1 . . . . . 2 (p + (1 − p)2 )) . .135) 1 1 2 (p + (1 − p)2 ).

3/16. i. 4 3 log2 (3)+ 16 log2 (4) = (b) Thus. 3/16. 1/4) − (log 16 − H (3/16. the ﬁrst four nodes exterior nodes have ary distribution is [ 16 3 steady state probability of 16 while node 5 has steady state probability of 1 4.. 3/16. 3/16. Find the entropy rate of the Markov chain associated with a random walk of a king on the 3 × 3 chessboard 1 4 7 2 5 8 3 6 9 . 3/16.141) (4. 16 . 3/16 (4. 16 . 1/4) − log 16 3 16 1 = 2( log + log 4) − log 16 4 3 4 3 = 3 − log 3 2 = H (3/16. 3/16.142) . 16 ] . 16 . Solution: (a) The stationary distribution for a connected graph of undirected edges with equal Ei weight is given as µi = 2 E where Ei denotes the number of edges emanating from node i and E is the total number of edges in the graph. the entropy rate of the random walk on this graph is 4 16 1 3 log (3) + = log 16 − H (3 / 16 .144) (4. 3/16. 3 / 16 . Hence. 3/16.143) (4. (b) What is the entropy rate? (c) Find the mutual information I (X n+1 . 1/4)) 20. 3 / 16 .e. the station3 3 3 3 4 . Random walk on chessboard. 1 / 4) 2 4 2 (c) The mutual information I (Xn+1 . Xn ) = H (Xn+1 ) − H (Xn+1 |Xn ) (4. 3 / 16 .145) = 2H (3/16.Entropy Rates of a Stochastic Process 2 t 83 @ @ @t3 £B £ B £ B B £ B t £ XXX 1 B XXX £ @ BBt £XXXX @ 4 ¨ £ @ ¨ @ £ ¨¨ ¨ £ @t 5 (a) Calculate the stationary distribution. Xn ) assuming the process is stationary.

By inspection. There are ﬁve graphs with four edges. 8 and H (X2 |X1 = i) = log 8 bits for i = 5 . The stationary distribution is given by µ i = Ei /E . we can calculate the entropy rate of the king as 9 H = i=1 µi H (X2 |X1 = i) (4. E5 = 8 and E = 40 . 21.2 log 8 = 2. It has to move from one state to the next. In a random walk the next state is chosen with equal probability among possible choices. 4.3 log 3 + 0. E2 = E4 = E6 = E8 = 5 . E1 = E3 = E7 = E9 = 3 .147) (4. Therefore. so µ1 = µ3 = µ7 = µ9 = 3/40 .84 Entropy Rates of a Stochastic Process What about the entropy rate of rooks. µ2 = µ4 = µ6 = µ8 = 5/40 and µ5 = 8/40 . Consider a random walk on a connected graph with 4 edges. so H (X 2 |X1 = i) = log 3 bits for i = 1. . Notice that the king cannot remain where it is. bishops and queens? There are two types of bishops. 9 . 6. H (X2 |X1 = i) = log 5 for i = 2. 3. 7. (a) Which graph has the highest entropy rate? (b) Which graph has the lowest? Solution: Graph entropy.5 log 5 + 0. where Ei = number of edges emanating from node i and E = 9 i=1 Ei .146) (4. Solution: Random walk on the chessboard.24 bits.148) = 0. Maximal entropy graphs.

75.844. (a) From the above we see that the ﬁrst graph maximizes entropy rate with and entropy rate of 1. (b) From the above we see that the third graph minimizes entropy rate with and entropy rate of .75. 1 and 1/4+3/8 log(3) ≈ . The bird ﬂies from room to room going to adjoining rooms with equal probability through each of the walls.094.094. 22.Entropy Rates of a Stochastic Process 85 Where the entropy rates are 1/2+3/8 log(3) ≈ 1. To be speciﬁc. the corner rooms have 3 exits. A bird is lost in a 3 × 3 × 3 cubical maze. 1. 3-D Maze. . .

(4. the total number of edges E = 54 .149) . {Zi } . (d) H (Z ) ≤ H (X ) . edges have 4 edges.e. (b) H (X ) ≤ H (Z ) .d. X2i+1 ) . . . 6 faces and 1 center. 2E 2E There are 8 corners. faces have 5 edges and centers have 6 edges.. Entropy rates Let {Xi } be a stationary process. process. or ≥ between each of the pairs listed below: (a) H (X ) ≤ H (Y ) .) ≤ H (X1 ) since conditioning reduces entropy (b) We have equality only if X1 is independent of the past X0 . So. =. Let Yi = (Xi . and H (V ) of the processes {Xi } .. (b) What are the conditions for equality? Solution: Entropy Rate (a) From Theorem 4.. H (Z ) . H (X ) = log(108) + 8 = 2. X−1 . 24.86 (a) What is the stationary distribution? Entropy Rates of a Stochastic Process (b) What is the entropy rate of this random walk? Solution: 3D Maze. Consider the entropy rates H (X ) .03 bits 23. if and only if Xi is an i. X−1 . What is the inequality relationship ≤ . The entropy rate of a random walk on a graph with equal weights is given by equation 4. Therefore. . H (Y ) . i. Let Zi = (X2i .1 H (X ) = H (X1 |X0 . . {Yi } . Let Vi = X2i . 12 edges.2. (c) H (X ) ≤ H (V ) . and {Vi } .i.. ≥ ≥ ≥ ≥ 3 3 log 108 108 + 12 4 4 log 108 108 +6 5 5 log 108 108 +1 6 6 log 108 108 (a) Argue that H (X ) ≤ H (X1 ) . Entropy rate Let {Xi } be a stationary stochastic process with entropy rate H (X ) . Xi+1 ) ..41 in the text: H (X ) = log(2E ) − H E1 Em . Corners have 3 edges. . . .

.152) (4. . . Yn ) = I (X . Xn . X2n ) = H (Z1 . Monotonicity. and Yi = (Xi−1 . . . . Yn ) ≥ H (X |Y1 . .153) (4. 8). . we get equality. Yn . . ≤ H (X ) − H (X |Y1 . Yn ) is non-decreasing in n . (2.) . 2. . Xn . we get equality. . . . . . . we get 2H (X ) = H(Z ) . . and dividing by n and taking the limit. Let Vi = X2i . and dividing by n and taking the limit. Yn+1 ) (4. Yn ) . (d) H (Z ) = 2H (X ) since H (X1 . H (X |Y1 . Y2 . . (a) H (X ) = H (Y ) . 7. . . Form the associated “edge-process” {Yi } by keeping track only of the transitions. X2 . . . . . Xi+1 ) . . . X2 . and dividing by n and taking the limit. . i. . . . since H (V1 |V0 . . Y2 .150) (c) H (X ) > H (V ) . Yn . (b) Under what conditions is the mutual information constant for all n ? Solution: Monotonicity (a) Since conditioning reduces entropy. (5. H (Z ) . . Y2 . (3. . X−2 . . H (Y ) . . Zn ) . . . . . (a) Show that I (X . . . Y1 . Zn ) . . 7). Y2 . . . . Yn . X0 . Suppose {X i } forms an irreducible Markov chain with transition matrix P and stationary distribution µ . For example becomes Y = (∅. and {Vi } . . . (b) H (X ) < H (Z ) . given Y1 . .n+1 ) (b) We have equality if and only if H (X |Y 1 . . we get 2H (X ) = H(Z ) . . 3). . Xn+1 ) = H (Y1 . since H (X1 . . Solution: Edge Process H (X ) = H (Y ) . 26. . Let Zi = (X2i . Y2 . Find the entropy rate of the edge process {Y i } . . . 8. 5. X−1 . . . . Y2 . Xi ) . . Y2 . Transitions in Markov chains. Yi = (Xi . since H (X 1 . X2i+1 ) . . . . 25. since H (X1 . .e. {Zi } . if X is conditionally independent of Y 2 . . (8. Thus the new process {Y i } takes values in X × X .151) (4. . Y2 . . Y2 . Yn+1 ) and hence I (X . Y2 .) = H (X2 |X0 . . Consider the entropy rates H (X ) . . . .Entropy Rates of a Stochastic Process Solution: Entropy rates 87 {Xi } is a stationary process. Yn ) = H (X ) − H (X |Y1 . Y1 . Yn ) . . . . . . Xn+1 ) = H (Y1 . Y1 . . . and dividing by n and taking the limit. 2). X = 3. . . 5). .) ≤ H (X2 |X1 . . and H (V ) of the processes {X i } .. . X2n ) = H (Z1 . Yn ) = H (X |Y1 ) for all n . .

. i = 1. 2. Solution: Mixture of processes. . (e ). . labeling the answers (a ).d. since the scheme that we use to generate the Y i s doesn’t change with time. (4. . . (b ). X22 . (e) for the process {Z i } . i = 1. . .d. . . Xk−1 . each time. (b) Is {Yi } an i. (d). . where {Zi } is Bernoulli( p ) and ⊕ denotes mod 2 addition. (c). Observe Now let θi be Bern( 2 1 n ELn −→ Z i = X θi i . Thus θ is not ﬁxed for all time. Answer (a).88 27. {Yi } is stationary. . Yn ) −→ H ? n (e) Is there a code that achieves an expected per-symbol description length H? 1 ). . Xk−1 ) = H (Zk+1 ) = H (p) 28.) = H (Xk+1 |Xk .154) with probability with probability 1 2 1 2 and let Yi = Xθi . . 2. (d ). What is the entropy rate? Speciﬁcally. be a Bernoulli process with parameter p1 and let X21 . Thus Y observes the process {X1i } or {X2i } . . but is chosen i. (b).i. Mixture of processes Suppose we observe one of two stochastic processes but don’t know which.i. What is the entropy rate H (X ) ? Solution: Entropy Rate H (X ) = H (Xk+1 |Xk . Entropy rate Entropy Rates of a Stochastic Process Let {Xi } be a stationary {0. . as it was in the ﬁrst part. . be Bernoulli (p2 ) . . let X11 . process? (d) Does (a) Is {Yi } stationary? 2. . X23 . (a) YES. X12 . Y2 . . X13 . . . 1} valued stochastic process obeying Xk+1 = Xk ⊕ Xk−1 ⊕ Zk+1 . (c) What is the entropy rate H of {Yi } ? 1 − log p(Y1 . Let θ= 1. Eventually Y will know which. (c ). be the observed stochastic process.

1/2.Xn )+1 and this converges to H (X ) . it is not IID. (e’) YES. since the scheme that we use to generate the Y i ’s doesn’t change with time. H = 1 H (Y n ) n 1 = lim (H (θ ) + H (Y n |θ ) − H (θ |Y n )) n→∞ n H (p1 ) + H (p2 ) = 2 n→∞ lim (d) The process {Yi } is NOT ergodic. . since there’s no dependence now – each Y i is generated according to an independent parameter θi .. Waiting times. we can do Huﬀman coding on longer and longer blocks of the process. . P r {X = 3} = ( 1 2) . If the process were to be IID.Entropy Rates of a Stochastic Process 89 (b) NO..) (e) Since the process is stationary.p. Y2 . Let X be the waiting time for the ﬁrst heads to appear in successive ﬂips of a fair coin. The rest of the term is ﬁnite and will go to 0 as n goes to ∞ . and its entropy rate is the mixture of the two entropy rates of the two processes so it’s given by H (p1 ) + H (p2 ) . so the AEP does not apply and the quantity −(1/n) log P (Y1 . 29. we can arrive at the result by examining I (Y n+1 . I (Yn+1 . (c) The process {Yi } is the mixture of two Bernoulli processes with diﬀerent parameters. 3 Thus. 1/2 and H (p2 ) w. Y n ) is nonzero. for example. bounded above by H (X1 . However. since there’s dependence now – all Y i s have been generated according to the same parameter θ .p. it is IID. if we are given Y n . {Yi } is stationary. the limit exists by the AEP.. 2 Note that only H (Y n |θ ) grows with n . as in (e) above. . Alternatively. . 2 More rigorously. These codes will have an expected per-symbol length .X2n (a’) YES. Y n ) . Yn ) does NOT converge to the entropy rate. (But it does converge to a random variable that equals H (p 1 ) w. and Yi ∼ Bernoulli( (p1 + p2 )/2) . then the expression I (Y n+1 . which in turn allows us to predict Yn+1 . Y n ) would have to be 0 . (b’) YES. its entropy rate is H (d’) YES. p1 + p 2 . (c’) Since the process is now IID.. Thus. . then we can estimate what θ is.

Sn ) = H (S1 ) + i=2 n H (Si |S i−1 ) H (Si |Si−1 ) H (X i ) = H (S1 ) + i=2 n = H (X1 ) + i=2 n = i=1 H (Xi ) = 2n (e) It follows trivially from (e) that H(S ) = H (S n ) n→∞ n 2n = lim n→∞ n = 2 lim . (a) (b) (c) (d) Is the process {Sn } stationary? Calculate H (S1 . i = 0 can take on several values. n H (S1 . are i.i. Does the process {Sn } have an entropy rate? If so. It turns out that process {Sn } is not stationary. (d) We can use chain rule and Markov properties to obtain the following results. Since the marginals for S0 and S1 . for example. . (b) It’s clear that the variance of S n grows with n . (a) S0 is always 0 while Si . the distribution must be time invariant. the process can’t be stationary. . Thus.d according to the distribution above. For the process to be stationary. what is it? If not. X2 .90 Entropy Rates of a Stochastic Process Let Sn be the waiting time for the nth head to appear. . . . S2 . S0 = 0 Sn+1 = Sn + Xn+1 where X1 . why not? What is the expected number of fair coin ﬂips required to generate a random variable having the same distribution as S n ? Solution: Waiting time process. since var(S n ) = var(Xn )+ var(Sn−1 ) > var(Sn−1 ) . An independent increment process is not stationary (not even wide sense stationary). are not the same. . There are several ways to show this. . . . . (c) Process {Sn } is an independent increment process. . X3 . Sn ) . which again implies that the marginals are not time-invariant. S2 .

since the entropy is invariant under any constant shift of a random variable and φ(x) log φ(x) is Riemann integrable.8690 while 100 − 1 . and set up the expression of H (S n ) in terms of the pmf of Sn . however. the distribution of S n will tend to 2 Gaussian with mean n p = 2n and variance n(1 − p)/p = 2n .e. H (Sn ) = H (Sn − E (Sn )) = − pk logpk ≈ − ≈ − φ(k ) log φ(k ) φ(x) log φ(x)dx φ(x) ln φ(x)dx φ(x)(− √ x2 − ln 2πσ 2 ) 2σ 2 = (− log e) = (− log e) 1 1 = (log e)( + ln 2πσ 2 ) 2 2 1 = log 2πeσ 2 2 1 log nπe + 1. where σ 2 = 2n . Let pk = P r (Sn = k + ESn ) = P r (Sn = k + 2n) .) Since computing the exact value of H (S n ) is diﬃcult (and fruitless in the exam setting). i.3.12. (f) The expected number of ﬂips required can be lower-bounded by H (S n ) and upperbounded by H (Sn ) + 2 (Theorem 5. Note.) Here is a speciﬁc example for n = 100 . Let φ(x) be the normal √ density 2 2 function with mean zero and variance 2n . Based on earlier discussion..Entropy Rates of a Stochastic Process 91 Note that the entropy rate can still exist even when the process is not stationary. (We have the n th n−1 success at the k th trial if and only if we have exactly n − 1 successes in k − 1 trials and a suceess at the k th trial. that for large n .e. Then for large n . however. page 115). Sn has a negative binomial k−1 k distribution. P r (S 100 = k−1 k k) = (1 2 ) . Furthermore. The Gaussian approximation of H (S n ) is 5. it would be suﬃcient to show that the expected number of ﬂips required is between H (Sn ) and H (Sn ) + 2 . φ(x) = exp (−x /2σ )/ 2πσ 2 . P r (Sn = k ) = (1 2 ) for k ≥ n . the entropy rate (for this problem) is the same as the entropy of X. i. = 2 (Refer to Chapter 9 for a more general discussion of the entropy of a continuous random variable and its relation to discrete entropy.

. . Xn ) = H (X2 |X1 ) 2 = k =0 P (X1 = k )H (X2 |X1 = k ) 1 1 1 1 × H( . . . Alternatively. Let {X i }∞ 1 be a Markov chain with transition matrix P . Zn ). .. 1. 1 2 1 4 1 4 1 4 1 2 1 4 1 4 1 4 1 2 P = [Pij ] = Let X1 be uniformly distributed over the states {0. 3 . (f) Are Zn−1 and Zn independent for n ≥ 2 ? Solution: 1 1 1 (a) Let µn denote the probability mass function at time n .8636 . . Z2 . . . Xn ). . (d) Find H (Zn ) and H (Xn ). where Z1 = X 1 Zi = Xi − Xi−1 (mod 3).. (b) Find limn→∞ n (a) Is {Xn } stationary? Now consider the derived process Z1 . Thus Z n encodes the transitions.. 2}. ) 3 2 4 4 = 3× = 3 . The expected number of ﬂips required is somewhere between 5. . . . .8636 and 7. i = 2. 3 . 2 .92 Entropy Rates of a Stochastic Process the exact value of H (Sn ) is 5. 30. the observation P is doubly stochastic will lead the same conclusion. 2}. 1 H (X1 . (c) Find H (Z1 . µn = µ1 = ( 3 . . (e) Find H (Zn |Zn−1 ) for n ≥ 2 . . 3 ) for all n and {Xn } is stationary. . n→∞ lim H (X1 . Markov chain transitions. Since µ 1 = ( 3 . j ∈ {0. thus P (X n+1 = j |Xn = i) = Pij . 3 ) and 1 1 1 µ2 = µ1 P = µ1 . not the states. 1. n. (b) Since {Xn } is stationary Markov. for n ≥ 2 .8636 . Z2 . . i. Zn . .

Since Z 1 . =3 2. 2 1 1 1 . . . by the chain rule of entropy and the Markovity. . . . and (f). . . . 4). . . (e). For k = 1 . . (d) Since {Xn } is stationary with µn = ( 3 1 1 1 H (Xn ) = H (X1 ) = H ( . . Zn ) = H (X1 . . . . . + H (Zn ) = H (Z1 ) + (n − 1)H (Z2 ) 3 = log 3 + (n − 1). . . 3). . Zn ) = H (Z1 ) + H (Z2 ) + . Xk−1 ) and hence independent of Zk = Xk − Xk−1 . Zk+1 is independent of (Xk . Z k+1 = Xk+1 − Xk is independent of Xk . Z 2 is independent of Z1 = X1 trivially. . 3 3 3 1 2. . . Xk−1 ) = P (Xk+1 − Xk |Xk ) = P (Xk+1 − Xk ) = P (Zk+1 ). . 4 . Xn ) n = k =1 H (Xk |X1 .Entropy Rates of a Stochastic Process 93 (c) Since (X1 . H (Zn ) = H ( 2 . . . we can use the results of parts (d). Xk−1 ) n = H (X1 ) + k =2 H (Xk |Xk−1 ) = H (X1 ) + (n − 1)H (X2 |X1 ) 3 = log 3 + (n − 1). we can trivially reach the same conclusion. . (e) Due to the symmetry of P . . . (f) Let k ≥ 2 . Zn are independent and Z2 . Xk−1 ) = P (Xk+1 − Xk |Xk . Xn ) and (Z1 . ) = log 3. . . 1 2. . Hence. 3. . H (Zn ) = 2 Alternatively. 1 For n ≥ 2 . 4. H (Z1 . 4. 4 . again by the symmetry of P . Now that P (Zk+1 |Xk . . P (Zn |Zn−1 ) = P (Zn ) for n ≥ 2. Zn ) are one-to-one. First observe that by the symmetry of P . using the result of part (f). H (Z1 . 1 1 1 Hence. Zn are identically distributed with the probability 1 1 distribution ( 1 2. 2 Alternatively. 4) 0. . . . H (Zn |Zn−1 ) = 3 . Zn = 1.

X1 ) H (X−n ) + H (X0 |X−n ) − H (X0 ) H (X0 ) + H (X0 |X−n ) − H (X0 ) H (Xn |X0 ) H (Xn |X0 . we have Y n = 101230 . Therefore k ∈ {−n. X−1 ) H (Xn+1 |X1 . To ﬁnd other possible values of k we expand H (X−n |X0 . Markov. X1 )? Give the argument. X1 ) H (X−n ) + H (X0 |X−n ) + H (X1 |X0 ) − H (X0 .94 31. (a) Find the entropy rate of X n . . 32. For what index k is H (X−n |X0 . We condition on (X 0 . Hence k = n + 1 is also a solution. X1 ) H (X−n ) + H (X0 |X−n ) + H (X1 |X0 . H (Y ) = H (X ) = H (p) . X1 ) − H (X0 . X0 . Entropy Rates of a Stochastic Process Let {Xi } ∼ Bernoulli(p) . . H (X ) = H (X ) = H (p) . n + 1}. X1 |X−n ) − H (X0 . For example. . X1 ) = H (Xk |X0 . (a) For an i. X1 ) and look into the past and future. Solution: Time symmetry. . X1 ) H (X−n ) + H (X0 . X−n ) − H (X0 . Thus. . (b) Observe that X n and Y n have a one-to-one mapping. X0 ) = = (b) = (c) = (d) = (e) = where (a) and (d) come from Markovity and (b). The trivial solution is k = −n. if X n = 101110 . Let {Xn } be a stationary Markov process. .d. source. Consider the associated Markov chain {Y i }n i=1 where Yi = (the number of 1’s in the current run of 1’s) . There are no other solution since for any other k. Time symmetry. Solution: Markov solution. (c) and (e) come from stationarity. .i. we can construct a periodic Markov process as a counter example. X1 ) = = = (a) H (X−n . (b) Find the entropy rate of Y n .

W ) + H (X ) − H (X. X2 |X4 ) − H (X1 |X2 . (Z.163) (4. 34. Y ) + I (Z .172) ≥ 0 35. Show that H (Xn |X0 ) is concave in n . X4 ) = I (X2 .174) . X3 ) − I (X1 .171) (4. W ) ≤ I (X . X2 |X3 ) + H (X1 |X(4. Xn ) ≤ 0 (4. y. z.166) (4. X2 |X3 ) − H (X2 |X1 . Z ) + I (X .156) (4. Xn−1 |X (4.169) (4. X3 ) + I (X2 . Z ) − I (X . Z ) − I (X . X4 ) where H (X1 |X2 . z.. X4 (4. X3 )) (4.158) = H (X1 . W ) . Z ) = I (W .Entropy Rates of a Stochastic Process 33. X3 |X1 . X3 ) = H (X1 |X2 . w|y ) for all x. X4 ) ≤ I (X1 . W ) = H (Z. Let {Xn }∞ −∞ be a stationary Markov process. X4 ) by the Markovity of the random variables. Chain inequality: Let X1 → X2 → X3 → X4 form a Markov chain.173) 0 . Z ) + H (W ) + H (Z ) − H (W. X4 ) 95 (4. X2 |X4 ) + H (X2 |X1 . hence by the data processing inequality. I (X . i.160) 2 . W. W. W )) .167) (4. X3 ) Solution: Chain inequality X1 → X2 → X3 toX4 I (X1 .155) (4. W ) Solution: Broadcast Channel X → Y → (Z.161) (4. Z ) + H (X. W ) − I (X . Show that I (X . Speciﬁcally show that H (Xn |X0 ) − H (Xn−1 |X0 ) − (H (Xn−1 |X0 ) − H (Xn−2 |X0 )) = −I (X1 . Show that I (X1 .164) − H (X2 |X1 . and hence I (X : Y ) +I (Z . X3 . Concavity of second law.e.170) (4. Y ) ≥ I (X . X ( )4. X4 ) − H (X1 .165) ≥ I (X : Z. p(x. Z )) − H (W ) − H (X ) + H (W. X4 ) = H (X1 ) − H (X1 |X4 ) + H (X2 ) − H (X2 |X3 ) − (H (X1 ) − H (X1 |X3 )) = H (X1 |X3 ) − H (X1 |X4 ) + H (X2 |X4 ) − H (X2 |X3 ) −(H (X2 ) − H (X2 |X4 )) +I (X2 . Z |X ) −H (Z ) − H (X ) + H (X. X4 ) + I (X2 .159) ) = −H (X2 |X1 . X3 ) + H (X2 |X1 . X3 ) − I (X2 . X4 ) − H (X2 |X1 . w) = p(x)p(y |x)p(z. Broadcast channel. X4 ) ≥ 0 +H (X1 . W ) + I (Z . y.162) (4. Z ) = −H (X. Let X → Y → (Z. W ) − I (X .168) (4. X3 ) − H (X1 . w . W ) − H (X ) = H (W |X ) − H (W |X. Z ) + H (X.157) (4. W ) (4. W ) form a Markov chain.

176) and (4. X−1 ) = H (Xn |X0 ) − H (Xn |X1 . If we deﬁne ∆n = H (Xn |X0 ) − H (Xn−1 |X0 ) (4. which implies that H (Xn |X0 ) is a concave function of n . X0 ) (4. X0 ) − (H (Xn−1 |X0 ) − H (Xn−1 |X1 . establishing that H (X n |X0 ) is a concave function of n .184) then the above chain of inequalities implies that ∆ n − ∆n−1 ≤ 0 . Solution: Concavity of second law of thermodynamics Since X0 → Xn−2 → Xn−1 → Xn is a Markov chain H (Xn |X0 ) = H (Xn |X0 ) − H (Xn−1 |X0 .175) where (4.180) (4. Xn−1 |Xn .181) (4. Xn |X0 ) − I (X1 .96 Entropy Rates of a Stochastic Process Thus the second diﬀerence is negative.183) −H (Xn−1 |X0 ) − (H (Xn−1 |X0 ) − H (Xn−2 |X0 ) (4.177) follows from stationarity of the Markov chain.182) (4. X0 ) − H (X1 |Xn .178) (4.181) follows from Markovity and (4. X0 ) = H (X1 |Xn−1 . X0 ) = −I (X1 . Xn . X0 ) − H (X1 |X0 ) + H (X1 |Xn−1 . Xn−1 |X0 ) = H (X1 |X0 ) − H (X1 |Xn . X0 ) ≤ 0 = H (X1 .176) . X−1 ) − H (Xn−2 |X0 (4.177) (4. Xn−1 . X−1 ) − (H (Xn−1 |X0 .179) (4. X0 ) − H (X1 |Xn . . X0 ) = I (X1 .

You may wish to explain the title of the problem. 97 (5. . . m L= i=1 pi n100 i (5. Hence we have L1 ≤ L2 . Any set of codeword lengths which achieve the minimum of L 2 will satisfy the Kraft inequality and hence we can construct an instantaneous code with the same codeword lengths. . Solution: How many ﬁngers has a Martian? Uniquely decodable codes satisfy Kraft’s inequality. . . If m = 6 and the codeword lengths are (l 1 .1) (5.Chapter 5 Data Compression 100 be the expected value 1. . 2. we must have L 2 ≤ L1 . . How many ﬁngers has a Martian? Let S= S1 . From both these conditions.3) min L Instantaneous codes L2 = min L Uniquely decodable codes L1 = Since all instantaneous codes are uniquely decodable. and hence the same L . we must have L1 = L2 .2) (5. l2 . Let L = m i=1 pi li of the 100th power of the word lengths associated with an encoding of the random variable X. 2. 3. 3). l6 ) = (1. Let L1 = min L over all instantaneous codes. Uniquely decodable and instantaneous codes. . What inequality relationship exists between L 1 and L2 ? Solution: Uniquely decodable and instantaneous codes. p1 . 2. . ﬁnd a good lower bound on D. p m The Si ’s are encoded into strings from a D -symbol output alphabet in a uniquely decodable manner. 1. . .4) . and let L 2 = min L over all uniquely decodable codes. . Therefore f (D ) = D −1 + D −1 + D −2 + D −3 + D −2 + D −3 ≤ 1. S m .

. There are D nmax sequences of length nmax .. (This situation can be visualized with the aid of a tree. l2 . (c) Find a ternary Huﬀman code for X. no codeword is a preﬁx of any other codeword.. Slackness in the Kraft inequality.12 0. D nmax −ni start with the i -th codeword.) Alternatively. The binary sequences in these intervals do not begin with any codeword and hence cannot be decoded. These and all longer sequences with these length n max sequences as preﬁxes cannot be decoded.04 0. 4. i.e. Solution: Slackness in the Kraft inequality. Let n max = max{n1 . . n2 .02 X= (a) Find a binary Huﬀman code for X. . Hence the total number of sequences which start with some codeword q −ni < D nmax . Perhaps the Martians were using a base 3 representation because they have 3 ﬁngers. An instantaneous code has word lengths l 1 . So a possible value of D is 3. . probably because we have 10 ﬁngers.. 1. Consider the random variable x1 x2 x3 x4 x5 x6 x7 0. . Instantaneous codes are preﬁx free codes.98 Data Compression We have f (2) = 7/4 > 1 . we can map codewords onto dyadic intervals on the real line corresponding to real numbers whose decimal expansions start with that codeword. (a) The Huﬀman tree for this distribution is . and D −ni < 1 . Huﬀman coding. hence D > 2 . 2. Hence there are sequences which do nmax −ni = D nmax is q i=1 D i=1 D not start with any codeword. (Maybe they are like Maine lobsters ?) 3. i=1 The code alphabet is D = {0. Because of the preﬁx condition no two sequences can start with the same codeword. Solution: Examples of Huﬀman codes. . Since the length of the interval for a codeword of length n i is D −ni . (b) Find the expected codelength for this encoding. Show that there exist arbitrarily long sequences of code symbols in D ∗ which cannot be decoded into sequences of codewords.26 0. . nq }. D − 1}..49 0. We have f (3) = 26/27 < 1 . . lm which satisfy the strict inequality m D −li < 1. .03 0.04 0. Our counting system is base 10. there exists some interval(s) not used by any codeword. Of these sequences.

49 0.04 0. 5 .Data Compression 99 Codeword 1 x1 0.09 210 x5 0.49 0.12 01001 x5 0. 2/15) . The Huﬀman code for the source with probabilities 1 1 2 2 (1 3 .010.49 0. 5 .51 1 00 x2 0. 15 .03 0. they all have the same average length.2 bits ¡ 2.05 0.49 1. Solution: More Huﬀman codes. the next lowest possible value of E (L) is 11/5 = 2. Find the binary Huﬀman code for the source with probabilities (1/3.49 0.02 bits. that is. 1/5.011}.32 bits.26 0. 1/5) source has average length 12/5 bits.7) for some integer k . 1/5.) .25 01000 x4 0. 1/5. E (L(∗)) = 2 × Since E (L(any code)) = i=1 (5.26 0. More Huﬀman codes.01 bits) (c) The ternary Huﬀman tree is Codeword 0 x1 0.26 0.26 0. 1/5).10.5) (5.12 0. 1/5.26 0.04 01011 x7 0.49 0. H (X ) = log 5 = 2. 1/5.25 22 x4 0.04 0. Argue that this code is also optimal for the source with probabilities (1/5.04 0. Note that one could also prove the optimality of (*) by showing that the Huﬀman code for the (1/5. 5 5 5 5 li k = bits 5 5 (5.26 20 x3 0. Hence (*) is optimal.6) 2 12 3 +3× = bits.12 0. ( H (X ) = 2. (Since each Huﬀman code produced by the Huﬀman encoding algorithm is optimal. 1/5. ( H 3 (X ) = 1.49 011 x3 0.04 0. 5.03 212 x7 0.04 211 x6 0. To show that this code (*) is also optimal for (1/5.12 0. 1/5.0 1 x2 0.08 0. 1/5.32 bits.02 (b) The expected length of the codewords for the binary Huﬀman code is 2.11.34 ternary symbols.13 0. no shorter code can be constructed without violating H (X ) ≤ EL .05 01010 x6 0.26 0.26 0.04 0.49 0.49 0.12 0.02 This code has an expected length 1. 2/15. 15 ) has codewords {00. 1/5. 1/5) we have to show that it has minimum expected length. 1/5. 1/5.12 0.27 ternary symbols).

(b) If the longest sequence of questions is required by nature’s answers to our questions. and therefore is not optimal. X2 . Xn . Which of these codes cannot be Huﬀman codes for any probability assignment? (a) {0. (b) {00. X2 . and p1 > p2 > . and the least likely n sequence is the all 0’s sequence with probability i=1 (1 − pi ) . Xn ) = H (Xi ) = H (p i ). Any yes-no question you can think of is admissible. Thus the most likely sequence is all 1’s. (b) The code {00. (a) Give a good lower bound on the minimum average number of questions required. . . (c) {01. 10. 10}. and therefore is not optimal and not a Huﬀman code. . Bad codes. X2 . . 10. where Xi is 1 or 0 according to whether the i -th object is good or defective. . Xn . 7. . > pn > 1/2 .01. a good lower bound on the average number of questions is the entropy of the sequence X 1 . 110} can be shortened to {00. . 11} without losing its instantaneous property. (a) We will be using the questions to determine the sequence X 1 .1} without losing its instantaneous property. so it cannot be a Huﬀman code. . Let X 1 . . 110}. But since the Xi ’s are independent Bernoulli random variables. what (in words) is the last question we should ask? And what two sets are we distinguishing with this question? Assume a compact (minimum average length) sequence of questions. Xn be independent with Pr {Xi = 1} = pi . 11}. (c) Give an upper bound (within 1 question) on the minimum average number of questions required. we have EQ ≥ H (X1 . .10} can be shortened to {0.11} is a Huﬀman code for the distribution (1/2.10. .10. . We are asked to determine the set of all defective objects. . with a probability of n i=1 pi . Solution: Bad codes (a) {0. .10. . . it is not a Huﬀman code because there is a unique longest codeword. .100 Data Compression 6. Huﬀman 20 questions.8) . . Alternatively.1/4. (c) The code {01.1/4). Let X i = 1 or 0 accordingly as the i-th object is good or defective. Solution: Huﬀman 20 Questions. Since the optimal set of questions corresponds to a Huﬀman code for the source. 01. Consider a set of n objects. X2 .01. (5.

. . X2 . . It is easy to design an optimal code for each state. (c) Note the next symbol Xn+1 = j and send the codeword in Ci corresponding to j. each code mapping elements of the set of S i ’s into sequences of 0’s and 1’s. (1 − pn ) and (1 − p1 )(1 − p2 ) . (d) Repeat for the next symbol. . 00 and 000 . the two least likely sequences are 000 .5 bits/symbol E (L|C3 ) = 1 bit/symbol The average message lengths of the next symbol conditioned on the previous state being Si are just the expected lengths of the codes C i . . i. U2 . C3 (one for each state 1. an upper bound on the minimum average number of questions is an upper bound on the average length of a Huﬀman code. . (c) By the same arguments as in Part (a). “Is the last item defective?”. . having transition matrix Un−1 \Un S1 S2 S3 S1 1/2 1/4 0 S2 1/4 1/2 1/2 S3 1/4 1/4 1/2 Thus the probability that S1 follows S3 is equal to zero. Simple optimum compression of a Markov source. 8.5 bits/symbol E (L|C2 ) = 1.e. C2 . A possible solution is Next state Code C1 code C2 code C3 S1 0 10 S2 10 0 0 S3 11 11 1 E (L|C1 ) = 1. . . What is the average message length of the next symbol conditioned on the previous state Xn = i using this coding scheme? What is the unconditional average number of bits per source symbol? Relate this to the entropy rate H (U ) of the Markov chain. Design 3 codes C1 . Xn ) + 1 = H (pi ) + 1 .Data Compression 101 (b) The last bit in the Huﬀman code distinguishes between the least likely source symbols. (b) Select code Ci .) In this case. and thus the two least likely sequences are uniquely deﬁned. . . . Thus the last question will ask “Is Xn = 1 ”. Solution: Simple optimum compression of a Markov source. . (1 − pn−1 )pn respectively. such that this Markov process can be sent with maximal compression by the following scheme: (a) Note the present symbol Xn = i . . namely H (X1 . 01 . . 2 and 3 . Note that this code assignment achieves the conditional entropy lower bound.. which have probabilities (1 − p1 )(1 − p2 ) . . all the probabilities are diﬀerent. . (By the conditions of the problem. Consider the 3-state Markov process U1 .

A random variable X takes on m values and has entropy H (X ) . Optimal code lengths that require one bit above entropy.. Then entropy of X is close to 0 .13) (5.16) log 2 3 . Give an example of a random variable for which the expected length of the optimal code is close to H (X ) + 1 . The source coding theorem shows that the optimal code for a random variable X has an expected length less than H (X ) + 1 . Thus the unconditional average number of bits per source symbol and the entropy rate H of the Markov chain are equal. 3 The entropy rate H of the Markov chain is H = H (X2 |X1 ) 3 (5. 1/3) . but the length of its optimal code is 1 bit.12) = = 4 1 2 × 1. 10. There is a trivial example that requires almost 1 bit above its entropy. (5. Let X be a binary random variable with probability of X = 1 close to 1.e. 9. Let µ be the stationary distribution. and thus maximal compression is obtained.15) = i=1 µi H (X2 |X1 = Si ) = 4/3 bits/symbol. Thus the unconditional average number of bits per source symbol 3 EL = i=1 µi E (L|Ci ) (5. construct a distribution for which the optimal code has L > H (X ) + 1 − .10) (5. 4/9. for any > 0 .102 Data Compression To ﬁnd the unconditional average. which is almost 1 bit above its entropy.9) We can solve this to ﬁnd that µ = (2/9.14) (5.5 + × 1. i. we have to ﬁnd the stationary distribution on the states. with average length H (X ) L= = H3 (X ). Solution: Optimal code lengths that require one bit above entropy.11) (5.5 + × 1 9 9 3 4 bits/symbol. Then 1/2 1/4 1/4 µ = µ 1/4 1/2 1/4 0 1/2 1/2 (5. Ternary codes that achieve the entropy bound. An instantaneous ternary code is found for this source. H (X 2 |X1 = Si ) . because the expected length of each code C i equals the entropy of the state after state i .

3−li = 1 . we essentially repeat the arguments of Theorem 5. Consider a random variable X which takes on four 1 1 1 values with probabilities ( 1 3 . Show that a suﬃx condition code is uniquely decodable.Data Compression (a) Show that each symbol of X has a probability of the form 3 −i for some i . (b) We will show that any distribution that has p i = 3−li for all i must have an odd number of symbols.25). and show that the minimum average length over all codes satisfying the suﬃx condition is the same as the average length of the Huﬀman code for that random variable. Solution: Suﬃx condition.1.5. which says that no codeword is a suﬃx of any other codeword. Now. 11. the number of terms in the sum has to be odd (the sum of an even number of odd terms is even). Thus there are an odd number of source symbols for any code that meets the ternary entropy bound. Solution: Ternary codes that achieve the entropy bound. Thus the number of source symbols is odd. Suﬃx condition. For any received sequence. Consider codes that satisfy the suﬃx condition.1. Another simple argument is to use basic number theory. Since the codewords satisfy the suﬃx condition. 3 . To do this. Each of the terms in the sum is odd. we can construct a ternary tree with nodes at the depths l i . li . 4 . 12. there is a preﬁx code of the same length and vice versa. the reversed codewords satisfy the preﬁx condition. We know from Theorem 5. (b) Show that m is odd. and since their sum is odd. We achieve the ternary entropy bound only if D (p||r) = 0 and c = 1 . and the we can uniquely decode the reversed code. the tree must be complete. with the probability of each leaf of the form 3 −i .3.2. But we can use the same reversal argument to argue that corresponding to every suﬃx code. 12 ) . The fact that the codes are uniquely decodable can be seen easily be reversing the order of the code. . Shannon codes and Huﬀman codes. and therefore we cannot achieve any lower codeword lengths with a suﬃx code than we can with a preﬁx code. We can write this as 3−lmax 3lmax −li = 1 or l − l l 3 max i = 3 max . The fact that we achieve the same minimum expected length then follows directly from the results of Section 5. (a) Construct a Huﬀman code for this random variable. Thus we achieve the entropy bound if and only if pi = 3−j for all i . since 3−li = 1 . 103 (a) We will argue that an optimal ternary code that meets the entropy bound corresponds to complete ternary tree. and look for the reversed codewords. A complete ternary tree has an odd number of leaves (this can be proved by induction on the number of internal nodes). that given the set of lengths. we work backwards from the end. We know that for this distribution. in (5.

. which is greater 1 .5 questions to determine the object. 3. .3. show that codeword length assignments (1.17) pi li in each case. . Find a rough lower bound to the number of objects in the universe. 2. (c) The symbol with probability 1/4 has an Huﬀman code of length 3. the Huﬀman code cannot be longer than the Shannon code.2 satisfy the Kraft inequality. Solution: Twenty questions. and player B attempts to identify the object with a series of yes-no questions.94 × 1011 .3.2. Solution: Huﬀman code. 14. Huﬀman code. But on the average. 21 21 21 21 21 21 (5. Solution: Shannon codes and Huﬀman codes. . 2) are both optimal. 3) and (2.5 = L∗ − 1 < H (X ) ≤ log |X | and hence number of objects in the universe > 2 37. Player A chooses some object in the universe. Twenty questions.2. 13. . (b) Both set of lengths 1. and they both achieve the same expected length (2 bits) for the above distribution. namely.3 for the diﬀerent codewords. 2. ) .104 Data Compression (b) Show that there exist two diﬀerent sets of optimal lengths for the codewords. Suppose that player B is clever enough to use the code achieving the minimal expected length with respect to player A’s distribution. Find the (a) binary and (b) ternary Huﬀman codes for the random variable X with probabilities p=( (c) Calculate L = 1 2 3 4 5 6 . Thus the Huﬀman code for a particular symbol may be longer than than log p the Shannon code for that symbol. (c) Conclude that there are optimal codes with codeword lengths for some symbols that exceed the Shannon code length log p(1 x) .2.2. 2. Therefore they are both optimal.3 and 2. 37. We observe that player B requires an average of 38. (a) Applying the Huﬀman algorithm gives us the following table Code Symbol Probability 0 1 1/3 1/3 2/3 1 11 2 1/3 1/3 1/3 101 3 1/4 1/3 100 4 1/12 which gives codeword lengths of 1.5 = 1.

3.05.0. 0.25. 0.1) .4 1 The average length = 2 ∗ (b) The code would have a rate equal to the entropy if each of the codewords was of length 1/p(X ) . 0. Huﬀman procedure 0.3 0. F } with probabilities (0.8 + 3 ∗ 0.5.0. In this case. 0. Solution: Huﬀman codes (a) The code constructed by the standard Codeword X Probability 10 1 0. D.25. E.1.1 0. (a) Construct a binary Huﬀman code for the following distribution on 5 symbols p = (0.Data Compression (a) The Huﬀman tree Codeword 00 x1 10 x2 11 x3 010 x4 0110 x5 0111 x6 (b) The ternary Codeword 1 2 00 01 020 021 022 for this distribution is 6/21 5/21 4/21 3/21 2/21 1/21 6/21 5/21 4/21 3/21 3/21 6/21 6/21 5/21 4/21 9/21 6/21 6/21 12/21 9/21 1 105 Huﬀman tree is x1 x2 x3 x4 x5 x6 x7 6/21 5/21 4/21 3/21 2/21 1/21 0/21 6/21 5/21 4/21 3/21 3/21 10/21 6/21 5/21 1 (c) The expected length of the codewords for the binary Huﬀman code is 51/21 = 2.125).2 0. B. the code constructed above would be eﬃcient for the distrubution (0. 0.3 00 3 0. Huﬀman codes.6 0.0.4 0.3.2 bits/symbol.3 0.2 011 5 0. Huﬀman codes: Consider a random variable X which takes 6 values {A.2. 0.0.1 0. What is the average length of this code? (b) Construct a probability distribution p on 5 symbols for which the code that you constructed in part (a) has an average length (under p ) equal to its entropy H (p ) .25.62 ternary symbols. C. 0.43 bits. . 0.2 = 2.05.25.3 11 2 0. 0. 16.3 0.3 0. The ternary code has an expected length of 34/21 = 1.125.2 010 4 0. 15.05) respectively.1.

05) = 2 bits.. 6 is not of the form 1 + k (D − 1) .1 1110 E 0.05.e. Give an example where the code constructed by converting an optimal quaternary code is also the optimal binary code.. F } with probabilities (0.25 1101 D 0.1.5+2 × 0. a code over an alphabet of four symbols (call them a. Show that LH ≤ LQB < LH + 2 (5.05.05 0. c → 10 and d → 11 .05 0.e. b → 01 .5 0. b. 0. 0 A 0. and convert the symbols into binary using the mapping a → 00 . In this case. a code over an alphabet of four symbols (call them a. What is the average length of this code? (c) One way to construct a binary code for the random variable is to start with a quaternary code. Solution: Huﬀman codes: Consider a random variable X which takes 6 values {A. 0. let LH be the average length of the binary Huﬀman code for the random variable.e.05 The average length of this code is 1 × 0. In fact.98 bits.1 0. i.25 0. What is the average length of the binary code for the above random variable constructed by this process? (d) For any random variable X .5 0.1 0. B. 0. a better bound is LQB ≤ LH + 1 .25 0. What is the average length of this code? Solution:Since the number of symbols. and let L QB be the average length code constructed by ﬁrst building a quaternary Huﬀman code and converting it to binary. we need to add a dummy symbol of probability 0 to bring it to this form. and provide an example where this bound is tight..5 1. C.0 10 B 0.1+0. .05+0. c and d ).106 Data Compression (a) Construct a binary Huﬀman code for this random variable. i.5 0. LQB < LH + 2 is not tight. (f) The upper bound. The entropy H (X ) in this case is 1. D. i.15 0. b.25. (a) Construct a binary Huﬀman code for this random variable. E. What is its average length? Solution: Code Source symbol Prob..25+4 × (0.18) (e) The lower bound in the previous example is tight. (b) Construct a quaternary Huﬀman code for this random variable.5 1100 C 0.5 0. What is its average length? (b) Construct a quaternary Huﬀman code for this random variable.5.1 0.05) respectively. i.25 0. 0. drawing up the Huﬀman tree is straightforward. c and d ).e. Prove this bound.25 0. 0.05+0.05 1111 F 0.

05 cd G 0.1 0. L H = LQB . The average length L QB for this code is then 2 bits. .15 ca D 0. we have H4 (X ) ≤ LQ < H4 (X ) + 1 (5. Also. C → 11 .21) Combining this with the bound that H 2 (X ) ≤ LH . Give an example where the code constructed by converting an optimal quaternary code is also the optimal binary code? Solution:Consider a random variable that takes on four equiprobable values. c → 10 and d → 11 .15 quaternary symbols. Substituting these in the previous equation. Show that LH ≤ LQB < LH + 2 (5. (d) For any random variable X .15 = 2. the Huﬀman code. we obtain LQB < LH + 2 . i. since each symbol in the quaternary code is converted into two bits.5 1.15 = 1. D → 1000 . E → 1001 . and convert the symbols into binary using the mapping a → 00 . a A 0. from the properties of entropy. (c) One way to construct a binary code for the random variable is to start with a quaternary code.3 bits. (e) The lower bound in the previous example is tight.5 0.1 cb E 0. we get H2 (X ) ≤ LQB < H2 (X ) + 2. its average length cannot be better than the average length of the best instantaneous code.25 0. That gives the lower bound of the inequality above. b → 01 . and F → 1010 .0 The average length of this code is 1 × 0. let LH be the average length of the binary Huﬀman code for the random variable. Then from the results proved in the book. it is easy to see that LQB = 2LQ .25 d C 0. (5.05 cc F 0.. B → 01 . and let L QB be the average length code constructed by ﬁrsting building a quaternary Huﬀman code and converting it to binary. What is the average length of the binary code for the above random variable constructed by this process? Solution:The code constructed by the above process is A → 00 .05 0. and the average length is 2 × 0. The Huﬀman code for this case is also easily seen to assign 2 bit codewords to each symbol.e.85 + 2 × 0. Then the quaternary Huﬀman code for this is 1 quaternary symbol for each source symbol. and therefore for this case. with average length 1 quaternary symbol.20) Also. the LQ be the length of the optimal quaternary code.85 + 4 × 0. it follows that H4 (X ) = H2 (X )/2 .Data Compression 107 Code Symbol Prob.0 b B 0. To prove the upper bound.19) Solution:Since the binary code constructed from the quaternary code is also instantaneous.

17. This in terms implies that the Huﬀman code in this case is of the form 1. l2 . Data compression.1) i − 1 this is less than 0. . Then we can use the inverse of the mapping mentioned in part (c) to construct a quaternary code for the random variable . 10/41 9/41 8/41 7/41 7/41 14/41 10/41 9/41 8/41 17/41 14/41 10/41 24/41 17/41 41/41 (a) (b) This is case of an Huﬀman code on an inﬁnite alphabet. a better bound is LQB ≤ LH + 1 . ( 10 )( 10 ) . An example where this upper bound is tight is the case when we have only two possible symbols.108 Data Compression (f) (Optional. .) Solution: Data compression Code 10 00 01 110 111 Source symbol A B C D E Prob. ( 10 )( 10 ) . we have LBQ ≥ LQ (5. .9(0. and we will obtain an instantaneous code where all the codewords have even length. Let LBQ be the average length of this quaternary code. Prove this bound.22) and since the BQ code is at best as good as the quaternary Huﬀman code.1)i−1 . Then LH = 1 . Find an optimal set of binary codeword lengths l 1 .1)) = (0. we have LBQ 1 = L H + 2 i:li is odd pi < LH + 1 2 (5.01. no credit) The upper bound. If we consider an initial subset of the symbols. 41 . Since the length of the quaternary codewords of BQ are half the length of the corresponding binary codewords. Append a 0 to each of these codewords. Since {x : x > i} is j>i 0.e. 41 . Thus Huﬀman coding will always merge the last two terms. and LQB = 2 . etc. Solution:Consider a binary Huﬀman code for the random variable X and consider all codewords of odd length.23) Therefore LQB = 2LQ ≤ 2LBQ < LH + 1 . L QB < LH + 2 is not tight. 41 . (minimizing pi li ) for an instantaneous code for each of the following probability mass functions: 10 9 8 7 7 (a) p = ( 41 . the cumulative sum of all the remaining terms is less than the last term used.001.. ( 10 )( 10 ).0001. . In fact. . . and provide an example where this bound is tight. i.it is easy to see that the quatnerary code is also instantaneous.9 ∗ (0. .1)i−1 (1/(1 − 0. we can see that the cumulative probability of all symbols j −1 = 0.1) . 41 ) 9 9 1 9 1 2 9 1 3 (b) p = ( 10 .9 ∗ (0.

01. Consider the following variation: X ∼ p(x).. he should choose the 63 most valuable outcomes for X . but p(x) can be chosen by the computer (and then announced to the player)? The computer wishes to minimize the player’s expected return. 0. (c) Yes. 109 (a) No. he receives the answer “Yes”) during this sequence. (a) The ﬁrst thing to recognize in this problem is that the player cannot cover more than 63 values of X with 6 questions. (b) Yes. . .Data Compression 18. The probabilities are . 2. With one question. This can be easily seen by induction. what if v (x) is ﬁxed. What should p(x) be? What is the expected return to the player? Solution: The game of Hi-Lo. the code is uniquely decodable. we see that we can ask at more 63 diﬀerent questions of the form “Is X = i ?” with 6 questions.e. since the ﬁrst codeword. there is only one value of X that can be covered. p(x) known. 100} . there is one value of X that can be covered with the ﬁrst question. . all uniquely decodable codes are non-singular. prize = v (x) . How should the player proceed to maximize his expected winnings? (b) The above doesn’t have much to do with information theory. 01} (a) Is it instantaneous? (b) Is it uniquely decodable? (c) Is it nonsingular? Solution: Codes. ﬁnd all the ones) and then parse the rest into 0’s. and depending on the answer to the ﬁrst question. Given a sequence of codewords. (The fact that we have narrowed the range at the end is irrelevant. (a) A computer generates a number X according to a known probability mass function p(x). The player asks a question. “You’re too high.) Thus if the player seeks to maximize his return. there are two possible values of X that can be asked in the next question.” He continues for a total of six questions. But arbitrary Yes-No questions are asked sequentially until X is determined.” or “You’re too low. By extending this argument. 19. Consider the code {0. Classes of codes. is a preﬁx of the second codeword. . x ∈ {1. as before.) Questions cost one unit each. With two questions. The game of Hi-Lo. “Is X = i ?” and is told “Yes”. ﬁrst isolate occurrences of 01 (i. How should the player proceed? What is the expected payoﬀ? (c) Continuing (b).. he receives a prize of value v (X ).e. if we have not isolated the value of X . (“Determined” doesn’t mean that a “Yes” answer is received. and play to isolate these values. the code is not instantaneous. If he is right (i.

pi = To complete the proof.25) 2−vi 2−vj j pi log pi − pi log pi − = D (p||r ) − log( and thus the return is minimized by choosing p i = ri .24) and diﬀerentiating and setting to 0. (5. (c) A computer wishing to minimize the return to player will want to minimize p(x)v (x) − H (X ) over choices of p(x) . we let ri = pi vi + pi log pi = = 2−vi −vj . 2. the optimal strategy is to construct a Huﬀman code for the source and use that to construct a question strategy. Thus the average cost C of the description of X is C = m i=1 pi ci li . j2 . but perhaps because time is precious in the situations in which these words are required. his next question will be “Is X = j ?”. and lose otherwise. Let J (p) = pi vi + pi log pi + λ pi (5. . . where j is the median of that half. We can write this as a standard minimization problem with constraints. Maximizing the return is equivalent to minimizing the expected number of questions. we obtain vi + log pi + 1 + λ = 0 or after normalizing to ensure that p i ’s forms a probability distribution.28) (5. m. and thus. His expected return is therefore between p(x)v (x) − H and p(x)v (x) − H − 1 . The return in this case to the player is x p(x)(v (x) − l(x)) . where l(x) is the number of questions required to determine the object. and let c i denote the cost per letter of the codeword when X = i. he will win if X is one of the 63 most valuable outcomes.27) 2−vj ) (5. This is the distribution that the computer must choose to minimize the return to the player. After isolating to either half. . and his ﬁrst question will be “Is X = i ?” where i is the median of these 63 numbers. 20. . (b) Now if arbitrary questions are allowed. Suppose that X = i with probability p i . . Proceeding this way. Let li be the number of binary symbols in the codeword associated with X = i.110 Data Compression irrelevant to this procedure. and rewrite the return as pi log 2−vi pi log ri − log( 2−vj ). i = 1. This strategy maximizes his expected winnings. Huﬀman codes with costs. the game reduces to a game of 20 questions to determine the object. Words like Run! Help! and Fire! are short. as argued in the text.26) (5.29) (5. not because they are frequently used. He will choose the 63 most valuable outcomes.

36) . . . . (c) Now we can account for the integer constraints.31) (5. (5.37) (5. . we get the relationship C ∗ ≤ CHuf f man < C ∗ + Q. .34) i = − log pj cj where we have ignored any integer constraints on n i . Ignore any implied in∗ . Let qi = (pi ci )/Q . l ∗ and the associated teger constraints on li . l ∗ . pi ci ? i=1 (a) We wish to minimize C = pi ci ni subject to 2−ni ≤ 1 . The minimum cost C ∗ for this assignment of codewords is C ∗ = QH (q) (5.35) (b) If we use q instead of p for the Huﬀman procedure. (b) How would you use the Huﬀman code procedure to minimize C over all uniquely decodable codes? Let CHuf f man denote this minimum. .Data Compression 111 (a) Minimize C over all l1 . l2 .32) (5. (c) Can you show that m C ∗ ≤ CHuf f man ≤ C ∗ + Solution: Huﬀman codes with costs. lm such that 2−li ≤ 1.30) (5. Let ni = − log qi Then − log qi ≤ ni < − log qi + 1 Multiplying by pi ci and summing over i .38) (5. Exhibit the minimizing l1 m 2 ∗ minimum value C . Since the only freedom is in the choice of r i .33) 1 ri qi qi log − = Q qi log qi ri = Q(D (q||r) + H (q)). . we can minimize C by choosing r = q or pi ci n∗ . Then q also forms a probability distribution and we can write C as C = = Q pi ci ni qi log (5. We will assume equality in the constraint and let r i = 2−ni and let Q = i pi ci . we obtain a code minimizing expected cost. (5. .

(x1 . xk ) . . . If C k is not one-to-one for some k . . The longest possible codeword in an optimal code has n − 1 binary digits. . p2 . . . . p 2 . . . pm . . . . . . pm } . . Since we know the maximum possible codeword length. . . . pn : n L(C ) = pi i . . Conditions for unique decodability.112 Data Compression 21. we obtain C (x1 ) · · · C (xi )C (y1 ) · · · C (yj ) = C (y1 ) · · · C (yj )C (x1 ) · · · C (xi ) . 22. the average codeword length is determined by the probability distribution p 1 . . . yj ) . . xi ) and (y1 . i=1 This is a linear. (The only if part is obvious. . . . . . Concatenating the input sequences (x 1 . . . . . Solution: Average length of an optimal code. 23. such that C (x1 )C (x2 ) · · · C (xi ) = C (y1 )C (y2 ) · · · C (yj ) . . which shows that C k is not one-to-one for k = i + j . then C is not UD. The optimal code is the candidate code with the minimum L . . since there exist two distinct sequences. Conversely. . pn . yj ) . . p2 . . . . This corresponds to a completely unbalanced tree in which each codeword has a diﬀerent length. pm ) . the average codeword length for an optimal D -ary preﬁx code for probabilities {p 1 . and therefore continuous. Unused code sequences. x2 . This is true even though the optimal code changes discontinuously as the probabilities vary. . xi ) and (y1 . . . . . . xk ) such that C k (x1 . . and its length is the minimum of a ﬁnite number of continuous functions and is therefore itself a continuous function of p1 . . . is a continuous function of p1 . Prove that L(p 1 . Average length of an optimal code. . (b) (Optional) Prove or disprove: C has inﬁnite decoding delay. . p n . . function of p 1 . . Prove that a code C is uniquely decodable if (and only if) the extension C k (x1 . (x 1 . xk ) = C (x1 )C (x2 ) · · · C (xk ) is a one-to-one mapping from X k to D ∗ for every k ≥ 1 . . if C is not UD then by deﬁnition there exist distinct sequences of source symbols. . . there are only a ﬁnite number of possible codes to consider. xk ) = C (x1 ) · · · C (xk ) = C (x1 ) · · · C (xk ) = C (x1 . (a) Prove that some ﬁnite sequence of code alphabet symbols is not the preﬁx of any sequence of codewords. . . xk ) and (x1 . . . Using a D -ary alphabet for codewords can only decrease its length. . . Let C be a variable length code that satisﬁes the Kraft inequality with equality but does not satisfy the preﬁx condition. .) Solution: Conditions for unique decodability. For each candidate code C .

is the redundancy of the code maximized? What is the limiting value of this worst case redundancy as m → ∞ ? Solution: Optimal codes for uniform distributions. but not the preﬁx condition is a suﬃx code (see problem 11). (a) Describe the optimal instantaneous binary code for this source and compute the average codeword length Lm . . whereas if the number of 1’s is odd. every (inﬁnite) sequence of code alphabet symbols corresponds to a sequence of codewords. where 2k ≤ m ≤ 2k+1 . Optimal codes for uniform distributions. . Consider a random variable with m equiprobable outcomes. The entropy of this information source is obviously log 2 m bits.11. consider decoding a string 011111 . . 01. For such a code. Then the probability that a random generated sequence begins with a codeword is at most m−1 i=1 D− i ≤ 1 − D− m < 1.11. .11. (b) (Optional) A reference to a paper proving that C has inﬁnite decoding delay will be supplied later. call m s the message with the shorter codeword . i=1 If the code does not satisfy the preﬁx condition. . say C (xm ) . If two codes diﬀer by 2 bits or more. then the string must be parsed 0. since the probability that a random generated sequence begins with a codeword is m D− i = 1 . 24. The redundancy of a variable length code is deﬁned to be ρ = L − H .11. and therefore the decoding delay could be inﬁnite.Data Compression 113 Solution: Unused code sequences. then at least one codeword. . . the string must be parsed 01. Thus the string cannot be decoded until the string of 1’s has ended. An simple example of a code that satisﬁes the Kraft inequality. is a preﬁx of another. there exists an optimal binary variable length preﬁx code such that the longest and shortest codewords diﬀer by at most one bit. (b) For what values of m does the average codeword length L m equal the entropy H = log 2 m ? (c) We know that L < H + 1 for any probability distribution. Let C be a variable length code that satisﬁes the Kraft inequality with equality but does not satisfy the preﬁx condition. For what value(s) of m . say C (x1 ) . . (a) When a preﬁx code satisﬁes the Kraft inequality with equality.11. . If the number of one’s is even. which shows that not every sequence of code alphabet symbols is the beginning of a sequence of codewords. (a) For uniformly probable codewords. It is easy to see by example that the decoding delay cannot be ﬁnite. 1110. 11} . The simplest non-trivial suﬃx code is one for three symbols {0. .0.

i where qi = 2− i . consider the following calculation of L : L= i pi i =− pi log2 2− i = H + D (p q ) . or when d = 0 . By this method we can transform any optimal code into a code in which the length of the shortest and longest codewords diﬀer by at most one bit. that is. it is easy to see that every optimal code has this property. (c) For n = 2m + d . (b) The average codeword length equals entropy if and only if n is a power of 2. Let d be the diﬀerence between n and the next smaller power of 2: d = n − 2 log 2 n . when all codewords have equal length. 2 +d ln 2 Therefore ∂r (2m + d)(2) − 2d 1 1 = · m − 2 m ∂d ln 2 2 + d (2 + d) . Since these messages are equally likely. Therefore L = H only if pi = qi . The length of the codeword for m s increases by 1 and the length of the codeword for m decreases by at least 1. Then the optimal code has 2d codewords of length log 2 n and n−2d codewords of length log 2 n . n Note that d = 0 is a special case in the above equation.114 Data Compression Cs and m the message with the longer codeword C . the redundancy r = L − H is given by r = L − log2 n = log 2 n + 2d − log 2 n n = m+ 2d − log2 (2m + d) n 2d ln(2m + d) = m+ m − . so obviously no other codeword could contain C s or C as a preﬁx. Change the codewords for these two messages so that the new codeword C s is the old Cs with a zero appended (Cs = Cs 0) and C is the old Cs with a one appended (C = Cs 1) . (In fact.) For a source with n messages. To see this. This gives L = = = 1 (2d log 2 n + (n − 2d) log 2 n ) n 1 (n log 2 n + 2d) n 2d log 2 n + . Cs and C are legitimate codewords since no other codeword contained C s as a preﬁx (by deﬁnition of a preﬁx code). (ms ) = log 2 n and (m ) = log 2 n . L ≤ L .

Data Compression

115

Setting this equal to zero implies d ∗ = 2m (2 ln 2 − 1) . Since there is only one maximum, and since the function is convex ∩ , the maximizing d is one of the two integers nearest (.3862)(2m ) . The corresponding maximum redundancy is r∗ ≈ m + 2d∗ ln(2m + d∗ ) − 2m + d ∗ ln 2 ln(2m + (.3862)2m ) 2(.3862)(2m ) − = m+ m 2 + (.3862)(2m ) ln 2 = .0861 .

This is achieved with arbitrary accuracy as n → ∞ . (The quantity σ = 0.0861 is one of the lesser fundamental constants of the universe. See Robert Gallager[3]. 25. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code are complicated functions of the message probabilities {p 1 , p2 , . . . , pm } , it can be said that less probable symbols are encoded into longer codewords. Suppose that the message probabilities are given in decreasing order p 1 > p2 ≥ · · · ≥ pm . (a) Prove that for any binary Huﬀman code, if the most probable message symbol has probability p1 > 2/5 , then that symbol must be assigned a codeword of length 1. (b) Prove that for any binary Huﬀman code, if the most probable message symbol has probability p1 < 1/3 , then that symbol must be assigned a codeword of length ≥ 2 . Solution: Optimal codeword lengths. Let {c 1 , c2 , . . . , cm } be codewords of respective lengths { 1 , 2 , . . . , m } corresponding to probabilities {p 1 , p2 , . . . , pm } . (a) We prove that if p1 > p2 and p1 > 2/5 then 1 = 1 . Suppose, for the sake of contradiction, that 1 ≥ 2 . Then there are no codewords of length 1; otherwise c1 would not be the shortest codeword. Without loss of generality, we can assume that c1 begins with 00. For x, y ∈ {0, 1} let Cxy denote the set of codewords beginning with xy . Then the sets C01 , C10 , and C11 have total probability 1 − p1 < 3/5 , so some two of these sets (without loss of generality, C 10 and C11 ) have total probability less 2/5. We can now obtain a better code by interchanging the subtree of the decoding tree beginning with 1 with the subtree beginning with 00; that is, we replace codewords of the form 1x . . . by 00x . . . and codewords of the form 00y . . . by 1y . . . . This improvement contradicts the assumption that 1 ≥ 2 , and so 1 = 1 . (Note that p1 > p2 was a hidden assumption for this problem; otherwise, for example, the probabilities {.49, .49, .02} have the optimal code {00, 1, 01} .)

(b) The argument is similar to that of part (a). Suppose, for the sake of contradiction, that 1 = 1 . Without loss of generality, assume that c 1 = 0 . The total probability of C10 and C11 is 1 − p1 > 2/3 , so at least one of these two sets (without loss of generality, C10 ) has probability greater than 2/3. We can now obtain a better code by interchanging the subtree of the decoding tree beginning with 0 with the

116

Data Compression

subtree beginning with 10; that is, we replace codewords of the form 10x . . . by 0x . . . and we let c1 = 10 . This improvement contradicts the assumption that 1 = 1 , and so 1 ≥ 2 .

26. Merges. Companies with values W 1 , W2 , . . . , Wm are merged as follows. The two least valuable companies are merged, thus forming a list of m − 1 companies. The value of the merge is the sum of the values of the two merged companies. This continues until one supercompany remains. Let V equal the sum of the values of the merges. Thus V represents the total reported dollar volume of the merges. For example, if W = (3, 3, 2, 2) , the merges yield (3, 3, 2, 2) → (4, 3, 3) → (6, 4) → (10) , and V = 4 + 6 + 10 = 20 . (a) Argue that V is the minimum volume achievable by sequences of pair-wise merges terminating in one supercompany. (Hint: Compare to Huﬀman coding.) ˜ i = Wi /W , and show that the minimum merge volume V (b) Let W = Wi , W satisﬁes ˜ ) ≤ V ≤ W H (W ˜ )+W W H (W (5.39) Solution: Problem: Merges (a) We ﬁrst normalize the values of the companies to add to one. The total volume of the merges is equal to the sum of value of each company times the number of times it takes part in a merge. This is identical to the average length of a Huﬀman code, with a tree which corresponds to the merges. Since Huﬀman coding minimizes average length, this scheme of merges minimizes total merge volume. (b) Just as in the case of Huﬀman coding, we have H ≤ EL < H + 1, we have in this case for the corresponding merge scheme ˜ ) ≤ V ≤ W H (W ˜ )+W W H (W (5.41) (5.40)

27. The Sardinas-Patterson test for unique decodability. A code is not uniquely decodable if and only if there exists a ﬁnite sequence of code symbols which can be resolved in two diﬀerent ways into sequences of codewords. That is, a situation such as | A1 | B1 | | B2 | A2 B3 | A3 ... ... Bn Am | |

must occur where each Ai and each Bi is a codeword. Note that B1 must be a preﬁx of A1 with some resulting “dangling suﬃx.” Each dangling suﬃx must in turn be either a preﬁx of a codeword or have another codeword as its preﬁx, resulting in another dangling suﬃx. Finally, the last dangling suﬃx in the sequence must also be a codeword. Thus one can set up a test for unique decodability (which is essentially the Sardinas-Patterson test[5]) in the following way: Construct a set S of all possible dangling suﬃxes. The code is uniquely decodable if and only if S contains no codeword.

Data Compression

(a) State the precise rules for building the set S .

117

(b) Suppose the codeword lengths are l i , i = 1, 2, . . . , m . Find a good upper bound on the number of elements in the set S . (c) Determine which of the following codes is uniquely decodable: i. ii. iii. iv. v. vi. vii. {0, 10, 11} . {0, 01, 11} . {0, 01, 10} . {0, 01} . {00, 01, 10, 11} . {110, 11, 10} . {110, 11, 100, 00, 10} .

(d) For each uniquely decodable code in part (c), construct, if possible, an inﬁnite encoded sequence with a known starting point, such that it can be resolved into codewords in two diﬀerent ways. (This illustrates that unique decodability does not imply ﬁnite decodability.) Prove that such a sequence cannot arise in a preﬁx code. Solution: Test for unique decodability. The proof of the Sardinas-Patterson test has two parts. In the ﬁrst part, we will show that if there is a code string that has two diﬀerent interpretations, then the code will fail the test. The simplest case is when the concatenation of two codewords yields another codeword. In this case, S2 will contain a codeword, and hence the test will fail. In general, the code is not uniquely decodeable, iﬀ there exists a string that admits two diﬀerent parsings into codewords, e.g. x1 x2 x3 x4 x5 x6 x7 x8 = x 1 x2 , x 3 x4 x5 , x 6 x7 x8 = x 1 x2 x3 x4 , x 5 x6 x7 x8 . (5.42)

In this case, S2 will contain the string x3 x4 , S3 will contain x5 , S4 will contain x6 x7 x8 , which is a codeword. It is easy to see that this procedure will work for any string that has two diﬀerent parsings into codewords; a formal proof is slightly more diﬃcult and using induction. In the second part, we will show that if there is a codeword in one of the sets S i , i ≥ 2 , then there exists a string with two diﬀerent possible interpretations, thus showing that the code is not uniquely decodeable. To do this, we essentially reverse the construction of the sets. We will not go into the details - the reader is referred to the original paper. (a) Let S1 be the original set of codewords. We construct S i+1 from Si as follows: A string y is in Si+1 iﬀ there is a codeword x in S1 , such that xy is in Si or if there exists a z ∈ Si such that zy is in S1 (i.e., is a codeword). Then the code is uniquely decodable iﬀ none of the S i , i ≥ 2 contains a codeword. Thus the set S = ∪i≥2 Si .

. and there is no way that we would need to wait to decode. . .00. (5. . (c) i. S2 = {1} . Shannon code. . S2 = {0} . . This code is a suﬃx code and is therefore UD. iv. {0. 11} . 2. ii. p2 . 10} .11. . . 11. S2 = {1} = S3 = S4 = . .00.00. S1 = {110. This code is not uniquely decodable. S3 = {0} . . 10} . 00. For the instantaneous codes. 10. .11. 28. where li = log p i (a) Show that the code constructed by this process is preﬁx-free and the average length satisﬁes H (X ) ≤ L < H (X ) + 1. {110. . Since 0 is codeword.118 Data Compression (b) A simple upper bound can be obtained from the fact that all strings in the sets Si have length less than lmax . in part (ii). 01. 1] rounded oﬀ to li bits. 11} . This code is UD. etc. vi. 10} . because by the Sardinas Patterson test. The sets in the test are S1 = {0. {110. since we can decode as soon as we see a codeword string. {00. S3 = φ . . S3 = φ . the string 10000 . S2 = {0} . . . 100. THe test produces sets S1 = {0. . 01. could be parsed either as 0. 01} . 0. This code is instantaneous and therefore UD. . (d) We can produce inﬁnite strings which can be decoded in two ways only for examples where the Sardinas Patterson test produces a repeating set.125) . the string 011111 .11. . could be parsed as 100. . . Solution: Shannon code. {0. 11. vii. Deﬁne Fi = i−1 k =1 pk .125. {0. this code fails the test.25. 01} . pm . {0. 10} . 10} . it is not possible to construct such a string. v. This code is instantaneous and hence uniquely decodable. by the Sardinas-Patterson test. or as 01. iii. 11. . The sets in the Sardinas-Patterson test are S 1 = {0. 0. . S3 = {0} . . . number Fi ∈ [0. . 01. Similarly for (viii).the string 010 has two valid parsings. 00. This code is a suﬃx code (see problem 11). For example. .44) (b) Construct the code for the probability distribution (0. . and therefore the maximum number of elements in S is less than 2lmax . 11} . . Then the codeword for i is the 1 . 100. This code is uniquely decodable. Consider the following method for generating a code for a random variable X which takes on m values {1. . . It is easy to see otherwise that the code is not UD . 11} . S2 = {1} . . since S1 = {110.00. . 01. 11. or as 10. 10. (5.43) the sum of the probabilities of all symbols less than i . It is therefore uniquely decodable. .11. . 10} . 0. m} with probabilities p 1 .5. 01. Assume that the probabilities are ordered so that p 1 ≥ p2 ≥ · · · ≥ pm .

.46) The diﬃcult part is to prove that the code is a preﬁx code. (The length of this sequence will depend on the outcome X 1 . which has length lj ≥ li .i. we have 2−li ≤ pi < 2−(li −1) . . (b) We build the following table Symbol Probability Fi in decimal Fi in binary li Codeword 1 0. (c) Give a heuristic argument why the encoded sequence of bits for any code that achieves the entropy bound cannot be compressible and therefore should have an entropy rate of 1 bit per symbol.5 0. .. diﬀers from the codeword for Fi at least once in the ﬁrst li places.0 0.25 0. (b) Let X1 ..45) which implies that H (X ) ≤ L = (5. ∼ p(x) .d. X2 . .. Now consider a binary Huﬀman code for this distribution. .e. Y2 . .875 0. j > i diﬀers from Fi by at least 2−li . .) Use part (a) to argue that the sequence Y1 . . Yi−1 . .. Let the random variable X be drawn from a dyadic distribution.125 0. (5. Xn .. that Pr{Yi = 0} = Pr{Yi = 1} = 1 2 . By the choice of l i . deﬁne the probability of a node as the sum of the probabilities of all the leaves under that node.X2 . 29. Thus the codeword for Fj . Y2 .125 0. p(x) = 2 −i . and will therefore diﬀer from Fi is at least one place in the ﬁrst li bits of the binary expansion of Fi . X2 .111 3 111 The Shannon code in this case achieves the entropy bound (1..Xn ) . . for all x ∈ X . j > i . .47) Thus Fj . Yk(X1 .Data Compression (a) Since li = log 1 pi 119 . . Thus no codeword is a preﬁx of any other codeword. . .10 2 10 3 0. for some i . . Xn to a sequence of bits Y1 . Thus the entropy rate of the coded sequence is 1 bit per symbol.5 0. . Xn be drawn i.0 1 0 2 0. For a Huﬀman code tree. i.75 bits) and is optimal. . . . Solution: Optimal codes for dyadic distributions.e. Y2 . (a) Argue that for any node in the tree. Using the Huﬀman code for p(x) . . . X2 . independent of Y1 . .75 0. forms a sequence of fair coin ﬂips. . we map X1 . Optimal codes for dyadic distributions. i. the probability of the left child is equal to the probability of the right child. we have log 1 1 ≤ li < log + 1 pi pi pi li < H (X ) + 1. .110 3 110 4 0. . (5.

When Xi are drawn i. Thus the probability of the ﬁrst bit being 1 is 1/2. (b) For a sequence X1 . The code tree constructed be the Huﬀman algorithm is a complete tree with leaves at depth li with probability pi = 2−li . Relative entropy is cost outcomes {1. The probability of the parent of these two siblings (at level j − 1 ) has probability 2 j + 2j = 2j −1 . (c) Assume that we have a coded sequence of bits from a code that met the entropy bound with equality. Assume that it is true for all trees with n leaves. For such a complete binary tree. Let the level of these siblings be j . We can prove this by induction. it is easy to see that the code tree constructed will be also be a complete binary tree with the properties in part (a). the probability of the next bit produced by the code being 1 is equal to the probability of the next bit being 0. H (q ) . . and at any internal node. it is true for a tree with 2 leaves. Xn . Thus the coded sequence cannot be compressible. 30. we can prove the following properties • The probability of any internal node at depth k is 2 −k . 4. Thus. the Huﬀman code acheives the entropy bound. .d. variable Symbol p(x) q (x) 1 1/2 1/2 2 1/4 1/8 3 1/8 1/8 4 1/16 1/8 5 1/16 1/8 of miscoding: Let the random variable X have ﬁve possible Consider two distributions p(x) and q (x) on this random C1 (x) 0 10 110 1110 1111 C2 (x) 0 100 101 110 111 (a) Calculate H (p) . and the entropy rate of the coded sequence is 1 bit per symbol. We can now replace the two siblings with their parent. it follows immediately the the probability of the left child is equal to the probability of the right child. which will contradict the bound.120 Data Compression (a) For a dyadic distribution. the property is true for all complete binary trees. 5} . . 2.i. by induction. we can construct the code tree for X1 . then we could used the compressed version of the coded sequence as our code. • From the above property. . Proceeding this way. For any tree with n + 1 leaves. . D (p||q ) and D (q ||p) . Clearly. X2 . 3. Thus the bits produced by the code are i.i. and thus must have an entropy rate of 1 bit/symbol. If the coded sequence were compressible. according to a dyadic distribution. we can construct a code tree by ﬁrst constructing the optimal tree for X1 . and achieve an average length less than the entropy bound. But now we have a tree with n leaves which satisﬁes the required property. Bernoulli(1/2). and then attaching the optimal tree for X 2 to each leaf of the optimal tree for X1 . without changing the probability of any other internal node. X2 .d. at least two of the leaves have to be siblings on the tree (else the tree would not be complete).

≥ p m . 31. By how much does it exceed the entropy p ? (d) What is the loss if we use code C1 when the distribution is q ? Solution: Cost of miscoding (a) H (p) = H (q ) = 1 2 1 2 1 1 1 log 2 + 4 log 4 + 1 8 log 8 + 16 log 16 + 16 log 16 = 1. 0.Data Compression 121 (b) The last two columns above represent codes for the random variable.125 bits. Thus C1 is an eﬃcient code for p(x) .48) where H2 (X ) is the entropy of X in bits. p2 . Verify that C2 is optimal for q . . show that the expected length of a non-singular code L 1:1 for a random variable X satisﬁes the following inequality: L1:1 ≥ H2 (X ) −1 log 2 3 (5. which exceeds the entropy of q by 0. assume that we have a random variable X which takes on m values with probabilities p1 . which is the same as 3 + 16 D (p||q ) . Such a code is non-singular but not uniquely decodable. which is D (q ||p) . In the following. using code C1 for q has an average length of 2. 1 2 1 2 /2 log 1 1/2 + /2 log 1 1/2 + 1 4 1 8 /4 log 1 1/8 + /8 log 1 1/4 + 1 8 1 8 /8 log 1 1/8 + /8 log 1 1/8 + 1/16 1/16 1 1 16 log 1/8 + 16 log 1/8 = 0.125 bits.875 bits. pm and that the probabilities are ordered so that p 1 ≥ p2 ≥ . D (p||q ) = D (p||q ) = (b) The average length of C1 for p(x) is 1. 1 1 1 1 log 2 + 8 log 8 + 8 log 8 + 8 log 8 + 8 log 8 = 2 bits. which is the entropy of p . if a random variable X takes on 3 values a.125 bits. Similarly.125 bits.125 bits. What is the average length of the codewords. Both these are required in cases when the code is to be used repeatedly to encode a sequence of outcomes of a random variable. . . we could encode them by 0.1 and “STOP”. For example. . . Non-singular codes: The discussion in the text focused on instantaneous codes. . 1. which is the entropy of q . the average length of code C 2 under q (x) is 2 bits. b and c. Verify that the average length of C1 under p is equal to the entropy H (p) . (c) Now assume that we use code C2 when the distribution is p . It exceeds the entropy by 0. Thus the average length of a nonsingular code is at least a constant fraction of the average length of an instantaneous code. 1 1 1 (c) If we use code C2 for p(x) . (d) Similary. we do not need unique decodability . with extensions to uniquely decodable codes.only the fact that the code is non-singular would suﬃce. Thus C 2 is an eﬃcient code for q . . then the average length is 1 2 ∗ 1 + 4 ∗ 3 + 8 ∗ 3 + 16 ∗ 1 ∗ 3 = 2 bits.875 bits. (a) By viewing the non-singular binary code as a ternary code with three symbols. But if we need to encode only one outcome and we know when we have reached the end of a codeword. Thus C 1 is optimal for p . and 00. 1/8 1/8 1 1 8 log 1/16 + 8 log 1/16 = 0.

etc.50) i i=1 (This can also be done using the non-negativity of relative entropy. (c) Give a simple example where the average length of the non-singular code is less than the entropy. Consider the diﬀerence 1:1 ≥ L= ˜=− F (p) = H (X ) − L m i=1 m pi log pi − pi log i=1 i +1 . the D -ary entropy. Argue that L 1:1 ≤ L∗ IN ST ≤ H (X ) + 1 . 10. 11.. Hk = ln k + γ + 21 k − 12k 2 + 120k 4 − 6 0 < < 1/252n . Thus we have H (X ) − log log |X | − 2 ≤ L∗ 1:1 ≤ H (X ) + 1.49) Prove by the method of Lagrange multipliers that the maximum of F (p) occurs when pi = c/(i +2) . where c = 1/(Hm+2 − H2 ) and Hk is the sum of the harmonic series. Thus l1 = l2 = 1 . 01. However. it is proved that the average length of any preﬁx-free code in a D -ary alphabet was greater than HD (X ) . Since L1:1 = m i=1 pi li . k 1 Hk = (5. It follows from the previous part m i ˜ that L∗ i=1 pi log 2 + 1 . Knuth. and therefore L∗ .577 . Show that in general li = m i i + 1 . Now if we start with any (5. i. 000. it can be shown that H (X ) − L 1:1 < log log m + 2 . 00. We now bound this average length. Vol. . it is a little more tricky to deal with the average length of this code. 1 1 where 1) that Hk ≈ ln k (more precisely. . ). l3 = l4 = l5 = l6 = 2 . Either using this or a simple approximation that Hk ≤ ln k + 1 . A non-singular code cannot do much better than an instantaneous code! Solution: (a) In the text.52) ≤ log(2(Hm+2 − H2 )) Now it is well known (see. . show that this is minimized if we allot the shortest codewords to the most probable symbols. .51) (5. “Art of Computer Programming”. (d) The set of codewords available for an non-singular code is {0. 2 (5.e.g. and γ = Euler’s constant = 0. e.) (f) Complete the arguments for ˜ H (X ) − L∗ 1:1 ≤ H (X ) − L (5. log 2 i=1 pi log 2 + 1 1:1 = (e) The previous part shows that it is easy to ﬁnd the optimal non-singular code for a distribution. which can be proved by integration of 1 ∗ x .122 Data Compression (b) Let LIN ST be the expected length of the best instantaneous code and L ∗ 1:1 be the ∗ expected length of the best non-singular code for X . . } .53) . 1.

Data Compression

123

(b) Since an instantaneous code is also a non-singular code, the best non-singular code is at least as good as the best instantaneous code. Since the best instantaneous ∗ code has average length ≤ H (X ) + 1 , we have L ∗ 1:1 ≤ LIN ST ≤ H (X ) + 1 .

binary non-singular code and add the additional symbol “STOP” at the end, the new code is preﬁx-free in the alphabet of 0,1, and “STOP” (since “STOP” occurs only at the end of codewords, and every codeword has a “STOP” symbol, so the only way a code word can be a preﬁx of another is if they were equal). Thus each code word in the new alphabet is one symbol longer than the binary codewords, and the average length is 1 symbol longer. 2 (X ) Thus we have L1:1 + 1 ≥ H3 (X ) , or L1:1 ≥ H log 3 − 1 = 0.63H (X ) − 1 .

(c) For a 2 symbol alphabet, the best non-singular code and the best instantaneous code are the same. So the simplest example where they diﬀer is when |X | = 3 . In this case, the simplest (and it turns out, optimal) non-singular code has three codewords 0, 1, 00 . Assume that each of the symbols is equally likely. Then H (X ) = log 3 = 1.58 bits, whereas the average length of the non-singular code 1 1 is 1 3 .1 + 3 .1 + 3 .2 = 4/3 = 1.3333 < H (X ) . Thus a non-singular code could do better than entropy.

(d) For a given set of codeword lengths, the fact that allotting the shortest codewords to the most probable symbols is proved in Lemma 5.8.1, part 1 of EIT. This result is a general version of what is called the Hardy-Littlewood-Polya inequality, which says that if a < b , c < d , then ad + bc < ac + bd . The general version of the Hardy-Littlewood-Polya inequality states that if we were given two sets of numbers A = {aj } and B = {bj } each of size m , and let a[i] be the i -th largest element of A and b[i] be the i -th largest element of set B . Then

m i=1 m m

a[i] b[m+1−i] ≤

i=1

ai bi ≤

a[i] b[i]

i=1

(5.54)

An intuitive explanation of this inequality is that you can consider the a i ’s to the position of hooks along a rod, and bi ’s to be weights to be attached to the hooks. To maximize the moment about one end, you should attach the largest weights to the furthest hooks. The set of available codewords is the set of all possible sequences. Since the only restriction is that the code be non-singular, each source symbol could be alloted to any codeword in the set {0, 1, 00, . . . } . Thus we should allot the codewords 0 and 1 to the two most probable source symbols, i.e., to probablities p1 and p2 . Thus l1 = l2 = 1 . Similarly, l3 = l4 = l5 = l6 = 2 (corresponding to the codewords 00, 01, 10 and 11). The next 8 symbols will use codewords of length 3, etc. We will now ﬁnd the general form for l i . We can prove it by induction, but we will k −1 j derive the result from ﬁrst principles. Let c k = j =1 2 . Then by the arguments of the previous paragraph, all source symbols of index c k +1, ck +2, . . . , ck +2k = ck+1

124

Data Compression

use codewords of length k . Now by using the formula for the sum of the geometric series, it is easy to see that ck = j = 1k−1 2j = 2 j = 0k−2 2j = 2 2k−1 − 1 = 2k − 2 2−1 (5.55)

Thus all sources with index i , where 2 k − 1 ≤ i ≤ 2k − 2 + 2k = 2k+1 − 2 use codewords of length k . This corresponds to 2 k < i + 2 ≤ 2k+1 or k < log(i + 2) ≤ k + 1 or k − 1 < log i+2 2 ≤ k . Thus the length of the codeword for the i . Thus the best non-singular code assigns codeword th symbol is k = log i+2 2 m ∗ length li = log(i/2+1) to symbol i , and therefore L ∗ 1:1 = i=1 pi log(i/2+1) . ˜ (e) Since log(i/2 + 1) ≥ log(i/2 + 1) , it follows that L ∗ 1:1 ≥ L= Consider the diﬀerence ˜=− F (p) = H (X ) − L

m i=1 m m i=1 pi log i 2

+1 .

pi log pi −

pi log

i=1

i +1 . 2

(5.56)

We want to maximize this function over all probability distributions, and therefore we use the method of Lagrange multipliers with the constraint pi = 1 . Therefore let

m m

J (p) = −

i=1

pi log pi −

pi log

i=1

m i + 1 + λ( pi − 1) 2 i=1

(5.57)

Then diﬀerentiating with respect to p i and setting to 0, we get i ∂J = −1 − log pi − log +1 +λ=0 ∂pi 2 log pi = λ − 1 − log pi = 2λ−1 i+2 2 (5.58) (5.59) (5.60)

**2 i+2 Now substituting this in the constraint that pi = 1 , we get
**

m

2 or 2λ = 1/(

1 i i+2

λ i=1

1 =1 i+2

k 1 j =1 j

(5.61) , it is obvious that (5.62)

**. Now using the deﬁnition Hk =
**

m i=1

m+2 1 1 1 = − 1 − = Hm+2 − H2 . i+2 i 2 i=1

Thus 2λ =

1 Hm+2 −H2

, and pi = 1 Hm+2 − H2 i + 2 1 (5.63)

Data Compression

Substituting this value of pi in the expression for F (p) , we obtain

m m

125

F (p) = − = − = −

i=1 m i=1 m

pi log pi − pi log pi pi log

pi log

i=1

i +1 2

(5.64) (5.65)

i+2 2 1

i=1

= log 2(Hm+2 − H2 )

2(Hm+2 − H2 )

(5.66) (5.67)

Thus the extremal value of F (p) is log 2(H m+2 − H2 ) . We have not showed that it is a maximum - that can be shown be taking the second derivative. But as usual, it is easier to see it using relative entropy. Looking at the expressions above, we can 1 1 see that if we deﬁne qi = Hm+2 −H2 i+2 , then qi is a probability distribution (i.e., i+2 qi ≥ 0 , qi = 1 ). Also, 2= 1 , and substuting this in the expression 1

2(Hm+2 −H2 ) qi

for F (p) , we obtain

m m

F (p) = − = − = − = −

i=1 m i=1 m

pi log pi − pi log pi pi log pi

pi log

i=1

i +1 2

(5.68) (5.69)

i+2 2 1 2(Hm+2 − H2 ) qi 1

(5.70) (5.71) (5.72) (5.73)

i=1 m

pi log

i=1

≤ log 2(Hm+2 − H2 ) (f)

= log 2(Hm+2 − H2 ) − D (p||q )

m pi 1 − pi log qi i=1 2(Hm+2 − H2 )

with equality iﬀ p = q . Thus the maximum value of F (p) is log 2(H m+2 − H2 ) ˜ H (X ) − L∗ 1:1 ≤ H (X ) − L (5.74) (5.75)

˜ and the second from the result The ﬁrst inequality follows from the deﬁnition of L of the previous part. To complete the proof, we will use the simple inequality H k ≤ ln k + 1 , which can 1 be shown by integrating x between 1 and k . Thus Hm+2 ≤ ln(m + 2) + 1 , and 1 2(Hm+2 − H2 ) = 2(Hm+2 − 1 − 1 2 ) ≤ 2(ln(m + 2) + 1 − 1 − 2 ) ≤ 2(ln(m + 2)) = 2 2 log(m + 2)/ log e ≤ 2 log(m + 2) ≤ 2 log m = 4 log m where the last inequality is true for m ≥ 2 . Therefore H (X ) − L1:1 ≤ log 2(Hm+2 − H2 ) ≤ log(4 log m) = log log m + 2 (5.76)

≤ log 2(Hm+2 − H2 )

(c) What is the minimum expected number of tastings required to determine the bad wine? (d) What mixture should be tasted ﬁrst? Solution: Bad Wine (a) If we taste one bottle at a time. to minimize the expected number of tastings the order of tasting should be from the most likely wine to be bad to the least. mixing and tasting. 23 . we get codeword lengths as (2. (c) The idea is to use Huﬀman coding. The expected number of tastings required is 6 i=1 pi li = 1 × = 55 23 = 2. For the ﬁrst sample.77) 1:1 ≤ H (X ) + 1 A non-singular code cannot do much better than an instantaneous code! 32. Choose the order of tasting to minimize the expected number of tastings required to determine the bad bottle. . 4) . 4. p2 . Suppose you taste the wines one at a time. 2. The expected number of tastings required is 6 i=1 pi li = 2 × = 54 23 = 2. . 23 . One is given 6 bottles of wine. Bad wine. It is known that precisely one bottle has gone bad (tastes terrible). You proceed. stopping when the bad bottle has been determined. Tasting will determine the bad wine. 3. Remember. 23 . With Huﬀman coding. From inspection of the bottles it is determined that the probability 8 6 4 2 2 1 pi that the ith bottle is bad is given by (p1 . you mix some of the wines in a fresh glass and sample the mixture. . p6 ) = ( 23 . 2. 23 ) .35 8 6 4 2 2 1 +2× +2× +3× +4× +4× 23 23 23 23 23 23 .39 6 4 2 2 1 8 +2× +3× +4× +5× +5× 23 23 23 23 23 23 (b) The ﬁrst bottle to be tasted should be the one with probability 8 23 . (a) What is the expected number of tastings required? (b) Which bottle should be tasted ﬁrst? Now you get smart. 23 . . if the ﬁrst 5 wines pass the test you don’t have to taste the last.126 Data Compression We therefore have the following bounds on the average length of a non-singular code H (X ) − log log |X | − 2 ≤ L∗ (5.

.1.80) .1 = 1 .0. (a) Argue that the above procedure is optimal. We can now sort this list of T’s.Data Compression (d) The mixture of the ﬁrst and second bottles should be tasted ﬁrst. (b) For any D > 2 . . T2 ) + 1. . (d) Show that there exists a tree such that C (T ) ≤ log 2 2 Ti i 2 Ti i (5. A random variable X takes on three values with probabilities 0. . and we would like to ﬁnd their sum S1 ⊕ S2 ⊕ · · · ⊕ Sm using 2-input gates. 127 33.2. forming the partial result at time max(T1 .00). Hence for D ≥ 10 . i. We now have a new problem with S1 ⊕ S2 .0. the Huﬀman code for the three symbols are all one character.e..3.6. each gate with 1 time unit delay. Huﬀman algorithm for tree construction. if D = 10 .79) +1 (5. Sm . 1 with codeword lengths (1. Huﬀman vs. (a) What are the lengths of the binary Huﬀman codewords for X ? What are the lengths of the binary Shannon codewords (l(x) = log( p(1 x) ) ) for X ? (b) What is the smallest integer D such that the expected Shannon codeword length with a D -ary alphabet equals the expected Huﬀman codeword length with a D ary alphabet? Solution: Huﬀman vs. .4) for the three symbols. . . (b) Show that this procedure ﬁnds the tree that minimizes C (T ) = max(Ti + li ) i (5. so that the ﬁnal result is available as quickly as possible. .3. The Shannon code would use lengths log p . which gives lengths (1. Tm .78) where Ti is the time at which the result alloted to the i -th leaf is available. T3 . . A simple greedy algorithm is to combine the earliest two results.6. The 1 would be equal to 1 for all symbols if log D 01 Shannon code length log D p . in that it constructs a circuit for which the ﬁnal result is available as quickly as possible. and 0. S2 . . Shannon. repeating this until we have the ﬁnal result. and li is the length of the path from the i -th leaf to the root. . (c) Show that C (T ) ≥ log 2 for any tree T . . 0. T2 ) + 1 .1) is (1. Sm are available at times T1 ≤ T2 ≤ .2. Consider the following problem: m binary signals S1 . ≤ Tm . the Shannon code is also optimal. 34. and apply the same merging step again. available at times max(T1 . . . Shannon (a) It is obvious that an Huﬀman code for the distribution (0.2). S3 .01.

. proving the optimality of the above algorithm. By induction.e.85) (5. (b) Let c1 = i 2Ti and c2 = T pi = 2 i Tj . (a) The proof is identical to the proof of optimality of Huﬀman coding. For any node. This result extneds to the complete tree. and therefore C (T ) ≥ log c1 − log c2 + pi log i pi ri (5. For any internal node of the tree. Then we can reduce the problem to constructing the optimal tree for a smaller problem. we extend the optimality to the larger problem. that if T i < Tj and li < lj . and let ri = j i 2 2−li . Thus maxi (Ti + li ) is a lower bound on the time at which the ﬁnal answer is available. we have Ti = log(pi c1 ) and li = − log(ri c2 ) . Given any tree of gates. by contradiction. Also. the time at which the ﬁnal result is available is max i (Ti + li ) . The fact that the tree achieves this bound can be shown by induction. Then C (T ) = max(Ti + li ) i (5. Assume otherwise. and that they could be taken as siblings (inputs to the same gate).128 Thus log2 Solution: Tree construction: i Data Compression 2 Ti is the analog of entropy for this problem. By the Kraft inequality. Tj + li } (5. the earliest that the output corresponding to a particular signal would be available is Ti + li . the output is available at time equal to the maximum of the input times plus 1. the output is available at time equal to maximum of the times at the leaves plus the gate delays to get from the leaf to the node. The above algorithm minimizes this cost. We show that the longest branches correspond to the two earliest times. Tj + lj } ≥ max{Ti + lj . the output is available at time max(Ti . we obtain a tree with a lower total cost.81) Thus the longest branches are associated with the earliest times. Tj ) + 1 . j 2 functions. i. then li ≥ lj . Now let 2−li Clearly. since max{Ti + li . pi and ri are probability mass −lj . We ﬁrst show that for the optimal tree if Ti < Tj .86) ≥ log c1 − log c2 + D (p||r ) .84) pi ri = max (log(pi c1 ) − log(ri c2 )) i = log c1 − log c2 + max log i Now the maximum of any random variable is greater than its average under any distribution. since the signal undergoes li gate delays. c2 ≤ 1 .82) (5. Thus for the gates connected to the inputs T i and Tj . then by exchanging the inputs.83) (5. The rest of the proof is identical to the Huﬀman proof. and for the root. as in the case of Huﬀman coding. The proof of this is.

Instead. . Show that EN ≤ 2 . . One wishes to generate a random variable X X= 1.90) You are given fair coin ﬂips Z1 . . Optimal word lengths. and compare with p1 . which occurs with probability 1/4. with probability 1/2. p2 . we achieve the lower bound if p i = ri and c2 = 1 . 2) be the word lengths of a binary Huﬀman code. to generate X .3)? . Find a good way to use Z 1 . It is well known that U is uniformly distributed on [0. .. Similarly. since the li ’s are constrained to be integers. . . 0. and generate a 1 at the ﬁrst time one of the Z i ’s is less than the corresponding pi and generate a 0 the ﬁrst time one of the Z i ’s is greater than the corresponding pi ’s.p1 p2 . X is generated after 2 bits of Z if Z1 = p1 and Z2 = p2 . Generating random variables. 1) . . . and that thus we can construct a tree 2 Tj ) + 1 j Ti + li ≤ log( (5. Z2 . Z2 . .89) for all i .Z1 Z2 .91) (5. (a) Can l = (1. What about (2. Let N be the (random) number of ﬂips needed to generate X .Data Compression Since −logc2 ≥ 0 and D (p||r ) ≥ 0 . . we generate X = 1 if U < p and 0 otherwise. Clearly. Thus. . if we let Tj 1 j2 = log . . log( j 2Tj ) is the equivalent of entropy for this problem! 35. . + 2 + 3 + . 2. . . as a binary number. . we cannot achieve equality in all cases.e. Thus EN 1 1 1 = 1.3. the sequence Z treated as a binary number.87) (c) From the previous part. . . + 2 4 8 = 2 (5. Z2 . . Thus the probability that X is generated after seeing the ﬁrst bit of Z is the probability that Z1 = p1 . Let U = 0. However. Thus this tree achieves within 1 unit of the lower bound. with probability p with probability 1 − p (5.88) li = log pi 2 Ti it is easy to verify that that achieves 2−li ≤ pi = 1 . Solution: We expand p = 0.92) 36. . i.2. . 129 (5. The procedure for generated X would therefore examine Z 1 . we have C (T ) ≥ log c1 which is the desired result. (5.

i.) can arise from binary Huﬀman codes? Solution: Optimal Word Lengths We ﬁrst answer (b) and apply the result to (a). Huﬀman. } {0. . 0} is uniquely decodable (suﬃx free) but not instantaneous. 0000} is neither uniquely decodable or instantaneous. p5 . Solution: Huﬀman Codes. Thus. while (2. ‘collapsing’ the tree back. 110. Which of the following codes are (a) uniquely decodable? (b) instantaneous? C1 C2 C3 C4 Solution: Codes.130 Data Compression (b) What word lengths l = (l1 . Codes. (a) (b) (c) (d) C1 C2 C3 C4 = {00. 3) cannot. . p3 . . (a) D=2 6 6 4 4 2 2 1 6 6 4 4 3 2 6 6 5 4 4 8 6 6 5 11 8 6 14 11 25 . 01. 11} is preﬁx free (instantaneous). 0000} 6 6 4 4 3 2 . } is instantaneous = {0. and if we assign each node a ‘weight’. (a) Clearly.. 01. p2 . 0} {00. namely 2−li . Find the Huﬀman D -ary code for (p 1 . 25 ) 38. = = = = {00. 000. we have that i 2−li = 1 . (b) Word lengths of a binary Huﬀman code must satisfy the Kraft inequality with −li = 1 . (1. l2 . . 37. . 1110. 00. 2. 11} {0. 25 . 2) can arise from Huﬀman code. 101. 10. 3) does not. . Thus. p6 ) = ( 25 and the expected word length (a) for D = 2 . An easy way to see this is the following: every node in equality. 2. i2 the tree has a sibling (property of optimal binary code). 01. . (1. 2) satisﬁes Kraft with equality. 25 . p4 . 10. while (2. . 3. 000. 100. 25 . . 3. then 2 × 2−li is the weight of the father (mother) node. 101. 110.e. 2. (b) for D = 4 . 2. 1110. 00. 01. = {00. 100. = {0. 25 .

1} ∗ be a nonsingular but nonuniquely decodable code.36 25 = = 39. Let C : X −→ {0. Let X have entropy H (X ). Entropy of encoded bits. Solution: Entropy of encoded bits .66 25 = = (b) D=4 6 6 4 4 2 2 1 9 6 6 4 25 6 pi 25 li 1 6 25 4 25 4 25 2 25 2 25 1 25 1 1 2 2 2 2 7 E(l) = i=1 pi li 1 (6 × 1 + 6 × 1 + 4 × 1 + 4 × 2 + 2 × 2 + 2 × 2 + 1 × 2) 25 34 = 1. (a) Compare H (C (X )) to H (X ) .Data Compression 131 6 pi 25 li 2 2 6 25 3 4 25 3 4 25 3 2 25 4 2 25 4 1 25 7 E(l) = i=1 pi li 1 (6 × 2 + 6 × 2 + 4 × 3 + 4 × 3 + 2 × 3 + 2 × 4 + 1 × 4) 25 66 = 2. (b) Compare H (C (X n )) to H (X n ) .

(Problem 2. There’s no straightforward rigorous way to calculate the entropy rates. 40. Solution: Code rate. . . so you need to do some guessing. since the Xi ’s are independent. and hence H (X n ) ≥ H (C (X n )) . 00. 3} and distribution X= 1. Let X1 . For example. with probability 1/2 with probability 1/4 with probability 1/4. and since the probabilities are dyadic there is no gain in coding in blocks. 10. 01. . be independent identically distributed according to this distribution and let Z1 Z2 Z3 . (a) Find the entropy rate H (X ) and the entropy rate H (Z ) in bits per symbol. 01. (b) Now let the code be C (x) = and ﬁnd the entropy rate H (Z ). let the code be C (x) = and ﬁnd the entropy rate H (Z ). the function X n → C (X n ) is many to one. if x = 1 if x = 2 if x = 3. Now we observe that this is an optimal code for the given distribution on X . . = C (X1 )C (X2 ) . 122 becomes 01010 . be the string of binary symbols resulting from concatenating the corresponding codewords. if x = 1 if x = 2 if x = 3. Code rate. . (c) Finally.4) (b) Since the code is not uniquely decodable. The data compression code for X assigns codewords C (x) = 0. This is a slightly tricky question.132 Data Compression (a) Since the code is non-singular. X2 . So the . 2. . 00. Let X be a random variable with alphabet {1. 10. 2. 3. if x = 1 if x = 2 if x = 3. . 1. Note that Z is not compressible further. the function X → C (X ) is one to one. and hence H (X ) = H (C (X )) . (a) First. 11. H (X ) = H (X1 ) = 1/2 log 2+2(1/4) log(4) = 3/2.

Therefore H (Z ) = H (Bern(1/2)) = 1 . . and L is the (binary) length function.Data Compression 133 (b) Here it’s easy. Suppose we encode the ﬁrst n symbols X 1 X2 · · · Xn into (We’re being a little sloppy and ignoring the fact that n above may not be a even. Z1 Z2 · · · Zm = C (X1 )C (X2 ) · · · C (Xn ). Z2 . . Since the concatenated codeword sequence is an invertible function of (X 1 . resulting process has to be i. Xn ) . .94) yields H (Z ) = = = where E [L(C (X1 ))] = H (X ) E [L(C (X1 ))] 3/2 7/4 6 . .93) The ﬁrst equality above is trivial since the X i ’s are independent.) Combining the left-hand-side of (5. . Zn ) n H (X1 . may guess that the right-hand-side above can be written as n H (Z1 Z2 · · · Z n 1 L(C (Xi )) ) = E [ i=1 L(C (Xi ))]H (Z ) (5. . . Similarly. 7 = 7/4. Here m = L(C (X1 ))+L(C (X2 ))+· · · +L(C (Xn )) is the total length of the encoded sequence (in bits). . it follows that nH (X ) = H (X1 X2 · · · Xn ) = H (Z1 Z2 · · · Z n 1 L(C (Xi )) ) (5. but it is true. . n→∞ H (Z ) = lim (c) This is the tricky part. Xn/2 ) = lim n→∞ n H (X ) n 2 = lim n→∞ n = 3/4. . X2 . Bern(1/2).94) = nE [L(C (X1 ))]H (Z ) (This is not trivial to prove.d. 3 x=1 p(x)L(C (x)) .93) with the right-hand-side of (5. H (Z1 .i. (for otherwise we could get further compression from it). but in the limit as n → ∞ this doesn’t make a diﬀerence). . .

The result of the sum of these two probabilities is p10 . 2. Again drawing the tree will verify this. Since the Huﬀman tree produces the minimum expected length tree. . we ﬁrst combine the two smallest probabilities. . 2. 43. In this case. . 3. the lengths of the ﬁrst 9 codewords remain unchanged. 3. . In eﬀect. Thus we have a ternary code for the ﬁrst symbol and binary thereafter. 2. p9 . l11 for the probabilities p1 . the ﬁrst 9 codewords will be left unchanged. . 2. Solution: Optimal codes. In short. 2. (1 − α)p10 . and seeing that one of the codewords of length 2 can be reduced to a codeword of length 1 which is shorter. we would combine αp10 and (1 − α)p10 . . . followed by binary digits {0. C } . 2) (b) (2. (b) The word lengths (2. p10 yields an optimal code (when expanded) for p 1 . p9 . 2. 2. . . 3) ARE the word lengths for a 3-ary Huﬀman code. 1} . 2. . these codeword lengths cannot be the word lengths for a Huﬀman tree. l2 . The key point is that an optimal code for p1 . . and are the word lengths for a 3-ary Huﬀman code. . Suppose we get a new distribution by splitting the last probability mass. 69 69 69 69 69 69 (5. . . 2. B. αp10 . 2. αp10 . . 2. i 3−li = 8 × 3−2 + 3 × 3−3 = 1 . 2. while the 2 new codewords will be XXX 0 and XXX 1 where XXX represents the last codeword of the original distribution. p2 . 2. . Ternary codes.134 Data Compression 41. . where 0 ≤ α ≤ 1 . . . . Optimal codes. Also. 2) CANNOT be the word lengths for a 3-ary Huﬀman code. . l2 . 2. .95) . What can you say about the optimal binary codeword lengths ˜ ˜ l˜ 1 . 3) Solution: Ternary codes. . . Let l1 . 3. so these word lengths satisfy the Kraft inequality with equality. Piecewise Huﬀman. 2. 2. Therefore the word lengths are optimal for some distribution. Note that the resulting probability distribution is now exactly the same as the original probability distribution. p2 . l10 be the binary Huﬀman codeword lengths for the probabilities p1 ≥ p2 ≥ . This can be seen by drawing the tree implied by these lengths. while the lengths of the last 2 codewords (new codewords) are equal to l 10 + 1 . ≥ p10 . To construct a Huﬀman code. Suppose the codeword that we use to describe a random variable X ∼ p(x) always starts with a symbol chosen from the set {A. 3. (a) The word lengths (1. . 2. 2. p2 . 42. . . (1 − α)p10 . 2. 2. Give the optimal uniquely decodeable code (minimum expected number of symbols) for the probability distribution p= 16 15 12 10 8 8 . Which of the following codeword lengths can be the word lengths of a 3-ary Huﬀman code and which cannot? (a) (1.

36 = 28 codewords of word length 6. 2. . suppose that X = 1 is the random object. Find the word lengths of the optimal binary encoding of p = Solution: Huﬀman. m} . . . .Data Compression Solution: Piecewise Codeword a x1 16 b1 x2 15 c1 x3 12 c0 x4 10 b01 x5 8 b00 x6 8 Huﬀman. . Let X be uniformly distributed over {1. we have (64 − k ) + 2k = 100 ⇒ k = 36. we can construct an instantaneous code with the same codeword lengths. . . m} are equally likely. . All 2m subsets of {1. . We ask random questions: Is X ∈ S1 ? Is X ∈ S2 ?. 3. and there are k nodes of depth 6 which will form 2k leaf nodes of depth 7. What is the probability that object 2 yields the same answers for k questions as object 1? (c) What is the expected number of objects in {2. Generally given a uniquely decodable code. . What is the expected number of wrong objects agreeing with the answers? 1 1 1 100 . There are 64 nodes of depth 6. . (a) How many deterministic questions are needed to determine X ? (b) Without loss of generality. Codeword a b c a0 b0 c0 44. There exists a code with smaller expected lengths that is uniquely decodable. . Huﬀman. and 2 × 36 = 72 codewords of word length 7. So there are 64 . This is not the case with the piecewise Huﬀman construction.. 45. but it is also instantaneously decodable. Random “20” questions.. Since the distribution is uniform the Huﬀman tree will consist of word lengths of log(100) = 7 and log(100) = 6 . 100 . . 2. of which (64k ) will be leaf nodes. Assume m = 2n . . . but not instantaneous. . 16 16 15 12 10 22 16 16 15 31 22 16 69 135 Note that the above code is not only uniquely decodable. 100 . m} that have the same answers to the questions as does the correct object 1? √ (d) Suppose we ask n + n random questions. Since the total number of leaf nodes is 100. .until only one integer remains.

Hence. 0 . to show that the probability of error (one or more wrong object remaining) goes to zero as n −→ ∞ . m. n into (c) we have the expected number of (2 n − 1)2−n− √ n. More information theoretically. (c) Let 1j = 1. Now we observe a noiseless output Y k of X k and ﬁgure out which object was sent. we can identify an object out of 2 n candidates. . . Since all subsets are equally likely. m E (# of objects in {2. Then. object j yields the same answers for k questions as object 1 . . where codewords X k are Bern( 1/2 ) random k -sequences. 0. i. . Hence. . . it is obvious the probability that object 2 has the same codeword as object 1 is 2−k . (b) Observe that the total number of subsets which include both object 1 and object 2 or neither of them is 2m−1 . . we can view this problem as a channel coding problem through a noiseless channel. the probability that object 2 yields the same answers for k questions as object 1 is (2 m−1 /2m )k = 2−k . . 3. the probability the object 1 is in a speciﬁc random subset is 1/2 . . (a) Obviously. Solution: Random “20” questions. . Huﬀman codewords for X are all of length n . with n deterministic questions. otherwise for j = 2. . the question whether object 1 belongs to the k th subset or not corresponds to the k th bit of the random codeword for object 1.136 Data Compression (e) Use Markov’s inequality Pr{X ≥ tµ} ≤ 1 t . . . joint typicality. Hence. . .e. m} with the same answers) = E ( m 1j ) j =2 = j =2 m E (1j ) 2 −k j =2 = = (m − 1)2−k (d) Plugging k = n + √ = (2n − 1)2−k . 1 2 0010 . From the same line of reasoning as in the achievability proof of the channel coding theorem. Object Codeword 1 0110 .

Then. by Markov’s inequality P (N ≥ 1) ≤ EN (2n − 1)2−n− 2 √ − n √ = ≤ n → 0. where the ﬁrst equality follows from part (d).Data Compression 137 (e) Let N by the number of wrong objects remaining. .

138 Data Compression .

bi ≥ 0 . be the amount invested on each of the horses.3) (6. Horse race. b3 ) .Chapter 6 Gambling and Data Compression 1. p2 . the most likely winner. The expected log wealth is thus 3 W (b) = i=1 pi log 3bi . bi = 1 . (b) Show that if instead we put all of our money on horse 1. A gambler oﬀers 3-for-1 odds on each of the horses.1) Let b = (b1 . . Three horses run a race. 139 . Solution: Horse race. we will eventually go broke with probability one.4) pi log pi − pi log pi bi (6. (6.5) (6. These are fair odds under the assumption that all horses are equally likely to win the race. b2 . p3 ) = 1 1 1 . (a) The doubling rate W (b) = i pi log bi oi pi log 3bi i (6. 2 4 4 (6.6) (6.2) (a) Maximize this over b to ﬁnd b∗ and W ∗ . . The true win probabilities are known to be p = (p1 . Thus the wealth achieved in repeated ∗ horse races should grow to inﬁnity like 2 nW with probability one.7) = = pi log 3 + = log 3 − H (p) − D (p||b) ≤ log 3 − H (p).

14) . the mathematics becomes more complicated. . 4 ) and W ∗ = log 3 − H ( 2 . then W (b) = −∞ and Sn → 2nW = 0 by the strong law of large numbers. 1 n( n j (6. If the odds are bad (due to a track take) the gambler may wish to keep money in his pocket. we used two diﬀerent approaches to prove the optimality of proportional betting when the gambler is not allowed keep any of the money as cash. By the strong law of large numbers. ∗ (b) If we put all the money on the ﬁrst horse. i. . 4.11) (6. w.085n = (1. . 2. and we will go broke with probability 1. 0) . but a “water-ﬁlling” solution results from the application of the Kuhn-Tucker conditions. (There isn’t an easy closed form solution in this case. Alternatively. ..140 Gambling and Data Compression 1 1 1 1 1 1 with equality iﬀ p = b . . . then the probability that we do not 1 n ) . . the gambler is allowed to keep some of the money as cash. .e. p) = E log S (X ) = i=1 pi log(b0 + bi oi ) (6. m . with probability p(x). .9) (6. . . and we have to use the calculus approach. x = 1. if b = (1. . . . 0. 4 .13) (b) Discuss b∗ if 1/o(i) > 1.1 (6.8) (6. b(m) be the amount bet on horses 1. Horse race with subfair odds.p. 4) = 9 1 2 log 8 = 0. (a) Find b∗ maximizing E log S if 1/o(i) < 1. 2. But in the case of subfair odds. Thus the resulting wealth is S (x) = b(0) + b(x)o(x). . . . o(m) .) Solution: (Horse race with a cash option). o(2). m. Let b(0) be the amount in his pocket and let b(1). with odds o(1).085 . . . In class.12) = = log 3b(Xj )) → 2 nE log 3b(X ) nW (b) When b = b∗ . 2. We want to maximize the expected log return. Since in this case. We will use both approaches for this problem.10) (6. m W (b. and win probabilities p(1). Sn = j 3b(Xj ) 2 2 . . the go broke in n races is ( 2 probability of the set of outcomes where we do not ever go broke is zero. Since this probability goes to zero with n . p(2). the relative entropy approach breaks down. Hence b∗ = p = ( 2 .06)n . p(m) . b(2). W (b) = W ∗ and Sn =2nW = 20. The setup of the problem is straight-forward.

that in this case. but not prove. the argument breaks down. p) as a sum of relative entropies. we can only achieve a negative expected log return. we see that K is maximized for b 0 = 0 . which is strictly worse than the 0 log return achieved be setting b0 = 1 . To maximize W (b. p) = = = = pi log(b0 + bi oi ) pi log pi log b0 oi b0 oi (6. the best we can achieve is proportional betting. we cannot simultaneously minimize D (p||r) . > 1 . which sets the last term to be 0.19) (6. Hence.17) (6.15) (6. Hence. If pi oi ≤ 1 for all horses. sub-fair odds. This is the case of superfair or fair odds. For example. Let us consider the two cases: (a) ≤ 1 . one should leave all one’s money as cash. 1 + oi ri = b0 oi where K= and ( b0 + bi ) = b 0 oi bi = b 0 ( 1 − 1) + 1. With b0 = 1 . However. with b 0 = 1 . p) . setting bi = pi would imply that ri = pi and hence D (p||r) = 0 . In this case. In this case. In these cases. then the ﬁrst term in the expansion of W (b. the gambler should invest all his money in the race using proportional gambling. we see that it is maximized for b 0 = 1 . 1 oi 1 oi + bi (b) . and not leave anything aside as cash. p) and this must be the optimal solution. Looking at the expression for K . in the case of a superfair gamble. one could invest any cash using a “Dutch book” (investing inversely proportional to the odds) and do strictly better with probability 1. pi log pi oi is negative. for fair or superfair games. Examining the expression for K . we must maximize log K and at the same time minimize D (p||r) . Now both K and r depend on the choice of b . This would indicate. Approach 1: Relative Entropy We try to express W (b. oi (6. A more rigorous approach using calculus will prove this. We have succeeded in simultaneously maximizing the two variable terms in the expression for W (b. W (b.18) + bi 1 oi + b i pi pi 1 oi pi log pi oi + log K − D (p||r). p) .20) K is a kind of normalized portfolio. it seems intuitively clear that we should put all of our money in the race. that is.16) (6.Gambling and Data Compression over all choices b with bi ≥ 0 and m i=0 bi 141 = 1.

Hence irrespective of which horse wins. Arrange the horses in order of decreasing b i oi . Let the amount bet on each of the horses be (b 1 .21) for all i . Since bi oi ≥ bm om for all i .27) Diﬀerentiating with respect to bi . since 1/oi > 1 . b0 + b i oi (6. . i.e..142 Gambling and Data Compression We can however give a simple argument to show that in the case of sub-fair odds. . we obtain ∂J pi oi = + λ = 0. bi ≥ 0 . 1− i=1 bi = 1 − m i=1 bi − bm om oi (6. the new portfolio does better than the old one and hence the old portfolio could not be optimal. so that there is no money left aside. we obtain ∂J = ∂b0 m i=1 (6. Consider a new portfolio with bi = b i − m bm om oi m (6. so that the m -th horse is the one with the minimum product. the gambler should leave at least some of his money as cash and that there is at least one horse on which he does not bet any money. Approach 2: Calculus We set up the functional using Lagrange multipliers as before: m m J (b) = i=1 pi log(b0 + bi oi ) + λ i=0 bi (6.24) (6.23) = i=1 bm om oi as cash. We will prove this by contradiction—starting with a portfolio that does not satisfy these criteria. We keep the remaining money. we will generate one which does better with probability one.29) . . ∂bi b0 + b i oi Diﬀerentiating with respect to b0 .25) (6.22) (6. b2 . The return on the new portfolio if horse i wins is bi oi = bi − bm om oi m oi + i=1 m bm om oi (6. .28) pi + λ = 0. bm ) with m i=1 bi = 1 .26) = b i oi + b m om i=1 1 −1 oi > b i oi .

λ . Rewriting the functional with Lagrange multipliers m m J (b) = i=1 pi log(b0 + bi oi ) + λ i=0 bi + γi bi (6.34) Diﬀerentiating w. we get λ + γ0 = λ 1 + oi γi . These conditions are described in Gallager[2]. 87. would give the optimal portfolio b . carrying out the same substitution. the derivative should be 0. we obtain ∂J = ∂b0 m i=1 (6.t. which indicates that the corresponding constraint has become active.33) pi + λ + γ0 = 0. pg. the derivative should be . we obtain pi oi ∂J = + λ + γi = 0. we obtain the following equation λ 1 = λ.30) The solution to these three equations. the only solution to this equation is λ = 0 . Actually. which shows that the solution is on the boundary of the region. For any coordinate on the boundary. Now.r.r.35) 1 which indicates that if oi = 1 . at least one of the γ ’s is non-zero. we get the constraint bi = 1. We should have allotted a Lagrange multiplier to each of these. oi (6.31) 1 Clearly in the case when oi = 1 . which indicates that the solution is on the boundary of the region over which the maximization is being carried out.36) (6.32) Diﬀerentiating with respect to bi . oi (6. we get the constraint bi = 1. For any coordinate which is in the interior of the region. ∂bi b0 + b i oi Diﬀerentiating with respect to b0 . b0 + b i oi (6. we have the inequality constraints bi ≥ 0 .t. we have been quite cavalier with the setup of the problem—in addition to the constraint bi = 1 . 143 (6.Gambling and Data Compression Diﬀerentiating w. But substituting the ﬁrst equation in the second. if they exist. we have to use the Kuhn-Tucker conditions to ﬁnd the maximum. In the case of solutions on the boundary. λ . The conditions describe the behavior of the derivative at the maximum of a concave function over a convex region.

Let us consider the two cases: (a) ≤ 1 . for a concave function F (x1 . we ﬁnd that all the Kuhn-Tucker conditions are satisﬁed. Hence. we obtain pi oi ≤0 +λ =0 b0 + b i oi and pi ≤0 +λ =0 b0 + b i oi if bi = 0 if bi > 0 if b0 = 0 if b0 > 0 (6. We should then invest some money in the race. . since the denominator of the expressions in the Kuhn-Tucker conditions also changes. . the optimum solution may involve investing in some horses with p i oi ≤ 1 . so that p1 o1 ≥ p 2 o2 ≥ · · · ≥ p m om . Hence. 1 1− if k ≥ 1 if k = 0 (6.41) (6. Deﬁne Ck = Deﬁne 1− k p i=1 i k 1 i=1 oi 1 oi (b) 1 oi (6. this is the optimal portfolio for superfair or fair odds. . There is no explicit form for the solution in this case.4. In this case. b 0 = 0 .144 Gambling and Data Compression negative in the direction towards the interior of the region. x2 . and bi = 0 . the optimum solution is to not invest anything in the race but to keep everything as cash. In this case. The Kuhn Tucker conditions for this case do not give rise to an explicit solution. . we try the expected solution.38) (6. however. > 1 .40) t = min{n|pn+1 on+1 ≤ Cn }. xn ) over the region xi ≥ 0 . then the solution is the maximum of the function in the region. Instead. b 0 = 1 .1 in Gallager[2] proves that if we can ﬁnd a solution to the Kuhn-Tucker conditions. and bi = pi . we can formulate a procedure for ﬁnding the optimum distribution of capital: Order the horses according to pi oi .37) Applying the Kuhn-Tucker conditions to the present maximization. In the case when some pi oi > 1 .39) Theorem 4. more than one horse may now violate the Kuhn-Tucker conditions. we try the solution we expect.42) . Hence under this condition. Setting λ = −1 . ∂F ≤ 0 ∂xi = 0 if xi = 0 if xi > 0 (6. We ﬁnd that all the Kuhn-Tucker conditions are satisﬁed if all p i oi ≤ 1 . More formally. Clearly t ≥ 1 since p1 o1 > 1 = C0 . the Kuhn-Tucker conditions are no longer satisﬁed by b0 = 1 .

46) For 1 ≤ i ≤ t . the Kuhn Tucker conditions reduce to pi oi pi oi = = 1. . Cards. (a) Determine H (X1 ). and this is the optimal solution. . X52 ). 2. . Let X i be the color of the ith card. . set bi = 0.43) Ct . b0 + b i oi Ct (6. the Kuhn-Tucker condition is pi = bo + b i oi t i=1 m t 1 pi 1 1− t i=1 pi + = + = 1. set bi = p i − and for i = t + 1. . (c) Does H (Xk | X1 . oi i=t+1 Ct o Ct i=1 i (6. .45) The above choice of b satisﬁes the Kuhn-Tucker conditions with λ = 1 . b0 + b i oi pi oi For t + 1 ≤ i ≤ m . An ordinary deck of cards containing 26 red cards and 26 black cards is shuﬄed and dealt out one card at at time without replacement. t . Hence H (X 1 ) = (1/2) log 2 + (1/2) log 2 = log 2 = 1 bit. m . .48) (6. . There is no change in the probability from X1 to X2 (or to Xi . . X2 .44) (6.Gambling and Data Compression 145 Claim: The optimal strategy for the horse race when the odds are subfair and some of the pi oi are greater than 1 is: set b0 = C t . . Solution: (a) P(ﬁrst card red) = P(ﬁrst card black) = 1/2 . 1 ≤ i ≤ 52 ) since all the permutations of red and black cards are equally likely. . . X2 .47) by the deﬁnition of t . 3. Xk−1 ) increase or decrease? . Hence the Kuhn Tucker conditions are satisﬁed. For b 0 . (d) Determine H (X1 . . (6. . oi (6. Hence H (X 2 ) = (1/2) log 2 + (1/2) log 2 = log 2 = 1 bit. . the Kuhn Tucker conditions reduce to pi oi pi oi = ≤ 1. (b) Determine H (X2 ). (b) P(second card red) = P(second card black) = 1/2 by symmetry. . and for i = 1.

.55) p(x) log b(x) p(x)[log x∈X n = n+ b(x) − log p(x)] p(x) = n + D (p(x)||b(x)) − H (X ). S52 = and hence log S52 = max E log S52 = log 9. xn ). max E log S52 = 52 − H (X ) b(x) (6.. E [log Sn ] = E [log[2n b(X1 . . X2 . .8 bits (3. . Suppose one gambles sequentially on the card outcomes in Problem 3. x2 . . xn ) is the proportion of wealth bet on x 1 . . Xk−1 ) = H (Xk+1 |X1 . . . Thus b(x) = p(x) and. . .. . the joint distribution of X k and X1 .52) (6. . x2 . . . .56) (6. X2 . . Xk−1 . where b(x1 . b(x) 252 52 26 = 9.59) (6. . Thus H (X1 . . Taking p(x) = b(x) makes D (p(x)||b(x)) = 0 and maximizes E log S 52 . xn .53) (6.2. .2 52! 26!26! Alternatively. x2 .. Xk−1 is the same as the joint distribution of X k+1 and X1 . Xk ) (6. .51) (6. X52 ) = log 52 26 = 48.50) 4. Knowledge of the past reduces uncertainty and thus means that the conditional entropy of the k -th card’s color given all the previous cards will decrease as k increases. Xn )]] = n log 2 + E [log b(X)] = n+ x∈X n (6.146 Gambling and Data Compression (c) Since all permutations are equally likely. Even odds of 2-for-1 are paid. . regardless of the outcome.54) (6. . . . . .2 bits less than 52) (6. Therefore H (Xk |X1 . as in the horse race. (d) All 52 26 possible sequences of 26 red cards and 26 black cards are equally likely. .08 = 3. . .60) . . . .58) = 52 − log = 3. .49) and so the conditional entropy decreases as we proceed along the sequence. Thus the wealth S n at time n is Sn = 2n b(x1 . .57) (6. Gambling. . proportional betting is log-optimal.08. . Solution: Gambling on red and black cards. . Xk−1 ) ≥ H (Xk+1 |X1 . (6. Find maxb(·) E log S52 .

b3 ) such that W (b. p) > 0 where W (b. 4. 4 4 2 Thus the odds are (o1 . . (a) What is the entropy of the race? 147 (b) Find the set of bets (b1 . 1. Solution: Beating the public odds. the bet b ∗ that maximizes the exponent. 4 6. ) . Horse race: A 3 horse race has win probabilities p = (p 1 . and odds o = (1.e. p2 . bi = 1 . . i. Consider a 3-horse race with win probabilities 1 1 1 (p1 . The gambler places bets b = (b1 . (a) Find the exponent. b2 . ri Calculating D (p r) . o3 ) = (4. b2 . The gambler gets his money back on the winning horse and loses the other bets.. p) = D (p r) − D (p b) 3 = i=1 pi log bi . ) 2 4 4 and fair odds with respect to the (false) distribution 1 1 1 (r1 . (b) Find the optimal gambling scheme b . bi ≥ 0. b3 ) . 2 (b) Compounded wealth will grow to inﬁnity for the set of bets (b 1 . b2 . p3 ) . b3 ) such that the compounded wealth in repeated plays will grow to inﬁnity.Gambling and Data Compression 5. Beating the public odds. These odds are very bad. . (a) The entropy of the race is given by H (p) = = 1 1 1 log 2 + log 4 + log 4 2 4 4 3 . p2 . p3 ) = ( . o2 . 2) . 1) . where bi denotes the proportion on wealth bet on horse i . this criterion becomes D (p b) < 1 . r3 ) = ( . Thus the wealth S n at time n resulting from independent gambles goes expnentially to zero. r2 .

After by this strategy is equal to log 4 − H (p) = log 4 − H ( 2 20 races with this strategy. 2.12. 4 . Assume that each of the horses pays 1 1 1 4-for-1 if it wins. and the growth rate achieved 1 1 1 1 . Assume that the player of the game is required pay $1 to play and is asked to choose 1 number from a range 1 to 8. 8 } . . 8. and the others will receive nothing. each of persons who picked 2 will receive $10. f8 ) . all the money collected that day. and the exponent in this case is W∗ = i pi log pi = −H (p). Thus the optimal bets are b = p .5%. 8 is (f 1 . the wealth is approximately 2 nW = 25 = 32 . At the end of every day. i. Let the probabilities of winning of the horses be { 1 2 .61) (b) The optimal gambling strategy is still proportional betting. The general population does not choose numbers uniformly . (6. then the $100 collected is split among the 10 people. (a) What is the optimal strategy to divide your money among the various possible tickets so as to maximize your long term growth rate? (Ignore the fact that you cannot buy fractional tickets. if 100 people played today.. . dividing the investment in proportion to the probabilities of each horse winning.12. f2 . If you started with $100 and bet optimally to maximize your long term growth rate. Thus the worst W ∗ is − log 3 .. Consider a horse race with 4 horses. (a) Despite the bad odds. E. the optimal strategy is still proportional gambling.e. . so that any single person’s choice choice does not change the proportion of people betting on any number. . and the drawing at the end of the day picked 2. . . the state lottery commission picks a number uniformly over the same range.g. 25%. Lotto. .numbers like 3 and 7 are supposedly lucky and are more popular than 4 or 8.) . and assume that n people play every day.e. The following analysis is a crude approximation to the games of Lotto conducted by various states. what are your optimal bets on each horse? Approximately how much money would you have after 20 races with this strategy ? Solution: Horse race. what distribution p causes S n to go to zero at the fastest rate? Solution: Minimizing losses. Also assume that n is very large. 7... 8 . Horse race. 8 . 8 ) = 2 − 1. and the gambler’s money goes to zero as 3 −n . 4 . Assume that the fraction of people choosing the various numbers 1. i. and 10 of them chose the number 2. Thus the bets on each horse should be (50%.75 = 0.25 . The optimal betting strategy is proportional betting. The jackpot.e. (c) The worst distribution (the one that causes the doubling rate to be as negative as possible) is that distribution that maximizes the entropy.148 Gambling and Data Compression (c) Assuming b is chosen as in (b). i. and hence the wealth would grow approximately 32 fold over 20 races. .5%). is split among all the people who chose the same number as the one chosen by the state.

. 1/4. Let p1 . When do the odds (o1 . p2 . .63) (6. 1/16. However. Thus the odds for number i is 1/fi . And with such large investments. 1/16) . . in 80 days. 1/8.25 = 79. . Horse race. There are many problems with the analysis. . despite the 50% cut taken by the state. o2 . not the least of which is that the state governments take out about half the money collected. irrespective of the proportions of the other players. the optimal growth rate is given by W ∗ (p) = pi log oi − H (p) = 1 1 log − log 8 8 fi (6. Suppose one is interested in maximizing the doubling rate for a horse race. .e.64) (6. the 50% cut of the state makes the odds in the lottery very unfair. ..7 . . i.. The number of days before this crosses a million = log 2 (1. (b) If there are n people playing. 1/16. 9.25 (6. 000)/0. and does not depend on the number of people playing. But under normal circumstances.62) (c) Substituing these fraction in the previous equation we get W ∗ (p) = 1 1 log − log 8 8 fi 1 (3 + 3 + 2 + 4 + 4 + 4 + 2 + 4) − 3 = 8 = 0. pm denote the win probabilities of the m horses. so that the jackpot is only half of the total collections. if the accumulated jackpot has reached a certain size. then the number of people sharing the jackpot of n dollars is nf i . the log optimal strategy is to divide your money uniformly over all the tickets. Also there are about 14 million diﬀerent possible tickets. 1/4. which would further complicate the analysis. and therefore each person gets n/nfi = 1/fi dollars if i is picked at the end of the day. and f i of them choose number i . the fact that people choices are not uniform does leave a loophole that can be exploited. the expected return can be greater than 1. the amount of money you would have would be approximately 20. . how long will it be before you become a millionaire? Solution: (a) The probability of winning does not depend on the number you choose. you should have a million dollars. and it is therefore possible to use a uniform distribution using $1 tickets only if we use capital of the order of 14 million dollars. .65) and therefore after N days. . and it is worthwhile to play.1. 000. f2 . . 1/16.25N . om ) ? .Gambling and Data Compression (b) What is the optimal growth rate that you can achieve in this game? 149 (c) If (f1 . i. Under certain conditions.e. the proportions of money bet on the diﬀerent possibilities will change. f8 ) = (1/8. and therefore. o2 . . and you start with $1. . om ) yield a higher doubling rate than the odds (o 1 . and it is not a worthwhile investment. . . Using the results of Section 6.

Horse race with probability estimates 1 1 1 (a) Three horses race. b). and with the incorrect estimate is pi log qi oi . 2 . p2 . the other 2b dollars. Adopt the strategy of switching to the other envelope with probability p(x) . . o2 . The amount b is unknown. An envelope is selected at random. (b) For m horses. 4 ) . om ) . . i. qm ) . the growth rate with the true distribution is pi log pi oi . you would bet pro1 log 4 1 portional to this. The loss in growth rate due to incorrect estimation of the probabilities is the diﬀerence between the two growth rates. Let Z be the amount that the player receives. what doubling rate W will you achieve? By how much has your doubling rate decreased due to your poor estimate of the probabilities. p2 . The odds are ∗ (4-for-1. W W = = pi log oi − H (p). . pm ) . Let W be the optimal doubling rate. 1 1 Suppose you believe the probabilities are ( 1 4 . and let Y be the amount in the other envelope. o2 . . q2 . 2 . 2b). If you try to maximize the doubling rate. and try to maximize the doubling rate W . o2 . where −x p(x) = e−ex +ex .e. what is W ∗ − W ? Solution: Horse race with probability estimates 1 1 (a) If you believe that the probabilities of winning are ( 1 4 . Thus (X. If you bet according to the true probabilities. . what is ∆W = W ∗ − W ? (b) Now let the horse race be among m horses. . with probabilities p = (p 1 . By Theorem 6. . . and would achieve a growth rate pi log bi oi = 2 4 + 1 1 1 1 1 9 log 3 + log 3 = log . . om ) respectively. The diﬀerence between the two is 1 pi log p qi = D (p||q ) . Their probabilities of winning are ( 2 .. pm ) and odds o = (o1 .2 in the book.66) . . 4 ) . . . . when E log oi > E log oi . . (2b. . . . you 4 2 4 4 4 8 1 1 . achieving a growth rate p log b o would bet ( 1 i i i = 2 4 4 1 1 1 1 1 1 3 1 2 log 4 2 + 4 log 3 4 + 4 log 3 4 = 2 log 2 . 11. . .150 Solution: Horse Race Gambling and Data Compression Let W and W denote the optimal doubling rates for the odds (o 1 . . 4 . 10. om ) and (o1 . that is. . Let X be the amount observed in this envelope. 4 ) . If you believe the true probabilities to be q = (q1 . Then W > W pi log oi > pi log oi . and pi log oi − H (p) exactly when where p is the probability vector (p 1 . ) on the three horses.1. The two envelope problem: One envelope contains b dollars. 3-for-1 and 3-for-1). Y ) = (b. . . with probability 1/2 with probability 1/2 (6. . .25 . which is 1 4 log 2 = 0.

In fact. P (J = 1|J = 1) = P (X = b|J = 1)P (J = 1|X = b. J = 1) = = > +P (X = 2b|J = 1)P (J = 1|X = 2b. J ) > 0 . However.69) (6. Y has the same unconditional distribution.5b .Gambling and Data Compression X. Thus. the probability that J = 1 is 1 − p(b) . X = 2b) . We will now show that the two random variables are not independent. conditioned on (J = 1. we will calculate the conditional probability P (J = 1|J = 1) . the probability that X = b or 2b is still (1/2. Thus you can do better than always staying or always switching.68) (6. the probability that Z = X . I (J . Similary.1/2). although E (Y /X ) > 1 . J ) = 0 . it is not diﬃcult to see that the unconditional probability distribution of J is also the same. 1/2. ratio of the amount in the other that one should always switch. Thus the amount in the ﬁrst envelope always contains some information about which envelope to choose. the other envelope contains 2x with probability 1/2 and contains x/2 with probability 1/2. By symmetry. and therefore E (X ) = 1.71) . (c) Let J be the index of the envelope containing the maximum amount of money. E (X ) . you are more likely to trade up than trade down. observe that E (Y ) = it does not follow that E (Y ) > Z= (6. Since the expected envelope to the one in hand is 5/4. conditioned on (J = 1. Show that for any b .67) (a) Show that E (X ) = E (Y ) = (b) Show that E (Y /X ) = 5/4 . (d) Show that E (Z ) > E (X ) . this is true for any monotonic decreasing switching function p(x) . By randomly switching according to p(x) . (c) Without any conditioning. and therefore J = 1 is p(2b) . J = 1) 1 1 (1 − p(b)) + p(2b) 2 2 1 1 + (p(2b) − p(b)) 2 2 1 2 (6. Thus E (Y /X ) = 5/4 . J = 1 or 2 with probability (1/2. To do this. Thus. it seems (This is the origin of the switching paradox. Conditioned on J = 1 .70) (6.) E (X )E (Y /X ) . However.1/2). and let J be the index of the envelope chosen by the algorithm. X = b) . Y. (b) Given X = x . and therefore I (J . 3b 2 151 with probability 1 − p(x) with probability p(x) . Solution: Two envelope problem: (a) X = b or 2b with prob.

o2 . . Solution: Gambling (a) From Theorem 6.74) 1) (6.. . We can also write this as pi log pi oi pi log i (6. J = 1) = = +P (X = 2b|J = 1)E (Z |X = 2b. J = 1) + p(J = 2|X = 2b. . J = 1)E (Z |J = 1. we assume that J = 1 . . . J = 1) 2 +p(J = 2|X = b. the ﬁrst envelope contains 2b . J = 1) (6. p2 . o2 .78) (6.e.80) (6.76) (6. X = 2b.77) +p(J = 1|X = 2b. Thus E (Z ) > E (X ) . i.75) (6. J = (6. J = 1)E (Z |J = 2. . . . Gambling. (b) minimizing the doubling rate for given ﬁxed odds o 1 . W ∗ = W∗ = i pi log oi − H (p) .73) 2 2 1 p(J = 1|X = b. J = 1) = = > 1 ([1 − p(b)]2b + p(b)b + p(2b)2b + [1 − p(2b)]b) 2 3b 1 + b(p(2b) − p(b)) 2 2 3b 2 as long as p(2b) − p(b) > 0 . . X = 2b. Then E (Z |J = 1) = P (X = b|J = 1)E (Z |X = b. 12. X = b.1.72) 1 1 E (Z |X = b.152 Gambling and Data Compression Thus the conditional distribution is not equal to the unconditional distribution and J and J are not independent. J = 1) + E (Z |X = 2b. J = 1) (6. om . om .82) . J = 1)E (Z |J = 2.2. . Find the horse win probabilities p 1 . . Without loss of generality. J = 1)E (Z |J = 1. .81) qi = (6. pm (a) maximizing the doubling rate W ∗ for given ﬁxed known odds o1 .79) pi log j = = i pi 1 oi pi log pi − qi i j = i pi log where pi − log qi 1 oi 1 j oj 1 oj 1 oj (6. (d) We use the above calculation of the conditional distribution to calculate E (Z ) . X = b.

W (bD . Such a bet is called a Dutch book. 1 − b. X = 1. (b) The maximum growth rate occurs when the horse with the maximum odds wins in all the races. Solution: Solution: Dutch book. 2 p = 1/2. Find this b and the associated wealth factor S (X ). The odds are super fair. Dutch book.Gambling and Data Compression 153 Therefore the minimum value of the growth rate occurs when p i = qi .e. Consider a horse race with m = 2 horses. 30 Bets = b.91 + 1 1 log 30 2 4 and SD (X ) = 2W (bD . This is the distribution that minimizes the growth rate. (a) 10bD = 30(1 − bD ) 40bD = 30 bD = 3/4. (b) In general. 2 2 to zero we get 1 2 10 10b∗ + 1 2 −30 30 − 30b∗ = 0 . 1/2 Odds (for one) = 10. W (b.P ) = 7. (a) There is a bet b which guarantees the same payoﬀ regardless of which horse wins. and the minimum value is 1 − log j oj . P ) = 1 3 log 10 2 4 = 2.5. Therefore. p) = Setting the ∂W ∂b 1 1 log(10b) + log(30(1 − b)). i. (b) What is the maximum growth rate of the wealth for this gamble? Compare it to the growth rate for the Dutch book. pi = 1 for the horse that provides the maximum odds 13..

. . . . . . . . . om ) respectively. . when E log oi > E log oi . o2 . . o2 . . . (a) Find the expected wealth ES (X ) .66 ∗ SD = 2WD = 7. By Theorem 6. Let X ∼ p(x) . . om ) yield a higher doubling rate than the odds (o 1 . and pi log oi − H (p) exactly when where p is the probability vector (p 1 . Then W > W pi log oi > pi log oi .5 14. that is. b(x) ≥ 0 .1. 2. Suppose one is interested in maximizing the doubling rate for a horse race. . o2 . m Let b(x) be the amount bet on horse x . 2 b∗ = Hence 1 1 log(5) + log(15) = 3.11 2 2 W (bD . i.e. Let p1 . p2 . . W W = = pi log oi − H (p). . 1 b(x) = 1 . om ) and (o1 .. . pm ) . pm denote the win probabilities of the m horses. . om ) ? Solution: Horse Race (Repeat of problem 9) Let W and W denote the optimal doubling rates for the odds (o 1 .91 W ∗ (p) = and S ∗ = 2W = 8.154 Gambling and Data Compression 1 1 + ∗ ∗ 2b 2(b − 1) ∗ (b − 1) + b∗ 2b∗ (b∗ − 1) 2b∗ − 1 4b∗ (1 − b∗ ) = 0 = 0 = 0 1 . Suppose the odds o(x) are fair with respect to p(x) . denote the winner of a horse race. . p) = 2.2 in the book. When do the odds (o1 . . o(x) = p(1 x) . . Then the resulting wealth factor is S (x) = b(x)o(x) . m . o2 . p2 . Horse race. Entropy of a fair horse race. . . . . . 15. with probability p(x) . . x = 1. .

91) = x=1 m p(x) log(b(x)o(x)) p(x) log(p(x)/p(x)) x=1 m = = x=1 p(x) log(1) = 0. (since o(x) = 1/p(x)) = 1. (c) The increase in our growth rate due to the side information is given by I (X . how much does it increase the growth rate W ∗ ? (d) Find I (X . the optimal growth rate of wealth.83) (6. I (X .87) (6.Gambling and Data Compression (b) Find W ∗ . W ∗ = E (log S (X )) m (6.94) (since Y is a deterministic function of X ) . Solution: Entropy of a fair horse race.85) (6. Y ) . (c) Suppose Y = 1.88) (6. Let q = Pr(Y = 1) = p(1) + p(2) .90) (6.84) (6. W ∗ .86) = = x=1 b(x).92) (6. (b) The optimal growth rate of wealth. Y ) = H (Y ) − H (Y |X ) = H (Y ) = H (q ). Y ) . in which case. (d) Already computed above. X = 1 or 2 0. (a) The expected wealth ES (X ) is m ES (X ) = x=1 m S (x)p(x) b(x)o(x)p(x) x=1 m (6. so we maintain our current wealth.93) (6. (6. is achieved when b(x) = p(x) for all x . otherwise 155 If this side information is available before the bet.89) (6.

. . with probability one. on the horses. he receives 1/2 a share and a return X/2 . Alternatively. Here the gambler hopes a given horse will lose. Do not constrain the bets to be positive. Now. but do constrain the bets to sum to 1. and one wishes to maximize pi ln(1 − bi ) subject to the constraint bi = 1. b2 . are i. Negative horse race Consider a horse race with m horses with win probabilities p 1 . .) Thus S = j =i bj . where Pr(X = 2k ) = 2−k .) (b) What is the optimal growth rate? Solution: Negative horse race (a) Let bi = 1 − bi ≥ 0 .95) Sn = c i=1 Show that this limit is ∞ or 0 . W ∗ is obtained upon setting D (p q ) = 0 . . (b) From (a) we directly see that setting D (p q ) = 0 implies W ∗ = log(m − 1) − H (p) . . Identify the “fair” entry fee c∗ . Let qi = bi / is a probability distribution on {1. Suppose X1 . 17. . which means making the bets such that pi = qi = bi /(m − 1) . . . (a) Find the growth rate optimal investment strategy b ∗ .d. . it was argued that c = ∞ was a “fair” price to pay to play this game. For this reason. Petersburg the following gambling proposition caused great consternation. m i=1 bi = 1 . . 2. (This eﬀectively allows short selling and margin. . accordingly as c < c ∗ or c > c∗ . Most people ﬁnd this answer absurd. . Petersburg paradox. (6. For example. 2. . .156 Gambling and Data Compression 16. loses his bet bi if horse i wins. and retains the rest of his bets. . or bi = 1 − (m − 1)pi . He places bets (b 1 . Thus his wealth Sn at time n is given by n Xi . . For an entry fee of c units. Thus. (No odds. 2. . {qi } pi log(1 − bi ) pi log qi (m − 1) pi log pi i = i = log(m − 1) + qi pi = log(m − 1) − H (p) − D (p q ) . W = i j bj . . . Then. . a gambler receives a payoﬀ of 2k units with probability 2−k . if he invests c/2 units in the game. pm . p2 . bm ). and note that i bi = m − 1 . (b) Suppose that the gambler can buy a share of the game. one can use Lagrange multipliers to solve the problem. . with probability pi . The St.i. k = 1. (a) Show that the expected payoﬀ for this game is inﬁnite. k = 1. X2 . . . m} . according to this distribution and the gambler reinvests all his wealth each time. Many years ago in ancient St.

we can conclude that any entry fee c is fair. the gambler should be allowed to keep a proportion b = 1 − b of his money in his pocket and invest the rest in the St.p. for all c . His wealth at time n is then n bXi Sn = b+ . (c) For what value of the entry fee c does the optimizing value b ∗ drop below 1? (d) How does b∗ vary with c ? (e) How does W ∗ (c) fall oﬀ with c ? Note that since W ∗ (c) > 0 . ∞ k =1 2−k log 1 − b + b2k c .99) Here are some questions about W ∗ (c). EX = ∞ k =1 p(X = 2k )2k = ∞ k =1 2−k 2k = ∞ k =1 1 = ∞. (6.100) Thus the expected return on the game is inﬁnite. (b) By the strong law of large numbers. c) = We have Sn =2nW (b. we see that 1 1 log Sn = n n n i=1 log Xi − log c → E log X − log c. (6. Therefore log c∗ = E log X = ∞ k =1 k 2−k = 2. w. (6.101) and therefore Sn goes to inﬁnity or 0 according to whether E log X is greater or less than log c .1 (6. Solution: The St. Petersburg game.98) (6.97) (6.102) Therefore a fair entry fee is 2 units if the gambler is forced to invest all his money. Petersburg paradox. 0≤b≤1 . (6. (a) The expected return. c).c) Let W ∗ (c) = max W (b.Gambling and Data Compression 157 More realistically. .96) c i=1 Let W (b.

W = E log X − log c = 2 − log c . From Figure 2. then the growth rate is W (b. it is clear that b∗ is less than 1 for c > 3 . We have illustrated the results with three plots. c) = ∞ k =1 2−k log 1 − b + b2k c .5 0 -0. c) 1 = 2−k ∂b 1−b+ k =1 b2k c −1 + 2k c (6. c) ∂b = ∞ k =1 2−k 1 1−b+ b2k c −1 + 2k c (6.5 4 c 5 b 6 7 8 9 10 0 Figure 6.5 1 0. W = 1 . We can also see this (b. ∂W (b.103) For b = 0 . we obtain ∞ ∂W (b.3) shows W ∗ as a function of c .104) Unfortunately.2)shows b ∗ as a function of c and the third (Figure 6.1) shows W (b.5 2 1.c) 2. Petersburg: W (b.158 Gambling and Data Compression "fileplot" W(b.105) .c) analytically by calculating the slope ∂W∂b at b = 1 . (6. The second (Figure 6. The ﬁrst (Figure 6. (c) If the gambler is not required to invest all his money. Diﬀerentiating to ﬁnd the optimum value of b . and we have to solve this numerically on the computer.5 -1 1 1 2 3 0. and for b = 1 . there is no explicit solution for the b that maximizes W for a given value of c . c) as a function of b and c . c) as a function of b and c .1: St.

3 "filewstar" 2.3: St.2: St. Petersburg: b∗ as a function of c .5 b 1 0. Petersburg: W ∗ (b∗ .6 b 0. .4 0.5 2 1. c) as a function of c .8 0.Gambling and Data Compression 159 1 "filebstar" 0.5 0 1 2 3 4 5 c 6 7 8 9 10 Figure 6.2 0 1 2 3 4 5 c 6 7 8 9 10 Figure 6.

Petersburg problem. Here the expected log wealth is inﬁnite for all b > 0 . 18. we have the super St. we cannot solve this problem explicitly. k (6. Show that there exists a unique b maximizing E ln and interpret the answer. 1 b = (1 2 . (e) The variation of W ∗ with c is shown in Figure 6. Petersburg paradox. Petersburg. 2 ). for all c . and for c > 3 .160 = = k ∞ k =1 Gambling and Data Compression 2−k 2k c 2k −1 d − ∞ k =1 (6.109) and thus with any constant entry fee. (d) The variation of b∗ with c is shown in Figure 6. In this case.108) 2 −k c2−2k = 1− c 3 which is positive for c < 3 . . we wish to maximize the relative growth rate with respect to some other portfolio. . we have E log X = k k (b + bX/c) 1 (1 2 + 2 X/c) 2−k log 22 = ∞.2. the gambler’s money grows to inﬁnity faster than exponentially. 2 ) portfolio.107) (6. c) = ∞ k =1 2 −k b22 log 1 − b + c k > 2 −k b22 = ∞. W (b. and the gambler’s wealth grows to inﬁnity faster than exponentially for any b > 0. Petersburg.3. We have 1 a conjecture (based on numerical results) that b ∗ → √ c2−c as c → ∞ . log c k (6. . although there are some . the optimal value of b lies in the interior.111) 1 2 2k 1 + k 2 2 c As in the case of the St. To see this. Finally. we need to But if we wish to maximize the wealth relative to the ( 2 maximize k b22 (1 − b ) + c J (b. . . . With Pr(X = 2 2 ) = 2−k . c) = 2−k log (6.110) 1 1 . Super St. but we 2 do not have a proof. Thus for c < 3 . where k Pr(X = 22 ) = 2−k . k = 1. .106) (6. But that doesn’t mean all investment ratios b are equally good. Solution: Super St. 2. a computer solution is fairly straightforward. . 2. b ∗ → 0 . the optimal value of b lies on the boundary of the region of b ’s. say. As c → ∞ . k = 1. since for any b > 0 .

3 0. we have illustrated the results with three plots. the optimum strategy is not to put all the money into the game.2 0.9 0. k .4 b 0. Petersburg: J (b.5 without any loss of accuracy. for k ≥ 6 .6 0. we can do a simple numerical computation as in the previous problem. c) as a function of b and c . As before. For example. However. Using this. These plots indicate that for large values of c . Thus there exists an optimal b ∗ even when the log optimal portfolio is undeﬁned. There exists a unique b∗ which maximizes the expected ratio. The second (Figure 6.c) 0 -1 -2 -3 -4 1 0. we can approximate the ratio within the log by 0b .5)shows b ∗ as a function of c and the third (Figure 6. even though the money grows at an inﬁnite rate. 2 2 is outside the normal range of number representable on a standard computer.4) shows J (b. which therefore causes the wealth to grow to inﬁnity at the fastest possible rate. c) as a function of b and c .6) shows J ∗ as a function of c .5 0. complications.Gambling and Data Compression Expected log ratio 161 ER(b.1 5 c 10 Figure 6.8 0. for k = 6 .4: Super St.7 0. The ﬁrst (Figure 6.

5 b 0.1 1 10 c 100 1000 Figure 6. Petersburg: J ∗ (b∗ .6: Super St.2 0.4 0.4 0.6 b 0. 1 0.9 0. Petersburg: b ∗ as a function of c .8 0.2 0.1 0.7 0. c) as a function of c .5: Super St.1 1 10 c 100 1000 Figure 6.9 0.5 0.1 0.7 0.8 0.3 0.162 Gambling and Data Compression 1 0.1 0 -0.3 0. .6 0.

New York. Sardinas and G. Academic Press. IT24:668–674. K¨ orner. Part 8. Csisz´ ar and J.Bibliography [1] I. Inform. pages 104–108.G. Information Theory and Reliable Communication. Wiley. Information Theory: Coding Theorems for Discrete Memoryless Systems. [3] R. 1962. Gallager. [2] R. In IRE Convention Record. mit einem Anhang u ¨ber Informationstheorie. Theory. Variations on a theme by Huﬀman. A necessary and suﬃcient condition for the unique decomposition of coded messages. Berlin. 1978. [5] A. 1981. IEEE Trans. 1968.W. Gallager. 163 . Patterson. 1953. Veb Deutscher Verlag der Wissenschaften. Wahrscheinlichkeitsrechnung.A. [4] A R´ enyi.G.

v

v

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd