Information Theory-Homework Exercises: 1 Entropy, Source Coding

Information theory—homework exercises
Edited by: Gábor Lugosi
1 Entropy, source coding

Problem 1 (Alternative definition of unique decodability) An f : X → Y ∗ code is called uniquely
decodable if for any messages u = u1 · · · uk and v = v1 · · · vk (where u1 , vi , . . . , uk , vk ∈ X ) with
f (u1 )f (u2 ) · · · f (uk ) = f (v1 )f (v2 ) · · · f (vk ),
we have ui = vi for all i. That is, as opposed to the definition given in class, we require that the codes of
any pair of messages with the same length are equal. Prove that the two definitions are equivalent.
Problem 2 (Average length of the optimal code) Show that the expected length of the codewords
of the optimal binary code may be arbitrarily close to H(X) + 1. More precisely, for any small > 0,
construct a distribution on the source alphabet X such that the average codeword length of the optimal
binary code satisfies
E|f (X)| > H(X) + 1 − .
Problem 3 (Equality in Kraft’s inequality) An f prefix code is called full if it loses its prefix property
by adding any new codeword to it. A string x is called undecodable if it is impossible to construct a sequence
of codewords such that x is a prefix of their concatenation. Show that the following three statements are
equivalent.
(a) f is full,
(b) there is no undecodable string with respect to f ,

Pn −li
(c) i=1 s = 1, where s is the cardinality of the code alphabet, li is the codeword length of the ith
codeword, and n is the number of codewords.
Problem 4 (Shannon-Fano code) Consider the following code construction. Order the elements of the
source alphabet X according to their decreasing probabilities: p(x1 ) ≥ p(x2 ) ≥ · · · ≥ p(xn ) > 0. Introduce
the numbers wi as follows:
i−1
X
w1 = 0, wi = p(xi ) (i = 2, . . . , n).
j=1
Consider the binary expansion of the numbers wi . Write down the binary expansion of wi until the first bit
such that the expansion differs from the expansion of all wj (j 6= i). Thus, we have obtained n finite string.
Define the binary codeword f (xi ) as the obtained binary expansion of wi following the decimal point. Prove
that the lengths of the codewords of the obtained code satisfy
|f (xi )| < − log p(xi ) + 1.
Therefore, the expected codeword length is smaller than the entropy plus one.
1
Problem 5 (Bad codes) Which of the following binary codes cannot be a Huffman code for any distribu-
tion? Verify your answer.
(a) 0, 10, 111, 101
(b) 00, 010, 011, 10, 110
(c) 1, 000, 001, 010, 011
Problem 6 Assume that the probability of each element of the source alphabet X = {x1 , . . . , xn } is of the
form 2−i , where i is a positive integer. Prove that the Shannon-Fano code is optimal. Show that the average
codeword length of a binary Huffman code is equal to the entropy if and only if the distribution is of the
described form.
Problem 7 Assume that the source alphabet X has five elements with the following corresponding prob-
abilities: 0.4; 0.35; 0.1; 0.1; 0.05. Determine the entropy of the source. Construct a Shannon-Fano code the
way it is described in Problem 4, and construct a binary prefix code with codeword lengths li = d− log p(xi )e
using the binary tree representation shown in class. What is the average codeword length?
l m
1
Problem 8 (The Shannon-Fano code is almost optimal) Let l(x) = log p(x) be the length of the
codeword corresponding to x ∈ X in a binary Shannon-Fano code, where p(x) = P{X = x}. Let l0 (x) be
the codeword length of x in an arbitrary uniquely decodable code. Prove that for any c > 1,
1
P{l(X) ≥ l0 (X) + c} ≤ .
2c−1
Problem 9 We toss a coin until we get a tail. Let X denote the number of tosses. Determine the entropy
of X.
Problem 10 (Log-Sum inequality) Assume that the numbers ai , bi ≥ 0, i = 1, . . . , n satisfy

n
X n
X
ai = a and bi = b.
i=1 i=1
Prove that
n
X bi b
ai log ≤ a log ,
i=1
ai a
with equality if and only if for all i, bi /ai =constant.
2
Problem 11 (The entropy of a more uniform distribution is larger) Show that the entropy of
the distribution
(p1 , . . . , pi , . . . , pj . . . , pn )
cannot be larger then the entropy of the distribution
p i + pj p i + pj
(p1 , . . . , ,..., . . . , pn ).
2 2
Problem 12 (The optimal code of a more uniform distribution is worse) Consider the distribu-
tions
pi + pj pi + pj
p = (p1 , . . . , pi , . . . , pj . . . , pn ) and q = (p1 , . . . , ,..., . . . , pn ).
2 2
Show that the expected codeword length of the optimal code of q (i.e., that of minimal expected code-
word length) cannot be smaller than that of p. (Where the expected values are taken with respect to the
corresponding distribution.)
Problem 13 (Information divergence) Let p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) probability distribu-

tions and define the “information distance” between them by the expression
n
X pi
D(p|q) = pi log .
i=1
qi
(This quantity is sometimes called relative entropy or Kullback-Leibler distance.) Verify the following prop-
erties.
• D(p|q) ≥ 0 with equality if and only if p = q.
• H(p) = log n − D(p|u), where H(p) denotes the entropy of p, and u denotes the uniform distribution
on the set {1, . . . , n}.
Problem 14 (Unknown distribution) Assume that the random variable X has distribution p = (p1 , . . . , pn ),
but this distribution is not known exactly. Instead, we are given the distribution q = (q1 , . . . , qn ),l and we
m
design a Shannon-Fano code according to these probabilities, i.e., with codeword lengths li = log q1i ,
i = 1, . . . , n. Show that the expected codeword length of the obtained code satisfies
n
X
H(p) + D(p|q) ≤ pi li < H(p) + D(p|q) + 1.
i=1
This means that the price we pay for not knowing the distribution exactly is about the information divergence
(which is alway nonnegative).
Problem 15 (Horse racing) n horses run at a horse race, and the i-th horse wins with probability p(i).
If the i-th horse wins, the payoff is o(i) > 0, i.e., if you bet x Forints on horse i, you get xo(i) Forints if horse
i wins, and zero Forints if any other horse wins. The race is run k times without changing the probabilities
and the odds. Let the random variables X1 , . . . , Xk denote the numbers of the winning horses in each round.
Assume that you start with one unit of money, and in each round you invest all the money you have such
Pn
that in each round the fraction of your money that you bet on horse i is b(i) ( i=1 b(i) = 1). Clearly, after
k rounds your wealth is
Y k
Sk = b(Xi )o(Xi ).
i=1
3
For a given distribution p = (p(1), . . . , p(n)) and betting strategy b = (b(1), . . . , b(n)), define
n
X
W (b, p) = E log(b(X1 )o(X1 )) = p(i) log(b(i)o(i)).
i=1
Show that
Sk ≈ 2kW (b,p) ,
where ak ≈ bk means that limk→∞ k1 log abkk = 0, and the above convergence is meant in probability. There-
fore, on a long run, maximizing W (b, p) results in the maximal wealth. Prove that the optimal betting
strategy is b = p, i.e., you should distribute your money proportionally to the winning probabilities, inde-
pendently of the odds (!!!).
Problem 16 (Shannon-Fano and Huffman codes) Let the random variable X be distributed as
(1/3; 1/3; 1/4; 1/12).
Construct a Huffmann code. Show that there are two different optimal codes with codeword lengths (1; 2; 3, 3)
and (2; 2; 2; 2). Conclude that there exists an optimal code such that some of its codewords are longer than
those of the corresponding Shannon-Fano code.
Problem 17 A sequence of six symbols are independently drawn according to the distribution of the random
variable X. The sequence is encoded symbol-by-symbol using an optimal (Huffmann) code. The resulted
binary string is 10110000101. We know that the source alphabet has five elements, but we only know
that the distribution is one of the two distributions {0, 4; 0, 3; 0, 2; 0, 05; 0, 05} and {0, 3; 0, 25; 0, 2; 0, 2; 0, 05}.
Determine the distribution of X.
Problem 18 We are asked to determine an object by asking yes-no questions. The object is drawn randomly
from a finite set according to a certain distribution. Playing optimally, we need 38.5 questions on the average
to find the object. At least how many elements does the finite set have?
Problem 19 (Shortest codeword of Huffmann codes) Suppose that we have an optimal binary
prefix code for the distribution (p1 , . . . , pn ), where p1 > p2 > . . . , pn > 0. Show that
• If p1 > 2/5 then the corresponding codeword has length 1.
• If p1 < 1/3 then the corresponding codeword has length at least 2.
Problem 20 (Basketball play-offs) The play-offs of NBA are played between team A and team B in a
seven-game series that terminates as soon as one of the teams wins four games. Let the random variable X
represent the outcome of the series of games (possible values of X are AAAA or BABABAB or AAABBBB).
Let Y denote the number of games played. Assuming that the two teams are equally strong, determine the
values of H(X), H(Y ), H(Y |X) and H(X|Y ).
4
Problem 21 (Cryptography) Let X and Z be binary 0-1 valued independent random variables, whose
distribution is given by P{X = 1} = p and P{Z = 1} = 1/2. Define Y = X ⊕ Z, where ⊕ denotes mod
2 summation. X may be thought of as the message, Z is the secret key, and Y is the encrypted message.
Determine the following quantities: H(X), H(X|Z), H(X|Y ), H(X|Y, Z), I(X; Y ), I(X; Z), I(X; (Y, Z)). In-
terpret the results. What does the value of I(X; Y ) tell you?
Problem 22 (Inequalities) Let X, Y and Z be arbitrary (finite) random variables. Prove the following
inequalities.
• H(X, Y |Z) ≥ H(X|Z),
• I((X, Y ); Z) ≥ I(X; Z),
• H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X).
Problem 23 (Card shuffling) Let X be an arbitrary random variable taking its values from the set
{1, 2, . . . , 52}. Let T be a random permutation of the numbers 1, 2, . . . , 52, i.e., T (1), T (2), . . . , T (52) is a
random re-ordering of the set {1, 2, . . . , 52}. We assume that T is independent of X. Show that
H(T (X)) ≥ H(X).
Problem 24 Let X = X1 , X2 , . . . be a binary memoryless stationary source with P{X1 = 1} = 10−6 .

Determine a variable-length code of X whose per letter expected codeword length is smaller than 1/10.
Problem 25 (Run length coding) Let X1 , . . . , Xn be binary random variables. Let R = (R1 , R2 , . . .)
denote the run lengths of the symbols in X1 , . . . , Xn . That is, for example, the run lengths of the sequence
1110010001111 is R = (3, 2, 1, 3, 4). Determine the relation between H(X1 , . . . , Xn ), H(R) and H(R, Xn )?
Problem 26 (Entropy of a Markov chain) Let X = X1 , X2 , . . . be a binary stationary Markov chain

with state transition probabilities
1−p 1+p
P{X2 = 0|X1 = 0} = p, P{X2 = 1|X1 = 0} = 1−p, P{X2 = 0|X1 = 1} = , P{X2 = 1|X1 = 1} = .
2 2
Determine P{X1 = 0}. What is the entropy of the source?
Problem 27 (The second law of thermodynamics) Let X = X1 , X2 , . . . be a stationary Markov chain.

Show that H(Xn |X1 ) is monotone increasing (in spite of the fact that H(Xn ) does not change with n by
stationarity).
Problem 28 (Properties of the binary entropy function) Define the function h(x) = −x log x −
(1 − x) log(1 − x), for x ∈ (0, 1), and h(0) = h(1) = 0. Show that h satisfies the following properties:
• symmetric around 1/2;
• continuous in every point of [0, 1];
• strictly monotone increasing in [0, 1/2];
• strictly concave.
5
Problem 29 (Data processing lemma) Assume that the random variables X, Y, Z form a Markov chain
(in this order). (In other words, X and Z are conditionally independent given Y . Prove the following
inequalities:
I(X; Z) ≤ I(X; Y ), and I(X; Z) ≤ I(Y ; Z).
Interpret the title of this exercise.
Problem 30 (Back to the future) Let . . . , X−2 , X−1 , X0 , X1 , X2 , . . . be a stationary sequence of ran-
dom variables. Show that
H(X0 |X−1 , X−2 , . . . , X−n ) = H(X0 |X1 , X2 , . . . , Xn ),
that is, the conditional entropy of the present given the past and the future are equal.
Problem 31 (Non-stationary Markov chain) Let Z = Z1 , Z2 , . . . be a homogeneous Markov chain

with state transition probabilities P{Z2 = 0|Z1 = 0} = P{Z2 = 1|Z1 = 1} = p, P{Z2 = 0|Z1 = 1} =
P{Z2 = 1|Z1 = 0} = 1 − p. Determine the entropy of the Markov chain for an arbitrary initial distribution.
What is limn→∞ H(Zn )? What is the stationary distribution?
Problem 32 Let the states of a stationary Markov chain be the integers 0, 1, . . . , 255. Assume that the
state transition probabilities are defined by the following 256 × 256 matrix (where the jth entry of the ith
row is P{Z2 = j|Z1 = i}).
 
1/2 1/4 1/4 0 0 0 ··· 0

 0 1/2 1/4 1/4 0 0 ··· 0 

0 ···
 
 0 0 1/2 1/4 1/4 0 
.. ..
 
 

 . . 


 0 0 ··· 0 1/2 1/4 1/4 

 
 1/4 0 0 ··· 0 1/2 1/4 
1/4 1/4 0 ··· 0 0 1/2
What is the stationary distribution of the Markov chain? Determine the entropy of the Markov chain.
Construct a good uniquely decodable variable-length block code.
Problem 33 The figure below describes a Markov chain Z = Z1 , Z2 , . . .. Assume that Z1 has the stationary
distribution.
Also define a sequence Y1 , Y2 , . . . of independent, identically distributed binary random variables with P (Yi =
0) = 1/3. Define the source X = X1 , X2 , . . . by the expression Xi = 2Zi + Yi . Determine the entropy of X
provided that Z1 , Z2 , . . . is independent of Y1 , Y2 , . . ..
6
Problem 34 (Variable-length coding using typical sequences) Let X be a stationary memoryless
source. A prefix code f : X k → {0, 1}∗ is used to encode k-long blocks of the source as follows: let > 0
be an arbitrary fixed number. The codeword of each -typical sequence starts with a 0, all other codewords
start with a 1. Each -typical sequence (whose total number does not exceed 2kH(X1 )(1+) ,) is encoded by a
codeword of length dkH(X1 )(1 + )e + 1. The non--typical sequences are encoded by codewords of length
dk log |X |e + 1 (the plus 1 in both cases come from the 0 and 1 prefixes). Show that for each 0 > , if k is
large enough, the per-letter expected codeword length satisfies
1
E|f (X1 , . . . , Xk )| ≤ H(X1 ) + 0 .
k
(This is an alternative proof for the source coding theorem that was proved using Shannon-Fano codes in
class.)
Problem 35 (Fixed length code) Let X be a stationary memoryless binary source with P{X1 = 1} =
0, 005, P{X1 = 0} = 0, 995. We assign codewords to those blocks of length 100 that contain at most 3 ones.
If all codewords are of the same length, then what is the minimal possible codeword length? What is exactly
the total probability of blocks that are not encoded? How can you estimate this probability by Chebyshev’s
inequality?
Problem 36 (Converse of the source coding theorem) We have seen in class that by encoding only
the typical sequences by fixed-length codewords, the per-letter codeword length can get arbitrarily close to
H(X)/ log s if the encoded blocks are sufficiently long. Show that there is no essentially better code than
this, that is, if k-long blocks of the stationary memoryless source X are encoded with error probability,
and the length of the codewords is mk , then
mk H(X)
lim inf ≥ .
k→∞ k log s
2 Universal coding, Kolmogorov’s complexity

Problem 37 (Number of types) Prove that k-long blocks of an alphabet of n elements can have at most
(k + 1)n different types. Show that the exact number of the different types is

k+n−1
.
n−1
Problem 38 (Polynomial coefficients and entropy) Let k = N1 + . . . + Nn . Prove (by using the
previous exercise and the method of proof of the analog statement for binomial coefficients given in class)
that
1 k!
n
2kH ≤ ≤ 2kH ,
(k + 1) N1 !N2 ! · · · Nn !
Pn
where H = − i=1 Nki log Nki is the entropy of the distribution (N1 /k, . . . , Nn /k).
7
Problem 39 (Adaptive Shannon-Fano code) Let X1 , . . . , XN be a sequence of independent and
identically distributed random variables, where the Xi take their values from the finite set {a1 , . . . , an }.
Suppose that we estimate the unknown source distribution P{X1 = ai } = pi , i = 1, . . . , n by the relative
PN
frequencies qN,i = N1 j=1 I{Xj =ai } Let Y1 , . . . , Yk be an i.i.d. sequence which is independent of the Xi but
has the same distribution. We want to encode this k-block using a Shannon-Fano code designed from the
estimated probabilities. Using the law of large numbers and Problem 14 prove that the expected per letter
codelength of this codes satisfies
1 1
E |f (Y1 , . . . , Yk )| ≤ H(Y1 ) + + N ,
k k
where N → 0 as N → ∞.
Problem 40 (Lempel-Ziv) Give the Lempel-Ziv parsing and encoding of the binary string consisting of
36 zeros. What is the codelength of the Lempel-Ziv encoding of a string of c(c + 1)/2 zeros? How does this
relate to the fact that the Lempel-Ziv algorithm is asymptotically optimal?
Problem 41 (Totally random sequences cannot be compressed) Let X1 , . . . , Xn be an i.i.d.
sequence of binary random variables with distribution P{X1 = 0} = 1/2 . Show that the Kolmogorov
complexity of X1 , . . . , Xn satisfies
P{K(X1 , . . . , Xn |n) < n − k} < 2−k
for all nonnegative k. (For example, the probability of the event that a random string of length n has
Kolmogorov complexity less than n − 7, is less than 1/100.)
Problem 42 (Kolmogorov complexity) Give an estimate of the Kolmogorov complexity of the fol-
lowing binary strings:
• A string of n zeros: 000...0,
• A string consisting of n 010 s: 010101...01,

√
• The first n bits of the binary expansion of 2 − 1.
• n bits drawn independently from the distribution p(0) = 1/3, p(1) = 2/3.
Problem 43 (Chaitin’s mystical, magical number) Let U be a universal Turing machine and let the
real number Ω be defined as X
Ω= 2−l(p) ,
p:U (p) halts
where l(p) is the length of the program p. That is, the sum is over the halting programs. We have seen in
class that Ω ≤ 1. Show that if we knew an Ωn such that Ω − 2−n < Ωn < Ω, then we could decide for any
program of length at most n if it is a halting program.
Problem 44 (Concatenated strings) Let x, y ∈ {0, 1}∗ , and let xy denote the concatenation of x and
y. Prove that
K(xy) ≤ K(x) + K(y) + c,
but there exist x and y such that
K(xy) < K(x) + K(y).
Problem 45 ( Kolmogorov complexity is not monotone) Show an example when x is the prefix
of y, but K(x) > K(y).
8
3 Lossy source coding and quantization
Problem 46 (The minimum mutual information is achieved ) Show that in the definition of the
rate-distortion function it is legitimate to write “min” instead of “inf”. You have to show that for a given
distribution of X, there exists a conditional probability assignment p(y|x), x ∈ X , y ∈ Y determining a
random variable Y such that the joint distribution of (X, Y ) satisfies Ed(X, Y ) ≤ δ, and I(X; Y ) = R(δ).
Problem 47 (The mutual information is convex in p(y|x) for fixed p(x)) Let the distribution of
the pair of r.v.’s (X, Y1 ) be determined by p(x) = P{X = x} and p1 (y|x) = P{Y1 = y|X = x}. Similarly,
let the distribution of (X, Y2 ) defined by p(x) and the conditional probabilities p2 (y|x) = P{Y2 = y|X = x}.
Let Y be a random variable whose joint distribution with X is defined by the conditional probabilities
P{Y = y|X = x} = λp1 (y|x) + (1 − λ)p2 (y|x),
where 0 < λ < 1. Prove that

I(X; Y ) ≤ λI(X; Y1 ) + (1 − λ)I(X; Y2 ),
that is, I(X, Y ) is convex in the conditional distribution of Y .
Problem 48 (The rate-distortion function of the binary memoryless source for Hamming
distortion. Part 1) Let X be a binary random variable with distribution P{X1 = 0} = p < 1/2, and
let 0 ≤ δ ≤ p. Show that for any random variable Y such that P{X 6= Y } ≤ δ, we have
I(X; Y ) ≥ h(p) − h(δ),
where h is the binary entropy function.
distortion. Part 2) Let X be a binary random variable with distribution P{X1 = 0} = p < 1/2, and
let 0 ≤ δ ≤ p. Define the joint distribution of the pair (X, Y ) as
1−p+δ p−δ
P{Y = 0} = , P{Y = 1} = ,
1 − 2δ 1 − 2δ
and
P{X = 0|Y = 0} = 1 − δ, P{X = 1|Y = 0} = δ, P{X = 0|Y = 1} = δ, and P{X = 1|Y = 1} = 1 − δ.
Prove that
I(X; Y ) = h(p) − h(δ).
distortion. Part 3) Let X = X1 , X2 , . . . be a binary stationary and memoryless source with distribution
P{X1 = 0} = p < 1/2. Using the previous problems show that the rate-distortion function of the source for
Hamming distortion is given by
(
h(p) − h(δ) if 0 ≤ δ ≤ p
R(δ) =
0 if δ > p.
9
Problem 51 (The Shannon lower bound to the rate-distortion function) Let X be a random
variable taking values from the source alphabet X = {x1 , . . . , xn }. Assume that the distortion measure
d : X × Y → [0, ∞) has the property that for all y ∈ Y, the n-vector
(d(x1 , y), . . . , d(xn , y))
is a permutation of the numbers d1 , . . . , dn . Define the function φ as
φ(δ) = max H(p),

Pn
where the maximum is taken over the probability distributions p = {p1 , . . . , pn } which satisfy i=1 pi di ≤ δ.
(here H(p) denotes the entropy of the distribution p.) Prove that φ(δ) is a concave function.
Prove that the following chain of inequalities hold for all r.v. Y such that Ed(X, Y ) ≤ δ.
I(X; Y ) = H(X) − H(X|Y )

X
= H(X) − p(y)H(X|Y = y)
y
X
≥ H(X) − p(y)φ(δy )
y
!
X
≥ H(X) − φ p(y)δy
y
≥ H(X) − φ(δ),
P
where δy = x p(x|y)d(x, y).
Conclude that
R(δ) ≥ H(X) − φ(δ).
H(X) − φ(δ) is called the Shannon lower bound to R(δ). What is this lower bound for a binary memoryless
source and Hamming distortion?
Problem 52 (Graph entropy) Given are a graph G and a probability distribution P on the vertex set
of G. The graph entropy assigned to G and P is the quantity
H(G, P ) = min I(X, Y ),
where the minimum is taken over all pairs of r.v.’s (X, Y ) which satisfy the following conditions:
(a) X takes its values from the vertex set of G.
(b) Y takes its values from the collection of independent subsets of the vertex set of G.
(c) For any vertex xi and independent subset yj , if xi 6∈ yj , then Pr{X = xi , Y = yj } = 0.
(d) X has distribution P , that is, Pr{X = xi } = P (xi ) for all x.
Prove that for all G and P ,
H(G, P ) ≤ log χ(G),
where χ(G) is the chromatic number of G. Give an example (i.e, G and P ) when equality holds.
Hint: Notice that a coloring of G is equivalent to the covering of the vertices of G by independent sets.
Problem 53 (1-bit quantization of the normal distribution) Let X be a normal r.v. with zero
mean and variance σ 2 . Determine the the best 1-bit (2-level) quantizer for X. What is the MSE of this
quantizer?
10
Problem 54 (Quantizing an exponential distribution) Let X be a real valued random variable
with density ( 1
c e− 2 x if x ∈ [0, 2]
f (x) =
0 if x 6∈ [0, 2],
where c makes f integrate to one.
(a) Quantize X with a 4-bit uniform quantizer Q matched to the interval [0, 2]. Using high resolution
approximation calculate the MSE and the entropy H(Q(X)) of Q.
(b) Let us redefine the density of X as
(
1 − 12 x
2e if x ≥ 0
f (x) =
0 if x < 0.
Quantize X with the quantizer Q

b defined as
(
Q(x) if x ∈ [0, 2]
Q(x)
b =
2 if x > 2.
Using the computations in Part (a) compute the entropy H(Q(X)).

b
(Hint: Let Z = 0, if X ∈ [0, 2], and Z = 1, if X > 2. Then H(Q(X))
b = H(Q(X),
b Z) = H(Z) + H(Q(X)|Z).)
b
Problem 55 A r.v. X has density f (x) given in the figure below. Suppose X is quantized with a 2-bit
uniform quantizer matched to the interval [−1, 1].
(a) Calculate exactly the squared distortion and the entropy of the quantizer.
(b) Calculate both quantities again, now using high-resolution approximations. Compare the result with the
exact values.
(Hint : x ln xdx = x2 ln2x − 14 .)
R
Problem 56 Sunny and rainy days follow one another according to the stationary Markov chain in the
figure. On a rainy day, the amount of rain is an exponential random variable with density f (x) = e−x (x > 0).
(measured in centimeters). The weather service reports the amount of rain quantized uniformly with a step-
size of 1 millimeter. What is the (approximate) entropy of the source consisting of the sequence of rain
reports?
11
Problem 57 (The Lloyd-Max algorithm is not optimal) Show in an example that the Lloyd-Max
algorithm does not always converge to the quantizer with minimum distortion. You have to give a density
f and an initial quantizer such that the algorithm does not converge to the optimum.
Problem 58 (Maximal differential entropy) We have seen the of all densities with a given variance,
the normal density has maximal differential entropy. Show that
• Among all nonnegative r.v.’s with a given mean, the one with exponential distribution has maximal
differential entropy.
• Among all densities which are zero outside the interval [a, b], the uniform density over [a, b] has maximal
differential entropy.
Problem 59 (Maximal entropy) Consider the family of all discrete probability distribution with a
given mean which are concentrated on the set of nonnegative integers. Using the method of the previous
problem show that the geometric distribution maximizes the entropy in this family.
Problem 60 (Differential entropy) Let X have a density and let H(X) denote differential entropy.
Show that for any a > 0 we have
H(aX) = H(X) + log a.
12
4 Channel Coding
Problem 61 (Repetition Code) The symbols 0 or 1 are to be transmitted through a binary symmetric
channel with crossover probability p < 1/2. The channel code has two codewords (000) and (111), the code
(000) is transmitted if the message is 0, and (111) is transmitted if the message is 1. The decoder is using a
majority decision; if the received word has more than one zeros, then the decoder output is 0, otherwise it
is 1. What are the probabilities of incorrect decoding Pe,1 and Pe,2 ?
Problem 62 (Using parity bits on channels with feedback) A simple but widely used method of
coding for binary channels is to attach the mod 2 sum of the bits to the end of the string (parity bit). If
there are an odd number of bit errors during transmission, this can be detected, and the decoder can ask
for a repetition from the encoder through a feedback loop. The simplest example for this scheme is the
following: consider a BSC with crossover probability p < 1/2 using the encoding rule that sends (00) if the
message is 0 and (11) is the message is 1. At decoding, if the two received bits are different, a repetition is
asked. Prove that the probability of incorrect decoding is
p2
Pe = Pe,1 = Pe,2 = ,
1 − 2p + 2p2
and that the number of bits used for transmitting one message bit, on the average, is
2
1 − 2p + 2p2
If you compare this with the result of the previous problem, you will notice that for small p ( p < (1 −
√
1/ 3)/2), smaller error probability using less channel bits can be achieved with the feedback code. On the
other hand, it can be proved that feedback does not increase capacity for discrete memoryless channels. This
means that if a scheme using feedback achieves a certain error probability at a certain rate, the same error
probability and rate can be achieved using a code without feedback (but the second code possibly has larger
blocklength).
Problem 63 Find the capacity of the channels in the figure:
Assume that the two channels above are used simultaneously, i.e., the same input symbol (0, 1 or 2) is
transmitted through both channels at the same time. This way we obtained a channel with 3 possible inputs
and 4 possible outputs. What is the capacity of this channel? Show that in all three cases there are channel
codes of rate R = C that can be decoded with zero probability of error.
Problem 64 (Mod 11 channel) Let U ∈ {0, 1, . . . , 10} be the input of a DMC. The output is determined
by the formula V = U + Z (mod 11), where Z is independent of U , and
1
P {Z = 1} = P {Z = 2} = P {Z = 3} = .
3
13
a, Find the channel capacity.
b, Find a code of length one with the largest possible rate which can be decoded without error. What is the
rate of the code?
Problem 65 In the figure below, U, Z and W are independent binary random variables, and ⊕ means
modulo 2 addition. The block diagram can be viewed as a binary channel with input U and output V .
(a) Find the channel capacity if P{Z = 1} = p and P{W = 1} = q.
(b) Find the channel capacity if the first ⊕ is replaced by a binary “or”, the second ⊕ is replaced by a binary
“and”, and p = 1/3, q = 3/4.
Problem 66 (Cascade of BSC’s) Take the cascade of k binary symmetric channels, each having
crossover probability p. Show that the binary channel obtained is equivalent to a BSC with crossover
probability 21 (1 − (1 − 2p)k ). This shows that the capacity of the cascade channel goes to zero as k increases
to infinity.
Problem 67 (Capacity of the erasure channel) Prove that the capacity of the erasure channel with
crossover probability p is C = 1 − p.
Problem 68 (Capacity of weakly symmetric channels) The channel transition matrix has entries
p(v|u), (u ∈ U, v ∈ V), arranged so that the jth element of the ith row is p(vj |ui ). A DMC is called weakly
symmetric if the rows of the transition matrix are the permutations of the same probability vector p, and
the sums of the probabilities in all columns are the same. Show that the capacity of such a channel is
C = log |V| − H(p).
Problem 69 (a) Find the capacity of the channel in the figure.
(b) Let p = 14 , and assume that the channel is used 30 times. How many bits of information can be
transmitted approximately, if the decoding error probability is to be “small”.
14
Problem 70 (MAP and ML decoding) Consider a BSC with crossover probability p. Assume that a
binary message is to be transmitted using a repetition code of length n, i.e., the first codeword is a string
of n zeros, the second is a string of n ones. We have seen that the maximum likelihood decoding method
will choose the codeword which is closer in Hamming distance to the output sequence. We also know that
the decoding rule giving the smallest error probability is the maximum a posteriori decoding. Suppose that
message is 0 with probability q and it is 1 with probability 1 − q. Determine the maximum a posteriori
decoding rule in this case. What is the decoding rule for n = 3, 10 and 100 if p = 1/3 and q = 0.1?
Problem 71 (ML decoding on the erasure channel) Find the maximum likelihood decoding rule
for the binary erasure channel.
15
5 Decision and Estimation
Problem 72 (Weather forecasting) Assume that the sunny and rainy days follow each other according
to a stationary Markov chain with distribution
P{rainy today|rainy yesterday} = 0, 5; P{rainy today|sunny yesterday} = 0, 3.
Knowing yesterday’s weather, in the morning I am trying to decide if I will need an umbrella today. The
cost function is as follows: If I take an umbrella with me, it will cost 100 Forints a day (independently of
the weather), since I loose my umbrellas quite often. If I don’t take an umbrella and there is no rain today,
then I have no expenses. On the other hand, if I don’t take an umbrella and it rains, then it will cost me 150
Forints to have my hair done at my hair stylist. What is the optimal decision if yesterday was rainy, and
what is the optimal decision if yesterday was sunny? What changes if the hair stylist asks for 200 Forints?
What is the expected cost of the optimal decision in each case?
Problem 73 (Detection in additive Gaussian noise) Consider the detection problem we have dealt
with in the class. What is the Bayes (maximum a posteriori) decision if we change the probabilities of the
zeros and ones to q and 1 − q? Calculate the error probability for n = 1 (only one sample is taken). For
n > 1 give a simple form of the Bayes decision for the white noise case, that is, when the noise samples are
independent and identically distributed normal random variables.
Problem 74 (Bayes versus Neyman-Pearson ) Consider the two-class decision problem where the
observation X is real valued and its conditional densities given A = 1 and A = 2 are given in the figure. Let
P(A = 1) = q1 and P(A = 2) = q2 be the prior probabilities. What is the Maximum Likelihood decision,
the Bayes decision, and the α-level Neyman-Pearson decision?
Problem 75 (Variance estimation) The observations X = (X1 , . . . , Xn ) are i.i.d. random variables
with unknown mean and variance. We want to estimate their common variance σ 2 (X1 ) = E(X1 − EX1 )2 .
It seems natural to use the estimate
 2
n n
1 X 1X 
Gn (X) = Xi − Xj .
n i=1 n j=1
Show that Gn is biased but asymptotically unbiased. Prove also that the modification
 2
n n
1 X
Xi − 1
X
G0n (X) = Xj 
n − 1 i=1 n j=1
16
is unbiased. (It can be proved that the expected squared difference of G0n and σ 2 (X1 ) is greater than that
of Gn and σ 2 (X1 ). In this sense, Gn is “better” than G0n . It is interesting to note that there is no estimate
p
of σ(X1 ) = σ 2 (X1 ) which is unbiased for all distributions of X. Think it over what this means.)
Problem 76 (Estimating the mean of uniformly distributed observations) Let X1 , . . . , Xn be

i.i.d. r.v.’s which are uniformly distributed over the interval [0, a]. We want to estimate the (common) mean
of the Xi . An obvious choice is the estimate
n
1X
Gn (X) = Xi .
n i=1
But we have seen in class that an unbiased estimate of a is n+1

n maxi Xi . Since the expectation is a/2, we
can also use the estimate
n+1
G0n (X) = max Xi .
2n i
2
a
Show that the expected squared difference of Gn (X) and a/2 is 12n , while the same quantity for the estimate
2
0 a
Gn (X) is 4n(n+2) . Thus the second estimate is much better than the plain average Gn .
Problem 77 (Exponential distributions) Let X1 , . . . , Xn be i.i.d. exponential r.v.’s with parameter

λ > 0. What is the maximum likelihood estimate of λ? Is this an unbiased estimate? Is it consistent?
Problem 78 (Hard limiter) Let X have the two-sided exponential distribution with density
1 −|x|
f (x) = e .
2
We want to measure the value of X, but the meter we have is accurate only in the region [−a, a]; if the
absolute value of the measured quantity is greater than a, then it shows a or −a. That is, the meter shows

 X if −a ≤ X ≤ a

Y = −a if X < −a

a if X > a.

What is the estimate of X based on Y , which minimizes the squared error?
Problem 79 (Minimizing the absolute error) Suppose we want to estimate the value of a real r.v.
A based on the observation of a real r.v. X. The estimate G(x) should be such that the absolute error
R(G) = E (|A − G(X)|)
is minimal. Show that the optimal estimate is the conditional median of A given X = x, i.e., G(x) must
satisfy
1
P{A ≤ G(x)|X = x} = .
2
(You can assume that G(x) exists and is unique.)
Problem 80 (Prediction of an autoregressive process) Let X1 , X2 , . . . be an i.i.d. sequence such

that E(Xi ) = 0, E(Xi2 ) = 1. Consider the process Y0 , Y1 , . . . which is given for all n ≥ 1 by
Yn = aYn−1 + Xn ,
17
where 0 < a < 1. Y0 is chosen such a way that Y0 , Y1 , . . . is stationary. (This Markov process is a special
case of the so called autoregressive processes.) What is the optimal zeroth, first, and second order linear
prediction of Yn ? You have to determine the coefficients c, c1 , c̃1 , c̃2 which minimize
2
!
X
2 2 2

E (Yn − c) , E (Yn − c1 Yn−1 ) , and E (Yn − c̃i Yn−i )
i=1
What is the pattern here? Try to guess the form of the sixth order predictor of the stationary process
3
X
Yn0 = 0
ai Yn−i + Xn .
i=1
Problem 81 (Gauss-Markov process) Let {Xt } (t ∈ R) be a time-continuous zero mean, stationary

Gaussian process with covariance function
K(τ ) = E(Xt Xt+τ ) = e−|τ | .
Based on observation at time instants XT and X2T , the value X3T of the process at 3T is to be estimated.
What is the estimate minimizing the expected squared estimation error?
18

Information Theory-Homework Exercises: 1 Entropy, Source Coding

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory-Homework Exercises: 1 Entropy, Source Coding

Uploaded by

Copyright:

Available Formats

Information theory—homework exercises

Edited by: Gábor Lugosi

1 Entropy, source coding

f (u1 )f (u2 ) · · · f (uk ) = f (v1 )f (v2 ) · · · f (vk ),

(b) there is no undecodable string with respect to f ,

|f (xi )| < − log p(xi ) + 1.

(a) 0, 10, 111, 101

(b) 00, 010, 011, 10, 110

(c) 1, 000, 001, 010, 011

Problem 10 (Log-Sum inequality) Assume that the numbers ai , bi ≥ 0, i = 1, . . . , n satisfy

with equality if and only if for all i, bi /ai =constant.

Problem 13 (Information divergence) Let p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) probability distribu-

• D(p|q) ≥ 0 with equality if and only if p = q.

(1/3; 1/3; 1/4; 1/12).

• If p1 > 2/5 then the corresponding codeword has length 1.

• If p1 < 1/3 then the corresponding codeword has length at least 2.

• H(X, Y |Z) ≥ H(X|Z),

• I((X, Y ); Z) ≥ I(X; Z),

• H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X).

H(T (X)) ≥ H(X).

Problem 24 Let X = X1 , X2 , . . . be a binary memoryless stationary source with P{X1 = 1} = 10−6 .

Problem 26 (Entropy of a Markov chain) Let X = X1 , X2 , . . . be a binary stationary Markov chain

Problem 27 (The second law of thermodynamics) Let X = X1 , X2 , . . . be a stationary Markov chain.

• symmetric around 1/2;

• continuous in every point of [0, 1];

• strictly monotone increasing in [0, 1/2];

Interpret the title of this exercise.

H(X0 |X−1 , X−2 , . . . , X−n ) = H(X0 |X1 , X2 , . . . , Xn ),

Problem 31 (Non-stationary Markov chain) Let Z = Z1 , Z2 , . . . be a homogeneous Markov chain

2 Universal coding, Kolmogorov’s complexity

• A string consisting of n 010 s: 010101...01,

P{Y = y|X = x} = λp1 (y|x) + (1 − λ)p2 (y|x),

where 0 < λ < 1. Prove that

that is, I(X, Y ) is convex in the conditional distribution of Y .

I(X; Y ) ≥ h(p) − h(δ),

where h is the binary entropy function.

P{X = 0|Y = 0} = 1 − δ, P{X = 1|Y = 0} = δ, P{X = 0|Y = 1} = δ, and P{X = 1|Y = 1} = 1 − δ.

(d(x1 , y), . . . , d(xn , y))

is a permutation of the numbers d1 , . . . , dn . Define the function φ as

φ(δ) = max H(p),

I(X; Y ) = H(X) − H(X|Y )

H(G, P ) = min I(X, Y ),

Quantize X with the quantizer Q

Using the computations in Part (a) compute the entropy H(Q(X)).

Problem 63 Find the capacity of the channels in the figure:

C = log |V| − H(p).

Problem 69 (a) Find the capacity of the channel in the figure.

P{rainy today|rainy yesterday} = 0, 5; P{rainy today|sunny yesterday} = 0, 3.

Problem 76 (Estimating the mean of uniformly distributed observations) Let X1 , . . . , Xn be

But we have seen in class that an unbiased estimate of a is n+1

Problem 77 (Exponential distributions) Let X1 , . . . , Xn be i.i.d. exponential r.v.’s with parameter

What is the estimate of X based on Y , which minimizes the squared error?

R(G) = E (|A − G(X)|)

Problem 80 (Prediction of an autoregressive process) Let X1 , X2 , . . . be an i.i.d. sequence such

Problem 81 (Gauss-Markov process) Let {Xt } (t ∈ R) be a time-continuous zero mean, stationary

K(τ ) = E(Xt Xt+τ ) = e−|τ | .

You might also like