You are on page 1of 11

E2 201: Information Theory (2019)

Solutions to Homework 3

1. Random variables X, Y and Z take values in {0, 1} with X and Z being independent and
uniform on {0, 1}, and Y = X ⊕ Z. Let PX , PY and PZ denote the pmfs of X, Y and Z,
respectively.
It is easy to see that H(X) = 1. Also, PY (0) = PXZ (0, 0) + PXZ (1, 1) = 1/2 and PY (1) =
PXZ (0, 1) + PXZ (1, 0) = 1/2, which gives H(Y ) = 1.
To compute H(X, Y ), we first determine the joint pmf PXY . Using PXY (x, y) = PY |X (y|x)PX (x),
we get PXY (x, y) = 1/4 for all (x, y) ∈ {0, 1}2 , which gives H(X, Y ) = 2. Also, since
H(X, Y ) = H(X) + H(Y ), we have that X and Y are independent.
Next, we compute H(X, Y |Z = 0) as follows:
1 1
H(X, Y |Z = 0) = PXY (0, 0) log + PXY (1, 1) log = 1.
PXY (0, 0) PXY (1, 1)
Similarly,
1 1
H(X, Y |Z = 1) = PXY (0, 1) log + PXY (1, 0) log = 1.
PXY (0, 1) PXY (1, 0)
Thus,
H(X, Y |Z) = PZ (0)H(X, Y |Z = 0) + PZ (1)H(X, Y |Z = 1)
= 1 < H(X|Z) + H(Y |Z) = 1 + 1,
which shows that X and Y are not independent given Z.
2. (i) Using the chain rule for entropy, we have
2H(X1 , X2 , X3 ) = H(X1 , X2 ) + H(X3 |X1 , X2 ) + H(X1 |X2 , X3 ) +H(X2 , X3 )
| {z } | {z }
≤H(X3 ) ≤H(X1 |X3 )
(a)
≤ H(X1 , X2 ) + H(X3 ) + H(X1 |X3 ) +H(X2 , X3 )
| {z }
=H(X1 ,X3 )

= H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 ),


where (a) above follows from the fact that conditioning reduces entropy.
(ii) We have
2H(X1 , X2 , X3 ) = H(X1 ) + H(X2 , X3 |X1 ) + H(X2 ) + H(X1 , X3 |X2 )
(b)
≥ H(X1 , X2 ) + H(X2 , X3 |X1 ) + H(X1 , X3 |X2 )
≥ H(X1 , X2 |X3 ) + H(X2 , X3 |X1 ) + H(X1 , X3 |X2 ),
where (b) above follows from the subadditivity of entropy, and the last line above follows
from the fact that conditioning reduces entropy.

1
3. We first note that by chain rule

I(X ∧ Y, Z) = I(X ∧ Y ) + I(X ∧ Z|Y )


= I(X ∧ Z) + I(X ∧ Y |Z).

Since each of the four quantities on the right is non negative, we have that I(X∧Y, Z) is greater
than or equal to each of them, implying I(X ∧ Y, Z) ≥ max{I(X ∧ Y ), I(X ∧ Z|Y ), I(X ∧
Z), I(X ∧ Y |Z)}.
When X −◦− Y −◦− Z and X −◦− Z −◦− Y , we have I(X ∧ Y, Z) = I(X ∧ Y ) = I(X ∧ Z).

4. (i) We note that for all x, y in the domain of f ,


(1) (2)
min f (x, y) ≤ f (x, y) ≤ max f (x, y).
x y

Combining inequalities (1) and (2) above, we have

min f (x, y) ≤ max f (x, y)


x y

for all x, y in the domain of f . Taking max on both sides, and noting that the right
y
hand side of the above inequality is independent of y, we get

max min f (x, y) ≤ max f (x, y) for all x.


y x y

Now, applying min on both sides of the above inequality, and recognising that the left
x
hand side of the above inequality does not depend on x yields

max min f (x, y) ≤ min max f (x, y).


y x x y

(ii) Let f and g be two concave functions on R. Define a new function h as h = f + g. We


know that for all x, y ∈ R and for all θ ∈ [0, 1], we have

f (θx + (1 − θ)y) ≥ θf (x) + (1 − θ)f (y),


g(θx + (1 − θ)y) ≥ θg(x) + (1 − θ)g(y).

Fix arbitrary x, y ∈ R and θ ∈ [0, 1]. Then,

h(θx + (1 − θ)y) = f (θx + (1 − θ)y) + g(θx + (1 − θ)y)


≥ θf (x) + (1 − θ)f (y) + θg(x) + (1 − θ)g(y)
= θ(f (x) + g(x)) + (1 − θ)(f (y) + g(y))
= θh(x) + (1 − θ)h(y),

thus proving that h is a concave function.


(iii) Let f (x) = ax + b for some a, b ∈ R. Then, for all x, y ∈ R and θ ∈ [0, 1], we have

f (θx + (1 − θ)y) = a(θx + (1 − θ)y) + b


= a(θx + (1 − θ)y) + b(θ + 1 − θ)
= θ(ax + b) + (1 − θ)(ay + b)
= θf (x) + (1 − θ)f (y),

thus proving that a linear function is both convex and concave.

2
(iv) Consider the functions f (x) = x2 and g(x) = loge x on the domain (0, ∞). Clearly, f is
convex while g is concave. Also,
1 1
f 0 (x) + g 0 (x) = 2x + , f 00 (x) + g 00 (x) = 2 − .
x x2
From this, it follows that f + g is concave in the interval [0, √12 ] and convex outside this
interval. Thus, f + g is neither convex nor concave on the entire domain (0, ∞).
Now, consider the functions f (x) = x2 and g(x) = −e−x on the domain (0, ∞). Clearly,
f is convex while g is concave. Furthermore,

f 00 (x) + g 00 (x) = 2 − e−x


>0

for all x > 0. Thus, f + g is a (strictly) convex function on the domain (0, ∞).
Finally, consider the functions f (x) = x2 and g(x) = loge x − x2 on the domain (0, ∞).
Clearly, f is convex while g is concave, and f + g = loge x is concave.
(v) Let f (x) = loge x − x + 1, x ≥ 0. Since the desired inequality is trivially true if x = 0,
it suffices to prove the inequality for x > 0. For any x > 0, we have
1
f 0 (x) = − 1,
x
which upon setting equal to zero yields x = 1. Furthermore,

f 00 (1) = −1,

thus implying that the function f attains a global maximum at the point x = 1. There-
fore, it follows that f (x) ≤ f (1) = 0, thus yielding loge x ≤ x − 1 for all x ≥ 0.
Relabelling x − 1 as −y (which implies that x ≥ 0 translates to y ≤ 1), we get that

loge (1 − y) ≤ −y.

This yields
1 − y ≤ e−y
for all y ≤ 1. Finally, we note that the above inequality is trivially true if y > 1 since
the left hand side of the above inequality then becomes negative while the right hand
side is strictly positive. Thus, we have

1 − y ≤ e−y for all y ≥ 0.

5. Let X = S52 denote the set of all permutations of numbers {1, 2, ..., 52}, namely each element
x ∈ X represents an order of shuffled cards. A random shuffle of cards can be represented
by a channel W : X → X . We say that a shuffle is good if it increases the “randomness” in
ordering of cards. Formally, a shuffle W is said to be good if for every pmf P on X ,

H(P) ≤ H(PW ).

Claim: A shuffle is good if and only if it does not decrease the entropy of randomly ordered
cards (here randomly ordered stand for X distributed uniformly over X ).

3
Proof. Clearly, if a shuffle is good then for a uniform distribution Punif(X ) on X

log |X | = H(Punif(X ) ) ≤ H(Punif(X ) W ) ≤ log |X |,

which can only happen if H(Punif(X ) W ) = log |X |. Thus, the shuffle does not decrease the
entropy of randomly ordered cards, and Punif(X ) W is uniform as well.
Conversely, suppose a shuffle does not decrease the entropy of randomly ordered cards. Then,
H(Punif(X ) W ) = H(Punif(X ) ) = log |X |, and so,

Punif(X ) W = Punif(X ) .

Note that for any pmf P on X

D(PkPunif(X ) ) = log |X | − H(P).

Also, since Punif(X ) W = Punif(X ) ,

D(PW kPunif(X ) W ) = log |X | − H(PW ).

Thus, by the data processing inequality we get

log |X | − H(PW ) = D(PW kPunif(X ) W )


≤ D(PkPunif(X ) )
= log |X | − H(P),

which gives

H(P) ≤ H(PW ).

Therefore, the shuffle is good if it does not reduce the entropy of randomly ordered cards.

Now, to answer the question asked in the homework, we show that the shuffle mentioned
in homework is good. To that end, by the claim above, it suffices to show that it does not
reduce the entropy of randomly ordered cards. Specifically, the shuffle in the homework can
be described as
(
1
, y = (xi , x1 , ..., xi−1 , xi+1 , ..., x52 ) for some 1 ≤ i ≤ 52,
W (y|x) = 52
0, otherwise.

For W above, it is a easy to see that

Punif(X ) W = Punif(X ) ,

which by the claim above shows that shuffle W is good, i.e.,

H(Y ) = H(PW ) ≥ H(P) = H(X).

6. Suppose X = {x1 , . . . , xm }. Suppose further, without loss of generality, that x1 is the symbol
with the largest probability. Clearly, the probability of guessing X correctly is maximised
when the guess is X̂ = x1 . From the data in the question, it follows that P (X̂ = X) =
P (X = x1 ) = p, from which we have P (X̂ 6= X) = 1 − p. By Fano’s inequality, we have

H(X|X̂) ≤ (1 − p) log(|X | − 1) + h(1 − p)


= (1 − p) log(m − 1) + h(p),

4
where in writing the last line above, we use the fact that h(1 − p) = h(p). Since the guess
X̂ is a constant (=x1 ), it follows that X and X̂ are independent of each other, which implies
that H(X|X̂) = H(X). This yields the desired relation.
The above inequality is met with equality when the distribution of X is P , where P is given
by (
p, x = x1 ,
P (x) = 1−p
m−1 , x 6= x1 .

7. By the chain rule for mutual information, we have


I(X ∧ Y, Z) = I(X ∧ Y ) + I(X ∧ Z|Y )
= I(X ∧ Z) + I(X ∧ Y |Z).
−Y
◦ −Z,
Since X− −◦ I(X∧Z|Y ) = 0 from which it follows that I(X∧Y |Z) = I(X∧Y )−I(X∧Z) ≤
I(X ∧ Y ), with equality iff X and Z are independent.
For an example where I(X ∧ Y |Z) > I(X ∧ Y ), consider X and Y independent and uniform
on {0, 1}, and let Z = X ⊕ Y . Then, I(X ∧ Y |Z) = H(X|Z) − H(X|Y, Z) = 1, whereas
I(X ∧ Y ) = 0.
8. We shall write Wx (y) as W (y|x) for notational convenience. Also, we shall introduce the
notation D(W (·|x)||Q(·)) to denote the KL divergence
def
X W (y|x)
D(W (·|x)||Q(·)) = W (y|x) log .
y
Q(y)

Then, we have
X
D(W ||Q|P ) = P (x)D(W (·|x)||Q(·)),
x

ax P (x) = aT P , where
P
which is clearly linear in P for every fixed Q since it is of the form
x
def T
for each x, ax = D(W (·|x)||Q(·)), and the symbol denotes transpose of a vector.
Now, let Q1 and Q2 be any two distributions on Y. For any θ ∈ [0, 1], let Qθ be the distribution
def
Qθ = θQ1 + (1 − θ)Q2 .
Then,
D(W ||Qθ |P )
X
= P (x)D(W (·|x)||Qθ (·))
x
X X W (y|x)
= P (x) W (y|x) log
x y
Qθ (y)
X X W (y|x)
= P (x) W (y|x) log
x y
Q1 (y) + (1 − θ)Q2 (y)
(a) X X W (y|x) X X W (y|x)
≤θ P (x) W (y|x) log + (1 − θ) P (x) W (y|x) log
x y
Q1 (y) x y
Q2 (y)
= θD(W ||Q1 |P ) + (1 − θ)D(W ||Q2 |P ),
where (a) above follows from the fact that the function f (x) = log x1 , x > 0, is convex. Thus,
for every fixed P , D(W ||Q|P ) is convex in Q.

5
9. Assume that there exist functions f : Y → U and g : Z → U such that (a) X −

◦ f (Y ) −

◦ (Y, Z),
(b) X −◦− g(Z) −◦− (Y, Z) and (c) P (f (Y ) = g(Z)) = 1. We need to show that this implies
X −◦− Y −◦− Z and X −◦− Z −◦− Y .
Since X −◦− f (Y ) −◦− (Y, Z), we have I(X ∧ Y, Z|f (Y )) = 0, that is,

H(X|f (Y )) − H(X|Y, Z, f (Y )) = H(X|f (Y )) − H(X|Y, Z) = 0.

Also, since X −

◦Y− −
◦ f (Y ) always holds, we have H(X|Y ) ≤ H(X|f (Y )), which together with
the previous observation gives H(X|Y ) ≤ H(X|Y, Z). But since I(X ∧ Z|Y ) = H(X|Y ) −
H(X|Y, Z) must be non negative, we have that H(X|Y ) = H(X|Y, Z), implying that X −◦−
Y −◦− Z. Similar arguments can be used to show X −◦− Z −◦− Y .
Now, assume that X −◦− Y −◦− Z and X −◦− Z −◦− Y hold. For simplicity, let us assume that the
sets X , Y and Z are finite∗ . Then,
(a) (b)
P (X = x|Y = y, Z = z) = P (X = x|Y = y) = P (X = x|Z = z) (1)

for all x ∈ X and for all y ∈ Y, z ∈ Z such that P (Y = y, Z = z) > 0. In the above equation,
(a) is due to the Markov relation X −−
◦Y− −
◦ Z and (b) is due to the Markov relation X −

◦ Z−

◦ Y.
For notational simplicity, let us denote by A the set

A = {(y, z) : P (Y = y, Z = z) > 0}.

We note that P ((Y, Z) ∈ A) = 1.


In what follows, we shall create a partition of the set A by defining an equivalence relation
on pairs (y, z) ∈ A. We shall then use this partition to construct appropriate functions f on
Y and g on Z.
R
Consider the following relation R on the set A: for all (y, z), (y 0 , z 0 ) ∈ A, let (y, z) ∼ (y 0 , z 0 ) if

P (X = x|Y = y, Z = z) = P (X = x|Y = y 0 , Z = z 0 ) for all x ∈ X .

Clearly, R is an equivalence relation. Therefore, R gives rise to equivalence classes which


constitute a partition of A. Let us denote by A1 , . . . , Am the elements of this partition. That
is,
m
[
A= Ai , Ai ∩ Aj = φ for all i 6= j.
i=1

Furthermore, for each i ∈ {1, 2, . . . , m}, we may identify a unique representative element
(yi , zi ) ∈ Ai such that Ai may be expressed as the equivalence class generated by the pair
(yi , zi ).
Fix i ∈ {1, 2, . . . , m}, and consider an arbitrary (y, z) ∈ Ai . We now claim that Ai is the only
set in which y can appear as the first coordinate of a pair. Indeed, suppose that there exists
z 0 such that (y, z 0 ) ∈ Aj for some j such that j 6= i. Then, for all x ∈ X , we have

P (X = x|Y = yi , Z = zi ) = P (X = x|Y = y, Z = z)
(a)
= P (X = x|Y = y, Z = z 0 )
= P (X = x|Y = yj , Z = zj ),

hence implying that the equivalence class generated by the pair (yi , zi ) is identical to the
equivalence class generated by the pair (yj , zj ), which is clearly a contradiction since j 6= i.

The arguments may be extended easily for the case when the sets are countably infinite.

6
In the above set of equalities, (a) follows from (1). This proves the claim. Similar arguments
may be used to argue that if (y, z) ∈ Aj for some j ∈ {1, 2, . . . , m}, then Aj is the only set in
which z can appear as the second coordinate of a pair.
Therefore, it follows that for every y ∈ Y such that P (Y = y) > 0, there exists a unique index
i = i(y) such that Ai(y) ⊂ A is the only set in which y can appear as the first coordinate of a
pair. Similarly, for every z ∈ Z such that P (Z = z) > 0, there exists a unique index j = j(z)
such that Aj(z) ⊂ A is the only set in which z can appear as the second coordinate of a pair.
Furthermore, if (y, z) ∈ Ai for some i, it follows that i(y) = j(z) = i.
Finally, let functions f and g be defined as

P (X = · |Y = yi(y) , Z = zi(y) ), P (Y = y) > 0,

f (y) = ( 0, . . . , 0 ), P (Y = y) = 0,

 | {z }
length=|X |

P (X = · |Y = yj(z) , Z = zj(z) ), P (Z = z) > 0,

g(z) = ( 0, . . . , 0 ), P (Z = z) = 0,

 | {z }
length=|X |

Then, the Markov relations X − −


◦ f (Y ) −

◦ (Y, Z) and X −−
◦ g(Z) −−
◦ (Y, Z) follow since specifying
f (Y ) (or g(Z)) immediately specifies the conditional distribution of X given (Y, Z). Also,

P (f (Y ) = g(Z)) = P (i(Y ) = j(Z))


Xm
= P (i(Y ) = i, j(Z) = i)
i=1
m
X
= P ((Y, Z) ∈ Ai )
i=1
= P ((Y, Z) ∈ A)
= 1.

10. (i) This is the sub-additivity relation for Shannon entropy. It may be derived using the
chain rule for entropy as follows.
n
X
H(X1 , . . . , Xn ) = H(X1 ) + H(Xi |X1 , . . . , Xi−1 )
i=2
n
X
≤ H(Xi ),
i=1

where the last line follows from the fact that conditioning reduces entropy. Equality
above holds if and only if X1 , . . . , Xn are mutually independent.
(ii) For any integer i ≤ np, note that there are ni subsets of {1, . . . , n} of size equal to i.


Thus, the summation on the left hand side of the desired inequality is the total number
of subsets of {1, . . . , n} of size at most np.
Let X be a random variable that is uniformly distributed over all subsets of {1, . . . , n}
of size less than or equal to np. That is, for any A ⊆ {1, . . . , n},
 1
 P n , |A| ≤ np,
( )
P (X = A) = i≤np i
0, otherwise.

7
Then, the Shannon entropy of X is given by
X n
H(X) = log .
i
i≤np

For each k ∈ {1, . . . , n}, let us define the random variable Xk as

Xk = 1{k∈X} .

That is, if the index k ∈ X (recall that X is a random subset of size at most np), then
Xk = 1. Else, Xk = 0. Then, we may write X = (X1 , . . . , Xn ) with the interpretation
that Xk = 1 if and only if k ∈ X. Now, we note that for each k ∈ {1, . . . , n},

P (Xk = 1) = P (k ∈ X)
X
= P (X = A, k ∈ A)
A:|A|≤np
X 1
= P n · P (k ∈ A|X = A)
A:|A|≤np i
i≤np
X X 1
= P n · P (k ∈ A|X = A)
i≤np A:|A|=i i
i≤np
n−1

(a) X X 1 i−1
= P n · n

i≤np A:|A|=i i i
i≤np
X X 1 i
= P n ·
i
n
i≤np A:|A|=i
i≤np
n

X
i i
= P ·
n
i
n
i≤np
i≤np

≤ p,

where in writing (a) above, we use the fact that for each k ∈ {1, . . . , n}, there are n−1

i−1
ways of forming a subset of {1, . . . , n} having i elements and containing k, and in writing
the last line above, we use the fact that i ≤ np for each term inside the summation in
the previous line. Thus, it follows that for all k ∈ {1, . . . , n},

H(Xk ) = h(P (Xk = 1)) ≤ h(p),

where the last inequality uses the fact that for p ∈ [0, 0.5], h(·) is increasing.
Combining all of the above facts and using the sub-additivity of Shannon entropy, we
have
X n
log = H(X) = H(X1 , . . . , Xn )
i
i≤np
n
X
≤ H(Xk )
k=1
≤ nh(p),

from which the desired inequality follows.

8
11. (i) We first note that by chain rule
n
X
I(X1 , . . . , Xn ∧ Y ) = I(Xi ∧ Y |X1 , . . . , Xi−1 )
i=1
Xn
= H(Xi |X1 . . . , Xi−1 ) − H(Xi |Y, X1 , . . . , Xi−1 )
i=1
Xn
= H(Xi ) − H(Xi |Y )
i=1
Xn
= I(Xi ∧ Y ),
i=1

where in the third step, we used the fact that X1 , . . . , Xn are mutually independent.Pn Also,
since I(X1 , . . . , Xn ∧ Y ) = H(Y ) − H(Y |X1 , . . . , Xn ) ≤ H(Y ) ≤ k, we have that i=1 I(Xi ∧
Y ) ≤ k, which implies that I(Xi ∧ Y ) ≤ k/n for at least one coordinate i ∈ [n].
(ii) Let X1 = X2 = · · · = Xn = Y and Y ∼ Unif({0, 1}k ). Then, I(Xi ∧ Y ) = k for every
i ∈ [n].

12. Let the quantised version of X be denoted by g(X). It is mentioned in the question that
g(X) is observed, and an estimate Û of U is made based on this observation. Furthermore,
it is also given that P (U 6= Û ) ≤ 1/3. Using Fano’s inequality, we have

H(U |g(X)) ≤ P (U 6= Û ) log(|U| − 1) + h(P (U 6= Û ))


1
≤ log(2m − 1) + 1
3
m
≤ + 1.
3
Also, we have

m = H(U )
= H(U |g(X)) + I(U ∧ g(X))
= H(U |g(X)) + H(g(X)) − H(g(X)|U )
≤ H(U |g(X)) + H(g(X))
≤ H(U |g(X)) + b,

where the last line follows since g(X) takes at most 2b values, and thus H(g(X)) ≤ log 2b = b.
From the above results, we get
m
m − b ≤ H(U |g(X)) ≤ + 1.
3
Combining the first and last inequalities above yields
2m
b≥ − 1.
3

13. It is clear that if Pinsker’s inequality holds for all distributions, then it holds for binary
distributions as well. We show the other direction, namely, that if Pinsker’s inequality holds
for binary distributions, then it holds for all distributions.

9
Let P and Q be arbitrary distributions on X and P 0 and Q0 be distributions on {0, 1}. Choose
P 0 ≡ Ber(P (A)) and Q0 ≡ Ber(Q(A)) for some A ⊂ X . Now, since Pinsker’s inequality holds
for binary distributions, we have
r
ln 2
D(P 0 kQ0 ) ≥ d(P 0 , Q0 )
2
= |P (A) − Q(A)|.

Maximizing over all A ⊂ X , we have D(P 0 kQ0 ) ≥ (2/ ln 2)d(P, Q)2 Further, from the data pro-
cessing inequality we have D(P kQ) ≥ D(P 0 kQ0 ), which together with the previous inequality
implies Pinsker’s inequality for arbitrary distributions.
We now show Pinsker’s inequality for binary distributions. Consider D(P ||Q) in nats† , i.e.,
p 1−p
D(P ||Q) = p ln + (1 − p) ln .
q 1−q
For a fixed q ∈ (0, 1), let

g(p) = D(P ||Q) − 2(p − q)2


p 1−p
= p ln + (1 − p) ln − 2p2 + 4pq − 2q 2
q 1−q
Then,
dg(p) p 1−p
= ln − ln − 4(p − q).
dp q 1−q
2
Setting dg(p) ∗ d g(p) 1 1
dp equal to 0, we get p = q. Observe that dp2 = p + 1−p − 4 ≥ 0. Therefore, p

is indeed the minimizer for g(p) and g(p∗ ) = 0. Thus, it follows that

D(P ||Q) ≥ 2(p − q)2 .

Equivalently, in bits,
2
D(P ||Q) ≥ (p − q)2 .
ln 2

14. (i) Note that the term EP [X] on the right hand side does not depend on Q. Therefore, the
right hand side of the desired relation may be expressed as
 
max λEQ [X] − D(Q||P ) − λEP [X].
Q:supp(Q)⊆supp(P )

Thus, it suffices to show that

log EP [2λX ] = max λEQ [X] − D(Q||P ).


Q:supp(Q)⊆supp(P )

Consider an arbitrary distribution Q such that supp(Q) ⊆ supp(P ). From the varia-
tional formula for KL divergence, we have
 
f (X)
D(Q||P ) = max EQ [f (X)] − log EP [2 ]
f

≥ EQ [λX] − log EP [2λX ],



expressed in base e

10
which implies that
log EP [2λX ] ≥ λEQ [X] − D(Q||P ).
Since the above is true for all distributions Q such that supp(Q) ⊆ supp(P ), we have
the inequality
log EP [2λX ] ≥ max λEQ [X] − D(Q||P ). (2)
Q:supp(Q)⊆supp(P )

Now, consider the distribution Q∗ given by


P (x)2λx
Q∗ (x) = , x ∈ R.
EP [2λX ]
Clearly, Q∗ is a discrete probability distribution on R, and supp(Q∗ ) ⊆ supp(P ). Also,
we have
X Q∗ (x)
D(Q∗ ||P ) = Q∗ (x) log
x
P (x)
X 2λx
= Q(x) log
x
EP [2λX ]
= λEQ∗ [X] − log EP [2λX ],
from which it follows that
log EP [2λX ] = λEQ∗ [X] − D(Q∗ ||P )
≤ max λEQ [X] − D(Q||P ). (3)
Q:supp(Q)⊆supp(P )

The desired relation now follows from (2) and (3).


(ii) We have
1X
d(P, Q) = |P (x) − Q(x)|
2 x
(a) 1X
≥ |x||P (x) − Q(x)|
2 x

(b) 1 X
≥ xP (x) − xQ(x)
2 x

1
= EP [X] − EQ [X]
2
1
≥ (EQ [X] − EP [X]),
2
where (a) above is due to the fact that |X| ≤ 1 with probability 1 under both dis-
tributions P and Q, and (b) is due to triangle inequality. The desired relation thus
follows.
(iii) This is a direct application of Hoeffding’s lemma, which gives
λ2
loge EP [eλ(X−EP [X]) ] ≤ .
2
This in turn implies that
1 λ2 loge 2
log2 EP [2λ(X−EP [X]) ] = loge EP [eλ(X−EP [X]) loge 2 ] ≤ .
loge 2 2

11

You might also like