You are on page 1of 6

Harvard SEAS

ES250 Information Theory

Entropy, relative entropy, and mutual information


1

Entropy

1.1

Entropy of a random variable

Definition The entropy of a discrete random variable X with pmf pX (x) is


X
p(x) log p(x)
H(X) =
x

The entropy measures the expected uncertainty in X. It has the following properties:
H(X) 0, entropy is always non-negative. H(X) = 0 iff X is deterministic.
Since Hb (X) = logb (a)Ha (X), we dont need to specify the base of the logarithm.

1.2

Joint entropy and conditional entropy

Definition Joint entropy between two random variables X and Y is


H(X, Y )

Ep(x,y) [log p(X, Y )]


XX
=
p(x, y) log p(x, y)
xX yY

Definition Given a random variable X, the conditional entropy of Y (average over X) is


H(Y |X)

Ep(x) [H(Y |X = x)]


X
p(x)H(Y |X = x)
=
xX

=
=

Ep(x) Ep(y|x) [log p(Y |X)]


Ep(x,y) [log p(Y |X)]

Note: H(X|Y ) 6= H(Y |X).

1.3

Chain rule

Joint and conditional entropy provide a natural calculus:


Theorem (Chain rule)
H(X, Y ) = H(X) + H(Y |X)
Corollary
H(X, Y |Z) = H(X|Z) + H(Y |X, Z)
Based

on Cover & Thomas, Chapter 2

Harvard SEAS

ES250 Information Theory

Relative Entropy and Mutual Information

2.1

Entropy and Mutual Information

Entropy H(X) is the uncertainty (self-information) of a single random variable


Conditional entropy H(X|Y ) is the entropy of one random variable conditional upon knowledge of another.
We call the reduction in uncertainty mutual information:
I(X; Y ) = H(X) H(X|Y )
Eventually we will show that the maximum rate of transmission over a given channel p(Y |X), such that the
error probability goes to zero, is given by the channel capacity:
C = max I(X; Y )
p(X)

Theorem Relationship between mutual information and entropy

2.2

I(X; Y )
I(X; Y )

=
=

H(X) H(X|Y )
H(Y ) H(Y |X)

I(X; Y )
I(X; Y )

=
=

H(X) + H(Y ) H(X, Y )


I(Y ; X) (symmetry)

I(X; X)

H(X)

(self-information)

Relative Entropy and Mutual Information

Definition Relative entropy


(Information- or Kullback-Leibler divergence)
D(p k q) , Ep

X
p(x)
p(x)
log
p(x) log
=
q(x)
q(x)
xX

Definition Mutual information (in terms of divergence)


I(X; Y )

D(p(x, y) k p(x)p(y))

p(X, Y )
= Ep(x,y) log
p(X)p(Y )
XX
p(x, y)
p(x, y) log
=
p(x)p(y)
xX yY

3
3.1

Chain Rules
Chain Rule for Entropy

The entropy of a collection of random variables is the sum of the conditional entropies:
Theorem (Chain rule for entropy) (X1 , X2 , ..., Xn ) p(x1 , x2 , ..., xn )
H(X1 , X2 , ..., Xn ) =

n
X

H(Xi |Xi1 , ..., X1 )

i=1

Harvard SEAS

3.2

ES250 Information Theory

Chain Rule for Mutual Information

Definition Conditional mutual information


,

I(X; Y |Z)

H(X|Z) H(X|Y, Z)
p(X, Y |Z)
= Ep(x,y,z) log
p(X|Z)p(Y |Z)

Theorem (Chain rule for mutual information)


I(X1 , X2 , ..., Xn ; Y ) =

n
X

I(Xi ; Y |Xi1 , Xi2 , ..., X1 )

i=1

3.3

Chain Rule for Relative Entropy

Definition Conditional relative entropy


D(p(y|x) k q(y|x))

p(Y |X)
, Ep(x,y) log
q(Y |X)
X
X
p(y|x)
=
p(x)
p(y|x) log
q(y|x)
x
y

Theorem (Chain rule for relative entropy)


D(p(x, y) k q(x, y)) = D(p(x) k q(x)) + D(p(y|x) k q(y|x))

Jensens Inequality
Recall that a convex function on an interval is one for which every chord lies (on or) above the function on
that interval.
A function f is concave if f is convex.

Theorem (Jensens inequality) If f is convex, then


E[f (X)] f (E[X]).
If f is strictly convex, the equality implies X = E[X] with probability 1.

4.1

Consequences

Theorem (Information inequality)


D(p k q) 0
with equality iff p = q.
Corollary (Nonnegativity of mutual information)
I(X; Y ) 0
with equality iff X and Y are independent.
Corollary (Information inequality)
D(p(y|x) k q(y|x)) 0
with equality iff p(y|x) = q(y|x) for all x, y s.t. p(x) > 0.
Corollary (Nonnegativity of mutual information)
I(X; Y |Z) 0
with equality iff X and Y are conditionally independent given Z.
3

Harvard SEAS

4.2

ES250 Information Theory

Some Inequalities

Theorem
H(X) log |X |
with equality iff X has a uniform distribution over X .
Theorem (Conditioning reduces entropy)
H(X|Y ) H(X)
with equality iff X and Y are independent.
Theorem (Independence bound on entropy)
H(X1 , X2 , ..., Xn )

n
X

H(Xi )

i=1

with equality iff Xi are independent.

Log Sum Inequality and its Application

Theorem (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn ,


n
!
Pn
n
X
X
ai
ai

ai log Pi=1
ai log
n
bi
i=1 bi
i=1
i=1
with equality iff ai /bi = const.

Theorem (Convexity of relative entropy) D(p k q) is convex in the pair (p, q), so that for pmfs (p 1 , q1 ) and (p2 , q2 ),
we have for all 0 1:
D(p1 + (1 )p2 k q1 + (1 )q2 )
D(p1 k q1 ) + (1 )D(p2 k q2 )
Theorem (Concavity of entropy) For X p(x), we have that
H(p) := Hp (X) is a concave function of p(x).
Theorem Let (X, Y ) p(x, y) = p(x)p(y|x).
Then, I(X; Y ) is a concave function of p(x) for fixed p(y|x), and a convex function of p(y|x) for fixed p(x).

6
6.1

Data-Processing Inequality
Markov Chain

Definition X, Y, Z form a Markov chain in that order (X Y Z) iff


p(x, y, z) = p(x)p(y|x)p(z|y).
Some consequences:
X Y Z iff X and Z are conditionally independent given Y .
XY Z

Z Y X. Thus, we can write X Y Z.

If Z = f (Y ), then X Y Z.

Harvard SEAS

6.2

ES250 Information Theory

Data-Processing Inequality

Theorem Data-processing inequality If X Y Z, then I(X; Y ) I(X; Z).


Corollary If Z = g(Y ), then I(X; Y ) I(X; g(Y )).
Corollary If X Y Z, then I(X; Y ) I(X; Y |Z).

Sufficient Statistics

7.1

Statistics and Mutual Information

Consider a family of probability distributions {f (x)} indexed by .


If X f (x | ) for fixed and T (X) is any statistic (i.e., function of the sample X), then we have
X T (X).
The data processing inequality in turn implies
I(; X) I(; T (X))
for any distribution on .
Is it possible to choose a statistic that preserves all of the information in X about ?

7.2

Sufficient Statistics and Compression

Definition (Sufficient Statistic) A function T (X) is said to be a sufficient statistic relative to the family {f (x)} if the
conditional distribution of X, given T (X) = t, is independent of for any distribution on (Fisher-Neyman):
f (x) = f (x | t)f (t)

T (X) X

I(; T (X)) I(; X)

Hence, I(; X) = I(; T (X)) for a sufficient statistic.


Definition (Minimal Sufficient Statistic) A function T (X) is a minimal sufficient statistic relative to {f (x)} if it
is a function of every other sufficient statistic U , in which case:
T (X) U (X) X
and information about in the sample is maximally compressed.

8
8.1

Fanos Inequality
Fanos Inequality and Estimation Error

Fanos inequality relates the probability of estimation error to conditional entropy:


: X Y X,
with Pe = Pr{X 6= X},
we have
Theorem (Fanos inequality) For any estimator X
H(X|Y ).
H(Pe ) + Pe log |X | H(X|X)
This implies
1 + Pe log |X | H(X|Y )
or
Pe

H(X|Y ) 1
.
log |X |

Harvard SEAS

8.2

ES250 Information Theory

Implications of Fanos Inequality

Some corollaries follow:


Corollary Let p = Pr{X 6= Y }. Then,
H(p) + p log |X | H(X|Y ).
and constrain X
: Y X ; then
Corollary Let Pe = Pr{X 6= X},
H(Pe ) + Pe log(|X | 1) H(X|Y ).

8.3

Sharpness of Fanos inequality

Suppose no observation Y so that X must simply be guessed, and order X {1, 2, . . . , m} such that p 1 p2
= 1 is the optimal estimate of X, with Pe = 1 p1 , and Fanos inequality becomes
pm . Then X
H(Pe ) + Pe log(m 1) H(X).
The pmf
(p1 , p2 , . . . , pm ) =

Pe
Pe
,...,
1 Pe ,
m1
m1

achieves this bound with equality.

8.4

Applications of Fanos inequality

Lemma If X and X 0 are iid with entropy H(X), then


Pr{X = X 0 } 2H(X) ,
with equality iff X has a uniform distribution.
Corollary Let X, X 0 be independent with X p(x), X 0 r(x); x, x0 X . Then
Pr{X = X 0 } 2H(p)D(pkr) ,
and Pr{X = X 0 } 2H(r)D(rkp) .

You might also like