1 Entropy

Harvard SEAS
ES250 Information Theory
Entropy, relative entropy, and mutual information

1
Entropy
1.1
Entropy of a random variable
Definition The entropy of a discrete random variable X with pmf pX (x) is

X
p(x) log p(x)
H(X) =
x
The entropy measures the expected uncertainty in X. It has the following properties:
H(X) 0, entropy is always non-negative. H(X) = 0 iff X is deterministic.
Since Hb (X) = logb (a)Ha (X), we dont need to specify the base of the logarithm.
1.2
Joint entropy and conditional entropy
Definition Joint entropy between two random variables X and Y is

H(X, Y )
Ep(x,y) [log p(X, Y )]

XX
=
p(x, y) log p(x, y)
xX yY
Definition Given a random variable X, the conditional entropy of Y (average over X) is

H(Y |X)
Ep(x) [H(Y |X = x)]

X
p(x)H(Y |X = x)
=
xX
=
=
Ep(x) Ep(y|x) [log p(Y |X)]

Ep(x,y) [log p(Y |X)]
Note: H(X|Y ) 6= H(Y |X).
1.3
Chain rule
Joint and conditional entropy provide a natural calculus:

Theorem (Chain rule)
H(X, Y ) = H(X) + H(Y |X)
Corollary
H(X, Y |Z) = H(X|Z) + H(Y |X, Z)
Based
on Cover & Thomas, Chapter 2
Harvard SEAS
Relative Entropy and Mutual Information
2.1
Entropy and Mutual Information
Entropy H(X) is the uncertainty (self-information) of a single random variable

Conditional entropy H(X|Y ) is the entropy of one random variable conditional upon knowledge of another.
We call the reduction in uncertainty mutual information:
I(X; Y ) = H(X) H(X|Y )
Eventually we will show that the maximum rate of transmission over a given channel p(Y |X), such that the
error probability goes to zero, is given by the channel capacity:
C = max I(X; Y )
p(X)
Theorem Relationship between mutual information and entropy
2.2
I(X; Y )
I(X; Y )
=
=
H(X) H(X|Y )
H(Y ) H(Y |X)
I(X; Y )
I(X; Y )
=
=
H(X) + H(Y ) H(X, Y )

I(Y ; X) (symmetry)
I(X; X)
H(X)
(self-information)
Relative Entropy and Mutual Information
Definition Relative entropy

(Information- or Kullback-Leibler divergence)
D(p k q) , Ep
X
p(x)
p(x)
log
p(x) log
=
q(x)
q(x)
xX
Definition Mutual information (in terms of divergence)

I(X; Y )
D(p(x, y) k p(x)p(y))
p(X, Y )
= Ep(x,y) log
p(X)p(Y )
XX
p(x, y)
p(x, y) log
=
p(x)p(y)
xX yY
3
3.1
Chain Rules
Chain Rule for Entropy
The entropy of a collection of random variables is the sum of the conditional entropies:
Theorem (Chain rule for entropy) (X1 , X2 , ..., Xn ) p(x1 , x2 , ..., xn )
H(X1 , X2 , ..., Xn ) =
n
X
H(Xi |Xi1 , ..., X1 )
i=1
Harvard SEAS
3.2
Chain Rule for Mutual Information
Definition Conditional mutual information

,
I(X; Y |Z)
H(X|Z) H(X|Y, Z)
p(X, Y |Z)
= Ep(x,y,z) log
p(X|Z)p(Y |Z)
Theorem (Chain rule for mutual information)

I(X1 , X2 , ..., Xn ; Y ) =
n
X
I(Xi ; Y |Xi1 , Xi2 , ..., X1 )
i=1
3.3
Chain Rule for Relative Entropy
Definition Conditional relative entropy

D(p(y|x) k q(y|x))
p(Y |X)
, Ep(x,y) log
q(Y |X)
X
X
p(y|x)
=
p(x)
p(y|x) log
q(y|x)
x
y
Theorem (Chain rule for relative entropy)

D(p(x, y) k q(x, y)) = D(p(x) k q(x)) + D(p(y|x) k q(y|x))
Jensens Inequality
Recall that a convex function on an interval is one for which every chord lies (on or) above the function on
that interval.
A function f is concave if f is convex.
Theorem (Jensens inequality) If f is convex, then

E[f (X)] f (E[X]).
If f is strictly convex, the equality implies X = E[X] with probability 1.
4.1
Consequences
Theorem (Information inequality)

D(p k q) 0
with equality iff p = q.
Corollary (Nonnegativity of mutual information)
I(X; Y ) 0
with equality iff X and Y are independent.
Corollary (Information inequality)
D(p(y|x) k q(y|x)) 0
with equality iff p(y|x) = q(y|x) for all x, y s.t. p(x) > 0.
Corollary (Nonnegativity of mutual information)
I(X; Y |Z) 0
with equality iff X and Y are conditionally independent given Z.
3
Harvard SEAS
4.2
Some Inequalities
Theorem
H(X) log |X |
with equality iff X has a uniform distribution over X .
Theorem (Conditioning reduces entropy)
H(X|Y ) H(X)
with equality iff X and Y are independent.
Theorem (Independence bound on entropy)
H(X1 , X2 , ..., Xn )
n
X
H(Xi )
i=1
with equality iff Xi are independent.
Log Sum Inequality and its Application
Theorem (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn ,

n
!
Pn
n
X
X
ai
ai
ai log Pi=1
ai log
n
bi
i=1 bi
i=1
i=1
with equality iff ai /bi = const.
Theorem (Convexity of relative entropy) D(p k q) is convex in the pair (p, q), so that for pmfs (p 1 , q1 ) and (p2 , q2 ),
we have for all 0 1:
D(p1 + (1 )p2 k q1 + (1 )q2 )
D(p1 k q1 ) + (1 )D(p2 k q2 )
Theorem (Concavity of entropy) For X p(x), we have that
H(p) := Hp (X) is a concave function of p(x).
Theorem Let (X, Y ) p(x, y) = p(x)p(y|x).
Then, I(X; Y ) is a concave function of p(x) for fixed p(y|x), and a convex function of p(y|x) for fixed p(x).
6
6.1
Data-Processing Inequality
Markov Chain
Definition X, Y, Z form a Markov chain in that order (X Y Z) iff

p(x, y, z) = p(x)p(y|x)p(z|y).
Some consequences:
X Y Z iff X and Z are conditionally independent given Y .
XY Z
Z Y X. Thus, we can write X Y Z.
If Z = f (Y ), then X Y Z.
Harvard SEAS
6.2
Data-Processing Inequality
Theorem Data-processing inequality If X Y Z, then I(X; Y ) I(X; Z).

Corollary If Z = g(Y ), then I(X; Y ) I(X; g(Y )).
Corollary If X Y Z, then I(X; Y ) I(X; Y |Z).
Sufficient Statistics
7.1
Statistics and Mutual Information
Consider a family of probability distributions {f (x)} indexed by .

If X f (x | ) for fixed and T (X) is any statistic (i.e., function of the sample X), then we have
X T (X).
The data processing inequality in turn implies
I(; X) I(; T (X))
for any distribution on .
Is it possible to choose a statistic that preserves all of the information in X about ?
7.2
Sufficient Statistics and Compression
Definition (Sufficient Statistic) A function T (X) is said to be a sufficient statistic relative to the family {f (x)} if the
conditional distribution of X, given T (X) = t, is independent of for any distribution on (Fisher-Neyman):
f (x) = f (x | t)f (t)
T (X) X
I(; T (X)) I(; X)
Hence, I(; X) = I(; T (X)) for a sufficient statistic.

Definition (Minimal Sufficient Statistic) A function T (X) is a minimal sufficient statistic relative to {f (x)} if it
is a function of every other sufficient statistic U , in which case:
T (X) U (X) X
and information about in the sample is maximally compressed.
8
8.1
Fanos Inequality
Fanos Inequality and Estimation Error
Fanos inequality relates the probability of estimation error to conditional entropy:

: X Y X,
with Pe = Pr{X 6= X},
we have
Theorem (Fanos inequality) For any estimator X
H(X|Y ).
H(Pe ) + Pe log |X | H(X|X)
This implies
1 + Pe log |X | H(X|Y )
or
Pe
H(X|Y ) 1
.
log |X |
Harvard SEAS
8.2
Implications of Fanos Inequality
Some corollaries follow:

Corollary Let p = Pr{X 6= Y }. Then,
H(p) + p log |X | H(X|Y ).
and constrain X
: Y X ; then
Corollary Let Pe = Pr{X 6= X},
H(Pe ) + Pe log(|X | 1) H(X|Y ).
8.3
Sharpness of Fanos inequality
Suppose no observation Y so that X must simply be guessed, and order X {1, 2, . . . , m} such that p 1 p2
= 1 is the optimal estimate of X, with Pe = 1 p1 , and Fanos inequality becomes
pm . Then X
H(Pe ) + Pe log(m 1) H(X).
The pmf
(p1 , p2 , . . . , pm ) =
Pe
Pe
,...,
1 Pe ,
m1
m1
achieves this bound with equality.
8.4
Applications of Fanos inequality
Lemma If X and X 0 are iid with entropy H(X), then

Pr{X = X 0 } 2H(X) ,
with equality iff X has a uniform distribution.
Corollary Let X, X 0 be independent with X p(x), X 0 r(x); x, x0 X . Then
Pr{X = X 0 } 2H(p)D(pkr) ,
and Pr{X = X 0 } 2H(r)D(rkp) .

1 Entropy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Entropy

Uploaded by

Copyright:

Available Formats

Harvard SEAS

ES250 Information Theory

Entropy, relative entropy, and mutual information

Entropy of a random variable

Definition The entropy of a discrete random variable X with pmf pX (x) is

Joint entropy and conditional entropy

Definition Joint entropy between two random variables X and Y is

Ep(x,y) [log p(X, Y )]

Definition Given a random variable X, the conditional entropy of Y (average over X) is

Ep(x) [H(Y |X = x)]

Ep(x) Ep(y|x) [log p(Y |X)]

Note: H(X|Y ) 6= H(Y |X).

Joint and conditional entropy provide a natural calculus:

on Cover & Thomas, Chapter 2

ES250 Information Theory

Relative Entropy and Mutual Information

Entropy and Mutual Information

Entropy H(X) is the uncertainty (self-information) of a single random variable

Theorem Relationship between mutual information and entropy

H(X) + H(Y ) H(X, Y )

Relative Entropy and Mutual Information

Definition Relative entropy

Definition Mutual information (in terms of divergence)

H(Xi |Xi1 , ..., X1 )

ES250 Information Theory

Chain Rule for Mutual Information

Definition Conditional mutual information

Theorem (Chain rule for mutual information)

I(Xi ; Y |Xi1 , Xi2 , ..., X1 )

Chain Rule for Relative Entropy

Definition Conditional relative entropy

Theorem (Chain rule for relative entropy)

Theorem (Jensens inequality) If f is convex, then

Theorem (Information inequality)

ES250 Information Theory

with equality iff Xi are independent.

Log Sum Inequality and its Application

Theorem (Log sum inequality) For nonnegative a1 , a2 , ..., an and b1 , b2 , ..., bn ,

Definition X, Y, Z form a Markov chain in that order (X Y Z) iff

Z Y X. Thus, we can write X Y Z.

ES250 Information Theory

Theorem Data-processing inequality If X Y Z, then I(X; Y ) I(X; Z).

Statistics and Mutual Information

Consider a family of probability distributions {f (x)} indexed by .

Sufficient Statistics and Compression

I(; T (X)) I(; X)

Hence, I(; X) = I(; T (X)) for a sufficient statistic.

Fanos inequality relates the probability of estimation error to conditional entropy:

ES250 Information Theory

Implications of Fanos Inequality

Some corollaries follow:

Sharpness of Fanos inequality

achieves this bound with equality.

Applications of Fanos inequality

Lemma If X and X 0 are iid with entropy H(X), then

You might also like