You are on page 1of 6

Information Theory

Engineering Part IIA: 3F1 - Signals and Systems

3F1 - Signals and Systems Michaelmas 2005

Information Theory Handout 2 Andrea Lecchini Visintini


30 November 2005

Information Theory

Engineering Part IIA: 3F1 - Signals and Systems

The denition of Entropy


Let X be a random variable which takes on values {x1, x2, . . . , xM } with probabilities {p1, p2, . . . , pM }. The Entropy of X , denoted H (X ), is a measure of how uncertain we are of the outcome of X . The Entropy is in the form of an average. It is the average uncertainty of the event {X = xi} and is a function of the probabilities pi H (X ) = H ({p1, p2, . . . , pM }) . The Entropy can also be seen as a measure of information because the loss of uncertainty when the realization of X is revealed equals an increase of information from the point of view of the observer. Entropy = measure of the average loss of uncertainty (or of the average surprise) when the realization of X is revealed The formula of the Entropy can be derived from rst principles.

The axioms for H


1 (1) If the distribution is uniform, i.e. pi = M , then H ({p1, p2, . . . , pM }) should increase with M

e.g. winning the toss of a coin is less of a surprise than winning the National Lottery (even if the prize is the same). (2) If the observation of the outcome of X is broken down into successive observations then H (X ) should be a weighted sum of the uncertainties on the individual observations. (This says that the Entropy is an average.)

H ({p1, p2, p3, p4, p5}) = H ({p1 + p2, p3 + p4 + p5}) p1 p2 , }) + (p1 + p2)H ({ p1 + p 2 p1 + p 2 p3 p4 + (p3 + p4 + p5)H ({ 5 , 5 , 3 pi 3 pi (3) H ({p1, p2, . . . , pM }) should be a continuous function in the pi

p5

5 }) 3 pi

Information Theory

Engineering Part IIA: 3F1 - Signals and Systems

Theorem (Shannon 1948) The only H ({p1, p2, . . . , pM }) satisfying the three axioms is of the form
M

Theorem H ({p1, p2, . . . , pM }) is maximized when the pi are equal. Proof Step 1: show that H ({p1, p2, . . . , pM }) log2(M ) (examples paper) Step 2: notice that the value log2(M ) is attained when the pi are equal 1 1 H ({ , . . . , }) = log2(M ) M M
M

H ({p1, p2, . . . , pM }) = K
i=1

pi log2(pi)

where K is a positive constant. The choice of K is arbitrary and allows to change the base of the logarithm since loga(x) = loga(b) logb(x) . The choice of K actually amounts to a choice of a unit of measure. The convention is to choose K = 1 and base 2. The unit of measure is 1 bit.

pi = log2(M )
i=1

Example: the toss of a biased coin H ({p, 1 p})


1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Information Theory

Engineering Part IIA: 3F1 - Signals and Systems

Joint and conditional Entropy


Let X and Y be random variables which take values on {x1, . . . , xM } and {y1, . . . , yN } and have joint probability distribution {p(xi, yj )} i = 1, . . . , M ; j = 1, . . . , N . The joint Entropy is the Entropy of the joint distribution
M N

Notice that if p(xi, yj ) = p(xi)p(yj ) then H (X ) + H (Y ) = H (X, Y ) . At this point we have proved the if implication of the equality. As for the inequality, we have
M N

H (X, Y ) [H (X )+ H (Y )] =
i=1 j =1 M N

p(xi, yj ) log2 p(xi)p(yj ) p(xi, yj ) p(xi)p(yj ) 1 p(xi, yj )


M N

p(xi)p(yj ) p(xi, yj )

H (X, Y ) =
i=1 j =1

p(xi, yj ) log2 p(xi, yj ) .

=
i=1 j =1

p(xi, yj ) log2(e) ln
M N

Theorem The following inequality holds H (X, Y ) H (X ) + H (Y ) with equality if and only if X and Y are independent. Proof Firstly we calculate an expression for the right hand side
M N

log2(e)
i=1 j =1

p(xi, yj )
M N

since ln(x) x 1

= log2(e) =0

p(xi)p(yj )
i=1 j =1 i=1 j =1

p(xi, yj )

H (X ) + H (Y ) =
i=1 M N

p(xi) log2 p(xi)


j =1 N M

p(yj ) log2 p(yj ) p(xi, yj ) log2 p(yj )

Now we have proved the inequality. The only if implication of the equality is proved by noticing that the equality in ln(x) x 1 holds only if x = 1.
3

x1 ln(x)

=
i=1 j =1 M N

p(xi, yj ) log2 p(xi)


j =1 i=1

=
i=1 j =1 M N

p(xi, yj )[log2 p(xi) + log2 p(yj )] p(xi, yj ) log2(p(xi)p(yj ))


i=1 j =1

-1

-2

-3

0.5

1.5

2.5

3.5

Information Theory

10

Engineering Part IIA: 3F1 - Signals and Systems

The conditional Entropy is the average of the Entropies of the conditional distributions.
M N

Theorem The following identity holds H (X, Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ) Proof


M N

H (Y |X ) =

i=1

p(xi)

j =1

p(yj |xi) log2 p(yj |xi) .

Example: p(xi, yj ) x1 x2 x3 p(yi) y1 0.1 0.2 0.05 0.35 y2 0.15 0.1 0.05 0.3 y3 0.05 0.15 0.15 0.35 p(xi) 0.3 0.45 0.25

H (X, Y ) =
i=1 j =1 M N

p(xi, yj ) log2 p(yj |xi)p(xi)


M N

=
i=1 j =1

p(xi, yj ) log2 p(xi)


i=1 j =1 M N

p(xi, yj ) log2 p(yj |xi )

= H (X ) H (Y ) = 0.35 log2(0.35) 0.3 log2(0.3) 0.35 log2(0.35) = 1.5813 0.1 0.1 log2 0.3 0.3 = 1.4591 0.15 log2 0.3 0.15 0.3 0.05 log2 0.3 0.05 0.3
i=1

p(xi)
j =1

p(yj |xi) log2 p(yj |xi) .

H (Y |x1) =

Theorem The following inequality holds H (Y |X ) H (Y ) . with equality if and only if X and Y are independent. Proof The claim follows from the previous Theorems. The Theorems say that the conditional Entropy is the uncertainty which remains after the revelation of the outcome of one random variable. Moreover the revelation of the outcome of one random variable cannot increase the uncertainty on the other one.

H (Y |x2) = 1.5305 H (Y |x3) = 1.371 H (Y |X ) = 0.3H (Y |x1) + 0.45H (Y |x2) + 0.25H (Y |x3) = 1.3752

Information Theory

11

12

Engineering Part IIA: 3F1 - Signals and Systems

Mutual Information
The decrease in uncertainty due to the revelation of the outcome one of two joined random variables is called mutual Information and is given by I (X |Y ) = H (X ) H (X |Y ) . The quantity I (X |Y ) is the average information conveyed about X by Y . Quite surprisingly I (X |Y ) is symmetric (proof in the example paper). Example Two coins are available, one unbiased and the other twoheaded. One coin is selected at random and tossed. How much information is conveyed about the identity of the coin by the outcome of the toss? X : selection Y : toss Head Tail p(xi) p(xi, yj ) 1 1 1 Unbiased 4 4 2 1 1 Two-Headed 0 2 2 3 1 p(yi) 4 4 I (X |Y ) = H (X ) H (X |Y ) 1 1 1 1 H (X ) = log2 log2 = 1 2 2 2 2 1 3 H (X |Y ) = H (X |Head) + H (X |Tail) 4 4 1 1 3 1 2 2 log2 log2 + 0 = 0.6887 = 4 3 3 3 3 4 I (Y |X ) = H (Y ) H (Y |X ) 3 1 1 3 H (Y ) = log2 log2 = 0.8113 4 4 4 4 1 1 H (Y |X ) = H (Y |Unbiased) + H (Y |Two-headed) 2 2 1 1 1 1 1 1 log2 log2 + 0 = 0.5 = 2 2 2 2 2 2 I (X |Y ) = 1 0.6887 = 0.8113 0.5 = 0.3113

You might also like