Professional Documents
Culture Documents
Information Theory
e.g. winning the toss of a coin is less of a surprise than winning the National Lottery (even if the prize is the same). (2) If the observation of the outcome of X is broken down into successive observations then H (X ) should be a weighted sum of the uncertainties on the individual observations. (This says that the Entropy is an average.)
H ({p1, p2, p3, p4, p5}) = H ({p1 + p2, p3 + p4 + p5}) p1 p2 , }) + (p1 + p2)H ({ p1 + p 2 p1 + p 2 p3 p4 + (p3 + p4 + p5)H ({ 5 , 5 , 3 pi 3 pi (3) H ({p1, p2, . . . , pM }) should be a continuous function in the pi
p5
5 }) 3 pi
Information Theory
Theorem (Shannon 1948) The only H ({p1, p2, . . . , pM }) satisfying the three axioms is of the form
M
Theorem H ({p1, p2, . . . , pM }) is maximized when the pi are equal. Proof Step 1: show that H ({p1, p2, . . . , pM }) log2(M ) (examples paper) Step 2: notice that the value log2(M ) is attained when the pi are equal 1 1 H ({ , . . . , }) = log2(M ) M M
M
H ({p1, p2, . . . , pM }) = K
i=1
pi log2(pi)
where K is a positive constant. The choice of K is arbitrary and allows to change the base of the logarithm since loga(x) = loga(b) logb(x) . The choice of K actually amounts to a choice of a unit of measure. The convention is to choose K = 1 and base 2. The unit of measure is 1 bit.
pi = log2(M )
i=1
Information Theory
Notice that if p(xi, yj ) = p(xi)p(yj ) then H (X ) + H (Y ) = H (X, Y ) . At this point we have proved the if implication of the equality. As for the inequality, we have
M N
H (X, Y ) [H (X )+ H (Y )] =
i=1 j =1 M N
p(xi)p(yj ) p(xi, yj )
H (X, Y ) =
i=1 j =1
=
i=1 j =1
p(xi, yj ) log2(e) ln
M N
Theorem The following inequality holds H (X, Y ) H (X ) + H (Y ) with equality if and only if X and Y are independent. Proof Firstly we calculate an expression for the right hand side
M N
log2(e)
i=1 j =1
p(xi, yj )
M N
since ln(x) x 1
= log2(e) =0
p(xi)p(yj )
i=1 j =1 i=1 j =1
p(xi, yj )
H (X ) + H (Y ) =
i=1 M N
Now we have proved the inequality. The only if implication of the equality is proved by noticing that the equality in ln(x) x 1 holds only if x = 1.
3
x1 ln(x)
=
i=1 j =1 M N
=
i=1 j =1 M N
-1
-2
-3
0.5
1.5
2.5
3.5
Information Theory
10
The conditional Entropy is the average of the Entropies of the conditional distributions.
M N
H (Y |X ) =
i=1
p(xi)
j =1
Example: p(xi, yj ) x1 x2 x3 p(yi) y1 0.1 0.2 0.05 0.35 y2 0.15 0.1 0.05 0.3 y3 0.05 0.15 0.15 0.35 p(xi) 0.3 0.45 0.25
H (X, Y ) =
i=1 j =1 M N
=
i=1 j =1
= H (X ) H (Y ) = 0.35 log2(0.35) 0.3 log2(0.3) 0.35 log2(0.35) = 1.5813 0.1 0.1 log2 0.3 0.3 = 1.4591 0.15 log2 0.3 0.15 0.3 0.05 log2 0.3 0.05 0.3
i=1
p(xi)
j =1
H (Y |x1) =
Theorem The following inequality holds H (Y |X ) H (Y ) . with equality if and only if X and Y are independent. Proof The claim follows from the previous Theorems. The Theorems say that the conditional Entropy is the uncertainty which remains after the revelation of the outcome of one random variable. Moreover the revelation of the outcome of one random variable cannot increase the uncertainty on the other one.
H (Y |x2) = 1.5305 H (Y |x3) = 1.371 H (Y |X ) = 0.3H (Y |x1) + 0.45H (Y |x2) + 0.25H (Y |x3) = 1.3752
Information Theory
11
12
Mutual Information
The decrease in uncertainty due to the revelation of the outcome one of two joined random variables is called mutual Information and is given by I (X |Y ) = H (X ) H (X |Y ) . The quantity I (X |Y ) is the average information conveyed about X by Y . Quite surprisingly I (X |Y ) is symmetric (proof in the example paper). Example Two coins are available, one unbiased and the other twoheaded. One coin is selected at random and tossed. How much information is conveyed about the identity of the coin by the outcome of the toss? X : selection Y : toss Head Tail p(xi) p(xi, yj ) 1 1 1 Unbiased 4 4 2 1 1 Two-Headed 0 2 2 3 1 p(yi) 4 4 I (X |Y ) = H (X ) H (X |Y ) 1 1 1 1 H (X ) = log2 log2 = 1 2 2 2 2 1 3 H (X |Y ) = H (X |Head) + H (X |Tail) 4 4 1 1 3 1 2 2 log2 log2 + 0 = 0.6887 = 4 3 3 3 3 4 I (Y |X ) = H (Y ) H (Y |X ) 3 1 1 3 H (Y ) = log2 log2 = 0.8113 4 4 4 4 1 1 H (Y |X ) = H (Y |Unbiased) + H (Y |Two-headed) 2 2 1 1 1 1 1 1 log2 log2 + 0 = 0.5 = 2 2 2 2 2 2 I (X |Y ) = 1 0.6887 = 0.8113 0.5 = 0.3113