You are on page 1of 13

Mutual Information, Joint Entropy

& Conditional Entropy


Contents

Entropy
Joint entropy & conditional entropy
Mutual information

2
Entropy(1/2)

Entropy(self-information)
H ( p)  H ( X )   p( x) log 2 p( x)
x

the amount of information in a random variable


average uncertainty of a random variable
the average length of the message needed to transmit an
outcome of that variable
the size of the search space consisting of the possible values of
a random variable and its associated probabilities
Properties
H ( X )  0 ( H ( X )  0 : providing no new information)
increases with message length

3
Entropy(2/2)
- Example
Simplified Polynesian
letter frequencies
i p t k a i u

P(i) 1/8 1/4 1/8 1/4 1/8 1/8

per-letter entropy
H ( P)    P(i) log P(i)  2.5 bits
i{ p ,t , k , a ,i ,u }

coding
p t k a i u
100 00 101 01 110 111

4
Joint Entropy & Conditional Entropy(1/4)

Joint Entropy
H ( X , Y )    p( x, y ) log p( X , Y )
x y

the amount of information needed on average to specify both


their values
Conditional Entropy
H (Y | X )   p( x) H (Y | X  x) how much extra
x information you still need
to supply on average to
 
  p( x)   p( y | x) log p( y | x) communicate Y given that
x  y  the other party knows X
   p( x, y ) log p( y | x)
x y

5
Joint Entropy & Conditional Entropy(2/4)

Chain Rules for Entropy


H ( X , Y )  H ( X )  H (Y | X )
H ( X 1 ,..., X n )  H ( X 1 )  H ( X 2 | X 1 )  ...  H ( X n | X 1 ,..., X n 1 )

6
Joint Entropy & Conditional Entropy(3/4)
- Example
Simplified Polynesian Revisited
syllable structure
all words consist of sequences of CV syllables.
C: consonant, V: vowel

p t k H (C )  1.061 bits
a 1/16 3/8 1/16 1/2
H (V | C )   p(C  c) H (V | C  c)
c  p ,t , k
i 1/16 3/16 0 1/4
1 1 1 3 1 1 1 1 1 1
u 0 3/16 1/16 1/4  H ( , ,0)  H ( , , )  H ( ,0, )
8 2 2 4 2 4 4 8 2 2
1/8 3/4 1/8  1.375 bits

H (C , V )  H (C )  H (V | C )  2.44 bits

7
Joint Entropy & Conditional Entropy(4/4)

Entropy Rate(per-word/per-letter entropy)


1 1
H rate  H ( X 1n )    p( x1n ) log p( x1n )
n n x1n

Entropy of a Language
1
H rate ( L)  lim H ( X 1 , X 2 ,..., X n )
n  n

8
Mutual Information(1/2)

Mutual Information H ( X ,Y )

I ( X ; Y )  H ( X )  H ( X | Y )  H (Y )  H (Y | X )
H (X |Y) H (Y | X )
p ( x, y )
  p ( x, y ) log I ( X ;Y )
x, y p( x) p( y )
H (X ) H (Y )
the reduction in uncertainty of one random variable due to
knowing about another
the amount of information one random variable contains about
another
measure of independence
I ( X ; Y )  0 : two variables are independent
grows according to ...
the degree of dependence
the entropy of the variables

9
Mutual Information(2/2)

Conditional Mutual Information


I ( X ; Y | Z )  I (( X ; Y ) | Z )  H ( X | Z )  H ( X | Y , Z )

Chain Rule
I ( X 1n ; Y )  I ( X 1 ; Y )  ...  I ( X n ; Y | X 1 ,..., X n 1 )
n
  I ( X i ; Y | X 1 ,..., X i 1 )
i 1

Pointwise Mutual Information


p ( x, y )
I ( x, y )  log between two particular points
p ( x) p ( y )

10
The Noisy Channel Model(1/2)

The Noisy Channel Model


W X Channel Y Ŵ
Encoder Decoder
Message from Input to p(y|x) Attempt to
Output from
a finite channel channel reconstruct
alphabet message based
on output

Assumption
the output of the channel depends probabilistically on the input
Channel capacity
C  max I ( X ; Y )
p( X )

11
The Noisy Channel Model(2/2)

The Noisy Channel Model in Linguistics


I Noisy Channel O Iˆ
Decoder
p (o | i )

decode the output to give the most likely input


p(i) p(o | i)
Iˆ  arg max p(i | o)  arg max  arg max p(i) p(o | i)
i i p (o) i

p (i ): language model, p (o | i ): channel probability

applications
MT, POS tagging, OCR, Speech recognition, ...

12
Derivation of Mutual Information

Mutual Information

I ( X ;Y )  H ( X )  H ( X | Y )
 H ( X )  H (Y )  H ( X , Y )
1 1
  p( x) log   p( y ) log   p( x, y ) log p( x, y )
x p( x) y p( y) x, y
1 1
  p( x, y ) log   p( x, y ) log   p( x, y ) log p( x, y )
x, y p ( x) x , y p( y ) x , y
p ( x, y )
  p( x, y ) log
x, y p ( x) p( y )

13

You might also like