Mutual Information, Joint Entropy & Conditional Entropy

Mutual Information, Joint Entropy
& Conditional Entropy

Contents
Entropy
Joint entropy & conditional entropy
Mutual information
2
Entropy(1/2)
Entropy(self-information)
H ( p)  H ( X )   p( x) log 2 p( x)
x
the amount of information in a random variable

average uncertainty of a random variable
the average length of the message needed to transmit an
outcome of that variable
the size of the search space consisting of the possible values of
a random variable and its associated probabilities
Properties
H ( X )  0 ( H ( X )  0 : providing no new information)
increases with message length
3
Entropy(2/2)
- Example
Simplified Polynesian
letter frequencies
i p t k a i u
P(i) 1/8 1/4 1/8 1/4 1/8 1/8
per-letter entropy
H ( P)    P(i) log P(i)  2.5 bits
i{ p ,t , k , a ,i ,u }
coding
p t k a i u
100 00 101 01 110 111
4
Joint Entropy & Conditional Entropy(1/4)
Joint Entropy
H ( X , Y )    p( x, y ) log p( X , Y )
x y
the amount of information needed on average to specify both

their values
Conditional Entropy
H (Y | X )   p( x) H (Y | X  x) how much extra
x information you still need
to supply on average to
 
  p( x)   p( y | x) log p( y | x) communicate Y given that
x  y  the other party knows X
   p( x, y ) log p( y | x)
x y
5
Chain Rules for Entropy

H ( X , Y )  H ( X )  H (Y | X )
H ( X 1 ,..., X n )  H ( X 1 )  H ( X 2 | X 1 )  ...  H ( X n | X 1 ,..., X n 1 )
6
- Example
Simplified Polynesian Revisited
syllable structure
all words consist of sequences of CV syllables.
C: consonant, V: vowel
p t k H (C )  1.061 bits
a 1/16 3/8 1/16 1/2
H (V | C )   p(C  c) H (V | C  c)
c  p ,t , k
i 1/16 3/16 0 1/4
1 1 1 3 1 1 1 1 1 1
u 0 3/16 1/16 1/4  H ( , ,0)  H ( , , )  H ( ,0, )
8 2 2 4 2 4 4 8 2 2
1/8 3/4 1/8  1.375 bits
H (C , V )  H (C )  H (V | C )  2.44 bits
7
Entropy Rate(per-word/per-letter entropy)

1 1
H rate  H ( X 1n )    p( x1n ) log p( x1n )
n n x1n
Entropy of a Language
1
H rate ( L)  lim H ( X 1 , X 2 ,..., X n )
n  n
8
Mutual Information(1/2)
Mutual Information H ( X ,Y )
I ( X ; Y )  H ( X )  H ( X | Y )  H (Y )  H (Y | X )
H (X |Y) H (Y | X )
p ( x, y )
  p ( x, y ) log I ( X ;Y )
x, y p( x) p( y )
H (X ) H (Y )
the reduction in uncertainty of one random variable due to
knowing about another
the amount of information one random variable contains about
another
measure of independence
I ( X ; Y )  0 : two variables are independent
grows according to ...
the degree of dependence
the entropy of the variables
9
Mutual Information(2/2)
Conditional Mutual Information

I ( X ; Y | Z )  I (( X ; Y ) | Z )  H ( X | Z )  H ( X | Y , Z )
Chain Rule
I ( X 1n ; Y )  I ( X 1 ; Y )  ...  I ( X n ; Y | X 1 ,..., X n 1 )
n
  I ( X i ; Y | X 1 ,..., X i 1 )
i 1
Pointwise Mutual Information

p ( x, y )
I ( x, y )  log between two particular points
p ( x) p ( y )
10
The Noisy Channel Model(1/2)
The Noisy Channel Model

W X Channel Y Ŵ
Encoder Decoder
Message from Input to p(y|x) Attempt to
Output from
a finite channel channel reconstruct
alphabet message based
on output
Assumption
the output of the channel depends probabilistically on the input
Channel capacity
C  max I ( X ; Y )
p( X )
11
The Noisy Channel Model(2/2)
The Noisy Channel Model in Linguistics

I Noisy Channel O Iˆ
Decoder
p (o | i )
decode the output to give the most likely input

p(i) p(o | i)
Iˆ  arg max p(i | o)  arg max  arg max p(i) p(o | i)
i i p (o) i
p (i ): language model, p (o | i ): channel probability
applications
MT, POS tagging, OCR, Speech recognition, ...
12
Derivation of Mutual Information
Mutual Information
I ( X ;Y )  H ( X )  H ( X | Y )
 H ( X )  H (Y )  H ( X , Y )
1 1
  p( x) log   p( y ) log   p( x, y ) log p( x, y )
x p( x) y p( y) x, y
1 1
  p( x, y ) log   p( x, y ) log   p( x, y ) log p( x, y )
x, y p ( x) x , y p( y ) x , y
p ( x, y )
  p( x, y ) log
x, y p ( x) p( y )
13

Mutual Information, Joint Entropy & Conditional Entropy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mutual Information, Joint Entropy & Conditional Entropy

Uploaded by

Copyright:

Available Formats

Mutual Information, Joint Entropy

& Conditional Entropy

the amount of information in a random variable

P(i) 1/8 1/4 1/8 1/4 1/8 1/8

the amount of information needed on average to specify both

Chain Rules for Entropy

Entropy Rate(per-word/per-letter entropy)

Conditional Mutual Information

Pointwise Mutual Information

The Noisy Channel Model

The Noisy Channel Model in Linguistics

decode the output to give the most likely input

p (i ): language model, p (o | i ): channel probability

You might also like