Professional Documents
Culture Documents
Ying Nian Wu
UCLA Department of Statistics
July 9, 2007
IPAM Summer School
Goal: A gentle introduction to the basic concepts
in information theory
a1 a2 ak aK
p p1 p2 pk pK
Pr( X ak ) pk p ( x) Pr( X x)
Example
a b c d
Pr 1 / 2 1 / 4 1 / 8 1 / 8
Entropy
a1 a2 ak aK
p p1 p2 pk pK
Pr( X ak ) pk p ( x) Pr( X x)
Definition
H ( X ) pk log pk p ( x) log p( x)
k x
Entropy
Example
a b c d
Pr 1 / 2 1 / 4 1 / 8 1 / 8
H ( X ) pk log pk p ( x) log p( x)
k x
1 1 1 1 1 1 1 1 7
H ( X ) log log log log
2 2 4 4 8 8 8 8 4
Entropy
a1 a2 ak aK
p p1 p2 pk pK
Pr( X ak ) pk p ( x) Pr( X x)
H ( X ) pk log pk p ( x) log p( x)
k x
H ( p) E[ log p( X )]
Definition for both discrete and continuous
Recall E[ f ( X )] f ( x) p ( x)
x
Entropy
Example
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
log p 1 2 3 3
H ( p) E[ log p( X )]
1 1 1 1 7
H ( X ) 1 2 3 3
2 4 8 8 4
Interpretation 1: cardinality
Uniform distribution X ~ Unif [ A]
There are | A | elements in A
All these choices are equally likely
X 1 , X 2 ,..., X n ~ p ( x) independently
p( X 1 , X 2 ,..., X n ) p( X 1 ) p( X 2 )... p ( X n ) Random?
But in some sense, it is essentially a constant!
Law of large number
X 1 , X 2 ,..., X n ~ p ( x) independently
1 n
n i 1
f ( X i ) E[ f ( X )] f ( x) p( x)
x
p( X 1 , X 2 ,..., X n ) p( X 1 ) p( X 2 )... p ( X n )
1 1 n
log p( X 1 ,..., X n ) log p( X i ) E[log p( X )] H ( p)
n n i 1
nH ( p )
Intuitively, in the long run, p ( X 1 ,..., X n ) 2
Asymptotic equipartition property
nH ( p )
p( X 1 ,..., X n ) 2
for n n0
1 n
Pr(| f ( X i ) | ) 1
n i 1
Typical set
a1 a2 ak aK
p p1 p2 pk pK
nH ( p )
p( X 1 ,..., X n ) 2
n
Typical set An , n
A n ,
Typical set
nH ( p )
p( X 1 ,..., X n ) 2
a b c d
Pr 1 / 4 1 / 4 1 / 4 1 / 4
flips HH HT TH TT
n
p( X 1 ,..., X n ) 2 nH ( p )
A n ,
( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2 nH ( p )
X ~ p ( x) amounts to H ( p) flips
Interpretation 2: coin flipping
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
flips H TT THH THT
H T
a
H T
b
H T
c d
Interpretation 2: coin flipping
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
log p 1 2 3 3
1 1 1 1 7
H ( X ) 1 2 3 3 flips
2 4 8 8 4
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
flips H TT THH THT
Interpretation 3: coding
Example
A a b c d
Pr 1 / 4 1 / 4 1 / 4 1 / 4
code 11 10 01 00
n p( X 1 ,..., X n ) 2 nH ( p )
A n ,
( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2 nH ( p )
100101100010abacbd
1 1 1 1 7
E[l ( X )] l ( x) p( x) 1 2 3 3 bits
x 2 4 8 8 4
Optimal code
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
code 1 00 011 010
H T 100101100010abacbd
a
H T
Sequence of coin flipping
A completely random sequence
b
H T Cannot be further compressed
c d l ( x) log p ( x)
E[l ( X )] H ( p )
e.g., two words I, probability
Optimal code
a1 a2 ak aK
p p1 p2 pk pK
l l1 l2 lk lK
2
k
lk
1
Minimize L E[l ( X )] l
k
k pk
Optimal length l log p k
*
k
L* H ( p )
Wrong model
a1 a2 ak aK
True p1 p2 pk pK
Wrong q1 q2 qk qK
k qk
Box: All models are wrong, but some are useful
Relative entropy
a1 a2 ak aK
p p1 p2 pk pK
q q1 q2 qk qK
p( X ) p( x)
D( p | q) E p [log ] p ( x) log
q( X ) x q( x)
Kullback-Leibler divergence
Relative entropy
a1 a2 ak aK
p p1 p2 pk pK
q q1 q2 qk qK
Jensen inequality
p( X ) q( X ) q( X )
D( p | q) E p [log ] E p [log ] log(E p [ ]) 0
q( X ) p( X ) p( X )
X 1 , X 2 ,..., X n ~ q( x) independently
nk number of times X i a k
nk
fk normalized frequency
n
a1 a2 ak aK
q q1 q2 qk qK nk
fk
freq n1 n2 nk nK n
n n
Pr( f ) q k
nk
q k nf k
Chain rule
p( x, y ) p X ( x) pY | X ( y | x) pY ( y ) p X |Y ( x | y )
Joint and conditional entropy
H ( X , Y ) p kj log p kj p ( x, y ) log p( x, y ) E[ log p ( X , Y )]
kj x, y
b1 b2 bj
a1 p11 p12 p1 j p1
a2 p21 p22 p2 j p 2
H (Y | X x) p ( y | x) log p ( y | x)
y
ak pk 1 pk 2 pkj pk
p1 p2 p j 1
H (Y | X ) p( x) H (Y | X x) p( x) p( y | x) log p( y | x)
x x y
H ( X , Y ) H ( X ) H (Y | X )
n
H ( X 1 ,..., X n ) H ( X i | X i 1 ,..., X 1 )
i 1
Mutual information
p ( x, y )
I ( X ; Y ) p ( x, y ) log
x, y p ( x) p ( y )
p( X , Y )
I ( X ; Y ) D( p( x, y ) | p ( x) p( y )) E[log ]
p( X ) p(Y )
I ( X ;Y ) H ( X ) H ( X | Y )
I ( X ; Y ) H ( X ) H (Y ) H ( X , Y )
Entropy rate
Stochastic process ( X 1 , X 2 ,..., X n ,...) not independent
Markov chain:
1. Zero-order approximation
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.
2. First-order approximation
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.
3. Second-order approximation (digram structure as in English).
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE
AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.
4. Third-order approximation (trigram structure as in English).
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF
THE REPTAGIN IS REGOACTIONA OF CRE.
5. First-order word approximation.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL
HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE.
6. Second-order word approximation.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF
THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF
WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Summary
Entropy of a distribution
measures randomness or uncertainty
log of the number of equally likely choices
average number of coin flips
average length of prefix code
(Kolmogorov: shortest machine code randomness)