You are on page 1of 41

Information Theory

Ying Nian Wu
UCLA Department of Statistics

July 9, 2007
IPAM Summer School
Goal: A gentle introduction to the basic concepts
in information theory

Emphasis: understanding and interpretations


of these concepts

Reference: Elements of Information Theory


by Cover and Thomas
Topics
•Entropy and relative entropy
•Asymptotic equipartition
property
•Data compression
•Large deviation
•Kolmogorov complexity
•Entropy rate of process
Entropy
Randomness or uncertainty of a probability distribution

 a1 a2  ak  aK
p p1 p2  pk  pK

Pr( X  ak )  pk p ( x)  Pr( X  x)

Example
 a b c d
Pr 1 / 2 1 / 4 1 / 8 1 / 8
Entropy
 a1 a2  ak  aK
p p1 p2  pk  pK

Pr( X  ak )  pk p ( x)  Pr( X  x)

Definition

H ( X )   pk log pk    p ( x) log p( x)
k x
Entropy
Example
 a b c d
Pr 1 / 2 1 / 4 1 / 8 1 / 8

H ( X )   pk log pk    p ( x) log p( x)
k x

1 1 1 1 1 1 1 1 7
H ( X )   log  log  log  log 
2 2 4 4 8 8 8 8 4
Entropy
 a1 a2  ak  aK
p p1 p2  pk  pK

Pr( X  ak )  pk p ( x)  Pr( X  x)

H ( X )   pk log pk    p ( x) log p( x)
k x

H ( p)  E[ log p( X )]
Definition for both discrete and continuous
Recall E[ f ( X )]   f ( x) p ( x)
x
Entropy
Example
 a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
 log p 1 2 3 3

H ( p)  E[ log p( X )]

1 1 1 1 7
H ( X )  1   2   3   3 
2 4 8 8 4
Interpretation 1: cardinality
Uniform distribution X ~ Unif [ A]
There are | A | elements in A
All these choices are equally likely

H ( X )  E[ log p( X )]  E[ log(1 / | A |)]  log | A |


Entropy can be interpreted as log of volume or size
n dimensional cube has 2 n vertices
can also be interpreted as dimensionality

What if the distribution is not uniform?


Asymptotic equipartition property
Any distribution is essentially a uniform distribution
in long run repetition
Recall if X ~ Unif[ A] then p ( X )  1 / | A | a constant

X 1 , X 2 ,..., X n ~ p ( x) independently
p( X 1 , X 2 ,..., X n )  p( X 1 ) p( X 2 )... p ( X n ) Random?
But in some sense, it is essentially a constant!
Law of large number

X 1 , X 2 ,..., X n ~ p ( x) independently

1 n

n i 1
f ( X i )  E[ f ( X )]   f ( x) p( x)
x

Long run average converges to expectation


Asymptotic equipartition property

p( X 1 , X 2 ,..., X n )  p( X 1 ) p( X 2 )... p ( X n )
1 1 n
log p( X 1 ,..., X n )   log p( X i )  E[log p( X )]   H ( p)
n n i 1
 nH ( p )
Intuitively, in the long run, p ( X 1 ,..., X n )  2
Asymptotic equipartition property
 nH ( p )
p( X 1 ,..., X n )  2

Recall X ~ Unif[ A] then p ( X )  1 / | A | a constant


if
Therefore, as if ( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2
nH ( p )

So the dimensionality per observation is H ( p )

We can make it more rigorous


Weak law of large number
X 1 , X 2 ,..., X n ~ p ( x) independently
1 n

n i 1
f ( X i )  E[ f ( X )]   f ( x) p( x)  
x

for n  n0
1 n
Pr(|  f ( X i )   |  )  1  
n i 1
Typical set

 a1 a2  ak  aK
p p1 p2  pk  pK
 nH ( p )
p( X 1 ,..., X n )  2

( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2 nH ( p )

n
Typical set An ,   n
A n ,
Typical set
 nH ( p )
p( X 1 ,..., X n )  2

( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2 nH ( p )

An , : The set of sequences ( x1 , x 2 ,..., x n )   n


 n ( H ( p )  )  n ( H ( p )  )
2  p( x1 , x 2 ,..., x n )  2
for n sufficiently
large n
p ( An , )  1  
A n ,
n ( H ( p )  ) n ( H ( p )  )
(1   )2 | An , | 2
Interpretation 2: coin flipping
Flip a fair coin  {Head, Tail}
Flip a fair coin twice independently
 {HH, HT, TH, TT}
……
Flip a fair coin n times independently
 2 n equally likely sequences

We may interpret entropy as the number of flips


Interpretation 2: coin flipping
Example

a b c d
Pr 1 / 4 1 / 4 1 / 4 1 / 4
flips HH HT TH TT

The above uniform distribution amounts to 2 coin flips


Interpretation 2: coin flipping
 a1 a2  ak  aK
p p1 p2  pk  pK

 n
p( X 1 ,..., X n )  2  nH ( p )
A n ,
( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2 nH ( p )

( X 1 ,..., X n ) amounts to nH ( p ) flips

X ~ p ( x) amounts to H ( p) flips
Interpretation 2: coin flipping
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
flips H TT THH THT

H T
a
H T
b
H T
c d
Interpretation 2: coin flipping
 a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
 log p 1 2 3 3
1 1 1 1 7
H ( X )  1   2   3   3  flips
2 4 8 8 4

a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
flips H TT THH THT
Interpretation 3: coding
Example

A a b c d
Pr 1 / 4 1 / 4 1 / 4 1 / 4
code 11 10 01 00

length  2  log 4   log(1 / 4)


Interpretation 3: coding
 a1 a2  ak  aK
p p1 p2  pk  pK

 n p( X 1 ,..., X n )  2  nH ( p )
A n ,
( X 1 ,..., X n ) ~ Unif[ A] ,with | A | 2 nH ( p )

How many bits to code elements in A ?


nH ( p ) bits
Can be made more formal using typical set
Prefix code
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
code 1 00 011 010

100101100010abacbd

1 1 1 1 7
E[l ( X )]   l ( x) p( x)  1   2   3   3   bits
x 2 4 8 8 4
Optimal code
a b c d
Pr 1/ 2 1/ 4 1/ 8 1/ 8
code 1 00 011 010

H T 100101100010abacbd
a
H T
Sequence of coin flipping
A completely random sequence
b
H T Cannot be further compressed
c d l ( x)   log p ( x)
E[l ( X )]  H ( p )
e.g., two words I, probability
Optimal code
 a1 a2  ak  aK
p p1 p2  pk  pK
l l1 l2  lk  lK

Kraft inequality for prefix code

2
k
 lk
1

Minimize L  E[l ( X )]  l
k
k pk
Optimal length l   log p k
*
k

L*  H ( p )
Wrong model
 a1 a2  ak  aK
True p1 p2  pk  pK
Wrong q1 q2  qk  qK

Optimal code L*  E[l * ( X )]   l k* p k   p k log p k


k k
Wrong code L  E[l ( X )]   l k p k   p k log q k
k k
pk
Redundancy L  L   pk log
*

k qk
Box: All models are wrong, but some are useful
Relative entropy
 a1 a2  ak  aK
p p1 p2  pk  pK
q q1 q2  qk  qK

p( X ) p( x)
D( p | q)  E p [log ]   p ( x) log
q( X ) x q( x)
Kullback-Leibler divergence
Relative entropy
 a1 a2  ak  aK
p p1 p2  pk  pK
q q1 q2  qk  qK

Jensen inequality
p( X ) q( X ) q( X )
D( p | q)  E p [log ]  E p [log ]   log(E p [ ])  0
q( X ) p( X ) p( X )

D( p | U )  E p [log p( X )]  log |  | log |  |  H ( p)


Types
 a1 a2  ak  aK
q q1 q2  qk  qK
freq n1 n2  nk  nK

X 1 , X 2 ,..., X n ~ q( x) independently

nk  number of times X i  a k
nk
fk  normalized frequency
n
 a1 a2  ak  aK
q q1 q2  qk  qK nk
fk 
freq n1 n2  nk  nK n

Law of large number


f  ( f 1 ,..., f K )  q  (q1 ,..., q K )
Refinement
p ( x1 ,..., xn )   p ( xi )
i

 n   n 
Pr( f )    q k  
nk
 q k nf k

 n1 ,..., n K  k  nf1 ,..., nf K  k


Large deviation
 a1 a2  ak  aK
q q1 q2  qk  qK nk
fk 
freq n1 n2  nk  nK n

Law of large number


f  ( f 1 ,..., f K )  q  (q1 ,..., q K )
Refinement
1
 log Pr( f )  D ( f | q)
n
 nD ( f |q )
Pr( f ) ~ 2
Kolmogorov complexity
Example: a string 011011011011…011

Program: for (i =1 to n/3)


write(011)
end
Can be translated to binary machine code

Kolmogorov complexity = length of shortest machine code


that reproduce the string
no probability distribution involved
If a long sequence is not compressible, then it has all
the statistical properties of a sequence of coin flipping
string = f(coin flippings)
Joint and conditional entropy
Joint distribution

 b1 b2  bj  Pr( X  ak & Y  b j )  pkj


a1 p11 p12  p1 j  p1 p( x, y )  Pr( X  x & Y  y )
a2 p21 p22  p2 j  p 2
      
ak pk 1 pk 2  pkj  pk  Marginal distribution
      
Pr( X  ak )  pk 
p1 p2  p j  1
Pr(Y  b j )  p j
p X ( x)  Pr( X  x)
e.g., eye color & hair color
pY ( y )  Pr(Y  y )
Joint and conditional entropy
 b1 b2  bj  Conditional distribution
a1 p11 p12  p1 j  p1
a2 p21 p22  p2 j  p 2 Pr( X  ak | Y  b j )  pkj / p j
       Pr(Y  b j | X  ak )  pkj / pk 
ak pk 1 pk 2  pkj  pk 
      
p X |Y ( x | y )  Pr( X  x | Y  y )
p1 p2  p j  1
pY | X ( y | x)  Pr(Y  y | X  x)

Chain rule

p( x, y )  p X ( x) pY | X ( y | x)  pY ( y ) p X |Y ( x | y )
Joint and conditional entropy
H ( X , Y )   p kj log p kj   p ( x, y ) log p( x, y )  E[ log p ( X , Y )]
kj x, y

 b1 b2  bj 
a1 p11 p12  p1 j  p1
a2 p21 p22  p2 j  p 2
       H (Y | X  x)   p ( y | x) log p ( y | x)
y
ak pk 1 pk 2  pkj  pk 
      
p1 p2  p j  1

H (Y | X )   p( x) H (Y | X  x)   p( x) p( y | x) log p( y | x)
x x y

H (Y | X )   p( x, y ) log p( y | x)  E[ log p(Y | X )]


x, y
Chain rule
p ( x, y )  p X ( x) pY | X ( y | x)  pY ( y ) p X |Y ( x | y )

H ( X , Y )  H ( X )  H (Y | X )
n
H ( X 1 ,..., X n )   H ( X i | X i 1 ,..., X 1 )
i 1
Mutual information
p ( x, y )
I ( X ; Y )   p ( x, y ) log
x, y p ( x) p ( y )

p( X , Y )
 
I ( X ; Y )  D( p( x, y ) | p ( x) p( y ))  E[log ]
p( X ) p(Y )

I ( X ;Y )  H ( X )  H ( X | Y )

I ( X ; Y )  H ( X )  H (Y )  H ( X , Y )
Entropy rate
Stochastic process ( X 1 , X 2 ,..., X n ,...) not independent

Entropy rate: H  lim 1


n  H ( X 1 ,..., X n )
(compression) n
Stationary process: H  lim n H ( X n | X n 1 ,..., X 1 )

Markov chain:

Pr( X n 1  x n 1 | X n  x n ,..., X 1  x1 )  Pr( X n 1  x n 1 | X n  x n )


Stationary Markov chain: H  H ( X n 1 | X n )
Shannon, 1948

1. Zero-order approximation
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.
 
2. First-order approximation
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.
 
3. Second-order approximation (digram structure as in English).
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE
AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.
 
4. Third-order approximation (trigram structure as in English).
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF
THE REPTAGIN IS REGOACTIONA OF CRE.
 
5. First-order word approximation.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL
HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE.
 
6. Second-order word approximation.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF
THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF
WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
 
Summary
Entropy of a distribution
measures randomness or uncertainty
log of the number of equally likely choices
average number of coin flips
average length of prefix code
(Kolmogorov: shortest machine code  randomness)

Relative entropy from one distribution to the other


measure the departure from the first to the second
coding redundancy
large deviation

Conditional entropy, mutual information, entropy rate

You might also like