Info Theory 2 PDF

Lecture 5: Asymptotic Equipartition Property
Law of large number for product of random variables

AEP and consequences
Dr. Yao Xie, ECE587, Information Theory, Duke University

Stock market
Initial investment Y0, daily return ratio ri, in t-th day, your money is
Yt = Y0r1 . . . rt.
Now if returns ratio ri are i.i.d., with

{
4, w.p. 1/2
ri =
0, w.p. 1/2.
So you think the expected return ratio is Eri = 2,
and then
EYt = E(Y0r1 . . . rt) = Y0(Eri)t = Y02t?
Dr. Yao Xie, ECE587, Information Theory, Duke University 1

Is optimized really optimal?
With Y0 = 1, actual return Yt goes like
1 4 16 0 0 0...
Optimize expected return is not optimal?
Fundamental reason: products does not behave the same as addition

(Weak) Law of large number
Theorem. For independent, identically distributed (i.i.d.) random
variables Xi,
1
n
Xn = Xi EX, in probability.
n i=1
Convergence in probability if for every > 0,
P {|Xn X| > } 0.
Proof by Markov inequality.
So this means
P {|Xn EX| } 1, n .

Other types of convergence
In mean square if as n
E(Xn X)2 0
With probability 1 (almost surely) if as n

{ }
P lim Xn = X = 1
n
In distribution if as n
lim Fn F,
n
where Fn and F are the cumulative distribution function of Xn and X.

Product of random variables
How does this behave? v

un
u
t
n
Xi
i=1
n n
Geometric mean n
i=1 Xi arithmetic mean 1
n i=1 Xi
Examples:
Volume V of a random box, each dimension Xi, V = X1 . . . Xn
Stock return Yt = Y0r1 . . . rt n
Joint distribution of i.i.d. RVs: p(x1, . . . , xn) = i=1 p(xi)

Law of large number for product of random variables
We can write
Xi = elog Xi
Hence v
u n
u 1 n log X
t Xi = e n i=1
n i
i=1
So from LLN
v
u n
u
t
n
Xi eE(log X) elog EX = EX.
i=1

Stock example:
1 1
E log ri = log 4 + log 0 =
2 2
E(Yt) Y0eE log ri = 0, t .
Example {
a, w.p. 1/2
X=
b, w.p. 1/2.
v
u
u
n a+b
E t
n
Xi ab
2
i=1

Asymptotic equipartition property (AEP)
LLN states that

1
n
Xi EX
n i=1
AEP states that most sequences
1 1
log H(X)
n p(X1, X2, . . . , Xn)
p(X1, X2, . . . , Xn) 2nH(X)
Analyze using LLN for product of random variables

AEP lies in the heart of information theory.
Proof for lossless source coding
Proof for channel capacity
and more...

AEP
Theorem. If X1, X2, . . . are i.i.d. p(x), then
1
log p(X1, X2, . . . , Xn) H(X), in probability.
n
Proof:
1
n
1
log p(X1, X2, , Xn) = log p(Xi)
n n i=1
E log p(X)
= H(X).
There are several consequences.

Typical set
A typical set
A(n)

contains all sequences (x1, x2, . . . , xn) X n with the property
2n(H(X)+) p(x1, x2, . . . , xn) 2n(H(X)).

Not all sequences are created equal
Coin tossing example: X {0, 1}, p(1) = 0.8

p(1, 0, 1, 1, 0, 1) = p Xi
(1 p) 5 Xi
= p4(1 p)2 = 0.0164

p(0, 0, 0, 0, 0, 0) = p Xi
(1 p) 5 Xi
= p4(1 p)2 = 0.000064
In this example, if
(x1, . . . , xn) A(n)
,
1
H(X) log p(X1, . . . , Xn) H(X) + .
n
This means a binary sequence is in typical set is the frequency of heads

is approximately k/n


p = 0.6, n = 25, k = number of 1s
! " ! "
n n k 1
k p (1 p)nk log p(x n )
k k n
0 1 0.000000 1.321928
1 25 0.000000 1.298530
2 300 0.000000 1.275131
3 2300 0.000001 1.251733
4 12650 0.000007 1.228334
5 53130 0.000054 1.204936
6 177100 0.000227 1.181537
7 480700 0.001205 1.158139
8 1081575 0.003121 1.134740
9 2042975 0.013169 1.111342
10 3268760 0.021222 1.087943
11 4457400 0.077801 1.064545
12 5200300 0.075967 1.041146
13 5200300 0.267718 1.017748
14 4457400 0.146507 0.994349
15 3268760 0.575383 0.970951
16 2042975 0.151086 0.947552
17 1081575 0.846448 0.924154
18 480700 0.079986 0.900755
19 177100 0.970638 0.877357
20 53130 0.019891 0.853958
21 12650 0.997633 0.830560
22 2300 0.001937 0.807161
23 300 0.999950 0.783763
24 25 0.000047 0.760364
25 1 0.000003 0.736966

Consequences of AEP
(n)
Theorem. 1. If (x1, x2, . . . , xn) A , then for n sufficiently large:
1
H(X) log p(x1, x2, . . . , xn) H(X) +
n
(n)
2. P {A } 1 .
(n)
3. |A | 2n(H(X)+).
(n)
4. |A | (1 )2n(H(X)).

Property 1
(n)
If (x1, x2, . . . , xn) A , then
1
H(X) log p(x1, x2, . . . , xn) H(X) + .
n
Proof from definition:
(x1, x2, . . . , xn) A(n)

,
if
2n(H(X)+) p(x1, x2, . . . , xn) 2n(H(X)).
The number of bits used to describe sequences in typical set is

approximately nH(X).

Property 2
(n)
P {A } 1 for n sufficiently large.
Proof: From AEP: because
1
log p(X1, . . . , Xn) H(X)
n
in probability, this means for a given > 0, when n is sufficiently large

1
p{ log p(X1, . . . , Xn) H(X) } 1 .

n
| {z }
(n)
A
High probability: sequences in typical set are most typical.

These sequences almost all have same probability - equipartition.

Property 3 and 4: size of typical set
(1 )2n(H(X)) |A(n)
|2
n(H(X)+)
Proof:

1= p(x1, . . . , xn)
(x1 ,...,xn )

p(x1, . . . , xn)
(n)
(x1 ,...,xn )A

p(x1, . . . , xn)2n(H(X)+)
(n)
(x1 ,...,xn )A
n(H(X)+)
= |A(n)
|2 .

(n)
On the other hand, P {A } 1 for n, so

1< p(x1, . . . , xn)
(n)
(x1 ,...,xn )A
n(H(X))
|A(n)
|2 .
Size of typical set depends on H(X).
When p = 1/2 in coin tossing example, H(X) = 1, 2nH(X) = 2n: all

sequences are typical sequences.

Typical set diagram
This enables us to divide all sequences into two sets
Typical set: high probability to occur, sample entropy is close to true
entropy
so we will focus on analyzing sequences in typical set
Non-typical set: small probability, can ignore in general
n:| |n elements
Non-typical set
Typical set
A(n) : 2n(H + )

elements

Data compression scheme from AEP
Let X1, X2, . . . , Xn be i.i.d. RV drawn from p(x)
We wish to find short descriptions for such sequences of RVs

Divide all sequences in X n into two sets
Non-typical set
Description: n log | | + 2 bits
Typical set
Description: n(H + ) + 2 bits


Use one bit to indicate which set
(n)
Typical set A use prefix 1
Since there are no more than 2n(H(X)+) sequences, indexing requires
no more than (H(X) + ) + 1 (plus one extra bit)
Non-typical set use prefix 0
Since there are at most |X |n sequences, indexing requires no more
than n log |X | + 1
Notation: xn = (x1, . . . , xn), l(xn) = length of codeword for xn
We can prove [ ]
1
E l(X n) H(X) +
n

Summary of AEP
Almost everything is almost equally probable.
Reasons that AEP has H(X)

n1 log p(xn) H(X), in probability
n(H(X) ) suffices to describe that random sequence on average
2H(X) is the effective alphabet size
Typical set is the smallest set with probability near 1
Size of typical set 2nH(X)
The distance of elements in the set nearly uniform

Next Time
AEP is about the property of independent sequences
What about the dependence processes?
Answer is entropy rate - next time.

Info Theory 2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Info Theory 2 PDF

Uploaded by

Copyright:

Available Formats

Lecture 5: Asymptotic Equipartition Property

Law of large number for product of random variables

Dr. Yao Xie, ECE587, Information Theory, Duke University

Now if returns ratio ri are i.i.d., with

So you think the expected return ratio is Eri = 2,

Dr. Yao Xie, ECE587, Information Theory, Duke University 1

Optimize expected return is not optimal?

Fundamental reason: products does not behave the same as addition

Dr. Yao Xie, ECE587, Information Theory, Duke University 2

Convergence in probability if for every > 0,

Proof by Markov inequality.

Dr. Yao Xie, ECE587, Information Theory, Duke University 3

With probability 1 (almost surely) if as n

where Fn and F are the cumulative distribution function of Xn and X.

Dr. Yao Xie, ECE587, Information Theory, Duke University 4

How does this behave? v

Dr. Yao Xie, ECE587, Information Theory, Duke University 5

Dr. Yao Xie, ECE587, Information Theory, Duke University 6

E(Yt) Y0eE log ri = 0, t .

Dr. Yao Xie, ECE587, Information Theory, Duke University 7

LLN states that

AEP states that most sequences

p(X1, X2, . . . , Xn) 2nH(X)

Analyze using LLN for product of random variables

Dr. Yao Xie, ECE587, Information Theory, Duke University 8

Proof for lossless source coding

Proof for channel capacity

Dr. Yao Xie, ECE587, Information Theory, Duke University 9

There are several consequences.

Dr. Yao Xie, ECE587, Information Theory, Duke University 10

contains all sequences (x1, x2, . . . , xn) X n with the property

2n(H(X)+) p(x1, x2, . . . , xn) 2n(H(X)).

Dr. Yao Xie, ECE587, Information Theory, Duke University 11

Coin tossing example: X {0, 1}, p(1) = 0.8

This means a binary sequence is in typical set is the frequency of heads

Dr. Yao Xie, ECE587, Information Theory, Duke University 12

Dr. Yao Xie, ECE587, Information Theory, Duke University 13

Dr. Yao Xie, ECE587, Information Theory, Duke University 14

Dr. Yao Xie, ECE587, Information Theory, Duke University 15

Proof from definition:

(x1, x2, . . . , xn) A(n)

The number of bits used to describe sequences in typical set is

Dr. Yao Xie, ECE587, Information Theory, Duke University 16

High probability: sequences in typical set are most typical.

Dr. Yao Xie, ECE587, Information Theory, Duke University 17

Dr. Yao Xie, ECE587, Information Theory, Duke University 18

Size of typical set depends on H(X).

When p = 1/2 in coin tossing example, H(X) = 1, 2nH(X) = 2n: all

Dr. Yao Xie, ECE587, Information Theory, Duke University 19

Dr. Yao Xie, ECE587, Information Theory, Duke University 20

Let X1, X2, . . . , Xn be i.i.d. RV drawn from p(x)

We wish to find short descriptions for such sequences of RVs

Dr. Yao Xie, ECE587, Information Theory, Duke University 21

Dr. Yao Xie, ECE587, Information Theory, Duke University 22

Notation: xn = (x1, . . . , xn), l(xn) = length of codeword for xn

Dr. Yao Xie, ECE587, Information Theory, Duke University 23

Almost everything is almost equally probable.

Reasons that AEP has H(X)

Dr. Yao Xie, ECE587, Information Theory, Duke University 24

What about the dependence processes?

Answer is entropy rate - next time.

Dr. Yao Xie, ECE587, Information Theory, Duke University 25

You might also like