You are on page 1of 20

Lecture 5 - AEP

Nguyễn Phương Thái


Introduction
- In information theory, the analog of the law of large numbers is the asymptotic equipartition
property (AEP).
- It is a direct consequence of the weak law of large numbers.
- The law of large numbers states that for independent, identically distributed (i.i.d.) random
variables, , is close to its expected value EX for large values of n.
- The AEP states that is close to the entropy H, where is the probability of observing the sequence .
Thus, the probability assigned to an observed sequence will be close to 2-nH.
- This enables us to divide the set of all sequences into two sets:
o the typical set, where the sample entropy is close to the true entropy
o and the nontypical set, which contains the other sequences
- Most of our attention will be on the typical sequences
o any property that is proved for the typical sequences will then be true with high probability
and will determine the average behavior of a large sample.
- First, an example. Let the random variable X ∈ {0,1} have a probability mass function defined by
P(X=1) = p và P(X=0) = q. If X1, X2, …, Xn are iid according to p(x) the probability of a sequence

x1, x2, …, xn is . For example, the probability of the sequence (1, 0, 1, 1, 0, 1) is p4q2. Clearly, it is
not true that all 2n sequences of length n have the same probability.
- However, we might be able to predict the probability of the sequence that we actually observe.
We ask for the probability of the outcomes , where are iid ~ p(x). This is insidiously self-
referential, but well defined nonetheless. Apparently, we are asking for the probability of an event
drawn according to the same probability distribution. Here it turns out that is close to 2-nH with
high probability.
We summarize this by saying, “Almost all events are almost equally surprising.” This is a way of
saying that:

if are iid ~ p(x).


Convergence of random variables
Definition: Given a sequence of random variables , we say that the sequence converges to a random
variable X:

(1) In probability if for every

(2) In mean square if

(3) With probability 1 (also called almost surely) if


AEP

Theorem: If are iid ~ p(x) then

in probability.

Proof: Functions of independent random variables are also independent random variables. Thus, since
the Xi are i.i.d., so are log p(Xi). Hence, by the weak law of large numbers,

in probability

= H(X)

which proves the theorem.


Typical set

Definition: The typical set with respect to p(x) is the set of sequences (x1, x2, …, xn) ∈ Xn with
the property:
The probability of a string xn that contains r ‘1’ and n-r ‘0’ is:

The number of strings that contain ‘1’ is . So r has a binomial distribution:

These functions are shown in the figure in the next page. The mean of r is np 1, and its standard
deviation is . If n = 100, .
If n = 1000, . Notice that as n gets bigger, the probability distribution of r becomes more
concentrated, in the sense that while the range of possible values of r grows as n, the standard
deviation of r grows only as . That r is most likely to fall in a small range of values implies that the
outcome x is also most likely to fall in a corresponding small subset of outcomes that we will call
the typical set.
Properties of typical set

(1) If then

(2) for n sufficiently large

(3) , where |A| denotes the number of elements in the set A

(4) for n sufficiently large

Thus, the typical set has probability nearly 1, all elements of the typical set are nearly equiprobable, and the
number of elements in the typical set

is nearly 2nH.
Data
compression
• Let X1,X2, . . . , Xn be
independent, identically
distributed random variables
drawn from the probability
mass function p(x).
• We wish to find short
descriptions for such
sequences of random
variables. We divide all
sequences in Xn into two
sets: the typical set and its
complement, as shown in
- We order all elements in each set according to some order (e.g., lexicographic order).
- Then we can represent each sequence of by giving the index of the sequence in the set.
- Since there are ≤ sequences in , the indexing requires no more than bits. [The extra bit may
be necessary because may not be an integer.]
- We prefix all these sequences by a 0, giving a total length of ≤ bits to represent each sequence
in (see Figure in the next page).
Similarly, we can index each sequence not in by using not more than bits. Prefixing these
indices by 1, we have a code for all the sequences in Xn.
Note the following features of the above coding scheme:

• The code is one-to-one and easily decodable. The initial bit acts as a flag bit to indicate the length of
the codeword that follows.

• We have used a brute-force enumeration of the atypical set without taking into account the fact
that the number of elements in is less than the number of elements in Xn. Surprisingly, this is good
enough to yield an efficient description.

• The typical sequences have short descriptions of length ≈ nH.


We use the notation xn to denote a sequence . Let l(xn) be the length of the codeword corresponding
to xn. If n is sufficiently large so that , the expected length of the codeword is
Theorem: Let Xn be i.i.d. ∼ p(x). Let . Then there exists a code that maps sequences xn of length n
into binary strings such that the mapping is one-to-one (and therefore invertible) and

for n sufficiently large.


Proof

You might also like