You are on page 1of 27

ENTROPY OF X, [.

[= M, Pr(X=j) = p
j
H(X) = p
j
logp
j
= E[logp
X
(X)]
j
logp (X) is a rv, called the log pmf.
X
H(X) 0; Equality if X deterministic.
H(X) logM; Equality if X equiprobable.
If X and Y are independent random symbols,
then the random symbol XY takes on sample
value xy with probability p
XY
(xy) = p
X
(x)p
Y
(y).
H(XY) = E[logp (XY)] = E[logp (X)p (Y)]
XY X Y
= E[logp (X) logp (Y)] = H(X) + H(Y)
X Y
1

For a discrete memoryless source, a block of


n random symbols, X
1
, . . . , X
n
, can be viewed
as a single random symbol X
n
taking on the
sample value x
n
= x
1
x
2
x
n
with probability
n
p
X
n
(x
n
) = p
X
(x
j
)
j=1
The random symbol X
n
has the entropy
n
H(X
n
) = E[logp (X
n
)] = E[log p (X
j
)]
X
n
X
j=1

n
= E

logp (X
j
)

= nH(X)
X
j=1
2
Fixed-to-variable prex-free codes
Segment input into n-blocks X
n
= X
1
X
2
X
n
.
Form min-length prex-free code for X
n
.
This is called an n-to-variable-length code
H(X
n
) = nH(X)
H(X
n
) E[L(X
n
)]
min
<H(X
n
) +1
E[L(X
n
)]
min
L
min,n
= bpss
n
H(X) L
min,n
<H(X) +1/n
3
WEAK LAW OF LARGE NUMBERS
(WLLN)
Let Y
1
, Y
2
, . . . be sequence of rvs with mean Y
and variance
Y
2
.
The sum S = Y
1
+ + Y
n
has mean nY and
variance n
Y
2
The sample average of Y
1
, . . . , Y
n
is
A
n
Y
=
S
=
Y
1
+ + Y
n
n n
It has mean and variance

2
E[A
n
y
] = Y ; VAR[A
Y
n
] =
Y
n
Note: lim
n
VAR[S] = lim
n
VAR[A
n
] =0.
Y
4
1
y
F
A
n
Y
(y)
F
A
2n
Y
(y)

Pr[A
n
Y
Y [ <
Pr[A
2n
Y
Y [ <
Y Y Y +
The distribution of A
n
Y
clusters around Y , clus-
tering more closely as n .

2
Chebyshev: for > 0, Pr[A
n
Y
Y [
Y
n
2
For any , > 0, large enough n,
Pr[A
Y
n
Y [
5

ASYMPTOTIC EQUIPARTITION
PROPERTY (AEP)
Lete X
1
, X
2
, . . . , be output from DMS.
Dene log pmf as w(x) = logp (x).
X
w(x) maps source symbols into real numbers.
For each j, W(X
j
) is a rv; takes value w(x) for
X
j
= x. Note that
E[W(X
j
)] = p (x)[logp (x)] = H(X)
X X
x
W(X
1
), W(X
2
), . . . sequence of iid rvs.
6
For X
1
= x
1
, X
2
= x
2
, the outcome for W(X
1
)+
W(X
2
) is
w(x
1
) + w(x
2
) = logp (x
1
) logp (x
2
)
X X
= logp
X
(x
1
)p
X
(x
2
)
= logp (x
1
x
2
)= w(x
1
x
2
)
X
1
X
2
where w(x
1
x
2
) is -log pmf of event X
1
X
2
= x
1
x
2
W(X
1
X
2
) = W(X
1
) + W(X
2
)
X
1
X
2
is a randomsymbol in its own right (takes
values x
1
x
2
). W(X
1
X
2
) is -log pmf of X
1
X
2
Probabilities multiply, log pmfs add.
7



For X
n
= x
n
; x
n
= (x
1
, . . . , x
n
), the outcome for
W(X
1
) + + W(X
n
) is
n n
logp
X
(x
j
) = logp
X
n(x
n
) w(x
j
) =
j=1 j=1
Sample average of log pmfs is
S
n
=
W(X
1
) + W(X
n
)
=
logp (X
n
)
X
n
W
n n
WLLN applies and is
Pr A
n
W
E[W(X)]

2
W
n
2
logp (X
n
)
X
n
n
H(X)

2
Pr
n
W
2
.
8



Dene typical set as
T

n
= x
n
:
logp (x
n
)
X
n
n
H(X) <
1
w
F
W
n
Y
(w)
F
W
2n
Y
(w

PrT
n


PrT
2n

)
H H H+
As n , typical set approaches probability 1:

2
Pr(X
n
T

n
) 1
n
W
2
9


We can also express T

n
as
T

n
= x
n
: n(H(X)) < logp
X
n
(x
n
) < n(H(X)+)
T
n
= x
n
: 2
n(H(X)+)
< p (x
n
) < 2
n(H(X))
.

X
n
Typical elements are approximately equiprob-
able in the strange sense above.
The complementary, atypical set of strings,
satisfy

2
Pr[(T

n
)
c
]
W
2
n
For any , > 0, large enough n, Pr[(T

n
)
c
] < .
10

For all x
n
T

n
, p
X
n
(x
n
) >2
n[H(X)+]
.
1 p
X
n
(x
n
) >[T

n
[ 2
n[H(X)+]
x
n
T
n

[T

n
[ <2
n[H(X)+]
1 p
X
n
(x
n
) <[T

n
[2
n[H(X)]
x
n
T
n

[T
n
[ >(1 )2
n[H(X)]

Summary: Pr[(T

n
)
c
] 0, [T

n
[ 2
nH(X)
,
p (x
n
) 2
nH(X)
for x
n
T

n
.
X
n
11
EXAMPLE
Consider binary DMS with Pr[X=1] = p
1
<1/2.
H(X) = p
1
logp
1
p
0
log(p
0
)
Consider a string x
n
with n
1
ones and n
0
zeros.
p
X
n
(x
n
) = p
n
1
1
p
n
0
0
logp
n
X
n
(x
n
)
=
n
n
1
logp
1

n
n
0
logp
0
The typical set T

n
is the set of strings for which
H(X)
logp (x
n
) n
0
n
X
n
=
n
n
1
logp
1

n
logp
0
In the typical set, n
1
p
1
n. For this binary
case, a string is typical if it has about the right
relative frequencies.
12
H(X)
logp (x
n
) n
1
n
0
n
X
n
=
n
logp
1

n
logp
0
The probability of a typical n-tuple is about
p
1
p
1
n
p
0
p
0
n
= 2
nH(X)
.
The number of n-tuples with p
1
n ones is
n!
2
nH(X)
(p
1
n)!(p
0
n)!

Note that there are 2


n
binary strings. Most of
them are collectively very improbable.
The most probable strings have almost all ze-
ros, but there arent enough of them to mat-
ter.
13
Fixed-to-xed-length source codes
For any , >0, and any large enough n, assign
xed length codeword to each x
n
T

n
.
Since [T

n
[ <2
n[H(X)+]
, L H(X)++
1
.
n
Prfailure .
Conversely, take L H(X) 2, and n large.
Probability of failure will then be almost 1.
14
For any > 0, the probability of failure will be
almost 1 if LH(X)2 and n is large enough:
We can provide codewords for at most 2
nH(X)2n
source n-tuples. Typical n-tuples have at most
probability 2
nH(X)+n
.
The aggregate probability of typical n-tuples
assigned codewords is at most 2
n
.
The aggregate probability of typical n-tuples
not assigned codewords is at least 1 2
n
.
Prfailure>1 2
n
1
15
General model: Visualize any kind of mapping
from the sequence of source symbols X

into
a binary sequence Y

.
Visualize a decoder that observes encoded bits,
one by one. For each n, let D
n
be the number
of observed bits required to decode X
n
(deci-
sions are nal).
The rate rv, as a function of n, is D
n
/n.
In order for the rate in bpss to be less than
H(X) in any meaningful sense, we require that
D
n
/n be smaller than H(X) with high probabil-
ity as n .
16
Theorem: For a DMS and any coding/decoding
technique, let , > 0 be arbitrary. Then for
large enough n,
PrD
n
n[H(X) 2] < +2
n
.
Proof: For given n, let m = ]n[H(X) 2]|.
Suppose that x
n
is decoded upon observation
n
of y
j
for some j m. Only x can be decoded
from y
m
. There are only 2
m
source n-tuples
(and thus at most 2
m
typical n-tuples) that
can be decoded by time m. Previous result
applies.
17
Questions about relevance of AEP and xed-
to-xed length source codes:
1) Are there important real DMS sources? No,
but DMS model provides memory framework.
2) Are xed-to-xed codes at very long length
practical? No, but view length as product life-
time to interpret bpss.
3) Do xed-to-xed codes with rare failures
solve queueing issues? No, queueing issues
arise only with real-time sources, and discrete
sources are rarely real time.
18
MARKOV SOURCES
A nite state Markov chain is a sequence S
0
, S
1
, . . .
of discrete cvs from a nite alphabet S where
q
0
(s) is a pmf on S
0
and for n1,
Q(s[s
t
) = Pr(S
n
=s[S
n1
=s
t
)
= Pr(S
n
=s[S
n1
=s
t
, S
n2
= s
n2
. . . , S
0
=s
0
)
for all choices of s
n2
. . . , s
0
, We use the states
to represent the memory in a discrete source
with memory.
19


Example: Binary source X
1
, X
2
, . . . ; S
n
= (X
n1
X
n
)

1; 0.1

00

01
0; 0.9

1; 0.5

0; 0.5 1; 0.5

0; 0.5

10

11
0; 0.1

1; 0.9
Each transition from a state has a single and
distinct source letter.
Letter species new state, new state species
letter.
20
Transitions in graph imply positive probability.
A state s is accessible from state s
t
if graph
has a path from s
t
s.
The period of s is gcd of path lengths from s
back to s.
A nite state Markov chain is ergodic if all
states are aperiodic and accessible from all
other states.
A Markov source X
1
, X
2
, . . . is the sequence of
labeled transitions on an ergodic Markov chain.
21

Ergodic Markov chains have steady state prob-


abilities given by
q(s) = q(s
t
)Q(s[s
t
); s S (1)

s
t
S
q(s) = 1
sS
Steady-state probabilities are approached asymp-
totically from any starting state, i.e., for all
s, s
t
S,
lim Pr(S
n
=s[S
0
=s
t
) = q(s) (2)
n
22

Coding for Markov sources


Simplest approach: use separate prex-free
code for each prior state.
If S
n1
=s, then encode X
n
with the prex-free
code for s. The codeword lengths l(x, s) are
chosen for the pmf p(x[s).
2
l(x,s)
1 for each s
x
It can be chosen by Human algorithm and
satises
H[X[s] L
min
(s) <H[X[s] +1
where
H[X[s] = P(x[s) logP(x[s)
x.
23

If the pmf on S
0
is the steady state pmf, q(s),
then the chain remains in steady state.
H[X[S] L
min
<H[X[S] +1, (3)
where
L
min
= q(s)L
min
(s) and
sS
H[X[S] = q(s)H[X[s]
sS
The encoder transmits s
0
followed by code-
word for x
1
using code for s
0
.
This species s
1
and x
2
is encoded with code
for s
1
, etc.
This is prex free and can be decoded instan-
taneously.
24
Conditional Entropy
H[X S] for Markov is like H[X] for DMS. [

1
H[X[S] = q(s)P(x[s) log
P(x[s)
sSx.
Note that

1
H[XS] =
s,x
q(s)P(x[s) log
q(s)P(x[s)
= H[S] + H[X S] [
Recall that
H[XS] H[S] + H[X]
Thus,
H[X[S] H[X]
25
Suppose we use n-to-variable-length codes for
each state.
H[S
1
, S
2
, . . . S
n
[S
0
] = nH[X[S]
H[X
1
, X
2
, . . . X
n
[S
0
] = nH[X[S]
By using n-to-variable length codes,
H[X[S] L
min,n
<H[X[S] +1/n
Thus, for Markov sources, H[X[S] is asymptot-
ically achievable.
The AEP also holds for Markov sources.
L H[X[S] can not be achieved, either in
expected length or xed length, with low prob-
ability of failure.
26
MIT OpenCourseWare
http://ocw.mit.edu
6.450 Principles of Digital Communication I
Fall 2009
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like