Unit 1: Information Theory and Coding

Unit 1
Information Theory and coding

Information & Coding Theory
Channel
Information (errors) Information
Symbols signal Symbols
Encoding signal Decoding Destin-
Source
Source/Channel
 + noise Channel/Source ation
s1,…,sq s1,…,sq
Noise
Example: Morse Code
telegraph wire
transmitter receiver
dots, dashes ∙─_
A, …, Z Encoding Decoding A, …, Z
spaces
keyer recognizer
shortwave radio
Example: ASCII Code
seven-bit Telephone seven-bit terminal

Character keyboard modem modem character
blocks wire blocks screen
Information Source – the symbols are undefined, and the “meaning” of the
information being sent is not dealt with – only an abstract measure of the
“amount” or “quantity.
Examples
text of various forms – reports, papers, memos, books, scientific data
(numbers)
pictures of various forms – diagrams, art, photographic images, scientific data
(e.g. from satellites)
sound of various forms – music, speech, noises, recorded sound, radio
animation of various forms – moving pictures, film, video tape, video camera,
television
equations representing mathematical ideas or algorithms – two textual
representation systems with graphical output: Tex & Mathematica
continuous waveforms discrete sampled and

analog and shapes digital quantized
Analog vs. Digital
Examples of (apparently) analog sources of information:
sound (amplitude versus time) picture (amplitude versus space)
pressure: + →  intensity: [0, 1]  [0, 1] → +
granularity – molecular granularity – crystalline

microphone → tape recorder → speaker lenses → film (negative) → print (positive)
Examples of (apparently) digital sources of information:

different languages have different
Text characters
(often the typeface or writing
a′ ? B sequences of characters style conveys additional meaning
or information)
Examples of composites (some digital and analog)
color printing (as opposed to photography)

spatially discrete (dots)
spectrum discrete (color separation)
intensity discrete (dot or no dot)
television intensity unquantized, temporally discrete
color spatially discrete
random signal contains largest amount of information

Unique Decodability
We must always be able to determine where one code
word ends and the next one begins. Counterexample:
Suppose: s1 = 0; s2 = 1; s3 = 11; s4 = 00
0011 = s4s3 or s1s1s3
Unique decodability means that any two distinct

sequences of symbols (of possibly differing lengths)
result in distinct code words.
Let S   s1 ,  , sq  be the set of source symbols, and
  n 
 
S n  S    S  si1  sin | si j  S be the n th direct product
of S . These are the sequences from S of length n
4.1, 2
Instantaneous Codes
s1 = 0 1 1 1
s4
s2 = 10 0 0 0
s3 = 110 decoding tree
s1 s2 s3
s4 = 111
No code word is the prefix of another. By reading a continuous sequence of code words,
one can instantaneously determine the end of each code word.
Consider the reverse: s1 = 0; s2 = 01; s3 = 011; s4 = 111
0111……111 is uniquely decodable, but the first symbol cannot be decoded without
reading all the way to the end.
4.3
Constructing Instantaneous Codes
comma code: s1 = 0 s2 = 10 s3 = 110 s4 = 1110 s5 = 1111

modification: s1 = 00 s2 = 01 s3 =10 s4 = 110 s5 = 111
Decoding tree 0 1
0 1 0 1
s1 = 00 s2 = 01 s3 = 10
0 1
Notice that every code word is s4 = 110 s5 = 111

located on the leaves
4.4
Kraft Inequality
Theorem: There exists an instantaneous code for S where each
symbol s  S is encoded in radix r with length |s| if and
only if 1
sS r
s
1
Proof: () By induction on the height (maximal length path) of the

decoding tree, max{|s|: s  S}. For simplicity, pick r = 2 (the
binary case). By IH, the leaves of T0, T1 satisfy the Kraft
inequality.
Basis: n = 1 Induction: n > 1
Prefixing one symbol at
or 0 1 0 1 top of tree increases all
0,1
the lengths by one, so
s1 s1 s2 T0 T1 1 1
 
1 1 1 <n <n    2
   2
1  1
2 2 2 1 1 1 1
Could use n = 0 here!

sT0 2
s
1 
sT1 2
s
1 
sT0 2
s 1
  s 1  1
sT1 2
4.5
Same argument for radix r: Induction: n > 1
Basis: n = 1 0 ≤ r 1
……
0 …… ≤ r1
T0 T≤r-1
s1 ………… s≤r
at most r
at most r subtrees
r
1 1 1 1 1
 1
1 r s 1
  s 
r sTi r r
IH
i 1 r
sTi
so adding at most r of these together gives ≤ 1
Inequality in the binary case implies that not all internal nodes have
degree 2, but if a node has degree 1, then clearly that edge can be
removed by contraction.
4.5
Kraft Inequality ()
Construct a code via decoding trees. Number the symbols s1, …, sq
so that l1 ≤ … ≤ lq and assume K ≤ 1.
Greedy method: proceed left-to-right, systematically assigning
leaves to code words, so that you never pass through or land on a
previous one. The only way this method could fail is if it runs out of
nodes (tree is over-full), but that would mean K > 1.
Exs: r=2 1, 3, 3, 3 r=2 1, 2, 3, 3 r=2 1, 2, 2, 3

0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 0 1
not used
½+¼+¼+⅛>1
½+⅛+⅛+⅛<1 ½+¼+⅛+⅛=1
4.5
Shortened Block Codes
With exactly 2m symbols, we can form a set of code words each of
length m : b1 …… bm bi  {0,1}. This is a complete binary decoding
tree of depth m. With < 2m symbols, we can chop off branches to get
modified (shortened) block codes.
0 s1
1
0 0 s1
1 s2
1
0 s2
0 0 1
s3
1 1
0 s3
s4 0
1 s4
1
s5 s5
Ex 1 Ex 2
4.6
McMillan Inequality
Idea: Uniquely decodable codes satisfy the same bounds as
instantaneous codes.
Theorem: Suppose we have a uniquely decodable code in radix r

of lengths of l1 ≤ … ≤ lq . Then their Kraft sum is ≤ 1.
q nl q
1 N
Proof : Let K   l i , and consider K n   kk
i 1 r k n r
Use a multinomial expansion to see that Nk = the number of ways n
l‘s can add up to k, which is the same as the number of different
ways n symbols can form a coded message of length k. Because of
uniqueness, this must be ≤ rk, the number of codewords.
nl q k
r
 K n   k  nlq  n  1  nlq (n  1)
k n r
But K  1  n  K n  nlq ›‹  K  1

Conclusion: WLOG we can use only instantaneous codes. 4.7
Average code length
q pi  probability of symbol si
Lavg   pi l i where
i 1 l i  si the length of si
Our goal is to minimize the average coded length.

If pn > pm then ln ≤ lm. For if pm < pn with lm < ln, then
interchanging the encodings for sm and sn we get
Lavg  pmlm  pn ln  ( pmln  pnlm )  ( pm  pn )(lm  ln )  0.
old > new
So we can assume that if p1 ≥ … ≥ pq then l1 ≤ … ≤ lq,

because if pi = pi+1 with li > li+1, we can just switch si and si+1.
4.8
Start with S = {s1, …, sq} the source alphabet. And consider B = {0, 1} as our code
alphabet (binary). First, observe that lq1 = lq, since the code is instantaneous, s<q
cannot be a prefix of sq, so dropping the last symbol from sq (if lq > lq1) won’t hurt.
Huffman algorithm: So, we can combine sq1 and sq into a “combo-symbol” (sq1+sq)
with probability (pq1+pq) and get a code for the reduced alphabet.
For q = 1, assign s1 = ε . For q > 1, let sq-1 = (sq-1+sq) 0 and sq = (sq-1+sq) 1

Example:
0.4 0.2 0.2 0.1 0.1 1 01 000 0010 0011
0.4 0.2 0.2 0.2

1 01 000 001
0.4 0.4 0.2

1 00 01
0.6 0.4
0 1
1.0
ε
N. B. the case for q = 1 does not produce a valid code. 4.8
Huffman is always of shortest average length
Huffman  Lavg
Assume We know
trying to show
≥
p1 ≥ … ≥ pq l1 ≤ … ≤ l q
Alternative L
Example: p1 = 0.7; p2 = p3 = p4 = 0.1

Compare Lavg = 1.5 to log2 q = 2. 0 1
Base Case: For q = 2, no shorter code exists. s1 s2
Induction Step: For q > 2 take any instantaneous

code for s1, …, sq with minimal average length.
4.8
Claim that lq1 = lq = lq1, q + 1 because
combined symbol
reduced code
sq1 + sq total height = lq
0 1
s1, ………
So its reduced code will always sq
sq1 satisfy:
q 2
By IH, L′L  i  ( pimportantly
pi lmore
avg ≤ L′. But q 1  pq )(l q the
 1)reduced
 L  pHuffman
q 1  pq code
i 1 properties so it also satisfies the same equation
shares the same
L′avg + (pq1 + pq) = Lavg, hence Lavg ≤ L.
4.8
Code Extensions
Take p1 = ⅔ and p2 = ⅓ Huffman code gives
s1 = 0 s2 = 1 Lavg = 1
Square the symbol alphabet to get:
S2 : s1,1 = s1s1; s1,2 = s1s2; s2,1 = s2s1; s2,2 = s2s2;
p1,1 = 4⁄9 p1,2 = 2⁄9 p2,1 = 2⁄9 p2,2 =
1
⁄9
Apply Huffman to S2:
s1,1 = 1; s1,2 = 01; s2,1 = 000; s2,2 = 001
4 2 2  1  17
Lavg  1   2   3   3   2
9 9 9 9 9
But we are sending two symbols at a time!
4.10
Huffman Codes in radix r
At each stage down, we merge the last (least probable) r states
into 1, reducing the # of states by r  1. Since we end with one
state, we must begin with no. of states  1 mod (r  1) . We pad
out states with probability 0 to get this. Example: r = 4; k = 3
0.22 0.2 0.18 0.15 0.1 0.08 0.05 0.02 0.0 0.0 pads
0.22 0.2 0.18 0.15 0.1 0.08 0.07
0.4 0.22 0.2 0.18

1 2 3 00 01 02 030 031
1.0
1 2 3 00 01 02 03
0 1 2 3

4.11
Information
A quantitative measure of the amount of information any
event represents. I(p) = the amount of information in the
occurrence of an event of probability p.
single
source
Axioms: symbol
A. I(p) ≥ 0 for any event p
B. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent events
C. I(p) is a continuous function of p
Cauchy functional equation units of information:
in base 2 = a bit
Existence: I(p) = log_(1/p) in base e = a nat
in base 10 = a Hartley
6.2
Uniqueness:
Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take any
0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0, and
hence logk (1/p0) = I′(p0). Now, any z  (0,1) can be
written as p0r, r a real number  R+ (r = logp0 z). The
Cauchy Functional Equation implies that I′(p0n) = n I′(p0)
and m  Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) =
(n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0).
Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z).
Note: In this proof, we introduce an arbitrary p0, show

how any z relates to it, and then eliminate the
dependency on that particular p0.
6.2
Entropy
The average amount of information received on a per
symbol basis from a source S = {s1, …, sq} of symbols, si
has probability pi. It is measuring the information rate.
In radix r, when all the probabilities are independent: information of the
weighted arithmetic mean
weighted geometric mean
of information
           
pi pi
q
1 q
1 q
1
H r ( S )   pi  log r   log r    log r   
i 1 pi i 1  pi  i 1  pi 
• Entropy is amount of information in probability distribution.

Alternative approach: consider a long message of N symbols
from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to
appear Npi times, and the probability of this typical message is:
q q
1 1
P   pi whose informatio n is log    N  pi log  N  H (S )
Np i
i 1 P  i 1 pi
6.3
Consider f(p) = p ln (1/p): (works for any base, not just e)
f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p)
f″(p) = p(-p-2) = - 1/p < 0 for p  (0,1)  f is concave down
f′(1/e) = 0
f(1/e) = 1/e
1
/e
f
f′(1) = -1
f′(0) = ∞
0 1
/e 1
p f(1) = 0
ln p1  ln p ( ln p)  p 1
lim f ( p)  lim  lim 1  lim  lim 0
p 0 p0 1 p 0 p p  0 ( p 1 ) p0  p 2
p
6.3
Basic information about logarithm function
y=x1
Tangent line to y = ln x at x = 1
(y  ln 1) = (ln)′x=1(x  1) ln x
y=x1
0 x
(ln x)″ = (1/x)′ = -(1/x2) < 0 x 1
 ln x is concave down. -1
Conclusion: ln x  x  1
6.4
Fundamental Gibbs inequality
q q
Let  xi  1 and  yi  1 be two probabilit y distributi ons, and consider
i 1 i 1
 only when xi  yi
q
yi  q
yi q q q

i 1
xi log
xi
 
i 1
xi (1  )   ( xi  yi )   xi   yi  1  1  0
xi i 1 i 1 i 1
• Minimum Entropy occurs when one pi = 1 and all others are 0.

• Maximum Entropy occurs when? Consider
Gibbs with
 1 distribution y i  q1
 1
q q q   
1  q
H (S )  log q   pi log  log q  pi   pi log  0
i 1 pi i 1 i 1  pi 
 
 
• Hence H(S) ≤ log q, and equality occurs only when pi = 1/q.
6.4
Entropy Examples
S = {s1} p1 = 1 H(S) = 0 (no information)

S = {s1,s2} p1 = p2 = ½ H2(S) = 1 (1 bit per symbol)
S = {s1, …, sr}p1 = … = pr = 1/r Hr(S) = 1 but H2(S) = log2r.
• Run length coding (for instance, in binary predictive coding):
p = 1  q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q)

As q  0 the term q log2(1/q) dominates (compare slopes). C.f.
average run length = 1/q and average # of bits needed = log2(1/q).
So q log2(1/q) = avg. amount of information per bit of original code.
Entropy as a Lower Bound for Average Code Length
Given an instantaneous code with length li in radix r, let

q
1 r  li q
K   li  1 ; Qi  ;  Qi  1
i 1 r K i 1
q
 Qi  Qi 1 1
So by Gibbs,  pi log r    0, applying log  log  log
i 1  pi  pi pi Qi
q q q
1 1
H r ( S )   pi log r   pi log r   pi (log r K  li log r r )
i 1 pi i 1 Qi i 1
q
 log r K   pi li . Since K  1, log r K  0, and hence H r ( S )  L.
i 1
By the McMillan inequality, this hold for all uniquely decodable

codes. Equality occurs when K = 1 (the decoding tree is complete)
and p  r  li
i
6.5
Shannon-Fano Coding
Simplest variable length method. Less efficient than Huffman, but
allows one to code symbol si with length li directly from
probability pi.
li = logr(1/pi)
 1  1 1 r pi
  log r   li   log r   1   li
 r   pi  r  .
li
 pi   pi  pi pi r
 K
q q q
pi 1
Summing this inequality over i: p
i 1
i 1 r
i 1
 li

i 1

r r
Kraft inequality is satisfied, therefore there is an instantaneous
code with these lengths.
6.6
q q
1
Also, H r ( S )   pi log r   pi li  H r ( S )  1
i 1 pi i 1 
L
by summing  multiplied by pi
Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1
0 1
H2(S) = 2.5 L = 5/2
0 1 0 1
0 1 0 1
6.6
The Entropy of Code Extensions
Recall: The nth extension of a source S = {s1, …, sq} with probabilities p1, …,
pq is the set of symbols
T = Sn = {si ∙∙∙ sin : sij  S 1  j  n} where
1
concatenation multiplication
ti = si ∙∙∙ sin has probability pi ∙∙∙ pin = Qi assuming independent
1 1
probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q.

The entropy is:
[]
qn qn
1 1
H ( S )  H (T )   Qi log
n
  Qi log 
i 1 Qi i 1 pi1  pin
qn  1 1   qn
1 qn
1
 
Qi log
 p
   log
p
  Qi log
 i 1 pi1
    Qi log .
pin
i 1  i1 i n 
i 1
6.8
qn qn
1 1
Consider t he kth term   Qi log   pi1  pi n log 
i 1 pi k i 1 pi k
q q q q q
1 1

i 1 1
 pi1  pi n log
i n 1
 i k  pi1  pi k  pi n  pi k log
ˆ
pi k i1 1 i n 1
ˆ
i k 1 pi k

q q

i 1 1
iˆk   pi1  pˆ i k  pi n H (S )  H (S )
i n 1
 pi1  pˆ i k  pi n is just a probabilit y in the (n  1)st
extension, and adding them all up gives 1.
 H(Sn) = n∙H(S)
Hence the average S-F code length Ln for T satisfies:
H(T)  Ln < H(T) + 1  n ∙ H(S)  Ln < n ∙ H(S) + 1 
H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity]

6.8
Extension Example
S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1)
p1 = 2/3 p2 = 1/3 ~ 0.9182958 …
Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1
Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3
2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F:
l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4
LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666…
Snn = (s
n 1 + is2)n, probabilities are corresponding terms in (pi 1 + np2i)
n
 
 n i
n i  2  
  p1  p2 So there are   symbols with probabilit y     1  2
i 0  i 
i  3 3 3n
     
 3n 
The correspond ing SF length is log 2 i    n log 2 3  i    n log 2 3  i
 2 
6.9
Extension cont.
n
 n  2i 1 n  n i
    n   n log 2 3  i   n   2   n log 2 3  i  
(n)
LSF
i 0  i  3 3 i 0  i 
1  n
n i n n i  2n
3n 
  n log 2 3   
i
i 0  
2    
i
i 0  
2  i    n log 2 3 
3

(2 + 1)n = 3n 2n 3n-1 *
(n )
LSF 2
Hence  log 2 3   H 2 (S )
as
n n  3
 n  i n  i dx
n n
n i x 1
 (2  x )    2  x  n (2  x )    2 (n  i ) x
n n 1 n  i 1
 n  3n 1 
i 0  i  i 0  i 
n
n i n
n i n
n n
n i
  2 (n  i )  n  3  n   2      i  2    2  i  n  3n  n  3n 1
i 0  i 
n 1
i 0  i  i 0  i 
i
i 0  i 
6.9
Markov Process Entropy
p( si | si1  sim )  conditiona l probabilit y that si follows si1  sim .
For an mth order process, think of letting the state s  si1 ,, sim .
Hence, I ( s | s )   1 
i log  , and so
 p ( si | s ) 
H (S | s )   p( s | s )  I ( s | s )
si S
i i
Now, let p(s )  the probabilit y of being in state s .

Then H (S )   p(s )  H (S | s ) 
s S m
 p(s )  p(s | s )  I (s | s )    p(s )  p(s
s i S
i i i | s )  I (si | s )
s S m s S m s i S
   p(s , s )  I (s | s )   p(s , s )  log
i i i
1
p ( s i |s )
s S m s i S s , s i S m 1
6.10
.8 previous next Example
state state
0, 0
Si1 Si2 Si p(si | si1, si2) p(si1, si2) p(si1, si2, si)
.2 .5
.5 0 0 0 0.8 5/14 4/14
0, 1 1, 0 0 0 1 0.2 5/14 1/14

.5 0 1 0 0.5 2/14 1/14
.5 .2 0 1 1 0.5 2/14 1/14
1, 1
1 0 0 0.5 2/14 1/14
.8 1 0 1 0.5 2/14 1/14
equilibrium probabilities: 1 1 0 0.2 5/14 1/14
p(0,0) = 5/14 = p(1,1) 1 1 1 0.8 5/14 4/14
p(0,1) = 2/14 = p(1,0)
1
H 2 (S )   p( si1 , si2 , si ) log 2
p ( si | si1 , si2 )

{0 ,1}3
4 1 1 1 1 1
2 log 2  2 log 2  4 log 2  0.801377
14 0.8 14 0.2 14 0.5 6.11
The Fibonacci numbers
•Let
f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , …. be defined by fn+1 = fn + fn−1.
The = the golden ratio, a root of the equation x2 = x + 1. Use these as
the weights for a system of number representation with digits 0 and 1,
without adjacent 1’s (because (100)phi = (11)phi).
Base Fibonacci
Representation Theorem: every number from 0 to fn − 1 can be uniquely written as an n-
bit number with no adjacent one’s .
Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε

Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction hypothesis. Otherwise, fn ≤ i
< fn+1 = fn−1 + fn , so 0 ≤ (i − fn) < fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi
with bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also has no adjacent ones.
Uniqueness: Let i be the smallest number ≥ 0 with two distinct representations (no
leading zeros). i = (bn−1 … b0)phi = (b′n−1 … b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so
without loss of generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1 which can’t be
true.
Base Fibonacci
The golden ratio  = (1+√5)/2 is a solution 0
1/r
…
to x2 − x − 1 = 0 and is equal to the limit of H2 = log2 r
r−1
the ratio of adjacent Fibonacci numbers.
1/2 0
1/
1/ 0 1 1st order Markov process:
1/2 0 1
1/
Think of source as 0
1/ 1/2 1/ + 1/2 = 1
emitting variable 10 1 0
length symbols: 1/2
Entropy = (1/)∙log  + ½(1/²)∙log ² = log  which is maximal
take into account

variable length symbols
Markov Processes
Let S = {s1, …, sq} be a set of symbols. A jth-order Markov process
has probabilities p(si | si … si ) associated with it, the conditional
1 j
probability of seeing si after having seen si1 … sij. This is said to be
a j-memory source, and there are qj states in the Markov process.
Transition Graph Weather Example:
Transition Matrix
½ next symbol
b Let j = 1. Think: a b c
⅓ ⅓ c
u
¼ a means “fair” r a
a ¼ ¼ r
e
⅓ ⅓ ⅓
b means “rain”
¼ n
t
b ¼ ½ ¼ =M
c means “snow”
c s c ¼ ¼ ½
⅓ t
a
½ t
e
p(si | sj) p(si | sj)
sj si
∑ outgoing edges = 1 i = column, j = row
transition probability 5.2
Ergodic Equilibriums
Definition: A first-order Markov process M is said to be ergodic if

1. From any state we can eventually get to any other state.
2. The system reaches a limiting distribution.
( pa , pb , pc ) M  ( pa , pb , pc ) repeating this ... pM n  p.
Fact : lim M n  M  exists.

n 
pM   pe is called the equilibriu m solution , and satisfies pe M  pe .
3 4 4
In above example, pe   , ,  which is the overall average weather.
 11 11 11 
5.2
Predictive Coding
Assume a prediction algorithm for a binary source which given all prior bits
predicts the next.
input stream prediction
s1 ….. sn1  pn en = pn  sn error
What is transmitted is the error, ei. By knowing just the error, the predictor also
knows the original symbols.
sn en en sn
source  channel  destination
pn pn
predictor predictor
must assume that both predictors are identical, and start in the same state
5.7
Accuracy: The probability of the predictor being correct is p = 1  q;
constant (over time) and independent of other prediction errors.
Let the probability of a run of exactly n 0’s, (0n1), be p(n) = pn ∙ q.
The probability of runs of any length n = 0, 1, 2, … is:
 
1 q q
 p q  q p  q 
n n
  1
1 p 1 p q
n0 n0   f( p ) 
 
Expected length of a run   (n  1) p q  q  (n  1) p n
n
n 0 n 0
 
 1 
 f ( p)dp    (n  1) p dp   p
n 1
n
 c  p   c. So,
n 0 n0 1 p 
(1  p ) p  p (1  p) 1  p  p 1 q 1
f ( p)    . So, q  f ( p )  2 
(1  p) 2
(1  p ) 2
(1  p) 2
q q
2
  
Note: alternate method for calculating f(p), look at  p n 
 n 0 
5.8
Coding of Run Lengths
Send a k-digit binary number to represent a run of zeroes whose
length is between 0 and 2k  2. (small runs are in binary)
For run lengths larger than 2k  2, send 2k  1 (k ones) followed by
another k-digit binary number, etc. (large runs are in unary)
Let n = run length. Fix k = block length. Use division to get:
n=i∙m+j 0 ≤ j < m = 2k  1
(like “reading” the “matrix” with m cells and ∞ many rows)
code k  k j in binary

n
01  11 1
   1 B j  1ik
B j ; length  (i  1)k  l n
i
5.9
Expected length of run length code
Let p(n) = the probability of a run of exactly n 0’s: 0n1.

The expected code length is:

 p n   l
But every n can be written uniquely as
n 0
n i∙m + j where i ≥ 0, 0 ≤ j < m = 2k  1.
 m 1  m 1
  p (i  m  j )  lim  j   p im j  q  (i  1)  k
i 0 j 0 i 0 j 0
 m 1 
 1  p m

 qk  (i  1)  p  p  qk  (i  1) p 
im j im
 
i 0 j 0 i 0  q 

1 k
k (1  p ) (i  1) p  k (1  p )
m im m

i 0 (1  p )
m 2
1 p m
5.9
Gray Code
Consider an analog-to-digital “flash” converter consisting of a
rotating wheel:
0 0
The maximum 1 0 imagine
error in the 0 1 1 0
1 0 “brushes”
scheme is ± ⅛ 0 0
contacting the
rotation 0 0
1
1 1 1 0
1
wheel in each of
because …
1 0 the three circles
1 1
The Hamming Distance between adjacent positions is 1.

In ordinary binary, the maximum distance is 3 (the max. possible).
5.15-17
 0G0 
Binary ↔ Gray 
  

 0  0G n 
Inductively: G (1)    G (n  1)   2 1 
1  1G2 n 1 
  
 
Computationally: Bi  bn 1 b0 bi  {0,1}  1G0 
Gi  g n 1 g 0 g i  {0,1}
b3 b2 b1 b0
Encoding g n 1  bn 1 ↓ ↘ ↓ ↘ ↓ ↘ ↓
g i  bi  bi 1 0  i  n  1 g3 g2 g1 g0
n 1 ↓ ↓ ↓ ↓
Decoding bi   g j b3 → b2 → b1 → b0
j i
(keep a running total)
5.15-17
Shannon’s Theorems
First theorem: H(S) ≤ Ln(Sn)/n < H(S) + 1/n
where Ln is the length of a certain code.
Second theorem: extends this idea to a channel with

errors, allowing one to reach arbitrarily close to the
channel capacity while simultaneously correcting almost all
the errors.
Proof: it does so without constructing a specific code, and
relies instead on a random code.
Random Codes
Send an n-bit block code through a binary symmetric channel:
A = {ai : i = 1, …, M} B = {bj : |bj| = n, j = 1, …, 2n}
M distinct Intuitively, each block
P Q
equiprobable n-bit comes through with n∙C
blocks Q P bits of information.
I2(ai) = log2 M C = 1 − H2(Q) small number ε > 0
Q<½
To signal close to capacity, we want I2(ai) = n (C − ε)
2nC intuitively, # of messages that can get thru channel
 M  2 n (C   )  n
2 by increasing n, this can be made arbitrarily large
 we can choose M so that we use only a small fraction of the # of
messages that could get thru – redundancy. Excess redundancy
gives us the room required to bring the error rate down. For a large
n, pick M random codewords from {0, 1}n.
10.4
With high probability, almost all ai will be a certain distance apart
(provided M « 2n). Picture the ai in n-dimensional Hamming space.
As each ai goes thru channel, we expect nQ errors on average.
Consider a sphere on radius Similarly, around each bj: What us

n (Q + ε′) about each ai: the probability that an uncorrectable
nε′ error occurs?
bj received symbol
 P ( a  S )
too much noise is also inside
PE  P ( ai  S )  another a′
a   ai
By the law of
ai
nQ large numbers,
sent symbol ai
a′ ai
   P(a  S )
lim P (b j  S n ( Q   ) (ai ))  0 A \{ a i }
bj
n 
nQ
N. b. P bi  S (ai )   P  ai  S (b j ) 
can be made « δ
nε′ 10.4
Idea
Pick # of code words M to be 2n(C−ε) where C is the channel capacity (the
block size n is as yet undetermined and depends on how close ε we
wish to approach the channel capacity). The number of possible
random codes = (2n)M = 2nM, each equally likely. Let PE = the probability
of errors averaged over all random codes. The idea is to show that PE
→ 0. I.e. given any code, most of the time it will probably work!
Proof
Suppose a is what’s sent, and b what’s received.
PE  P  d (a, b)  n (Q   )   P  d (a, b)  n (Q   )
                 
a a
too many errors another codeword is too close
Let X = 0/1 be a random variable representing errors in the channel,
with probability P/Q. So if the error vector a  b = (X1, …, Xn), then
d(a, b) = X1 + … + Xn.
 X  X n 
P d (a, b)  n(Q   )  P X 1    X n  n(Q   )  P  1  Q    
as  n 
 X  X n  V {X }
P 1  Q      0 n   (by law of large numbers)
 n 
2
 n
N. B. Q = E{X}  Q < ½ , pick ε′  Q + ε′ < ½ 10.5
Since the a′ are randomly (uniformly) distributed throughout,
2 nH 2 (Q  )
P d (a, b)  n(Q   ) 
by the binomial bound
2n volume of whole space

Chance that some particular 1 
n   log 1 
code word lands too close. 2nH 2 (Q )  2 Q 

2n
1
H is convex down and Q      H (Q   )  H (Q)   H (Q)
2
1 1 1 Q 1 
H (Q)  1  log  1  log  log  log  1. Hence,
Q 1 Q Q Q 
1 
Chance that any one is too close. n  log2  1 
Q 
2 nH 2 ( Q )  2
M  P d (a, b)  n(Q   )   n nH 2 (Q ) n n 
 1  
2 2 2 2
n    log2  1   
Q  
2 
  
 0
as n  
N.b. e = log2(1/Q–1) > 0, so we can choose ε′e < ε.
10.5

Unit 1: Information Theory and Coding

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1: Information Theory and Coding

Uploaded by

Copyright:

Available Formats

Unit 1

Information Theory and coding

Example: ASCII Code

seven-bit Telephone seven-bit terminal

continuous waveforms discrete sampled and

sound (amplitude versus time) picture (amplitude versus space)

pressure: + →  intensity: [0, 1]  [0, 1] → +

granularity – molecular granularity – crystalline

Examples of (apparently) digital sources of information:

color printing (as opposed to photography)

random signal contains largest amount of information

0011 = s4s3 or s1s1s3

Unique decodability means that any two distinct

Consider the reverse: s1 = 0; s2 = 01; s3 = 011; s4 = 111

comma code: s1 = 0 s2 = 10 s3 = 110 s4 = 1110 s5 = 1111

Notice that every code word is s4 = 110 s5 = 111

Proof: () By induction on the height (maximal length path) of the

so adding at most r of these together gives ≤ 1

Exs: r=2 1, 3, 3, 3 r=2 1, 2, 3, 3 r=2 1, 2, 2, 3

Theorem: Suppose we have a uniquely decodable code in radix r

But K  1  n  K n  nlq ›‹  K  1

Our goal is to minimize the average coded length.

old > new

So we can assume that if p1 ≥ … ≥ pq then l1 ≤ … ≤ lq,

For q = 1, assign s1 = ε . For q > 1, let sq-1 = (sq-1+sq) 0 and sq = (sq-1+sq) 1

0.4 0.2 0.2 0.2

0.4 0.4 0.2

Example: p1 = 0.7; p2 = p3 = p4 = 0.1

Base Case: For q = 2, no shorter code exists. s1 s2

Induction Step: For q > 2 take any instantaneous

0.22 0.2 0.18 0.15 0.1 0.08 0.07

0.4 0.22 0.2 0.18

Note: In this proof, we introduce an arbitrary p0, show

• Entropy is amount of information in probability distribution.

• Minimum Entropy occurs when one pi = 1 and all others are 0.

S = {s1} p1 = 1 H(S) = 0 (no information)

• Run length coding (for instance, in binary predictive coding):

p = 1  q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q)

Given an instantaneous code with length li in radix r, let

By the McMillan inequality, this hold for all uniquely decodable

Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1

probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q.

H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity]

Now, let p(s )  the probabilit y of being in state s .

.5 0 0 0 0.8 5/14 4/14

0, 1 1, 0 0 0 1 0.2 5/14 1/14

Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε

Entropy = (1/)∙log  + ½(1/²)∙log ² = log  which is maximal

take into account

Definition: A first-order Markov process M is said to be ergodic if

( pa , pb , pc ) M  ( pa , pb , pc ) repeating this ... pM n  p.

Fact : lim M n  M  exists.

pM   pe is called the equilibriu m solution , and satisfies pe M  pe .

input stream prediction

s1 ….. sn1  pn en = pn  sn error

Let n = run length. Fix k = block length. Use division to get:

(like “reading” the “matrix” with m cells and ∞ many rows)

Let p(n) = the probability of a run of exactly n 0’s: 0n1.

The Hamming Distance between adjacent positions is 1.

Second theorem: extends this idea to a channel with

Consider a sphere on radius Similarly, around each bj: What us

2n volume of whole space

You might also like