You are on page 1of 94

Unit-1

INFORMATION THEORY
Historical Notes
• Claude E. Shannon (1916-
2001) himself in 1948, has
established almost
everything regarding Source
Coding and Channel Coding.
• He was dealing with
communication aspects.
• He first used the term “bit.”
Motivation: What is information?
• By definition, information is certain knowledge
about certain things, which may or may not be
conceived by an observer.
• Example: music, story, news, etc.
• Information means.
• Information propagates.
• Information corrupts…
Information Theory tells us…

• What is exactly information


• How are they measured and presented
• What are its implications, limitations, and
applications
Information

• Information  knowledge
• Information: reduction in uncertainty

• Example:
1) Flip a Coin
2) Roll a Die

• The Information within a source is the


uncertainty of the source.
Example
• Every time we row a dice, we get points from 1 through 6. The
information we get is larger than we throw a coin or row an “unfair”
dice.
• The less we know, the more the information contained!
• For a source (random process) that is known to generate a sequence
01010101… with 0 right after 1 and 1 right after 0, though has an
average chance of 50% to get either 0 or 1, but the information is the
same as a fair coin.
• If outcome known for sure, no information gained.
Definition of Information
• Let E be some event that occurs with
probability P(E). If we are told that E has
occurred, then we say we have received
I(E)=log2(1/P(E)) bits of information
• Example:
– Result of a fair coin flip (log22=1 bit)
– Result of a fair die roll (log26=2.585 bits)
Information is Additive

• I(k fair coin tosses) = log22k =k bits


• Example: information conveyed by words
– Random word from a 100,000 word vocabulary:
• I(word) = log(100,000) = 16.6 bits
– A 1000 word document from the same source
• I(document) = 16,600 bits
– A 480x640 pixel, 16-greyscale video picture:
• I(picture) = 307,200 * log16 = 1,228,800 bits
 A picture is worth more than a 1000 words!
Entropy
• A zero-memory information source S is a source that emits symbols
from an alphabet {s1, s2,…, sk} with probability {p1, p2,…,pk},
respectively, where the symbols emitted are statistically
independent.

• What is the average amount of information in observing the output


of the source S?

• Call this entropy:


Explanation of Entropy

1. Average amount of information provided per symbol


2. Average # of bits needed to communicate each symbol
Information Measure:
behavioral definitions
• H should be maximized when the object is
most unpredictable.
• H(X)=0 if X is deterministic.
• The information measure H should be additive
for independent objects; i.e., with 2
information sources which has no relations
with each other, H=H1+H2.
• H is the information entropy!
Properties of Entropy

1. Non-negative: H(P) 0,


2. H(P) = 0 only if pi = 0 or 1,
3. For any other probability distribution {q1,…,qk},

qi

4. H(P)  log2k, with equality iff pi=1/k for all I; equiprobable,


5. Further the p value from uniform dist., the lower the entropy.
Discrete Memoryless Source (example)
• Entropy of Binary Memoryless
Source
• Binary memoryless source has
symbols 0 and 1 which have
probabilities p0 = p and
p1 = (1-p0) = (1-p)
• Entropy as a function of p,

 zero information at edges


 maximum information at 0.5 (1 bit)
 drop off more quickly close to edges than in the middle
Extension of a Discrete Memoryless Source

• H(Pn) = n H(P)

– Random word from a 100,000 word vocabulary:


• I(word) = log(100,000) = 16.6 bits
– A 1000 word document from the same source
• I(document) = 16,600 bits
The Entropy of English and Tamil
• ENGLISH:
• 27 characters (A-Z, space)
• 100,000 words (average 6.5 char each)

• Assuming independence between successive characters:


– Uniform character distribution: log2 (27) = 4.75 bits/char

• TAMIL:
• 248 characters (247+1)
• Assuming independence between successive characters:
– Uniform character distribution: log2 (248)= 7.952bits/character
• Entropy of English is much lower!
Source Coding
 Source coding means an effective representation of
data generated by a discrete source
 Represented by source encoder
 Statistics of the source must be known (e.g. if coding
priorities exist)
 Morse Code : E . , Q - - . -
 Functional Requirements:
 Codewords are in binary form
 Code is uniquely decodable
Source Coding

Discrete sk Source bk Sequence of 0’s and 1’s


Memoryless Encoder
Source

K symbols P(sk) = pk Length of bk for sk = lk

K-1
Average codeword length L =  pk lk ?
k=0

Coding Efficiency  = Lmin/ L ,

where Lmin  minimum possible value of L


Source Coding Theorem
The Source Coding Theorem

 Given a discrete memoryless source of entropy


H(P), the average codeword length L for any
distortionless source encoding is bounded as,
L  H(P)

 Source Entropy repesents a fundamental limit


on average number of bits per symbol necessary
for lossless compression.
 = H(P)/ L
Fixed length code
Variable length code 1
A BAD CAB
Variable length code 2

A EAB AAD
A BAAB AAAB
Kraft Inequality
Example
Huffman Coding
Example
Huffman coding is not unique
Example 2
Alternate
Table
100% Efficient
Example
Arithmetic Coding
Limitations of Huffman Code
• Huffman – optimal if probabilities 2^(-n).

• Prefix code match self information with length


of codewords(2^n).

• Prefix code using binary tree take one bit to


take decision (0.5/0.5 or 0.9/0.1)
AC don’t have this limits
• Arithmetic code represent the file to be
encoded by an interval of real numbers
between 0 and 1.

• Successive symbols in the message reduce this


interval w.r.to probability of that symbol.
Example
Limitation
• Receiver should have prior information about
the probabilities of symbols.

• Don’t know when to stop.


Lempel-Ziv Algorithm
Limitations of other codes
• Huffman requires symbol probabilities.
• In real life statistics of the source is unknown.
• It is optimal for DMS.
• Practical requires statistical interdependence
(‘q-u’, ‘t-h’, ‘i-n-g’).
LZA
• Variable to fixed length.
• Universal source coding.
Example
• 101011011010101011
• 1, 0, 1011011010101011
• 1, 0, 10, 11011010101011
• 1, 0, 10, 11, 011010101011
• 1, 0, 10, 11, 01, 1010101011
• 1, 0, 10, 11, 01, 101, 0101011
• 1, 0, 10, 11, 01, 101, 010, 1011
Example
• 01001111100101000001010101100110000
• Parsing
• 0,1,00,11,111,001,01,000,0010,10,101,100,110
Discrete Memoryless Channels
• A discrete memoryless channel is a statistical model with an input X and
output Y, which is a noisy version of X (here both are random variables)
• In each time slot the channel accepts an input symbol X selected from a
given alphabet X and emits Y belonging to a given alphabet Y,
– Alphabet sizes are finite  discrete
– Current output depends only on current input  memoryless
• Channel description  X , Y and the transition probabilities

X = { x0, x1, x2 ...................xJ-1 },


Y = { y0, y1, y2 ...................xK-1 },

p (yk/xj) = P(Y=yk/X=xj) ; for all j and k


0  p (yk/xj)  1 ; for all j and k
Discrete Memoryless Channels

X p (yk/xj) Y Channel matrix

p (y0/x0) p (y1/x0) ..............p (yK-1/x0)


p (y0/x1) p (y1/x1) ..............p (yK-1/x1)
P=

p (y0/xJ-1) p (y1/ xJ-1) ........... p (yK-1/ xJ-1)

K-1
 p (yk/xj) = 1; for all j
k=0
Discrete Memoryless Channels
p (xj)  apriori probability of input symbols

p (xj, yk) = P (X=xj, Y=yk) = P (Y=yk / X=xj) P (X=xj)

= p (yk / xj) p (xj)

J-1

p (yk) = P (Y=yk) =  P (Y=yk / X=xj) P (X=xj)


j=0

J-1

=  p (yk / xj) p (xj) ; for k = 1,2,......K-1


j=0

Transition probabilities
Discrete Memoryless Channels
• Example : Binary symmetric channel

1-p p

p 1-p
Conditional Entropy

• Source alphabet X , Source Entropy H(X ),


• Uncertainity remaining about X after the observation of Y  H(X / Y )

• Uncertainity before transmission  H(X ),


• Uncertainity after transmission  H(X / Y )
• Uncertainity resolved  H(X ) - H(X / Y )  MUTUAL IMFORMATION

• MUTUAL IMFORMATION  I(X ; Y ) = H(X ) - H(X / Y )

Also  I(Y ; X) = H(Y ) - H(Y / X )


Mutual Information
Mutual Information
J-1
H(X ) = å p(xj) log2 { 1 / p(xj) }
j=0
J-1 K-1
=  p(xj) log2 { 1 / p(xj) }  p(yk /xj)
j=0 k=0

J-1 K-1

=   p(yk /xj) p(xj) log2 { 1 / p(xj) }


j=0 K=0

K-1 J-1
=   p(xj , yk ) log2 { 1 / p(xj) }
K=0 j=0
Conditional Entropy
• Source alphabet X , Source Entropy H(X ),
• Uncertainity remaining about X after the observation of Y 

J-1

H(X / Y=yk) =  p(xj / yk ) log2 { 1 / p(xj / yk) }


j=0

K-1

H(X / Y ) =  H(X / Y=yk) p(yk)


k=0

K-1 J-1

=   p(xj / yk ) p(yk) log2 { 1 / p(xj / yk) }


K=0 j=0

K-1 J-1
=   p(xj , yk ) log2 { 1 / p(xj / yk) }
K=0 j=0
Mutual Information
I(X ; Y ) = H(X ) - H(X / Y )

K-1 J-1

=   p(xj , yk ) log2 { 1 / p(xj) }


K=0 j=0

K-1 J-1
-   p(xj , yk ) log2 { 1 / p(xj / yk) }
K=0 j=0

K-1 J-1
I(X ; Y )=   p(xj , yk ) log2 {p(xj / yk) / p(xj) }
K=0 j=0

Applying Baye’s rule : I(X ; Y ) = I(Y ; X) = H(Y ) - H(Y


/X)
Properties of Mutual Information

 Mutual information of channel is symmetric, I(X ; Y ) = I(Y ; X)

 Mutual information is always non-negative I(X ; Y )  0


 Information cannot be lost
 No gain in information  0, happens when X and Y are statistically
independent

 Mutual information of a channel is related to the joint entropy of


channel input and channel output by,
I(X ; Y ) = H(X ) + H(Y ) - H(X , Y )
where joint entropy is defined by,
K-1 J-1
H(X , =   p(xj , yk ) log2 { 1 / p(xj , yk ) }
Y) K=0 j=0
Relationship
H(X, Y)

H(X)

H(Y)

H(X|Y) I(X;Y) H(Y|X)


Mutual Information
A binary symmetric channel
A binary channel with asymmetric
probabilities
Channel Capacity
 Capacity in the channel is defined as a intrinsic ability of a channel to convey
information

 Using mutual information the channel capacity of a discrete memoryless


channel is the maximum average mutual information I(X ; Y ) in any single
use of channel, where maximization is over all possible probability
distributions p(xj)

 C = max I(X ; Y ) ; subject to:

p(xj) p(x j) all j


0 for
where C is measured in
bits per channel use. J-1

 p(xj ) = 1
j=0
Discrete Memoryless Channels
Example : Binary symmetric channel

– capacity of a binary symmetric channel is given by input


probabilities and transition probabilities
– variability with the error probability
C = max I(X ; Y ) ;

= I(X ; Y )p(x ) = p(x1) = 0.5


0

= 1 + plog2p
+(1-p)log2(1-p)

= 1 – H(p)
H(p) is maximum for p = 0.5
C is maximum when p=0  noise free
Binary Erasure Channel (BEC)
Symmetric
Example
Properties
Channel Coding
ECC
Rate(Optimum)
Noisy Channel Coding Theorem or simply the
Channel Coding Theorem
Example
Average probability of error for repetition
codes
Information Capacity Theorem
Capacity
Example
Shannon’s
Information Capacity Theorem
The information capacity of a continuous channel of
bandwidth B Hertz, perturbed by additive white Gaussian
noise of power spectral density No/2 and limited in
bandwidth to B, is given by,

C = B log2( 1 + P/ NoB) bits per second

where P is the average transmitted power.


Defines the fundamental limit on the rate of error-free transmission
for a power-limited, band-limited Gaussian channel.
Shannon limit …
• Shannon theorem puts a limit on transmission
data rate, not on error probability:

– Theoretically possible to transmit information at any


rate Rb , where Rb  C with an arbitrary small error
probability by using a sufficiently complicated coding
scheme.

– For an information rate Rb > C , it is not possible to find


a code that can achieve an arbitrary small error
probability.
Shannon limit …

C = B log2( 1 + P/ NoB)

Rb = C = B log2( 1 + Eb C / NoB)
C/B = log2 [ 1 + (C/B) (Eb/ No )]
Shannon limit

Eb / No = [ 2C/B – 1 ] / (C/B)

As B  , Eb / No  ln 2 = 0.693  -1.6 dB

– There exists a limiting value of below which there can be no error-


free communication at any information rate.

– By increasing the bandwidth alone, the capacity cannot be increased to any


desired value.
Shannon limit …

Unattainable
region
Rb /B [bits/s/Hz]

Practical region

Eb /No [dB]
Shannon limit …

B/Rb [Hz/bits/s] Practical region

Unattainable
region

-1.6 [dB]
Bandwidth efficiency plane
R>C
Unattainable region M=256
M=64
R=C
M=16
M=8
Rb/ B [bits/s/Hz]

M=4
Bandwidth limited
M=2

M=4 M=2 R<C


M=8 Practical region
M=16

Shannon limit MPSK


MQAM
Power limited MFSK
Shannon-Fano Code
• Algorithm
– Line up symbols by decreasing probability of occurrence
– Divide symbols into 2 groups so that both have similar combined
probability
– Assign 0 to 1st group and 1 to the 2nd
– Repeat step 2
• Example

Symbols Prob. Code-word


A 0.35 00 Average code-word length =
B 0.17 01 0.35 x 2 + 0.17 x 2 + 0.17 x 2
C 0.17 10 + 0.16 x 3 + 0.15 x 3
D 0.16 110 = 2.31 bits per symbol
E 0.15 111
Huffman Code
• Shannon-Fano code [1949]
– Top-down algorithm: assigning code from most frequent to least
frequent
– VLC, uniquely & instantaneously decodable (no code-word is a
prefix of another)
– Unfortunately not optimal in term of minimum redundancy
• Huffman code [1952]
– Quite similar to Shannon-Fano in VLC concept
– Bottom-up algorithm: assigning code from least frequent to
most frequent
– Minimum redundancy when probabilities of occurrence are
powers-of-two
– In JPEG images, DVD movies, MP3 music
Huffman Coding Algorithm

• Encoding algorithm
– Order the symbols by decreasing probabilities Node
– Starting from the bottom, assign 0 to the least probable Root
symbol and 1 to the next least probable
– Combine the two least probable symbols into one composite
symbol 1 0
– Reorder the list with the composite symbol
1 0
– Repeat Step 2 until only two symbols remain in the list
• Huffman tree
– Nodes: symbols or composite symbols
– Branches: from each node, 0 defines one branch Leaves
while 1 defines the other
• Decoding algorithm
– Start at the root, follow the branches based on the bits received
– When a leaf is reached, a symbol has just been decoded
Huffman Coding Example
Symbols Prob. Symbols Prob. Symbols Prob.
A 0.35 A 0.35 A 0.35
B 0.17 DE 0.31 BC 0.34 1
C 0.17 B 0.17 1 DE 0.31 0
D 0.16 1 C 0.17 0
E 0.15 0
Huffman Tree
Huffman Codes 1 0 Symbols Prob.
BCDE A
A 0 BCDE 0.65 1
BC 1 0 DE
B 111 A 0.35 0
C 110 B1 0 1 0 E
D 101 C D
E 100
Average code-word length =
0.35 x 1 + 0.65 x 3 = 2.30 bits per symbol
Huffman coding
• In Huffman coding to each symbol of a given alphabet is assigned a
sequence of bits according to the symbol probability

L = 2.2

H(P) = 2.12193
Huffman coding
• Huffman coding  not unique

L remains unchanged

Variance  differs
1 K-1

2 =  pk (lk – L )2
0 k=0

Verify what happens


to the variance ?
Run-length encoding

Run-length encoding is probably the simplest method of


compression. It can be used to compress data made of any
combination of symbols. It does not need to know the
frequency of occurrence of symbols and can be very efficient
if data is represented as 0s and 1s.
The general idea behind this method is to replace
consecutive repeating occurrences of a symbol by one
occurrence of the symbol followed by the number of
occurrences.
The method can be even more efficient if the data uses
only two symbols (for example 0 and 1) in its bit pattern and
one symbol is more frequent than the other.
Figure 15.2 Run-length encoding example
Figure 15.3 Run-length encoding for two symbols
Lempel Ziv (LZ) encoding is an example of a category of
algorithms called dictionary-based encoding. The idea is to
create a dictionary (a table) of strings used during the
communication session. If both the sender and the receiver
have a copy of the dictionary, then previously-encountered
strings can be substituted by their index in the dictionary to
reduce the amount of information transmitted.
LZW
LZW is an LZ78-based scheme designed by T Welch in
1984
– LZ78 schemes work by putting phrases into a dictionary
when a repeat occurrence of a particular phrase is found,
outputting the dictionary index instead of the phrase
– LZW starts with a 4K dictionary
entries 0-255 refer to individual bytes
entries 256-4095 refer to substrings
– Each time a new code is generated it means a new string has
been parsed
New strings are generated by adding current character K to the
end of an existing string w (until dictionary is full)
88
Block Transform Coding

original image

decompose
Sequence of 8 by 8 blocks - different planes
treated separately (RGB, YUV etc.)
transform
Transformed blocks reduce redundancy and
concentrate signal energy into a few coefficients
discrete cosine transformation (DCT)
quantise
Blocks with discarded information - goal is to smooth
picture and discard information that will not be
missed, e.g. high frequencies
entropy code
Block Transform Encoding

DCT

Zig-zag quantise

run
length
code 010111000111…..
entropy
code
Block Encoding
Original image
139 144 149 153 1260 -1 -12 -5
DCT -23 -17 -6 -3
144 151 153 156
150 155 160 163 -11 -9 -2 2
quantise
159 161 162 160 -7 -2 0 1

79 0 -1 0
Zig-zag
79 0 -2 -1 -1 -1 0 0 -1 0 0 0 0 0 0 0 -2 -1 0 0
-1 -1 0 0
0 79 0 0 0 0
1 -2
run 0 -1
length 0 -1
code 0 -1 10011011100011….
Huffman
2 -1
code
0 0
Block Transform Decoding

DCT

Zig-zag quantise

run
length
code 010111000111…..
entropy
code
Result of Coding and Decoding

139 144 149 153 144 146 149 152


144 151 153 156 148 150 152 154
150 155 160 163 155 156 157 158
159 161 162 160 160 161 161 162
Original block Reconstructed block

-5 -2 0 1
-4 1 1 2
-5 -1 3 5
-1 0 1 -2

errors
Linear Prediction (Introduction):
• The object of linear prediction is to estimate
the output sequence from a linear
combination of input samples, past output
samples or both :

– The factors a(i) and b(j) are called predictor


coefficients.

94

You might also like