Professional Documents
Culture Documents
Information Theory Handout BW
Information Theory Handout BW
Module IV Part I
INFORMATION THEORY
MODULE III
PART I
Information theory and coding: Discrete messages - amount of information entropy information rate. Coding- Shannons theorem, Channel capacity - capacity of Gaussian channelBandwidth S/N Trade off - Use of orthogonal signal to attain Shannons limit Efficiency of orthogonal signal transmission.
AMOUNT OF INFORMATION
Consider a communication system in which the allowable messages are m1,m2. with probabilities of occurrence p1,p2 Then p1+p2+.=1. Let the transmitter transmit a message mk with probability pk. Let the receiver has correctly identified the message. Then the amount of information conveyed by the system is defined as: Ik= logb (1/pk) where b is the base of log. = -logb pk The base may be 2,10 or e. When the base is 2 the unit of Ik is is bit (binary unit) when it is 10 the unit is Hartley or decit. When the natural logarithmic base is used the unit is nat. Base 2 is commonly used to represent Ik.
AMOUNT OF INFORMATION
The above units are related as :
log 2 a =
ln a log10 a = ln 2 log10 2
The base of 2 is preferred because in binary PCM the possible messages 0 and 1 occur with likely hood and the amount of information conveyed by each bit is log22=1 bit.
IMPORTANT PROPERTIES OF IK
Ik approaches 0 as pk approaches 1. pk=1 means the receiver already knows the message and there is no need for transmission so Ik=0. Eg: The statement sun rises in the east conveys no information. Ik must be a non-negative quantity since each message contains some information in the worst case Ik=0. The information content of a message having higher probability of occurrence is less than the information content of message having lower probability. As pk approaches 0, Ik approaches infinity. The information content in a highly improbable event approaches unity.
NOTES
When the symbols 0 and 1 of a PCM data occur with equal likely hood with probabilities the amount of information conveyed by each bit is Ik(0) = Ik(1) = log22= 1 bit When the probabilities are different the less probable symbol conveys more information. Let p(0)=1/4 p(1)=3/4 Ik(0)=log2 4=2 bits Ik(1)=log2 4/3=0.42 bit When there are M equally likely and independent messages such that M=2N with N an integer, the information in each message is Ik=log2 M=log2 2N = N bits.
3/23/2009
NOTES
In this case if we are using binary PCM code for representing the M messages the number of binary digits required to represent all the 2N messages is also N. i.e when there are M (=2N) equally likely messages the amount of information conveyed by each message is equal to the number of binary digits needed to represent all the messages. When two independent messages mk and mI are correctly identified the amount of information conveyed is the sum of the information associated with each of the messages individually. 1 1 I I = log 2 I k = log 2 pI pk When the messages are independent the probabilities of the composite message is pkpI.
EXAMPLE
EXAMPLE 1 A source produces one of four possible symbols during each interval having probabilities: p(x1)=1/2, p(x2)=1/4, p(x3) = p(x4) = 1/8. Obtain the information content of each of these symbols. ANS: I(x1)=log22 I(x2)=log24 I(x3)=log28 I(x4)=log28
I k , I = log 2
1 1 1 = log 2 + log 2 = Ik + I I pk p I pk pI
AVERAGE INFORMATION,ENTROPY
Suppose we have M different and independent messages m1,m2 with probabilities of occurrence p1,p2 Suppose further that during a long period of transmission a sequence of L messages has been generated. If L is very large we may expect that in the L message sequence we transmitted p1L messages of m1, p2L messages of m2,etc. The total information in such a message will be:
AVERAGE INFORMATION,ENTROPY
H pk log 2
k =1 M
1 pk
Average information is also referred to as Entropy. Its unit is information bits/symbol or bits/message.
I Total = p1 L log 2
Average information per message interval is represented by the symbol H is given by:
1 1 + p2 L log 2 + ................... p1 p2
H pk log 2
k =1
1 pk
AVERAGE INFORMATION,ENTROPY
When pk=1, there is only a single possible message and the receipt of that message conveys no information. H = log2 1 = 0 When pk0 amount of information Ik and the average information in this case is :
AVERAGE INFORMATION,ENTROPY
H 1
Plot of H as a function of p
lim0 p
1 p log 2 = 0 p
HMAX
The average information associated with an extremely unlikely message as well as an extremely likely message is zero. Consider that a source generates two messages with probabilities p and (1-p). The average information per message is :
H = p log 2
1 1 + (1 p ) log 2 (1 p ) p
when p = 1, H = 0
1/2
when p = 0, H = 0
3/23/2009
AVERAGE INFORMATION,ENTROPY
dH =0 The maximum value of H may be located by setting dp
AVERAGE INFORMATION,ENTROPY
dH 1 p = 0 log =0 dP p 1 p 1 = 1 1 p = p p = p 2
1 1 H = p log 2 + (1 p ) log 2 (1 p ) p
H = p log 2 p (1 p ) log 2 (1 p )
1 dH 1 = p + l p (1 p ). log ( ) + log(1 p ). 1 l ( ) dp p (1 p )
= 1 log p + 1 + log(1 p )
Similarly when there are 3 messages the average information H becomes maximum when the probability of each of these messages p=1/3. 1/3
= log(1 p ) log p
1 p = log p
Extending this, when there are M messages H becomes a maximum when all the messages are equally likely with p=1/M. In this case each message has a probability 1/M and
H max =
1 log 2 M = log 2 M M
INFORMATION RATE R
Let a source emits symbol at the rate r symbols/second. Then information rate of the source: R= r H information bits/second. R information rate, H entropy of the source r r rate at which symbols are generated. generated R= r (symbols/second) x H (information bits/symbol) R= rH (information bits/second)
EXAMPLE 1
A discrete source emits one of the five symbols once very milliseconds with probabilities 1/2, 1/4, 1/8, 1/16 and 1/16 respectively. Determine source entropy and information rate.
H =
P log
i =1 i
1 = Pi
P log
i =1 i
1 Pi
Symbol rate r = f b =
EXAMPLE 2
The probabilities of five possible outcomes of an experiment are given as 1 1 1 1 P( x1 ) = , P( x2 ) = , P( x3 ) = , P( x4 ) = P( x5 ) = 2 4 8 16 Determine the entropy and information rate if there are 16 outcomes per second.
EXAMPLE 3
An analog signal band limited to 10kHz is quantized into 8 levels of a PCM system with probabilities of 1/4, 1/5, 1/5,1/10, 1/10, 1/20, 1/20 and 1/20 respectively. Find the entropy and rate of information. fm= 10 kHz fs = 2 x 10kHz = 20 kHz 3 Rate at which messages are produced r = fs = 20 10 messages / sec
H ( X ) = P( xi ) log2 g
i =1
= 56800 bits/sec
3/23/2009
EXAMPLE 4
Consider a telegraph source having two symbols dot and dash. The dot duration is 0.2s. The dash duration is 3 times the dot duration. The probability of the dots occurring is twice that of the dash and the time between symbols is 0.2s. Calculate the information rate of the telegraph source.
EXAMPLE 4 (Contd..)
Average time per symbol is
p(dot) = 2 p(dash)
1 p(dash) = 3
H ( X ) = p(dot) log2
SOURCE CODING
Let there be M equally likely messages such that If the messages are equally likely, the entropy H becomes maximum and is given by M=2N.
SOURCE CODING
The more likely a message is, the fewer the number of bits that should be used in its code word. Let X be a DMS with finite entropy H(X) and an alphabet { x1, x2,,xm } with corresponding probabilities of occurrence p(xi) where i = 1, 2, 3, ..m. Let the binary code word assigned to symbol xi by the encoder h d have l length ni measured i bit L th d in bits. Length of a code th f d word is the number of bits in the code word. The average code word length L per source symbol is given by
= p( xi )ni
i =1
SOURCE CODING
x1 x2
. . . . . . . . . .p(xm) .
SOURCE CODING
y1 y2
. . .
n1 n2 n3
The parameter L represents the average number of bits per source symbol used in the source coding process. L Code efficiency is defined as = min where Lmin is the L minimum possible value of L. When approaches unity the code is said to be efficient. Code redundancy is defined as = 1
CHANNEL
. . . . .
DMS
SOURCE CODING
bk
xm
nm
yn
Binary sequence
3/23/2009
SOURCE CODING
The conversion of the output of a DMS into a sequence of binary symbols (binary codes) is called source coding. The device that performs this is called source encoder. If some symbols are known to be more probable than others then we may assign short codes to frequent source symbols and long code words to rare source symbols. Such a code is called a variable length code. As an example, in Morse code the letter E is encoded into a single dot where as the letter Q is encoded as _ _ . _ . This is because in English language letter E occurs more frequently than the letter Q.
L H ( x)
and L can be made as close to H(x) as desired for some Lmin = H ( x ) suitably chosen code. When the code efficiency i = H ( x ) ffi i is L No code can achieve efficiency greater than 1, but for any source, there are codes with efficiency as close to 1 as desired. The proof does not give a method to find the best codes. It just sets a limit on how good they can be.
p log
i
qi pi
Consider any two probability distributions { p0 , p1,..., pM 1} and {q0 , q1,...,qM 1} on the alphabet { x0 , x1,..., xM 1} of a discrete memoryless channel.
q q 1 M 1 p i log 2 pi = ln 2 p i ln pi .............. (1) i =0 i =0 i i By a special property of the natural logarithm (ln), we have,
M 1
M 1
i =0
p llog
i
ln x x 1, x 0
Applying this property to Eq-(1)
qi 0 ..............(2) pi If there are M equally probable messages x1, x2 ,..., xM 1 with probabilities
M 1 i =0
qi 0 pi
p log
i
q log
i
1 = log2 M ..............(3) qi
p lo g
i
1 0 pi M
2
log 2 M , sinc e p i = 1
M 1 i =0
M 1 i =0
p log
i i =0
1 M 1 1 + p i log 2 0 pi i =0 M
M 1 1 1 p i log 2 pi M i =0
H ( X ) log 2 M
H(X ) N H( X ) L
H(X) = log2M if and only if pi = 1/M for all i, i.e., all the symbols in the alphabet are equi-probable. This upper bound on entropy corresponds to maximum uncertainty. Proof of the lower bound: Each probability pi is less than or equal to 1. Each term pi log2(1/ pi) is zero if and only if pi = 0 or 1. i.e., pi = 1 for some I and all others are zeroes. Since each probability pi 1, each term pi log2(1/ pi) is always non negative.
if M = 2 N
M 1
pi log 2
M 1 i =0
p log
i
3/23/2009
CLASSIFICATION OF CODES
Fixed Length Code: A fixed length code is one whose code word length is fixed. Code 1 and code 2 of Table 1 are fixed length codes. Variable Length Code: A variable length code is one whose code word length is not fixed. Shannon-Fano and Huffmans codes are examples of variable length codes. Code 3, 4, 5 in the Table 1 are variable length codes codes. Distinct Code: A code is distinct if each code word is distinguishable from other code words. Codes 2,3,4,5 and 6 are distinct codes. Prefix Code: A code in which no code word can be formed by adding code symbols to another code word is called a prefix code. No code word should be a prefix to another. e.g. Codes 2,4 and 6
CLASSIFICATION OF CODES
Uniquely Decodable Code: A code is uniquely decodable if the original source sequence can be reconstructed perfectly from the encoded binary sequence. Code 3 of the table is not uniquely decodable since the binary sequence 1001 may correspond to source sequences x2x3x2 or x2x1x1x2. A sufficient condition to ensure that a code is uniquely decodable i th t no code word i a prefix of another. Th d d bl is that d d is fi f th Thus codes 2,4 and 6 are uniquely decodable codes. Prefix-free condition is not a necessary condition for unique decodability. e.g. code 5 Instantaneous Codes: A code is called instantaneous if the end of any code word is recognizable without examining subsequent code symbols. Prefix-free codes are instantaneous codes e.g. code 6
CLASSIFICATION OF CODES
.
Consider CODING (INSTANTANEOUS CODING) PREFIX a discrete memory-less source of alphabet {x0, x1,,
xi x1 x2 x3 x4
Code 1
Code 2
Code 3
Code 4
Code 5
Code 6
00 01 00 11
00 01 10 11
0 1 00 11
0 10 110 111
0 01 011 0111
1 01 001 0001
Fixed Length Codes: 1,2 Variable Length Code: 3,4,5,6 Distinct Code: 2,3,4,5,6
Prefix Code: 2,4,6 Uniquely Decodable Code: 2,4,6 Instantaneous Codes: 2,4,6
xm-1} with statistics {p0, p1, , pm-1} Let the code word assigned to source symbol xk be denoted by {mk1, mk2, , mkn} where the individual elements are 0s and 1s and n is the code word length. Initial part of code word is represented by mk1, , mki for some in Any sequence made of the initial part of the code word is called a prefix of the code word. A prefix code is defined as a code in which no code word is a prefix of any other code word. It has the important property that it is always uniquely decodable. But the converse is not always true. Thus, a code that does not satisfy the prefix condition is also uniquely decodable.
EXAMPLE 1(Contd)
xi x1 x2 x3 x4 Code A 00 01 10 11
EXAMPLE
An analog signal is band-limited to fm Hz and sampled at Nyquist rate. The samples are quantized into 4 levels. Each level represents one symbol. The probabilities of occurrence of these 4 levels (symbols) are p(x1) = p(x4) = 1/8 and p(x2) = p(x3) = 3/8. Obtain the information rate of the source. Answer:
p( x 2 ) = p( x3 ) =
H(X ) =
3 8
p( x1) = p( x 4 ) =
1 8
= 1.8 bits/symbol.
A prefix code always satisfies Kraft inequality. But the converse is not always true.
3/23/2009
EXAMPLE
We are transmitting 3.6fm bits/second. There are four levels, these four levels may be coded using binary PCM as shown below. Symbol Probabilities Binary digits Q1 1/8 00 Q2 3/8 01 Q3 3/8 10 Q4 1/8 11 Two binary digits are needed to send each symbol. Since symbols are sent at the rate 2fm symbols/sec, the transmission rate of binary digits will be:
EXAMPLE
Binary digit rate = 2 binary digits symbols 2 fm symbol second = 4 f m binary digits/sec ond
Since one binary digit is capable of conveying 1 bit of y g p y g information, the above coding scheme is capable of conveying 4fm information bits/sec. But we have seen earlier that we are transmitting 3.6fm bits of information per second. This means that the information carrying ability of binary PCM is not completely utilized by this transmission scheme.
EXAMPLE
In the above example if all the symbols are equally likely ie p(x1)=p(x2)=p(x3)=p(x4)=1/4 With binary PCM coding, maximum information rate is achieved if all messages are equally likely. Often this is difficult to achieve. So we go for alternative coding schemes t i h to increase average i f information per bit ti bit.
SHANNON-FANO CODING
Find out the Shannon-Fano Codes corresponding to eight messages m1,m2,m3m7 with probabilities 1/2, 1/8, 1/8, 1/16, 1/16, 1/16, 1/32 and 1/32
Message m1 m2 m3 m4 m5 m6 m7 m8 Probabilities 1/2 1/8 1/8 1/16 1/16 1/16 1/32 1/32 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 Codes 0 100 101 1100 1101 1110 11110 11111 No of bits/message 1 3 3 4 4 4 5 5
SHANNON-FANO CODING
m1 m2 m3 m4 m5 m6 m7 m8 0 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1
3/23/2009
SHANNON-FANO CODING
L=
SHANNON-FANO CODING
There are 6 possible messages m1, m2, m3.m6 with probabilities 0.3, 0.25, 0.2, 0.12, 0.08, 0.05 Obtain ShannonFano Codes
i =1
p ( xi )ni
Messages Probablities
codes
length
m1 m2 m3 m4 m5 m6
0 0 1 1 1 1
0 1 0 1 1 1
0 1 1
0 1
2 2 2 3 4 4
SHANNON-FANO CODING
L=
SHANNON-FANO CODING
A DMS has five equally likely symbols. Construct Shannon-Fano Code
xi x1 x2 x3 x4 x5 P(xi) 0.2 0.2 0.2 0.2 0.2 0 0 1 1 1 0 1 0 1 1 0 1 Codes 0 0 0 1 1 0 1 1 0 1 1 1 Length 2 2 2 3 3
i =1
p ( xi )ni
= 2.38 b / symbol
H =
i =1
p ( x i ) log 2 g
1 1 1 1 + 0.25 log2 + 0.2 log2 + 0.12 log2 0.3 0.25 0.2 0.12 1 1 = 2.36 b / symbol + 0.08 log2 + 0.05 log2 0.05 0.08 H 2.36 = = = 0.99 = 99% Redundancy = 1 = 0.01 = 1% L 2.38 = 0.3 log2
1 p ( xi )
SHANNON-FANO CODING
L =
SHANNON-FANO CODING
A DMS has five symbols x1, x2, x3, x4, x5, construct ShannonFano Code xi P(xi) 0.4 0.19 0.16 0.15 0.1 0 0 1 1 1 0 1 0 1 1 0 1 Codes 0 0 0 1 1 0 1 1 0 1 1 1 Length 2 2 2 3 3
i =1
p ( xi )ni
= 0.2(2 + 2 + 2 + 3 + 3) = 2.4
= 2.4 b / symbol
H =
i =1
p ( x i ) log 2 g
1 p ( xi )
x1 x2 x3 x4 x5
3/23/2009
HUFFMANN -CODING
(i) List the source symbols in the order of decreasing probability. (ii) Combine the probabilities (add) of two symbols having the lowest probabilities and reorder the resultant probabilities. This process is called bubbling. (iii) During the bubbling process if the new weight is equal to existing probabilities the new branch is to be bubbled to the top of the group having same probabilities. p g p g p (iv) Complete the tree structure and assign a 1 to the branch rising up and 0 to that coming down. (v) From the final point trace the path to the required symbol and order the 0s and 1s encountered in the path to form the code.
0.6 0.4
1 0
(1 0 1) (1 0 0) (0 1 1) (0 1 0)
0.55 0.45
1 0
0.6 0.4
1 0
x2 x3 x4 x5
(1 1 1) (1 1 0) (1 0 1) (1 0 0)
(1 0 0) (1 0 1 1) (1 0 1 0)
CHANNEL REPRESENTATION
A communication channel may be defined as the path or medium through which the symbols flow to the receiver. A Discrete Memory-less Channel (DMC) is a statistical model with an input X and an output Y as shown below. During each signalling interval, the channel accepts an input signal from X, and in response it generates an output symbol from Y. The channel is discrete when the alphabets of X and Y are both finite values. It is memory-less when the current output depends only on the current input and not on any of the previous inputs.
0.6 0.4
1 0
x2 x3 x4 x5
(1 1 1) (1 1 0)
3/23/2009
CHANNEL REPRESENTATION
x1 x2
. . . . . . . .
CHANNEL REPRESENTATION
A diagram of a DMC with m inputs and n outputs is shown above. The input X consists of input symbols x1,x2,..xm. The output Y consists of output symbols y1,y2.yn. Each possible input to output path is indicated along with a conditional probability p(yj/xi) which indicates the conditional probability of obtaining output yj given that input is xi and is called channel transition probability. A channel is completely specified by the complete set of transition probabilities. So a DMC is often specified by a matrix of transition probabilities [P(y/x)]
y1 y2
. .
p(yj/xi)
. . . . .
xm
yn
CHANNEL MATRIX
P( y1 | x1 ) P( y2 | x1 ) .................. P( yn | x1 ) P( y | x ) P( y | x ) ................. P( y | x ) 1 2 2 2 n 2 ....................................................................... ....................................................................... P( y1 | xm ) P( y2 | xm ) ................. P( yn | xm )
CHANNEL MATRIX
Matrix [P(y|x)] is called channel matrix. Each row of the matrix specifies the probabilities of obtaining y1,y2.yn, given x1. So, the sum of elements in any row should be unity.
n
p( y
j =1
| xi ) = 1 for all i
If the probabilities P(X) are represented by the row matrix, then we p ( ) p y , have [P(X)] = [p(x1) p(x2) ...p(xm)] The output probabilities P(y) are represented by the row matrix [P(Y)] = [p(y1) p(y2).p(yn)] The output probabilities may be expressed in terms of input probabilities as [P(Y)] = [P(X)] [P(Y|X)]
CHANNEL MATRIX
If [P(X)] is represented as a diagonal matrix
LOSSLESS CHANNEL
A channel described by a channel matrix with only one nonzero element in each column is called a lossless channel. In a lossless channel no source information is lost in transmission.
x1 3/4 1/4 y1 y2 y3 y4 1 y5
x2
1/3 2/3
x3
3 4 0 0
1 4 0 0
0 1 3 0
0 2 0 3 0 1 0
10
3/23/2009
DETERMINISTIC CHANNEL
A channel described by a channel matrix with only one nonzero element in each row is called a deterministic channel
DETERMINISTIC CHANNEL
1 1 0 0 0 0 0 1 1 0 0 0 0 0 1
x1 x2 x3 x4 x5 1 1 1 1 1 y1 y2 y3
[P(Y|X)] =
Since each row has only one non-zero element, this element must be unity. When a given source symbol is sent to a deterministic channel, it is clear which output symbol is received.
NOISELESS CHANNEL
A channel is called noiseless if it is both lossless and deterministic. The channel matrix has only one element in each row and each column, and this element is unity. The input and output alphabets are of the same size.
[ P(Y | X )] =
x1=0 p p x2=1
p 1 p p 1 p
1-p y1=0
x1 x2 x3 xm
1 1 1 1
y1 y2 y3 ym
y2=1 1-p
EXAMPLE 1
p(x1) p(x2)
0.9
y1 y2
0.8
(i) Find the channel matrix of the binary channel. (ii) Find p(y1) and p(y2) when p(x1)=p(x2)=0.5 (iii) Find the joint probabilities p(x1,y2) and p(x2,y1) when p(x1)=p(x2)=0.5
11
3/23/2009
SOLUTION
p(y1 | x1) p(y2 | x1) Channel Matrix p(y | x)= p(y1 | x2 ) p(y2 | x2 )
0.9 0.1 = 0.2 0.8
EXAMPLE 2
Two binary channels of the above example are connected in cascade. Find the overall channel matrix and draw the resultant when equivalent channel diagram. Find p(z1), p(z2) p(x1)=p(x2)=0.5
[P (Y )] = [P ( X )][P (Y | X )]
0.9 0.1 = [0.5 0.5] = [0.55 0.45] 0.2 0.8 p( y1) = 0.55, p(y 2 ) = 0.45
0.5 0 0.9 0.1 0.45 0.05 = = 0 0.5 0.2 0.8 0.1 0.4
x1
0.9
0.9
z1
[P ( X ,Y )] = [P ( X )]d [P (Y | X )]
p(x1, y1) p(x1, y2 ) 0.45 0.05 = p(x2, y1) p(x2, y2 ) 0.1 0.4
x2
0.8
0.8
z2
SOLUTION
P (Z | X ) = [P (Y | X )][ P (Z | Y ]
0.9 0.1 0.9 0.1 0.83 0.17 = = 0.1 0.8 0.2 0.8 0.34 0.66
EXAMPLE 3
A channel has the channel matrix
[P (Y | X )] =
0 1 p p p 1 p 0
x1
0.83
z1
(i) Draw the channel diagram (ii) If the source has equally likely outputs compute the p probabilities associated with the channel outputs for p=0.2 p p
[P (Z )] = [P ( X )][P (Z | X )]
x2
P(Z) = [0.5
x1=0
0.66
z2
0.415]
x2=1
y3=1
SOLUTION
This channel is known as binary erasure channel (BEC) It has two inputs x1=0 and x2=1 and three outputs y1=0, y2=e, y3=1 where e denotes erasure. This means that the output is in doubt and hence it should be erased
EXAMPLE 5 1
x1
1 3
3
1 3
y1
4 1 4
x2
0 .8 [ P (Y )] = [0 . 5 0 . 5 ] 0 = [0 . 4 0 . 2 0 . 4 ]
0 .2 0 .2
0 0 .8
y2
4 1 4 1 2
x3
(i)
y3
1 2
Find the channel matrix (ii) Find output probabilities if p ( x1 ) = (iii) Find the output entropy H (Y ) .
p( x2 ) = p(x3 ) =
1. 4
12
3/23/2009
SOLUTION1 1 1
3 P [Y / X ] = 1 4 1 4 3 1 2 1 4
P [Y ] = [P ( X )][P (Y | X )]
17 17 = 7 48 48 24
H (Y ) =
0.33 0.33 0.33 = [0.5 0.25 0.25] 0.25 0.5 0.25 0.25 0.25 0.5
p ( y )log
i =1 i
1 p( y i )
y1 y2 .
x2 . . . xn
CHANNEL
= 1.58 b/symbols
. . yn
i. H(X) is the entropy of the transmitter. ii. H(Y) is the entropy of the receiver iii. H(X,Y) is the joint entropy of the transmitted and received symbols iv. H(X|Y) is the entropy of the transmitter with a knowledge of the received symbols. v. H(Y|X) is the entropy of the receiver with a knowledge of the transmitted symbols.
p ( x ) log
i
1 p( xi )
1 p( xi , y j )
m 1
i =0
1 p ( y i ) lo g 2 p(y i )
= p( xi , y j )log2
j = 0 i =0
n 1 m 1
1 p( x i | y j ) p( y j )
n 1 m 1 1 H ( X | Y ) = p( xi , y j )log2 p( x | y ) j =0 i =0 i j
n 1 m 1 1 1 = p( xi , y j ) log2 + log2 p( x i | y j ) p( y j ) j =0 i =0
1 H (Y | X ) = p( xi , y j )log2 p( y | x ) j =0 i =0 j i
n 1 m 1
n 1 m 1 1 1 = p( xi , y j )log2 + p( xi , y j )log2 p( xi | y j ) p( y j ) j =0 i =0
H ( X ,Y ) = p( xi , y j )log2
j =0 i =0
n 1 m 1
1 p( xi , y j )
= H ( X | Y ) + p( xi , y j )log2
j =0 i =0
n 1 m 1
1 p( y j )
13
3/23/2009
MUTUAL INFORMATION
If the channel is noiseless then the reception of some symbol yj uniquely determines the message transmitted. Because of noise there is a certain amount of uncertainty regarding the transmitted symbol when yj is received. p(xi|yj) represents the conditional probability that the transmitted symbol was xi given that yj is received. The average uncertainty about x when yj is received is represented as
m1 1 H( X | Y = y j ) = p( xi | y j )log2 i =0 p( xi | y j )
i =0
= H ( X | Y ) + H (Y )
H ( X ,Y ) = H ( X | Y ) + H (Y )
Similarly
H (X ,Y)= H (Y | X ) + H ( X )
The quantity H(X|Y=yj) is itself a random variable that takes on values H(X|Y=y0), H(X|Y=y1),, H(X|Y=yn) with probabilities p(y0), p(y1),, p(yn).
MUTUAL INFORMATION
Now the average uncertainty about X when Y is received is
1 H ( X | Y ) = p( xi | y j )log2 p( y j ) p( x | y ) j = 0 i =0 i j
n 1 m 1
MUTUAL INFORMATION
If the channel were noiseless the average amount of information received would be H(X) bits per received symbol. H(X) is the average amount of information transmitted per symbol. Because of channel noise we lose an average of H(X|Y) information per symbol. Due to this loss the receiver receives on the average H(X) H(X|Y) bits per symbol. p y The quantity H(X) H(X|Y) is denoted by I(X;Y) and is called mutual information.
m 1 1 m1 n 1 1 I ( X ;Y ) = p( xi )log2 p( xi , y j )log2 p( x | y ) i =0 i j p( xi ) i =0 j =0
n 1 m 1 1 = p( xi | y j ) p( y j )log2 p( x | y ) j =0 i =0 i j n 1 m 1 1 = p( xi , y j )log2 p( x | y ) j = 0 i =0 i j
H(X|Y) represents the average loss of information about a transmitted symbol when a symbol is received. It is called equivocation of X w. r. t. Y.
MUTUAL INFORMATION
But
CHANNEL CAPACITY
A particular communication channel has fixed source and destination alphabets and a fixed channel matrix. So the only variable quantity in the expression for mutual information I(X;Y) is the source probability p(xi). Consequently maximum information transfer requires specific source statistics obtained through source coding. A suitable measure of the efficiency of information transfer through a DMS is obtained by comparing the actual information transfer to the upper bound of such transinformation for a given channel. The information transfer in a channel is characterised by mutual information and Shannon named the maximum mutual information as the channel capacity.
p( x , y ) = p( x )
j =0 i j i
n 1
1 1 I(X;Y ) = p(xi , y j )log2 p(xi , y j )log p(x | y ) i =0 j =0 p(xi ) i =0 j =0 i j m1 n1 p(xi / y j ) = p(xi , y j )log p(xi ) i =0 j =0 m 1 n 1 p( xi , y j ) = p( xi , y j )log ...........(1) p( y j ) p( xi ) i = 0 j =0
m1 n1 m1 n1
If we interchange the symbols xi and yj the value of eq(1) is not altered, so we get I(X;Y)=I(Y;X). H(X) H(X|Y)=H(Y) H(Y|X)
14
3/23/2009
CHANNEL CAPACITY
Channel capacity C is the maximum possible information transmitted when one symbol is transmitted from the transmitter. Channel capacity depends on the transmission medium, kind of signals, kind of receiver, etc. and it is a property of the system as a whole.
p
x2=1
y2=1 1 1-
1-p 1
The source alphabet consists of two symbols x1 and x2 with probabilities p(x1)=p and p(x2)=1-p. The destination alphabet is y1,y2. The average error probability per symbol is
p e = p ( x1 ) p ( y 2 | x1 ) + p ( x 2 ) p ( y 1 | x 2 )
= p + (1 p ) =
HMAX
( x ) = x log 2
1 1 + (1 x ) log 2 1 x x
1/2
p ( y1 ) = p ( xi , y1 )
x
p( y
j =1
| xi ) log 2
1 p ( y j | xi )
= p ( y1 | x1 ) p( x1 ) + p( y1 | x2 ) p( x2 ) = (1 ) p + (1 p ) ( ( = + p 2 p H (Y ) = ( + p 2 p )
= p ( x1 ) p ( y1 | x1 ) log 2 + p ( x 2 ) p ( y1 | x 2 ) log 2
= p(1 )log2 = (1 )log2
1 1 + p ( x1 ) p ( y 2 | x1 ) log 2 p ( y1 | x1 ) p ( y 2 | x1 ) 1 1 + p ( x 2 ) p ( y 2 | x 2 ) log 2 p ( y1 | x 2 ) p ( y 2 | x2 )
1 p ( y j | xi ) 1 p( y j | xi )
= p( xi ) p( y j | xi )log2
j =1 i =1
1 1 + log2 1
15
3/23/2009
C = 1 ( )
C = 1 ( )
I ( X ; Y ) = ( p ) = H ( X )
On the other hand if the channel is very much noisy, =1/2.
I ( X ;Y ) = 0
For a fixed , () is a constant, but the other term (+p-2 p) varies with source probability. This term reaches a maximum value of 1 when +p-2 p=1/2
h( X ) =
1 ( x)log2 dx f X ( x)
ii.
h(x) is called differential entropy of X to distinguish it from the ordinary or absolute entropy. The difference between h(x) and H(X) can be explained as below.
16
3/23/2009
EXAMPLE
A signal amplitude X is a random variable uniformly distributed in the range (-1,1). The signal is passed through an amplifier of gain 2. The output Y is also a random variable uniformly distributed in the range (-2,+2). Determine the differential entropies of X and Y
EXAMPLE
h( x) = 1 2 log 2 2 dx = 1 2 [x ]1 = 1bit
1 1
h( y ) =
12 , x <1 fx ( x ) = 0, Otherwise
14 , fy ( y ) = 0, y <2 Otherwise
The entropy of the random variable Y is twice that of X. Here Y=2X and a knowledge of X uniquely determines Y. Hence the average uncertainty about X and Y should be identical. Amplification can neither add nor subtract information. But here h(Y) is twice as large as h(X). This is because h(X) and h(Y) are differential entropies and they will be equal only if their reference entropies are equal.
EXAMPLE
The reference entropy R1 for X is log x and reference entropy R2 for Y is log y In the limit as x, y 0
R1 = lim log x
x 0
R2 = lim log y
y 0
R1 R2 = lim log g
R1, reference entropy of X is higher than the reference entropy R2 for Y. Hence if X and Y have equal absolute entropies their differential entropies must differ by 1 bit.
MUTUAL INFORMATION
Then fx(x|y) x is the probability that X will lie in the interval (x, x+x) when Y=y provided x0. There is an uncertainty about the event that X lies in the interval (x,x+x). This uncertainty log [1 f X ( x | y ) x ] arises because of channel noise and therefore represents a loss of information. g[ Because log[1 f X ( x ) x ] is the information transmitted and log [1 f X ( x | y ) x ] is the information lost over the channel the net information received is given by the different between the two. = log [1 f X ( x ) x ] log [1 f X ( x | y ) x ] f ( x | y) = log X f X ( x)
.
MUTUAL INFORMATION
Comparing with the discrete case we can write the mutual information between random variable X and Y as
I ( X ;Y ) =
f ( x | y) f XY ( x, y ) log 2 X dxdy f X ( x)
f XY ( x, y ) log 2
=
=
( x) fY ( y | x)log2
f X ( x)log2
17
3/23/2009
MUTUAL INFORMATION
Now, f X ( x ) log
MUTUAL INFORMATION
1 dx = h( x ) and fY ( y | x )dy = 1 f X ( x)
I ( x ; y ) = h( x ) +
= h( x )
f XY ( x, y ) log 2 f X ( x | y )dxdy
f XY ( x, y ) log 2 1 dxdy f X ( x | y)
It is the loss of information over the channel. The average of log [1 f X ( x | y )] is the average loss of information over the channel when some x is transmitted and y is received. By definition this quantity is represented by h(x|y) and is called equivocation of X and Y q
h( X | Y ) =
f XY ( x, y )log 2
1 dxdy f x ( x | y)
The second term on the RHS represents the average over x and y of log [1 f X ( x | y )] But this term log [1 f X ( x | y ) ] represents the uncertainty about x when y is received.
I ( X ;Y ) = h ( x ) h ( x | y )
CHANNEL CAPACITY
That is when some value of X is transmitted and when some value of Y is received the average information transmitted over the channel is I(X;Y). Channel capacity C is defined as the maximum amount of information that can be transmitted on the average.
C = max[ I ( X ; Y )]
C = max[ I ( X ; Y )]
f X ( x) =
1 e 2
( x )2 2 2
Also the random variables X and Y must have the same mean and same variance 2
fY ( x ) log 2
1 fY ( x ) log 2 f X ( x )dx fY ( x )
h(Y ) fY ( x ) log 2 f X ( x )dx..........(1)
p
k =1
log 2
qk 0 pk
we may write
f ( x) fY ( x ) log 2 X dx 0 fY ( x )
1 + fY ( x ) log 2 f X ( x ) 0 fY ( x )
When the random variable X is Gaussian its PDF is given by ( x ) 1 2 f X ( x) = e 2 .............(2) 2 Substituting (2) in (1)
fY ( x ) log 2
h(Y ) fY ( x )log 2
1 e 2
( x )2 2 2
dx
18
3/23/2009
fY ( x ) = 1,
(x )
fY ( x )dx = 2
h (Y )
1 lo g 2 2 2 e 2
( x )2 log 2 e fY ( x ) dx fY ( x ) log e 2 dx 2 2
h( y ) =
1 log 2 (2 e 2 ) 2
For a finite variance 2 the gaussian random variable has the largest differential entropy attainable by any random variable. py q y y The entropy is uniquely determined by its variance.
f X ( x ) dx
fY ( y | x ) log 2 (1 fY ( y | x ))dy
For a given x, y is equal to a constant x+n . Hence the distribution of Y when X has a given value is identical to that of n except for a translation by x . If fn represents the PDF of the noise sample n
fY ( y | x ) = f N ( y x )
h (Y | X ) =
f XY ( x , y ) log(1 / fY ( y | x )) dxdy
putting y-x = z
f X ( x ) fY ( y | x )log(1 fY ( y | x ) dxdy
f Y ( y | x ) log 2 (1 f Y ( y | x ) =
f N ( z ) log 2 (1 f N ( z )) dz
19
3/23/2009
h (n ) =
1 lo g 2 e N 2
N = B
CS =
h m ax ( y ) =
1 lo g 2 e (S + N ) 2
2 = S + N
S 1 C = 2 B log 1 + N 2 S = B lo g 1 + b its /s e c o n d N
S C = B log 1 + bits/second N
When the bandwidth B increases the channel capacity does not become infinite as expected because with an increase in BW the noise power also increases. Thus for a fixed signal power and in presence of white Gaussian noise the channel capacity approaches an upper limit with increase in band width .
S C = B log 2 1 + N
Putting N = B
log 2 e = 1.44
C = 1.44
S s B log 2 1 + B S
This equation indicates that we may trade off bandwidth for signal to noise ratio and vice versa
C = 1 .4 4 S
Putting
S = x B
C =
putting =
log 2 (1 + x )
N B
C = 1 .4 4
SB N
S = 1 .4 4 (B N
As B , x 0
C = lim
x0
log 2 (1 + x )1/ x
For a maximum C we can trade off S/N and B. If S/N is reduced we have to increase the BW . If the BW is to be reduced we have to increase S/N and so on .
20
3/23/2009
g ( x )g ( x ) = 0
i j
x1 i j
If we multiply and integrate the functions over the interval x1 and x2 th result i zero except when th signal are th same. d the lt is t h the i l the A set of functions which has this property is described as being orthogonal over the interval from x1 to x2. The function can be compared to vector v i and v j whose dot product is given by
f ( x ) = c1 g1 ( x ) + c2 g 2 ( x )................ + cn g n ( x )...............(2)
where cs are numerical constants. The orthogonality of the gs makes it easy to compute the coefficients cn. To evaluate cn we multiply both sides of eq(2) by gn(x) and integrate over the interval of orthogonality. .
v i v j cos
x2
x2
x1
x2
x1
we have
cn = f ( x)gn ( x )dx..........(3)
x1
x2
x2
x1
Because of orthogonality all of the terms on the right hand side becomes zero with a single exception.
x2
f ( x)g ( x)dx = c g
n n
x2
2 n
( x )dx
2 When orthogonal functions are selected such that gn ( x)dx = 1 x1 they are said to be normalised. y The use of normalised functions has the advantage that cns can be calculated from eq(3) without having to evaluate the
x2
x1
cn =
x1 x2
x1
f ( x ) g n ( x )dx
x2 x1
integral
g
x1
2 n
( x)dx
g n2 ( x )dx
A set of functions which are both orthogonal and normalised is called orthonormal set.
e1 e2
AWGN
SOURCE OF M MESSAGES
s2 (t )
T
. . . . .
. . . . .
sM (t)
eM
21
3/23/2009
( si (t ) + n(t ) ) si (t )dt
T 0
s i 2 ( t )d t
It is adjusted to produce an output of Es, symbol energy. In the presence of an AWGN wave from n(t), the output of the lth correlator will be T T T el = ( si (t ) + n(t )) sl (t )dt = sl (t )n (t )dt + si (t ) sl (t )dt
= n(t ) sl (t )dt nl 0 The quantity nl is a random variable which is gaussian, which has a mean value of zero and has a mean square value given by 2=Es/2 The correlator corresponding to the transmitted message will have an output
T
0
To determine which message has been transmitted we shall compare the matched filters output e1, e2 .., eM. We may decide that si(t) has been transmitted if the corresponding output ei is larger than the output of any of the filters. The probability that some arbitrarily selected output el is less than the output ei is given by
ei = ( si (t ) + n (t )) si (t )dt
0
p ( e l < ei ) =
1 2
ei
el 2 2 2
d e l ................(1)
pL =
Let
1 2
el =x 2
E s + ni
el 2 2 2
de l
M 1
= [ p(e2 < ei )]
1 2
ei
el 2 2 2
de l
M 1
1 pL =
Es n + i 2
2 e x dx
M 1
Es
2
+ ni 2
Es = 2
Es
1 pL =
x2
dx
M 1
E n pL = pL s , M , i 2
To find the probability that ei is the largest output without reference to the noise output ni of the ith correlator we need to average pL over all possible values of ni. The average is the probability that we shall be correct in deciding that the transmitted signal corresponds to the correlator which yields the largest output. Let this probability be pC. The probability of an error is then pE = 1-pC ni is a random variable (gaussian) with zero mean and variance 2 . Hence the average value of pL considering all possible values of ni is given by
..........( 4 )
E n pL = f s , M , i 2
22
3/23/2009
E ni pL s , M , 2
2 i 2 dni e
n
If y =
ni 2
M
1 pC =
e y
pe = 1 pc
Es
+y
2 e x dx
M 1
dy
M=1024 M=2048 M
ln 2
0.71 2 3 6 10 20
Si =
Si
log M
2
increases. As
Si
, pe 0
log M
2
E sT T log2 M
SiT
Put
Es = Si T
Si
log M
2
Si
logT M
2
Put P
log 2 M g =R T
Si
For fixed M and R, pe decreases as as the noise density decreases. d For fixed M and , pe decreases as the signal power goes up. For fixed Si, and M, pe decreases as we allow more time T for the transmission of a single message or the rate R is decreased. For fixed Si, and T, pe decreases as M decreases As M, the number of messages increases, the error probability reduces.
Si
ln 2
Si
Rmax
= ln 2
R max =
Si
ln 2
= 1.44
Si
The maximum rate obtained for this M ary FSK is the same as that obtained by Shannons Theorem
23