You are on page 1of 41

Information Theory

Part 2
Prof H Xu
FEB 28, 2013
Useful Maths
What is the definition of random variable?
A variable whose values are random but whose
statistical distribution is known.
What is probability density function (pdf)?
A probability function describes all the values that the
random variable can take .
Useful Maths
What is cumulative distribution function (cdf.)?
If pdf f(x), then cdf F(x) is given as
}

= s =
x
dx x f x X P x F ) ( ) ( ) (
The relationship between cdf and pdf:
dx
x dF
x f
) (
) ( =
Joint probability distribution and density function
) , ( ) , ( y Y x X P y x F
xy
s s =
dxdy
y x dF
y x f
) , (
) , ( =
Useful Maths
Let X be a random variable. Assume the probability density
function (pdf) of X is f(x) ( )
The mean of X is defined as

The variance of X is defined as

Remind:

}

= = dx x f x X E ) ( ] [
< < x
}

= = dx x f x X D ) ( ) ( ] [
2 2
o
1 ) ( =
}
+

dx x f
Conditional Probability density function
Useful Maths
) (
) , (
) | (
|
y f
y x f
y x f
y x
=
Statistical Independence: Two random variables X,Y are
called statistically independent if and only if
) ( ) ( ) , ( y f x f y x f
y x xy
= or ) ( ) ( ) , ( y F x F y x F
y x xy
=
Useful Maths
Let X be a random variable. Assume the probability density
function (pdf) of X is f(x) ( ).
Let be also a random variable. The pdf of Y is

where is the inverse function of
< < x
) (X g Y =
) ( y

< <
=
others
y y h y h f
y
0
) ( ' )] ( [
) (
| o

) ( y h
) (x g
{ } ) ( ), ( min + = g g o
{ } ) ( ), ( max + = g g |
Useful Maths
Example 1
}
=
s =
s =
a y
x
Y
dx x f
a y X P
y aX P y F
/
0
) (
) / (
) ( ) (
a y
x
y
y
e
a
a y f
a
dy
y dF
y f
/
1
) / (
1
) (
) (

=
=
=
0 > a
Let aX Y = and pdf is
x
x
e x f

= ) (
Find
) ( y f
y
0 > x
) 0 ( > a
What is the random variable?
meaning of PDF
Know to find mean and variance
Know to find PDf of a function of a random
variable
Summary of Useful Maths
Discrete Memoryless Channels
A discrete memoryless channel is a statistical model with an
input and output , where is a noisy version of .

X Y
Y
X
view of a discrete memoryless channel
( ) ( )
| | for all and
k j k j
p y x P Y y X x j k = = =
Naturally, we have
( )
0 | 1 for all and
k j
p y x j k s s
Discrete Memoryless Channels
For example:
( )
0 0
| 1 p y x q =
( )
1 0
| p y x q =
( )
0 1
| p y x q =
( )
1 1
| 1 p y x q =
Discrete Memoryless Channels
A convenient way of describing a discrete memoryless
channel is to arrange the various transition probabilities of the
channel in the form of a matrix as follows

( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
0 0 1 0 1 0
0 1 1 1 1 1
0 1 1 1 1 1
| | |
| | |

| | |
K
K
J J K J
p y x p y x p y x
p y x p y x p y x
p y x p y x p y x

(
(
(
=
(
(
(

P
( )
1
0
| 1 for all
K
k j
k
p y x j

=
=

( ) ( )
( ) ( )
0 0 1 0
0 1 1 1
| |
| |
p y x p y x
p y x p y x
(
=
(

P
( ) ( )
0 0 1 0
| | 1 p y x p y x + =
( ) ( )
0 1 1 1
| | 1 p y x p y x + =
Discrete Memoryless Channels
( ) ( )
( ) ( )
, ,
|
j k j k
k j j
p x y P X x Y y
P Y y X x P X x
= = =
= = = =
The joint probability distribution of the random variables
and is given by

X
Y
( ) ( )
|
k j j
p y x p x =
( ) ( )
for 0,1, , 1
j j
p x P X x j J = = =
( )
0 0
| 1 p y x q =
( )
1 0
| p y x q =
( )
0 1
| p y x q =
( )
1 1
| 1 p y x q =
For example:
( )
0 0
0 0 0
( , )
| ( )
1
(1 )
2
p x y
p y x p x
q
=
=
Assume:
0 1
( ) ( ) 1/ 2 p x p x = =
Discrete Memoryless Channels
The marginal probability distribution of the output random
variable is obtained by averaging out the dependence of
on , as shown by

Y
( )
,
j k
p x y
j
x
( ) ( )
( ) ( )
1
0
|
k k
J
k j j
j
p y P Y y
P Y y X x P X x

=
= =
= = = =

( ) ( )
1
0
| for 0,1, , 1
J
k j j
j
p y x p x k K

=
= =

The probabilities for , are known as the

priority probabilities of the various input symbols.
( )
j
p x 0,1, , 1 j J =
Discrete Memoryless Channels
( ) ( )
1
0
( ) | for 0,1, , 1
J
k k j j
j
p y p y x p x k K

=
= =

For example:
( )
0 0
| 1 p y x q =
( )
1 0
| p y x q =
( )
0 1
| p y x q =
( )
1 1
| 1 p y x q =
( ) ( )
0
0 0 0 0 1 1
( )
| ( ) | ( )
1/ 2
p y
p y x p x p y x p x = +
=
Assume:
0 1
( ) ( ) 1/ 2 p x p x = =
Discrete Memoryless Channels
For , the average probability of symbol error, , is
defined as the probability that the output random variable
is different from the input random variable , averaged
over all we thus write
J K =
e
P
k
Y
j
X
. j k =
( )
1
0
K
j
e k
k
k j
P P Y y

=
=
= =

( ) ( )
1 1
0 0
|
K J
k j j
k j
k j
p y x p x

= =
=
=

The difference is the average probability of correct

reception.

1
j
e
P
Discrete Memoryless Channels
( )
1
0
K
e k
k
k j
P P Y y

=
=
= =

( ) ( )
1 1
0 0
|
K J
k j j
k j
k j
p y x p x

= =
=
=

if we are given the input a priori probabilities and the

channel matrix (i.e., the matrix of transition probabilities
), then we can calculate the probabilities of the various output
symbols, the .
( )
j
p x
( )
|
k j
p y x
( )
k
p y
Discrete Memoryless Channels
Example:
( )
1
0
K
e k
k
k j
P P Y y

=
=
= =

( ) ( )
1 1
0 0
|
K J
k j j
k j
k j
p y x p x

= =
=
=

2
1
) 1 ( ) 0 (
1 0
= = = = x p x p
) 1 ( ) 1 | 0 ( ) 0 ( ) 0 | 1 ( = = = + = = = = x p x y p x p x y p p
e
p x y p p
e
= = = = ) 0 | 1 (

Symmetric channel
Review of entropy
{ } { }

= = =
= =
|
|
.
|

\
|
= = =
K
k
k k
K
k
k
k
k
K
k
k k k
p p I E
p
p I p I E S H
1 1 1
log
1
log ) (
Average amount of information per symbol

Average amount of surprise when observing the symbol

Uncertainty the observer has before seeing the symbol

Average number of bits needed to communicate the
symbol
Conditional entropy
( ) ( )
( )
1
2
0
1
| | log
|
J
k j k
j
j k
H X y p x y
p x y

=
| |
| =
|
\ .

Suppose we receive , the conditional entropy of

(uncertainty) in the transmitted symbol is

k
y X
Alternative interpretation: the expected number of bits needed
to transmit if both the transmitter and the receiver know
the value of .
X
k
y
{ } { }

= = =
= =
|
|
.
|

\
|
= = =
K
k
k k
K
k
k
k
k
K
k
k k k
p p I E
p
p I p I E S H
1 1 1
log
1
log ) (
In the following discussion, use X to replace S .
Conditional entropy
( ) ( )
( )
1
2
0
1
| | log
|
J
k j k
j
j k
H X y p x y
p x y

=
| |
| =
|
\ .

( ) ( )
1
| ( / )
=
=

K
k k
k
H X Y H X y p y
( ) ( ) ( )
( )
2
1 1
1
| | log
|
K J
k j k
k j
j k
H X Y p y p x y
p x y
= =
=

Averaging over all received symbols gives what is called the

equivocation.

Conditional entropy
( ) ( ) ( )
( )
2
1 1
1
| | log
|
K J
k j k
k j
j k
H X Y p y p x y
p x y
= =
=

( )
( )
2
1 1
1
, log
|
K J
j k
k j
j k
p x y
p x y
= =
=

The conditional entropy (or equivocation) quantifies the

remaining entropy of a random variable given that the value
of a second random variable is known
X
Y
Information Gain
( )
H X
( )
| H X Y
Average information
Remaining
Average information
( )
( | ) H X H X Y
is called information gain which is contained in
channel.
Two extreme cases:
Perfect channel: ( )
( | ) ( ) H X H X Y H X =
No communication : ( )
( | ) 0 H X H X Y =
The objective of designing a channel is to make ( )
( | ) H X H X Y
as larger as possible
How to prove it?
How to prove it?
Other entropy functions
Now we can define several other entropy functions

( ) ( )
( )
2
1 1
1
| , log
|
J K
j k
j k
k j
H Y X p x y
p y x
= =
=

( ) ( )
( )
2
1 1
1
, , log
,
J K
j k
j k
j k
H X Y p x y
p x y
= =
=

X Y
H X Y p x y p x y E P X Y = =

( )
( )
2
1 1
1
( / ) , log
|
K J
j k
k j
j k
H X Y p x y
p x y
= =
=

These lead to two useful relationships

( ) ( ) ( )
, | H X Y H X Y H Y = + ( ) ( ) ( )
, | H X Y H Y X H X = +
) ( ) / ( ) , ( y p y x p y x p =
) ( ) / ( ) , ( x p x y p y x p =
Other entropy functions
( , ) ( , ) log[ ( , )] [log[ ( , )]]
X Y
H X Y p x y p x y E P X Y = =

How to prove them?

Mutual Information
Definition of mutual information
Mutual information is the relative entropy (Kullback- Leibler
distance) between the joint distribution and product
distribution of two random variables:
( , ) ( , )
( ; ) ( , )log log
( ) ( ) ( ) ( )
X Y
p x y p X Y
I X Y p x y E
p x p y p X p Y

( (
= =
`
( (

)

= =
X Y
Y X p E y x p y x p Y X H )]] , ( [log[ )] , ( log[ ) , ( ) , (
Mutual Information

)
`

=
(

=
X Y
Y p X p
Y X p
E
y p x p
y x p
y x p Y X I
) ( ) (
) , (
log
) ( ) (
) , (
log ) , ( ) ; (
{ } ) / ( ) ( )] / ( log[
) (
1
log
) (
1
) (
) , (
log ) ; (
Y X H X H Y X p E
X p
E
X p Y p
Y X p
E Y X I
= +
)
`

=
)
`

=
) / ( ) ( ) / ( ) ( ) ; ( X Y H Y H Y X H X H Y X I = =
( , ) ( , ). H X Y H Y X =
) / ( ) ( ) ; ( X Y H Y H Y X I =
For
, the concept of mutual information
is essentially a measure of how much information about the
random variable Y is contained in the random variable X.
Similarly
{ } ) / ( ) ( )] / ( log[
) (
1
log
) (
1
) (
) , (
log ) ; (
X Y H Y H X Y p E
Y p
E
Y p X p
Y X p
E Y X I
= +
)
`

=
)
`

=
Mutual Information
) / ( ) ( ) ; ( Y X H X H Y X I =
Reduction in the uncertainty of X due to the knowledge of Y
) / ( ) ( ) ; ( X Y H Y H Y X I =
Reduction in the uncertainty of Y due to the knowledge of X
Mutual Information
Interpretation: on the average bits will be saved
if both transmitter and receiver know Y.
) / ( ) ( Y X H X H
Mutual Information
If there is no communication, (X and Y are independent)
0 ) / ( ) ( ) ; ( = = Y X H X H Y X I ) ( ) / ( X H Y X H =
If communication channel is perfect (X and Y are identical)
) ( ) ( ) ; ( Y H X H Y X I = = 0 ) / ( = Y X H
If communication channel is not perfect , ) ( ) ; ( 0 X H Y X I < <

Channel Capacity
Channel Capacity
The channel capacity is the maximum value of

( )
; I X Y
Channel capacity: ) ; ( max
) (
Y X I C
x p
=
Bits/sec
If the symbol rate is symbols|sec, then

r
Channel capacity: ) ; ( max
) (
Y X I r C
x p
=
Examples of Channel Capacity
Example 1: noiseless binary channel
Assume p(x=0)=p(x=1)=1/2
max ( ; ) 1 C I X Y = =
bit
Examples of Channel Capacity
Example 2: noise channel with nonoverlapping outputs
Assume p(x=0)=p(x=1)=1/2
max ( ; ) 1 C I X Y = =
bit
Examples of Channel Capacity
Example 3: Binary Symmetric Channel
Assume p(x=0)=p(x=1)=1/2
max ( ; ) 1 ( ) C I X Y H p = =
bit
Examples of Channel Capacity
Example 4: Binary Erasure Channel
Assume p(x=0)=p(x=1)=1/2
max ( ; ) 1 C I X Y o = =
bit
Y=X+Z
X
Z
Y
) 2 log(
2
1
) (
2
N
e Z H o t =
)] ( 2 log[
2
1
) (
2 2
S N
e Y H o o t + s
Will be proved late
2 2
[ ]
N
E Z =o
2 2
[ ]
S
E X =o
Examples of Channel Capacity
Example 5: Gaussian Channel
Capacity of Gaussian Channel
] 2 log[
2
1
)] ( 2 log[
2
1
) ( ) ( ) ; (
2 2 2
N N S
e e Z H Y H Y X I o t o o t + s =
) 1 log(
2
1
) 1 log(
2
1
]
2
2 2
log[
2
1
2
2
2
2 2

o
o
o t
o t o t
+ = + =
+
=
N
S
N
N S
e
e e
2
2
N
S
o
o
=
is signal to noise ratio per symbol.
Channel capacity: ) ; ( max
2 2
] [
) (
Y X I C
s
X E
x p
o s
=
)
2
exp(
2
1
) (
2
2
2
N
N
z
z p
o
o t

= Recall Gaussian distribution

Capacity of Gaussian Channel
Prove: ) 2 log(
2
1
) (
2
N
e Z H o t =
)] ( [log ) (
2
z p E Z H
Z
=
(

= ) 2 ( log
2
1
) (log
2
2
2 2
2
2
N
N
e
z
E to
o
| | ) 2 ( log
2
1
) (log
2
1
2
2
2
2
2
N
N
z E e to
o
+ =
| | ) 2 ( log
2
1
) (log
2
1
2
2
2
2
2
N
N
z E e to
o
+ =
) 2 ( log
2
1
) 2 ( log
2
1
) (log
2
1
2
2
2
2 2 N N
e e o t to = + =
2 2 2 2 2 2
] [ ] [ ] ) [( ] [
N S
Z E X E Z X E Y E o o + s + = + =
Capacity of Gaussian Channel
Prove
)] ( 2 log[
2
1
) (
2 2
S N
e Y H o o t + s
Entropy maximized if Y is Gaussian, we have
)] ( 2 log[
2
1
) (
2 2
S N
e Y H o o t + s ) 2 log(
2
1
) (
2
N
e Z H o t =

Capacity of limited continuous Gaussian Channel
The capacity of a band limited continuous Gaussian channel is
given by the Shannon- Hartley law (also called Shannons 3
rd

theorem or the channel capacity theorem)

log 1
S
C B
N
| |
= +
|
\ .
B N
S
N
S
ratio noise to signal
N
S
bandwidth B
0
=

Where is the single sided power spectral density of

Gaussian noise.
0
N
0
log 1
S
C B
N B
| |
= +
|
\ .
Noisy channel coding theorem
Sometimes called
Shannons second theorem
Fundamental theorem of information theory

Given a discrete memoryless source with rate R and a
discrete memoryless channel with capacity , then if
, there exists a coding scheme for which the source output can
be transmitted be over the channel with an arbitrarily small
probability of error

C R C s
The channel capacity specifies the fundamental limit on the
rate at which information can be transmitted reliably over a
channel.

Assignment:
review of information theory : part two

Due day :March 15, 2012