Chapter 1

Information Theory and Coding 314
Dr. Roberto Togneri
Dept. of E&E Engineering

The University of Western Australia
Section 1
Concepts in Information Theory
(Information Sources)
DISCLAIMER
Whilst every attempt is made to maintain the accuracy and correctness of these notes,
the author, Dr. Roberto Togneri, makes no warranty or guarantee or promise express
or implied concerning the content
ITC314 Information Theory and Coding 314 623.314
1. CONCEPTS IN INFORMATION THEORY
1.1 What is Information?
The idea is to describe 'information' in such a way so that it can be quantified. Hence we can then
do things like:
• examine the information characteristics of sources, channels, receivers, etc.
• optimise transmission of information (source codes for compression)
• improve reliability of transmission (channel codes for error detection and correction)
Our analysis will be restricted to the case of discrete information sources.
A simple example which highlights a fundamental relation of information theory follows. Consider
a system which needs to transmit the weather in City A to a receiving station in City B. Let us
suppose that we describe the weather as being: sunny, cloudy, rainy, foggy. Each weather indication
represents a source symbol, and a sequence of, say, daily weather indications is a message. We
propose that the information can be defined as the amount of uncertainty that the receiver has about
what is being sent. For instance, suppose that the receiver has the following probability relations for
the weather:
Weather Symbol Probability

sunny 0.65
cloudy 0.20
rainy 0.10
foggy 0.05
Since City A is nearly always sunny, being told that it is sunny conveys little information. On the
other hand, if it is foggy, this is very surprising and conveys the most information.
We note that in information theory we are only concerned with the transmission aspect, not the
semantic aspect of information. In other words, we make no statement on the truth or meaning of
what is being sent.
This definition of information can be examined by considering the coding and transmission of the
weather. Given the above probabilities a typical message of daily weather transmissions could be:
'sunny sunny sunny sunny sunny cloudy cloudy rainy sunny'
18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-2

The message has to be transmitted digitally. Thus, each source symbol has to be mapped to a binary
codeword. One such coding scheme is as follows:
Weather Symbol Code α

sunny 00
cloudy 01
rainy 10
foggy 11
Thus the above message is coded as:

00 00 00 00 00 01 01 10 00
and the total message length is 18 bits
Can we transmit the same weather information with a shorter message? Try:
Weather Symbols Code β

sunny 0
cloudy 10
rainy 110
foggy 111
Then:
0 0 0 0 0 10 10 110 0
and the total length of the message has now been reduced to 13 bits
Thus we may say that to transmit a shorter message we require:

1
Length (in bits) of Codeword α
Probability of Source Symbol
Indeed, the length of codeword for optimum transmission (i.e. minimum message length) is a direct,
but not perfect, measurement of the information needed to transmit the source symbol. Thus the
more likely a symbol is the smaller the codeword that is required and the less information is needed
to transmit the source symbol. However the assignment of probability is based on the frequency of
occurrence of a symbol in relation to all other symbols, so we now need to properly define our
discrete information source and quantify both the information of each symbol and the average
information of the source. The latter is termed the entropy and is a fundamental quantity of
information theory.

1.2 Self-information and Average information (entropy)
We define an information source, S, as a collection of discrete symbols, si, representing an

indication, event or an indivisible and definable quantity or operation that the source transmits:
source → s1 s3 s4 s2 …
The collection of discrete symbols forms a fixed finite source alphabet of size q:
{
S = s1, s2 , ... , s q }
Thus the information source can transmit any one of q different symbols.
Example 1.1
Consider an information source of dialled telephone numbers. The information source alphabet
consists of the digit symbols: S = {0,1,2,3,4,5,6,7,8,9}
An information source of E-mail text would obviously consist of the letters of the alphabet, both
lower and upper case, space, and the various punctuation symbols:
S = {a, b,K ,z, A,B,K ,Z, , . , ; , : ,! , ? , ,}
An information source representing the {on, off} status of a switch or actuator is a binary
information source: S = { off, on}
Or a more sophisticated motor or robot control information source could have the following source
alphabet: S = {off , on ,forward , backward , stop }
Now that we have defined an information source a mathematical model is needed to describe the
source for the purposes of quantifying the information content. Why do we want to do this? One
reason is that by quantifying the information content we can transmit and store messages from the
source more efficiently. From the previous section it is evident that the probability of occurrence of
each source symbol is an important parameter for the design of efficient codes for transmission.
Indeed this provides us with the simplest model of an information source:
Zero-memory model
We assume that successive symbols emitted from the source are statistically independent. Such a
source is described completely by the source alphabet, S, and the source symbol probabilities:
{P( s1 ), P( s2 ), ... , P( sq )}
Since a source symbol is always transmitted with probability 1, then the following must be true:
q
∑ P (s ) = ∑ P (s ) = 1
i i
i =1 S
In the previous section it was found that the length of a binary codeword is proportional to the
inverse of the probability of occurrence. And indeed we define the self-information in a similar
way:

Self-information
If the symbol si occurs it is deemed to have provided:

1
I ( si ) = log 2 bits
P ( si )
Equation 1.1
of information or self-information.
Why do we use logarithms?
This can be easily seen in the special case of equi-probable events. Say, that there is a total of q
symbols in the alphabet, then the number of bits, N, to represent all q symbols is:
q = 2N
that is, each source symbol will need to be represented by:
N = log 2 q
1 1
bits of information. Now with equi-probable symbols we have P(si) = , so N = log 2 bits
q P ( si )
are needed to represent si and this is exactly the expression we have defined for self-information.
The case for general information sources then follows.
Example 1.2
Consider the binary source S = {0,1} with probabilities P(0) = 0.5 and P(1) = 0.5 then we get the
obvious result that I(0) = I(1) = log 2 (1/0.5) = log 2 (2) = 1 bit! Thus for equi-probable binary
sources each symbol requires 1 bit of information.
Now consider the binary source with probabilities P(0) = 0.7 and P(1) = 0.3. We have I(0) =
log 2 (1/0.7) = 0.51 bits and I(1) = log 2 (1/0.3) = 1.74 bits which indicates that 0 can be represented
in less than 1 bit but 1 will need to be represented by more than 1 bit. It doesn't tell us how to do this
(indeed how can 0 be represented by half a bit?) but it tells us that this is what we should do to
transmit or store the information efficiently (i.e. without redundancy). How this can be achieved
will be discussed in Chapter 3.
The choice of logarithms to the base 2 is evident when considering digital storage and transmission
(in bits). However other logarithm bases can also be defined leading to different unit names:
1
I ( si ) = l n nats (base e)
P ( si )
1
I ( si ) = log10 Hartleys (base 10)
P( si )
Our definition of self-information can now be extended to quantifying the average information of
the source itself. This is a more important quantity since storage and transmission of information
sources must take into account the average or typical source message rather than the per symbol
information. Indeed this measure, the entropy, defines the most important quantity of information
theory.

The average information, or entropy, of the source alphabet can be calculated as follows. Each
symbol si occurs with probability P(si) and provides I(si) bits of information. The average amount
of information obtained per symbol from the source is then:
∑ P( s ) I ( s )
i i bits
S
Entropy
The average information or entropy , H(S), of the source S is:

1
H (S ) = ∑ P ( si ) I ( si ) =∑ P ( si ) log bits
S S P ( si )
Equation 1.2
The following should be noted:

1. Since we are mainly concerned with digital (binary) sources and channels logarithms will be
assumed to be base 2 and we can omit the subscript.
2. The form of Equation 1.2 is related to the entropy of statistical thermodynamics, hence the use
of the term entropy for Equation 1.2.
3. H(.) denotes entropy; I(.) denotes information.
Example 1.3
Consider the source S = {s1, s2, s3} with P(s1) = 0.5, P(s2) = P(s3) = 0.25. Then:
H (S ) = (0.5) log 2 + (0.25)log 4 + ( 0.25) log 4 = 1.5 bits
Thus a typical message from the source will require on average 1.5 bits of information per source
symbol.
Calculation of entropy can be facilitated on most modern calculators with a built-in ln( ) or log10( )
function by using the change of base formula, based on ln( ), as follows:
Neat Entropy Calculation Trick!
q
Calculate H e ( S ) = −∑ P ( si ) ln P ( si ) and then:
i =1
He ( S )
Hr (S ) = for the entropy calculated to any base r
ln r
1.3 Minimum and Maximum Entropy
Consider an information source with q symbols. Each symbol, si, is assigned a probability, P(si),
and the entropy of the source, H(S), is then calculated. Since H(S) represents the average
information of the source it would be interesting to determine what assignment of probabilities, for
the same size alphabet, will yield a source with minimum H(S), what this minimum H(S) is, and
hence how to identify these special sources with minimum H(S). More importantly, we would like
to know what special sources have maximum H(S) and what this maximum is. Indeed we will
attempt to guess what the answer is based on our intuitive knowledge and understanding of entropy
definition and then verify this by mathematical derivation.

1.3.1 Minimum Entropy
In Section 1.1 we proposed that information represents a measure of uncertainty of the occurrence
of the source symbol. With this definition it is obvious that the minimum average information will
occur when we are certain with probability 1 as to what the source will do (i.e. no uncertainty). This
can only happen if one of the q symbols always occurs with probability 1 while all the other
symbols occur with probability 0 (i.e. they never occur). To see this is indeed the case:
From Equation 1.2 if a symbol, si, has P(si) = 1, then we must also have that P(sj) = 0 for j ≠ i and:
1 1
H (S ) = 1 log + ∑ 0 log
1 s ≠s 0 j i
1
Now we have that: lim P ( s j )log =0
P ( s j )→0 P( s j )
Thus we get the following result:
Result 1.1
H(S) ≥ 0 with equality when P(si) = 1 for one of the symbols, si.
This result makes sense since if the receiver is certain that it is always going to get si (since P(si) =
1) there is no need for the source to transmit anything!
1.3.2 Maximum Entropy
Since maximum entropy represents the opposite extreme to minimum entropy and can be thought of
as the case when we have the maximum amount of uncertainty as to what the source will do next, it
is intuitively obvious that this should occur when all source symbols are equally likely. Furthermore
with q equally likely symbols we expect to require no less than log q bits to represent each symbol.
We now proceed to prove that the condition for maximum entropy occurs for the special case of
equi-probable symbols and that H(S) = log q bits is the maximum entropy.
To determine the case of maximum entropy we start by making use of the following inequality:
ln x ≤ x − 1 with equality at x = 1
⇒ − ln x ≥ 1 − x
1
⇒ ln ≥1− x
x
Equation 1.3
We also define an important fundamental quantity by considering the sources:

S = {si }, i = 1,2,3, ... q
U = {ui }, i = 1,2,3, ... q

Note that there are q symbols in both sources S and U. We define the following quantity:
q
ri 1 q r
∑p
i =1
i log
pi
= ∑
ln 2 i =1
pi ln i
pi
where pi = P(si) and ri = P(ui)

ln x
Now we know from the change of base formula that: log 2 x =
ln 2
Hence using the inequality ln x ≤ (x-1) we get:
q
ri 1 q  ri  ri
∑ p i log ≤ ∑ pi  − 1
ln 2 i =1  pi
with equality if = 1 for all i
i =1 pi  pi
1 q 1  q q

≤ ∑
ln 2 i =1
( ri − p i ) = 
 ∑
ln 2  i =1
ri − ∑
i =1
p i 

≤0
Fundamental Inequality
A fundamental relation which plays an important part in information theory is:

q
r
∑=1 p i log pi ≤ 0
i i
Equation 1.4
with equality only if pi = ri for all i.
1
To derive the condition for maximum entropy we consider what happens when ri = . Making the
q
appropriate substitution in Equation 1.4 and with further derivation:
q
1
∑ p log qp
i ≤0
i =1 i
q
1 q 1
∑p i log + ∑ pi log ≤ 0
q i =1 pi
i =1
1 q 1
1 log + ∑ pi log ≤ 0
q i =1 pi
q
1
∑ p log p i ≤ log q
i =1 i
H ( S ) ≤ log q
1
and from Equation 1.4 the condition for maximum entropy (i.e. when H(S) = log q) is p i = ri = ,
q
that is when all source symbols are equi-probable.

Result 1.2
H(S) ≤ log q, for a source alphabet with q symbols, with the maximum H(S) = log q, when
1
P(si) = , that is: maximum entropy, H(S) = log q, occurs when all symbols are equiprobable
q
With digital information sources, S= {0,1}, q = 2 and we can consider this special case analytically.
The source symbol probabilities are designated as:
P(0) = p and P(1) = 1 - P(0) = 1- p = p
The entropy for such a source can be expressed in the following functional form:
1 1 1 1
H (S ) = p log + p log = F ( x = p ) where F ( x ) = x log + (1 − x) log
p p x (1 − x )
The function, F(x = p), can be plotted as follows:
1.0 1.0
H(S)
0.5
0.0
0.0 0.5 1.0
p
Figure 1.1 Plot of H(S) = F(x = p)
From Figure 1.1we see that:
1. If we are certain about the output (either p = 0 or p = 1) then no information is provided (H(S) =
0).
2. Maximum entropy is H(S) = 1.0 which occurs when p = 0.5, that is, equi-probable source
symbols.

1.4 Joint and Conditional entropy
1.4.1 Joint entropy
The joint entropy is used when considering extensions to an information source (see Section 1.7).
We derive the joint entropy for the case of two different source alphabets, and then provide the
obvious generality to n sources. The two source alphabets usually describe different information
sources (which operate jointly or concurrently). However we can re-use the same source alphabet
when considering source symbol blocks (extension) from the same information source. For the
general case we define two different sources, S1 and S2, with different size source alphabets and
probabilities:
S1 = {si1 }, i1 = 1,2,3,K, q1
S 2 = {si 2 }, i 2 = 1,2,3,K, q2
The entropy when considering the two symbols si 1 , si 2 jointly is calculated as a natural extension to
Equation 1.2:
Joint Entropy for S1 and S2

q1 q2
1
H ( S1 , S 2 ) = ∑ ∑ P (s i1 i2s ) log
i1=1 i 2=1 P( si 1si 2 )
Equation 1.5
Example 1.4
Consider the following sources:
1 1
S1 = {0,1} with P ( si1=1 = 0) = and P( si 1=2 = 1) =
2 2
S 2 = {0,1} with P ( si 2 =1 = 0) = and P ( si 2 =2 = 1) =
1 2
3 3
We assume independent events: P( si 1si 2 ) = P ( si1 ) P( si 2 ) , hence
2 2
1
H (S1 , S 2 ) = ∑ ∑ P( s i1 ) P ( si 2 ) log
i1=1 i 2=1 P ( si 1 ) P ( si 2 )
1 2 1 2
= log 6 + log 3 + log 6 + log 3
6 6 6 6
= 1.9183 bits per pair
The entropy H(S1,S2) is defined as the average bits of information per joint pair of symbols,
si 1 , si 2 .This should be contrasted with H(S) which is defined as the average bits of information per
symbol. The definition of joint entropy to the general case of n source alphabets is the natural
extension to Equation 1.2:

Joint Entropy for the General Case

q1 q2 qn
1
H (S1 , S 2 ,K, S n ) = ∑ ∑L∑ P( si1 , si 2 ,K, sin ) log
i1=1 i 2=1 in=1 P ( si1 , si 2 ,K, sin )
Equation 1.6
1.4.2 Conditional entropy
The conditional entropy is an important quantity when considering Markov models (Section 1.5)
and communications channels (Chapter 2). We derive the conditional entropy for two source
alphabets, S1 and S2:
S1 = {si1 }, i1 = 1,2,3,K, q1
S 2 = {si 2 }, i 2 = 1,2,3,K, q2
The first or given symbol, si1, comes from S 1 and this is followed by the next symbol, si2, which
comes from S 2. Thus S2 depends on S1. We describe what is happening by the conditional
probability, P(si2 /si1), the probability of symbol si2 given si1. The physical context of the
dependency will become apparent when discussing Markov models and communication channels.
Before proceeding we restate some results regarding Bayes’ Theorem from probability theory:
Bayes’ Theorem
P (s i1 ,si 2 )
P( si 2 / si 1 ) =
P ( si 1 )
Equation 1.7
or:
P( si 2 / si 1 ) P ( si1 ) = P ( si1 , si 2 )
Equation 1.8
For independent events:
P ( si 2 ) P ( si 1 )
P( si 2 / si 1 ) = = P ( si 2 )
P( si 1 )
Equation 1.9
We now proceed to develop the conditional entropy, H(S1 / S2). It should be noted that:
q2
∑ P( s i2 / si 1 ) = 1
i 2 =1
and hence the self-information of the source symbol si2 given si 1 can be defined as:
1
I ( si 2 / si1 ) = log
P( si 2 / s i1 )
The average information over S2 given si1 is then given by the average of I ( si 2 / si 1 ) over all si2
given si1:
q2
1
H (S 2 / si 1 ) = ∑ P ( si 2 / si1 ) log
i 2 =1 P ( si 2 / si 1 )
Equation 1.10
To obtain H(S2 / S1) all that remains is to average H(S2 / si1) over all si1:

q1
H (S 2 / S1 ) = ∑ P ( si 1 ) H ( S 2 / si1 )
i 1=1
q1 q2
1
= ∑ ∑ P ( si 1 ) P ( si 2 / si1 ) log
i 1=1 i 2 =1 P( si 2 / si 1 )
Hence using Equation 1.8 we get the final form:
Conditional Entropy for S1 and S2

q1 q2
1
H (S 2 / S1 ) = ∑ ∑ P( s s ) log
i1 i 2
P ( si 2 / si1 )
i 1=1 i 2=1
Equation 1.11
The definition of conditional entropy to the general case of n source alphabets is an extension to
Equation 1.11.
Conditional Entropy for the General Case

q1 q2 qn
∑ ∑L ∑
1
H (S n / S1 , S 2 ,K , S n − 1 ) = P ( si1 , si 2 , K, sin ) log
P( sin / si1, si 2 , K, s i(n −1) )
i1=1 i 2 =1 in =1
Equation 1.12
Example 1.5
Consider S1 = {0,1} and S2 = {0,1}.
Symbols from S 1 occur first with:
P( si 1=1 = 0) = 0.5 and P( si 1=2 = 1) = 0.5
Symbols from S 2 then occur dependent on which symbol occurred from S 1.
Hence we need to describe S2 in terms of these dependent probabilities:
P( si 2=1 = 0 / si 1=1 = 0) = 0.25
P ( si 2=2 = 1 / si 1=1 = 0) = 0.75
P ( si 2=1 = 0 / si 1=2 = 1) = 0.6
P ( si 2 =2 = 1 / si 1=2 = 1) = 0.4
The conditional entropy for the information provided by S 2 dependent on S 1 is:
2 2
1
H (S 2 / S1 ) = ∑ ∑ P ( si1 si 2 ) log
i1=1 i 2 =1 P ( si 2 / si 1 )
2 2
1
= ∑ P ( si1 ) ∑ P ( si 2 / si 1 ) log
i1=1 i 2=1 P (si 2 / s i1 )
= 0.5 0.25 log

1 1   1 1 
+ 0.75 log  + 0.5 0.6 log + 0.4 log 
 0.25 0.75   0.6 0.4 
= 0.89 bits per symbol from S 2
The entropy H(S1/S2) is defined as the average bits of information per symbol si2 (conditioned on
si1).

1.4.3 Recursive expression of Joint Entropy
An important relation which will be used in later sections when discussing source extensions is the
restatement of the joint entropy for the general case of n source alphabets in terms of the joint
entropy for the general case of n-1 source alphabets. This permits a useful recursive relationship for
deriving expressions for the entropy of source extensions (Section 1.7) in terms of the base (no
extension) entropy.
Using Equation 1.11 and Equation 1.12 the derivation proceeds as follows:
q1 q2 qn
1
H (S1 , S 2 ,K, S n ) = ∑ ∑L∑ P( si 1 , si 2 ,K, sin ) log
i1=1 i 2=1 in=1 P ( si1 , si 2 ,K, sin )
q1 q2 qn
1
=∑ ∑L∑ P( s i1 , si 2 ,K, sin ) log
i1=1 i 2=1 in=1 P ( sin / si1 , si 2 ,K, s i ( n−1) ) P( si 1 , si 2 ,K, si ( n −1) )
q1 q2 qn  1 1 
=∑ ∑ L∑ P ( si1 , si 2 ,K, sin ) log + log 
i1=1 i 2=1 in=1  P ( sin / si1 , si 2 ,K, si ( n−1) ) P( si 1 , si 2 ,K, si ( n −1) ) 
q1 q2 qn
1
=∑ ∑ L∑ P ( si1 , si 2 ,K, sin ) log +
i1=1 i 2=1 in=1 P ( sin / si 1 , si 2 ,K, si ( n −1) )
q1 q2 qn
1
∑ ∑L∑ P( si1 , si2 ,K, sin ) log P (si 1 , si 2 ,K, si (n −1) )
i 1=1 i 2 =1 in=1
q1 q2 qn
1
=∑ ∑L∑ P( s i1 , si 2 ,K, sin ) log +
i1=1 i 2=1 in=1 P ( sin / si 1 , si 2 ,K, si ( n −1) )
q1 q2 q ( n−1)
1
∑ ∑L ∑ P ( si1 , si 2 ,K, si ( n −1) ) log
P ( si1 , s i 2 ,K, si ( n−1) )
i 1=1 i 2 =1 i ( n−1)=1
= H ( S n / S1 , S 2 ,K, S n −1 ) + H ( S1 , S 2 ,K, S n −1 )
Hence we have an important result for the general case of joint entropy:
Joint Entropy for the General Case (recursive form)

H (S1 , S 2 ,K, S n ) = H ( S n / S1 , S 2 ,K, S n −1 ) + H ( S1 , S 2 ,K, S n −1 )
= H ( S n / S1 , S 2 ,K, S n −1 ) + H ( S n −1 / S1 , S 2 ,K, S n −2 ) + L + H ( S1 )
n−1
= ∑ H ( S n−i / S1 , S 2 ,K, S n −i −1 )
i =0
Equation 1.13
Equation 1.13 states quite simply that the joint entropy of n source alphabets which are transmitted
in sequence is the sum of the entropy transmitting the first symbol, the entropy of transmitting the
second symbol given the first symbol, the entropy of transmitting the third symbol given the first
and second symbols, and so on.

Example 1.6
Consider sources S1 = {0,1} and S2 = {0,1} from Example 1.5.
We had that symbols from S 1 occur first with:
P( si 1=1 = 0) = 0.5 and P( si 1=2 = 1) = 0.5
and symbols from S 2 then occur dependent on which symbol occurred from S 1 with dependent
probabilities:
P( si 2=1 = 0 / si 1=1 = 0) = 0.25
P ( si 2=2 = 1 / si 1=1 = 0) = 0.75
P ( si 2=1 = 0 / si1=2 = 1) = 0.6
P ( si 2 =2 = 1 / si1=2 = 1) = 0.4
Furthermore we can derive the joint probabilities as:
P( si 2=1 = 0, si1=1 = 0) = P ( si 2 =1 = 0 / si1=1 = 0 )P ( si1=1 = 0) = (0.25)(0.5) = 0.125
P( si 2=1 = 1, si1=1 = 0) = P ( si 2 =1 = 1 / si1=1 = 0) P (si 1=1 = 0) = (0.75)(0.5) = 0.375
P( si 2=1 = 0, si1=1 = 1) = P ( si 2 =1 = 0 / si1=1 = 1) P (si 1=1 = 1) = (0.6)(0.5) = 0.3
P( si 2=1 = 1, si1=1 = 1) = P( si 2=1 = 1 / si 1=1 = 1) P ( si1=1 = 1) = ( 0.4)(0.5) = 0.2
Now H (S1 , S 2 ) = 1.89 and H (S1 ) = 1.0 . Also from Example 1.5, H (S 2 / S1 ) = 0. 89 . We note that:
H (S1 , S 2 ) = H ( S 2 / S1 ) + H (S1 ) = 0.89 + 1.0 = 1.89
1.4.4 Conditioning reduces uncertainty
An important relationship is that which exists between H(Sn) and H(Sn/S1,S2,…Sn-1). That is the
unconditional entropy of Sn, H(Sn),.compared to the conditional entropy of Sn, H(Sn/S1,S2,…Sn -1). To
do this we consider the expression H(Sn/S1,S2,…Sn -1) - H(Sn):
H (S n / S1 , S 2 ,K, S n −1 ) − H ( S n )
q1 q2 qn qn
1 1
=∑ ∑L∑ P( si1 , si2 ,K, s in ) log − ∑ P ( sin ) log
i 1=1 i 2=1 in=1 P ( sin / si 1 , si 2 ,K, si ( n −1) ) in−1 P ( sin )
q1 q2 qn
P ( sin )
=∑ ∑L∑ P( si1 , si 2 ,K, sin ) log
i1=1 i 2=1 in=1 P ( sin / si1 , si 2 ,K, si ( n−1) )
q1 q2 qn P ( sin ) P ( si1 , s i 2 ,K, si ( n−1) )
= ∑ ∑L∑ P( si1 , si2 ,K, s in ) log
i 1=1 i 2=1 in=1 P (si 1 , si 2 ,K, sin )
Let p i = P ( si 1 , si 2 ,K, sin ) and ri = P (sin )P ( si 1 , si 2 ,K, si ( n −1) ) and using Equation 1.4 we get the
important result that H (S n / S1 , S 2 ,K, S n −1 ) ≤ H (S n ) , that is conditioning (or “knowledge”) reduces
the uncertainty or entropy.
Conditioning reduces Uncertainty

The entropy of source Sn, H(Sn), is related to the conditional entropy, H(Sn/S1,S2,…Sn-1), by:
H (S n / S1 , S 2 ,K, S n −1 ) ≤ H (S n )
Equation 1.14

1.5 Markov Sources and Models
1.5.1 Modelling sources with memory
The zero-memory model and ensuing definition of entropy in Equation 1.2 assume that symbols are
emitted from the information source independently of one another. But is this a reasonable
assumption to make for real-world information sources? Consider the most important of information
sources, the English language.
Example 1.7
Let us assume that a message in English received thus far is ‘Th’. What is the most likely letter to
follow next? ‘e’, ‘o’, ‘r’, ‘a’, ‘u’, ‘i’ are the most likely letters that usually follow ‘Th’ in the
English language.
How about if the message received thus far was ‘ac’. What is the most likely letter to follow next?
‘c’, ‘t’, ‘q’, ‘h’ are the most likely letters that usually follow ‘ac’
We see that the most likely letter to follow a message from the English language is highly dependent
on the previous letters. Indeed for almost all real-world information sources message symbols are
either highly correlated (e.g. raw PCM data) or highly dependent on the previous context (e.g.
English language) so it is important that a model with some form of memory be used to describe the
source more accurately.
A direct extension of the zero-memory model parameters to incorporate memory is to replace the
symbol probabilities {P ( s1 ), P ( s2 ), ... , P ( sq )} with conditional probabilities which model the
previous context dependency. Such a model is known as a Markov model.
1.5.2 mth order Markov model
An mth order Markov model of a source defines conditional probabilities which model the
dependency of the next (or current) symbol on the previous m symbols which have been
transmitted:
Definition of an mth order Markov model
An mth order Markov model with a source alphabet of size q is described by the following
conditional probabilities:
P(si/sj1, sj2, ... s jm) = P(si / Sm) for all i and j1,j2,…jm = 1,2,…,q
where Sm represents the current state of the system.
This also implies that:

P(si/sj1, sj2, ... s jn) = P(si/sj1, sj2, ... s jm) = P(si / Sm) for any n > m
And if m = 0 a zero-memory model of the source follows:

P(si/sj1, sj2, ... s jn) = P(si) for any n > m

With q possible symbols an mth order Markov model will have qm possible states with q different
probability transitions from each state. This behaviour is diagrammatically illustrated by using a
state diagram.
Example 1.8
Consider a binary, S = {0,1}, 2nd order Markov source (i.e. a source which is fully described by the
2nd order Markov model). We have q = 2 and the source is described by conditional probabilities
dependent on the previous m = 2 symbols, yielding q m = 4 possible states:
00, 01, 10 and 11;
and qm+1 conditional probabilities:
P(si = 0 / S2 = 00) = P(0/00) = 0.8
P(si = 1 / S2 = 11) = P(1/11) = 0.8
P(1/00) = P(0/11) = 0.2
P(0/01) = P(0/10) = P(1/01) = P(1/10) = 0.5
State Diagram How it works
Consider the following sequence of observed source

P(0/00) = 0.8
transmissions:
00100110…
00 Assuming an initial state 00, then the state sequence
for the above is:
P(1/00) = 0.2 P(0/10) = 0.5
00 00 00 01 10 00 01 11 10
P(0/01) = 0.5
The state sequence can be obtained either by using a
window of size 2 which is shifted to the right by 1 for
each transmission:
01 10 00 0 0100110
P(1/10) = 0.5
or by following the corresponding state transition for
the transmitted symbol:
initial state: 00 transmit 0 à 00
P(1/01) = 0.5 P(0/11) = 0.2 next state: 00 transmit 0 à 00
11 next state: 00 transmit 1 à 01
next state: 01 transmit 0 à 10
etc. …
P(1/11) = 0.8
We define the transition probability matrix [P] from the state diagram of a Markov source as
follows:
1. The states are enumerated as S 1,S 2 ,…,S n where n = qm is the number of states for an mth order
Markov source. [P] is an n x n matrix which gives the transition probability of emitting symbol
si from state S i = sj1 sj2 ... sjm and ending up in state S j = sj2 sj3 ... sjm si. Thus the ith row and jth
column entry of [P], Pij = P(S j/S i), gives the transition probability of going from state S i to state
S j.
2. The rows of [P] must add to 1 (i.e. the state must change including changing back to the same
state).

3. If we are going from state S i to state S j in k transitions (and emit k symbols) the transition
probability matrix can be shown to be [Pk] = [P]k.
Example 1.9
The [P] matrix for the 2nd order Markov source of Example 1.8 is:
0.8 0.2 0 0
0 0 0.5 0.5
[P ] =  
0.5 0.5 0 0
0
 0 0.2 0.8
where S1 = 00, S2 = 01, S3 = 10, S4 = 11 and Pij = P ( S j / S i )
1.5.3 Ergodic Markov Sources
An important class of Markov sources are ergodic Markov sources. An ergodic Markov source is
one which, if observed for a very long time, will emit a sequence of source symbols which is
‘typical’. Most Markov information sources are ergodic. An example of a non-ergodic source is the
following:
1.0 00
0.5
0.5 P(1/00) = 0
01 0.5 10 P(0/11) = 0
0.5
11 1.0
After a very long time the sequence will be either '00000..' or '11111..'. These are not typical
sequences. Furthermore once a transition to state 00 or 11 is made these are effectively ‘dead end’
states, and it is the presence of these states that identifies a non-ergodic Markov source.
Definition of an Ergodic Markov Source
A Markov source is ergodic if, after a certain finite number of steps, it is possible to go from any
state to any other state with a nonzero probability.
1.5.4 Stationary State Distribution for Ergodic Markov Sources
An important property of ergodic Markov sources is that the probability distribution over the set of
states is independent of the initial state distribution and approaches a unique distribution (termed the
stationary state distribution) with long sequences.
The stationary state distribution for an ergodic source, [T] = [Pk], occurs when k approaches
infinity, that is the transition probability from state S i to state S j in k transitions approaches a

steady-state value as k approaches infinity. Furthermore, these steady-state transition probabilities

converge to an identical value for the probability of ending up in state S j, independent of the
starting state S i. That is, [T] has identical rows:
n
 1 2 L tn 
t t
t t L t 
lim[ P k ] = lim [P ]k = [T ] =  1 2 n
n
k →∞ k →∞ M M O M 
t t L t 
1 2 n
where tj = P(S j), defines the probability of being in state S j for a long message run (i.e. k à ∞), that
is, the steady-state or stationary state probability of S j and [T] is the stationary state distribution
matrix (or more uniquely just [t1 t2, …, tn]).
Example 1.10
Consider [P2] for the 2nd order Markov source of Example 1.8 is:
0.64 0.16 0.1 0.1 
0.25 0.25 0.1 0.4 
[P ] = [ P] = [ P ][P] = 
2 2 
 0.4 0.1 0.25 0.25
 0.1 0.1 0.16 0.64
 
Now consider how P(S j = 00 / S i = 00) = 0.64 is interpreted. To start in S i = 00 and end up in S j =
00 in k=2 transitions requires 00 à 00 (1st transition) with probability 0.8 and then 00 à 00 (2nd
transition) also with probability 0.8, thus 0.8 x 0.8 = 0.64. Furthermore in k=2 transitions we can go
from any state to any other state. For example, consider P(S j = 10 / S i = 00) = 0.1, where 00 à 01
(1st transition) occurs with probability 0.2, and 01 à 10 (2nd transition) occurs with probability 0.5,
thus 0.2 x 0.5 = 0.1.
How about [P3]?

0.562 0.178 0.1 0.16 
 0.25 0.1 0.205 0.445
[P 3 ] = [ P ]3 = [ P]2 [P ] =  
0.445 0.205 0.1 0.25 
 
 0.16 0.1 0.178 0.562 
It should be noted that the rows appear to converge from [P] to [P3], with an initial observation that
it t1 ≈ t4 and t 2 ≈ t3, but [Pk ] for larger k needs to be calculated to verify this. Consider P(S j = 00 / S i
= 00) = 0.562. In k = 3 transitions there, in fact, two paths that can be followed. One path is to
simply remain in state 00 for all 3 transitions giving a probability of 0.8 x 0.8 x 0.8 = 0.5120. The
other path is 00 à 01 (probability 0.2); 01 à 10 (probability 0.5) and 10 à 00 (probability 0.5),
thus a total probability of 0.2 x 0.5 x 0.5 = 0.05. Hence, since either path is equally possible: 0.5120
+ 0.05 = 0.5620.
Obtaining [T] by calculating [P]k for large k, is neither precise nor efficient. Fortunately, there is an
exact solution to this. By definition we have [T] = [P]k for infinite k. However, it is also true then
that [T] = [P]k+1 (after all if k = ∞ then so is k + 1!). Hence we have:
[T] = [P]k+1 = [P]k [P] = [T][P]
To obtain a convenient set of equations to solve for [T], we take the transpose of the above:
([T][P])T = [P]T[T]T = [T]T

where:
 t1 t1 L t1 
t t2 L t 2 
[T ]T =  2
M M O M
t
n tn L tn 
that is, [T]T has identical columns, which means all we need to do is solve for one column:
Solving for the stationary state distribution, ti:
 t1   t1 
t  t 
T 2
[P ] =  2
M M
t  t 
 n  n
which is solved subject to the constraint: t1 + t2 + … + tn = 1.
Example 1.11
What is [T] for the 2nd order Markov source of Example 1.8 ?
Now:
 t1   t1 
  t 
T t 2 
[P ] =  2
M M
   
t n  t n 
è
0.8 0 0.5 0   t1   t1 
0.2 0 0.5 0  t  t 
  2  =  2 
 0 0.5 0 0.2 t 3   t3 
    
 0 0.5 0 0.8 t4  t 4 
è
0.8t1 + 0.5t3 = t1
0.2t1 + 0.5t3 = t2
0.5t2 + 0. 2t4 = t3
0.5t 2 + 0.8t4 = t4
subject to the additional constraint:
t1 + t2 + t3 + t 4 = 1
Solving the above 5 equations simultaneously yields:
5 2
t1 = t4 = = 0.357 and t2 = t3 = = 0.143
14 14
Hence:
P(00) = P(11) = 0.357 and P(01) = P(10) = 0.143

1.5.5 Entropy of a Markov model
Given an information source either a zero-memory model based on the symbol probabilities, P(si),
or an mth order Markov model based on the symbol probabilities conditioned on the previous m
symbols, P(si/S m), can be used to describe the source. In Equation 1.2 the definition of entropy was
given based on a zero-memory model. We now define the entropy of an mth order Markov model
and then make an important observation on the entropy as a measure of the modelling accuracy for
the same (physical) information source.
If we are in state S m = (sj1, sj2, …, sjm) of an mth order Markov source then the conditional
probability of emitting symbol si is P(si/S m) = P(si/sj1, sj2, …, sjm). Thus, the self-information of si
occurring while in state S m = (sj1, sj2, …, sjm) is:
1
I ( si / S m ) = log
P ( si / S m )
The average of I(si/S m) over all si, is the entropy (average information) conditioned on S m:
q
H ( S / S ) = ∑ P ( si / S m ) I ( si / S m )
m
i =1
If we then average H(S/S m) over all qm states we get the final expression of the entropy assuming an
mth order Markov model:
H (S ) = ∑ P (S m )H ( S / S m )
Sm
q
1
= ∑ P (S m )∑ P ( si / S m ) log
Sm i =1 P ( si / S m )
q
1
= ∑∑ P ( S m )P ( si / S m ) log
S m i =1 P ( si / S m )
q
where: ∑∑ ≡ ∑
S m i =1 S m +1
and: P( S ) P ( si / S m ) = P( S m , si ) = P ( s j 1s j 2 L s jm , si )
m
Entropy of an mth order Markov source

1
H (S ) = ∑ P( s s L s jm , si )log
j1 j 2
P ( si / s j1 s j 2L s jm )
S m +1
Equation 1.15
The following should be noted:

1. We adopt the same expression, H(S), to define the base entropy of the source S, with the
expression given either by Equation 1.2 or Equation 1.15 depending on the type of source (or
indeed the type of model used to describe the source) that is used.
2. Equation 1.15 should be compared with Equation 1.12 where it is evident that
H (S ) ≡ H (S / S1 , S 2 ,K, S m ) , the entropy of an mth order Markov source, H(S), is identical to the
expression for conditional entropy with m+1 source alphabets.

Example 1.12
What is the entropy of the 2nd order Markov source of Example 1.8 ?
Since m = 2 we need to calculate H(S) as follows:

1  1 
H (S ) = ∑ P (S 2 , si ) log 2
= ∑ P ( S 2 ) ∑ P ( si / S 2 ) log 2 
S 3 P ( si / S ) S 2
s i
P ( si / S ) 
 1 1 
= P( S1 = 00)  P (si = 0 / S1 = 00) log + P( si = 1 / S1 = 00) log +
 P (si = 0 / S1 = 00 ) P ( si = 1 / S1 = 00) 
 1 1 
P (S1 = 01)  P( si = 0 / S1 = 01) log + P ( si = 1 / S1 = 01) log +
 P ( si = 0 / S1 = 01) P ( si = 1 / S1 = 01) 
 1 1 
P (S1 = 10) P ( si = 0 / S1 = 10) log + P ( si = 1 / S1 = 10) log +
 P (s i = 0 / S1 = 10) P ( si = 1 / S1 = 10) 
 1 1 
P (S1 = 11) P ( si = 0 / S1 = 11) log + P ( si = 1 / S1 = 11) log 
 P ( si = 0 / S1 = 11) P ( si = 1 / S1 = 11) 
Calculating H(S) is facilitated by using a table as follows:

si S2 P(si/S2) P(S2) P(S2,si)
0 0.8 4/14
00 5/14
1 0.2 1/14
0 0.5 1/14
01 2/14
1 0.5 1/14
0 0.5 1/14
10 2/14
1 0.5 1/14
0 0.2 1/14
11 5/14
1 0.8 4/14
where P(S 2, si) = P(si/S 2)P(S 2).
1
H (S ) = ∑ P( S 2 , si ) log
S3 P ( si / S 2 )
Thus:
4 1 1 1 4 1
= log + log + L + log = 0.81 bits per symbol
14 0.8 14 0.2 14 0.8

1.6 Different Models for the Same Source
A source, S, has been defined as producing a sequence of symbols, si, from a fixed finite source
alphabet {s1, s2,…, sq} of q distinct symbols. An mth order Markov model assumes the next symbol
produced is dependent on the previous m symbols already produced, and is thus specified by all
conditional probabilities of the form P( si / s j 1 , s j 2 ,K, s jm ) . The corresponding entropy conditioned
on the state, S m, of the previous m symbols. H (S ) ≡ H (S / S m ) = H (S / S1 , S 2 ,K, S m ) is defined by
Equation 1.15.
Adjoint Source
A zero-memory model for the same source assumes symbols are produced independently of one
another, and is thus fully specified by {P ( s1 ), P ( s2 ),K, P ( sn )}. In relation to a general information
source, S, the adjoint source, S , is defined as the zero-memory source equivalent of S. Thus the
corresponding entropy is H (S ) , which is defined by Equation 1.2.
Given the mth order Markov model of a source, S, the adjoint source, S , can be defined by deriving
the zero-memory model probabilities as P( si ) = ∑ P ( si / S m )P ( S m ) or P( si ) = ∑ P (S m ) . Given
Sm S m −1
that H(S) is conditioned on the previous m symbols (the state) and H (S ) is independent of the state,
a direct consequence of Equation 1.14 is the following important relationship.
Entropy of an mth order Markov source compared to the adjoint source
H (S ) ≤ H ( S )
Equation 1.16
m
with equality when the symbols are statistically independent of the current state, S .
Thus, with Markov sources knowledge of the preceding symbols reduces our uncertainty of the next
symbol when compared with the equivalent zero-memory case. Once again we see that “knowledge”
reduces uncertainty.
Example 1.13
What is the entropy of the adjoint source for the 2nd order Markov source of Example 1.8 ?
We can derive P(0) either by:

P(0) = ∑ P (0 / S 2 ) P (S 2 )
S2
= P (0 / 00) P (00) + P (0 / 01) P (01) + P (0 / 10)P (10) + P (0 / 11)P (11)

= 0.5
or
P(0) = ∑ P ( S 2 )
S
= P (00) + P (01)
= 0.5
and P(1) = 1 - P(0) = 0.5.
Hence H ( S ) = 1.0 bits and H (S ) = 0.81 ≤ H ( S ) = 1.0

1.6.1 Building better models
From Equation 1.16 the entropy of a Markov model is always no greater, and usually less, than the
entropy of the zero-memory model for the same source. Define
~
H (Sˆ ) = H ( S / S ) = H (S / S1 , S 2 ,K, S m ) as the entropy of an m order Markov model for the
m m th
source S. The special case H (Sˆ 0 ) defines the entropy of the zero-memory model for the source S.
~
The following results from recursive application of Equation 1.14 (i.e. H (S / S m ) ≤ H ( S ) also
~ ~
implies that H (S S m ) H S / S −1 :
Entropy relationships for different models of the same source
For a general information source an mth order Markov model will yield a lower entropy than that for
an (m-1)th order Markov model:
H (Sˆ m ) ≤ H (Sˆ m−1 ) ≤ K ≤ H ( Sˆ 1 ) ≤ H ( Sˆ 0 )
Equation 1.17
such that:
H (Sˆ ∞ ) = lim H (Sˆ m )
m →∞
implies H ( Sˆ m ) converges to an asymptotic value H ( Sˆ ∞ ) , which defines the “true” source entropy.
In the special case that the general information source is fully specified by a kth order Markov
model, then:
H (Sˆ k ) = H ( Sˆ k +1 ) = H ( Sˆ k +2 ) = K = H ( Sˆ ∞ )
1.7 Extensions of Sources
In some cases it may be more convenient to deal with a block of symbols rather than with individual
symbols.
The nth extension of a source, S n
Let S be a information source with source alphabet {s1, s 2,… , sq}. The nth extension of S , S n , is a
source with qn symbols {σ 1,σ 2,…,σqn). Each σ i corresponds to a block of length of n of the si.
P(σ i), the probability of σ i is just the probability of the corresponding sequence of si's. That is, if
σ i = (si1, si2, ..., sin) then P(σ i) = P(si1)P(si2)…P(sin) in the case of a zero-memory source and
n −1
P(σi ) = P ( sin / s1 , s 2 ,K, si ( n−1) )P ( si ( n −1) / s1 , s 2 ,K, si ( n −2 ) )K P( si 1 ) = ∏ P (si ( n− j ) / si1 , si 2 ,K, si ( n − j −1) )
j =0
in the case of a general source.
If S is a zero-memory source, then S n is also a zero-memory source.
If S is an mth order Markov source, then S n is a µth order Markov source, where µ = smallest integer
greater than or equal to m/n

Example 1.14
Consider a binary source S = {0,1} with q = 2. Then we have:
2nd extension: S2 = {σ 1=00, s2=01,s3=10,s4=11} à qn = 4
3rd extension: S3 = {σ 1=000,σ 2=001,σ 3=010,σ 4=011,σ 5=100,σ 6=101,σ 7=110,σ 8=111} à qn = 8
If S is a zero-memory source, then so is S 3

If S is an mth order Markov source with m = 2, then S 3 is a 1st order Markov source in terms of σ i.
How is P(011), say, calculated? If S is a zero-memory source then P(011) = P(0)P(1)P(1), but for a
general source P(011) = P(1/01)P(01) = P(1/01)P(1/0)P(0), where the ordering implies the right-
most symbols is the most recent (i.e. P(011) = P(0/11)P(11) is mathematically correct but is
physically intractable since it implies that the past symbol 0 depends on the future outcome 11!)
1.7.1 H(S n) for a zero-memory source
For a zero-memory source the entropy of S n, H(S n), is equivalent to the general case n-variable joint
entropy of Equation 1.6, that is H(S n) = H(S 1, S 2, …, S n). Using Equation 1.13 an expression for
H(S n) in terms of H(S) can be derived as follows:
H (S n ) = H ( S1 , S 2 ,K, S n ) = H ( S n / S1 , S 2 ,K, S n −1 ) + H ( S n−1 / S1 , S 2 ,K, S n−2 ) + L + H ( S1 )
~ ~
= H ( S n / S n −1 ) + H ( S n−1 / S n −2 ) + L + H (S1 )
= H ( S n ) + H (S n−1 ) + L + H ( S1 ) = nH ( S )
~
where we note that for a zero-memory source H (S i / S i −1 ) = H ( Si ) = H ( S ) .
For the nth extension of a zero-memory source:

H (S n ) = nH ( S ).
Equation 1.18
Example 1.15
Consider the zero-memory source S = {s1, s2, s 3} with P(s1) = 0.5, P(s2) = P(s 3) = 0.25.
To calculate H(S 2) directly we derive the qn = 9 P(σ i) = P(si1)P(si2) terms:
P(s1, s1) = 0.25; P(s 1, s2) = 0.125; P(s1, s3) = 0.125; … ; P(s 3, s2) = 0.0625; P(s3, s3) = 0.0625.
From which we get:
9
1 1 1 1
H (S 2 ) = ∑ P (σi ) log = 0.25 log + 0.125 log + K + 0.0625 log
i =1 P (σi ) 0.25 0.125 0.0625
= 3 bits per symbol σi / pair of symbols si
Using Equation 1.18 all that is needed is to calculate:
3
1
H (S ) = ∑ P (si ) log = 1.5 bits per symbols si
i =1 P ( si )
and then H(S2) = 2H(S) = 2(1.5) = 3 bits per symbol σ i.

1.7.2 H(S n) for an mth order Markov Information Source
~
For an mth order Markov source, the entropy of S n, H (S n ) ≡ H ( S n / S m ) , and we can use Equation
~ ~ ~
1.13 conditioned on S m to derive an expression for H (S n / S m ) in terms of H (S ) ≡ H (S / S m ) :
~ ~ ~
We have that H (S n ) = H ( S n / S n −1 ) + H ( S n −1 / S n −2 ) + L + H ( S1 ) , which if conditioned on S m
~ ~ ~ ~ ~ ~
becomes: H (S n / S m ) = H ( S n / S m S n −1 ) + H ( S n −1 / S m S n−2 ) + L + H ( S1 / S m ) .
~ ~
The expression H (S i / S m S i −1 ) effectively defines the entropy of an (m+i-1)th Markov source, but
since the source is known to be an mth order Markov source only the most recent m symbols are
~
significant, thus H (S i / S m S i −1 ) = H ( S i / S i −m, S i −m+1 ,K, S i −1 ) = H ( Si / S m ) = H (S ) , and hence:
~ ~ ~ ~
H (S n ) ≡ H ( S n / S m ) = H ( S n / S m ) + H (S n −1 / S m ) + L + H (S1 / S m )
~
= nH ( S / S m ) ≡ nH (S )
For the nth extension of an mth order Markov source:

H (S n ) = nH ( S ).
Equation 1.19
Example 1.16
From Example 1.12, H(S) = 0.81 for the 2nd order Markov source of Example 1.8.
How do we interpret S 2 and what is H(S 2)?
Since m = 2 and n = 2, the S 2 is a 1st order Markov source in terms of σ i = (si1, si2), and the
transition probabilities are P(σi /σ j) = P(si1 si2 /sj1 sj2) = P(si1 /sj1 s j2)P(si2 /sj1 si1), and it should be
evident that the channel matrix is the same as [P]2 (see Example 1.10)
And H(S 2) = 2H(S) = 2(0.81) = 1.62 bits per σ i.
1.7.3 Relationship between H(S n) and H(S /S n-1)
Consider a general source, S for which we are interested in estimating the highest order Markov
model that is applicable in order to achieve the best estimate of the “true” source entropy, H ( Sˆ ∞ ) .
~
To do this the mth Markov model needs to be defined and H (S / S m ) = H ( Sˆ m ) calculated for
increasing values of m until an asymptotic value is reached. Since H (Sˆ m ) is a decreasing function
of m (Equation 1.17) what we are after is an estimate of how low H ( Sˆ m ) can go and hence an
estimate for H ( Sˆ ∞ ) . This is a very expensive exercise since the Markov model complexity grows
exponentially with m.
An alternative approach is to calculate H (S n ) , the joint entropy for a block of n symbols and use
this to estimate H ( Sˆ ∞ ) . This approach makes sense intuitively since with larger values of n the time
dependency is implicitly modelled in the calculation of:
P( si 1 , si 2 ,K, sin ) = P( sin / s1 , s2 ,K, si ( n −1) ) P (s i ( n−1) / s1 , s2 ,K, si ( n−2 ) )K P (si 1 )
n −1
= ∏ P ( si ( n − j ) / si 1 , si 2 ,K, si (n − j −1) )
j =0

To prove this result and establish how H (S n ) relates to H ( Sˆ m ) and H ( Sˆ ∞ ) we use Equation 1.13
as follows:
~ ~
H (S n ) = H ( S n / S n −1 ) + H ( S n −1 / S n −2 ) + L + H ( S1 )
= H ( Sˆ n −1 ) + H ( Sˆ n −2 ) + L + H (Sˆ 0 )
∴ H ( S n ) ≥ nH ( Sˆ n−1 )
H (S n )
where the last result follows from Equation 1.17. Defining Hˆ ( S n ) = as the joint entropy of
n
S n per symbol we get the following important relationship:
Joint entropy per symbol is lower bounded by the (n-1)th Markov model entropy
Hˆ (S n ) ≥ H ( Sˆ n−1 )
Equation 1.20
n −1 H (S n )
Note that it has just been shown that H (S ) ≤
ˆ and together with Equation 1.13:
n
H ( S n ) = H (Sˆ n −1 ) + H ( S n −1 )
H (S n )
H (S n ) ≤ + H (S n −1 )
n
nH ( S n ) ≤ H ( S n ) + nH (S n −1 )
H ( S n ) H ( S n−1 )
≤
n n −1
This gives the next result:
Joint entropy per symbol is a decreasing function of n
Hˆ (S n ) ≤ Hˆ ( S n−1 )
Equation 1.21
Given that H (Sˆ ∞ ) = lim H ( Sˆ n ) , together with Equation 1.20 and Equation 1.21 it can be shown
n →∞
that:
Joint entropy per symbol approximates H ( Sˆ ∞ ) for large n
lim Hˆ ( S n ) = H (Sˆ ∞ )
n→∞
Equation 1.22
Thus calculating Hˆ (S n ) for increasing values of n better approximations to H ( Sˆ ∞ ) , the “true”

source entropy can be achieved. Most importantly, it should be noted that Hˆ (S n ) is much easier to
calculate than H ( Sˆ m ) . However an even simpler approximation to H ( Sˆ ∞ ) can be derived. For
increasing n it can be shown that the probabilities for the different message strings of length n,
P( si 1 , si 2 ,K, sin ) , are such that the messages tend to group themselves into two distinct sets. One is
a set of most probable messages of length n, where the probability of each message in the set tends
to the same value, P, for increasing n. The other is a set of messages of length n which become
increasingly unlikely to occur, with the probability of the message tending to 0 with increasing n.

Example 1.17
Consider a zero-memory binary source, S = {0,1}, with P(0) = 3/4, P(1) = 1/4. Consider as an
extreme case, the message of length n which contains all 0 symbols and the message of length n
which contains all 1 symbols. The all 1’s messages occurs with probability (0.75)n, while the all 0’s
message occurs with probability (0.25)n. For n =4, this yields 3.2x10-2 and 3.9x10-3 respectively and
for n = 16, this yields 1.0x10-2 and 2.3x10-10. It is evident that for increasing values of n the all 0’s
message becomes much more unlikely than the all 1’s message. Indeed consider an arbitrary
message of length n with i symbols being a 0, this message occurs with probability (3i/4n). Thus the
probability of the message varies exponentially dependent on the number of symbols in the message
which are 0. This leads to the some messages rapidly approaching zero probability of occurrence
and the remaining messages forming the set of most probable messages.
1
One implication of this is that the main contributions to Hˆ (S n ) are the P log terms arising from
P
the most probable messages. Let M be the number of most probable messages, then we would
1
expect P ≈ :
M
M M
1 log M
H ( S n
) ∑ P log
P
∑ M
log
1
Hˆ (S n ) = = i =1 = i =1 = P
n n n n
Long messages observed from source S will belong to the set of most probable messages, and
hence:
The probability of long messages, P( si 1 , si 2 ,K, sin ) , approximate H ( Sˆ ∞ ) for large n
 1 
H (Sˆ ∞ ) = lim Hˆ ( S n ) = lim − log P ( si1 , si 2 ,K, sin )
n →∞ n→∞  n 
Equation 1.23
1.8 Structure of Language
[Handout 1]

1.9 Differential Entropy
Our treatment of entropy has been restricted to the discrete case. This is alright for digital systems.
However, for analogue communications and for general sources and problems the continuous case
must be used.
The mathematics for the continuous case are more complicated but the results are the same (with a
few provisos). For continuous sources we use the probability distribution function (pdf) f(x) with
the property:
+∞
∫− ∞ f X ( x ) dx = 1
Equation 1.24
Our definitions for entropy and joint/conditional entropy now become:
+∞ 1
H(X) = ∫− ∞ f X ( x ) log
f X ( x)
dx
+∞
=− ∫−∞ f X ( x ) log f X ( x )dx
Equation 1.25
+∞
H (Y ) = ∫− ∞ fY ( y ) log fY ( y ) dy
+∞ +∞
H ( XY ) = ∫ ∫ f XY ( xy ) log f XY ( xy ) dxdy
−∞ −∞
+∞ +∞ f XY ( xy ) +∞ +∞
H ( X /Y ) = − ∫− ∞ ∫−∞ f XY ( xy ) log fY ( y )
dxdy = − ∫ ∫
−∞ −∞
f XY ( xy ) log f X ( x | y ) dxdy
+∞ +∞ f XY ( xy ) +∞ +∞
H (Y / X ) = − ∫− ∞ ∫−∞ f XY ( xy ) log f X (x)
dxdy = − ∫ ∫
− ∞ −∞
f XY ( xy ) log f Y ( y | x ) dxdy
Properties
1. The differential entropy may be negative.
2. The differential entropy may become infinitely large.
NOTE
1. Self-information and the idea that H(X) represents the average self-information no longer makes
sense in the continuous case. H(X) is simply called the entropy function.
2. The entropy functions have different properties than the discrete entropy functions.

Example 1.18
We adapt the fundamental inequality of Equation 1.4 to the continuous case:
+∞ f ( x)
−∞ ∫
f Y ( x ) log 2 X
fY ( x )
dx ≤ 0
NOTE
+∞ f (x)
The quantity
−∞∫fY ( x ) log 2 X
f Y ( x)
dx is called the relative entropy or the Kullback-Leibler
measure.
Equivalently we may write:
+∞ +∞
− ∫−∞ fY ( x ) log2 fY ( x) dx ≤ − ∫−∞ f Y ( x ) log2 f X ( x ) dx
and we note that the left-hand side is simply the expression for the differential entropy, H(Y):
+∞
H (Y ) ≤ − ∫−∞ f Y ( x ) log2 f X ( x ) dx
Equation 1.26
Suppose we know describe the random variables X and Y as follows:
• Both X and Y have the same mean µ and the same variance σ 2
• The random variable X is Gaussian distributed
Hence:
1  ( x − µ) 2 
f X (x) = exp  − 
2πσ  2σ 2 
 
Substituting the above in Equation 1.26 and changing the logarithm base
(i.e. log2(x) = log2(e)ln(x)):
+∞  ( x − µ) 2 
H (Y ) ≤ − log 2 e
−∞ ∫ fY ( x ) −

 2σ 2
− ln( 2π σ )  dx


 +∞ 
∫
2
 − ∞ ( x − µ) fY ( x ) dx +∞ 
∴ H (Y ) ≤ log 2 e 
 2σ 2
+ ln( 2πσ )
−∞ ∫
fY ( x ) dx 

 
+∞ +∞
Now: ∫− ∞ fY ( x ) dx = 1 and ∫− ∞ ( x − µ)2 fY ( x ) dx = σ 2 , hence:
H (Y ) ≤ log 2 e + ln( 2πσ ) = log 2 e + log 2 e ln( 2πσ )
1 1
2  2 1
è H (Y ) ≤ log 2 (2 eπσ 2 )
1 2
∴ H (Y ) ≤ log 2 e + log 2 ( 2πσ ) = log 2 ( 2eπσ ) = log 2 (2 eπσ 2 )
2
1
We have that H ( X ) = log 2 ( 2eπσ 2 ) for the Gaussian random variable X, and H (Y ) ≤ H ( X ) for
2
any distribution random variable Y (with same mean and variance as X). Two important results are:
1. For a finite variance σ 2, the Gaussian random variable has the largest differential entropy
attainable by any random variable.
2. The entropy of a Gaussian random variable X is uniquely determined by the variance of X (i.e. it
is independent of the mean of X)
Yet again we see the importance of all things Gaussian!

Chapter 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1

Uploaded by

Copyright:

Available Formats

Information Theory and Coding 314

Dr. Roberto Togneri

Dept. of E&E Engineering

1. CONCEPTS IN INFORMATION THEORY

1.1 What is Information?

Our analysis will be restricted to the case of discrete information sources.

Weather Symbol Probability

'sunny sunny sunny sunny sunny cloudy cloudy rainy sunny'

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-2

Weather Symbol Code α

Thus the above message is coded as:

Weather Symbols Code β

Thus we may say that to transmit a shorter message we require:

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-3

1.2 Self-information and Average information (entropy)

We define an information source, S, as a collection of discrete symbols, si, representing an

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-4

If the symbol si occurs it is deemed to have provided:

Why do we use logarithms?

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-5

The average information or entropy , H(S), of the source S is:

The following should be noted:

H (S ) = (0.5) log 2 + (0.25)log 4 + ( 0.25) log 4 = 1.5 bits

1.3 Minimum and Maximum Entropy

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-6

1.3.1 Minimum Entropy

1.3.2 Maximum Entropy

We also define an important fundamental quantity by considering the sources:

U = {ui }, i = 1,2,3, ... q

where pi = P(si) and ri = P(ui)

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-7

Hence using the inequality ln x ≤ (x-1) we get:

A fundamental relation which plays an important part in information theory is:

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-8

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-9

1.4 Joint and Conditional entropy

1.4.1 Joint entropy

Joint Entropy for S1 and S2

We assume independent events: P( si 1si 2 ) = P ( si1 ) P( si 2 ) , hence

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-10

Joint Entropy for the General Case

1.4.2 Conditional entropy

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-11

Hence using Equation 1.8 we get the final form:

Conditional Entropy for S1 and S2

Conditional Entropy for the General Case

= 0.5 0.25 log

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-12

1.4.3 Recursive expression of Joint Entropy

Joint Entropy for the General Case (recursive form)

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-13

1.4.4 Conditioning reduces uncertainty

Conditioning reduces Uncertainty

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-14

1.5 Markov Sources and Models

1.5.1 Modelling sources with memory

1.5.2 mth order Markov model

Definition of an mth order Markov model

This also implies that:

And if m = 0 a zero-memory model of the source follows:

18/01/99 Dr. Roberto Togneri, E&E Eng, UWA Page 1-15

State Diagram How it works

Consider the following sequence of observed source