You are on page 1of 23

Alexandria University


Electrical Engineering Department

Spring 2016
MP 219 Mathematics 9
(Probability and Random Processes)
Dr. Sherif Rabia
Eng. Sara Kamel
Channel Coding Theorem

Page 1 of 23

Team Members



Page 2 of 23

---------Entropy definition . 4
Source coding .. 7
Mutual information .. 10
Channel capacity ... 12
Channel coding theorem ..... 14
Matlab .... 19
Sources ... 21

Page 3 of 23

------------------------Information theory answers two fundamental questions in

communication theory: What is the ultimate data compression (answer: the
entropy H), and what is the ultimate transmission rate of communication
(answer: the channel capacity C). For this reason some consider information
theory to be a subset of communication theory. We argue that it is much
Claude Elwood Shannon (1916 2001), American electrical engineer and
has been called the father of information theory, and was the founder of
practical digital circuit
design theory.
Channel Coding Theorem: It is possible to achieve near perfect
communication of information over a noisy channel


Page 4 of 23

Entropy definition
------------------------ Shannon Information Content
The Shannon Information Content of an outcome with probability p is

I =log 2

p(x )

Definition: The entropy is a measure of the average uncertainty in the random variable.
It is the number of bits on average of Shannon Information Content required to describe the
random variable.
The entropy H(X) of a discrete random variable X is defined by

H ( x )= p (x) log 2

p(x )

Some Properties of entropy

Page 5 of 23

1- Always positive H(x)0

2- H(X)=0 if X is a deterministic variable (certainty).
3- H(X) is maximum for equi-probability statistics (uncertainty).
To show this properties



H(X) = p log 1/p + (1 p) log1/(1 p)


def H(p).

H(X) depends only on the probability mass function px not on RV X so we can write H(p).

Joint entropy
We now extend the definition to a pair of random variables X & Y.
Page 6 of 23

Definition: The joint entropy H(X, Y) of a pair of discrete random variables (X, Y) with a
joint distribution p(x, y) is defined as

H ( x , y )= p( x , y) log 2
x, y

p(x , y)

Conditional entropy

Definition: The conditional entropy H(Y|X) is defined as

H (Y X)= p(x , y ) log 2

x, y

p( yx )

Source coding
---------------------- Introduction
The information coming from the source can be characters, if the information
source is a text. It would be pixels if the information source is an image. So if I
Page 7 of 23

want to transmit pixels or characters, how could I do that? Well, this is done
using source code.
Source coding is a mapping from (a sequence of) symbols from an
information source to a sequence of bits.
This is the concept behind data compression.

Types of source coding

1. Lossless source coding: The source symbols can be exactly recovered

from the binary bits e.g. winzip software in Windows OS.

2. Lossy source coding: The source symbols can be recovered within

some distortion e.g. JPEG.

Why source coding?

Source coding tries to make a minimal code length and helps to get rid of
undesired or unimportant extra information.
The channel that will receive the code may not have the capacity to
communicate at the source information rate. So, we use source coding to
represent source at a lower rate with some loss of information.

Code length

Therere two types of codes: fixed length code and variable-length code.

Page 8 of 23

1. Fixed length code, as its name clarifies, has symbols that have the
same number of bits.

2. Variable length code has symbols that have different number of bits
depending on the probability of each symbol.

Variable length code is the better solution as it allows the minimal code

Variable code length

The source symbols may have uniform or non-uniform distribution.
Non uniform distribution of the source symbols may allow efficient
representation of the signal at a lower rate. If I have a symbol S1 that
appears regularly (has a large probability); it would be good if I encode this
symbol into a small code word so the length of the total code will be short.
The length of the code word is determined by the following formula:

log r


, where I is the theoretical length of the code word, r is the radix of

the code (2 in case of binary code) and p is the probability of the symbol.
Lets take an example to make things clearer.
Theres a source that generates three symbols: S1, S2 and S3. The
probabilities of S1, S2 and S3 are 0.3, 0.5 and 0.2 respectively.
By applying the formula, well obtain the following results.

I 1 = 1.7,


= 1 and

I 3 = 2.3.

But, as you may notice, these results are theoretical as theres no 1.7 bit. So
to make it practical we shall approximate them. Thus, the results will be as

I 1 = 2,


= 1 and

I 3 = 3.

As weve mentioned before, we shall notice that the symbol of the largest
probability (S2) will have the shortest length (only one bit).
So if the source information generates the following code:
Page 9 of 23

S1 S2 S1 S3 S2 S2 S1 S2 S3 S2
The source coding will generate 17 bits (2 bits + 1 bit + 2 bits + 3 bits + 1 bit
+ 1 bit + 2 bits + 1 bit + 3 bits + 1 bit).

Desired properties of source codes

One of the most fundamental properties of a source code is that it must be
uniquely decodable.
{0,010,01,10} is an example of non-uniquely decodable source code.
If we have the following stream of bits: 001010, it can be read 0 01 010 or 0
010 10 or 0 01 0 10. So confusion will be presented when we receive this
But if the code was {10,00,11,110} and we receive the same stream 001010,
it can be only read 00 10 10.
So, {10,00,11,110} is an example of uniquely decodable source code.

Page 10 of 23

Mutual information
---------------------- Definition

Mutual information is one of many quantities that measures how much one
random variable tells us about another. It can be thought of as the reduction
in uncertainty about one random variable given knowledge of another.
Intuitively, mutual information measures the information that X and Y share:
it measures how much knowing one of these variables reduces uncertainty
about the other. For example, if X and Y are independent, then
knowing X does not give any information about Y and vice versa, so their
mutual information is zero. At the other extreme, if X is a deterministic
function of Y and Y is a deterministic function of X then all information
conveyed by X is shared with Y: knowing X determines the value of Y and vice
versa. High mutual information indicates a large reduction in uncertainty; low
mutual information indicates a small reduction; and zero mutual information
between two random variables means the variables are independent. An
important theorem from information theory says that the mutual information
between two variables is 0 if and only if the two variables are statistically
For example, suppose X represents the roll of a fair 6-sided die, and Y
represents whether the roll is even (0 if even, 1 if odd). Clearly, the value of Y
tells us something about the value of X and vice versa. That is, these
variables share mutual information.
On the other hand, if X represents the roll of one fair die, and Z represents
the roll of another fair die, then X and Z share no mutual information. The roll
of one die does not contain any information about the outcome of the other

Mathematical representation

Page 11 of 23

For two discrete variables X and Y whose joint probability distribution

is PXY(x,y) , the mutual information between them, denoted I(X;Y) , is given

PXY (x , y )

PXY ( x , y )log PX( x) PY ( y )



, where PX(x) and PY(y) are the


PXY ( x , y )



PXY (x , y )

To understand what I(X;Y) actually means, lets modify the equation first.
I(X;Y)= H(X)H(X|Y), where

PX (x)log PX (x)


PX Y ( x y )log (PXY ( x y ))



PY ( y)

Mutual information is therefore the reduction in uncertainty about variable X

after observing Y.
The focus here is on discrete variables, but most results derived for discrete
variables extend very naturally to continuous ones one simply replaces
sums by integrals.
The units of information depend on the base of the logarithm. If base 2 is
used (the most common, and the one used here), information is measured in

Page 12 of 23

The following figure is a graphical representation of the conditional entropy

and the mutual information.

Channel capacity (c)

-----------------------------Its the highest rate of reliable (error free) information that can be
transmitted through a communication channel.

Channel capacity is affected by:

- The attenuation of a channel which varies with frequency as well as channel
- The noise induced into the channel which increases with distance.
- Non-linear effects such as clipping on the signal.

Shannons Channel Coding Theorem

Page 13 of 23

Shannons Channel Coding Theorem states that if the information rate, R

(bits/s)[ information rate is the average entropy per symbol] is equal to or
less than the channel capacity, C, (i.e. R < C) then there is, in principle, a
coding technique which enables transmission over the noisy channel with no
The inverse of this is that if R > C, then the probability of error is close to 1
for every symbol.

Shannons Channel Capacity Theorem

It states that:
C = B log2 (1+


) bits/s

C: Channel capacity
B: Channel bandwidth
S: Signal power
N: Noise power


: Signal to noise ratio (SNR)

We conclude that channel capacity (C) increases when the bandwidth

increases and also when the signal to noise ratio increases.

This expression applies to information in any format and to both analogue

and data communications, but its application is most common in data
The channel capacity theorem relates three system parameters:
1- Channel bandwidth B
2- Average transmitted signal power S
3- Noise power at the channel N

Hence for a given average transmitted power [S] and channel bandwidth [B]
we can transmit information at rate [C bits/s] without any error.
Page 14 of 23

Its not possible to transmit information at any other rate higher than [C
bits/s] without having a definite probability of error. Hence the channel
capacity theorem defines the fundamental limit on the rate of error-free
transmission for a power-limited, band-limited channel.

Channel coding theorem

------------------------------------The purpose of channel coding theory is to find codes which transmit quickly,
contain many valid code words and can correct or at least detect many
errors. While not mutually exclusive, performance in these areas is a tradeoff.
So, different codes are optimal for different applications. The needed
properties of this code mainly depend on the probability of errors happening
during transmission.
Although not a very good code, a simple repeat code can serve as an
understandable example. Suppose we take a block of data bits (representing
Page 15 of 23

sound) and send it three times. At the receiver we will examine the three
repetitions bit by bit and take a majority vote. The twist on this is that we
don't merely send the bits in order. We interleave them. The block of data bits
is first divided into 4 smaller blocks. Then we cycle through the block and
send one bit from the first, then the second, etc. This is done three times to
spread the data out over the surface of the disk. In the context of the simple
repeat code, this may not appear effective. However, there are more powerful
codes known which are very effective at correcting the "burst" error of a
scratch or a dust spot when this interleaving technique is used.
A number of algorithms are used for channel coding we will discuss some of
them which are linear. First let`s explain some definitions.

Systematic code: is any error-correcting code in which the input data is

embedded in the encoded output. Conversely, in a non-systematic
code the output does not contain the input symbols.
Systematic codes have the advantage that the parity data can simply be
appended to the source block, and receivers do not need to recover the
original source symbols if received correctly for engineering purposes such as
synchronization and monitoring, it is desirable to get reasonable good
estimates of the received source symbols without going through the lengthy
decoding process which may be carried out at a remote site at a later time.
The codes we are going to be discussing will be systematic codes.

Block codes: In coding theory, a block code is any member of the large and
important family of error-correcting codes that encode data in blocks. There is
a vast number of examples for block codes, many of which have a wide range
of practical applications. Block codes are conceptually useful because they
allow coding theorists, mathematicians, and computer scientists to study the
limitations of all block codes in a unified way. Such limitations often take the
form of bounds that relate different parameters of the block code to each
other, such as its rate and its ability to detect and correct errors.

Page 16 of 23

Linear block codes

Cyclic code
A cyclic code is a block code, where the circular shift of each code word
gives another word that belongs to the code. They are error-correcting
codes that have algebraic properties that are convenient for efficient error
detection and correction.

"If 00010111 is a valid code word, applying a right circular shift gives the
string 10001011. If the code is cyclic, then 10001011 is again a valid code
word. In general, applying a right circular shift moves the least significant bit
(LSB) to the leftmost position, so that it becomes the most significant bit
(MSB); the other positions are shifted by 1 to the right"
General definition:
Let C be a linear code over a finite field GF(q) of block length n. C is called
a cyclic code if, for every code word c=(c1,...,cn) from C, the word
Page 17 of 23

(cn,c1,...,cn-1) in

GF (q)n

obtained by a cyclic right shift of components is

again a code word. Because one cyclic right shift is equal to n 1 cyclic left
shifts, a cyclic code may also be defined via cyclic left shifts. Therefore the
linear code C is cyclic precisely when it is invariant under all cyclic shifts.

A parity bit, or check bit is a bit added to the end of a string of binary code
that indicates whether the number of bits in the string with the value one is
even or odd. Parity bits are used as the simplest form of error detecting code.

Parity types:
In the case of even parity, for a given set of bits, the occurrence of bits whose
value is 1 is counted. If that count is odd, the parity bit value is set to 1,
making the total count of occurrences of 1's in the whole set (including the
parity bit) an even number. If the count of 1's in a given set of bits is already
even, the parity bit's value remains 0.
In the case of odd parity, the situation is reversed. For a given set of bits, if
the count of bits with a value of 1 is even, the parity bit value is set to 1
making the total count of 1's in the whole set(including the parity bit) an odd
number. If the count of bits with a value of 1 is odd, the count is already odd
so the parity bit's value remains 0.
If the parity bit is present but not used, it may be referred to as mark
parity (when the parity bit is always 1) or space parity (the bit is always 0).

Parity in Mathematics:
In mathematics, parity refers to the evenness or oddness of an integer, which
for a binary number is determined only by the least significant bit. In
telecommunications and computing, parity refers to the evenness or oddness
of the number of bits with value one within a given set of bits, and is thus
determined by the value of all the bits. It can be calculated via an XOR sum of
the bits, yielding 0 for even parity and 1 for odd parity. This property of being
Page 18 of 23

dependent upon all the bits and changing value, if any one bit changes,
allows for its use in error detection schemes.
Error detection:
If an odd number of bits (including the parity bit) are transmitted incorrectly,
the parity bit will be incorrect, thus indicating that a parity error occurred in
the transmission. The parity bit is only suitable for detecting errors; it cannot
correct any errors, as there is no way to determine which particular bit is
corrupted. The data must be discarded entirely, and re-transmitted from
scratch. On a noisy transmission medium, successful transmission can
therefore take a long time, or even never occur. However, parity has the
advantage that it uses only a single bit and requires only a number of XOR
gates to generate. Hamming code is an example of an error-correcting code.
Parity bit checking is used occasionally for transmitting ASCII characters,
which have 7 bits, leaving the 8th bit as a parity bit.

Hamming code

Hamming code is a linear error-correcting code that encodes four bits of data into
seven bits by adding three parity bits. It is a member of a larger family
of Hamming codes.
They can detect up to two-bit errors or correct one-bit errors without detection of
uncorrected errors. By contrast, the simple parity code cannot correct errors, and
can detect only an odd number of bits in error. Hamming codes are perfect
codes, that is, they achieve the highest possible rate for codes with their block
length and minimum distance of three.

The goal of the hamming code

The goal of Hamming codes is to create a set of parity bits that overlap such
that a single-bit error (the bit is logically flipped in value) in a data bit or a
parity bit can be detected and corrected. While multiple overlaps can be
created, the general method is presented in Hamming codes.

Page 19 of 23

Graphical depiction of the 4 data bits d1 to d4 and 3 parity

bits p1 to p3 and which parity bits apply to which data bits

This table describes which parity bits cover which transmitted bits in the
encoded word. For example, p2 provides an even parity for bits 2, 3, 6, and
7. It also details which transmitted by which parity bit by reading the
column. For example, d1 is covered by p1 and p2 but not p3. This table will
have a striking resemblance to the parity-check matrix (H) in the next

Page 20 of 23

Matlab implementation
------------ Hamming Code


Huffmann Code
Page 21 of 23


-----------Page 22 of 23

Page 23 of 23