Aramaya Niversity: Aramaya Nstitute OF Echnology

HARAMAYA UNIVERSITY
HARAMAYA INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL AND COMPUTER
ENGINEERING
Introduction to Information Theory and Coding
Course Coordinator:-
Dr. MulugetaCourse
Atlabachew (Ass. Professor),
Coordinator :-
Guest Lecturer
Dr. Mulugeta Atlabachew (Ass. Professor)
INFORMATION
 Information can be described as one or more statements or facts that are
received by a human and that have some form of worth to the recipient.
Haramaya University, HIT, School of ECE

 The working definition for information therefore must :
1. be something, although the exact nature (substance, energy, or
abstract concept) isn't clear;
2. provide “new” information: a repetition of previously received
messages isn't informative;
3. be “true:" a lie or false or counterfactual information is mis-
information, not information itself;
4. be ``about" something.
2 11/22/2022 Haramaya University, HIT, School of ECE

INFORMATION
 Information can be thought of as the resolution of

uncertainty;
 It answers the question of "What an entity is" and thus
defines both its essence and the nature of its characteristics.
 The concept of information has different meanings in different
contexts.
 Thus the concept becomes related to notions of constraint,
communication, control, data, form, education, knowledge,
meaning, understanding, mental stimuli, pattern, perception,
representation, and entropy.
11/22/2022 Haramaya University, HIT, School of ECE 3

INFORMATION
 Information is stimuli that has meaning in some context

for its receiver.
 When information is entered into and stored in a
computer, it is generally referred to as data.
 After processing (such as formatting and printing), output
data can again be perceived as information.
 When information is packaged or used for understanding
or doing something, it is known as knowledge.

Information
How to measure Information?

 The uncertainty of an event is measured by its probability of
occurrence and is inversely proportional to that.
 More uncertain events, require more information to resolve
the uncertainty of that event.
 The mathematical theory of information is based on
probability theory and statistics, and measures information
with several quantities of information.

Information

 The choice of logarithmic base in the following formulae determines
the unit of information entropy that is used.
 The most common unit of information is the bit, based on the binary
logarithm.
 Other units include the nat, based on the natural logarithm or base e,
and the hartley, based on the base 10 or common logarithm.
 The bit is a typical unit of information. It is 'that which reduces
uncertainty by half'.

Why logarithm to measure Information?
 We want our information measure I(p) to have several

properties
1) Information is a non-negative quantity: I(p) ≥ 0.
2) If an event has probability 1, we get no information from the occurrence of
the event: I(1) = 0.
3) If two independent events occur, then the information we get from observing
the events is the sum of the two informations: I(p1,p2) = I(p1)+I(p2):
4) We want our information measure to be a continuous (and, in fact,
monotonic) function of the probability (slight changes in probability should
result in slight changes in information).

We can therefore derive the following:

 From this, we can derive the nice property between logarithm and Self-
information content:

 This formula is called the surprise function which represents the
amount of surprise or the amount of self-information a particular
symbol x of a distribution holds.
 Intuitively the definition can be understood as follows: we would be
surprised if a rare symbol ( p(x) is small ) is observed. Thus, lower the
p(x), higher the surprise, which is achieved by the above definition.

 Thus, using different bases for the logarithm results in

information measures which are just constant multiples of
each other, corresponding with measurements in different
units:
1. log2 units are bits (from 'binary')-Typical in Information Theory
2. log3 units are trits(from 'trinary')
3. loge units are nats (from 'natural logarithm') (We'll use ln(x)
for loge(x))
4. log10 units are Hartleys, after an early worker in the field.

In general,

 The Shannon information content /self-information content/ of an
outcome x is defined to be
and it is measured in bits.

 The entropy of an ensemble X is defined to be the average Shannon
information content of an outcome:

Entropy

 In information theory, the entropy of a random variable is the average level
of "information", "surprise", or "uncertainty" inherent in the variable's
possible outcomes.
 The concept of information entropy was introduced by Claude Shannon in his
work published in 1948.
 Entropy should be a measure of how "surprising" the average outcome of a
variable is.
 Entropy is the average self-information content of the random variable.
 For a continuous random variable, differential entropy is analogous to
entropy.

Entropy

 Suppose now that we have n symbols {a1; a2; : : : ; an}, and some
source is providing us with a stream of these symbols.
 Suppose further that the source emits the symbols with probabilities
{p1; p2; : : : ; pn}, respectively.
 Assume that the symbols are emitted independently.
 What is the average amount of information we get from each symbol

we see in the stream from the source?

Entropy

 What we really want here is a weighted average.
 If we observe the symbol ai, we will be getting log(1/pi) information

from that particular observation.
 In a long run (say N) of observations, we will see (approximately) N*pi
occurrences of symbol ai.
 Thus, in the N (independent) observations, we will get total information
I of

Entropy

 Then, the average information we get per symbol observed for the N
observations will be
 We define to be zero when

Entropy
 For generalization, let us suppose that we have a set of probabilities

(a probability distribution) P = {p1, p2,…, pn}.

 We define the entropy of the distribution P by:
 if we have a continuous rather than discrete probability distribution

P(x):

Entropy
 Another way to think about entropy is in terms of expected value.

 Given a discrete probability distribution P = {p1, p2,…, pn} with
and or a continuous distribution P(x) with
and
 We can define the expected value of an associated discrete set F = {f1;
f2; : : : ; fn} or function F(x) by:
 Or
 From this we can conclude that

Entropy
 The entropy of a probability distribution is just the expected value of

the information of the distribution..

 Intuitively, the more the expected surprise or the entropy of the
distribution, the harder it is to represent.
Therefore, Entropy represents the expected value of surprise a distribution

holds.

Information Theory
 Information theory is the scientific study of the

quantification, storage, and communication of information.
 The field was fundamentally established by the works of
Harry Nyquist and Ralph Hartley, in the 1920s, and Claude
Shannon in the 1940s.
 The field is at the intersection of probability theory,
statistics, computer science, statistical mechanics,
information engineering, and electrical engineering.

Shannon Definition of Communication
 “The fundamental problem of communication is that of reproducing at

one point either exactly or approximately a message selected at
another point.”
“Frequently the messages have meaning”
“... [which is] irrelevant to the engineering problem.”
November
22 Haramaya University, HIT, School of ECE 20
Shannon Definition of Communication
 Shannon wants to find a way for “reliably” transmitting data

throughout the channel at “maximal” possible rate.
Information
Coding
Source
Communication
Channel
Destination Decoding
November
22 Haramaya University, HIT, School of ECE 21
Model of a Digital Communication System
Message Encoder
e.g. English symbols e.g. English to 0,1 sequence
Information
Coding
Source
Communication
Channel
Destination Decoding
Can have noise

or distortion
Decoder
e.g. 0,1 sequence to English

Information
 Given a discrete random variable X , with possible outcomes

x1,x2,…xn, which occur with probability p(x1),p(x2),…,p(xn) the
entropy of X is formally defined as:
H ( X )   p( x ) log 2 p( x )
x
 This Function is called the entropy function

Information
Example
 Tossing a dice:

 Outcomes are 1,2,3,4,5,6
 Each occurs at probability 1/6
 Information provided by tossing a dice is
6 6
H    p(i ) log 2 p(i )    p(i ) log 2 p(i )
i 1 i 1
6
1 1
   log 2  log 2 6  2.585 bits
i 1 6 6
ENTROPY-Properties
NOTATIONS
 We use capital letters X and Y to name random variables,

 lower case letters and to refer to their respective
outcomes.
 These are drawn from particular sets A and B: and
 The probability of a particular outcome is denoted

ENTROPY-Properties
JOINT ENSEMBLE
 A joint ensemble „XY' is an ensemble whose outcomes are

ordered pairs with
 The joint ensemble XY defines a probability distribution
over all possible joint outcomes
MARGINAL PROBABILITY
 From the Sum Rule, we can see that the probability of X taking on a particular
value is the sum of the joint probabilities of this outcome for X and
all possible outcomes for Y :

ENTROPY-Properties
 In a simplified way it can be rewrite as

 And similarly for Y

ENTROPY-Properties
Conditional probability
 From the Product Rule, we can easily see that the conditional

probability that
 In a simplified way
 Similarly

ENTROPY-Properties
JOINT ENTROPY ENSEMBLES

CONDITIONAL ENTROPY
 Conditional entropy of an ensemble X, given that y = bj measures the
uncertainty remaining about random variable X after specifying that random
variable Y has taken on a particular value y = bj.
 It is defined naturally as the entropy of the probability distribution

ENTROPY-Properties
CONDITIONAL ENTROPY
 Conditional entropy of an ensemble X, given an ensemble Y is given by

CHAIN RULE FOR ENTROPY
 The joint entropy, conditional entropy, and marginal entropy for two
ensembles X and Y are related by:

ENTROPY-Properties
 It should seem natural and intuitive that the joint entropy of a pair of
random variables is the entropy of one plus the conditional entropy of

the other (the uncertainty that it adds once its dependence on the first
one has been discounted by conditionalizing on it).
 If we have three random variables X; Y; Z, the conditionalizing of the

joint distribution of any two of them, upon the third, is also expressed
by a Chain Rule:

ENTROPY-Properties
INDEPENDENCE BOUND ON ENTROPY
 A consequence of the Chain Rule for Entropy is that if we have many

different random variables X1; X2; :::; Xn, then the sum of all their
individual entropies is an upper bound on their joint entropy:
 Their joint entropy only reaches this upper bound if all of the random
variables are independent.

ENTROPY-Properties
MUTUAL INFORMATION
 Mutual information is a quantity that measures a relationship between

two random variables that are sampled simultaneously.
 In particular, it measures how much information is communicated, on
average, in one random variable about another.
 Intuitively, one might ask, how much does one random variable tell me
about another?
 In this definition, P(X) and P(Y ) are the marginal distributions of X and
Y obtained through the marginalization process.

ENTROPY-Properties
MUTUAL INFORMATION
 Note that in case X and Y are independent random variables, then the

numerator inside the logarithm equals the denominator.
 Then the log term vanishes, and the mutual information equals zero, as
one should expect.
 In the event that the two random variables are perfectly correlated,
then their mutual information is the entropy of either one alone.
 The mutual information of a random variable with itself is just its

entropy or self-information
ENTROPY-Properties
MUTUAL INFORMATION
 The mutual information I (X;Y) is the intersection between H(X) and H(Y)

 These properties are reflected in three equivalent definitions for the
mutual information between X and Y :

Information

Assignment-1
Using Probability Theory, Show that there is only one way to

measure information which is in terms of number of bits.

Shannon’s 1st Source Coding Theorem
 Shannon showed that:
“To reliably store the information generated by some

random source X, you need no more/less than, on the
average, H(X) bits for each outcome.”

 If I toss a dice 1,000,000 times and record values from each
trial

1,3,4,6,2,5,2,4,5,2,4,5,6,1,….
 In principle, I need 3 bits for storing each outcome as 3 bits
covers 1-8. So I need 3,000,000 bits for storing the
information.
 Using ASCII representation, computer needs 8 bits=1 byte
for storing each outcome
 The resulting file has size 8,000,000 bits

 You only need 2.585 bits for storing each outcome.
So, the file can be compressed to yield size


2.585x1,000,000=2,585,000 bits
 Optimal Compression Ratio is:
2,585,000
 0.3231  32.31%
8,000,000


40
11/22/2022 Haramaya University, HIT, School of ECE

Aramaya Niversity: Aramaya Nstitute OF Echnology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aramaya Niversity: Aramaya Nstitute OF Echnology

Uploaded by

Copyright:

Available Formats

HARAMAYA UNIVERSITY

HARAMAYA INSTITUTE OF TECHNOLOGY

Introduction to Information Theory and Coding

Haramaya University, HIT, School of ECE

2 11/22/2022 Haramaya University, HIT, School of ECE

 Information can be thought of as the resolution of

11/22/2022 Haramaya University, HIT, School of ECE 3

 Information is stimuli that has meaning in some context

11/22/2022 Haramaya University, HIT, School of ECE 4

How to measure Information?

Haramaya University, HIT, School of ECE

5 11/22/2022 Haramaya University, HIT, School of ECE

How to measure Information?

Haramaya University, HIT, School of ECE

6 11/22/2022 Haramaya University, HIT, School of ECE

 We want our information measure I(p) to have several

Haramaya University, HIT, School of ECE

7 11/22/2022 Haramaya University, HIT, School of ECE

We can therefore derive the following:

Haramaya University, HIT, School of ECE

Haramaya University, HIT, School of ECE

9 11/22/2022 Haramaya University, HIT, School of ECE

 Thus, using different bases for the logarithm results in

Haramaya University, HIT, School of ECE

10 11/22/2022 Haramaya University, HIT, School of ECE

Haramaya University, HIT, School of ECE

and it is measured in bits.

11 11/22/2022 Haramaya University, HIT, School of ECE

How to measure Information?

Haramaya University, HIT, School of ECE

12 11/22/2022 Haramaya University, HIT, School of ECE

How to measure Information?

Haramaya University, HIT, School of ECE

 What is the average amount of information we get from each symbol

13 11/22/2022 Haramaya University, HIT, School of ECE

How to measure Information?

Haramaya University, HIT, School of ECE

 If we observe the symbol ai, we will be getting log(1/pi) information

14 11/22/2022 Haramaya University, HIT, School of ECE

How to measure Information?

Haramaya University, HIT, School of ECE

 We define to be zero when

15 11/22/2022 Haramaya University, HIT, School of ECE

 For generalization, let us suppose that we have a set of probabilities

Haramaya University, HIT, School of ECE

 if we have a continuous rather than discrete probability distribution

16 11/22/2022 Haramaya University, HIT, School of ECE

 Another way to think about entropy is in terms of expected value.

Haramaya University, HIT, School of ECE

17 11/22/2022 Haramaya University, HIT, School of ECE

 The entropy of a probability distribution is just the expected value of

Haramaya University, HIT, School of ECE

Therefore, Entropy represents the expected value of surprise a distribution

18 11/22/2022 Haramaya University, HIT, School of ECE

 Information theory is the scientific study of the

Haramaya University, HIT, School of ECE

19 11/22/2022 Haramaya University, HIT, School of ECE

 “The fundamental problem of communication is that of reproducing at

“Frequently the messages have meaning”

“... [which is] irrelevant to the engineering problem.”

 Shannon wants to find a way for “reliably” transmitting data

Can have noise

11/22/2022 Haramaya University, HIT, School of ECE 22