You are on page 1of 3

Lecture Notes on

Information Theory (incomplete)

Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU

28 January 2017

Contents
1 Entropy 1
1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

These lecture notes are largely based on (Cover and Thomas, 1991).

I assume you are familiar with the basics of Bayesian theory, see ( Wiskott, 2013).

1 Entropy
1.1 Definition
Even though Information Theory is about information, it is based on a clear definition of the lack of infor-
mation, i.e. uncertainty. The idea is that the amount of information a message has depends not so much on
the message itself but on how much it reduces the uncertainty of the receiver about something.
Assume we talk about a person named Mary. If I tell you Mary is 13 years of age., how much information
did I give you? Well, it depends on how uncertain you were about the age of Mary in the first place. If you
already knew that she is a 7th-grader, then you have already guessed that she is around 13. If you only knew
that she is my cousine, then my message has reduced your uncertainty about Marys age significantly and it
has given you much more information. Thus, before defining information we have to define uncertainty.
Given a random variable A that can assume the values a A with probability distribution P (A) the
uncertainty about the value of A is given by its Entropy (D: Entropie)
X
H(A) := P (a) log P (a) = h log P (A)iA , (1)
aA
P
where hf (a)iA := aA P (A)f (a) indicates averaging over A. The logarithm is taken to the base 2 here
(usually no parentheses are used around its argument for convenience) and the units in which entropy is
measured is bits (D: bits). Other base values can be used as well leading to different units (base e leads to unit
2006 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from
other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view
a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their own
copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free copyrights
(here usually figures I have the rights to publish but you dont, like my own published figures). Figures I do not have the rights
to publish are grayed out, but the word Figure, Image, or the like in the reference is often linked to a pdf.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1
nats) but are less common. Entropies measured with different base values only differ by a constant factor,
e.g. H2 (A) = (log2 e)He (A), if the subscript indicates the base. Note also that entropy is non-negative,
because 0 P (a) 1 and log P (a) 0.
We are least uncertain about A if it can assume only one value a0 , i.e. if P (A = a0 ) = 1 and P (A 6= a0 ) = 0, in
which case H(A) = 0 (if we use the reasonable definition 0 log 0 := 0). If A contains 16 elements that are all
equally likely, i.e. if P (A = a) = 1/16 a A, then uncertainty is high with H(A) = 16 1/16 log 1/16 =
log 16 = 4. If A had more than 16 elements, entropy could be even higher. Thus, we see that entropy is zero
if there is no uncertainty and increases with uncertainty.
Note that A could also be vectorial or a combination of random variables, so that the definition of H(A)
can be readily generalized to a definition of joint entropy (D: Verbundentropie, Blockentropie)
XX
H(A, B) := P (a, b) log P (a, b) = h log P (a, b)iA,B , (2)
aA bB

which would quantify how uncertain we are about A and B.


It is also straightforward to compute the average entropy of A if B is known, which leads to the conditional
entropy (D: bedingte Entropie)
!
X X X
H(A|B) := P (b)H(A|b) := P (b) P (a|b) log P (a|b) (3)
bB bB aA
XX
= P (a, b) log P (a|b) = h log P (a|b)iA,B . (4)
aA bB

1.2 Properties
We repeat the non-negativity property, which follows directly from the definition of the entropy and the fact
that probabilities are less than 1.
H(A) 0 . (5)

Some properties of the entropy can be easily derived from properties of probabilities.
Symmetry:

P (A, B) = P (B, A) (6)


= h log P (a, b)iA,B = h log P (b, a)iA,B (7)
(2)
H(A, B) = H(B, A) . (8)

Chain rule (D: Kettenregel).

P (A, B) = P (A|B)P (B) (9)


= h log P (a, b)iA,B = h log P (a|b)iA,B + h log P (b)iB (10)
(2,4,1)
H(A, B) = H(A|B) + H(B) , (11)

i.e. the joint entropy of A and B is the entropy of B plus the conditional entropy of A given B.
If two variables are statistically independent

A and B are statistically independent (12)


P (A, B) = P (A)P (B) (13)
= h log P (a, b)iA,B = h log P (a)iA + h log P (b)iB (14)
(2,1)
H(A, B) = H(A) + H(B) . (15)

2
It is intuitively clear that if A and B are not statistically independent the uncertainty about A and B is less
than the uncertainty about A plus the uncertainty about B alone. Thus we would expect

A and B are statistically dependent (16)


H(A, B) < H(A) + H(B) . (17)

It seems therefore plausible to quantify the degree of statistical dependence between A and B by the difference
between H(A) + H(B) and H(A, B), which is called mutual information (D: Transinformation)

I(A; B) := H(A) + H(B) H(A, B) (18)


(11)
= H(A) H(A|B) (19)
(8,11)
= H(B) H(B|A) . (20)

Mutual information is obviously symmetric,


(18,8)
I(A; B) = I(B; A) . (21)

As one would expect, mutual information is non-negative


(18,1,2)
I(A; B) = h log P (a)iA + h log P (b)iB h log P (a, b)iA,B (22)
= h log P (a)iA,B + h log P (b)iA,B h log P (a, b)iA,B (23)
X P (a)P (b)
= P (a, b) log (24)
aA
P (a, b)
bB

X P (a)P (b)
log P (a, b) (by Jensens inequality) (25)
aA
P (a, b)
bB
X X
= log P (a) P (b) (26)
aA bB
| {z } | {z }
1 1
= 0. (27)

The proof has to be slightly modified if P (a, b) = 0 for some a, b. Jensens inequality states that for a strictly
concave function f , like the logarithm, hf (a)iA f (haiA ) with equality iff A assumes only one value. Thus,
equality in (25) holds only if P (a)P (b)/P (a, b) is a constant, i.e. if A and B are statistically independent,
which results in the statement

A and B are statistically independent (28)


I(A; B) = 0 , (29)

as we would expect from a reasonable definition of mutual information.


Another consequence of the non-negativity of mutual information is that conditioning can only reduce
uncertainty,
(19,27)
H(A|B) H(A) . (30)

References
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. John Wiley & Sons, Inc., New
York. 1
Wiskott, L. (2013). Lecture notes on Bayesian theory and graphical models. Available at https://www.
ini.rub.de/PEOPLE/wiskott/Teaching/Material/index.html. 1

You might also like