You are on page 1of 75

Information Theory

The Small Plates

Contents
1

Information theory

1.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Historical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Quantities of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.1

Entropy of an information source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.2

Joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.3

Conditional entropy (equivocation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.4

Mutual information (transinformation) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.5

KullbackLeibler divergence (information gain) . . . . . . . . . . . . . . . . . . . . . . .

1.3.6

Other quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Coding theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.1

Source theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.2

Channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Applications to other elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.1

Intelligence uses and secrecy applications . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.2

Pseudorandom number generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.3

Seismic exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.4

Semiotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.5

Miscellaneous applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

1.5

1.6

1.7

1.8
2

1.6.1

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.2

History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.3

Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.4

Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.7.1

The classic work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.7.2

Other journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.7.3

Textbooks on information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.7.4

Other books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.7.5

MOOC on information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Self-information

15
i

ii

CONTENTS
2.1

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3

Self-information of a partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.4

Relationship to entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.5

References

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.6

External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Entropy (information theory)

18

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.2

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.3

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.4

Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.5

Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.5.1

Relationship to thermodynamic entropy . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.5.2

Entropy as information content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.5.3

Entropy as a measure of diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.5.4

Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.5.5

Worlds technological capacity to store and communicate information . . . . . . . . . . . .

24

3.5.6

Limitations of entropy as information content . . . . . . . . . . . . . . . . . . . . . . . .

24

3.5.7

Limitations of entropy in cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.5.8

Data as a Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.5.9

b-ary entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.6

Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.7

Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.7.1

Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.7.2

Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.7.3

Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.7.4

Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.8

Further properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.9

Extending discrete entropy to the continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.9.1

Dierential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.9.2

Limiting Density of Discrete Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.9.3

Relative entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.10 Use in combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.10.1 LoomisWhitney inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.10.2 Approximation to binomial coecient . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.11 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.13 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.13.1 Textbooks on information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.14 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

CONTENTS

iii

Binary entropy function

35

4.1

Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.2

Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.3

Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.4

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.5

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

Dierential entropy

38

5.1

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.2

Properties of dierential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.3

Maximization in the normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.4

Example: Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

5.5

Dierential entropies for various distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.6

Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.7

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.8

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.9

External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Diversity index

43

6.1

True diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

6.2

Richness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

6.3

Shannon index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

6.3.1

Rnyi entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Simpson index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

6.4.1

Inverse Simpson index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

6.4.2

GiniSimpson index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

6.5

BergerParker index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

6.6

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

6.7

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

6.8

Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

6.9

External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

6.4

Conditional entropy

49

7.1

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

7.2

Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

7.3

Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

7.4

Generalization to quantum theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

7.5

Other properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

7.6

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

7.7

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

Joint entropy

53

8.1

53

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

CONTENTS
8.2

Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

8.2.1

Greater than individual entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

8.2.2

Less than or equal to the sum of individual entropies . . . . . . . . . . . . . . . . . . . . .

54

8.3

Relations to other entropy measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

8.4

References

54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Mutual information

55

9.1

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

9.2

Relation to other quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

9.3

Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

9.3.1

Metric

58

9.3.2

Conditional mutual information

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

9.3.3

Multivariate mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

9.3.4

Directed information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

9.3.5

Normalized variants

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

9.3.6

Weighted variants

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

9.3.7

Adjusted mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

9.3.8

Absolute mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

9.3.9

Linear correlation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

9.3.10 For discrete data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

9.4

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

9.5

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

9.6

Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

9.7

References

64

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 Cross entropy

65

10.1 Motivation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

10.2 Estimation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

10.3 Cross-entropy minimization

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.4 Cross-entropy error function and logistic regression

66

. . . . . . . . . . . . . . . . . . . . . . . . .

66

10.5 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

10.6 References

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

10.7 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

10.8 Text and image sources, contributors, and licenses . . . . . . . . . . . . . . . . . . . . . . . . . .

68

10.8.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

10.8.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

10.8.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

Chapter 1

Information theory
Not to be confused with information science.
Information theory studies the quantication, storage, and communication of information. It was originally proposed by Claude E. Shannon in 1948 to nd fundamental limits on signal processing and communication operations
such as data compression, in a landmark paper entitled "A Mathematical Theory of Communication". Now this theory
has found applications in many other areas, including statistical inference, natural language processing, cryptography,
neurobiology,[1] the evolution[2] and function[3] of molecular codes, model selection in ecology,[4] thermal physics,[5]
quantum computing, linguistics, plagiarism detection,[6] pattern recognition, and anomaly detection.[7]
A key measure in information theory is "entropy". Entropy quanties the amount of uncertainty involved in the value
of a random variable or the outcome of a random process. For example, identifying the outcome of a fair coin ip
(with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a
roll of a die (with six equally likely outcomes). Some other important measures in information theory are mutual
information, channel capacity, error exponents, and relative entropy.
Applications of fundamental topics of information theory include lossless data compression (e.g. ZIP les), lossy
data compression (e.g. MP3s and JPEGs), and channel coding (e.g. for Digital Subscriber Line (DSL)).
The eld is at the intersection of mathematics, statistics, computer science, physics, neurobiology, and electrical
engineering. Its impact has been crucial to the success of the Voyager missions to deep space, the invention of the
compact disc, the feasibility of mobile phones, the development of the Internet, the study of linguistics and of human
perception, the understanding of black holes, and numerous other elds. Important sub-elds of information theory
include source coding, channel coding, algorithmic complexity theory, algorithmic information theory, informationtheoretic security, and measures of information.

1.1 Overview
Information theory studies the transmission, processing, utilization, and extraction of information. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a
noisy channel, this abstract concept was made concrete in 1948 by Claude Shannon in his paper "A Mathematical
Theory of Communication", in which information is thought of as a set of possible messages, where the goal is to
send these messages over a noisy channel, and then to have the receiver reconstruct the message with low probability
of error, in spite of the channel noise. Shannons main result, the noisy-channel coding theorem showed that, in the
limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity,
a quantity dependent merely on the statistics of the channel over which the messages are sent.[1]
Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more: adaptive systems, anticipatory systems, articial intelligence, complex systems, complexity science,
cybernetics, informatics, machine learning, along with systems sciences of many descriptions. Information theory is
a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital eld of
coding theory.
1

CHAPTER 1. INFORMATION THEORY

Coding theory is concerned with nding explicit methods, called codes, for increasing the eciency and reducing
the error rate of data communication over noisy channels to near the Channel capacity. These codes can be roughly
subdivided into data compression (source coding) and error-correction (channel coding) techniques. In the latter case,
it took many years to nd the methods Shannons work proved were possible. A third class of information theory
codes are cryptographic algorithms (both codes and ciphers). Concepts, methods and results from coding theory
and information theory are widely used in cryptography and cryptanalysis. See the article ban (unit) for a historical
application.
Information theory is also used in information retrieval, intelligence gathering, gambling, statistics, and even in musical
composition.

1.2 Historical background


Main article: History of information theory
The landmark event that established the discipline of information theory, and brought it to immediate worldwide
attention, was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in
the Bell System Technical Journal in July and October 1948.
Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs, all implicitly assuming events
of equal probability. Harry Nyquist's 1924 paper, Certain Factors Aecting Telegraph Speed, contains a theoretical
section quantifying intelligence and the line speed at which it can be transmitted by a communication system,
giving the relation W = K log m (recalling Boltzmanns constant), where W is the speed of transmission of intelligence,
m is the number of dierent voltage levels to choose from at each time step, and K is a constant. Ralph Hartley's 1928
paper, Transmission of Information, uses the word information as a measurable quantity, reecting the receivers ability
to distinguish one sequence of symbols from any other, thus quantifying information as H = log S n = n log S, where
S was the number of possible symbols, and n the number of symbols in a transmission. The unit of information was
therefore the decimal digit, much later renamed the hartley in his honour as a unit or scale or measure of information.
Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world
war Enigma ciphers.
Much of the mathematics behind information theory with events of dierent probabilities were developed for the eld
of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs. Connections between information-theoretic entropy
and thermodynamic entropy, including the important contributions by Rolf Landauer in the 1960s, are explored in
Entropy in thermodynamics and information theory.
In Shannons revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell
Labs by the end of 1944, Shannon for the rst time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion that
The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point.
With it came the ideas of
the information entropy and redundancy of a source, and its relevance through the source coding theorem;
the mutual information, and the channel capacity of a noisy channel, including the promise of perfect loss-free
communication given by the noisy-channel coding theorem;
the practical result of the ShannonHartley law for the channel capacity of a Gaussian channel; as well as
the bita new way of seeing the most fundamental unit of information.

1.3 Quantities of information


Main article: Quantities of information

1.3. QUANTITIES OF INFORMATION

Information theory is based on probability theory and statistics. Information theory often concerns itself with measures of information of the distributions associated with random variables. Important quantities of information are
entropy, a measure of information in a single random variable, and mutual information, a measure of information in
common between two random variables. The former quantity is a property of the probability distribution of a random
variable and gives a limit on the rate at which data generated by independent samples with the given distribution can
be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum
rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics
are determined by the joint distribution.
The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. A
common unit of information is the bit, based on the binary logarithm. Other units include the nat, which is based on
the natural logarithm, and the hartley, which is based on the common logarithm.
In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p = 0.
This is justied because limp0+ p log p = 0 for any logarithmic base.

1.3.1

Entropy of an information source

Based on the probability mass function of each source symbol to be communicated, the Shannon entropy H, in units
of bits (per symbol), is given by

H=

pi log2 (pi )

where pi is the probability of occurrence of the i-th possible value of the source symbol. This equation gives the
entropy in the units of bits (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has
sometimes been called the "shannon" in his honor. Entropy is also commonly computed using the natural logarithm
(base e, where e is Eulers number), which produces a measurement of entropy in "nats" per symbol and sometimes
simplies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible,
but less commonly used. For example, a logarithm of base 28 = 256 will produce a measurement in bytes per symbol,
and a logarithm of base 10 will produce a measurement in decimal digits (or hartleys) per symbol.
Intuitively, the entropy HX of a discrete random variable X is a measure of the amount of uncertainty associated with
the value of X when only its distribution is known.
The entropy of a source that emits a sequence of N symbols that are independent and identically distributed (iid) is
NH bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the
entropy of a message of length N will be less than NH.
Suppose one transmits 1000 bits (0s and 1s). If the value of each of these bits is known to the receiver (has a
specic value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each
bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been
transmitted. Between these two extremes, information can be quantied as follows. If is the set of all messages
{x1 , , xn} that X could be, and p(x) is the probability of some x X , then the entropy, H, of X is dened:[8]

H(X) = EX [I(x)] =

p(x) log p(x).

xX

(Here, I(x) is the self-information, which is the entropy contribution of an individual message, and X is the expected
value.) A property of entropy is that it is maximized when all the messages in the message space are equiprobable
p(x) = 1/n; i.e., most unpredictable, in which case H(X) = log n.
The special case of information entropy for a random variable with two outcomes is the binary entropy function,
usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit:

Hb (p) = p log2 p (1 p) log2 (1 p).

CHAPTER 1. INFORMATION THEORY

H(X)

0.5

0.5
Pr(X = 1)

The entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function, Hb(p). The entropy
is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.

1.3.2

Joint entropy

The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X, Y). This
implies that if X and Y are independent, then their joint entropy is the sum of their individual entropies.
For example, if (X, Y) represents the position of a chess piece X the row and Y the column, then the joint entropy
of the row of the piece and the column of the piece will be the entropy of the position of the piece.

H(X, Y ) = EX,Y [ log p(x, y)] =

p(x, y) log p(x, y)

x,y

Despite similar notation, joint entropy should not be confused with cross entropy.

1.3.3

Conditional entropy (equivocation)

The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation
of X about Y) is the average conditional entropy over Y:[9]

1.3. QUANTITIES OF INFORMATION

H(X|Y ) = EY [H(X|y)] =

p(y)

yY

p(x|y) log p(x|y) =

p(x, y) log

x,y

xX

p(x, y)
.
p(y)

Because entropy can be conditioned on a random variable or on that random variable being a certain value, care
should be taken not to confuse these two denitions of conditional entropy, the former of which is in more common
use. A basic property of this form of conditional entropy is that:

H(X|Y ) = H(X, Y ) H(Y ).

1.3.4

Mutual information (transinformation)

Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information
shared between sent and received signals. The mutual information of X relative to Y is given by:

I(X; Y ) = EX,Y [SI(x, y)] =

p(x, y) log

x,y

p(x, y)
p(x) p(y)

where SI (Specic mutual Information) is the pointwise mutual information.


A basic property of the mutual information is that

I(X; Y ) = H(X) H(X|Y ).


That is, knowing Y, we can save an average of I(X; Y) bits in encoding X compared to not knowing Y.
Mutual information is symmetric:

I(X; Y ) = I(Y ; X) = H(X) + H(Y ) H(X, Y ).


Mutual information can be expressed as the average KullbackLeibler divergence (information gain) between the
posterior probability distribution of X given the value of Y and the prior distribution on X:

I(X; Y ) = Ep(y) [DKL (p(X|Y = y)p(X))].


In other words, this is a measure of how much, on the average, the probability distribution on X will change if we
are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions
to the actual joint distribution:

I(X; Y ) = DKL (p(X, Y )p(X)p(Y )).


Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the
multinomial distribution and to Pearsons 2 test: mutual information can be considered a statistic for assessing
independence between a pair of variables, and has a well-specied asymptotic distribution.

1.3.5

KullbackLeibler divergence (information gain)

The KullbackLeibler divergence (or information divergence, information gain, or relative entropy) is a way of
comparing two distributions: a true probability distribution p(X), and an arbitrary probability distribution q(X). If
we compress data in a manner that assumes q(X) is the distribution underlying some data, when, in reality, p(X) is the

CHAPTER 1. INFORMATION THEORY

correct distribution, the KullbackLeibler divergence is the number of average additional bits per datum necessary
for compression. It is thus dened

DKL (p(X)q(X)) =

p(x) log q(x)

xX

xX

p(x) log p(x) =

xX

p(x) log

p(x)
.
q(x)

Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it is not symmetric and
does not satisfy the triangle inequality (making it a semi-quasimetric).
Another interpretation of the KL divergence is the unnecessary surprise introduced by a prior from the truth:
suppose a number X is about to be drawn randomly from a discrete set with probability distribution p(x). If Alice
knows the true distribution p(x), while Bob believes (has a prior) that the distribution is q(x), then Bob will be more
surprised than Alice, on average, upon seeing the value of X. The KL divergence is the (objective) expected value of
Bobs (subjective) surprisal minus Alices surprisal, measured in bits if the log is in base 2. In this way, the extent to
which Bobs prior is wrong can be quantied in terms of how unnecessarily surprised its expected to make him.

1.3.6

Other quantities

Other important information theoretic quantities include Rnyi entropy (a generalization of entropy), dierential
entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information.

1.4 Coding theory


Main article: Coding theory
Coding theory is one of the most important and direct applications of information theory. It can be subdivided into

A picture showing scratches on the readable surface of a CD-R. Music and data CDs are coded using error correcting codes and
thus can still be read even if they have minor scratches using error detection and correction.

source coding theory and channel coding theory. Using a statistical description for data, information theory quanties
the number of bits needed to describe the data, which is the information entropy of the source.
Data compression (source coding): There are two formulations for the compression problem:
1. lossless data compression: the data must be reconstructed exactly;
2. lossy data compression: allocates bits needed to reconstruct the data, within a specied delity level measured
by a distortion function. This subset of Information theory is called ratedistortion theory.

1.4. CODING THEORY

Error-correcting codes (channel coding): While data compression removes as much redundancy as possible,
an error correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the
data eciently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justied by the information transmission theorems, or sourcechannel separation theorems that justify the use of bits as the universal currency for information in
many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than
one receiver (the broadcast channel) or intermediary helpers (the relay channel), or more general networks, compression followed by transmission may no longer be optimal. Network information theory refers to these multi-agent
communication models.

1.4.1

Source theory

Any process that generates successive messages can be considered a source of information. A memoryless source
is one in which each message is an independent identically distributed random variable, whereas the properties of
ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic. These terms are well
studied in their own right outside information theory.
Rate
Information rate is the average entropy per symbol. For memoryless sources, this is merely the entropy of each
symbol, while, in the case of a stationary stochastic process, it is

r = lim H(Xn |Xn1 , Xn2 , Xn3 , . . .);


n

that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a
process that is not necessarily stationary, the average rate is

r = lim

1
H(X1 , X2 , . . . Xn );
n

that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.[10]
It is common in information theory to speak of the rate or entropy of a language. This is appropriate, for example,
when the source of information is English prose. The rate of a source of information is related to its redundancy and
how well it can be compressed, the subject of source coding.

1.4.2

Channel capacity

Main article: Channel capacity


Communications over a channelsuch as an ethernet cableis the primary motivation of information theory. As
anyone whos ever used a telephone (mobile or landline) knows, however, such channels often fail to produce exact
reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality. How
much information can one hope to communicate over a noisy (or otherwise imperfect) channel?
Consider the communications process over a discrete channel. A simple model of the process is shown below:
Here X represents the space of messages transmitted, and Y the space of messages received during a unit time over
our channel. Let p(y|x) be the conditional probability distribution function of Y given X. We will consider p(y|x) to be
an inherent xed property of our communications channel (representing the nature of the noise of our channel). Then
the joint distribution of X and Y is completely determined by our channel and by our choice of f(x), the marginal
distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize
the rate of information, or the signal, we can communicate over the channel. The appropriate measure for this is the
mutual information, and this maximum mutual information is called the channel capacity and is given by:

CHAPTER 1. INFORMATION THEORY

Transmitter

(noisy)
Channel

Receiver

C = max I(X; Y ).
f

This capacity has the following property related to communicating at information rate R (where R is usually bits per
symbol). For any information rate R < C and coding error > 0, for large enough N, there exists a code of length N
and rate R and a decoding algorithm, such that the maximal probability of block error is ; that is, it is always
possible to transmit with arbitrarily small block error. In addition, for any rate R > C, it is impossible to transmit with
arbitrarily small block error.
Channel coding is concerned with nding such nearly optimal codes that can be used to transmit data over a noisy
channel with a small coding error at a rate near the channel capacity.

Capacity of particular channel models


A continuous-time analog communications channel subject to Gaussian noise see ShannonHartley theorem.
A binary symmetric channel (BSC) with crossover probability p is a binary input, binary output channel that
ips the input bit with probability p. The BSC has a capacity of 1 H (p) bits per channel use, where H is
the binary entropy function to the base 2 logarithm:

1p

p
p
1

1p

A binary erasure channel (BEC) with erasure probability p is a binary input, ternary output channel. The
possible channel outputs are 0, 1, and a third symbol 'e' called an erasure. The erasure represents complete
loss of information about an input bit. The capacity of the BEC is 1 - p bits per channel use.

1p
p
p

1p

0
e
1

1.5. APPLICATIONS TO OTHER FIELDS

1.5 Applications to other elds


1.5.1

Intelligence uses and secrecy applications

Information theoretic concepts apply to cryptography and cryptanalysis. Turing's information unit, the ban, was used
in the Ultra project, breaking the German Enigma machine code and hastening the end of World War II in Europe.
Shannon himself dened an important concept now called the unicity distance. Based on the redundancy of the
plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.
Information theory leads us to believe it is much more dicult to keep secrets than it might rst appear. A brute force
attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key
algorithms (sometimes called secret key algorithms), such as block ciphers. The security of all such methods currently
comes from the assumption that no known attack can break them in a practical amount of time.
Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force
attacks. In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned
on the key) can ensure proper transmission, while the unconditional mutual information between the plaintext and
ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be
able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However,
as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure
methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of
key material.

1.5.2

Pseudorandom number generation

Pseudorandom number generators are widely available in computer language libraries and application programs.
They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern
computer equipment and software. A class of improved random number generators is termed cryptographically
secure pseudorandom number generators, but even they require random seeds external to the software to work as
intended. These can be obtained via extractors, if done carefully. The measure of sucient randomness in extractors
is min-entropy, a value related to Shannon entropy through Rnyi entropy; Rnyi entropy is also used in evaluating
randomness in cryptographic systems. Although related, the distinctions among these measures mean that a random
variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.

1.5.3

Seismic exploration

One early commercial application of information theory was in the eld of seismic oil exploration. Work in this eld
made it possible to strip o and separate the unwanted noise from the desired seismic signal. Information theory and
digital signal processing oer a major improvement of resolution and image clarity over previous analog methods.[11]

1.5.4

Semiotics

Concepts from information theory such as redundancy and code control have been used by semioticians such as
Umberto Eco and Ferruccio Rossi-Landi to explain ideology as a form of message transmission whereby a dominant
social class emits its message by using signs that exhibit a high degree of redundancy such that only one message is
decoded among a selection of competing ones.[12]

1.5.5

Miscellaneous applications

Information theory also has applications in gambling and investing, black holes, and bioinformatics.

1.6 See also


Algorithmic probability

10

CHAPTER 1. INFORMATION THEORY


Algorithmic information theory
Bayesian inference
Communication theory
Constructor theory - a generalization of information theory that includes quantum information
Inductive probability
Minimum message length
Minimum description length
List of important publications
Philosophy of information

1.6.1

Applications

Active networking
Cryptanalysis
Cryptography
Cybernetics
Entropy in thermodynamics and information theory
Gambling
Intelligence (information gathering)
Seismic exploration

1.6.2

History

Hartley, R.V.L.
History of information theory
Shannon, C.E.
Timeline of information theory
Yockey, H.P.

1.6.3

Theory

Coding theory
Detection theory
Estimation theory
Fisher information
Information algebra
Information asymmetry
Information eld theory
Information geometry

1.6. SEE ALSO


Information theory and measure theory
Kolmogorov complexity
Logic of information
Network coding
Philosophy of Information
Quantum information science
Semiotic information theory
Source coding
Unsolved Problems

1.6.4

Concepts

Ban (unit)
Channel capacity
Channel (communications)
Communication source
Conditional entropy
Covert channel
Decoder
Dierential entropy
Encoder
Information entropy
Joint entropy
KullbackLeibler divergence
Mutual information
Pointwise mutual information (PMI)
Receiver (information theory)
Redundancy
Rnyi entropy
Self-information
Unicity distance
Variety

11

12

CHAPTER 1. INFORMATION THEORY

1.7 References
[1] F. Rieke; D. Warland; R Ruyter van Steveninck; W Bialek (1997). Spikes: Exploring the Neural Code. The MIT press.
ISBN 978-0262681087.
[2] cf. Huelsenbeck, J. P., F. Ronquist, R. Nielsen and J. P. Bollback (2001) Bayesian inference of phylogeny and its impact
on evolutionary biology, Science 294:2310-2314
[3] Rando Allikmets, Wyeth W. Wasserman, Amy Hutchinson, Philip Smallwood, Jeremy Nathans, Peter K. Rogan, Thomas
D. Schneider, Michael Dean (1998) Organization of the ABCR gene: analysis of promoter and splice junction sequences,
Gene 215:1, 111-122
[4] Burnham, K. P. and Anderson D. R. (2002) Model Selection and Multimodel Inference: A Practical Information-Theoretic
Approach, Second Edition (Springer Science, New York) ISBN 978-0-387-95364-9.
[5] Jaynes, E. T. (1957) Information Theory and Statistical Mechanics, Phys. Rev. 106:620
[6] Charles H. Bennett, Ming Li, and Bin Ma (2003) Chain Letters and Evolutionary Histories, Scientic American 288:6,
76-81
[7] David R. Anderson (November 1, 2003). Some background on why people in the empirical sciences may want to better
understand the information-theoretic methods (pdf). Retrieved 2010-06-23.
[8] Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN
0-486-68210-2.
[9] Robert B. Ash (1990) [1965]. Information Theory. Dover Publications, Inc. ISBN 0-486-66521-6.
[10] Jerry D. Gibson (1998). Digital Compression for Multimedia: Principles and Standards. Morgan Kaufmann. ISBN 155860-369-7.
[11] The Corporation and Innovation, Haggerty, Patrick, Strategic Management Journal, Vol. 2, 97-118 (1981)
[12] Semiotics of Ideology, Noth, Winfried, Semiotica, Issue 148,(1981)

1.7.1

The classic work

Shannon, C.E. (1948), "A Mathematical Theory of Communication", Bell System Technical Journal, 27, pp.
379423 & 623656, July & October, 1948. PDF.
Notes and other formats.
R.V.L. Hartley, Transmission of Information, Bell System Technical Journal, July 1928
Andrey Kolmogorov (1968), Three approaches to the quantitative denition of information in International
Journal of Computer Mathematics.

1.7.2

Other journal articles

J. L. Kelly, Jr., Saratoga.ny.us, A New Interpretation of Information Rate Bell System Technical Journal, Vol.
35, July 1956, pp. 91726.
R. Landauer, IEEE.org, Information is Physical Proc. Workshop on Physics and Computation PhysComp'92
(IEEE Comp. Sci.Press, Los Alamitos, 1993) pp. 14.
R. Landauer, IBM.com, Irreversibility and Heat Generation in the Computing Process IBM J. Res. Develop.
Vol. 5, No. 3, 1961
Timme, Nicholas; Alford, Wesley; Flecker, Benjamin; Beggs, John M. (2012). Multivariate information
measures: an experimentalists perspective. arXiv:1111.6857 [cs.IT].

1.7. REFERENCES

1.7.3

13

Textbooks on information theory

Arndt, C. Information Measures, Information and its Description in Science and Engineering (Springer Series:
Signals and Communication Technology), 2004, ISBN 978-3-540-40855-0
Ash, RB. Information Theory. New York: Interscience, 1965. ISBN 0-470-03445-9. New York: Dover 1990.
ISBN 0-486-66521-6
Gallager, R. Information Theory and Reliable Communication. New York: John Wiley and Sons, 1968. ISBN
0-471-29048-3
Goldman, S. Information Theory. New York: Prentice Hall, 1953. New York: Dover 1968 ISBN 0-48662209-6, 2005 ISBN 0-486-44271-3
Cover, Thomas; Thomas, Joy A. (2006). Elements of information theory (2nd ed.). New York: WileyInterscience. ISBN 0-471-24195-4.
Csiszar, I, Korner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems Akademiai Kiado:
2nd edition, 1997. ISBN 963-05-7440-3
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1
Mansuripur, M. Introduction to Information Theory. New York: Prentice Hall, 1987. ISBN 0-13-484668-0
McEliece, R. The Theory of Information and Coding. Cambridge, 2002. ISBN 978-0521831857
Pierce, JR. An introduction to information theory: symbols, signals and noise. Dover (2nd Edition). 1961
(reprinted by Dover 1980).
Reza, F. An Introduction to Information Theory. New York: McGraw-Hill 1961. New York: Dover 1994.
ISBN 0-486-68210-2
Shannon, Claude; Weaver, Warren (1949). The Mathematical Theory of Communication (PDF). Urbana, Illinois: University of Illinois Press. ISBN 0-252-72548-4. LCCN 49-11922.
Stone, JV. Chapter 1 of book Information Theory: A Tutorial Introduction, University of Sheeld, England,
2014. ISBN 978-0956372857.
Yeung, RW. A First Course in Information Theory Kluwer Academic/Plenum Publishers, 2002. ISBN 0-30646791-7.
Yeung, RW. Information Theory and Network Coding Springer 2008, 2002. ISBN 978-0-387-79233-0

1.7.4

Other books

Leon Brillouin, Science and Information Theory, Mineola, N.Y.: Dover, [1956, 1962] 2004. ISBN 0-48643918-6
James Gleick, The Information: A History, a Theory, a Flood, New York: Pantheon, 2011. ISBN 978-0-37542372-7
A. I. Khinchin, Mathematical Foundations of Information Theory, New York: Dover, 1957. ISBN 0-48660434-9
H. S. Le and A. F. Rex, Editors, Maxwells Demon: Entropy, Information, Computing, Princeton University
Press, Princeton, New Jersey (1990). ISBN 0-691-08727-X
Robert K. Logan. What is Information? - Propagating Organization in the Biosphere, the Symbolosphere, the
Technosphere and the Econosphere,
Toronto: DEMO Publishing.
Tom Siegfried, The Bit and the Pendulum, Wiley, 2000. ISBN 0-471-32174-5

14

CHAPTER 1. INFORMATION THEORY


Charles Seife, Decoding the Universe, Viking, 2006. ISBN 0-670-03441-X
Jeremy Campbell, Grammatical Man, Touchstone/Simon & Schuster, 1982, ISBN 0-671-44062-4
Henri Theil, Economics and Information Theory, Rand McNally & Company - Chicago, 1967.
Escolano, Suau, Bonev, Information Theory in Computer Vision and Pattern Recognition, Springer, 2009. ISBN
978-1-84882-296-2

1.7.5

MOOC on information theory

Raymond W. Yeung, "Information Theory" (The Chinese University of Hong Kong)

1.8 External links


Erill I. (2012), "A gentle introduction to information content in transcription factor binding sites" (University
of Maryland, Baltimore County)
Hazewinkel, Michiel, ed. (2001), Information, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608010-4
Lambert F. L. (1999), "Shued Cards, Messy Desks, and Disorderly Dorm Rooms - Examples of Entropy
Increase? Nonsense!", Journal of Chemical Education
Schneider T. D. (2014), "Information Theory Primer"
Srinivasa, S., "A Review on Multivariate Mutual Information"
IEEE Information Theory Society and ITSoc review articles

Chapter 2

Self-information
In information theory, self-information or surprisal is a measure of the information content associated with an
event in a probability space or with the value of a discrete random variable. It is expressed in a unit of information,
for example bits, nats, or hartleys, depending on the base of the logarithm used in its calculation.
The term self-information is also sometimes used as a synonym of the related information-theoretic concept of
entropy. These two meanings are not equivalent, and this article covers the rst sense only.

2.1 Denition
By denition, information is transferred from an originating entity possessing the information to a receiving entity
only when the receiver had not known the information a priori. If the receiving entity had previously known the
content of a message with certainty before receiving the message, the amount of information of the message received
is zero.
For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, Weather forecast for
tonight: dark. Continued dark overnight, with widely scattered light by morning. Assuming one not residing near the
Earths poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in
advance of receiving the forecast, that darkness always comes with the night.
When the content of a message is known a priori with certainty, with probability of 1, there is no actual information
conveyed in the message. Only when the advanced knowledge of the content of the message by the receiver is less
certain than 100% does the message actually convey information.
Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of
event, n , depends only on the probability of that event.

I(n ) = f (P(n ))
for some function f () to be determined below. If P(n ) = 1 , then I(n ) = 0 . If P(n ) < 1 , then I(n ) > 0 .
Further, by denition, the measure of self-information is nonnegative and additive. If a message informing of event C
is the intersection of two independent events A and B , then the information of event C occurring is the compound
message of both independent events A and B occurring. The quantity of information of compound message C
would be expected to equal the sum of the amounts of information of the individual component messages A and B
respectively:

I(C) = I(A B) = I(A) + I(B)


Because of the independence of events A and B , the probability of event C is

P(C) = P(A B) = P(A) P(B)


15

16

CHAPTER 2. SELF-INFORMATION

However, applying function f () results in

I(C) = I(A) + I(B)


f (P(C)) = f (P(A)) + f (P(B))
(
)
= f P(A) P(B)
The class of function f () having the property such that

f (x y) = f (x) + f (y)
is the logarithm function of any base. The only operational dierence between logarithms of dierent bases is that
of dierent scaling constants.

f (x) = K log(x)
Since the probabilities of events are always between 0 and 1 and the information associated with these events must
be nonnegative, that requires that K < 0 .
Taking into account these properties, the self-information I(n ) associated with outcome n with probability P(n )
is dened as:
(
I(n ) = log(P(n )) = log

1
P(n )

The smaller the probability of event n , the larger the quantity of self-information associated with the message that
the event indeed occurred. If the above logarithm is base 2, the unit of I(n ) is bits. This is the most common
practice. When using the natural logarithm of base e , the unit will be the nat. For the base 10 logarithm, the unit of
information is the hartley.
As a quick illustration, the information content associated with an outcome of 4 heads (or any specic outcome) in 4
consecutive tosses of a coin would be 4 bits (probability 1/16), and the information content associated with getting a
result other than the one specied would be 0.09 bits (probability 15/16). See below for detailed examples.
This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term was coined by Myron Tribus in his 1961 book Thermostatics and
Thermodynamics.
The information entropy of a random event is the expected value of its self-information.
Self-information is an example of a proper scoring rule.

2.2 Examples
On tossing a coin, the chance of 'tail' is 0.5. When it is proclaimed that indeed 'tail' occurred, this amounts to
I('tail') = log2 (1/0.5) = log2 2 = 1 bit of information.
When throwing a fair die, the probability of 'four' is 1/6. When it is proclaimed that 'four' has been thrown,
the amount of self-information is
I('four') = log2 (1/(1/6)) = log2 (6) = 2.585 bits.
When, independently, two dice are thrown, the amount of information associated with {throw 1 = 'two' &
throw 2 = 'four'} equals

2.3. SELF-INFORMATION OF A PARTITIONING

17

I('throw 1 is two & throw 2 is four') = log2 (1/P(throw 1 = 'two' & throw 2 = 'four')) = log2 (1/(1/36))
= log2 (36) = 5.170 bits.
This outcome equals the sum of the individual amounts of self-information associated with {throw 1 =
'two'} and {throw 2 = 'four'}; namely 2.585 + 2.585 = 5.170 bits.
In the same two dice situation we can also consider the information present in the statement The sum of the
two dice is ve
I('The sum of throws 1 and 2 is ve') = log2 (1/P('throw 1 and 2 sum to ve')) = log2 (1/(4/36)) = 3.17
bits. The (4/36) is because there are four ways out of 36 possible to sum two dice to 5. This shows how
more complex or ambiguous events can still carry information.

2.3 Self-information of a partitioning


The self-information of a partitioning of elements within a set (or clustering) is the expectation of the information
of a test object; if we select an element at random and observe in which partition/cluster it exists, what quantity of
information do we expect to obtain? The information of a partitioning C with P(k) denoting the fraction of elements
within partition k is [1]

I(C) = E( log(P(C))) =

P(k) log(P(k))

k=1

2.4 Relationship to entropy


The entropy is the expected value of the self-information of the values of a discrete random variable. Sometimes, the
entropy itself is called the self-information of the random variable, possibly because the entropy satises H(X) =
I(X; X) , where I(X; X) is the mutual information of X with itself.[2]

2.5 References
[1] Marina Meil; Comparing clusteringsan information based distance; Journal of Multivariate Analysis, Volume 98, Issue
5, May 2007
[2] Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.

C.E. Shannon, A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp 379423, (Part
I), 1948.

2.6 External links


Examples of surprisal measures
Surprisal entry in a glossary of molecular information theory
Bayesian Theory of Surprise

Chapter 3

Entropy (information theory)

2 shannons of entropy: In the case of two fair coin tosses, the information entropy is the log-base-2 of the number of possible outcomes;
with two coins there are four outcomes, and the entropy is two bits. Generally, information entropy is the average information of all
possible outcomes.

In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modies the message in some way. The receiver attempts to
infer which message was sent. In this context, entropy (more specically, Shannon entropy) is the expected value
(average) of the information contained in each message. 'Messages can be modeled by any ow of information.
In a more technical sense, there are reasons (explained below) to dene information as the negative of the logarithm of
the probability distribution. The probability distribution of the events, coupled with the information amount of every
event, forms a random variable whose expected value is the average amount of information, or entropy, generated by
this distribution. Units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to
dene it, though the shannon is commonly referred to as a bit.
The logarithm of the probability distribution is useful as a measure of entropy because it is additive for independent
sources. For instance, the entropy of a coin toss is 1 shannon, whereas of m tosses it is m shannons. Generally, you
need log2 (n) bits to represent a variable that can take one of n values if n is a power of 2. If these values are equally
probable, the entropy (in shannons) is equal to the number of bits. Equality between number of bits and shannons
holds only while all outcomes are equally probable. If one of the events is more probable than others, observation of
18

3.1. INTRODUCTION

19

that event is less informative. Conversely, rarer events provide more information when observed. Since observation
of less probable events occurs more rarely, the net eect is that the entropy (thought of as average information)
received from non-uniformly distributed data is less than log2 (n). Entropy is zero when one outcome is certain.
Shannon entropy quanties all these considerations exactly when a probability distribution of the source is known.
The meaning of the events observed (the meaning of messages) does not matter in the denition of entropy. Entropy
only takes into account the probability of observing a specic event, so the information it encapsulates is information
about the underlying probability distribution, not the meaning of the events themselves.
Generally, entropy refers to disorder or uncertainty. Shannon entropy was introduced by Claude E. Shannon in his
1948 paper "A Mathematical Theory of Communication".[1] Shannon entropy provides an absolute limit on the best
possible average length of lossless encoding or compression of an information source. Rnyi entropy generalizes
Shannon entropy.

3.1 Introduction
Entropy is a measure of unpredictability of information content. To get an intuitive understanding of these three
terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll isn't
already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll
and learning the results gives some new information; these are just dierent ways of saying that the entropy of the
poll results is large. Now, consider the case that the same poll is performed a second time shortly after the rst poll.
Since the result of the rst poll is already known, the outcome of the second poll can be predicted well and the results
should not contain much new information; in this case the entropy of the second poll result is small relative to the
rst.
Now consider the example of a coin toss. Assuming the probability of heads is the same as the probability of tails,
then the entropy of the coin toss is as high as it could be. This is because there is no way to predict the outcome of
the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be
correct with probability 1/2. Such a coin toss has one bit of entropy since there are two possible outcomes that occur
with equal probability, and learning the actual outcome contains one bit of information. Contrarily, a coin toss with a
coin that has two heads and no tails has zero entropy since the coin will always come up heads, and the outcome can
be predicted perfectly. Analogously, one binary bit has a log2 2 = 1 Shannon or bit entropy because it can have one
of two values (1 and 0). Similarly, one trit contains log2 3 (about 1.58496) bits of information because it can have
one of three values.
English text, treated as a string of characters, has fairly low entropy, i.e., is fairly predictable. Even if we do not
know exactly what is going to come next, we can be fairly certain that, for example, there will be many more es than
zs, that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the
combination 'th' will be more common than 'z', 'q', or 'qu'. After the rst few letters one can often guess the rest of
the word. English text has between 0.6 and 1.3 bits of entropy for each character of message.[2][3]
If a compression scheme is losslessthat is, you can always recover the entire original message by decompressing
then a compressed message has the same quantity of information as the original, but communicated in fewer characters. That is, it has more information, or a higher entropy, per character. This means a compressed message has less
redundancy. Roughly speaking, Shannons source coding theorem says that a lossless compression scheme cannot
compress messages, on average, to have more than one bit of information per bit of message, but that any value less
than one bit of information per bit of message can be attained by employing a suitable coding scheme. The entropy
of a message per bit multiplied by the length of that message is a measure of how much total information the message
contains.
Intuitively, imagine that we wish to transmit sequences one of the 4 characters A, B, C, or D. Thus, a message to be
transmitted might be 'ABADDCAB'. Information theory gives a way of calculating the smallest possible amount of
information that will convey this. If all 4 letters are equally likely (25%), we can do no better (over a binary channel)
than to have 2 bits encode (in binary) each letter: A might code as '00', B as '01', C as '10', and D as '11'. Now
suppose A occurs with 70% probability, B with 26%, and C and D with 2% each. We could assign variable length
codes, so that receiving a '1' tells us to look at another bit unless we have already received 2 bits of sequential 1s. In
this case, A would be coded as '0' (one bit), B as '10', and C and D as '110' and '111'. It is easy to see that 70% of
the time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, then,
fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of A followed by B - together
96% of characters). The calculation of the sum of probability-weighted log probabilities measures and captures this
eect.

20

CHAPTER 3. ENTROPY (INFORMATION THEORY)

Shannons theorem also implies that no lossless compression scheme can shorten all messages. If some messages
come out shorter, at least one must come out longer due to the pigeonhole principle. In practical use, this is generally
not a problem, because we are usually only interested in compressing certain types of messages, for example English
documents as opposed to gibberish text, or digital photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or uninteresting sequences larger. However, the problem can still arise even in
everyday use when applying a compression algorithm to already compressed data: for example, making a ZIP le of
music, pictures or videos that are already in a compressed format such as FLAC, MP3, WebM, AAC, PNG or JPEG
will generally result in a ZIP le that is slightly larger than the source le(s).

3.2 Denition
Named after Boltzmanns -theorem, Shannon dened the entropy (Greek letter Eta) of a discrete random variable
X with possible values {x1 , , xn} and probability mass function P(X) as:

H(X) = E[I(X)] = E[ ln(P(X))].


Here E is the expected value operator, and I is the information content of X.[4][5] I(X) is itself a random variable.
The entropy can explicitly be written as

H(X) =

P(xi ) I(xi ) =

i=1

P(xi ) logb P(xi ),

i=1

where b is the base of the logarithm used. Common values of b are 2, Eulers number e, and 10, and the unit of
entropy is shannon for b = 2, nat for b = e, and hartley for b = 10.[6] When b = 2, the units of entropy are also
commonly referred to as bits.
In the case of P(xi) = 0 for some i, the value of the corresponding summand 0 logb(0) is taken to be 0, which is
consistent with the limit:

lim p log(p) = 0.

p0+

One may also dene the conditional entropy of two events X and Y taking values xi and yj respectively, as

H(X|Y ) =

p(xi , yj ) log

i,j

p(yj )
p(xi , yj )

where p(xi, yj) is the probability that X = xi and Y = yj. This quantity should be understood as the amount of
randomness in the random variable X given the event Y.

3.3 Example
Main article: Binary entropy function
Main article: Bernoulli process
Consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails; this can be modelled
as a Bernoulli process.
The entropy of the unknown result of the next toss of the coin is maximized if the coin is fair (that is, if heads and
tails both have equal probability 1/2). This is the situation of maximum uncertainty as it is most dicult to predict
the outcome of the next toss; the result of each toss of the coin delivers one full bit of information.

3.4. RATIONALE

21

H(X)

0.5

0.5
Pr(X = 1)

Entropy (X) (i.e. the expected surprisal) of a coin ip, measured in shannons, graphed versus the fairness of the coin Pr(X = 1),
where X = 1 represents a result of heads.
Note that the maximum of the graph depends on the distribution. Here, the entropy is at most 1 shannon, and to communicate the
outcome of a fair coin ip (2 possible values) will require an average of at most 1 bit. The result of a fair die (6 possible values)
would require on average log2 6 bits.

However, if we know the coin is not fair, but comes up heads or tails with probabilities p and q, where p q, then there
is less uncertainty. Every time it is tossed, one side is more likely to come up than the other. The reduced uncertainty
is quantied in a lower entropy: on average each toss of the coin delivers less than one full bit of information.
The extreme case is that of a double-headed coin that never comes up tails, or a double-tailed coin that never results
in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no new information as the
outcome of each coin toss is always certain. In this respect, entropy can be normalized by dividing it by information
length. This ratio is called metric entropy and is a measure of the randomness of the information.

3.4 Rationale
To understand the meaning of pi log(1/pi), at rst, try to dene an information function, I, in terms of an event i
with probability pi. How much information is acquired due to the observation of event i? Shannons solution follows
from the fundamental properties of information:[7]

22

CHAPTER 3. ENTROPY (INFORMATION THEORY)


1. I(p) is anti-monotonic increases and decreases in the probability of an event produce decreases and increases
in information, respectively
2. I(0) is undened
3. I(p) 0 information is a non-negative quantity
4. I(1) = 0 events that always occur do not communicate information
5. I(p1 p2 ) = I(p1 ) + I(p2 ) information due to independent events is additive

The last is a crucial property. It states that joint probability communicates as much information as two individual
events separately. Particularly, if the rst event can yield one of n equiprobable outcomes and another has one of m
equiprobable outcomes then there are mn possible outcomes of the joint event. This means that if log2 (n) bits are
needed to encode the rst value and log2 (m) to encode the second, one needs log2 (mn) = log2 (m) + log2 (n) to encode
both. Shannon discovered that the proper choice of function to quantify information, preserving this additivity, is
logarithmic, i.e.,

I(p) = log(1/p)
The base of the logarithm can be any xed real number greater than 1. The dierent units of information (bits for
log2 , trits for log3 , nats for the natural logarithm ln and so on) are just constant multiples of each other. (In contrast,
the entropy would be negative if the base of the logarithm were less than 1.) For instance, in case of a fair coin toss,
heads provides log2 (2) = 1 bit of information, which is approximately 0.693 nats or 0.631 trits. Because of additivity,
n tosses provide n bits of information, which is approximately 0.693n nats or 0.631n trits.
Now, suppose we have a distribution where event i can happen with probability pi. Suppose we have sampled it N
times and outcome i was, accordingly, seen ni = N pi times. The total amount of information we have received is

ni I(pi ) =

N pi log

1
pi

The average amount of information that we receive with every event is therefore

pi log

1
.
pi

3.5 Aspects
3.5.1

Relationship to thermodynamic entropy

Main article: Entropy in thermodynamics and information theory


The inspiration for adopting the word entropy in information theory came from the close resemblance between Shannons formula and very similar known formulae from statistical mechanics.
In statistical thermodynamics the most general formula for the thermodynamic entropy S of a thermodynamic system
is the Gibbs entropy,

S = kB

pi ln pi

where kB is the Boltzmann constant, and pi is the probability of a microstate. The Gibbs entropy was dened by J.
Willard Gibbs in 1878 after earlier work by Boltzmann (1872).[8]
The Gibbs entropy translates over almost unchanged into the world of quantum physics to give the von Neumann
entropy, introduced by John von Neumann in 1927,

3.5. ASPECTS

23

S = kB Tr( ln )
where is the density matrix of the quantum mechanical system and Tr is the trace.
At an everyday practical level the links between information entropy and thermodynamic entropy are not evident.
Physicists and chemists are apt to be more interested in changes in entropy as a system spontaneously evolves away
from its initial conditions, in accordance with the second law of thermodynamics, rather than an unchanging probability distribution. And, as the minuteness of Boltzmanns constant kB indicates, the changes in S / kB for even tiny
amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in data compression or signal processing. Furthermore, in classical thermodynamics the entropy
is dened in terms of macroscopic measurements and makes no reference to any probability distribution, which is
central to the denition of information entropy.
The connection between thermodynamics and what is now known as information theory was rst made by Ludwig
Boltzmann and expressed by his famous equation:

S = kB ln(W )
where S is the thermodynamic entropy of a particular macrostate (dened by thermodynamic parameters such as
temperature, volume, energy, etc.), W is the number of microstates (various combinations of particles in various
energy states) that can yield the given macrostate, and kB is Boltzmanns constant. It is assumed that each microstate
is equally likely, so that the probability of a given microstate is pi = 1/W. When these probabilities are substituted
into the above expression for the Gibbs entropy (or equivalently kB times the Shannon entropy), Boltzmanns equation
results. In information theoretic terms, the information entropy of a system is the amount of missing information
needed to determine a microstate, given the macrostate.
In the view of Jaynes (1957), thermodynamic entropy, as explained by statistical mechanics, should be seen as an
application of Shannons information theory: the thermodynamic entropy is interpreted as being proportional to the
amount of further Shannon information needed to dene the detailed microscopic state of the system, that remains
uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with
the constant of proportionality being just the Boltzmann constant. For example, adding heat to a system increases
its thermodynamic entropy because it increases the number of possible microscopic states of the system that are
consistent with the measurable values of its macroscopic variables, thus making any complete state description longer.
(See article: maximum entropy thermodynamics). Maxwells demon can (hypothetically) reduce the thermodynamic
entropy of a system by using information about the states of individual molecules; but, as Landauer (from 1961)
and co-workers have shown, to function the demon himself must increase thermodynamic entropy in the process, by
at least the amount of Shannon information he proposes to rst acquire and store; and so the total thermodynamic
entropy does not decrease (which resolves the paradox). Landauers principle imposes a lower bound on the amount
of heat a computer must generate to process a given amount of information, though modern computers are far less
ecient.

3.5.2

Entropy as information content

Main article: Shannons source coding theorem


Entropy is dened in the context of a probabilistic model. Independent fair coin ips have an entropy of 1 bit per ip.
A source that always generates a long string of Bs has an entropy of 0, since the next character will always be a 'B'.
The entropy rate of a data source means the average number of bits per symbol needed to encode it. Shannons
experiments with human predictors show an information rate between 0.6 and 1.3 bits per character in English;[9] the
PPM compression algorithm can achieve a compression ratio of 1.5 bits per character in English text.
From the preceding example, note the following points:
1. The amount of entropy is not always an integer number of bits.
2. Many data bits may not convey information. For example, data structures often store information redundantly,
or have identical sections regardless of the information in the data structure.

24

CHAPTER 3. ENTROPY (INFORMATION THEORY)

Shannons denition of entropy, when applied to an information source, can determine the minimum channel capacity required to reliably transmit the source as encoded binary digits (see caveat below in italics). The formula can
be derived by calculating the mathematical expectation of the amount of information contained in a digit from the
information source. See also Shannon-Hartley theorem.
Shannons entropy measures the information contained in a message as opposed to the portion of the message that is
determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties
relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain.

3.5.3

Entropy as a measure of diversity

Main article: Diversity index


Entropy is one of several ways to measure diversity. Specically, Shannon entropy is the logarithm of 1 D, the true
diversity index with parameter equal to 1.

3.5.4

Data compression

Main article: Data compression


Entropy eectively bounds the performance of the strongest lossless compression possible, which can be realized in
theory by using the typical set or in practice using Human, LempelZiv or arithmetic coding. See also Kolmogorov
complexity. In practice, compression algorithms deliberately include some judicious redundancy in the form of
checksums to protect against errors.

3.5.5

Worlds technological capacity to store and communicate information

A 2011 study in Science estimates the worlds technological capacity to store and communicate optimally compressed
information normalized on the most eective compression algorithms available in the year 2007, therefore estimating
the entropy of the technologically available sources.[10]
The authors estimate humankind technological capacity to store information (fully entropically compressed) in 1986
and again in 2007. They break the information into three categoriesto store information on a medium, to receive
information through a one-way broadcast networks, or to exchange information through two-way telecommunication
networks.[10]

3.5.6

Limitations of entropy as information content

There are a number of entropy-related concepts that mathematically quantify information content in some way:
the self-information of an individual message or symbol taken from a given probability distribution,
the entropy of a given probability distribution of messages or symbols, and
the entropy rate of a stochastic process.
(The rate of self-information can also be dened for a particular sequence of messages or symbols generated by
a given stochastic process: this will always be equal to the entropy rate in the case of a stationary process.) Other
quantities of information are also used to compare or relate dierent sources of information.
It is important not to confuse the above concepts. Often it is only clear from context which one is meant. For example,
when someone says that the entropy of the English language is about 1 bit per character, they are actually modeling
the English language as a stochastic process and talking about its entropy rate. Shannon himself used the term in this
way.[3]
Although entropy is often used as a characterization of the information content of a data source, this information
content is not absolute: it depends crucially on the probabilistic model. A source that always generates the same

3.5. ASPECTS

25

symbol has an entropy rate of 0, but the denition of what a symbol is depends on the alphabet. Consider a source
that produces the string ABABABABAB in which A is always followed by B and vice versa. If the probabilistic
model considers individual letters as independent, the entropy rate of the sequence is 1 bit per character. But if the
sequence is considered as AB AB AB AB AB " with symbols as two-character blocks, then the entropy rate is 0
bits per character.
However, if we use very large blocks, then the estimate of per-character entropy rate may become articially low.
This is because in reality, the probability distribution of the sequence is not knowable exactly; it is only an estimate.
For example, suppose one considers the text of every book ever published as a sequence, with each symbol being
the text of a complete book. If there are N published books, and each book is only published once, the estimate
of the probability of each book is 1/N, and the entropy (in bits) is log2 (1/N) = log2 (N). As a practical code, this
corresponds to assigning each book a unique identier and using it in place of the text of the book whenever one
wants to refer to the book. This is enormously useful for talking about books, but it is not so useful for characterizing
the information content of an individual book, or of language in general: it is not possible to reconstruct the book
from its identier without knowing the probability distribution, that is, the complete text of all the books. The key
idea is that the complexity of the probabilistic model must be considered. Kolmogorov complexity is a theoretical
generalization of this idea that allows the consideration of the information content of a sequence independent of any
particular probability model; it considers the shortest program for a universal computer that outputs the sequence. A
code that achieves the entropy rate of a sequence for a given model, plus the codebook (i.e. the probabilistic model),
is one such program, but it may not be the shortest.
For example, the Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, . Treating the sequence as a message and each number as
a symbol, there are almost as many symbols as there are characters in the message, giving an entropy of approximately
log2 (n). So the rst 128 symbols of the Fibonacci sequence has an entropy of approximately 7 bits/symbol. However,
the sequence can be expressed using a formula [F(n) = F(n1) + F(n2) for n = 3, 4, 5, , F(1) =1, F(2) = 1] and
this formula has a much lower entropy and applies to any length of the Fibonacci sequence.

3.5.7

Limitations of entropy in cryptography

In cryptanalysis, entropy is often roughly used as a measure of the unpredictability of a cryptographic key. For
example, a 128-bit key that is uniformly randomly generated has 128 bits of entropy. It also takes (on average)
21281 guesses to break by brute force. However, entropy fails to capture the number of guesses required if the
possible keys are not chosen uniformly.[11][12] Instead, a measure called guesswork can be used to measure the eort
required for a brute force attack.[13]
Other problems may arise from non-uniform distributions used in cryptography. For example, consider a 1000000digit binary one-time pad using exclusive or. If the pad has 1000000 bits of entropy, it is perfect. If the pad has
999999 bits of entropy, evenly distributed (each individual bit of the pad having 0.999999 bits of entropy) it may
provide good security. But if the pad has 999999 bits of entropy, where the rst bit is xed and the remaining 999999
bits are perfectly random, then the rst bit of the ciphertext will not be encrypted at all.

3.5.8

Data as a Markov process

A common way to dene entropy for text is based on the Markov model of text. For an order-0 source (each character
is selected independent of the last characters), the binary entropy is:

H(S) =

pi log2 pi ,

where pi is the probability of i. For a rst-order Markov source (one in which the probability of selecting a character
is dependent only on the immediately preceding character), the entropy rate is:
H(S) =

pi

pi (j) log2 pi (j),

where i is a state (certain preceding characters) and pi (j) is the probability of j given i as the previous character.
For a second order Markov source, the entropy rate is

26

CHAPTER 3. ENTROPY (INFORMATION THEORY)

H(S) =

pi

3.5.9

pi (j)

pi,j (k) log2 pi,j (k).

b-ary entropy

In general the b-ary entropy of a source S = (S, P) with source alphabet S = {a1 , , an} and discrete probability
distribution P = {p1 , , pn} where pi is the probability of ai (say pi = p(ai)) is dened by:

Hb (S) =

pi logb pi ,

i=1

Note: the b in "b-ary entropy is the number of dierent symbols of the ideal alphabet used as a standard yardstick to
measure source alphabets. In information theory, two symbols are necessary and sucient for an alphabet to encode
information. Therefore, the default is to let b = 2 (binary entropy). Thus, the entropy of the source alphabet, with its
given empiric probability distribution, is a number equal to the number (possibly fractional) of symbols of the ideal
alphabet, with an optimal probability distribution, necessary to encode for each symbol of the source alphabet. Also
note that optimal probability distribution here means a uniform distribution: a source alphabet with n symbols has
the highest possible entropy (for an alphabet with n symbols) when the probability distribution of the alphabet is
uniform. This optimal entropy turns out to be logb(n).

3.6 Eciency
A source alphabet with non-uniform distribution will have less entropy than if those symbols had uniform distribution
(i.e. the optimized alphabet). This deciency in entropy can be expressed as a ratio called eciency:
(X) =

n
i=1

p(xi ) logb (p(xi ))


logb (n)

Eciency has utility in quantifying the eective use of a communications channel. This formulation is also referred
to as the normalized entropy, as the entropy is divided by the maximum entropy logb (n) .

3.7 Characterization
Shannon entropy is characterized by a small number of criteria, listed below. Any denition of entropy satisfying
these assumptions has the form

pi log(pi )

i=1

where K is a constant corresponding to a choice of measurement units.


In the following, pi = Pr(X = xi) and n(p1 , , pn) = (X).

3.7.1

Continuity

The measure should be continuous, so that changing the values of the probabilities by a very small amount should
only change the entropy by a small amount.

3.8. FURTHER PROPERTIES

3.7.2

27

Symmetry

The measure should be unchanged if the outcomes xi are re-ordered.

Hn (p1 , p2 , . . .) = Hn (p2 , p1 , . . .)

3.7.3

Maximum

The measure should be maximal if all the outcomes are equally likely (uncertainty is highest when all possible events
are equiprobable).
(
Hn (p1 , . . . , pn ) Hn

1
1
,...,
n
n

)
= logb (n).

For equiprobable events the entropy should increase with the number of outcomes.
(
Hn

1
1
,...,
|n {z n}

(
= logb (n) < logb (n + 1) = Hn+1

3.7.4

)
1
1
.
,...,
n+1
n+1
|
{z
}
n+1

Additivity

The amount of entropy should be independent of how the process is regarded as being divided into parts.
This last functional relationship characterizes the entropy of a system with sub-systems. It demands that the entropy
of a system can be calculated from the entropies of its sub-systems if the interactions between the sub-systems are
known.
Given an ensemble of n uniformly distributed elements that are divided into k boxes (sub-systems) with b1 , , bk
elements each, the entropy of the whole ensemble should be equal to the sum of the entropy of the system of boxes
and the individual entropies of the boxes, each weighted with the probability of being in that particular box.
For positive integers bi where b1 + + bk = n,
(
Hn

1
1
,...,
n
n

(
= Hk

b1
bk
,...,
n
n

)
+

bi
i=1

(
Hbi

1
1
,...,
bi
bi

)
.

Choosing k = n, b1 = = bn = 1 this implies that the entropy of a certain outcome is zero: 1 (1) = 0. This implies
that the eciency of a source alphabet with n symbols can be dened simply as being equal to its n-ary entropy. See
also Redundancy (information theory).

3.8 Further properties


The Shannon entropy satises the following properties, for some of which it is useful to interpret entropy as the
amount of information learned (or uncertainty eliminated) by revealing the value of a random variable X:
Adding or removing an event with probability zero does not contribute to the entropy:

Hn+1 (p1 , . . . , pn , 0) = Hn (p1 , . . . , pn )


It can be conrmed using the Jensen inequality that

28

CHAPTER 3. ENTROPY (INFORMATION THEORY)

[
(
)]
( [
])
1
1
H(X) = E logb
logb E
= logb (n)
p(X)
p(X)
This maximal entropy of logb(n) is eectively attained by a source alphabet having a uniform probability
distribution: uncertainty is maximal when all possible events are equiprobable.
The entropy or the amount of information revealed by evaluating (X,Y) (that is, evaluating X and Y simultaneously) is equal to the information revealed by conducting two consecutive experiments: rst evaluating the
value of Y, then revealing the value of X given that you know the value of Y. This may be written as

H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X) + H(X).


If Y = f(X) where f is a function, then (f(X)|X) = 0. Applying the previous formula to (X, f(X)) yields

H(X) + H(f (X)|X) = H(f (X)) + H(X|f (X)),


so (f(X)) (X), thus the entropy of a variable can only decrease when the latter is passed through a
function.
If X and Y are two independent random variables, then knowing the value of Y doesn't inuence our knowledge
of the value of X (since the two don't inuence each other by independence):

H(X|Y ) = H(X).
The entropy of two simultaneous events is no more than the sum of the entropies of each individual event, and
are equal if the two events are independent. More specically, if X and Y are two random variables on the
same probability space, and (X, Y) denotes their Cartesian product, then

H(X, Y ) H(X) + H(Y ).


Proving this mathematically follows easily from the previous two properties of entropy.

3.9 Extending discrete entropy to the continuous case


3.9.1

Dierential entropy

Main article: Dierential entropy


The Shannon entropy is restricted to random variables taking discrete values. The corresponding formula for a
continuous random variable with probability density function f(x) with nite or innite support X on the real line is
dened by analogy, using the above form of the entropy as an expectation:

h[f ] = E[ ln(f (x))] =

f (x) ln(f (x)) dx.


X

3.9. EXTENDING DISCRETE ENTROPY TO THE CONTINUOUS CASE

29

This formula is usually referred to as the continuous entropy, or dierential entropy. A precursor of the continuous
entropy h[f] is the expression for the functional in the -theorem of Boltzmann.
Although the analogy between both functions is suggestive, the following question must be set: is the dierential
entropy a valid extension of the Shannon discrete entropy? Dierential entropy lacks a number of properties that the
Shannon discrete entropy has it can even be negative and thus corrections have been suggested, notably limiting
density of discrete points.
To answer this question, we must establish a connection between the two functions:
We wish to obtain a generally nite measure as the bin size goes to zero. In the discrete case, the bin size is the
(implicit) width of each of the n (nite or innite) bins whose probabilities are denoted by pn. As we generalize to
the continuous domain, we must make this width explicit.
To do this, start with a continuous function f discretized into bins of size . By the mean-value theorem there exists
a value xi in each bin such that

(i+1)

f (xi ) =

f (x) dx
i

and thus the integral of the function f can be approximated (in the Riemannian sense) by

f (x) dx = lim

f (xi )

i=

where this limit and bin size goes to zero are equivalent.
We will denote

H :=

f (xi ) log (f (xi ))

i=

and expanding the logarithm, we have

H =

f (xi ) log(f (xi ))

i=

f (xi ) log().

i=

As 0, we have

f (xi )

i=

f (xi ) log(f (xi ))

i=

f (x) dx = 1

f (x) log f (x) dx.

But note that log() as 0, therefore we need a special denition of the dierential or continuous entropy:
(
)
h[f ] = lim H + log =
0

f (x) log f (x) dx,

which is, as said before, referred to as the dierential entropy. This means that the dierential entropy is not a limit
of the Shannon entropy for n . Rather, it diers from the limit of the Shannon entropy by an innite oset (see
also the article on information dimension)

30

CHAPTER 3. ENTROPY (INFORMATION THEORY)

3.9.2

Limiting Density of Discrete Points

Main article: Limiting density of discrete points


It turns out as a result that, unlike the Shannon entropy, the dierential entropy is not in general a good measure
of uncertainty or information. For example, the dierential entropy can be negative; also it is not invariant under
continuous co-ordinate transformations. This problem may be illustrated by a change of units when x is a dimensioned
variable. f(x) will then have the units of 1/x. The argument of the logarithm must be dimensionless, otherwise it is
improper, so that the dierential entropy as given above will be improper. If is some standard value of x (i.e.
bin size) and therefore has the same units, then a modied dierential entropy may be written in proper form as:

f (x) log(f (x) ) dx

H=

and the result will be the same for any choice of units for x. In fact, the limit of discrete entropy as N would
also include a term of log(N ) , which would in general be innite. This is expected, continuous variables would
typically have innite entropy when discretized. The limiting density of discrete points is really a measure of how
much easier a distribution is to describe than a distribution that is uniform over its quantization scheme.

3.9.3

Relative entropy

Main article: Generalized relative entropy


Another useful measure of entropy that works equally well in the discrete and the continuous case is the relative
entropy of a distribution. It is dened as the KullbackLeibler divergence from the distribution to a reference measure
m as follows. Assume that a probability distribution p is absolutely continuous with respect to a measure m, i.e. is of
the form p(dx) = f(x)m(dx) for some non-negative m-integrable function f with m-integral 1, then the relative entropy
can be dened as

DKL (pm) =

log(f (x))p(dx) =

f (x) log(f (x))m(dx).

In this form the relative entropy generalises (up to change in sign) both the discrete entropy, where the measure m is
the counting measure, and the dierential entropy, where the measure m is the Lebesgue measure. If the measure m
is itself a probability distribution, the relative entropy is non-negative, and zero if p = m as measures. It is dened for
any measure space, hence coordinate independent and invariant under co-ordinate reparameterizations if one properly
takes into account the transformation of the measure m. The relative entropy, and implicitly entropy and dierential
entropy, do depend on the reference measure m.

3.10 Use in combinatorics


Entropy has become a useful quantity in combinatorics.

3.10.1

LoomisWhitney inequality

A simple example of this is an alternate proof of the LoomisWhitney inequality: for every subset A Zd , we have

|A|d1

|Pi (A)|

i=1

where Pi is the orthogonal projection in the i-th coordinate:

3.11. SEE ALSO

31

Pi (A) = {(x1 , . . . , xi1 , xi+1 , . . . , xd ) : (x1 , . . . , xd ) A}.


The proof follows as a simple corollary of Shearers inequality: if X1 , , Xd are random variables and S 1 , , Sn
are subsets of {1, , d} such that every integer between 1 and d lies in exactly r of these subsets, then
1
H[(Xj )jSi ]
r i=1
n

H[(X1 , . . . , Xd )]

where (Xj )jSi is the Cartesian product of random variables Xj with indexes j in Si (so the dimension of this vector
is equal to the size of Si).
We sketch how LoomisWhitney follows from this: Indeed, let X be a uniformly distributed random variable with
values in A and so that each point in A occurs with equal probability. Then (by the further properties of entropy
mentioned above) (X) = log| A |, where | A | denotes the cardinality of A. Let Si = {1, 2, , i1, i+1, , d}. The
range of (Xj )jSi is contained in Pi(A) and hence H[(Xj )jSi ] log |Pi (A)| . Now use this to bound the right
side of Shearers inequality and exponentiate the opposite sides of the resulting inequality you obtain.

3.10.2

Approximation to binomial coecient

For integers 0 < k < n let q = k/n. Then


( )
2nH(q)
nk 2nH(q) ,
n+1
where
H(q) = q log2 (q) (1 q) log2 (1 q). [14]
Here is a sketch proof. Note that
n

(n)
i

(n)
k

q qn (1 q)nnq is one term of the expression

q i (1 q)ni = (q + (1 q))n = 1.

i=0

Rearranging gives the upper bound. For the lower bound one rst shows, using some algebra, that it is the largest
term in the summation. But then,
(n)
k

q qn (1 q)nnq

1
n+1

since there are n + 1 terms in the summation. Rearranging gives the lower bound.
A nice interpretation of this is that the number of binary strings of length n with exactly k many 1s is approximately
2nH(k/n) .[15]

3.11 See also


Conditional entropy
Cross entropy is a measure of the average number of bits needed to identify an event from a set of possibilities
between two probability distributions
Diversity index alternative approaches to quantifying diversity in a probability distribution

32

CHAPTER 3. ENTROPY (INFORMATION THEORY)


Entropy (arrow of time)
Entropy encoding a coding scheme that assigns codes to symbols so as to match code lengths with the probabilities of the symbols.
Entropy estimation
Entropy power inequality
Entropy rate
Fisher information
Graph entropy
Hamming distance
History of entropy
History of information theory
Information geometry
Joint entropy is the measure how much entropy is contained in a joint system of two random variables.
Kolmogorov-Sinai entropy in dynamical systems
Levenshtein distance
Mutual information
Negentropy
Perplexity
Qualitative variation other measures of statistical dispersion for nominal distributions
Quantum relative entropy a measure of distinguishability between two quantum states.
Rnyi entropy a generalisation of Shannon entropy; it is one of a family of functionals for quantifying the
diversity, uncertainty or randomness of a system.
Randomness
Shannon index
Theil index
Typoglycemia

3.12 References
[1] Shannon, Claude E. (JulyOctober 1948). "A Mathematical Theory of Communication". Bell System Technical Journal.
27 (3): 379423. doi:10.1002/j.1538-7305.1948.tb01338.x. (PDF, archived from here)
[2] Schneier, B: Applied Cryptography, Second edition, page 234. John Wiley and Sons.
[3] Shannon, C. E. (January 1951). Prediction and Entropy of Printed English (PDF). Bell System Technical Journal. 30 (1):
5064. doi:10.1002/j.1538-7305.1951.tb01366.x. Retrieved 30 March 2014.
[4] Borda, Monica (2011). Fundamentals in Information Theory and Coding. Springer. p. 11. ISBN 978-3-642-20346-6.
[5] Han, Te Sun & Kobayashi, Kingo (2002). Mathematics of Information and Coding. American Mathematical Society. pp.
1920. ISBN 978-0-8218-4256-0.
[6] Schneider, T.D, Information theory primer with an appendix on logarithms, National Cancer Institute, 14 April 2007.

3.13. FURTHER READING

33

[7] Carter, Tom (March 2014). An introduction to information theory and entropy (PDF). Santa Fe. Retrieved Aug 2014.
Check date values in: |access-date= (help)
[8] Compare: Boltzmann, Ludwig (1896, 1898). Vorlesungen ber Gastheorie : 2 Volumes Leipzig 1895/98 UB: O 5262-6.
English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press;
(1995) New York: Dover ISBN 0-486-68455-5
[9] Mark Nelson (24 August 2006). The Hutter Prize. Retrieved 2008-11-27.
[10] The Worlds Technological Capacity to Store, Communicate, and Compute Information, Martin Hilbert and Priscila
Lpez (2011), Science (journal), 332(6025), 6065; free access to the article through here: martinhilbert.net/WorldInfoCapacity.
html
[11] Massey, James (1994). Guessing and Entropy (PDF). Proc. IEEE International Symposium on Information Theory.
Retrieved December 31, 2013.
[12] Malone, David; Sullivan, Wayne (2005). Guesswork is not a Substitute for Entropy (PDF). Proceedings of the Information
Technology & Telecommunications Conference. Retrieved December 31, 2013.
[13] Pliam, John (1999). Guesswork and variation distance as measures of cipher security. International Workshop on Selected
Areas in Cryptography. Retrieved October 23, 2016.
[14] Aoki, New Approaches to Macroeconomic Modeling. page 43.
[15] Probability and Computing, M. Mitzenmacher and E. Upfal, Cambridge University Press

This article incorporates material from Shannons entropy on PlanetMath, which is licensed under the Creative Commons Attribution/Share-Alike License.

3.13 Further reading


3.13.1

Textbooks on information theory

Arndt, C. (2004), Information Measures: Information and its Description in Science and Engineering, Springer,
ISBN 978-3-540-40855-0
Cover, T. M., Thomas, J. A. (2006), Elements of information theory, 2nd Edition. Wiley-Interscience. ISBN
0-471-24195-4.
Gray, R. M. (2011), Entropy and Information Theory, Springer.
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1
Martin, Nathaniel F.G. & England, James W. (2011). Mathematical Theory of Entropy. Cambridge University
Press. ISBN 978-0-521-17738-2.
Shannon, C.E., Weaver, W. (1949) The Mathematical Theory of Communication, Univ of Illinois Press. ISBN
0-252-72548-4
Stone, J. V. (2014), Chapter 1 of Information Theory: A Tutorial Introduction, University of Sheeld, England.
ISBN 978-0956372857.

3.14 External links


Hazewinkel, Michiel, ed. (2001), Entropy, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-0104
Introduction to entropy and information on Principia Cybernetica Web
Entropy an interdisciplinary journal on all aspect of the entropy concept. Open access.

34

CHAPTER 3. ENTROPY (INFORMATION THEORY)


Description of information entropy from Tools for Thought by Howard Rheingold
A java applet representing Shannons Experiment to Calculate the Entropy of English
Slides on information gain and entropy
An Intuitive Guide to the Concept of Entropy Arising in Various Sectors of Science a wikibook on the interpretation of the concept of entropy.
A Light Discussion and Derivation of Entropy
Network Event Detection With Entropy Measures, Dr. Raimund Eimann, University of Auckland, PDF; 5993
kB a PhD thesis demonstrating how entropy measures may be used in network anomaly detection.
Rosetta Code repository of implementations of Shannon entropy in dierent programming languages.
Information Theory for Intelligent People. Short introduction to the axioms of information theory, entropy,
mutual information, Kullback-Liebler divergence, and Jensen-Shannon distance.
Online tool for calculating entropy

Chapter 4

Binary entropy function

H(X)

0.5

0.5
Pr(X = 1)

Entropy of a Bernoulli trial as a function of success probability, called the binary entropy function.

In information theory, the binary entropy function, denoted H(p) or Hb (p) , is dened as the entropy of a Bernoulli
process with probability of success p . Mathematically, the Bernoulli trial is modelled as a random variable X that
can take on only two values: 0 and 1. The event X = 1 is considered a success and the event X = 0 is considered a
35

36

CHAPTER 4. BINARY ENTROPY FUNCTION

failure. (These two events are mutually exclusive and exhaustive.)


If Pr(X = 1) = p , then Pr(X = 0) = 1 p and the entropy of X (in shannons) is given by

H(X) = Hb (p) = p log2 p (1 p) log2 (1 p)


where 0 log2 0 is taken to be 0. The logarithms in this formula are usually taken (as shown in the graph) to the base
2. See binary logarithm.
When p = 12 , the binary entropy function attains its maximum value. This is the case of the unbiased bit, the most
common unit of information entropy.
H(p) is distinguished from the entropy function H(X) in that the former takes a single real number as a parameter
whereas the latter takes a distribution or random variables as a parameter. Sometimes the binary entropy function
is also written as H2 (p) . However, it is dierent from and should not be confused with the Rnyi entropy, which is
denoted as H2 (X) .

4.1 Explanation
In terms of information theory, entropy is considered to be a measure of the uncertainty in a message. To put it
intuitively, suppose p = 0 . At this probability, the event is certain never to occur, and so there is no uncertainty at
all, leading to an entropy of 0. If p = 1 , the result is again certain, so the entropy is 0 here as well. When p = 1/2
, the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage
to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit.
Intermediate values fall between these cases; for instance, if p = 1/4 , there is still a measure of uncertainty on the
outcome, but one can still predict the outcome correctly more often than not, so the uncertainty measure, or entropy,
is less than 1 full bit.

4.2 Derivative
The derivative of the binary entropy function may be expressed as the negative of the logit function:

d
Hb (p) = logit2 (p) = log2
dp

p
1p

4.3 Taylor series


The Taylor series of the binary entropy function in a neighborhood of 1/2 is

Hb (p) = 1

1 (1 2p)2n
2 ln 2 n=1 n(2n 1)

for 0 p 1 .

4.4 See also


Metric entropy
Information theory
Information entropy

4.5. REFERENCES

37

4.5 References
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1

Chapter 5

Dierential entropy
Dierential entropy (also referred to as continuous entropy) is a concept in information theory that began as an
attempt by Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to
continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it
was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy
is the limiting density of discrete points (LDDP). Dierential entropy (described here) is commonly encountered in
the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete
entropy.

5.1 Denition
Let X be a random variable with a probability density function f whose support is a set X . The dierential entropy
h(X) or h(f) is dened as

h(X) =

f (x) log f (x) dx


X

For probability distributions which don't have an explicit density function expression, but have an explicit quantile
function expression, Q(p), then h(Q) can be dened in terms of the derivative of Q(p) i.e. the quantile density function
Q'(p) as [1]

h(Q) =

log Q (p) dp

As with its discrete analog, the units of dierential entropy depend on the base of the logarithm, which is usually 2
(i.e., the units are bits). See logarithmic units for logarithms taken in dierent bases. Related concepts such as joint,
conditional dierential entropy, and relative entropy are dened in a similar fashion. Unlike the discrete analog,
the dierential entropy has an oset that depends on the units used to measure X.[2] For example, the dierential
entropy of a quantity measured in millimeters will be log(1000) more than the same quantity measured in meters; a
dimensionless quantity will have dierential entropy of log(1000) more than the same quantity divided by 1000.
One must take care in trying to apply properties of discrete entropy to dierential entropy, since probability density
functions can be greater than 1. For example, Uniform(0,1/2) has negative dierential entropy

1
2

2 log(2) dx = log(2)

Thus, dierential entropy does not share all properties of discrete entropy.
Note that the continuous mutual information I(X;Y) has the distinction of retaining its fundamental signicance as
a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of X
38

5.2. PROPERTIES OF DIFFERENTIAL ENTROPY

39

and Y as these partitions become ner and ner. Thus it is invariant under non-linear homeomorphisms (continuous
and uniquely invertible maps) ,[3] including linear [4] transformations of X and Y, and still represents the amount of
discrete information that can be transmitted over a channel that admits a continuous space of values.
For the direct analogue of discrete entropy extended to the continuous space, see limiting density of discrete points.

5.2 Properties of dierential entropy


For densities f and g, the KullbackLeibler divergence D(f||g) is nonnegative with equality if f = g almost
everywhere. Similarly, for two random variables X and Y, I(X;Y) 0 and h(X|Y) h(X) with equality if and
only if X and Y are independent.
The chain rule for dierential entropy holds as in the discrete case

h(X1 , . . . , Xn ) =

h(Xi |X1 , . . . , Xi1 )

h(Xi )

i=1

Dierential entropy is translation invariant, i.e., h(X + c) = h(X) for a constant c.


Dierential entropy is in general not invariant under arbitrary invertible maps. In particular, for a constant a,
h(aX) = h(X) + log|a|. For a vector valued random variable X and a matrix A, h(A X) = h(X) + log|det(A)|.
In general, for a transformation from a random vector to another random vector with same dimension Y =
m(X), the corresponding entropies are related via

h(Y) h(X) +



m
dx
f (x) log
x



where m
x is the Jacobian of the transformation m. The above inequality becomes an equality if the
transform is a bijection. Furthermore, when m is a rigid rotation, translation, or combination thereof,
the Jacobian determinant is always 1, and h(Y) = h(X).
If a random vector X in Rn has mean zero and covariance matrix K, h(X)
with equality if and only if X is jointly gaussian (see below).

1
2

log(det 2eK) =

1
2

log[(2e)n det K]

However, dierential entropy does not have other desirable properties:


It is not invariant under change of variables, and is therefore most useful with dimensionless variables.
It can be negative.
A modication of dierential entropy that addresses these drawbacks is the relative information entropy, also known
as the KullbackLeibler divergence, which includes an invariant measure factor (see limiting density of discrete
points).

5.3 Maximization in the normal distribution


With a normal distribution, dierential entropy is maximized for a given variance. The following is a proof that a
Gaussian variable has the largest entropy amongst all random variables of equal variance, or, alternatively, that the
maximum entropy distribution under constraints of mean and variance is the Gaussian.
Let g(x) be a Gaussian PDF with mean and variance 2 and f(x) an arbitrary PDF with the same variance. Since
dierential entropy is translation invariant we can assume that f(x) has the same mean of as g(x).

40

CHAPTER 5. DIFFERENTIAL ENTROPY

Consider the KullbackLeibler divergence between the two distributions

0 DKL (f ||g) =

f (x) log

f (x)
g(x)

dx = h(f )

f (x) log(g(x))dx.

Now note that

f (x) log(g(x))dx =

f (x) log

(x)2
2 2

dx
2 2
(
)

1
(x )2
=
f (x) log
dx + log(e)
f (x)
dx
2 2
2 2

2
= 12 log(2 2 ) log(e) 2
2
(
)
= 12 log(2 2 ) + log(e)

= 12 log(2e 2 )
= h(g)
because the result does not depend on f(x) other than through the variance. Combining the two results yields

h(g) h(f ) 0
with equality when g(x) = f(x) following from the properties of KullbackLeibler divergence.
This result may also be demonstrated using the variational calculus. A Lagrangian function with two Lagrangian
multipliers may be dened as:

L=

g(x) ln(g(x)) dx 0 1

)
(

g(x) dx 2

)
g(x)(x )2 dx

where g(x) is some function with mean . When the( entropy of g(x) is)at a maximum and the constraint equa
tions, which consist of the normalization condition 1 = g(x) dx and the requirement of xed variance
(
)

2 = g(x)(x )2 dx , are both satised, then a small variation g(x) about g(x) will produce a variation
L about L which is equal to zero:

0 = L =

(
)
g(x) ln(g(x)) + 1 + 0 + (x )2 dx

Since this must hold for any small g(x), the term in brackets must be zero, and solving for g(x) yields:

g(x) = e0 1(x)

Using the constraint equations to solve for 0 and yields the normal distribution:
(x)2
1
g(x) =
e 22
2 2

5.4 Example: Exponential distribution


Let X be an exponentially distributed random variable with parameter , that is, with probability density function

5.5. DIFFERENTIAL ENTROPIES FOR VARIOUS DISTRIBUTIONS

41

f (x) = ex for x 0.
Its dierential entropy is then
Here, he (X) was used rather than h(X) to make it explicit that the logarithm was taken to base e, to simplify the
calculation.

5.5 Dierential entropies for various distributions


In the table below (x) =
function, B(p, q) =

(p)(q)
(p+q)

et tx1 dt is the gamma function, (x) =

d
dx

ln (x) =

(x)
(x)

is the digamma

[5]

is the beta function, and E is Eulers constant.

(Many of the dierential entropies are from.[6]

5.6 Variants
As described above, dierential entropy does not share all properties of discrete entropy. For example, the dierential
entropy can be negative; also it is not invariant under continuous coordinate transformations. Edwin Thompson Jaynes
showed in fact that the expression above is not the correct limit of the expression for a nite set of probabilities.[7]
A modication of dierential entropy adds an invariant measure factor to correct this, (see limiting density of discrete
points). If m(x) is further constrained to be a probability density, the resulting notion is called relative entropy in
information theory:

D(p||m) =

p(x) log

p(x)
dx.
m(x)

The denition of dierential entropy above can be obtained by partitioning the range of X into bins of length h with
associated sample points ih within the bins, for X Riemann integrable. This gives a quantized version of X, dened
by Xh = ih if ih X (i+1)h. Then the entropy of Xh is

Hh =

hf (ih) log(f (ih))

hf (ih) log(h).

The rst term on the right approximates the dierential entropy, while the second term is approximately log(h).
Note that this procedure suggests that the entropy in the discrete sense of a continuous random variable should be .

5.7 See also


Information entropy
Information theory
Limiting density of discrete points
Self-information
KullbackLeibler divergence
Entropy estimation

42

CHAPTER 5. DIFFERENTIAL ENTROPY

5.8 References
[1] Vasicek, Oldrich (1976), A Test for Normality Based on Sample Entropy, Journal of the Royal Statistical Society, Series
B, 38 (1): 5459, JSTOR 2984828.
[2] Pages 183-184, Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics. New York: Charles Scribners Sons.
[3] Kraskov, Alexander; Stgbauer, Grassberger (2004). Estimating mutual information. Physical Review E. 60: 066138.
arXiv:cond-mat/0305641 . Bibcode:2004PhRvE..69f6138K. doi:10.1103/PhysRevE.69.066138.
[4] Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN
0-486-68210-2.
[5] Park, Sung Y.; Bera, Anil K. (2009). Maximum entropy autoregressive conditional heteroskedasticity model (PDF).
Journal of Econometrics. Elsevier: 219230. Retrieved 2011-06-02.
[6] Lazo, A. and P. Rathie (1978). On the entropy of continuous probability distributions. IEEE Transactions on Information
Theory. 24(1): 120122. doi:10.1109/TIT.1978.1055832.
[7] Jaynes, E.T. (1963). Information Theory And Statistical Mechanics (PDF). Brandeis University Summer Institute Lectures
in Theoretical Physics. 3 (sect. 4b): 181218.

Thomas M. Cover, Joy A. Thomas. Elements of Information Theory New York: Wiley, 1991. ISBN 0-47106259-6

5.9 External links


Hazewinkel, Michiel, ed. (2001), Dierential entropy, Encyclopedia of Mathematics, Springer, ISBN 9781-55608-010-4
Dierential entropy. PlanetMath.

Chapter 6

Diversity index
A diversity index is a quantitative measure that reects how many dierent types (such as species) there are in
a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed
among those types. The value of a diversity index increases both when the number of types increases and when
evenness increases. For a given number of types, the value of a diversity index is maximized when all types are
equally abundant.
When diversity indices are used in ecology, the types of interest are usually species, but they can also be other
categories, such as genera, families, functional types or haplotypes. The entities of interest are usually individual
plants or animals, and the measure of abundance can be, for example, number of individuals, biomass or coverage.
In demography, the entities of interest can be people, and the types of interest various demographic groups. In
information science, the entities can be characters and the types the dierent letters of the alphabet. The most
commonly used diversity indices are simple transformations of the eective number of types (also known as 'true
diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real
phenomenon (but a dierent one for each diversity index).[1][2][3][4]

6.1 True diversity


True diversity, or the eective number of types, refers to the number of equally abundant types needed for the
average proportional abundance of the types to equal that observed in the dataset of interest (where all types may
not be equally abundant). The true diversity in a dataset is calculated by rst taking the weighted generalized mean
Mq of the proportional abundances of the types in the dataset, and then taking the reciprocal of this. The equation
is:[3][4]

1
q
D=
=
Mq1

q1

i=1

=
pi pq1
i

)1/(1q)
pqi

i=1

The denominator Mq equals the average proportional abundance of the types in the dataset as calculated with the
weighted generalized mean with exponent q1. In the equation, R is richness (the total number of types in the
dataset), and the proportional abundance of the ith type is pi. The proportional abundances themselves are used as
the nominal weights. When q = 1, the above equation is undened. However, the mathematical limit as q approaches
1 is well dened and the corresponding diversity is calculated with the following equation:

D = R

i=1

ppi i

= exp

)
pi ln(pi )

i=1

which is the exponential of the Shannon entropy calculated with natural logarithms (see below).
The value of q is often referred to as the order of the diversity. It denes the sensitivity of the diversity value to
rare vs. abundant species by modifying how the weighted mean of the species proportional abundances is calculated.
43

44

CHAPTER 6. DIVERSITY INDEX

With some values of the parameter q, the value of Mq assumes familiar kinds of weighted mean as special cases. In
particular, q = 0 corresponds to the weighted harmonic mean, q = 1 to the weighted geometric mean and q = 2 to the
weighted arithmetic mean. As q approaches innity, the weighted generalized mean with exponent q1 approaches
the maximum pi value, which is the proportional abundance of the most abundant species in the dataset. Generally,
increasing the value of q increases the eective weight given to the most abundant species. This leads to obtaining a
larger Mq value and a smaller true diversity (q D) value with increasing q.
When q = 1, the weighted geometric mean of the pi values is used, and each species is exactly weighted by its
proportional abundance (in the weighted geometric mean, the weights are the exponents). When q > 1, the weight
given to abundant species is exaggerated, and when q < 1, the weight given to rare species is. At q = 0, the species
weights exactly cancel out the species proportional abundances, such that the weighted mean of the pi values equals
1 / R even when all species are not equally abundant. At q = 0, the eective number of species, 0 D, hence equals the
actual number of species R. In the context of diversity, q is generally limited to non-negative values. This is because
negative values of q would give rare species so much more weight than abundant ones that q D would exceed R.[3][4]
The general equation of diversity is often written in the form[1][2]

D=

( R

)1/(1q)
pqi

i=1

and the term inside the parentheses is called the basic sum. Some popular diversity indices correspond to the basic
sum as calculated with dierent values of q.[2]

6.2 Richness
Main article: Species richness
Richness R simply quanties how many dierent types the dataset of interest contains. For example, species richness
(usually noted S) of a dataset is the number of dierent species in the corresponding species list. Richness is a simple
measure, so it has been a popular diversity index in ecology, where abundance data are often not available for the
datasets of interest. Because richness does not take the abundances of the types into account, it is not the same thing
as diversity, which does take abundances into account. However, if true diversity is calculated with q = 0, the eective
number of types (0 D) equals the actual number of types (R).[2][4]

6.3 Shannon index


The Shannon index has been a popular diversity index in the ecological literature, where it is also known as Shannons
diversity index, the ShannonWiener index, the ShannonWeaver index and the Shannon entropy. The measure
was originally proposed by Claude Shannon to quantify the entropy (uncertainty or information content) in strings
of text.[5] The idea is that the more dierent letters there are, and the more equal their proportional abundances in
the string of interest, the more dicult it is to correctly predict which letter will be the next one in the string. The
Shannon entropy quanties the uncertainty (entropy or degree of surprise) associated with this prediction. It is most
often calculated as follows:

H =

pi ln pi

i=1

where pi is the proportion of characters belonging to the ith type of letter in the string of interest. In ecology, pi is
often the proportion of individuals belonging to the ith species in the dataset of interest. Then the Shannon entropy
quanties the uncertainty in predicting the species identity of an individual that is taken at random from the dataset.
Although the equation is here written with natural logarithms, the base of the logarithm used when calculating the
Shannon entropy can be chosen freely. Shannon himself discussed logarithm bases 2, 10 and e, and these have since
become the most popular bases in applications that use the Shannon entropy. Each log base corresponds to a dierent

6.4. SIMPSON INDEX

45

measurement unit, which have been called binary digits (bits), decimal digits (decits) and natural digits (nats) for the
bases 2, 10 and e, respectively. Comparing Shannon entropy values that were originally calculated with dierent log
bases requires converting them to the same log base: change from the base a to base b is obtained with multiplication
by logba.[5]
It has been shown that the Shannon index is based on the weighted geometric mean of the proportional abundances
of the types, and that it equals the logarithm of true diversity as calculated with q = 1:[3]

H =

pi ln pi =

i=1

ln pipi

i=1

This can also be written

H = (ln pp11 + ln pp22 + ln pp33 + + ln ppRR )


which equals

H =

(
ln pp11 pp22 pp33

ppRR

= ln

1
pp11 pp22 pp33 ppRR

)
= ln

i=1

ppi i

Since the sum of the pi values equals unity by denition, the denominator equals the weighted geometric mean of the
pi values, with the pi values themselves being used as the weights (exponents in the equation). The term within the
parentheses hence equals true diversity 1 D, and H' equals ln(1 D).[1][3][4]
When all types in the dataset of interest are equally common, all pi values equal 1 / R, and the Shannon index hence
takes the value ln(R). The more unequal the abundances of the types, the larger the weighted geometric mean of the
pi values, and the smaller the corresponding Shannon entropy. If practically all abundance is concentrated to one
type, and the other types are very rare (even if there are many of them), Shannon entropy approaches zero. When
there is only one type in the dataset, Shannon entropy exactly equals zero (there is no uncertainty in predicting the
type of the next randomly chosen entity).

6.3.1

Rnyi entropy

The Rnyi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:

( R
)
q
1
H=
ln
pi
1q
i=1

which equals

H = ln

q1

= ln(qD)

q1
i=1 pi pi

This means that taking the logarithm of true diversity based on any value of q gives the Rnyi entropy corresponding
to the same value of q.

6.4 Simpson index


The Simpson index was introduced in 1949 by Edward H. Simpson to measure the degree of concentration when
individuals are classied into types.[6] The same index was rediscovered by Orris C. Herndahl in 1950.[7] The square

46

CHAPTER 6. DIVERSITY INDEX

root of the index had already been introduced in 1945 by the economist Albert O. Hirschman.[8] As a result, the same
measure is usually known as the Simpson index in ecology, and as the Herndahl index or the HerndahlHirschman
index (HHI) in economics.
The measure equals the probability that two entities taken at random from the dataset of interest represent the same
type.[6] It equals:

p2i

i=1

This also equals the weighted arithmetic mean of the proportional abundances pi of the types of interest, with the proportional abundances themselves being used as the weights.[1] Proportional abundances are by denition constrained
to values between zero and unity, but their weighted arithmetic mean, and hence 1/R, which is reached when all
types are equally abundant.
By comparing the equation used to calculate with the equations used to calculate true diversity, it can be seen that 1/
equals 2 D, i.e. true diversity as calculated with q = 2. The original Simpsons index hence equals the corresponding
basic sum.[2]
The interpretation of as the probability that two entities taken at random from the dataset of interest represent the
same type assumes that the rst entity is replaced to the dataset before taking the second entity. If the dataset is very
large, sampling without replacement gives approximately the same result, but in small datasets the dierence can be
substantial. If the dataset is small, and sampling without replacement is assumed, the probability of obtaining the
same type with both random draws is:
R
l=

ni (ni 1)
N (N 1)

i=1

where ni is the number of entities belonging to the ith type and N is the total number of entities in the dataset.[6] This
form of the Simpson index is also known as the HunterGaston index in microbiology.[9]
Since mean proportional abundance of the types increases with decreasing number of types and increasing abundance
of the most abundant type, obtains small values in datasets of high diversity and large values in datasets of low
diversity. This is counterintuitive behavior for a diversity index, so often such transformations of that increase with
increasing diversity have been used instead. The most popular of such indices have been the inverse Simpson index
(1/) and the GiniSimpson index (1 ).[1][2] Both of these have also been called the Simpson index in the ecological
literature, so care is needed to avoid accidentally comparing the dierent indices as if they were the same.

6.4.1

Inverse Simpson index

The inverse Simpson index equals:


1
1/ = R

2
i=1 pi

= 2D

This simply equals true diversity of order 2, i.e. the eective number of types that is obtained when the weighted
arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.

6.4.2

GiniSimpson index

The original Simpson index equals the probability that two entities taken at random from the dataset of interest
(with replacement) represent the same type. Its transformation 1 therefore equals the probability that the two
entities represent dierent types. This measure is also known in ecology as the probability of interspecic encounter
(PIE)[10] and the GiniSimpson index.[2] It can be expressed as a transformation of true diversity of order 2:

1=1

i=1

p2i = 1 1/2 D

6.5. BERGERPARKER INDEX

47

The GibbsMartin index of sociology, psychology and management studies,[11] which is also known as the Blau index,
is the same measure as the GiniSimpson index.

6.5 BergerParker index


The BergerParker[12] index equals the maximum pi value in the dataset, i.e. the proportional abundance of the most
abundant type. This corresponds to the weighted generalized mean of the pi values when q approaches innity, and
hence equals the inverse of true diversity of order innity (1/ D).

6.6 See also


Species diversity
Species richness
Alpha diversity
Beta diversity
Cultural diversity
Gamma diversity
Qualitative variation
Isolation index
Relative abundance
Eective number of parties, a diversity index applied to political parties

6.7 References
[1] Hill, M. O. (1973). Diversity and evenness: a unifying notation and its consequences. Ecology. 54: 427432. doi:10.2307/1934352.
[2] Jost, L (2006). Entropy and diversity. Oikos. 113: 363375. doi:10.1111/j.2006.0030-1299.14714.x.
[3] Tuomisto, H (2010). A diversity of beta diversities: straightening up a concept gone awry. Part 1. Dening beta diversity
as a function of alpha and gamma diversity. Ecography. 33: 222. doi:10.1111/j.1600-0587.2009.05880.x.
[4] Tuomisto, H (2010). A consistent terminology for quantifying species diversity? Yes, it does exist. Oecologia. 4:
853860. doi:10.1007/s00442-010-1812-0.
[5] Shannon, C. E. (1948) A mathematical theory of communication. The Bell System Technical Journal, 27, 379423 and
623656.
[6] Simpson, E. H. (1949). Measurement of diversity. Nature. 163: 688. doi:10.1038/163688a0.
[7] Herndahl, O. C. (1950) Concentration in the U.S. Steel Industry. Unpublished doctoral dissertation, Columbia University.
[8] Hirschman, A. O. (1945) National power and the structure of foreign trade. Berkeley.
[9] Hunter, PR; Gaston, MA (1988). Numerical index of the discriminatory ability of typing systems: an application of
Simpsons index of diversity. J Clin Microbiol. 26 (11): 24652466. PMID 3069867.
[10] Hurlbert, S.H. (1971). The nonconcept of species diversity: A critique and alternative parameters. Ecology. 52: 577
586. doi:10.2307/1934145.
[11] Gibbs, Jack P.; William T. Martin (1962). Urbanization, technology and the division of labor. American Sociological
Review. 27: 667677. doi:10.2307/2089624. JSTOR 2089624.
[12] Berger, Wolfgang H.; Parker, Frances L. (June 1970). Diversity of Planktonic Foraminifera in Deep-Sea Sediments.
Science. 168 (3937): 13451347. doi:10.1126/science.168.3937.1345.

48

CHAPTER 6. DIVERSITY INDEX

6.8 Further reading


Colinvaux, Paul A. (1973). Introduction to Ecology. Wiley. ISBN 0-471-16498-4.
Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory. Wiley. ISBN 0-471-06259-6.
See chapter 5 for an elaboration of coding procedures described informally above.
Chao, A.; Shen, T-J. (2003). Nonparametric estimation of Shannons index of diversity when there are unseen
species in sample (PDF). Environmental and Ecological Statistics. 10 (4): 429443. doi:10.1023/A:1026096204727.

6.9 External links


Simpsons Diversity index
Diversity indices gives some examples of estimates of Simpsons index for real ecosystems.

Chapter 7

Conditional entropy

Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles
is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional
entropy H(X|Y). The circle on the right (blue and violet) is H(Y), with the blue being H(Y|X). The violet is the mutual information
I(X;Y).

In information theory, the conditional entropy (or equivocation) quanties the amount of information needed to
describe the outcome of a random variable Y given that the value of another random variable X is known. Here,
information is measured in shannons, nats, or hartleys. The entropy of Y conditioned on X is written as H(Y |X) .

7.1 Denition
If H(Y |X = x) is the entropy of the variable Y conditioned on the variable X taking a certain value x , then
H(Y |X) is the result of averaging H(Y |X = x) over all possible values x that X may take.
Given discrete random variables X with Image X and Y with Image Y , the conditional entropy of Y given X is
49

50

CHAPTER 7. CONDITIONAL ENTROPY

dened as: (Intuitively, the following can be thought as the weighted sum of H(Y |X = x) for each possible value
of x , using p(x) as the weights)[1]

H(Y |X)

p(x) H(Y |X = x)

xX

p(x)

xX

p(y|x) log p(y|x)

yY

p(x, y) log p(y|x)

xX yY

p(x, y) log p(y|x)

xX ,yY

p(x, y) log

xX ,yY

p(x, y) log

xX ,yY

p(x, y)
.
p(x)

p(x)
.
p(x, y)

Note: It is understood that the expressions 0 log 0 and 0 log (c/0) for xed c>0 should be treated as being equal to
zero.
H(Y |X) = 0 if and only if the value of Y is completely determined by the value of X . Conversely, H(Y |X) =
H(Y ) if and only if Y and X are independent random variables.

7.2 Chain rule


Assume that the combined system determined by two random variables X and Y has joint entropy H(X, Y ) , that
is, we need H(X, Y ) bits of information to describe its exact state. Now if we rst learn the value of X , we have
gained H(X) bits of information. Once X is known, we only need H(X, Y ) H(X) bits to describe the state of
the whole system. This quantity is exactly H(Y |X) , which gives the chain rule of conditional entropy:

H(Y |X) = H(X, Y ) H(X).


The chain rule follows from the above denition of conditional entropy:

H(Y |X) =

(
p(x, y) log

xX ,yY

p(x)
p(x, y)

p(x, y) log(p(x, y)) +

xX ,yY

p(x, y) log(p(x))

xX ,yY

= H(X, Y ) +

p(x) log(p(x))

xX

= H(X, Y ) H(X).
In general, a chain rule for multiple random variables holds:

H(X1 , X2 , . . . , Xn ) =

H(Xi |X1 , . . . , Xi1 )

i=1

It has a similar form to Chain rule (probability) in probability theory, except that addition instead of multiplication
is used.

7.3. BAYES RULE

51

7.3 Bayes rule


Bayes rule for conditional entropy states

H(Y |X) = H(X|Y ) H(X) + H(Y ) .


Proof. H(Y |X) = H(X, Y ) H(X) and H(X|Y ) = H(Y, X) H(Y ) . Symmetry implies H(X, Y ) =
H(Y, X) . Subtracting the two equations implies Bayes rule.
If Y is conditional independent of Z given X we have:

H(Y |X, Z) = H(Y |X).

7.4 Generalization to quantum theory


In quantum information theory, the conditional entropy is generalized to the conditional quantum entropy. The latter
can take negative values, unlike its classical counterpart. Bayes rule does not hold for conditional quantum entropy,
since H(X, Y ) = H(Y, X) .

7.5 Other properties


For any X and Y :

H(Y |X) H(Y )


H(X, Y ) = H(X|Y ) + H(Y |X) + I(X; Y ),
H(X, Y ) = H(X) + H(Y ) I(X; Y ),
I(X; Y ) H(X),
where I(X; Y ) is the mutual information between X and Y .
For independent X and Y :

H(Y |X) = H(Y ) and H(X|Y ) = H(X)


Although the specic-conditional entropy, H(X|Y = y) , can be either less or greater than H(X) , H(X|Y ) can
never exceed H(X) .

7.6 See also


Entropy (information theory)
Mutual information
Conditional quantum entropy
Variation of information
Entropy power inequality
Likelihood function

52

CHAPTER 7. CONDITIONAL ENTROPY

7.7 References
[1] Cover, Thomas M.; Thomas, Joy A. (1991). Elements of information theory (1st ed.). New York: Wiley. ISBN 0-47106259-6.

Chapter 8

Joint entropy

Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles
is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional
entropy H(X|Y). The circle on the right (blue and violet) is H(Y), with the blue being H(Y|X). The violet is the mutual information
I(X;Y).

In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.

8.1 Denition
The joint Shannon entropy (in bits) of two variables X and Y is dened as

H(X, Y ) =

P (x, y) log2 [P (x, y)]

53

54

CHAPTER 8. JOINT ENTROPY

where x and y are particular values of X and Y , respectively, P (x, y) is the joint probability of these values occurring
together, and P (x, y) log2 [P (x, y)] is dened to be 0 if P (x, y) = 0 .
For more than two variables X1 , ..., Xn this expands to
H(X1 , ..., Xn ) =

...

x1

P (x1 , ..., xn ) log2 [P (x1 , ..., xn )]

xn

where x1 , ..., xn are particular values of X1 , ..., Xn , respectively, P (x1 , ..., xn ) is the probability of these values
occurring together, and P (x1 , ..., xn ) log2 [P (x1 , ..., xn )] is dened to be 0 if P (x1 , ..., xn ) = 0 .

8.2 Properties
8.2.1

Greater than individual entropies

The joint entropy of a set of variables is greater than or equal to all of the individual entropies of the variables in the
set.
H(X, Y ) max[H(X), H(Y )]
H(X1 , ..., Xn ) max[H(X1 ), ..., H(Xn )]

8.2.2

Less than or equal to the sum of individual entropies

The joint entropy of a set of variables is less than or equal to the sum of the individual entropies of the variables in
the set. This is an example of subadditivity. This inequality is an equality if and only if X and Y are statistically
independent.
H(X, Y ) H(X) + H(Y )
H(X1 , ..., Xn ) H(X1 ) + ... + H(Xn )

8.3 Relations to other entropy measures


Joint entropy is used in the denition of conditional entropy
H(X|Y ) = H(Y, X) H(Y )
and
H(X1 , . . . , Xn ) =

N
k=1

H(Xk |Xk1 , . . . , X1 )

It is also used in the denition of mutual information


I(X; Y ) = H(X) + H(Y ) H(X, Y )
In quantum information theory, the joint entropy is generalized into the joint quantum entropy.

8.4 References
Theresa M. Korn; Korn, Granino Arthur. Mathematical Handbook for Scientists and Engineers: Denitions,
Theorems, and Formulas for Reference and Review. New York: Dover Publications. pp. 613614. ISBN
0-486-41147-8.

Chapter 9

Mutual information

Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles
is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional
entropy H(X|Y). The circle on the right (blue and violet) is H(Y), with the blue being H(Y|X). The violet is the mutual information
I(X;Y).

In probability theory and information theory, the mutual information (MI) of two random variables is a measure
of the mutual dependence between the two variables. More specically, it quanties the amount of information (in
units such as bits) obtained about one random variable, through the other random variable. The concept of mutual
information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory,
that denes the amount of information held in a random variable.
Not limited to real-valued random variables like the correlation coecient, MI is more general and determines how
similar the joint distribution p(X,Y) is to the products of factored marginal distribution p(X)p(Y). MI is the expected
value of the pointwise mutual information (PMI). The most common unit of measurement of mutual information is
the bit.
55

56

CHAPTER 9. MUTUAL INFORMATION

9.1 Denition
Formally, the mutual information of two discrete random variables X and Y can be dened as:

I(X; Y ) =

(
p(x, y) log

yY xX

)
p(x, y)
,
p(x) p(y)

where p(x,y) is the joint probability distribution function of X and Y, and p(x) and p(y) are the marginal probability
distribution functions of X and Y respectively.
In the case of continuous random variables, the summation is replaced by a denite double integral:
(


p(x, y) log

I(X; Y ) =
Y

p(x, y)
p(x) p(y)

)
dx dy,

where p(x,y) is now the joint probability density function of X and Y, and p(x) and p(y) are the marginal probability
density functions of X and Y respectively.
If the log base 2 is used, the units of mutual information are the bit.
Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of
these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does
not give any information about Y and vice versa, so their mutual information is zero. At the other extreme, if X is a
deterministic function of Y and Y is a deterministic function of X then all information conveyed by X is shared with
Y: knowing X determines the value of Y and vice versa. As a result, in this case the mutual information is the same
as the uncertainty contained in Y (or X) alone, namely the entropy of Y (or X). Moreover, this mutual information
is the same as the entropy of X and as the entropy of Y. (A very special case of this is when X and Y are the same
random variable.)
Mutual information is a measure of the inherent dependence expressed in the joint distribution of X and Y relative
to the joint distribution of X and Y under the assumption of independence. Mutual information therefore measures
dependence in the following sense: I(X; Y) = 0 if and only if X and Y are independent random variables. This is easy
to see in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:
(
log

p(x, y)
p(x) p(y)

)
= log 1 = 0.

Moreover, mutual information is nonnegative (i.e. I(X;Y) 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)).

9.2 Relation to other quantities


Mutual information can be equivalently expressed as

I(X; Y ) = H(X) H(X|Y )


= H(Y ) H(Y |X)
= H(X) + H(Y ) H(X, Y )
= H(X, Y ) H(X|Y ) H(Y |X)
where H(X) and H(Y ) are the marginal entropies, H(X|Y) and H(Y|X) are the conditional entropies, and H(X,Y)
is the joint entropy of X and Y. Note the analogy to the union, dierence, and intersection of two sets, as illustrated
in the Venn diagram.
Using Jensens inequality on the denition of mutual information we can show that I(X;Y) is non-negative, consequently, H(X) H(X|Y ) . Here we give the detailed deduction of I(X;Y) = H(Y) H(Y|X):

9.3. VARIATIONS

I(X; Y ) =

57

p(x, y) log

x,y

p(x, y)
p(x, y) log p(y)

p(x)
x,y
x,y

=
p(x)p(y|x) log p(y|x)
p(x, y) log p(y)

p(x, y)
p(x)p(y)

p(x, y) log

x,y

(
p(x)

x,y

log p(y)

p(y|x) log p(y|x)


(

)
p(x, y)

p(x)H(Y |X = x)

log p(y)p(y)

= H(Y |X) + H(Y )


= H(Y ) H(Y |X).
The proofs of the other identities above are similar.
Intuitively, if entropy H(Y) is regarded as a measure of uncertainty about a random variable, then H(Y|X) is a measure
of what X does not say about Y. This is the amount of uncertainty remaining about Y after X is known, and thus
the right side of the rst of these equalities can be read as the amount of uncertainty in Y, minus the amount of
uncertainty in Y which remains after X is known, which is equivalent to the amount of uncertainty in Y which is
removed by knowing X". This corroborates the intuitive meaning of mutual information as the amount of information
(that is, reduction in uncertainty) that knowing either variable provides about the other.
Note that in the discrete case H(X|X) = 0 and therefore H(X) = I(X;X). Thus I(X;X) I(X;Y), and one can formulate
the basic principle that a variable contains at least as much information about itself as any other variable can provide.
Mutual information can also be expressed as a KullbackLeibler divergence of the product p(x) p(y) of the marginal
distributions of the two random variables X and Y, from p(x,y) the random variables joint distribution:

I(X; Y ) = DKL (p(x, y)p(x)p(y)).


Furthermore, let p(x|y) = p(x, y) / p(y). Then

I(X; Y ) =

p(y)

p(x|y) log2

p(x|y)
p(x)

p(y) DKL (p(x|y)p(x))

= EY {DKL (p(x|y)p(x))}.
Note that here the Kullback-Leibler divergence involves integration with respect to the random variable X only and
the expression DKL (p(x|y)p(x)) is now a random variable in Y. Thus mutual information can also be understood
as the expectation of the KullbackLeibler divergence of the univariate distribution p(x) of X from the conditional
distribution p(x|y) of X given Y: the more dierent the distributions p(x|y) and p(x) are on average, the greater the
information gain.

9.3 Variations
Several variations on mutual information have been proposed to suit various needs. Among these are normalized
variants and generalizations to more than two variables.

58

9.3.1

CHAPTER 9. MUTUAL INFORMATION

Metric

Many applications require a metric, that is, a distance measure between pairs of points. The quantity
d(X, Y ) = H(X, Y ) I(X; Y )
= H(X) + H(Y ) 2I(X; Y )
= H(X|Y ) + H(Y |X)
satises the properties of a metric (triangle inequality, non-negativity, indiscernability and symmetry). This distance
metric is also known as the Variation of information.
If X, Y are discrete random variables then all the entropy terms are non-negative, so 0 d(X, Y ) H(X, Y ) and
one can dene a normalized distance
D(X, Y ) = d(X, Y )/H(X, Y ) 1.
The metric D is a universal metric, in that if any other distance measure places X and Y close-by, then the D will also
judge them close.[1]
Plugging in the denitions shows that
D(X, Y ) = 1 I(X; Y )/H(X, Y ).
In a set-theoretic interpretation of information (see the gure for Conditional entropy), this is eectively the Jaccard
distance between X and Y.
Finally,
D (X, Y ) = 1

I(X; Y )
max(H(X), H(Y ))

is also a metric.

9.3.2

Conditional mutual information

Main article: Conditional mutual information


Sometimes it is useful to express the mutual information of two random variables conditioned on a third.
(
)
I(X; Y |Z) = EZ I(X; Y )|Z =

pZ (z)pX,Y |Z (x, y|z) log


zZ yY xX

pX,Y |Z (x, y|z)


,
pX|Z (x|z)pY |Z (y|z)

which can be simplied as


I(X; Y |Z) =

pX,Y,Z (x, y, z)pZ (z)


.
pX,Y,Z (x, y, z) log
pX,Z (x, z)pY,Z (y, z)
zZ yY xX

Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true
that
I(X; Y |Z) 0
for discrete, jointly distributed random variables X, Y, Z. This result has been used as a basic building block for
proving other inequalities in information theory.

9.3. VARIATIONS

9.3.3

59

Multivariate mutual information

Main article: Multivariate mutual information


Several generalizations of mutual information to more than two random variables have been proposed, such as
total correlation and interaction information. If Shannon entropy is viewed as a signed measure in the context of
information diagrams, as explained in the article Information theory and measure theory, then the only denition of
multivariate mutual information that makes sense is as follows:

I(X1 ; X1 ) = H(X1 )
and for n > 1,

I(X1 ; ... ; Xn ) = I(X1 ; ... ; Xn1 ) I(X1 ; ... ; Xn1 |Xn ),


where (as above) we dene
(
)
I(X1 ; ... ; Xn1 |Xn ) = EXn I(X1 ; ... ; Xn1 )|Xn .
(This denition of multivariate mutual information is identical to that of interaction information except for a change
in sign when the number of random variables is odd.)
Applications
Applying information diagrams blindly to derive the above denition has been criticised, and indeed it has found
rather limited practical application since it is dicult to visualize or grasp the signicance of this quantity for a large
number of random variables. It can be zero, positive, or negative for any odd number of variables n 3.
One high-dimensional generalization scheme which maximizes the mutual information between the joint distribution
and other target variables is found to be useful in feature selection.[2]
Mutual information is also used in the area of signal processing as a measure of similarity between two signals. For
example, FMI metric[3] is an image fusion performance measure that makes use of mutual information in order to
measure the amount of information that the fused image contains about the source images. The Matlab code for this
metric can be found at.[4]

9.3.4

Directed information

Directed information, I(X n Y n ) , measures the amount of information that ows from the process X n to Y n ,
where X n denotes the vector X1 , X2 , ..., Xn and Y n denotes Y1 , Y2 , ..., Yn . The term directed information was
coined by James Massey and is dened as

I(X n Y n ) =

I(X i ; Yi |Y i1 )

i=1

Note that if n = 1 the directed information becomes the mutual information. Directed information has many
applications in problems where causality plays an important role, such as capacity of channel with feedback.[5][6]

9.3.5

Normalized variants

Normalized variants of the mutual information are provided by the coecients of constraint,[7] uncertainty coecient
[8]
or prociency:[9]

60

CHAPTER 9. MUTUAL INFORMATION

CXY =

I(X; Y )
H(Y )

I(X; Y )
.
H(X)

and CY X =

The two coecients are not necessarily equal. In some cases a symmetric measure may be desired, such as the
following redundancy measure:

R=

I(X; Y )
H(X) + H(Y )

which attains a minimum of zero when the variables are independent and a maximum value of

Rmax =

min(H(X), H(Y ))
H(X) + H(Y )

when one variable becomes completely redundant with the knowledge of the other. See also Redundancy (information
theory). Another symmetrical measure is the symmetric uncertainty (Witten & Frank 2005), given by

U (X, Y ) = 2R = 2

I(X; Y )
H(X) + H(Y )

which represents a weighted average of the two uncertainty coecients.[8]


If we consider mutual information as a special case of the total correlation or dual total correlation, the normalized
version are respectively,
I(X;Y )
min[H(X),H(Y )]

and

I(X;Y )
H(X,Y )

Finally theres a normalization [10] which derives from rst thinking of mutual information as an analogue to covariance
(thus Shannon entropy is analogous to variance). Then the normalized mutual information is calculated akin to the
Pearson correlation coecient,

I(X; Y )

.
H(X)H(Y )

9.3.6

Weighted variants

In the traditional formulation of the mutual information,

I(X; Y ) =

yY xX

p(x, y) log

p(x, y)
,
p(x) p(y)

each event or object specied by (x, y) is weighted by the corresponding probability p(x, y) . This assumes that all
objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be
the case that certain objects or events are more signicant than others, or that certain patterns of association are more
semantically important than others.
For example, the deterministic mapping {(1, 1), (2, 2), (3, 3)} may be viewed as stronger than the deterministic
mapping {(1, 3), (2, 1), (3, 2)} , although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954,
Coombs, Dawes & Tversky 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational
mapping between the associated variables. If it is desired that the former relationshowing agreement on all variable
valuesbe judged stronger than the later relation, then it is possible to use the following weighted mutual information
(Guiasu 1977).

9.3. VARIATIONS

I(X; Y ) =

61

w(x, y)p(x, y) log

yY xX

p(x, y)
,
p(x) p(y)

which places a weight w(x, y) on the probability of each variable value co-occurrence, p(x, y) . This allows that
certain probabilities may carry more or less signicance than others, thereby allowing the quantication of relevant
holistic or prgnanz factors. In the above example, using larger relative weights for w(1, 1) , w(2, 2) , and w(3, 3)
would have the eect of assessing greater informativeness for the relation {(1, 1), (2, 2), (3, 3)} than for the relation
{(1, 3), (2, 1), (3, 2)} , which may be desirable in some cases of pattern recognition, and the like. This weighted
mutual information is a form of weighted KL-Divergence, which is known to take negative values for some inputs,[11]
and there are examples where the weighted mutual information also takes negative values.[12]

9.3.7

Adjusted mutual information

Main article: adjusted mutual information


A probability distribution can be viewed as a partition of a set. One may then ask: if a set were partitioned randomly,
what would the distribution of probabilities be? What would the expectation value of the mutual information be?
The adjusted mutual information or AMI subtracts the expectation value of the MI, so that the AMI is zero when two
dierent distributions are random, and one when two distributions are identical. The AMI is dened in analogy to
the adjusted Rand index of two dierent partitions of a set.

9.3.8

Absolute mutual information

Using the ideas of Kolmogorov complexity, one can consider the mutual information of two sequences independent
of any probability distribution:

IK (X; Y ) = K(X) K(X|Y ).


To establish that this quantity is symmetric up to a logarithmic factor ( IK (X; Y ) IK (Y ; X) ) requires the chain
rule for Kolmogorov complexity (Li & Vitnyi 1997). Approximations of this quantity via compression can be used
to dene a distance measure to perform a hierarchical clustering of sequences without having any domain knowledge
of the sequences (Cilibrasi & Vitnyi 2005).

9.3.9

Linear correlation

Unlike correlation coecients, such as the product moment correlation coecient, mutual information contains information about all dependencelinear and nonlinearand not just linear dependence as the correlation coecient
measures. However, in the narrow case that both marginal distributions for X and Y are normally distributed and
their joint distribution is a bivariate normal distribution, there is an exact relationship between I and the correlation
coecient (Gel'fand & Yaglom 1957).

1
I = log(1 2 )
2

9.3.10

For discrete data

When X and Y are limited to be in a discrete number of states, observation data is summarized in a contingency table,
with row variable X (or i) and column variable Y (or j). Mutual information is one of the measures of association or
correlation between the row and column variables. Other measures of association include Pearsons chi-squared test
statistics, G-test statistics, etc. In fact, mutual information is equal to G-test statistics divided by 2N where N is the
sample size.

62

CHAPTER 9. MUTUAL INFORMATION

In the special case where the number of states for both row and column variables is 2 (i,j=1,2), the degrees of freedom
of the Pearsons chi-squared test is 1. Out of the four terms in the summation:

i,j

pij log

pij
pi pj

only one is independent. It is the reason that mutual information function has an exact relationship with the correlation
function pX=1,Y =1 pX=1 pY =1 for binary sequences .[13]

9.4 Applications
In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often
equivalent to minimizing conditional entropy. Examples include:
In search engine technology, mutual information between phrases and contexts is used as a feature for k-means
clustering to discover semantic clusters (concepts).[14]
In telecommunications, the channel capacity is equal to the mutual information, maximized over all input
distributions.
Discriminative training procedures for hidden Markov models have been proposed based on the maximum
mutual information (MMI) criterion.
RNA secondary structure prediction from a multiple sequence alignment.
Phylogenetic proling prediction from pairwise present and disappearance of functionally link genes.
Mutual information has been used as a criterion for feature selection and feature transformations in machine
learning. It can be used to characterize both the relevance and redundancy of variables, such as the minimum
redundancy feature selection.
Mutual information is used in determining the similarity of two dierent clusterings of a dataset. As such, it
provides some advantages over the traditional Rand index.
Mutual information of words is often used as a signicance function for the computation of collocations in
corpus linguistics. This has the added complexity that no word-instance is an instance to two dierent words;
rather, one counts instances where 2 words occur adjacent or in close proximity; this slightly complicates the
calculation, since the expected probability of one word occurring within N words of another, goes up with N.
Mutual information is used in medical imaging for image registration. Given a reference image (for example, a
brain scan), and a second image which needs to be put into the same coordinate system as the reference image,
this image is deformed until the mutual information between it and the reference image is maximized.
Detection of phase synchronization in time series analysis
In the infomax method for neural-net and other machine learning, including the infomax-based Independent
component analysis algorithm
Average mutual information in delay embedding theorem is used for determining the embedding delay parameter.
Mutual information between genes in expression microarray data is used by the ARACNE algorithm for reconstruction of gene networks.
In statistical mechanics, Loschmidts paradox may be expressed in terms of mutual information.[15][16] Loschmidt
noted that it must be impossible to determine a physical law which lacks time reversal symmetry (e.g. the
second law of thermodynamics) only from physical laws which have this symmetry. He pointed out that the
H-theorem of Boltzmann made the assumption that the velocities of particles in a gas were permanently uncorrelated, which removed the time symmetry inherent in the H-theorem. It can be shown that if a system is
described by a probability density in phase space, then Liouvilles theorem implies that the joint information

9.5. SEE ALSO

63

(negative of the joint entropy) of the distribution remains constant in time. The joint information is equal to the
mutual information plus the sum of all the marginal information (negative of the marginal entropies) for each
particle coordinate. Boltzmanns assumption amounts to ignoring the mutual information in the calculation of
entropy, which yields the thermodynamic entropy (divided by Boltzmanns constant).
The mutual information is used to learn the structure of Bayesian networks/dynamic Bayesian networks, which
explain the causal relationship between random variables, as exemplied by the GlobalMIT toolkit : learning
the globally optimal dynamic Bayesian network with the Mutual Information Test criterion.
Popular cost function in decision tree learning.

9.5 See also


Pointwise mutual information
Quantum mutual information

9.6 Notes
[1] Kraskov, Alexander; Stgbauer, Harald; Andrzejak, Ralph G.; Grassberger, Peter (2003). Hierarchical Clustering Based
on Mutual Information. arXiv:q-bio/0311039 .
[2] Christopher D. Manning; Prabhakar Raghavan; Hinrich Schtze (2008). An Introduction to Information Retrieval. Cambridge
University Press. ISBN 0-521-86571-9.
[3] Haghighat, M. B. A.; Aghagolzadeh, A.; Seyedarabi, H. (2011). A non-reference image fusion metric based on mutual information of image features. Computers & Electrical Engineering. 37 (5): 744756. doi:10.1016/j.compeleceng.2011.07.012.
[4] http://www.mathworks.com/matlabcentral/fileexchange/45926-feature-mutual-information-fmi-image-fusion-metric
[5] Massey, James (1990). Causality, Feedback And Directed Informatio (ISITA).
[6] Permuter, Haim Henry; Weissman, Tsachy; Goldsmith, Andrea J. (February 2009). Finite State Channels With TimeInvariant Deterministic Feedback. IEEE Transactions on Information Theory. 55 (2): 644662. doi:10.1109/TIT.2008.2009849.
[7] Coombs, Dawes & Tversky 1970.
[8] Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). Section 14.7.3. Conditional Entropy and Mutual
Information. Numerical Recipes: The Art of Scientic Computing (3rd ed.). New York: Cambridge University Press.
ISBN 978-0-521-88068-8
[9] White, Jim; Steingold, Sam; Fournelle, Connie. Performance Metrics for Group-Detection Algorithms (PDF).
[10] Strehl, Alexander; Ghosh, Joydeep (2002), Cluster Ensembles A Knowledge Reuse Framework for Combining Multiple
Partitions (PDF), The Journal of Machine Learning Research, 3 (Dec): 583617
[11] Kvlseth, T. O. (1991). The relative useful information measure: some comments. Information sciences. 56 (1): 3538.
doi:10.1016/0020-0255(91)90022-m.
[12] Pocock, A. (2012). Feature Selection Via Joint Likelihood (PDF) (Thesis).
[13] Wentian Li (1990). Mutual information functions versus correlation functions. J. Stat. Phys. 60 (5-6): 823837.
doi:10.1007/BF01025996.
[14] Parsing a Natural Language Using Mutual Information Statistics by David M. Magerman and Mitchell P. Marcus
[15] Hugh Everett Theory of the Universal Wavefunction, Thesis, Princeton University, (1956, 1973), pp 1140 (page 30)
[16] Everett, Hugh (1957). Relative State Formulation of Quantum Mechanics. Reviews of Modern Physics. 29: 454462.
doi:10.1103/revmodphys.29.454.

64

CHAPTER 9. MUTUAL INFORMATION

9.7 References
Cilibrasi, R.; Vitnyi, Paul (2005). Clustering by compression (PDF). IEEE Transactions on Information
Theory. 51 (4): 15231545. doi:10.1109/TIT.2005.844059.
Cronbach, L. J. (1954). On the non-rational application of information measures in psychology. In Quastler,
Henry. Information Theory in Psychology: Problems and Methods. Glencoe, Illinois: Free Press. pp. 1430.
Coombs, C. H.; Dawes, R. M.; Tversky, A. (1970). Mathematical Psychology: An Elementary Introduction.
Englewood Clis, New Jersey: Prentice-Hall.
Church, Kenneth Ward; Hanks, Patrick (1989). Word association norms, mutual information, and lexicography. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics. doi:10.1145/90000/89095
(inactive 2016-01-30).
Gel'fand, I.M.; Yaglom, A.M. (1957). Calculation of amount of information about a random function contained in another such function. American Mathematical Society Translations: Series 2. 12: 199246. English
translation of original in Uspekhi Matematicheskikh Nauk 2 (1): 3-52.
Guiasu, Silviu (1977). Information Theory with Applications. McGraw-Hill, New York. ISBN 978-0-07025109-0.
Li, Ming; Vitnyi, Paul (February 1997). An introduction to Kolmogorov complexity and its applications. New
York: Springer-Verlag. ISBN 0-387-94868-6.
Lockhead, G. R. (1970). Identication and the form of multidimensional discrimination space. Journal of
Experimental Psychology. 85 (1): 110. doi:10.1037/h0029508. PMID 5458322.
David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0-521-64298-1 (available free online)

Haghighat, M. B. A.; Aghagolzadeh, A.; Seyedarabi, H. (2011). A non-reference image fusion metric based on
mutual information of image features. Computers & Electrical Engineering. 37 (5): 744756. doi:10.1016/j.compeleceng.2011.0
Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York:
McGraw-Hill, 1984. (See Chapter 15.)
Witten, Ian H. & Frank, Eibe (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan
Kaufmann, Amsterdam. ISBN 978-0-12-374856-0.
Peng, H.C., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 27 (8): 12261238. doi:10.1109/tpami.2005.159. PMID 16119262.
Andre S. Ribeiro; Stuart A. Kauman; Jason Lloyd-Price; Bjorn Samuelsson & Joshua Socolar (2008). Mutual Information in Random Boolean models of regulatory networks. Physical Review E. 77 (1). arXiv:0707.3642 .
doi:10.1103/physreve.77.011901.
Wells, W.M. III; Viola, P.; Atsumi, H.; Nakajima, S.; Kikinis, R. (1996). Multi-modal volume registration
by maximization of mutual information (PDF). Medical Image Analysis. 1 (1): 3551. doi:10.1016/S13618415(01)80004-9. PMID 9873920.

Chapter 10

Cross entropy
In information theory, the cross entropy between two probability distributions p and q over the same underlying set
of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is
used that is optimized for an unnatural probability distribution q , rather than the true distribution p .
The cross entropy for the distributions p and q over a given set is dened as follows:

H(p, q) = Ep [ log q] = H(p) + DKL (pq),


where H(p) is the entropy of p , and DKL (p||q) is the KullbackLeibler divergence of q from p (also known as the
relative entropy of p with respect to q note the reversal of emphasis).
For discrete p and q this means
H(p, q) =

p(x) log q(x).

The situation for continuous distributions is analogous. We have to assume that p and q are absolutely continuous
with respect to some reference measure r (usually r is a Lebesgue measure on a Borel -algebra). Let P and Q be
probability density functions of p and q with respect to r . Then

P (x) log Q(x) dr(x) = Ep [ log Q].


X

NB: The notation H(p, q) is also used for a dierent concept, the joint entropy of p and q .

10.1 Motivation
In information theory, the KraftMcMillan theorem establishes that any directly decodable coding scheme for coding
a message to identify one value xi out of a set of possibilities X can be seen as representing an implicit probability
distribution q(xi ) = 2li over X , where li is the length of the code for xi in bits. Therefore, cross entropy can be
interpreted as the expected message-length per datum when a wrong distribution Q is assumed while the data actually
follows a distribution P . That is why the expectation is taken over the probability distribution P and not Q .
[
H(p, q) = Ep [li ] = Ep log
H(p, q) =

p(xi ) log

xi

H(p, q) =

1
q(xi )

1
q(xi )

p(x) log q(x).

65

66

CHAPTER 10. CROSS ENTROPY

10.2 Estimation
There are many situations where cross-entropy needs to be measured but the distribution of p is unknown. An example
is language modeling, where a model is created based on a training set T , and then its cross-entropy is measured on
a test set to assess how accurate the model is in predicting the test data. In this example, p is the true distribution
of words in any corpus, and q is the distribution of words as predicted by the model. Since the true distribution is
unknown, cross-entropy cannot be directly calculated. In these cases, an estimate of cross-entropy is calculated using
the following formula:

H(T, q) =

1
log2 q(xi )
N
i=1

where N is the size of the test set, and q(x) is the probability of event x estimated from the training set. The sum
is calculated over N . This is a Monte Carlo estimate of the true cross entropy, where the training set is treated as
samples from p(x) .

10.3 Cross-entropy minimization


Cross-entropy minimization is frequently used in optimization and rare-event probability estimation; see the crossentropy method.
When comparing a distribution q against a xed reference distribution p , cross entropy and KL divergence are
identical up to an additive constant (since p is xed): both take on their minimal values when p = q , which is 0 for
KL divergence, and H(p) for cross entropy.[1] In the engineering literature, the principle of minimising KL Divergence
(Kullbacks "Principle of Minimum Discrimination Information") is often called the Principle of Minimum CrossEntropy (MCE), or Minxent.
However, as discussed in the article KullbackLeibler divergence, sometimes the distribution q is the xed prior
reference distribution, and the distribution p is optimised to be as close to q as possible, subject to some constraint.
In this case the two minimisations are not equivalent. This has led to some ambiguity in the literature, with some
authors attempting to resolve the inconsistency by redening cross-entropy to be DKL (pq) , rather than H(p, q) .

10.4 Cross-entropy error function and logistic regression


Cross entropy can be used to dene the loss function in machine learning and optimization. The true probability pi
is the true label, and the given distribution qi is the predicted value of the current model.
More specically, let us consider logistic regression, which (in its most basic form) deals with classifying a given set of
data points into two possible classes generically labelled 0 and 1 . The logistic regression model thus predicts an output
y {0, 1} , given an input vector x . The probability is modeled using the logistic function g(z) = 1/(1 + ez ) .
Namely, the probability of nding the output y = 1 is given by

qy=1 = y g(w x) ,
where the vector of weights w is optimized through some appropriate algorithm such as gradient descent. Similarly,
the complementary probability of nding the output y = 0 is simply given by

qy=0 = 1 y
The true (observed) probabilities can be expressed similarly as py=1 = y and py=0 = 1 y .
Having set up our notation, p {y, 1 y} and q {
y , 1 y} , we can use cross entropy to get a measure for
similarity between p and q :

10.5. SEE ALSO

H(p, q) =

67

pi log qi = y log y (1 y) log(1 y)

The typical loss function that one uses in logistic regression is computed by taking the average of all cross-entropies
in the sample. For example, suppose we have N samples with each sample labeled by n = 1, . . . , N . The loss
function is then given by:
]
N
N [
1
1
L(w) =
H(pn , qn ) =
yn log yn + (1 yn ) log(1 yn ) ,
N n=1
N n=1
where yn g(w xn ) , with g(z) the logistic function as before.
The logistic loss is sometimes called cross-entropy loss. Its also known as log loss (In this case, the binary label is
often denoted by {1,+1}).[2]

10.5 See also


Cross-entropy method
Logistic regression
Conditional entropy
Maximum likelihood estimation

10.6 References
[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. MIT Press. Online
[2] Murphy, Kevin (2012). Machine Learning: A Probabilistic Perspective. MIT. ISBN 978-0262018029.

de Boer, Pieter-Tjerk; Kroese, Dirk P.; Mannor, Shie; Rubinstein, Reuven Y. (February 2005). A Tutorial on the Cross-Entropy Method (PDF). Annals of Operations Research (pdf). 134 (1). pp. 1967.
doi:10.1007/s10479-005-5724-z. ISSN 1572-9338.

10.7 External links


What is cross-entropy, and why use it?

68

CHAPTER 10. CROSS ENTROPY

10.8 Text and image sources, contributors, and licenses


10.8.1

Text

Information theory Source: https://en.wikipedia.org/wiki/Information_theory?oldid=752055055 Contributors: AxelBoldt, Brion VIBBER, Timo Honkasalo, Ap, Graham Chapman, XJaM, Toby Bartels, Hannes Hirzel, Edward, D, PhilipMW, Michael Hardy, Isomorphic,
Kku, Bobby D. Bryant, Varunrebel, Vinodmp, AlexanderMalmberg, Karada, (, Iluvcapra, Minesweeper, StephanWehner, Ahoerstemeier, Angela, LittleDan, Kevin Baas, Poor Yorick, Andres, Novum, Charles Matthews, Guaka, Bemoeial, Ww, Dysprosia, The Anomebot, Munford, Hyacinth, Ann O'nyme, Shizhao, AnonMoos, MH~enwiki, Robbot, Fredrik, Rvollmert, Seglea, Chancemill, Securiger,
SC, Lupo, Wile E. Heresiarch, ManuelGR, Ancheta Wis, Giftlite, Lee J Haywood, COMPATT, KuniShiro~enwiki, SoWhy, Andycjp,
Ynh~enwiki, Pcarbonn, OverlordQ, L353a1, Rdsmith4, APH, Cihan, Elektron, Creidieki, Neuromancien, Rakesh kumar, Picapica, D6,
Jwdietrich2, Imroy, CALR, Noisy, Rich Farmbrough, Guanabot, NeuronExMachina, Bishonen, Ericamick, Xezbeth, Bender235, MisterSheik, Crunchy Frog, El C, Spoon!, Simon South, Bobo192, Smalljim, Rbj, Maurreen, Nothingmuch, Photonique, Andrewbadr, Haham
hanuka, Pearle, Mpeisenbr, Mdd, Msh210, Uncle Bill, Pouya, BryanD, PAR, Cburnett, Jheald, Geraldshields11, Kusma, DV8 2XL,
Oleg Alexandrov, FrancisTyers, Velho, Woohookitty, Linas, Mindmatrix, Ruud Koot, Eatsaq, Eyreland, SeventyThree, Kanenas, Graham87, Josh Parris, Mayumashu, SudoMonas, Arunkumar, HappyCamper, Bubba73, Alejo2083, Chris Pressey, Mathbot, Annacoder,
Nabarry, Srleer, Chobot, DVdm, Commander Nemet, FrankTobia, Siddhant, YurikBot, Wavelength, RobotE, RussBot, Michael Slone,
Loom91, Grubber, ML, Yahya Abdal-Aziz, Raven4x4x, Moe Epsilon, DanBri, BMAH07, Allchopin, Light current, Mceliece, Arthur Rubin, Lyrl, GrinBot~enwiki, Sardanaphalus, Lordspaz, SmackBot, Imz, Henri de Solages, Incnis Mrsi, Reedy, InverseHypercube, Cazort,
Gilliam, Metacomet, Octahedron80, Spellchecker, Colonies Chris, Jahiegel, Unnikrishnan.am, LouScheer, Calbaer, EPM, Djcmackay,
Michael Ross, Tyrson, Jon Awbrey, Het, Bidabadi~enwiki, Chungc, SashatoBot, Nick Green, Harryboyles, Sina2, Lachico, Almkglor,
Bushsf, Sir Nicholas de Mimsy-Porpington, FreezBee, Dicklyon, E-Kartoel, Wizard191, Matthew Verey, Isvish, ScottHolden, CapitalR,
Gnome (Bot), Tawkerbot2, Marty39, Daggerstab, CRGreathouse, Thermochap, Ale jrb, Thomas Keyes, Mct mht, Pulkitgrover, Grgarza, Maria Vargas, Roman Cheplyaka, Hpalaiya, Vanished User jdksfajlasd, Nearfar, Heidijane, Thijs!bot, N5iln, WikiIT, Headbomb,
James086, PoulyM, Edchi, D.H, Jvstone, HSRT, JAnDbot, BenjaminGittins, RainbowCrane, Jthomp4338, Dricherby, Buettcher, MetsBot, David Eppstein, Pax:Vobiscum, Logan1939, MartinBot, Tamer ih~enwiki, Sigmundg, Jargon777, Policron, Useight, VolkovBot,
Joeoettinger, JohnBlackburne, Jimmaths, Constant314, Starrymessenger, Kjells, Magmi, AllGloryToTheHypnotoad, Bemba, Lamro,
Radagast3, Newbyguesses, SieBot, Ivan tambuk, Robert Loring, Masgatotkaca, Junling, Pcontrop, Algorithms, Anchor Link Bot, Melcombe, ClueBot, Fleem, Ammarsakaji, Estirabot, 7&6=thirteen, Oldrubbie, Vegetator, Singularity42, Dziewa, Lambtron, Johnuniq,
SoxBot III, HumphreyW, Vanished user uih38riiw4hjlsd, Mitch Ames, Addbot, Deepmath, Peerc, Eweinber, Sun Ladder, C9900, Blaylockjam10, L.exsteens, Xlasne, LuK3, Egoistorms, Luckas-bot, Quadrescence, Yobot, TaBOT-zerem, Taxisfolder, Carleas, Twohoos,
Cassandra Cathcart, Dbln, Materialscientist, Informationtheory, Citation bot, Jingluolaodao, Expooz, Raysonik, Xqbot, Ywaz, Isheden,
Informationtricks, Dani.gomezdp, PHansen, Masrudin, FrescoBot, Nageh, Tiramisoo, Sanpitch, Gnomehacker, Pinethicket, Momergil,
Jonesey95, SkyMachine, SchreyP, Lotje, Miracle Pen, Vanadiumho, Kastchei, Djjr, EmausBot, WikitanvirBot, Primefac, Jmencisom,
Wikipelli, Dcirovic, Bethnim, Quondum, Henriqueroscoe, Terra Novus, ClueBot NG, Wcherowi, MelbourneStar, BarrelProof, TimeOfDei, Frietjes, Thepigdog, Pzrq, MrJosiahT, Lawsonstu, Helpful Pixie Bot, Leopd, BG19bot, Vaulttech, Wiki13, Trevayne08, CitationCleanerBot, Brad7777, Schafer510, BattyBot, David.moreno72, Bankmichael1, FoCuSandLeArN, SFK2, Jochen Burghardt, Limittheorem, Dschslava, Phamnhatkhanh, Szzoli, 314Username, Roastliras, Eigentensor, Comp.arch, SakeUPenn, Logan.dunbar, Leegrc,
Prof. Michael Bank, DanBalance, JellydPuppy, KasparBot, Kk, Lr0^^k, Mingujizaixin, DerGuteSamariter, Sisu55, Capriciousknees,
Bhannel, Fmadd, Dello1234 and Anonymous: 282
Self-information Source: https://en.wikipedia.org/wiki/Self-information?oldid=744791353 Contributors: Kku, Charles Matthews, Maximus Rex, Khym Chanur, Jeq, Brona, MisterSheik, Flammifer, Melaen, Jheald, Recury, Linas, The Rambling Man, RussBot, InverseHypercube, Mbset, Mcld, Spirituelle, Talgalili, Honing, BrotherE, Coee2theorems, Catslash, Kjells, PaulTanenbaum, Mundhenk, Jiuren,
UKoch, Vql, PixelBot, Addbot, Okurtsev, Mrocklin, Lightbot, Fryed-peach, Yobot,
, Ashpilkin, AnomieBOT, Br77rino, FrescoBot,
Igor Yalovecky, EmausBot, Lueling, Quondum, BG19bot, Rblazek, Kodiologist, JYBot, Enterprisey, Sminthopsis84, Dierkam and
Anonymous: 35
Entropy (information theory) Source: https://en.wikipedia.org/wiki/Entropy_(information_theory)?oldid=752813360 Contributors: Tobias Hoevekamp, Derek Ross, Bryan Derksen, The Anome, Ap, PierreAbbat, Rade Kutil, Waveguy, B4hand, Youandme, Olivier, Stevertigo, Michael Hardy, Kku, Mkweise, Ahoerstemeier, Snoyes, AugPi, Rick.G, Ww, Sbwoodside, Dysprosia, Jitse Niesen, Fibonacci,
Paul-L~enwiki, Omegatron, Jeq, Noeckel, Robbot, Tomchiukc, Benwing, Netpilot43556, Rursus, Bkell, Tea2min, Stirling Newberry,
Giftlite, Donvinzk, Boaz~enwiki, Peruvianllama, Brona, Romanpoet, Udo.bellack, Jabowery, Christopherlin, Neilc, Gubbubu, Beland,
OverlordQ, MarkSweep, Karol Langner, Wiml, Bumm13, Sctfn, Zeman, Abdull, TheObtuseAngleOfDoom, Rich Farmbrough, ArnoldReinhold, Bender235, ESkog, MisterSheik, Jough, Guettarda, Cretog8, Army1987, Foobaz, Franl, Flammifer, Sjschen, Sligocki, PAR,
Cburnett, Jheald, Tomash, Oleg Alexandrov, Linas, Shreevatsa, LOL, Bkwillwm, Male1979, Ryan Reich, Btyner, Marudubshinki, Graham87, BD2412, Jetekus, Grammarbot, Nanite, Sj, Rjwilmsi, Thomas Arelatensis, Nneonneo, Erkcan, Alejo2083, Mfeadler, Srleer,
Vonkje, Chobot, DVdm, Flashmorbid, Wavelength, Alpt, Kymacpherson, Ziddy, Kimchi.sg, Afelton, Buster79, Brandon, Hakeem.gadi,
Vertigre, DmitriyV, GrinBot~enwiki, SmackBot, InverseHypercube, Fulldecent, IstvanWolf, Diegotorquemada, Mcld, Gilliam, Ohnoitsjamie, Dauto, Kurykh, Gutworth, Nbarth, DHN-bot~enwiki, Colonies Chris, Jdthood, Rkinch, Javalenok, CorbinSimpson, Wen D
House, Radagast83, Cybercobra, Mrander, DMacks, FilippoSidoti, Daniel.Cardenas, Michael Rogers, Andrei Stroe, Ohconfucius, Snowgrouse, Dmh~enwiki, Ninjagecko, Michael L. Hall, JoseREMY, Severoon, Nonsuch, Phancy Physicist, KeithWinstein, Seanmadsen,
Dicklyon, Shockem, Ryan256, Dan Gluck, Kencf0618, Dwmalone, AlainD, Ylloh, CmdrObot, Hanspi, CBM, Mcstrother, Citrus538,
Neonleonb, FilipeS, Tkircher, Farzaneh, Blaisorblade, Ignoramibus, Michael C Price, Alexnye, SteveMcCluskey, Nearfar, Thijs!bot,
WikiC~enwiki, Edchi, EdJohnston, D.H, Phy1729, Jvstone, Seaphoto, Heysan, Zylorian, Dougher, Husond, OhanaUnited, Time3000,
Shaul1, Coee2theorems, Magioladitis, RogierBrussee, VoABot II, Albmont, Swpb, First Harmonic, JaGa, Kestasjk, Tommy Herbert,
Pax:Vobiscum, R'n'B, CommonsDelinker, Coppertwig, Policron, Jobonki, Jvpwiki, Ale2006, Idioma-bot, Cuzkatzimhut, Trevorgoodchild, Aelkiss, Trachten, Saigyo, Kjells, Go2slash, BwDraco, Mermanj, Spinningspark, PhysPhD, Bowsmand, Michel.machado, TimProof, Maxlittle2007, Hirstormandy, Neil Smithline, Dailyknowledge, Flyer22 Reborn, Mdsam2~enwiki, EnOreg, Algorithms, Svick,
AlanUS, Melcombe, Rinconsoleao, Alksentrs, Schuermann~enwiki, Vql, Djr32, Blueyeru, TedDunning, Musides, Ra2007, Qwfp, Johnuniq, Kace7, Porphyro, Addbot, Deepmath, Landon1980, Olli Niemitalo, Fgnievinski, Hans de Vries, Mv240, MrVanBot, Jill-Jnn,
Favonian, ChenzwBot, Wikomidia, Numbo3-bot, Ehrenkater, Tide rolls, Lightbot, Fryed-peach, Eastereaster, Luckas-bot, Yobot, Sobec,
Cassandra Cathcart, AnomieBOT, Jim1138, Zandr4, Mintrick, Informationtheory, Belkovich, ArthurBot, Xqbot, Ywaz, Gusshoekey,
Br77rino, Almabot, GrouchoBot, Omnipaedista, RibotBOT, Ortvolute, Entropeter, Constructive editor, FrescoBot, Hobsonlane, GEB-

10.8. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

69

Stgo, Olexa Riznyk, Mhadi.afrasiabi, Orubt, Rc3002, HRoestBot, Cesarth73, RedBot, Cfpcompte, Pmagrass, Mduteil, Lotje, BlackAce48, LilyKitty, Angelorf, 777sms, CobraBot, Duoduoduo, Aoidh, Spakin, Hoelzro, Mean as custard, Jann.poppinga, Mitch.mcquoid,
Born2bgratis, Lalaithion, Gopsy, Racerx11, Mo ainm, Hhhippo, Purplie, Martboy722, Quondum, SporkBot, Music Sorter, Erianna,
Elsehow, Fjoelskaldr, Alexander Misel, ChuispastonBot, Sigma0 1, DASHBotAV, ClueBot NG, Tschijnmotschau, Matthiaspaul, Raymond Ellis, Mesoderm, SheriKLap, Helpful Pixie Bot, Bibcode Bot, BG19bot, Guy vandegrift, Eli of Athens, Marcocapelle, Hushaohan, Trombonechamp, Manoguru, Muhammad Shuaib Nadwi, BattyBot, ChrisGualtieri, Marek marek, VLReeder77, Jrajniak89, Cerabot~enwiki, Jiejie9988, Fourshade, Frosty, SFK2, Szzoli, Chrislgarry, I am One of Many, Phabius99, Jamesmcmahon0, Altroware,
OhGodItsSoAmazing, Tchanders, Suderpie, Ynaamad, AkselA, Orehet, Monkbot, Yikkayaya, Leegrc, Visme, Donen1937, WikiRambala, Magriteappleface, Oisguad, Boky90, Auerbachkeller, Isambard Kingdom, Secvz, Tinysciencecow, KasparBot, Radegast, BourkeM,
UMD Xuechi, Anareth, Jackbirda, Edlihtam6su, Handers Illian, LathiaRutvik, Bender the Bot, Sethx1138, CitySlicker2016 and Anonymous: 338
Binary entropy function Source: https://en.wikipedia.org/wiki/Binary_entropy_function?oldid=715332226 Contributors: Jheald, Linas,
Alejo2083, Jaraalbe, SmackBot, Calbaer, Ylloh, WikiC~enwiki, Hermel, JaGa, Trachten, Aaron Rotenberg, Venny85, UKoch, FrescoBot,
Quondum, Neighbornou, Fourshade and Anonymous: 9
Dierential entropy Source: https://en.wikipedia.org/wiki/Differential_entropy?oldid=748499150 Contributors: Michael Hardy, Kku,
Karada, Den fjttrade ankan~enwiki, Giftlite, TheObtuseAngleOfDoom, PAR, Jheald, Count Iblis, Oleg Alexandrov, Woohookitty,
Nanite, The Rambling Man, Vertigre, Teply, Bo Jacoby, SmackBot, Diegotorquemada, Mcld, Nbarth, Colonies Chris, Memming,
Radagast83, Yuide, Pulkitgrover, Blaisorblade, Zickzack, Headbomb, Ioeth, Every Creek Counts, Coee2theorems, Jorgenumata, Policron, Epistemenical, Kyle the bot, Nathan B. Kitchen, Vlsergey, Rinconsoleao, Drazick, Daniel Hershcovich, Guozj02, Kaba3, Qwfp,
Webtier~enwiki, Addbot, Deepmath, Olli Niemitalo, Yobot, Informationtheory, Kwiki, Stefanhost, Slxu.public, RjwilmsiBot, Cogiati,
Quondum, Fjoelskaldr, Jrnold, WJVaughn3, Vladimir Iok, Bibcode Bot, Solomon7968, BattyBot, ChrisGualtieri, SFK2, JGTZ, Limittheorem, Szzoli, Ianweiner, Monkbot, Leegrc, Bender the Bot and Anonymous: 32
Diversity index Source: https://en.wikipedia.org/wiki/Diversity_index?oldid=726822838 Contributors: The Anome, Michael Hardy,
Kku, Ronz, Den fjttrade ankan~enwiki, Duncharris, Andycjp, Forbsey, Wtmitchell, Jheald, Jackhynes, Stemonitis, Tabletop, Rjwilmsi,
Mathbot, Wavelength, Hillman, Sasuke Sarutobi, Carabinieri, Ilmari Karonen, SmackBot, Malkinann, Gilliam, Bluebot, Nbarth, Eliezg,
Richard001, Ligulembot, Dogears, Amdurbin, BeenAroundAWhile, AshLin, Myasuda, TimVickers, R'n'B, DrMicro, Classical geographer, Timios, Flyer22 Reborn, Melcombe, Jfdarmo, Rumping, Niceguyedc, Lab-oratory, Addbot, NjardarBot, West.andrew.g, Denicho,
Luckas-bot, Yobot, AnomieBOT, Citation bot, Sylwia Ufnalska, Asif Qasmov, NSH002, Binjiangwiki, HRoestBot, Xnus, TuHan-Bot,
Dcirovic, Bamyers99, ClueBot NG, , Widr, David Blundon, Pacerier, Frze, Fodon, Lileiting, Cricetus, Ilyapon, Zz9296,
Carl Lehto, Jochen Burghardt, Kenadra, Monkbot, Leegrc, Darw15ish, Tracedragon762 and Anonymous: 40
Conditional entropy Source: https://en.wikipedia.org/wiki/Conditional_entropy?oldid=740301777 Contributors: The Anome, SebastianHelm, AugPi, Romanpoet, Stern~enwiki, MarkSweep, Creidieki, MisterSheik, Macl, PAR, Jheald, Linas, GregorB, Qwertyus, Thomas
Arelatensis, YurikBot, Zvika, SmackBot, Mcld, Betacommand, Tplayford, Njerseyguy, Tsca.bot, Solarapex, Sadi Carnot, A. Pichler, Ylloh, Thermochap, MaxEnt, Meznaric, Giromante, MagusMind, Sterrys, Magioladitis, Mozaher, DavidCBryant, AleaIntrica, Celique,
Alexbot, Addbot, Yobot, AnomieBOT, Lynxoid84, Citation bot, Oashi, RedBot, Peter.prettenhofer, KonradVoelkel, Quondum, ClueBot
NG, Apalmigiano, JeanM, Stretchhhog, David.ryan.snyder, Luis Goslin, Latex-yow and Anonymous: 42
Joint entropy Source: https://en.wikipedia.org/wiki/Joint_entropy?oldid=727418967 Contributors: Edward, Sander123, Stern~enwiki,
MarkSweep, Sam Hocevar, Creidieki, Alperen, AndersKaseorg, PAR, Jheald, Linas, Male1979, Mysid, SmackBot, Mcld, Bluebot, Memming, Solarapex, Rebooted, Edchi, Don Quixote de la Mancha, Phe-bot, Winsteps, Xiaop c, Addbot, Olli Niemitalo, Ptbotgourou,
AnomieBOT, Citation bot, John737, Olexa Riznyk, Enthdegree, KonradVoelkel, ZroBot, Smartdust, Mark viking and Anonymous: 19
Mutual information Source: https://en.wikipedia.org/wiki/Mutual_information?oldid=750615501 Contributors: Michael Hardy, Paul
Barlow, Kku, AugPi, Cherkash, Willem, Charles Matthews, Dcoetzee, Jitse Niesen, VeryVerily, Secretlondon, Canjo, Kahn~enwiki,
Wile E. Heresiarch, Ancheta Wis, Giftlite, Sepreece, Romanpoet, Stern~enwiki, Eequor, Macrakis, Fangz, MarkSweep, Elroch, Creidieki, MisterSheik, Art LaPella, Bobo192, A1kmm, Photonique, Anthony Appleyard, Apoc2400, PAR, Jheald, Oleg Alexandrov, Linas,
Jrg Knappen~enwiki, Rjwilmsi, Thomas Arelatensis, Gwiki~enwiki, Sderose, Benja, Welsh, Balizarde, Saharpirmoradian, Ses~enwiki,
Naught101, SmackBot, Took, Mcld, Nervexmachina, Njerseyguy, Miguel Andrade, Colonies Chris, Memming, Solarapex, Tmg1165,
Moala, Dfass, Ben Moore, Hu12, Freelance Intellectual, Thermochap, Shorespirit, Ppgardne, Thijs!bot, Headbomb, Afriza, Jddriessen,
Kirrages, Bgrot, Fritz.obermeyer, Daisystanton, Baccyak4H, Originalname37, Agentesegreto, Mozaher, Pdcook, Ged.R, Aelkiss, Jamelan, VanishedUserABC, Rl1rl1, Astrehl, Lord Phat, Anchor Link Bot, ClueBot, Kolyma, MystBot, Addbot, Deepmath, Wli625, DOI
bot, Glutar, ChenzwBot, Luckas-bot, Yobot, AnomieBOT, Okisan, Lynxoid84, Kavas, Materialscientist, Citation bot, LilHelpa, Xqbot,
Af1523, Vthesniper, Olexa Riznyk, , Citation bot 1, RjwilmsiBot, Ghostofkendo, Jowa fan, Fly by Night, KonradVoelkel,
Dcirovic, Sgoder, AManWithNoPlan, ClueBot NG, Amircrypto, Helpful Pixie Bot, BG19bot, Craigacp, SciCompTeacher, Fsman,
Manoguru, Shashazhu1989, ChrisGualtieri, Sds57, Mogism, SFK2, Limit-theorem, Keith David Smeltz, Me, Myself, and I are Here,
Szzoli, Phleg1, Monkbot, Velvel2, Sisu55 and Anonymous: 98
Cross entropy Source: https://en.wikipedia.org/wiki/Cross_entropy?oldid=744442721 Contributors: Kevin Baas, Samw, Jitse Niesen,
Pgan002, MarkSweep, MisterSheik, Jheald, Linas, Jrg Knappen~enwiki, Eclecticos, Kri, SmackBot, Keegan, Colonies Chris, J. Finkelstein, Ojan, Thijs!bot, Jrennie, Life of Riley, Addbot, Materialscientist, Nippashish, Erik9bot, Mydimle, WikitanvirBot, BG19bot,
Ahmahran, David.moreno72, ChrisGualtieri, Densonsmith, Velvel2, LordBm, KristianHolsheimer, Grzegorz Swirszcz, Chrishaack and
Anonymous: 20

10.8.2

Images

File:Binary_entropy_plot.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg License: CCBY-SA-3.0 Contributors: original work by Brona, published on Commons at Image:Binary entropy plot.png. Converted to SVG by Alessio
Damato Original artist: Brona and Alessio Damato
File:Binary_erasure_channel.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b6/Binary_erasure_channel.svg License:
Public domain Contributors: Own work Original artist: David Eppstein
File:Binary_symmetric_channel.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b5/Binary_symmetric_channel.svg
License: Public domain Contributors: Own work Original artist: David Eppstein

70

CHAPTER 10. CROSS ENTROPY

File:CDSCRATCHES.jpg Source: https://upload.wikimedia.org/wikipedia/commons/5/5a/CDSCRATCHES.jpg License: Public domain Contributors: English Wikipedia <a href='https://en.wikipedia.org/wiki/en:Image:CDSCRATCHES.JPG' class='extiw' title='w:en:
Image:CDSCRATCHES.JPG'>here</a> Original artist: en:user:Jam01
File:Comm_Channel.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/48/Comm_Channel.svg License: Public domain
Contributors: en wikipedia Original artist: Dicklyon
File:Crypto_key.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Crypto_key.svg License: CC-BY-SA-3.0 Contributors: Own work based on image:Key-crypto-sideways.png by MisterMatt originally from English Wikipedia Original artist: MesserWoland
File:Edit-clear.svg Source: https://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The
Tango! Desktop Project. Original artist:
The people from the Tango! project. And according to the meta-data in the le, specically: Andreas Nilsson, and Jakub Steiner (although
minimally).
File:Entropy-mutual-information-relative-entropy-relation-diagram.svg Source: https://upload.wikimedia.org/wikipedia/commons/
d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg License: Public domain Contributors: Own work Original artist:
KonradVoelkel
File:Entropy_flip_2_coins.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d4/Entropy_flip_2_coins.jpg License: CC
BY-SA 3.0 Contributors: File:Ephesos_620-600_BC.jpg Original artist: http://www.cngcoins.com/
File:Fisher_iris_versicolor_sepalwidth.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/40/Fisher_iris_versicolor_sepalwidth.
svg License: CC BY-SA 3.0 Contributors: en:Image:Fisher iris versicolor sepalwidth.png Original artist: en:User:Qwfp (original); Pbroks13
(talk) (redraw)
File:Folder_Hexagonal_Icon.svg Source: https://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Ccby-sa-3.0 Contributors: ? Original artist: ?
File:Internet_map_1024.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg License: CC BY
2.5 Contributors: Originally from the English Wikipedia; description page is/was here. Original artist: The Opte Project
File:Lock-green.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg License: CC0 Contributors: en:
File:Free-to-read_lock_75.svg Original artist: User:Trappist the monk
File:Nuvola_apps_edu_mathematics_blue-p.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/3e/Nuvola_apps_edu_
mathematics_blue-p.svg License: GPL Contributors: Derivative work from Image:Nuvola apps edu mathematics.png and Image:Nuvola
apps edu mathematics-p.svg Original artist: David Vignoni (original icon); Flamurai (SVG convertion); bayo (color)
File:Portal-puzzle.svg Source: https://upload.wikimedia.org/wikipedia/en/f/fd/Portal-puzzle.svg License: Public domain Contributors:
? Original artist: ?
File:Question_book-new.svg Source: https://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0
Contributors:
Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist:
Tkgd2007
File:Symbol_template_class.svg Source: https://upload.wikimedia.org/wikipedia/en/5/5c/Symbol_template_class.svg License: Public
domain Contributors: ? Original artist: ?
File:Wikiquote-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg License: Public domain
Contributors: Own work Original artist: Rei-artur

10.8.3

Content license

Creative Commons Attribution-Share Alike 3.0