Contents
1
Information theory
1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Historical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Quantities of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1
1.3.2
Joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3
1.3.4
1.3.5
1.3.6
Other quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coding theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1
Source theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2
Channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1
1.5.2
1.5.3
Seismic exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.4
Semiotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.5
Miscellaneous applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
1.5
1.6
1.7
1.8
2
1.6.1
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6.2
History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6.3
Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6.4
Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7.1
12
1.7.2
12
1.7.3
13
1.7.4
Other books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.7.5
14
External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Selfinformation
15
i
ii
CONTENTS
2.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3
Selfinformation of a partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4
Relationship to entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.5
References
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.6
External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
18
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.4
Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5
Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.5.1
22
3.5.2
23
3.5.3
24
3.5.4
Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.5.5
24
3.5.6
24
3.5.7
25
3.5.8
25
3.5.9
bary entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.6
Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.7
Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.7.1
Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.7.2
Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.7.3
Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.7.4
Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.8
Further properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.9
28
3.9.1
Dierential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.9.2
30
3.9.3
Relative entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
30
30
31
31
3.12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
33
33
33
CONTENTS
iii
35
4.1
Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.2
Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.3
Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.4
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Dierential entropy
38
5.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.2
39
5.3
39
5.4
40
5.5
41
5.6
Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.7
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.9
External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Diversity index
43
6.1
True diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
6.2
Richness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
6.3
Shannon index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
6.3.1
Rnyi entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Simpson index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.4.1
46
6.4.2
GiniSimpson index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
6.5
BergerParker index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
6.6
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
6.7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
6.8
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
6.9
External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
6.4
Conditional entropy
49
7.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
7.2
Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
7.3
Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.4
51
7.5
Other properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.6
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
7.7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Joint entropy
53
8.1
53
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
CONTENTS
8.2
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
8.2.1
54
8.2.2
54
8.3
54
8.4
References
54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mutual information
55
9.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
9.2
56
9.3
Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
9.3.1
Metric
58
9.3.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
9.3.3
59
9.3.4
Directed information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
9.3.5
Normalized variants
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
9.3.6
Weighted variants
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
9.3.7
61
9.3.8
61
9.3.9
Linear correlation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
61
9.4
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
9.5
See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
9.6
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
9.7
References
64
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Cross entropy
65
10.1 Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
10.2 Estimation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
. . . . . . . . . . . . . . . . . . . . . . . . .
66
67
10.6 References
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
68
10.8.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
10.8.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
70
Chapter 1
Information theory
Not to be confused with information science.
Information theory studies the quantication, storage, and communication of information. It was originally proposed by Claude E. Shannon in 1948 to nd fundamental limits on signal processing and communication operations
such as data compression, in a landmark paper entitled "A Mathematical Theory of Communication". Now this theory
has found applications in many other areas, including statistical inference, natural language processing, cryptography,
neurobiology,[1] the evolution[2] and function[3] of molecular codes, model selection in ecology,[4] thermal physics,[5]
quantum computing, linguistics, plagiarism detection,[6] pattern recognition, and anomaly detection.[7]
A key measure in information theory is "entropy". Entropy quanties the amount of uncertainty involved in the value
of a random variable or the outcome of a random process. For example, identifying the outcome of a fair coin ip
(with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a
roll of a die (with six equally likely outcomes). Some other important measures in information theory are mutual
information, channel capacity, error exponents, and relative entropy.
Applications of fundamental topics of information theory include lossless data compression (e.g. ZIP les), lossy
data compression (e.g. MP3s and JPEGs), and channel coding (e.g. for Digital Subscriber Line (DSL)).
The eld is at the intersection of mathematics, statistics, computer science, physics, neurobiology, and electrical
engineering. Its impact has been crucial to the success of the Voyager missions to deep space, the invention of the
compact disc, the feasibility of mobile phones, the development of the Internet, the study of linguistics and of human
perception, the understanding of black holes, and numerous other elds. Important subelds of information theory
include source coding, channel coding, algorithmic complexity theory, algorithmic information theory, informationtheoretic security, and measures of information.
1.1 Overview
Information theory studies the transmission, processing, utilization, and extraction of information. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a
noisy channel, this abstract concept was made concrete in 1948 by Claude Shannon in his paper "A Mathematical
Theory of Communication", in which information is thought of as a set of possible messages, where the goal is to
send these messages over a noisy channel, and then to have the receiver reconstruct the message with low probability
of error, in spite of the channel noise. Shannons main result, the noisychannel coding theorem showed that, in the
limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity,
a quantity dependent merely on the statistics of the channel over which the messages are sent.[1]
Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more: adaptive systems, anticipatory systems, articial intelligence, complex systems, complexity science,
cybernetics, informatics, machine learning, along with systems sciences of many descriptions. Information theory is
a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital eld of
coding theory.
1
Coding theory is concerned with nding explicit methods, called codes, for increasing the eciency and reducing
the error rate of data communication over noisy channels to near the Channel capacity. These codes can be roughly
subdivided into data compression (source coding) and errorcorrection (channel coding) techniques. In the latter case,
it took many years to nd the methods Shannons work proved were possible. A third class of information theory
codes are cryptographic algorithms (both codes and ciphers). Concepts, methods and results from coding theory
and information theory are widely used in cryptography and cryptanalysis. See the article ban (unit) for a historical
application.
Information theory is also used in information retrieval, intelligence gathering, gambling, statistics, and even in musical
composition.
Information theory is based on probability theory and statistics. Information theory often concerns itself with measures of information of the distributions associated with random variables. Important quantities of information are
entropy, a measure of information in a single random variable, and mutual information, a measure of information in
common between two random variables. The former quantity is a property of the probability distribution of a random
variable and gives a limit on the rate at which data generated by independent samples with the given distribution can
be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum
rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics
are determined by the joint distribution.
The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. A
common unit of information is the bit, based on the binary logarithm. Other units include the nat, which is based on
the natural logarithm, and the hartley, which is based on the common logarithm.
In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p = 0.
This is justied because limp0+ p log p = 0 for any logarithmic base.
1.3.1
Based on the probability mass function of each source symbol to be communicated, the Shannon entropy H, in units
of bits (per symbol), is given by
H=
pi log2 (pi )
where pi is the probability of occurrence of the ith possible value of the source symbol. This equation gives the
entropy in the units of bits (per symbol) because it uses a logarithm of base 2, and this base2 measure of entropy has
sometimes been called the "shannon" in his honor. Entropy is also commonly computed using the natural logarithm
(base e, where e is Eulers number), which produces a measurement of entropy in "nats" per symbol and sometimes
simplies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible,
but less commonly used. For example, a logarithm of base 28 = 256 will produce a measurement in bytes per symbol,
and a logarithm of base 10 will produce a measurement in decimal digits (or hartleys) per symbol.
Intuitively, the entropy HX of a discrete random variable X is a measure of the amount of uncertainty associated with
the value of X when only its distribution is known.
The entropy of a source that emits a sequence of N symbols that are independent and identically distributed (iid) is
NH bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the
entropy of a message of length N will be less than NH.
Suppose one transmits 1000 bits (0s and 1s). If the value of each of these bits is known to the receiver (has a
specic value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each
bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been
transmitted. Between these two extremes, information can be quantied as follows. If is the set of all messages
{x1 , , xn} that X could be, and p(x) is the probability of some x X , then the entropy, H, of X is dened:[8]
H(X) = EX [I(x)] =
xX
(Here, I(x) is the selfinformation, which is the entropy contribution of an individual message, and X is the expected
value.) A property of entropy is that it is maximized when all the messages in the message space are equiprobable
p(x) = 1/n; i.e., most unpredictable, in which case H(X) = log n.
The special case of information entropy for a random variable with two outcomes is the binary entropy function,
usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit:
H(X)
0.5
0.5
Pr(X = 1)
The entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function, Hb(p). The entropy
is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.
1.3.2
Joint entropy
The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X, Y). This
implies that if X and Y are independent, then their joint entropy is the sum of their individual entropies.
For example, if (X, Y) represents the position of a chess piece X the row and Y the column, then the joint entropy
of the row of the piece and the column of the piece will be the entropy of the position of the piece.
x,y
Despite similar notation, joint entropy should not be confused with cross entropy.
1.3.3
The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation
of X about Y) is the average conditional entropy over Y:[9]
H(XY ) = EY [H(Xy)] =
p(y)
yY
p(x, y) log
x,y
xX
p(x, y)
.
p(y)
Because entropy can be conditioned on a random variable or on that random variable being a certain value, care
should be taken not to confuse these two denitions of conditional entropy, the former of which is in more common
use. A basic property of this form of conditional entropy is that:
1.3.4
Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information
shared between sent and received signals. The mutual information of X relative to Y is given by:
p(x, y) log
x,y
p(x, y)
p(x) p(y)
1.3.5
The KullbackLeibler divergence (or information divergence, information gain, or relative entropy) is a way of
comparing two distributions: a true probability distribution p(X), and an arbitrary probability distribution q(X). If
we compress data in a manner that assumes q(X) is the distribution underlying some data, when, in reality, p(X) is the
correct distribution, the KullbackLeibler divergence is the number of average additional bits per datum necessary
for compression. It is thus dened
DKL (p(X)q(X)) =
xX
xX
xX
p(x) log
p(x)
.
q(x)
Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it is not symmetric and
does not satisfy the triangle inequality (making it a semiquasimetric).
Another interpretation of the KL divergence is the unnecessary surprise introduced by a prior from the truth:
suppose a number X is about to be drawn randomly from a discrete set with probability distribution p(x). If Alice
knows the true distribution p(x), while Bob believes (has a prior) that the distribution is q(x), then Bob will be more
surprised than Alice, on average, upon seeing the value of X. The KL divergence is the (objective) expected value of
Bobs (subjective) surprisal minus Alices surprisal, measured in bits if the log is in base 2. In this way, the extent to
which Bobs prior is wrong can be quantied in terms of how unnecessarily surprised its expected to make him.
1.3.6
Other quantities
Other important information theoretic quantities include Rnyi entropy (a generalization of entropy), dierential
entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information.
A picture showing scratches on the readable surface of a CDR. Music and data CDs are coded using error correcting codes and
thus can still be read even if they have minor scratches using error detection and correction.
source coding theory and channel coding theory. Using a statistical description for data, information theory quanties
the number of bits needed to describe the data, which is the information entropy of the source.
Data compression (source coding): There are two formulations for the compression problem:
1. lossless data compression: the data must be reconstructed exactly;
2. lossy data compression: allocates bits needed to reconstruct the data, within a specied delity level measured
by a distortion function. This subset of Information theory is called ratedistortion theory.
Errorcorrecting codes (channel coding): While data compression removes as much redundancy as possible,
an error correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the
data eciently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justied by the information transmission theorems, or sourcechannel separation theorems that justify the use of bits as the universal currency for information in
many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multipleaccess channel), more than
one receiver (the broadcast channel) or intermediary helpers (the relay channel), or more general networks, compression followed by transmission may no longer be optimal. Network information theory refers to these multiagent
communication models.
1.4.1
Source theory
Any process that generates successive messages can be considered a source of information. A memoryless source
is one in which each message is an independent identically distributed random variable, whereas the properties of
ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic. These terms are well
studied in their own right outside information theory.
Rate
Information rate is the average entropy per symbol. For memoryless sources, this is merely the entropy of each
symbol, while, in the case of a stationary stochastic process, it is
that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a
process that is not necessarily stationary, the average rate is
r = lim
1
H(X1 , X2 , . . . Xn );
n
that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.[10]
It is common in information theory to speak of the rate or entropy of a language. This is appropriate, for example,
when the source of information is English prose. The rate of a source of information is related to its redundancy and
how well it can be compressed, the subject of source coding.
1.4.2
Channel capacity
Transmitter
(noisy)
Channel
Receiver
C = max I(X; Y ).
f
This capacity has the following property related to communicating at information rate R (where R is usually bits per
symbol). For any information rate R < C and coding error > 0, for large enough N, there exists a code of length N
and rate R and a decoding algorithm, such that the maximal probability of block error is ; that is, it is always
possible to transmit with arbitrarily small block error. In addition, for any rate R > C, it is impossible to transmit with
arbitrarily small block error.
Channel coding is concerned with nding such nearly optimal codes that can be used to transmit data over a noisy
channel with a small coding error at a rate near the channel capacity.
1p
p
p
1
1p
A binary erasure channel (BEC) with erasure probability p is a binary input, ternary output channel. The
possible channel outputs are 0, 1, and a third symbol 'e' called an erasure. The erasure represents complete
loss of information about an input bit. The capacity of the BEC is 1  p bits per channel use.
1p
p
p
1p
0
e
1
Information theoretic concepts apply to cryptography and cryptanalysis. Turing's information unit, the ban, was used
in the Ultra project, breaking the German Enigma machine code and hastening the end of World War II in Europe.
Shannon himself dened an important concept now called the unicity distance. Based on the redundancy of the
plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.
Information theory leads us to believe it is much more dicult to keep secrets than it might rst appear. A brute force
attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key
algorithms (sometimes called secret key algorithms), such as block ciphers. The security of all such methods currently
comes from the assumption that no known attack can break them in a practical amount of time.
Information theoretic security refers to methods such as the onetime pad that are not vulnerable to such brute force
attacks. In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned
on the key) can ensure proper transmission, while the unconditional mutual information between the plaintext and
ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be
able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However,
as in any other cryptographic system, care must be used to correctly apply even informationtheoretically secure
methods; the Venona project was able to crack the onetime pads of the Soviet Union due to their improper reuse of
key material.
1.5.2
Pseudorandom number generators are widely available in computer language libraries and application programs.
They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern
computer equipment and software. A class of improved random number generators is termed cryptographically
secure pseudorandom number generators, but even they require random seeds external to the software to work as
intended. These can be obtained via extractors, if done carefully. The measure of sucient randomness in extractors
is minentropy, a value related to Shannon entropy through Rnyi entropy; Rnyi entropy is also used in evaluating
randomness in cryptographic systems. Although related, the distinctions among these measures mean that a random
variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.
1.5.3
Seismic exploration
One early commercial application of information theory was in the eld of seismic oil exploration. Work in this eld
made it possible to strip o and separate the unwanted noise from the desired seismic signal. Information theory and
digital signal processing oer a major improvement of resolution and image clarity over previous analog methods.[11]
1.5.4
Semiotics
Concepts from information theory such as redundancy and code control have been used by semioticians such as
Umberto Eco and Ferruccio RossiLandi to explain ideology as a form of message transmission whereby a dominant
social class emits its message by using signs that exhibit a high degree of redundancy such that only one message is
decoded among a selection of competing ones.[12]
1.5.5
Miscellaneous applications
Information theory also has applications in gambling and investing, black holes, and bioinformatics.
10
1.6.1
Applications
Active networking
Cryptanalysis
Cryptography
Cybernetics
Entropy in thermodynamics and information theory
Gambling
Intelligence (information gathering)
Seismic exploration
1.6.2
History
Hartley, R.V.L.
History of information theory
Shannon, C.E.
Timeline of information theory
Yockey, H.P.
1.6.3
Theory
Coding theory
Detection theory
Estimation theory
Fisher information
Information algebra
Information asymmetry
Information eld theory
Information geometry
1.6.4
Concepts
Ban (unit)
Channel capacity
Channel (communications)
Communication source
Conditional entropy
Covert channel
Decoder
Dierential entropy
Encoder
Information entropy
Joint entropy
KullbackLeibler divergence
Mutual information
Pointwise mutual information (PMI)
Receiver (information theory)
Redundancy
Rnyi entropy
Selfinformation
Unicity distance
Variety
11
12
1.7 References
[1] F. Rieke; D. Warland; R Ruyter van Steveninck; W Bialek (1997). Spikes: Exploring the Neural Code. The MIT press.
ISBN 9780262681087.
[2] cf. Huelsenbeck, J. P., F. Ronquist, R. Nielsen and J. P. Bollback (2001) Bayesian inference of phylogeny and its impact
on evolutionary biology, Science 294:23102314
[3] Rando Allikmets, Wyeth W. Wasserman, Amy Hutchinson, Philip Smallwood, Jeremy Nathans, Peter K. Rogan, Thomas
D. Schneider, Michael Dean (1998) Organization of the ABCR gene: analysis of promoter and splice junction sequences,
Gene 215:1, 111122
[4] Burnham, K. P. and Anderson D. R. (2002) Model Selection and Multimodel Inference: A Practical InformationTheoretic
Approach, Second Edition (Springer Science, New York) ISBN 9780387953649.
[5] Jaynes, E. T. (1957) Information Theory and Statistical Mechanics, Phys. Rev. 106:620
[6] Charles H. Bennett, Ming Li, and Bin Ma (2003) Chain Letters and Evolutionary Histories, Scientic American 288:6,
7681
[7] David R. Anderson (November 1, 2003). Some background on why people in the empirical sciences may want to better
understand the informationtheoretic methods (pdf). Retrieved 20100623.
[8] Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN
0486682102.
[9] Robert B. Ash (1990) [1965]. Information Theory. Dover Publications, Inc. ISBN 0486665216.
[10] Jerry D. Gibson (1998). Digital Compression for Multimedia: Principles and Standards. Morgan Kaufmann. ISBN 1558603697.
[11] The Corporation and Innovation, Haggerty, Patrick, Strategic Management Journal, Vol. 2, 97118 (1981)
[12] Semiotics of Ideology, Noth, Winfried, Semiotica, Issue 148,(1981)
1.7.1
Shannon, C.E. (1948), "A Mathematical Theory of Communication", Bell System Technical Journal, 27, pp.
379423 & 623656, July & October, 1948. PDF.
Notes and other formats.
R.V.L. Hartley, Transmission of Information, Bell System Technical Journal, July 1928
Andrey Kolmogorov (1968), Three approaches to the quantitative denition of information in International
Journal of Computer Mathematics.
1.7.2
J. L. Kelly, Jr., Saratoga.ny.us, A New Interpretation of Information Rate Bell System Technical Journal, Vol.
35, July 1956, pp. 91726.
R. Landauer, IEEE.org, Information is Physical Proc. Workshop on Physics and Computation PhysComp'92
(IEEE Comp. Sci.Press, Los Alamitos, 1993) pp. 14.
R. Landauer, IBM.com, Irreversibility and Heat Generation in the Computing Process IBM J. Res. Develop.
Vol. 5, No. 3, 1961
Timme, Nicholas; Alford, Wesley; Flecker, Benjamin; Beggs, John M. (2012). Multivariate information
measures: an experimentalists perspective. arXiv:1111.6857 [cs.IT].
1.7. REFERENCES
1.7.3
13
Arndt, C. Information Measures, Information and its Description in Science and Engineering (Springer Series:
Signals and Communication Technology), 2004, ISBN 9783540408550
Ash, RB. Information Theory. New York: Interscience, 1965. ISBN 0470034459. New York: Dover 1990.
ISBN 0486665216
Gallager, R. Information Theory and Reliable Communication. New York: John Wiley and Sons, 1968. ISBN
0471290483
Goldman, S. Information Theory. New York: Prentice Hall, 1953. New York: Dover 1968 ISBN 0486622096, 2005 ISBN 0486442713
Cover, Thomas; Thomas, Joy A. (2006). Elements of information theory (2nd ed.). New York: WileyInterscience. ISBN 0471241954.
Csiszar, I, Korner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems Akademiai Kiado:
2nd edition, 1997. ISBN 9630574403
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0521642981
Mansuripur, M. Introduction to Information Theory. New York: Prentice Hall, 1987. ISBN 0134846680
McEliece, R. The Theory of Information and Coding. Cambridge, 2002. ISBN 9780521831857
Pierce, JR. An introduction to information theory: symbols, signals and noise. Dover (2nd Edition). 1961
(reprinted by Dover 1980).
Reza, F. An Introduction to Information Theory. New York: McGrawHill 1961. New York: Dover 1994.
ISBN 0486682102
Shannon, Claude; Weaver, Warren (1949). The Mathematical Theory of Communication (PDF). Urbana, Illinois: University of Illinois Press. ISBN 0252725484. LCCN 4911922.
Stone, JV. Chapter 1 of book Information Theory: A Tutorial Introduction, University of Sheeld, England,
2014. ISBN 9780956372857.
Yeung, RW. A First Course in Information Theory Kluwer Academic/Plenum Publishers, 2002. ISBN 0306467917.
Yeung, RW. Information Theory and Network Coding Springer 2008, 2002. ISBN 9780387792330
1.7.4
Other books
Leon Brillouin, Science and Information Theory, Mineola, N.Y.: Dover, [1956, 1962] 2004. ISBN 0486439186
James Gleick, The Information: A History, a Theory, a Flood, New York: Pantheon, 2011. ISBN 9780375423727
A. I. Khinchin, Mathematical Foundations of Information Theory, New York: Dover, 1957. ISBN 0486604349
H. S. Le and A. F. Rex, Editors, Maxwells Demon: Entropy, Information, Computing, Princeton University
Press, Princeton, New Jersey (1990). ISBN 069108727X
Robert K. Logan. What is Information?  Propagating Organization in the Biosphere, the Symbolosphere, the
Technosphere and the Econosphere,
Toronto: DEMO Publishing.
Tom Siegfried, The Bit and the Pendulum, Wiley, 2000. ISBN 0471321745
14
1.7.5
Chapter 2
Selfinformation
In information theory, selfinformation or surprisal is a measure of the information content associated with an
event in a probability space or with the value of a discrete random variable. It is expressed in a unit of information,
for example bits, nats, or hartleys, depending on the base of the logarithm used in its calculation.
The term selfinformation is also sometimes used as a synonym of the related informationtheoretic concept of
entropy. These two meanings are not equivalent, and this article covers the rst sense only.
2.1 Denition
By denition, information is transferred from an originating entity possessing the information to a receiving entity
only when the receiver had not known the information a priori. If the receiving entity had previously known the
content of a message with certainty before receiving the message, the amount of information of the message received
is zero.
For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, Weather forecast for
tonight: dark. Continued dark overnight, with widely scattered light by morning. Assuming one not residing near the
Earths poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in
advance of receiving the forecast, that darkness always comes with the night.
When the content of a message is known a priori with certainty, with probability of 1, there is no actual information
conveyed in the message. Only when the advanced knowledge of the content of the message by the receiver is less
certain than 100% does the message actually convey information.
Accordingly, the amount of selfinformation contained in a message conveying content informing an occurrence of
event, n , depends only on the probability of that event.
I(n ) = f (P(n ))
for some function f () to be determined below. If P(n ) = 1 , then I(n ) = 0 . If P(n ) < 1 , then I(n ) > 0 .
Further, by denition, the measure of selfinformation is nonnegative and additive. If a message informing of event C
is the intersection of two independent events A and B , then the information of event C occurring is the compound
message of both independent events A and B occurring. The quantity of information of compound message C
would be expected to equal the sum of the amounts of information of the individual component messages A and B
respectively:
16
CHAPTER 2. SELFINFORMATION
f (x y) = f (x) + f (y)
is the logarithm function of any base. The only operational dierence between logarithms of dierent bases is that
of dierent scaling constants.
f (x) = K log(x)
Since the probabilities of events are always between 0 and 1 and the information associated with these events must
be nonnegative, that requires that K < 0 .
Taking into account these properties, the selfinformation I(n ) associated with outcome n with probability P(n )
is dened as:
(
I(n ) = log(P(n )) = log
1
P(n )
The smaller the probability of event n , the larger the quantity of selfinformation associated with the message that
the event indeed occurred. If the above logarithm is base 2, the unit of I(n ) is bits. This is the most common
practice. When using the natural logarithm of base e , the unit will be the nat. For the base 10 logarithm, the unit of
information is the hartley.
As a quick illustration, the information content associated with an outcome of 4 heads (or any specic outcome) in 4
consecutive tosses of a coin would be 4 bits (probability 1/16), and the information content associated with getting a
result other than the one specied would be 0.09 bits (probability 15/16). See below for detailed examples.
This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term was coined by Myron Tribus in his 1961 book Thermostatics and
Thermodynamics.
The information entropy of a random event is the expected value of its selfinformation.
Selfinformation is an example of a proper scoring rule.
2.2 Examples
On tossing a coin, the chance of 'tail' is 0.5. When it is proclaimed that indeed 'tail' occurred, this amounts to
I('tail') = log2 (1/0.5) = log2 2 = 1 bit of information.
When throwing a fair die, the probability of 'four' is 1/6. When it is proclaimed that 'four' has been thrown,
the amount of selfinformation is
I('four') = log2 (1/(1/6)) = log2 (6) = 2.585 bits.
When, independently, two dice are thrown, the amount of information associated with {throw 1 = 'two' &
throw 2 = 'four'} equals
17
I('throw 1 is two & throw 2 is four') = log2 (1/P(throw 1 = 'two' & throw 2 = 'four')) = log2 (1/(1/36))
= log2 (36) = 5.170 bits.
This outcome equals the sum of the individual amounts of selfinformation associated with {throw 1 =
'two'} and {throw 2 = 'four'}; namely 2.585 + 2.585 = 5.170 bits.
In the same two dice situation we can also consider the information present in the statement The sum of the
two dice is ve
I('The sum of throws 1 and 2 is ve') = log2 (1/P('throw 1 and 2 sum to ve')) = log2 (1/(4/36)) = 3.17
bits. The (4/36) is because there are four ways out of 36 possible to sum two dice to 5. This shows how
more complex or ambiguous events can still carry information.
I(C) = E( log(P(C))) =
P(k) log(P(k))
k=1
2.5 References
[1] Marina Meil; Comparing clusteringsan information based distance; Journal of Multivariate Analysis, Volume 98, Issue
5, May 2007
[2] Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.
C.E. Shannon, A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp 379423, (Part
I), 1948.
Chapter 3
2 shannons of entropy: In the case of two fair coin tosses, the information entropy is the logbase2 of the number of possible outcomes;
with two coins there are four outcomes, and the entropy is two bits. Generally, information entropy is the average information of all
possible outcomes.
In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modies the message in some way. The receiver attempts to
infer which message was sent. In this context, entropy (more specically, Shannon entropy) is the expected value
(average) of the information contained in each message. 'Messages can be modeled by any ow of information.
In a more technical sense, there are reasons (explained below) to dene information as the negative of the logarithm of
the probability distribution. The probability distribution of the events, coupled with the information amount of every
event, forms a random variable whose expected value is the average amount of information, or entropy, generated by
this distribution. Units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to
dene it, though the shannon is commonly referred to as a bit.
The logarithm of the probability distribution is useful as a measure of entropy because it is additive for independent
sources. For instance, the entropy of a coin toss is 1 shannon, whereas of m tosses it is m shannons. Generally, you
need log2 (n) bits to represent a variable that can take one of n values if n is a power of 2. If these values are equally
probable, the entropy (in shannons) is equal to the number of bits. Equality between number of bits and shannons
holds only while all outcomes are equally probable. If one of the events is more probable than others, observation of
18
3.1. INTRODUCTION
19
that event is less informative. Conversely, rarer events provide more information when observed. Since observation
of less probable events occurs more rarely, the net eect is that the entropy (thought of as average information)
received from nonuniformly distributed data is less than log2 (n). Entropy is zero when one outcome is certain.
Shannon entropy quanties all these considerations exactly when a probability distribution of the source is known.
The meaning of the events observed (the meaning of messages) does not matter in the denition of entropy. Entropy
only takes into account the probability of observing a specic event, so the information it encapsulates is information
about the underlying probability distribution, not the meaning of the events themselves.
Generally, entropy refers to disorder or uncertainty. Shannon entropy was introduced by Claude E. Shannon in his
1948 paper "A Mathematical Theory of Communication".[1] Shannon entropy provides an absolute limit on the best
possible average length of lossless encoding or compression of an information source. Rnyi entropy generalizes
Shannon entropy.
3.1 Introduction
Entropy is a measure of unpredictability of information content. To get an intuitive understanding of these three
terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll isn't
already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll
and learning the results gives some new information; these are just dierent ways of saying that the entropy of the
poll results is large. Now, consider the case that the same poll is performed a second time shortly after the rst poll.
Since the result of the rst poll is already known, the outcome of the second poll can be predicted well and the results
should not contain much new information; in this case the entropy of the second poll result is small relative to the
rst.
Now consider the example of a coin toss. Assuming the probability of heads is the same as the probability of tails,
then the entropy of the coin toss is as high as it could be. This is because there is no way to predict the outcome of
the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be
correct with probability 1/2. Such a coin toss has one bit of entropy since there are two possible outcomes that occur
with equal probability, and learning the actual outcome contains one bit of information. Contrarily, a coin toss with a
coin that has two heads and no tails has zero entropy since the coin will always come up heads, and the outcome can
be predicted perfectly. Analogously, one binary bit has a log2 2 = 1 Shannon or bit entropy because it can have one
of two values (1 and 0). Similarly, one trit contains log2 3 (about 1.58496) bits of information because it can have
one of three values.
English text, treated as a string of characters, has fairly low entropy, i.e., is fairly predictable. Even if we do not
know exactly what is going to come next, we can be fairly certain that, for example, there will be many more es than
zs, that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the
combination 'th' will be more common than 'z', 'q', or 'qu'. After the rst few letters one can often guess the rest of
the word. English text has between 0.6 and 1.3 bits of entropy for each character of message.[2][3]
If a compression scheme is losslessthat is, you can always recover the entire original message by decompressing
then a compressed message has the same quantity of information as the original, but communicated in fewer characters. That is, it has more information, or a higher entropy, per character. This means a compressed message has less
redundancy. Roughly speaking, Shannons source coding theorem says that a lossless compression scheme cannot
compress messages, on average, to have more than one bit of information per bit of message, but that any value less
than one bit of information per bit of message can be attained by employing a suitable coding scheme. The entropy
of a message per bit multiplied by the length of that message is a measure of how much total information the message
contains.
Intuitively, imagine that we wish to transmit sequences one of the 4 characters A, B, C, or D. Thus, a message to be
transmitted might be 'ABADDCAB'. Information theory gives a way of calculating the smallest possible amount of
information that will convey this. If all 4 letters are equally likely (25%), we can do no better (over a binary channel)
than to have 2 bits encode (in binary) each letter: A might code as '00', B as '01', C as '10', and D as '11'. Now
suppose A occurs with 70% probability, B with 26%, and C and D with 2% each. We could assign variable length
codes, so that receiving a '1' tells us to look at another bit unless we have already received 2 bits of sequential 1s. In
this case, A would be coded as '0' (one bit), B as '10', and C and D as '110' and '111'. It is easy to see that 70% of
the time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, then,
fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of A followed by B  together
96% of characters). The calculation of the sum of probabilityweighted log probabilities measures and captures this
eect.
20
Shannons theorem also implies that no lossless compression scheme can shorten all messages. If some messages
come out shorter, at least one must come out longer due to the pigeonhole principle. In practical use, this is generally
not a problem, because we are usually only interested in compressing certain types of messages, for example English
documents as opposed to gibberish text, or digital photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or uninteresting sequences larger. However, the problem can still arise even in
everyday use when applying a compression algorithm to already compressed data: for example, making a ZIP le of
music, pictures or videos that are already in a compressed format such as FLAC, MP3, WebM, AAC, PNG or JPEG
will generally result in a ZIP le that is slightly larger than the source le(s).
3.2 Denition
Named after Boltzmanns theorem, Shannon dened the entropy (Greek letter Eta) of a discrete random variable
X with possible values {x1 , , xn} and probability mass function P(X) as:
H(X) =
P(xi ) I(xi ) =
i=1
i=1
where b is the base of the logarithm used. Common values of b are 2, Eulers number e, and 10, and the unit of
entropy is shannon for b = 2, nat for b = e, and hartley for b = 10.[6] When b = 2, the units of entropy are also
commonly referred to as bits.
In the case of P(xi) = 0 for some i, the value of the corresponding summand 0 logb(0) is taken to be 0, which is
consistent with the limit:
lim p log(p) = 0.
p0+
One may also dene the conditional entropy of two events X and Y taking values xi and yj respectively, as
H(XY ) =
p(xi , yj ) log
i,j
p(yj )
p(xi , yj )
where p(xi, yj) is the probability that X = xi and Y = yj. This quantity should be understood as the amount of
randomness in the random variable X given the event Y.
3.3 Example
Main article: Binary entropy function
Main article: Bernoulli process
Consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails; this can be modelled
as a Bernoulli process.
The entropy of the unknown result of the next toss of the coin is maximized if the coin is fair (that is, if heads and
tails both have equal probability 1/2). This is the situation of maximum uncertainty as it is most dicult to predict
the outcome of the next toss; the result of each toss of the coin delivers one full bit of information.
3.4. RATIONALE
21
H(X)
0.5
0.5
Pr(X = 1)
Entropy (X) (i.e. the expected surprisal) of a coin ip, measured in shannons, graphed versus the fairness of the coin Pr(X = 1),
where X = 1 represents a result of heads.
Note that the maximum of the graph depends on the distribution. Here, the entropy is at most 1 shannon, and to communicate the
outcome of a fair coin ip (2 possible values) will require an average of at most 1 bit. The result of a fair die (6 possible values)
would require on average log2 6 bits.
However, if we know the coin is not fair, but comes up heads or tails with probabilities p and q, where p q, then there
is less uncertainty. Every time it is tossed, one side is more likely to come up than the other. The reduced uncertainty
is quantied in a lower entropy: on average each toss of the coin delivers less than one full bit of information.
The extreme case is that of a doubleheaded coin that never comes up tails, or a doubletailed coin that never results
in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no new information as the
outcome of each coin toss is always certain. In this respect, entropy can be normalized by dividing it by information
length. This ratio is called metric entropy and is a measure of the randomness of the information.
3.4 Rationale
To understand the meaning of pi log(1/pi), at rst, try to dene an information function, I, in terms of an event i
with probability pi. How much information is acquired due to the observation of event i? Shannons solution follows
from the fundamental properties of information:[7]
22
The last is a crucial property. It states that joint probability communicates as much information as two individual
events separately. Particularly, if the rst event can yield one of n equiprobable outcomes and another has one of m
equiprobable outcomes then there are mn possible outcomes of the joint event. This means that if log2 (n) bits are
needed to encode the rst value and log2 (m) to encode the second, one needs log2 (mn) = log2 (m) + log2 (n) to encode
both. Shannon discovered that the proper choice of function to quantify information, preserving this additivity, is
logarithmic, i.e.,
I(p) = log(1/p)
The base of the logarithm can be any xed real number greater than 1. The dierent units of information (bits for
log2 , trits for log3 , nats for the natural logarithm ln and so on) are just constant multiples of each other. (In contrast,
the entropy would be negative if the base of the logarithm were less than 1.) For instance, in case of a fair coin toss,
heads provides log2 (2) = 1 bit of information, which is approximately 0.693 nats or 0.631 trits. Because of additivity,
n tosses provide n bits of information, which is approximately 0.693n nats or 0.631n trits.
Now, suppose we have a distribution where event i can happen with probability pi. Suppose we have sampled it N
times and outcome i was, accordingly, seen ni = N pi times. The total amount of information we have received is
ni I(pi ) =
N pi log
1
pi
The average amount of information that we receive with every event is therefore
pi log
1
.
pi
3.5 Aspects
3.5.1
S = kB
pi ln pi
where kB is the Boltzmann constant, and pi is the probability of a microstate. The Gibbs entropy was dened by J.
Willard Gibbs in 1878 after earlier work by Boltzmann (1872).[8]
The Gibbs entropy translates over almost unchanged into the world of quantum physics to give the von Neumann
entropy, introduced by John von Neumann in 1927,
3.5. ASPECTS
23
S = kB Tr( ln )
where is the density matrix of the quantum mechanical system and Tr is the trace.
At an everyday practical level the links between information entropy and thermodynamic entropy are not evident.
Physicists and chemists are apt to be more interested in changes in entropy as a system spontaneously evolves away
from its initial conditions, in accordance with the second law of thermodynamics, rather than an unchanging probability distribution. And, as the minuteness of Boltzmanns constant kB indicates, the changes in S / kB for even tiny
amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in data compression or signal processing. Furthermore, in classical thermodynamics the entropy
is dened in terms of macroscopic measurements and makes no reference to any probability distribution, which is
central to the denition of information entropy.
The connection between thermodynamics and what is now known as information theory was rst made by Ludwig
Boltzmann and expressed by his famous equation:
S = kB ln(W )
where S is the thermodynamic entropy of a particular macrostate (dened by thermodynamic parameters such as
temperature, volume, energy, etc.), W is the number of microstates (various combinations of particles in various
energy states) that can yield the given macrostate, and kB is Boltzmanns constant. It is assumed that each microstate
is equally likely, so that the probability of a given microstate is pi = 1/W. When these probabilities are substituted
into the above expression for the Gibbs entropy (or equivalently kB times the Shannon entropy), Boltzmanns equation
results. In information theoretic terms, the information entropy of a system is the amount of missing information
needed to determine a microstate, given the macrostate.
In the view of Jaynes (1957), thermodynamic entropy, as explained by statistical mechanics, should be seen as an
application of Shannons information theory: the thermodynamic entropy is interpreted as being proportional to the
amount of further Shannon information needed to dene the detailed microscopic state of the system, that remains
uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with
the constant of proportionality being just the Boltzmann constant. For example, adding heat to a system increases
its thermodynamic entropy because it increases the number of possible microscopic states of the system that are
consistent with the measurable values of its macroscopic variables, thus making any complete state description longer.
(See article: maximum entropy thermodynamics). Maxwells demon can (hypothetically) reduce the thermodynamic
entropy of a system by using information about the states of individual molecules; but, as Landauer (from 1961)
and coworkers have shown, to function the demon himself must increase thermodynamic entropy in the process, by
at least the amount of Shannon information he proposes to rst acquire and store; and so the total thermodynamic
entropy does not decrease (which resolves the paradox). Landauers principle imposes a lower bound on the amount
of heat a computer must generate to process a given amount of information, though modern computers are far less
ecient.
3.5.2
24
Shannons denition of entropy, when applied to an information source, can determine the minimum channel capacity required to reliably transmit the source as encoded binary digits (see caveat below in italics). The formula can
be derived by calculating the mathematical expectation of the amount of information contained in a digit from the
information source. See also ShannonHartley theorem.
Shannons entropy measures the information contained in a message as opposed to the portion of the message that is
determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties
relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain.
3.5.3
3.5.4
Data compression
3.5.5
A 2011 study in Science estimates the worlds technological capacity to store and communicate optimally compressed
information normalized on the most eective compression algorithms available in the year 2007, therefore estimating
the entropy of the technologically available sources.[10]
The authors estimate humankind technological capacity to store information (fully entropically compressed) in 1986
and again in 2007. They break the information into three categoriesto store information on a medium, to receive
information through a oneway broadcast networks, or to exchange information through twoway telecommunication
networks.[10]
3.5.6
There are a number of entropyrelated concepts that mathematically quantify information content in some way:
the selfinformation of an individual message or symbol taken from a given probability distribution,
the entropy of a given probability distribution of messages or symbols, and
the entropy rate of a stochastic process.
(The rate of selfinformation can also be dened for a particular sequence of messages or symbols generated by
a given stochastic process: this will always be equal to the entropy rate in the case of a stationary process.) Other
quantities of information are also used to compare or relate dierent sources of information.
It is important not to confuse the above concepts. Often it is only clear from context which one is meant. For example,
when someone says that the entropy of the English language is about 1 bit per character, they are actually modeling
the English language as a stochastic process and talking about its entropy rate. Shannon himself used the term in this
way.[3]
Although entropy is often used as a characterization of the information content of a data source, this information
content is not absolute: it depends crucially on the probabilistic model. A source that always generates the same
3.5. ASPECTS
25
symbol has an entropy rate of 0, but the denition of what a symbol is depends on the alphabet. Consider a source
that produces the string ABABABABAB in which A is always followed by B and vice versa. If the probabilistic
model considers individual letters as independent, the entropy rate of the sequence is 1 bit per character. But if the
sequence is considered as AB AB AB AB AB " with symbols as twocharacter blocks, then the entropy rate is 0
bits per character.
However, if we use very large blocks, then the estimate of percharacter entropy rate may become articially low.
This is because in reality, the probability distribution of the sequence is not knowable exactly; it is only an estimate.
For example, suppose one considers the text of every book ever published as a sequence, with each symbol being
the text of a complete book. If there are N published books, and each book is only published once, the estimate
of the probability of each book is 1/N, and the entropy (in bits) is log2 (1/N) = log2 (N). As a practical code, this
corresponds to assigning each book a unique identier and using it in place of the text of the book whenever one
wants to refer to the book. This is enormously useful for talking about books, but it is not so useful for characterizing
the information content of an individual book, or of language in general: it is not possible to reconstruct the book
from its identier without knowing the probability distribution, that is, the complete text of all the books. The key
idea is that the complexity of the probabilistic model must be considered. Kolmogorov complexity is a theoretical
generalization of this idea that allows the consideration of the information content of a sequence independent of any
particular probability model; it considers the shortest program for a universal computer that outputs the sequence. A
code that achieves the entropy rate of a sequence for a given model, plus the codebook (i.e. the probabilistic model),
is one such program, but it may not be the shortest.
For example, the Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, . Treating the sequence as a message and each number as
a symbol, there are almost as many symbols as there are characters in the message, giving an entropy of approximately
log2 (n). So the rst 128 symbols of the Fibonacci sequence has an entropy of approximately 7 bits/symbol. However,
the sequence can be expressed using a formula [F(n) = F(n1) + F(n2) for n = 3, 4, 5, , F(1) =1, F(2) = 1] and
this formula has a much lower entropy and applies to any length of the Fibonacci sequence.
3.5.7
In cryptanalysis, entropy is often roughly used as a measure of the unpredictability of a cryptographic key. For
example, a 128bit key that is uniformly randomly generated has 128 bits of entropy. It also takes (on average)
21281 guesses to break by brute force. However, entropy fails to capture the number of guesses required if the
possible keys are not chosen uniformly.[11][12] Instead, a measure called guesswork can be used to measure the eort
required for a brute force attack.[13]
Other problems may arise from nonuniform distributions used in cryptography. For example, consider a 1000000digit binary onetime pad using exclusive or. If the pad has 1000000 bits of entropy, it is perfect. If the pad has
999999 bits of entropy, evenly distributed (each individual bit of the pad having 0.999999 bits of entropy) it may
provide good security. But if the pad has 999999 bits of entropy, where the rst bit is xed and the remaining 999999
bits are perfectly random, then the rst bit of the ciphertext will not be encrypted at all.
3.5.8
A common way to dene entropy for text is based on the Markov model of text. For an order0 source (each character
is selected independent of the last characters), the binary entropy is:
H(S) =
pi log2 pi ,
where pi is the probability of i. For a rstorder Markov source (one in which the probability of selecting a character
is dependent only on the immediately preceding character), the entropy rate is:
H(S) =
pi
where i is a state (certain preceding characters) and pi (j) is the probability of j given i as the previous character.
For a second order Markov source, the entropy rate is
26
H(S) =
pi
3.5.9
pi (j)
bary entropy
In general the bary entropy of a source S = (S, P) with source alphabet S = {a1 , , an} and discrete probability
distribution P = {p1 , , pn} where pi is the probability of ai (say pi = p(ai)) is dened by:
Hb (S) =
pi logb pi ,
i=1
Note: the b in "bary entropy is the number of dierent symbols of the ideal alphabet used as a standard yardstick to
measure source alphabets. In information theory, two symbols are necessary and sucient for an alphabet to encode
information. Therefore, the default is to let b = 2 (binary entropy). Thus, the entropy of the source alphabet, with its
given empiric probability distribution, is a number equal to the number (possibly fractional) of symbols of the ideal
alphabet, with an optimal probability distribution, necessary to encode for each symbol of the source alphabet. Also
note that optimal probability distribution here means a uniform distribution: a source alphabet with n symbols has
the highest possible entropy (for an alphabet with n symbols) when the probability distribution of the alphabet is
uniform. This optimal entropy turns out to be logb(n).
3.6 Eciency
A source alphabet with nonuniform distribution will have less entropy than if those symbols had uniform distribution
(i.e. the optimized alphabet). This deciency in entropy can be expressed as a ratio called eciency:
(X) =
n
i=1
Eciency has utility in quantifying the eective use of a communications channel. This formulation is also referred
to as the normalized entropy, as the entropy is divided by the maximum entropy logb (n) .
3.7 Characterization
Shannon entropy is characterized by a small number of criteria, listed below. Any denition of entropy satisfying
these assumptions has the form
pi log(pi )
i=1
3.7.1
Continuity
The measure should be continuous, so that changing the values of the probabilities by a very small amount should
only change the entropy by a small amount.
3.7.2
27
Symmetry
Hn (p1 , p2 , . . .) = Hn (p2 , p1 , . . .)
3.7.3
Maximum
The measure should be maximal if all the outcomes are equally likely (uncertainty is highest when all possible events
are equiprobable).
(
Hn (p1 , . . . , pn ) Hn
1
1
,...,
n
n
)
= logb (n).
For equiprobable events the entropy should increase with the number of outcomes.
(
Hn
1
1
,...,
n {z n}
(
= logb (n) < logb (n + 1) = Hn+1
3.7.4
)
1
1
.
,...,
n+1
n+1

{z
}
n+1
Additivity
The amount of entropy should be independent of how the process is regarded as being divided into parts.
This last functional relationship characterizes the entropy of a system with subsystems. It demands that the entropy
of a system can be calculated from the entropies of its subsystems if the interactions between the subsystems are
known.
Given an ensemble of n uniformly distributed elements that are divided into k boxes (subsystems) with b1 , , bk
elements each, the entropy of the whole ensemble should be equal to the sum of the entropy of the system of boxes
and the individual entropies of the boxes, each weighted with the probability of being in that particular box.
For positive integers bi where b1 + + bk = n,
(
Hn
1
1
,...,
n
n
(
= Hk
b1
bk
,...,
n
n
)
+
bi
i=1
(
Hbi
1
1
,...,
bi
bi
)
.
Choosing k = n, b1 = = bn = 1 this implies that the entropy of a certain outcome is zero: 1 (1) = 0. This implies
that the eciency of a source alphabet with n symbols can be dened simply as being equal to its nary entropy. See
also Redundancy (information theory).
28
[
(
)]
( [
])
1
1
H(X) = E logb
logb E
= logb (n)
p(X)
p(X)
This maximal entropy of logb(n) is eectively attained by a source alphabet having a uniform probability
distribution: uncertainty is maximal when all possible events are equiprobable.
The entropy or the amount of information revealed by evaluating (X,Y) (that is, evaluating X and Y simultaneously) is equal to the information revealed by conducting two consecutive experiments: rst evaluating the
value of Y, then revealing the value of X given that you know the value of Y. This may be written as
H(XY ) = H(X).
The entropy of two simultaneous events is no more than the sum of the entropies of each individual event, and
are equal if the two events are independent. More specically, if X and Y are two random variables on the
same probability space, and (X, Y) denotes their Cartesian product, then
Dierential entropy
29
This formula is usually referred to as the continuous entropy, or dierential entropy. A precursor of the continuous
entropy h[f] is the expression for the functional in the theorem of Boltzmann.
Although the analogy between both functions is suggestive, the following question must be set: is the dierential
entropy a valid extension of the Shannon discrete entropy? Dierential entropy lacks a number of properties that the
Shannon discrete entropy has it can even be negative and thus corrections have been suggested, notably limiting
density of discrete points.
To answer this question, we must establish a connection between the two functions:
We wish to obtain a generally nite measure as the bin size goes to zero. In the discrete case, the bin size is the
(implicit) width of each of the n (nite or innite) bins whose probabilities are denoted by pn. As we generalize to
the continuous domain, we must make this width explicit.
To do this, start with a continuous function f discretized into bins of size . By the meanvalue theorem there exists
a value xi in each bin such that
(i+1)
f (xi ) =
f (x) dx
i
and thus the integral of the function f can be approximated (in the Riemannian sense) by
f (x) dx = lim
f (xi )
i=
where this limit and bin size goes to zero are equivalent.
We will denote
H :=
i=
H =
i=
f (xi ) log().
i=
As 0, we have
f (xi )
i=
i=
f (x) dx = 1
But note that log() as 0, therefore we need a special denition of the dierential or continuous entropy:
(
)
h[f ] = lim H + log =
0
which is, as said before, referred to as the dierential entropy. This means that the dierential entropy is not a limit
of the Shannon entropy for n . Rather, it diers from the limit of the Shannon entropy by an innite oset (see
also the article on information dimension)
30
3.9.2
H=
and the result will be the same for any choice of units for x. In fact, the limit of discrete entropy as N would
also include a term of log(N ) , which would in general be innite. This is expected, continuous variables would
typically have innite entropy when discretized. The limiting density of discrete points is really a measure of how
much easier a distribution is to describe than a distribution that is uniform over its quantization scheme.
3.9.3
Relative entropy
DKL (pm) =
log(f (x))p(dx) =
In this form the relative entropy generalises (up to change in sign) both the discrete entropy, where the measure m is
the counting measure, and the dierential entropy, where the measure m is the Lebesgue measure. If the measure m
is itself a probability distribution, the relative entropy is nonnegative, and zero if p = m as measures. It is dened for
any measure space, hence coordinate independent and invariant under coordinate reparameterizations if one properly
takes into account the transformation of the measure m. The relative entropy, and implicitly entropy and dierential
entropy, do depend on the reference measure m.
3.10.1
LoomisWhitney inequality
A simple example of this is an alternate proof of the LoomisWhitney inequality: for every subset A Zd , we have
Ad1
Pi (A)
i=1
31
H[(X1 , . . . , Xd )]
where (Xj )jSi is the Cartesian product of random variables Xj with indexes j in Si (so the dimension of this vector
is equal to the size of Si).
We sketch how LoomisWhitney follows from this: Indeed, let X be a uniformly distributed random variable with
values in A and so that each point in A occurs with equal probability. Then (by the further properties of entropy
mentioned above) (X) = log A , where  A  denotes the cardinality of A. Let Si = {1, 2, , i1, i+1, , d}. The
range of (Xj )jSi is contained in Pi(A) and hence H[(Xj )jSi ] log Pi (A) . Now use this to bound the right
side of Shearers inequality and exponentiate the opposite sides of the resulting inequality you obtain.
3.10.2
(n)
i
(n)
k
q i (1 q)ni = (q + (1 q))n = 1.
i=0
Rearranging gives the upper bound. For the lower bound one rst shows, using some algebra, that it is the largest
term in the summation. But then,
(n)
k
q qn (1 q)nnq
1
n+1
since there are n + 1 terms in the summation. Rearranging gives the lower bound.
A nice interpretation of this is that the number of binary strings of length n with exactly k many 1s is approximately
2nH(k/n) .[15]
32
3.12 References
[1] Shannon, Claude E. (JulyOctober 1948). "A Mathematical Theory of Communication". Bell System Technical Journal.
27 (3): 379423. doi:10.1002/j.15387305.1948.tb01338.x. (PDF, archived from here)
[2] Schneier, B: Applied Cryptography, Second edition, page 234. John Wiley and Sons.
[3] Shannon, C. E. (January 1951). Prediction and Entropy of Printed English (PDF). Bell System Technical Journal. 30 (1):
5064. doi:10.1002/j.15387305.1951.tb01366.x. Retrieved 30 March 2014.
[4] Borda, Monica (2011). Fundamentals in Information Theory and Coding. Springer. p. 11. ISBN 9783642203466.
[5] Han, Te Sun & Kobayashi, Kingo (2002). Mathematics of Information and Coding. American Mathematical Society. pp.
1920. ISBN 9780821842560.
[6] Schneider, T.D, Information theory primer with an appendix on logarithms, National Cancer Institute, 14 April 2007.
33
[7] Carter, Tom (March 2014). An introduction to information theory and entropy (PDF). Santa Fe. Retrieved Aug 2014.
Check date values in: accessdate= (help)
[8] Compare: Boltzmann, Ludwig (1896, 1898). Vorlesungen ber Gastheorie : 2 Volumes Leipzig 1895/98 UB: O 52626.
English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press;
(1995) New York: Dover ISBN 0486684555
[9] Mark Nelson (24 August 2006). The Hutter Prize. Retrieved 20081127.
[10] The Worlds Technological Capacity to Store, Communicate, and Compute Information, Martin Hilbert and Priscila
Lpez (2011), Science (journal), 332(6025), 6065; free access to the article through here: martinhilbert.net/WorldInfoCapacity.
html
[11] Massey, James (1994). Guessing and Entropy (PDF). Proc. IEEE International Symposium on Information Theory.
Retrieved December 31, 2013.
[12] Malone, David; Sullivan, Wayne (2005). Guesswork is not a Substitute for Entropy (PDF). Proceedings of the Information
Technology & Telecommunications Conference. Retrieved December 31, 2013.
[13] Pliam, John (1999). Guesswork and variation distance as measures of cipher security. International Workshop on Selected
Areas in Cryptography. Retrieved October 23, 2016.
[14] Aoki, New Approaches to Macroeconomic Modeling. page 43.
[15] Probability and Computing, M. Mitzenmacher and E. Upfal, Cambridge University Press
This article incorporates material from Shannons entropy on PlanetMath, which is licensed under the Creative Commons Attribution/ShareAlike License.
Arndt, C. (2004), Information Measures: Information and its Description in Science and Engineering, Springer,
ISBN 9783540408550
Cover, T. M., Thomas, J. A. (2006), Elements of information theory, 2nd Edition. WileyInterscience. ISBN
0471241954.
Gray, R. M. (2011), Entropy and Information Theory, Springer.
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0521642981
Martin, Nathaniel F.G. & England, James W. (2011). Mathematical Theory of Entropy. Cambridge University
Press. ISBN 9780521177382.
Shannon, C.E., Weaver, W. (1949) The Mathematical Theory of Communication, Univ of Illinois Press. ISBN
0252725484
Stone, J. V. (2014), Chapter 1 of Information Theory: A Tutorial Introduction, University of Sheeld, England.
ISBN 9780956372857.
34
Chapter 4
H(X)
0.5
0.5
Pr(X = 1)
Entropy of a Bernoulli trial as a function of success probability, called the binary entropy function.
In information theory, the binary entropy function, denoted H(p) or Hb (p) , is dened as the entropy of a Bernoulli
process with probability of success p . Mathematically, the Bernoulli trial is modelled as a random variable X that
can take on only two values: 0 and 1. The event X = 1 is considered a success and the event X = 0 is considered a
35
36
4.1 Explanation
In terms of information theory, entropy is considered to be a measure of the uncertainty in a message. To put it
intuitively, suppose p = 0 . At this probability, the event is certain never to occur, and so there is no uncertainty at
all, leading to an entropy of 0. If p = 1 , the result is again certain, so the entropy is 0 here as well. When p = 1/2
, the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage
to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit.
Intermediate values fall between these cases; for instance, if p = 1/4 , there is still a measure of uncertainty on the
outcome, but one can still predict the outcome correctly more often than not, so the uncertainty measure, or entropy,
is less than 1 full bit.
4.2 Derivative
The derivative of the binary entropy function may be expressed as the negative of the logit function:
d
Hb (p) = logit2 (p) = log2
dp
p
1p
Hb (p) = 1
1 (1 2p)2n
2 ln 2 n=1 n(2n 1)
for 0 p 1 .
4.5. REFERENCES
37
4.5 References
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0521642981
Chapter 5
Dierential entropy
Dierential entropy (also referred to as continuous entropy) is a concept in information theory that began as an
attempt by Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to
continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it
was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy
is the limiting density of discrete points (LDDP). Dierential entropy (described here) is commonly encountered in
the literature, but it is a limiting case of the LDDP, and one that loses its fundamental association with discrete
entropy.
5.1 Denition
Let X be a random variable with a probability density function f whose support is a set X . The dierential entropy
h(X) or h(f) is dened as
h(X) =
For probability distributions which don't have an explicit density function expression, but have an explicit quantile
function expression, Q(p), then h(Q) can be dened in terms of the derivative of Q(p) i.e. the quantile density function
Q'(p) as [1]
h(Q) =
log Q (p) dp
As with its discrete analog, the units of dierential entropy depend on the base of the logarithm, which is usually 2
(i.e., the units are bits). See logarithmic units for logarithms taken in dierent bases. Related concepts such as joint,
conditional dierential entropy, and relative entropy are dened in a similar fashion. Unlike the discrete analog,
the dierential entropy has an oset that depends on the units used to measure X.[2] For example, the dierential
entropy of a quantity measured in millimeters will be log(1000) more than the same quantity measured in meters; a
dimensionless quantity will have dierential entropy of log(1000) more than the same quantity divided by 1000.
One must take care in trying to apply properties of discrete entropy to dierential entropy, since probability density
functions can be greater than 1. For example, Uniform(0,1/2) has negative dierential entropy
1
2
2 log(2) dx = log(2)
Thus, dierential entropy does not share all properties of discrete entropy.
Note that the continuous mutual information I(X;Y) has the distinction of retaining its fundamental signicance as
a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of X
38
39
and Y as these partitions become ner and ner. Thus it is invariant under nonlinear homeomorphisms (continuous
and uniquely invertible maps) ,[3] including linear [4] transformations of X and Y, and still represents the amount of
discrete information that can be transmitted over a channel that admits a continuous space of values.
For the direct analogue of discrete entropy extended to the continuous space, see limiting density of discrete points.
h(X1 , . . . , Xn ) =
h(Xi )
i=1
h(Y) h(X) +
m
dx
f (x) log
x
where m
x is the Jacobian of the transformation m. The above inequality becomes an equality if the
transform is a bijection. Furthermore, when m is a rigid rotation, translation, or combination thereof,
the Jacobian determinant is always 1, and h(Y) = h(X).
If a random vector X in Rn has mean zero and covariance matrix K, h(X)
with equality if and only if X is jointly gaussian (see below).
1
2
log(det 2eK) =
1
2
log[(2e)n det K]
40
0 DKL (f g) =
f (x) log
f (x)
g(x)
dx = h(f )
f (x) log(g(x))dx.
f (x) log(g(x))dx =
f (x) log
(x)2
2 2
dx
2 2
(
)
1
(x )2
=
f (x) log
dx + log(e)
f (x)
dx
2 2
2 2
2
= 12 log(2 2 ) log(e) 2
2
(
)
= 12 log(2 2 ) + log(e)
= 12 log(2e 2 )
= h(g)
because the result does not depend on f(x) other than through the variance. Combining the two results yields
h(g) h(f ) 0
with equality when g(x) = f(x) following from the properties of KullbackLeibler divergence.
This result may also be demonstrated using the variational calculus. A Lagrangian function with two Lagrangian
multipliers may be dened as:
L=
g(x) ln(g(x)) dx 0 1
)
(
g(x) dx 2
)
g(x)(x )2 dx
where g(x) is some function with mean . When the( entropy of g(x) is)at a maximum and the constraint equa
tions, which consist of the normalization condition 1 = g(x) dx and the requirement of xed variance
(
)
2 = g(x)(x )2 dx , are both satised, then a small variation g(x) about g(x) will produce a variation
L about L which is equal to zero:
0 = L =
(
)
g(x) ln(g(x)) + 1 + 0 + (x )2 dx
Since this must hold for any small g(x), the term in brackets must be zero, and solving for g(x) yields:
g(x) = e0 1(x)
Using the constraint equations to solve for 0 and yields the normal distribution:
(x)2
1
g(x) =
e 22
2 2
41
f (x) = ex for x 0.
Its dierential entropy is then
Here, he (X) was used rather than h(X) to make it explicit that the logarithm was taken to base e, to simplify the
calculation.
(p)(q)
(p+q)
d
dx
ln (x) =
(x)
(x)
is the digamma
[5]
5.6 Variants
As described above, dierential entropy does not share all properties of discrete entropy. For example, the dierential
entropy can be negative; also it is not invariant under continuous coordinate transformations. Edwin Thompson Jaynes
showed in fact that the expression above is not the correct limit of the expression for a nite set of probabilities.[7]
A modication of dierential entropy adds an invariant measure factor to correct this, (see limiting density of discrete
points). If m(x) is further constrained to be a probability density, the resulting notion is called relative entropy in
information theory:
D(pm) =
p(x) log
p(x)
dx.
m(x)
The denition of dierential entropy above can be obtained by partitioning the range of X into bins of length h with
associated sample points ih within the bins, for X Riemann integrable. This gives a quantized version of X, dened
by Xh = ih if ih X (i+1)h. Then the entropy of Xh is
Hh =
hf (ih) log(h).
The rst term on the right approximates the dierential entropy, while the second term is approximately log(h).
Note that this procedure suggests that the entropy in the discrete sense of a continuous random variable should be .
42
5.8 References
[1] Vasicek, Oldrich (1976), A Test for Normality Based on Sample Entropy, Journal of the Royal Statistical Society, Series
B, 38 (1): 5459, JSTOR 2984828.
[2] Pages 183184, Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics. New York: Charles Scribners Sons.
[3] Kraskov, Alexander; Stgbauer, Grassberger (2004). Estimating mutual information. Physical Review E. 60: 066138.
arXiv:condmat/0305641 . Bibcode:2004PhRvE..69f6138K. doi:10.1103/PhysRevE.69.066138.
[4] Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN
0486682102.
[5] Park, Sung Y.; Bera, Anil K. (2009). Maximum entropy autoregressive conditional heteroskedasticity model (PDF).
Journal of Econometrics. Elsevier: 219230. Retrieved 20110602.
[6] Lazo, A. and P. Rathie (1978). On the entropy of continuous probability distributions. IEEE Transactions on Information
Theory. 24(1): 120122. doi:10.1109/TIT.1978.1055832.
[7] Jaynes, E.T. (1963). Information Theory And Statistical Mechanics (PDF). Brandeis University Summer Institute Lectures
in Theoretical Physics. 3 (sect. 4b): 181218.
Thomas M. Cover, Joy A. Thomas. Elements of Information Theory New York: Wiley, 1991. ISBN 0471062596
Chapter 6
Diversity index
A diversity index is a quantitative measure that reects how many dierent types (such as species) there are in
a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed
among those types. The value of a diversity index increases both when the number of types increases and when
evenness increases. For a given number of types, the value of a diversity index is maximized when all types are
equally abundant.
When diversity indices are used in ecology, the types of interest are usually species, but they can also be other
categories, such as genera, families, functional types or haplotypes. The entities of interest are usually individual
plants or animals, and the measure of abundance can be, for example, number of individuals, biomass or coverage.
In demography, the entities of interest can be people, and the types of interest various demographic groups. In
information science, the entities can be characters and the types the dierent letters of the alphabet. The most
commonly used diversity indices are simple transformations of the eective number of types (also known as 'true
diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real
phenomenon (but a dierent one for each diversity index).[1][2][3][4]
1
q
D=
=
Mq1
q1
i=1
=
pi pq1
i
)1/(1q)
pqi
i=1
The denominator Mq equals the average proportional abundance of the types in the dataset as calculated with the
weighted generalized mean with exponent q1. In the equation, R is richness (the total number of types in the
dataset), and the proportional abundance of the ith type is pi. The proportional abundances themselves are used as
the nominal weights. When q = 1, the above equation is undened. However, the mathematical limit as q approaches
1 is well dened and the corresponding diversity is calculated with the following equation:
D = R
i=1
ppi i
= exp
)
pi ln(pi )
i=1
which is the exponential of the Shannon entropy calculated with natural logarithms (see below).
The value of q is often referred to as the order of the diversity. It denes the sensitivity of the diversity value to
rare vs. abundant species by modifying how the weighted mean of the species proportional abundances is calculated.
43
44
With some values of the parameter q, the value of Mq assumes familiar kinds of weighted mean as special cases. In
particular, q = 0 corresponds to the weighted harmonic mean, q = 1 to the weighted geometric mean and q = 2 to the
weighted arithmetic mean. As q approaches innity, the weighted generalized mean with exponent q1 approaches
the maximum pi value, which is the proportional abundance of the most abundant species in the dataset. Generally,
increasing the value of q increases the eective weight given to the most abundant species. This leads to obtaining a
larger Mq value and a smaller true diversity (q D) value with increasing q.
When q = 1, the weighted geometric mean of the pi values is used, and each species is exactly weighted by its
proportional abundance (in the weighted geometric mean, the weights are the exponents). When q > 1, the weight
given to abundant species is exaggerated, and when q < 1, the weight given to rare species is. At q = 0, the species
weights exactly cancel out the species proportional abundances, such that the weighted mean of the pi values equals
1 / R even when all species are not equally abundant. At q = 0, the eective number of species, 0 D, hence equals the
actual number of species R. In the context of diversity, q is generally limited to nonnegative values. This is because
negative values of q would give rare species so much more weight than abundant ones that q D would exceed R.[3][4]
The general equation of diversity is often written in the form[1][2]
D=
( R
)1/(1q)
pqi
i=1
and the term inside the parentheses is called the basic sum. Some popular diversity indices correspond to the basic
sum as calculated with dierent values of q.[2]
6.2 Richness
Main article: Species richness
Richness R simply quanties how many dierent types the dataset of interest contains. For example, species richness
(usually noted S) of a dataset is the number of dierent species in the corresponding species list. Richness is a simple
measure, so it has been a popular diversity index in ecology, where abundance data are often not available for the
datasets of interest. Because richness does not take the abundances of the types into account, it is not the same thing
as diversity, which does take abundances into account. However, if true diversity is calculated with q = 0, the eective
number of types (0 D) equals the actual number of types (R).[2][4]
H =
pi ln pi
i=1
where pi is the proportion of characters belonging to the ith type of letter in the string of interest. In ecology, pi is
often the proportion of individuals belonging to the ith species in the dataset of interest. Then the Shannon entropy
quanties the uncertainty in predicting the species identity of an individual that is taken at random from the dataset.
Although the equation is here written with natural logarithms, the base of the logarithm used when calculating the
Shannon entropy can be chosen freely. Shannon himself discussed logarithm bases 2, 10 and e, and these have since
become the most popular bases in applications that use the Shannon entropy. Each log base corresponds to a dierent
45
measurement unit, which have been called binary digits (bits), decimal digits (decits) and natural digits (nats) for the
bases 2, 10 and e, respectively. Comparing Shannon entropy values that were originally calculated with dierent log
bases requires converting them to the same log base: change from the base a to base b is obtained with multiplication
by logba.[5]
It has been shown that the Shannon index is based on the weighted geometric mean of the proportional abundances
of the types, and that it equals the logarithm of true diversity as calculated with q = 1:[3]
H =
pi ln pi =
i=1
ln pipi
i=1
H =
(
ln pp11 pp22 pp33
ppRR
= ln
1
pp11 pp22 pp33 ppRR
)
= ln
i=1
ppi i
Since the sum of the pi values equals unity by denition, the denominator equals the weighted geometric mean of the
pi values, with the pi values themselves being used as the weights (exponents in the equation). The term within the
parentheses hence equals true diversity 1 D, and H' equals ln(1 D).[1][3][4]
When all types in the dataset of interest are equally common, all pi values equal 1 / R, and the Shannon index hence
takes the value ln(R). The more unequal the abundances of the types, the larger the weighted geometric mean of the
pi values, and the smaller the corresponding Shannon entropy. If practically all abundance is concentrated to one
type, and the other types are very rare (even if there are many of them), Shannon entropy approaches zero. When
there is only one type in the dataset, Shannon entropy exactly equals zero (there is no uncertainty in predicting the
type of the next randomly chosen entity).
6.3.1
Rnyi entropy
The Rnyi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:
( R
)
q
1
H=
ln
pi
1q
i=1
which equals
H = ln
q1
= ln(qD)
q1
i=1 pi pi
This means that taking the logarithm of true diversity based on any value of q gives the Rnyi entropy corresponding
to the same value of q.
46
root of the index had already been introduced in 1945 by the economist Albert O. Hirschman.[8] As a result, the same
measure is usually known as the Simpson index in ecology, and as the Herndahl index or the HerndahlHirschman
index (HHI) in economics.
The measure equals the probability that two entities taken at random from the dataset of interest represent the same
type.[6] It equals:
p2i
i=1
This also equals the weighted arithmetic mean of the proportional abundances pi of the types of interest, with the proportional abundances themselves being used as the weights.[1] Proportional abundances are by denition constrained
to values between zero and unity, but their weighted arithmetic mean, and hence 1/R, which is reached when all
types are equally abundant.
By comparing the equation used to calculate with the equations used to calculate true diversity, it can be seen that 1/
equals 2 D, i.e. true diversity as calculated with q = 2. The original Simpsons index hence equals the corresponding
basic sum.[2]
The interpretation of as the probability that two entities taken at random from the dataset of interest represent the
same type assumes that the rst entity is replaced to the dataset before taking the second entity. If the dataset is very
large, sampling without replacement gives approximately the same result, but in small datasets the dierence can be
substantial. If the dataset is small, and sampling without replacement is assumed, the probability of obtaining the
same type with both random draws is:
R
l=
ni (ni 1)
N (N 1)
i=1
where ni is the number of entities belonging to the ith type and N is the total number of entities in the dataset.[6] This
form of the Simpson index is also known as the HunterGaston index in microbiology.[9]
Since mean proportional abundance of the types increases with decreasing number of types and increasing abundance
of the most abundant type, obtains small values in datasets of high diversity and large values in datasets of low
diversity. This is counterintuitive behavior for a diversity index, so often such transformations of that increase with
increasing diversity have been used instead. The most popular of such indices have been the inverse Simpson index
(1/) and the GiniSimpson index (1 ).[1][2] Both of these have also been called the Simpson index in the ecological
literature, so care is needed to avoid accidentally comparing the dierent indices as if they were the same.
6.4.1
2
i=1 pi
= 2D
This simply equals true diversity of order 2, i.e. the eective number of types that is obtained when the weighted
arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.
6.4.2
GiniSimpson index
The original Simpson index equals the probability that two entities taken at random from the dataset of interest
(with replacement) represent the same type. Its transformation 1 therefore equals the probability that the two
entities represent dierent types. This measure is also known in ecology as the probability of interspecic encounter
(PIE)[10] and the GiniSimpson index.[2] It can be expressed as a transformation of true diversity of order 2:
1=1
i=1
p2i = 1 1/2 D
47
The GibbsMartin index of sociology, psychology and management studies,[11] which is also known as the Blau index,
is the same measure as the GiniSimpson index.
6.7 References
[1] Hill, M. O. (1973). Diversity and evenness: a unifying notation and its consequences. Ecology. 54: 427432. doi:10.2307/1934352.
[2] Jost, L (2006). Entropy and diversity. Oikos. 113: 363375. doi:10.1111/j.2006.00301299.14714.x.
[3] Tuomisto, H (2010). A diversity of beta diversities: straightening up a concept gone awry. Part 1. Dening beta diversity
as a function of alpha and gamma diversity. Ecography. 33: 222. doi:10.1111/j.16000587.2009.05880.x.
[4] Tuomisto, H (2010). A consistent terminology for quantifying species diversity? Yes, it does exist. Oecologia. 4:
853860. doi:10.1007/s0044201018120.
[5] Shannon, C. E. (1948) A mathematical theory of communication. The Bell System Technical Journal, 27, 379423 and
623656.
[6] Simpson, E. H. (1949). Measurement of diversity. Nature. 163: 688. doi:10.1038/163688a0.
[7] Herndahl, O. C. (1950) Concentration in the U.S. Steel Industry. Unpublished doctoral dissertation, Columbia University.
[8] Hirschman, A. O. (1945) National power and the structure of foreign trade. Berkeley.
[9] Hunter, PR; Gaston, MA (1988). Numerical index of the discriminatory ability of typing systems: an application of
Simpsons index of diversity. J Clin Microbiol. 26 (11): 24652466. PMID 3069867.
[10] Hurlbert, S.H. (1971). The nonconcept of species diversity: A critique and alternative parameters. Ecology. 52: 577
586. doi:10.2307/1934145.
[11] Gibbs, Jack P.; William T. Martin (1962). Urbanization, technology and the division of labor. American Sociological
Review. 27: 667677. doi:10.2307/2089624. JSTOR 2089624.
[12] Berger, Wolfgang H.; Parker, Frances L. (June 1970). Diversity of Planktonic Foraminifera in DeepSea Sediments.
Science. 168 (3937): 13451347. doi:10.1126/science.168.3937.1345.
48
Chapter 7
Conditional entropy
Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles
is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional
entropy H(XY). The circle on the right (blue and violet) is H(Y), with the blue being H(YX). The violet is the mutual information
I(X;Y).
In information theory, the conditional entropy (or equivocation) quanties the amount of information needed to
describe the outcome of a random variable Y given that the value of another random variable X is known. Here,
information is measured in shannons, nats, or hartleys. The entropy of Y conditioned on X is written as H(Y X) .
7.1 Denition
If H(Y X = x) is the entropy of the variable Y conditioned on the variable X taking a certain value x , then
H(Y X) is the result of averaging H(Y X = x) over all possible values x that X may take.
Given discrete random variables X with Image X and Y with Image Y , the conditional entropy of Y given X is
49
50
dened as: (Intuitively, the following can be thought as the weighted sum of H(Y X = x) for each possible value
of x , using p(x) as the weights)[1]
H(Y X)
p(x) H(Y X = x)
xX
p(x)
xX
yY
xX yY
xX ,yY
p(x, y) log
xX ,yY
p(x, y) log
xX ,yY
p(x, y)
.
p(x)
p(x)
.
p(x, y)
Note: It is understood that the expressions 0 log 0 and 0 log (c/0) for xed c>0 should be treated as being equal to
zero.
H(Y X) = 0 if and only if the value of Y is completely determined by the value of X . Conversely, H(Y X) =
H(Y ) if and only if Y and X are independent random variables.
H(Y X) =
(
p(x, y) log
xX ,yY
p(x)
p(x, y)
xX ,yY
p(x, y) log(p(x))
xX ,yY
= H(X, Y ) +
p(x) log(p(x))
xX
= H(X, Y ) H(X).
In general, a chain rule for multiple random variables holds:
H(X1 , X2 , . . . , Xn ) =
i=1
It has a similar form to Chain rule (probability) in probability theory, except that addition instead of multiplication
is used.
51
52
7.7 References
[1] Cover, Thomas M.; Thomas, Joy A. (1991). Elements of information theory (1st ed.). New York: Wiley. ISBN 0471062596.
Chapter 8
Joint entropy
Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles
is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional
entropy H(XY). The circle on the right (blue and violet) is H(Y), with the blue being H(YX). The violet is the mutual information
I(X;Y).
In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.
8.1 Denition
The joint Shannon entropy (in bits) of two variables X and Y is dened as
H(X, Y ) =
53
54
where x and y are particular values of X and Y , respectively, P (x, y) is the joint probability of these values occurring
together, and P (x, y) log2 [P (x, y)] is dened to be 0 if P (x, y) = 0 .
For more than two variables X1 , ..., Xn this expands to
H(X1 , ..., Xn ) =
...
x1
xn
where x1 , ..., xn are particular values of X1 , ..., Xn , respectively, P (x1 , ..., xn ) is the probability of these values
occurring together, and P (x1 , ..., xn ) log2 [P (x1 , ..., xn )] is dened to be 0 if P (x1 , ..., xn ) = 0 .
8.2 Properties
8.2.1
The joint entropy of a set of variables is greater than or equal to all of the individual entropies of the variables in the
set.
H(X, Y ) max[H(X), H(Y )]
H(X1 , ..., Xn ) max[H(X1 ), ..., H(Xn )]
8.2.2
The joint entropy of a set of variables is less than or equal to the sum of the individual entropies of the variables in
the set. This is an example of subadditivity. This inequality is an equality if and only if X and Y are statistically
independent.
H(X, Y ) H(X) + H(Y )
H(X1 , ..., Xn ) H(X1 ) + ... + H(Xn )
N
k=1
H(Xk Xk1 , . . . , X1 )
8.4 References
Theresa M. Korn; Korn, Granino Arthur. Mathematical Handbook for Scientists and Engineers: Denitions,
Theorems, and Formulas for Reference and Review. New York: Dover Publications. pp. 613614. ISBN
0486411478.
Chapter 9
Mutual information
Venn diagram for various information measures associated with correlated variables X and Y. The area contained by both circles
is the joint entropy H(X,Y). The circle on the left (red and violet) is the individual entropy H(X), with the red being the conditional
entropy H(XY). The circle on the right (blue and violet) is H(Y), with the blue being H(YX). The violet is the mutual information
I(X;Y).
In probability theory and information theory, the mutual information (MI) of two random variables is a measure
of the mutual dependence between the two variables. More specically, it quanties the amount of information (in
units such as bits) obtained about one random variable, through the other random variable. The concept of mutual
information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory,
that denes the amount of information held in a random variable.
Not limited to realvalued random variables like the correlation coecient, MI is more general and determines how
similar the joint distribution p(X,Y) is to the products of factored marginal distribution p(X)p(Y). MI is the expected
value of the pointwise mutual information (PMI). The most common unit of measurement of mutual information is
the bit.
55
56
9.1 Denition
Formally, the mutual information of two discrete random variables X and Y can be dened as:
I(X; Y ) =
(
p(x, y) log
yY xX
)
p(x, y)
,
p(x) p(y)
where p(x,y) is the joint probability distribution function of X and Y, and p(x) and p(y) are the marginal probability
distribution functions of X and Y respectively.
In the case of continuous random variables, the summation is replaced by a denite double integral:
(
p(x, y) log
I(X; Y ) =
Y
p(x, y)
p(x) p(y)
)
dx dy,
where p(x,y) is now the joint probability density function of X and Y, and p(x) and p(y) are the marginal probability
density functions of X and Y respectively.
If the log base 2 is used, the units of mutual information are the bit.
Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of
these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does
not give any information about Y and vice versa, so their mutual information is zero. At the other extreme, if X is a
deterministic function of Y and Y is a deterministic function of X then all information conveyed by X is shared with
Y: knowing X determines the value of Y and vice versa. As a result, in this case the mutual information is the same
as the uncertainty contained in Y (or X) alone, namely the entropy of Y (or X). Moreover, this mutual information
is the same as the entropy of X and as the entropy of Y. (A very special case of this is when X and Y are the same
random variable.)
Mutual information is a measure of the inherent dependence expressed in the joint distribution of X and Y relative
to the joint distribution of X and Y under the assumption of independence. Mutual information therefore measures
dependence in the following sense: I(X; Y) = 0 if and only if X and Y are independent random variables. This is easy
to see in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:
(
log
p(x, y)
p(x) p(y)
)
= log 1 = 0.
Moreover, mutual information is nonnegative (i.e. I(X;Y) 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)).
9.3. VARIATIONS
I(X; Y ) =
57
p(x, y) log
x,y
p(x, y)
p(x, y) log p(y)
p(x)
x,y
x,y
=
p(x)p(yx) log p(yx)
p(x, y) log p(y)
p(x, y)
p(x)p(y)
p(x, y) log
x,y
(
p(x)
x,y
log p(y)
)
p(x, y)
p(x)H(Y X = x)
log p(y)p(y)
I(X; Y ) =
p(y)
p(xy) log2
p(xy)
p(x)
= EY {DKL (p(xy)p(x))}.
Note that here the KullbackLeibler divergence involves integration with respect to the random variable X only and
the expression DKL (p(xy)p(x)) is now a random variable in Y. Thus mutual information can also be understood
as the expectation of the KullbackLeibler divergence of the univariate distribution p(x) of X from the conditional
distribution p(xy) of X given Y: the more dierent the distributions p(xy) and p(x) are on average, the greater the
information gain.
9.3 Variations
Several variations on mutual information have been proposed to suit various needs. Among these are normalized
variants and generalizations to more than two variables.
58
9.3.1
Metric
Many applications require a metric, that is, a distance measure between pairs of points. The quantity
d(X, Y ) = H(X, Y ) I(X; Y )
= H(X) + H(Y ) 2I(X; Y )
= H(XY ) + H(Y X)
satises the properties of a metric (triangle inequality, nonnegativity, indiscernability and symmetry). This distance
metric is also known as the Variation of information.
If X, Y are discrete random variables then all the entropy terms are nonnegative, so 0 d(X, Y ) H(X, Y ) and
one can dene a normalized distance
D(X, Y ) = d(X, Y )/H(X, Y ) 1.
The metric D is a universal metric, in that if any other distance measure places X and Y closeby, then the D will also
judge them close.[1]
Plugging in the denitions shows that
D(X, Y ) = 1 I(X; Y )/H(X, Y ).
In a settheoretic interpretation of information (see the gure for Conditional entropy), this is eectively the Jaccard
distance between X and Y.
Finally,
D (X, Y ) = 1
I(X; Y )
max(H(X), H(Y ))
is also a metric.
9.3.2
Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true
that
I(X; Y Z) 0
for discrete, jointly distributed random variables X, Y, Z. This result has been used as a basic building block for
proving other inequalities in information theory.
9.3. VARIATIONS
9.3.3
59
I(X1 ; X1 ) = H(X1 )
and for n > 1,
9.3.4
Directed information
Directed information, I(X n Y n ) , measures the amount of information that ows from the process X n to Y n ,
where X n denotes the vector X1 , X2 , ..., Xn and Y n denotes Y1 , Y2 , ..., Yn . The term directed information was
coined by James Massey and is dened as
I(X n Y n ) =
I(X i ; Yi Y i1 )
i=1
Note that if n = 1 the directed information becomes the mutual information. Directed information has many
applications in problems where causality plays an important role, such as capacity of channel with feedback.[5][6]
9.3.5
Normalized variants
Normalized variants of the mutual information are provided by the coecients of constraint,[7] uncertainty coecient
[8]
or prociency:[9]
60
CXY =
I(X; Y )
H(Y )
I(X; Y )
.
H(X)
and CY X =
The two coecients are not necessarily equal. In some cases a symmetric measure may be desired, such as the
following redundancy measure:
R=
I(X; Y )
H(X) + H(Y )
which attains a minimum of zero when the variables are independent and a maximum value of
Rmax =
min(H(X), H(Y ))
H(X) + H(Y )
when one variable becomes completely redundant with the knowledge of the other. See also Redundancy (information
theory). Another symmetrical measure is the symmetric uncertainty (Witten & Frank 2005), given by
U (X, Y ) = 2R = 2
I(X; Y )
H(X) + H(Y )
and
I(X;Y )
H(X,Y )
Finally theres a normalization [10] which derives from rst thinking of mutual information as an analogue to covariance
(thus Shannon entropy is analogous to variance). Then the normalized mutual information is calculated akin to the
Pearson correlation coecient,
I(X; Y )
.
H(X)H(Y )
9.3.6
Weighted variants
I(X; Y ) =
yY xX
p(x, y) log
p(x, y)
,
p(x) p(y)
each event or object specied by (x, y) is weighted by the corresponding probability p(x, y) . This assumes that all
objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be
the case that certain objects or events are more signicant than others, or that certain patterns of association are more
semantically important than others.
For example, the deterministic mapping {(1, 1), (2, 2), (3, 3)} may be viewed as stronger than the deterministic
mapping {(1, 3), (2, 1), (3, 2)} , although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954,
Coombs, Dawes & Tversky 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational
mapping between the associated variables. If it is desired that the former relationshowing agreement on all variable
valuesbe judged stronger than the later relation, then it is possible to use the following weighted mutual information
(Guiasu 1977).
9.3. VARIATIONS
I(X; Y ) =
61
yY xX
p(x, y)
,
p(x) p(y)
which places a weight w(x, y) on the probability of each variable value cooccurrence, p(x, y) . This allows that
certain probabilities may carry more or less signicance than others, thereby allowing the quantication of relevant
holistic or prgnanz factors. In the above example, using larger relative weights for w(1, 1) , w(2, 2) , and w(3, 3)
would have the eect of assessing greater informativeness for the relation {(1, 1), (2, 2), (3, 3)} than for the relation
{(1, 3), (2, 1), (3, 2)} , which may be desirable in some cases of pattern recognition, and the like. This weighted
mutual information is a form of weighted KLDivergence, which is known to take negative values for some inputs,[11]
and there are examples where the weighted mutual information also takes negative values.[12]
9.3.7
9.3.8
Using the ideas of Kolmogorov complexity, one can consider the mutual information of two sequences independent
of any probability distribution:
9.3.9
Linear correlation
Unlike correlation coecients, such as the product moment correlation coecient, mutual information contains information about all dependencelinear and nonlinearand not just linear dependence as the correlation coecient
measures. However, in the narrow case that both marginal distributions for X and Y are normally distributed and
their joint distribution is a bivariate normal distribution, there is an exact relationship between I and the correlation
coecient (Gel'fand & Yaglom 1957).
1
I = log(1 2 )
2
9.3.10
When X and Y are limited to be in a discrete number of states, observation data is summarized in a contingency table,
with row variable X (or i) and column variable Y (or j). Mutual information is one of the measures of association or
correlation between the row and column variables. Other measures of association include Pearsons chisquared test
statistics, Gtest statistics, etc. In fact, mutual information is equal to Gtest statistics divided by 2N where N is the
sample size.
62
In the special case where the number of states for both row and column variables is 2 (i,j=1,2), the degrees of freedom
of the Pearsons chisquared test is 1. Out of the four terms in the summation:
i,j
pij log
pij
pi pj
only one is independent. It is the reason that mutual information function has an exact relationship with the correlation
function pX=1,Y =1 pX=1 pY =1 for binary sequences .[13]
9.4 Applications
In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often
equivalent to minimizing conditional entropy. Examples include:
In search engine technology, mutual information between phrases and contexts is used as a feature for kmeans
clustering to discover semantic clusters (concepts).[14]
In telecommunications, the channel capacity is equal to the mutual information, maximized over all input
distributions.
Discriminative training procedures for hidden Markov models have been proposed based on the maximum
mutual information (MMI) criterion.
RNA secondary structure prediction from a multiple sequence alignment.
Phylogenetic proling prediction from pairwise present and disappearance of functionally link genes.
Mutual information has been used as a criterion for feature selection and feature transformations in machine
learning. It can be used to characterize both the relevance and redundancy of variables, such as the minimum
redundancy feature selection.
Mutual information is used in determining the similarity of two dierent clusterings of a dataset. As such, it
provides some advantages over the traditional Rand index.
Mutual information of words is often used as a signicance function for the computation of collocations in
corpus linguistics. This has the added complexity that no wordinstance is an instance to two dierent words;
rather, one counts instances where 2 words occur adjacent or in close proximity; this slightly complicates the
calculation, since the expected probability of one word occurring within N words of another, goes up with N.
Mutual information is used in medical imaging for image registration. Given a reference image (for example, a
brain scan), and a second image which needs to be put into the same coordinate system as the reference image,
this image is deformed until the mutual information between it and the reference image is maximized.
Detection of phase synchronization in time series analysis
In the infomax method for neuralnet and other machine learning, including the infomaxbased Independent
component analysis algorithm
Average mutual information in delay embedding theorem is used for determining the embedding delay parameter.
Mutual information between genes in expression microarray data is used by the ARACNE algorithm for reconstruction of gene networks.
In statistical mechanics, Loschmidts paradox may be expressed in terms of mutual information.[15][16] Loschmidt
noted that it must be impossible to determine a physical law which lacks time reversal symmetry (e.g. the
second law of thermodynamics) only from physical laws which have this symmetry. He pointed out that the
Htheorem of Boltzmann made the assumption that the velocities of particles in a gas were permanently uncorrelated, which removed the time symmetry inherent in the Htheorem. It can be shown that if a system is
described by a probability density in phase space, then Liouvilles theorem implies that the joint information
63
(negative of the joint entropy) of the distribution remains constant in time. The joint information is equal to the
mutual information plus the sum of all the marginal information (negative of the marginal entropies) for each
particle coordinate. Boltzmanns assumption amounts to ignoring the mutual information in the calculation of
entropy, which yields the thermodynamic entropy (divided by Boltzmanns constant).
The mutual information is used to learn the structure of Bayesian networks/dynamic Bayesian networks, which
explain the causal relationship between random variables, as exemplied by the GlobalMIT toolkit : learning
the globally optimal dynamic Bayesian network with the Mutual Information Test criterion.
Popular cost function in decision tree learning.
9.6 Notes
[1] Kraskov, Alexander; Stgbauer, Harald; Andrzejak, Ralph G.; Grassberger, Peter (2003). Hierarchical Clustering Based
on Mutual Information. arXiv:qbio/0311039 .
[2] Christopher D. Manning; Prabhakar Raghavan; Hinrich Schtze (2008). An Introduction to Information Retrieval. Cambridge
University Press. ISBN 0521865719.
[3] Haghighat, M. B. A.; Aghagolzadeh, A.; Seyedarabi, H. (2011). A nonreference image fusion metric based on mutual information of image features. Computers & Electrical Engineering. 37 (5): 744756. doi:10.1016/j.compeleceng.2011.07.012.
[4] http://www.mathworks.com/matlabcentral/fileexchange/45926featuremutualinformationfmiimagefusionmetric
[5] Massey, James (1990). Causality, Feedback And Directed Informatio (ISITA).
[6] Permuter, Haim Henry; Weissman, Tsachy; Goldsmith, Andrea J. (February 2009). Finite State Channels With TimeInvariant Deterministic Feedback. IEEE Transactions on Information Theory. 55 (2): 644662. doi:10.1109/TIT.2008.2009849.
[7] Coombs, Dawes & Tversky 1970.
[8] Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). Section 14.7.3. Conditional Entropy and Mutual
Information. Numerical Recipes: The Art of Scientic Computing (3rd ed.). New York: Cambridge University Press.
ISBN 9780521880688
[9] White, Jim; Steingold, Sam; Fournelle, Connie. Performance Metrics for GroupDetection Algorithms (PDF).
[10] Strehl, Alexander; Ghosh, Joydeep (2002), Cluster Ensembles A Knowledge Reuse Framework for Combining Multiple
Partitions (PDF), The Journal of Machine Learning Research, 3 (Dec): 583617
[11] Kvlseth, T. O. (1991). The relative useful information measure: some comments. Information sciences. 56 (1): 3538.
doi:10.1016/00200255(91)90022m.
[12] Pocock, A. (2012). Feature Selection Via Joint Likelihood (PDF) (Thesis).
[13] Wentian Li (1990). Mutual information functions versus correlation functions. J. Stat. Phys. 60 (56): 823837.
doi:10.1007/BF01025996.
[14] Parsing a Natural Language Using Mutual Information Statistics by David M. Magerman and Mitchell P. Marcus
[15] Hugh Everett Theory of the Universal Wavefunction, Thesis, Princeton University, (1956, 1973), pp 1140 (page 30)
[16] Everett, Hugh (1957). Relative State Formulation of Quantum Mechanics. Reviews of Modern Physics. 29: 454462.
doi:10.1103/revmodphys.29.454.
64
9.7 References
Cilibrasi, R.; Vitnyi, Paul (2005). Clustering by compression (PDF). IEEE Transactions on Information
Theory. 51 (4): 15231545. doi:10.1109/TIT.2005.844059.
Cronbach, L. J. (1954). On the nonrational application of information measures in psychology. In Quastler,
Henry. Information Theory in Psychology: Problems and Methods. Glencoe, Illinois: Free Press. pp. 1430.
Coombs, C. H.; Dawes, R. M.; Tversky, A. (1970). Mathematical Psychology: An Elementary Introduction.
Englewood Clis, New Jersey: PrenticeHall.
Church, Kenneth Ward; Hanks, Patrick (1989). Word association norms, mutual information, and lexicography. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics. doi:10.1145/90000/89095
(inactive 20160130).
Gel'fand, I.M.; Yaglom, A.M. (1957). Calculation of amount of information about a random function contained in another such function. American Mathematical Society Translations: Series 2. 12: 199246. English
translation of original in Uspekhi Matematicheskikh Nauk 2 (1): 352.
Guiasu, Silviu (1977). Information Theory with Applications. McGrawHill, New York. ISBN 9780070251090.
Li, Ming; Vitnyi, Paul (February 1997). An introduction to Kolmogorov complexity and its applications. New
York: SpringerVerlag. ISBN 0387948686.
Lockhead, G. R. (1970). Identication and the form of multidimensional discrimination space. Journal of
Experimental Psychology. 85 (1): 110. doi:10.1037/h0029508. PMID 5458322.
David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0521642981 (available free online)
Haghighat, M. B. A.; Aghagolzadeh, A.; Seyedarabi, H. (2011). A nonreference image fusion metric based on
mutual information of image features. Computers & Electrical Engineering. 37 (5): 744756. doi:10.1016/j.compeleceng.2011.0
Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York:
McGrawHill, 1984. (See Chapter 15.)
Witten, Ian H. & Frank, Eibe (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan
Kaufmann, Amsterdam. ISBN 9780123748560.
Peng, H.C., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 27 (8): 12261238. doi:10.1109/tpami.2005.159. PMID 16119262.
Andre S. Ribeiro; Stuart A. Kauman; Jason LloydPrice; Bjorn Samuelsson & Joshua Socolar (2008). Mutual Information in Random Boolean models of regulatory networks. Physical Review E. 77 (1). arXiv:0707.3642 .
doi:10.1103/physreve.77.011901.
Wells, W.M. III; Viola, P.; Atsumi, H.; Nakajima, S.; Kikinis, R. (1996). Multimodal volume registration
by maximization of mutual information (PDF). Medical Image Analysis. 1 (1): 3551. doi:10.1016/S13618415(01)800049. PMID 9873920.
Chapter 10
Cross entropy
In information theory, the cross entropy between two probability distributions p and q over the same underlying set
of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is
used that is optimized for an unnatural probability distribution q , rather than the true distribution p .
The cross entropy for the distributions p and q over a given set is dened as follows:
The situation for continuous distributions is analogous. We have to assume that p and q are absolutely continuous
with respect to some reference measure r (usually r is a Lebesgue measure on a Borel algebra). Let P and Q be
probability density functions of p and q with respect to r . Then
NB: The notation H(p, q) is also used for a dierent concept, the joint entropy of p and q .
10.1 Motivation
In information theory, the KraftMcMillan theorem establishes that any directly decodable coding scheme for coding
a message to identify one value xi out of a set of possibilities X can be seen as representing an implicit probability
distribution q(xi ) = 2li over X , where li is the length of the code for xi in bits. Therefore, cross entropy can be
interpreted as the expected messagelength per datum when a wrong distribution Q is assumed while the data actually
follows a distribution P . That is why the expectation is taken over the probability distribution P and not Q .
[
H(p, q) = Ep [li ] = Ep log
H(p, q) =
p(xi ) log
xi
H(p, q) =
1
q(xi )
1
q(xi )
65
66
10.2 Estimation
There are many situations where crossentropy needs to be measured but the distribution of p is unknown. An example
is language modeling, where a model is created based on a training set T , and then its crossentropy is measured on
a test set to assess how accurate the model is in predicting the test data. In this example, p is the true distribution
of words in any corpus, and q is the distribution of words as predicted by the model. Since the true distribution is
unknown, crossentropy cannot be directly calculated. In these cases, an estimate of crossentropy is calculated using
the following formula:
H(T, q) =
1
log2 q(xi )
N
i=1
where N is the size of the test set, and q(x) is the probability of event x estimated from the training set. The sum
is calculated over N . This is a Monte Carlo estimate of the true cross entropy, where the training set is treated as
samples from p(x) .
qy=1 = y g(w x) ,
where the vector of weights w is optimized through some appropriate algorithm such as gradient descent. Similarly,
the complementary probability of nding the output y = 0 is simply given by
qy=0 = 1 y
The true (observed) probabilities can be expressed similarly as py=1 = y and py=0 = 1 y .
Having set up our notation, p {y, 1 y} and q {
y , 1 y} , we can use cross entropy to get a measure for
similarity between p and q :
H(p, q) =
67
The typical loss function that one uses in logistic regression is computed by taking the average of all crossentropies
in the sample. For example, suppose we have N samples with each sample labeled by n = 1, . . . , N . The loss
function is then given by:
]
N
N [
1
1
L(w) =
H(pn , qn ) =
yn log yn + (1 yn ) log(1 yn ) ,
N n=1
N n=1
where yn g(w xn ) , with g(z) the logistic function as before.
The logistic loss is sometimes called crossentropy loss. Its also known as log loss (In this case, the binary label is
often denoted by {1,+1}).[2]
10.6 References
[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. MIT Press. Online
[2] Murphy, Kevin (2012). Machine Learning: A Probabilistic Perspective. MIT. ISBN 9780262018029.
de Boer, PieterTjerk; Kroese, Dirk P.; Mannor, Shie; Rubinstein, Reuven Y. (February 2005). A Tutorial on the CrossEntropy Method (PDF). Annals of Operations Research (pdf). 134 (1). pp. 1967.
doi:10.1007/s104790055724z. ISSN 15729338.
68
Text
Information theory Source: https://en.wikipedia.org/wiki/Information_theory?oldid=752055055 Contributors: AxelBoldt, Brion VIBBER, Timo Honkasalo, Ap, Graham Chapman, XJaM, Toby Bartels, Hannes Hirzel, Edward, D, PhilipMW, Michael Hardy, Isomorphic,
Kku, Bobby D. Bryant, Varunrebel, Vinodmp, AlexanderMalmberg, Karada, (, Iluvcapra, Minesweeper, StephanWehner, Ahoerstemeier, Angela, LittleDan, Kevin Baas, Poor Yorick, Andres, Novum, Charles Matthews, Guaka, Bemoeial, Ww, Dysprosia, The Anomebot, Munford, Hyacinth, Ann O'nyme, Shizhao, AnonMoos, MH~enwiki, Robbot, Fredrik, Rvollmert, Seglea, Chancemill, Securiger,
SC, Lupo, Wile E. Heresiarch, ManuelGR, Ancheta Wis, Giftlite, Lee J Haywood, COMPATT, KuniShiro~enwiki, SoWhy, Andycjp,
Ynh~enwiki, Pcarbonn, OverlordQ, L353a1, Rdsmith4, APH, Cihan, Elektron, Creidieki, Neuromancien, Rakesh kumar, Picapica, D6,
Jwdietrich2, Imroy, CALR, Noisy, Rich Farmbrough, Guanabot, NeuronExMachina, Bishonen, Ericamick, Xezbeth, Bender235, MisterSheik, Crunchy Frog, El C, Spoon!, Simon South, Bobo192, Smalljim, Rbj, Maurreen, Nothingmuch, Photonique, Andrewbadr, Haham
hanuka, Pearle, Mpeisenbr, Mdd, Msh210, Uncle Bill, Pouya, BryanD, PAR, Cburnett, Jheald, Geraldshields11, Kusma, DV8 2XL,
Oleg Alexandrov, FrancisTyers, Velho, Woohookitty, Linas, Mindmatrix, Ruud Koot, Eatsaq, Eyreland, SeventyThree, Kanenas, Graham87, Josh Parris, Mayumashu, SudoMonas, Arunkumar, HappyCamper, Bubba73, Alejo2083, Chris Pressey, Mathbot, Annacoder,
Nabarry, Srleer, Chobot, DVdm, Commander Nemet, FrankTobia, Siddhant, YurikBot, Wavelength, RobotE, RussBot, Michael Slone,
Loom91, Grubber, ML, Yahya AbdalAziz, Raven4x4x, Moe Epsilon, DanBri, BMAH07, Allchopin, Light current, Mceliece, Arthur Rubin, Lyrl, GrinBot~enwiki, Sardanaphalus, Lordspaz, SmackBot, Imz, Henri de Solages, Incnis Mrsi, Reedy, InverseHypercube, Cazort,
Gilliam, Metacomet, Octahedron80, Spellchecker, Colonies Chris, Jahiegel, Unnikrishnan.am, LouScheer, Calbaer, EPM, Djcmackay,
Michael Ross, Tyrson, Jon Awbrey, Het, Bidabadi~enwiki, Chungc, SashatoBot, Nick Green, Harryboyles, Sina2, Lachico, Almkglor,
Bushsf, Sir Nicholas de MimsyPorpington, FreezBee, Dicklyon, EKartoel, Wizard191, Matthew Verey, Isvish, ScottHolden, CapitalR,
Gnome (Bot), Tawkerbot2, Marty39, Daggerstab, CRGreathouse, Thermochap, Ale jrb, Thomas Keyes, Mct mht, Pulkitgrover, Grgarza, Maria Vargas, Roman Cheplyaka, Hpalaiya, Vanished User jdksfajlasd, Nearfar, Heidijane, Thijs!bot, N5iln, WikiIT, Headbomb,
James086, PoulyM, Edchi, D.H, Jvstone, HSRT, JAnDbot, BenjaminGittins, RainbowCrane, Jthomp4338, Dricherby, Buettcher, MetsBot, David Eppstein, Pax:Vobiscum, Logan1939, MartinBot, Tamer ih~enwiki, Sigmundg, Jargon777, Policron, Useight, VolkovBot,
Joeoettinger, JohnBlackburne, Jimmaths, Constant314, Starrymessenger, Kjells, Magmi, AllGloryToTheHypnotoad, Bemba, Lamro,
Radagast3, Newbyguesses, SieBot, Ivan tambuk, Robert Loring, Masgatotkaca, Junling, Pcontrop, Algorithms, Anchor Link Bot, Melcombe, ClueBot, Fleem, Ammarsakaji, Estirabot, 7&6=thirteen, Oldrubbie, Vegetator, Singularity42, Dziewa, Lambtron, Johnuniq,
SoxBot III, HumphreyW, Vanished user uih38riiw4hjlsd, Mitch Ames, Addbot, Deepmath, Peerc, Eweinber, Sun Ladder, C9900, Blaylockjam10, L.exsteens, Xlasne, LuK3, Egoistorms, Luckasbot, Quadrescence, Yobot, TaBOTzerem, Taxisfolder, Carleas, Twohoos,
Cassandra Cathcart, Dbln, Materialscientist, Informationtheory, Citation bot, Jingluolaodao, Expooz, Raysonik, Xqbot, Ywaz, Isheden,
Informationtricks, Dani.gomezdp, PHansen, Masrudin, FrescoBot, Nageh, Tiramisoo, Sanpitch, Gnomehacker, Pinethicket, Momergil,
Jonesey95, SkyMachine, SchreyP, Lotje, Miracle Pen, Vanadiumho, Kastchei, Djjr, EmausBot, WikitanvirBot, Primefac, Jmencisom,
Wikipelli, Dcirovic, Bethnim, Quondum, Henriqueroscoe, Terra Novus, ClueBot NG, Wcherowi, MelbourneStar, BarrelProof, TimeOfDei, Frietjes, Thepigdog, Pzrq, MrJosiahT, Lawsonstu, Helpful Pixie Bot, Leopd, BG19bot, Vaulttech, Wiki13, Trevayne08, CitationCleanerBot, Brad7777, Schafer510, BattyBot, David.moreno72, Bankmichael1, FoCuSandLeArN, SFK2, Jochen Burghardt, Limittheorem, Dschslava, Phamnhatkhanh, Szzoli, 314Username, Roastliras, Eigentensor, Comp.arch, SakeUPenn, Logan.dunbar, Leegrc,
Prof. Michael Bank, DanBalance, JellydPuppy, KasparBot, Kk, Lr0^^k, Mingujizaixin, DerGuteSamariter, Sisu55, Capriciousknees,
Bhannel, Fmadd, Dello1234 and Anonymous: 282
Selfinformation Source: https://en.wikipedia.org/wiki/Selfinformation?oldid=744791353 Contributors: Kku, Charles Matthews, Maximus Rex, Khym Chanur, Jeq, Brona, MisterSheik, Flammifer, Melaen, Jheald, Recury, Linas, The Rambling Man, RussBot, InverseHypercube, Mbset, Mcld, Spirituelle, Talgalili, Honing, BrotherE, Coee2theorems, Catslash, Kjells, PaulTanenbaum, Mundhenk, Jiuren,
UKoch, Vql, PixelBot, Addbot, Okurtsev, Mrocklin, Lightbot, Fryedpeach, Yobot,
, Ashpilkin, AnomieBOT, Br77rino, FrescoBot,
Igor Yalovecky, EmausBot, Lueling, Quondum, BG19bot, Rblazek, Kodiologist, JYBot, Enterprisey, Sminthopsis84, Dierkam and
Anonymous: 35
Entropy (information theory) Source: https://en.wikipedia.org/wiki/Entropy_(information_theory)?oldid=752813360 Contributors: Tobias Hoevekamp, Derek Ross, Bryan Derksen, The Anome, Ap, PierreAbbat, Rade Kutil, Waveguy, B4hand, Youandme, Olivier, Stevertigo, Michael Hardy, Kku, Mkweise, Ahoerstemeier, Snoyes, AugPi, Rick.G, Ww, Sbwoodside, Dysprosia, Jitse Niesen, Fibonacci,
PaulL~enwiki, Omegatron, Jeq, Noeckel, Robbot, Tomchiukc, Benwing, Netpilot43556, Rursus, Bkell, Tea2min, Stirling Newberry,
Giftlite, Donvinzk, Boaz~enwiki, Peruvianllama, Brona, Romanpoet, Udo.bellack, Jabowery, Christopherlin, Neilc, Gubbubu, Beland,
OverlordQ, MarkSweep, Karol Langner, Wiml, Bumm13, Sctfn, Zeman, Abdull, TheObtuseAngleOfDoom, Rich Farmbrough, ArnoldReinhold, Bender235, ESkog, MisterSheik, Jough, Guettarda, Cretog8, Army1987, Foobaz, Franl, Flammifer, Sjschen, Sligocki, PAR,
Cburnett, Jheald, Tomash, Oleg Alexandrov, Linas, Shreevatsa, LOL, Bkwillwm, Male1979, Ryan Reich, Btyner, Marudubshinki, Graham87, BD2412, Jetekus, Grammarbot, Nanite, Sj, Rjwilmsi, Thomas Arelatensis, Nneonneo, Erkcan, Alejo2083, Mfeadler, Srleer,
Vonkje, Chobot, DVdm, Flashmorbid, Wavelength, Alpt, Kymacpherson, Ziddy, Kimchi.sg, Afelton, Buster79, Brandon, Hakeem.gadi,
Vertigre, DmitriyV, GrinBot~enwiki, SmackBot, InverseHypercube, Fulldecent, IstvanWolf, Diegotorquemada, Mcld, Gilliam, Ohnoitsjamie, Dauto, Kurykh, Gutworth, Nbarth, DHNbot~enwiki, Colonies Chris, Jdthood, Rkinch, Javalenok, CorbinSimpson, Wen D
House, Radagast83, Cybercobra, Mrander, DMacks, FilippoSidoti, Daniel.Cardenas, Michael Rogers, Andrei Stroe, Ohconfucius, Snowgrouse, Dmh~enwiki, Ninjagecko, Michael L. Hall, JoseREMY, Severoon, Nonsuch, Phancy Physicist, KeithWinstein, Seanmadsen,
Dicklyon, Shockem, Ryan256, Dan Gluck, Kencf0618, Dwmalone, AlainD, Ylloh, CmdrObot, Hanspi, CBM, Mcstrother, Citrus538,
Neonleonb, FilipeS, Tkircher, Farzaneh, Blaisorblade, Ignoramibus, Michael C Price, Alexnye, SteveMcCluskey, Nearfar, Thijs!bot,
WikiC~enwiki, Edchi, EdJohnston, D.H, Phy1729, Jvstone, Seaphoto, Heysan, Zylorian, Dougher, Husond, OhanaUnited, Time3000,
Shaul1, Coee2theorems, Magioladitis, RogierBrussee, VoABot II, Albmont, Swpb, First Harmonic, JaGa, Kestasjk, Tommy Herbert,
Pax:Vobiscum, R'n'B, CommonsDelinker, Coppertwig, Policron, Jobonki, Jvpwiki, Ale2006, Idiomabot, Cuzkatzimhut, Trevorgoodchild, Aelkiss, Trachten, Saigyo, Kjells, Go2slash, BwDraco, Mermanj, Spinningspark, PhysPhD, Bowsmand, Michel.machado, TimProof, Maxlittle2007, Hirstormandy, Neil Smithline, Dailyknowledge, Flyer22 Reborn, Mdsam2~enwiki, EnOreg, Algorithms, Svick,
AlanUS, Melcombe, Rinconsoleao, Alksentrs, Schuermann~enwiki, Vql, Djr32, Blueyeru, TedDunning, Musides, Ra2007, Qwfp, Johnuniq, Kace7, Porphyro, Addbot, Deepmath, Landon1980, Olli Niemitalo, Fgnievinski, Hans de Vries, Mv240, MrVanBot, JillJnn,
Favonian, ChenzwBot, Wikomidia, Numbo3bot, Ehrenkater, Tide rolls, Lightbot, Fryedpeach, Eastereaster, Luckasbot, Yobot, Sobec,
Cassandra Cathcart, AnomieBOT, Jim1138, Zandr4, Mintrick, Informationtheory, Belkovich, ArthurBot, Xqbot, Ywaz, Gusshoekey,
Br77rino, Almabot, GrouchoBot, Omnipaedista, RibotBOT, Ortvolute, Entropeter, Constructive editor, FrescoBot, Hobsonlane, GEB
69
Stgo, Olexa Riznyk, Mhadi.afrasiabi, Orubt, Rc3002, HRoestBot, Cesarth73, RedBot, Cfpcompte, Pmagrass, Mduteil, Lotje, BlackAce48, LilyKitty, Angelorf, 777sms, CobraBot, Duoduoduo, Aoidh, Spakin, Hoelzro, Mean as custard, Jann.poppinga, Mitch.mcquoid,
Born2bgratis, Lalaithion, Gopsy, Racerx11, Mo ainm, Hhhippo, Purplie, Martboy722, Quondum, SporkBot, Music Sorter, Erianna,
Elsehow, Fjoelskaldr, Alexander Misel, ChuispastonBot, Sigma0 1, DASHBotAV, ClueBot NG, Tschijnmotschau, Matthiaspaul, Raymond Ellis, Mesoderm, SheriKLap, Helpful Pixie Bot, Bibcode Bot, BG19bot, Guy vandegrift, Eli of Athens, Marcocapelle, Hushaohan, Trombonechamp, Manoguru, Muhammad Shuaib Nadwi, BattyBot, ChrisGualtieri, Marek marek, VLReeder77, Jrajniak89, Cerabot~enwiki, Jiejie9988, Fourshade, Frosty, SFK2, Szzoli, Chrislgarry, I am One of Many, Phabius99, Jamesmcmahon0, Altroware,
OhGodItsSoAmazing, Tchanders, Suderpie, Ynaamad, AkselA, Orehet, Monkbot, Yikkayaya, Leegrc, Visme, Donen1937, WikiRambala, Magriteappleface, Oisguad, Boky90, Auerbachkeller, Isambard Kingdom, Secvz, Tinysciencecow, KasparBot, Radegast, BourkeM,
UMD Xuechi, Anareth, Jackbirda, Edlihtam6su, Handers Illian, LathiaRutvik, Bender the Bot, Sethx1138, CitySlicker2016 and Anonymous: 338
Binary entropy function Source: https://en.wikipedia.org/wiki/Binary_entropy_function?oldid=715332226 Contributors: Jheald, Linas,
Alejo2083, Jaraalbe, SmackBot, Calbaer, Ylloh, WikiC~enwiki, Hermel, JaGa, Trachten, Aaron Rotenberg, Venny85, UKoch, FrescoBot,
Quondum, Neighbornou, Fourshade and Anonymous: 9
Dierential entropy Source: https://en.wikipedia.org/wiki/Differential_entropy?oldid=748499150 Contributors: Michael Hardy, Kku,
Karada, Den fjttrade ankan~enwiki, Giftlite, TheObtuseAngleOfDoom, PAR, Jheald, Count Iblis, Oleg Alexandrov, Woohookitty,
Nanite, The Rambling Man, Vertigre, Teply, Bo Jacoby, SmackBot, Diegotorquemada, Mcld, Nbarth, Colonies Chris, Memming,
Radagast83, Yuide, Pulkitgrover, Blaisorblade, Zickzack, Headbomb, Ioeth, Every Creek Counts, Coee2theorems, Jorgenumata, Policron, Epistemenical, Kyle the bot, Nathan B. Kitchen, Vlsergey, Rinconsoleao, Drazick, Daniel Hershcovich, Guozj02, Kaba3, Qwfp,
Webtier~enwiki, Addbot, Deepmath, Olli Niemitalo, Yobot, Informationtheory, Kwiki, Stefanhost, Slxu.public, RjwilmsiBot, Cogiati,
Quondum, Fjoelskaldr, Jrnold, WJVaughn3, Vladimir Iok, Bibcode Bot, Solomon7968, BattyBot, ChrisGualtieri, SFK2, JGTZ, Limittheorem, Szzoli, Ianweiner, Monkbot, Leegrc, Bender the Bot and Anonymous: 32
Diversity index Source: https://en.wikipedia.org/wiki/Diversity_index?oldid=726822838 Contributors: The Anome, Michael Hardy,
Kku, Ronz, Den fjttrade ankan~enwiki, Duncharris, Andycjp, Forbsey, Wtmitchell, Jheald, Jackhynes, Stemonitis, Tabletop, Rjwilmsi,
Mathbot, Wavelength, Hillman, Sasuke Sarutobi, Carabinieri, Ilmari Karonen, SmackBot, Malkinann, Gilliam, Bluebot, Nbarth, Eliezg,
Richard001, Ligulembot, Dogears, Amdurbin, BeenAroundAWhile, AshLin, Myasuda, TimVickers, R'n'B, DrMicro, Classical geographer, Timios, Flyer22 Reborn, Melcombe, Jfdarmo, Rumping, Niceguyedc, Laboratory, Addbot, NjardarBot, West.andrew.g, Denicho,
Luckasbot, Yobot, AnomieBOT, Citation bot, Sylwia Ufnalska, Asif Qasmov, NSH002, Binjiangwiki, HRoestBot, Xnus, TuHanBot,
Dcirovic, Bamyers99, ClueBot NG, , Widr, David Blundon, Pacerier, Frze, Fodon, Lileiting, Cricetus, Ilyapon, Zz9296,
Carl Lehto, Jochen Burghardt, Kenadra, Monkbot, Leegrc, Darw15ish, Tracedragon762 and Anonymous: 40
Conditional entropy Source: https://en.wikipedia.org/wiki/Conditional_entropy?oldid=740301777 Contributors: The Anome, SebastianHelm, AugPi, Romanpoet, Stern~enwiki, MarkSweep, Creidieki, MisterSheik, Macl, PAR, Jheald, Linas, GregorB, Qwertyus, Thomas
Arelatensis, YurikBot, Zvika, SmackBot, Mcld, Betacommand, Tplayford, Njerseyguy, Tsca.bot, Solarapex, Sadi Carnot, A. Pichler, Ylloh, Thermochap, MaxEnt, Meznaric, Giromante, MagusMind, Sterrys, Magioladitis, Mozaher, DavidCBryant, AleaIntrica, Celique,
Alexbot, Addbot, Yobot, AnomieBOT, Lynxoid84, Citation bot, Oashi, RedBot, Peter.prettenhofer, KonradVoelkel, Quondum, ClueBot
NG, Apalmigiano, JeanM, Stretchhhog, David.ryan.snyder, Luis Goslin, Latexyow and Anonymous: 42
Joint entropy Source: https://en.wikipedia.org/wiki/Joint_entropy?oldid=727418967 Contributors: Edward, Sander123, Stern~enwiki,
MarkSweep, Sam Hocevar, Creidieki, Alperen, AndersKaseorg, PAR, Jheald, Linas, Male1979, Mysid, SmackBot, Mcld, Bluebot, Memming, Solarapex, Rebooted, Edchi, Don Quixote de la Mancha, Phebot, Winsteps, Xiaop c, Addbot, Olli Niemitalo, Ptbotgourou,
AnomieBOT, Citation bot, John737, Olexa Riznyk, Enthdegree, KonradVoelkel, ZroBot, Smartdust, Mark viking and Anonymous: 19
Mutual information Source: https://en.wikipedia.org/wiki/Mutual_information?oldid=750615501 Contributors: Michael Hardy, Paul
Barlow, Kku, AugPi, Cherkash, Willem, Charles Matthews, Dcoetzee, Jitse Niesen, VeryVerily, Secretlondon, Canjo, Kahn~enwiki,
Wile E. Heresiarch, Ancheta Wis, Giftlite, Sepreece, Romanpoet, Stern~enwiki, Eequor, Macrakis, Fangz, MarkSweep, Elroch, Creidieki, MisterSheik, Art LaPella, Bobo192, A1kmm, Photonique, Anthony Appleyard, Apoc2400, PAR, Jheald, Oleg Alexandrov, Linas,
Jrg Knappen~enwiki, Rjwilmsi, Thomas Arelatensis, Gwiki~enwiki, Sderose, Benja, Welsh, Balizarde, Saharpirmoradian, Ses~enwiki,
Naught101, SmackBot, Took, Mcld, Nervexmachina, Njerseyguy, Miguel Andrade, Colonies Chris, Memming, Solarapex, Tmg1165,
Moala, Dfass, Ben Moore, Hu12, Freelance Intellectual, Thermochap, Shorespirit, Ppgardne, Thijs!bot, Headbomb, Afriza, Jddriessen,
Kirrages, Bgrot, Fritz.obermeyer, Daisystanton, Baccyak4H, Originalname37, Agentesegreto, Mozaher, Pdcook, Ged.R, Aelkiss, Jamelan, VanishedUserABC, Rl1rl1, Astrehl, Lord Phat, Anchor Link Bot, ClueBot, Kolyma, MystBot, Addbot, Deepmath, Wli625, DOI
bot, Glutar, ChenzwBot, Luckasbot, Yobot, AnomieBOT, Okisan, Lynxoid84, Kavas, Materialscientist, Citation bot, LilHelpa, Xqbot,
Af1523, Vthesniper, Olexa Riznyk, , Citation bot 1, RjwilmsiBot, Ghostofkendo, Jowa fan, Fly by Night, KonradVoelkel,
Dcirovic, Sgoder, AManWithNoPlan, ClueBot NG, Amircrypto, Helpful Pixie Bot, BG19bot, Craigacp, SciCompTeacher, Fsman,
Manoguru, Shashazhu1989, ChrisGualtieri, Sds57, Mogism, SFK2, Limittheorem, Keith David Smeltz, Me, Myself, and I are Here,
Szzoli, Phleg1, Monkbot, Velvel2, Sisu55 and Anonymous: 98
Cross entropy Source: https://en.wikipedia.org/wiki/Cross_entropy?oldid=744442721 Contributors: Kevin Baas, Samw, Jitse Niesen,
Pgan002, MarkSweep, MisterSheik, Jheald, Linas, Jrg Knappen~enwiki, Eclecticos, Kri, SmackBot, Keegan, Colonies Chris, J. Finkelstein, Ojan, Thijs!bot, Jrennie, Life of Riley, Addbot, Materialscientist, Nippashish, Erik9bot, Mydimle, WikitanvirBot, BG19bot,
Ahmahran, David.moreno72, ChrisGualtieri, Densonsmith, Velvel2, LordBm, KristianHolsheimer, Grzegorz Swirszcz, Chrishaack and
Anonymous: 20
10.8.2
Images
File:Binary_entropy_plot.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/22/Binary_entropy_plot.svg License: CCBYSA3.0 Contributors: original work by Brona, published on Commons at Image:Binary entropy plot.png. Converted to SVG by Alessio
Damato Original artist: Brona and Alessio Damato
File:Binary_erasure_channel.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b6/Binary_erasure_channel.svg License:
Public domain Contributors: Own work Original artist: David Eppstein
File:Binary_symmetric_channel.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b5/Binary_symmetric_channel.svg
License: Public domain Contributors: Own work Original artist: David Eppstein
70
File:CDSCRATCHES.jpg Source: https://upload.wikimedia.org/wikipedia/commons/5/5a/CDSCRATCHES.jpg License: Public domain Contributors: English Wikipedia <a href='https://en.wikipedia.org/wiki/en:Image:CDSCRATCHES.JPG' class='extiw' title='w:en:
Image:CDSCRATCHES.JPG'>here</a> Original artist: en:user:Jam01
File:Comm_Channel.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/48/Comm_Channel.svg License: Public domain
Contributors: en wikipedia Original artist: Dicklyon
File:Crypto_key.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Crypto_key.svg License: CCBYSA3.0 Contributors: Own work based on image:Keycryptosideways.png by MisterMatt originally from English Wikipedia Original artist: MesserWoland
File:Editclear.svg Source: https://upload.wikimedia.org/wikipedia/en/f/f2/Editclear.svg License: Public domain Contributors: The
Tango! Desktop Project. Original artist:
The people from the Tango! project. And according to the metadata in the le, specically: Andreas Nilsson, and Jakub Steiner (although
minimally).
File:Entropymutualinformationrelativeentropyrelationdiagram.svg Source: https://upload.wikimedia.org/wikipedia/commons/
d/d4/Entropymutualinformationrelativeentropyrelationdiagram.svg License: Public domain Contributors: Own work Original artist:
KonradVoelkel
File:Entropy_flip_2_coins.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d4/Entropy_flip_2_coins.jpg License: CC
BYSA 3.0 Contributors: File:Ephesos_620600_BC.jpg Original artist: http://www.cngcoins.com/
File:Fisher_iris_versicolor_sepalwidth.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/40/Fisher_iris_versicolor_sepalwidth.
svg License: CC BYSA 3.0 Contributors: en:Image:Fisher iris versicolor sepalwidth.png Original artist: en:User:Qwfp (original); Pbroks13
(talk) (redraw)
File:Folder_Hexagonal_Icon.svg Source: https://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Ccbysa3.0 Contributors: ? Original artist: ?
File:Internet_map_1024.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg License: CC BY
2.5 Contributors: Originally from the English Wikipedia; description page is/was here. Original artist: The Opte Project
File:Lockgreen.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Lockgreen.svg License: CC0 Contributors: en:
File:Freetoread_lock_75.svg Original artist: User:Trappist the monk
File:Nuvola_apps_edu_mathematics_bluep.svg Source: https://upload.wikimedia.org/wikipedia/commons/3/3e/Nuvola_apps_edu_
mathematics_bluep.svg License: GPL Contributors: Derivative work from Image:Nuvola apps edu mathematics.png and Image:Nuvola
apps edu mathematicsp.svg Original artist: David Vignoni (original icon); Flamurai (SVG convertion); bayo (color)
File:Portalpuzzle.svg Source: https://upload.wikimedia.org/wikipedia/en/f/fd/Portalpuzzle.svg License: Public domain Contributors:
? Original artist: ?
File:Question_booknew.svg Source: https://upload.wikimedia.org/wikipedia/en/9/99/Question_booknew.svg License: Ccbysa3.0
Contributors:
Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist:
Tkgd2007
File:Symbol_template_class.svg Source: https://upload.wikimedia.org/wikipedia/en/5/5c/Symbol_template_class.svg License: Public
domain Contributors: ? Original artist: ?
File:Wikiquotelogo.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquotelogo.svg License: Public domain
Contributors: Own work Original artist: Reiartur
10.8.3
Content license