CHAPTER 6

Shannon entropy
This chapter is a digression in information theory. This is a fascinating
subject, which arose once the notion of information got precise and quantifyable.
From a physical point of view, information theory has nothing to do with physics.
However, the concept of Shanon entropy shares some intuition with Boltzmann’s,
and some of the mathematics developed in information theory turns out to have
relevance in statistical mechanics. This chapter contains very basic material on
information theory.
1. Quantifying information
Computers have popularized the notion of bit, a unit of information that takes
two values, 0 or 1. We introduce the information size H0 (A) of a set A as the
number of bits that is necessary to encode each element of A separately, i.e.
H0 (A) = log2 |A|.

(6.1)

This quantity has a unit, the bit. If we have two sets A and B, then
H0 (A × B) = H0 (A) + H0 (B).

(6.2)

This justifies the logarithm. The information size of a set is not necessarily integer.
If we need to encode the elements of A, the number of necessary bits is dH0 (A)e
rather than H0 (A); but this is irrelevant for the theory.
The ideas above are natural and anybody might have invented the concept of
information size. The next notion, the information gain, is intuitive but needed a
genius to define and quantify it.
Suppose you need to uncover a certain English word of five letters. You manage
to obtain one letter, namely an e. This is useful information, but the letter e is
common in English, so this provides little information. If, on the other hand, the
letter that you discover is j (the least common in English), the search has been
more narrowed and you have obtained more information. The information gain
quantifies this heuristics.
We need to introduce a relevant formalism. Let A = (A, p) be a discrete
probability space. That is, A = {a1 , . . . , an } is a finite set, and each element has
probability pi . (The σ-algebra is the set of all subsets of A.) The information gain
G(B|A) measures the gain obtainedPby the knowledge that the outcome belongs to
the set B ⊂ A. We denote p(B) = i∈B pi .
Definition 6.1. The information gain is
G(B|A) = log2

1
= − log2 p(B).
p(B)
51

p)). and it satisfies the following additivity property. Then n X H(p + λq) = − (pi + λqi ) log2 (pi + λqi ).4) n Since − log2 is strictly convex. (6.3) p(C) It follows that G(B|A) = G(C|A) + G(B|C). That is. The gain for knowing that the outcome is in C is G(C|A) = − log2 p(C). The Shannon entropy of A is H(A) = − n X pi log2 pi . with equality iff p(ai ) = 1/n for all i. qn ) such that qi = 0. (6. (6. .) Given p in the interior of P .52 6. H is strictly concave with respect to p. . P = {(p1 . Definition 6.2. dλ2 log 2 i=1 p1 + λqi (6. writing H(p) instead of H(A = (A. although its origin goes back to Pauli and von Neumann. The gain for knowing that it is in B. after knowing that it is in C. as it should be. . .5) (It P is actually a simplex. the inequality above is strict unless p1 = · · · = pn . 2. We gain 1 bit if p(B) = 21 . is p(B) G(B|C) = − log2 p(B|C) = − log . i=1 The extension to continuum probability spaces is not straightforward and we do not discuss it here. Proposition 6. . . pn ) : 0 6 pi 6 1. we have H(αp(1) + (1 − α)p(2) ) > αH(p(1) ) + (1 − α)H(p(2) ). Proof.1. dλ i=1 The latter is strictly negative.6) i=1 The derivatives with respect to λ are n X dH =− qi log2 (pi + λqi ). .2. and such that p + λq ∈ P provided λ is a number small enough. an article that created information theory. Let B ⊂ C ⊂ A. The space of probabilities on A is the convex set X pi = 1}. we have from Jensen’s inequality n n X 1  X 1 − log2 pi 6 pi = −H(A). SHANNON ENTROPY The information gain is positive. let q = (q1 . Shannon entropy It is named after Shannon1. − log2 p pi i=1 i i=1 | {z } (6. Since − log2 is convex. Proof. H(A) 6 log2 n.7)  1The American Claude Shannon (1916–2001) wrote A mathematical theory of communication in 1948. The unit for the information gain is the bit. .  Proposition 6. n d2 H 1 X qi2 = − . .

D.N ) 0 otherwise. pcan (ω) = Y (β. N ) . N ) = 1 kB log 2 S(U. but the present discussion is clearly more general. N ). D. qi i=1 The definition is not symmetric under exchanges of p and q. 1}D . We now study the Shannon entropy of these measures. the microcanonical. Relation with Boltzmann entropy Statistical mechanics provides three probability measures on the phase space. One easily obtains H(pcan ) =  1  βhH(ω)i + log Y (β. log 2 The average of the Hamiltonian is equal to the thermodynamic energy U (β. and that the state space is Ω = {0. canonical. We now have the Gibbs factor. N ). Recall that the free energy is related to the energy and the entropy by U = F − T S. RELATION WITH BOLTZMANN ENTROPY 53 Definition 6. . Recall that the domain D ⊂ Zd is discrete. Proof. ( −βH(ω) e if N (ω) = N. For simplicity we consider the Ising model in the lattice gas interpretation.3. H(p|q) 6= H(q|p) unless p = q. The relative entropy of the probability p with respect to the probability q is n X pi H(p|q) = pi log2 . D.D.D. The probability is uniform over all states with energy U and number of particles N : ( 1 if H(ω) = U and N (ω) = N .3. D. with equality iff p = q.N ) 0 otherwise. D. pi p i i (6. we see that 1 H(pcan ) = kB log 2 S(β. pmicro (ω) = X(U. N ). Canonical ensemble. and grand-canonical measures. The logarithm of Y is equal to −βF (β. it coincides with the thermodynamic entropy.3. Then H(pmicro ) = log2 X(U. D. By Jensen.8)  3. H(p|q) = − X pi log2 i X q  qi i pi > − log2 = 0. Proposition 6. With β = 1/kB T . Microcanonical ensemble. N ). as it turns. H(p|q) > 0. The Hamiltonian Ω → R involves a sum over nearest neighbours.

not a lim inf or a lim sup. with equality iff elements of A come with equal probability. Theorem I (Shannon source coding theorem). then we lose the information. and of the theory of large deviations can be found in [Pfister. The occurrence of probabilities may be confusing. . we assume independence). i. that can denote the 26 letters from the Latin alphabet. For any δ > 0. As is checked in the exercises. and to decode it. if we allow an error δ. N →∞ N The theorem says that if we allow for a tiny error. such that p(B) > 1 − δ. The probability pgd−can involves now the chem1 ical potential. 1 lim Hδ (AN ) = H(A). in In and out of equilibrium: Physics with a probability flavor. to send it (or to store it). we have H(pgd−can ) = kB log 2 S(β. H0 (AN ) = N H0 (A).9) Notice that limδ→0 Hδ (A) = H0 (A).-E. The information size is given by Hδ (A). . and if our message is large (depending on the error). Probab. so a discussion is needed. We start with a finite set A. Notice that the limit in the theorem is a true limit. aN ) ∈ N A is p(ai ). and random singing by the soli! SoQwe want to say something about Hδ (AN ) (the probability of (a1 . We have H0 (AN ) = N H0 (A). A string of N characters is selected at random using some probability. random performing by the orchestra. 51. 2002]2 4. This modelization is behind all compression algorithms. or a larger set of words. We consider a file that contains N |A| symbols with N large. Thus Shannon entropy gives the optimal compression rate. We assume that no error is committed during these operations.e. If a file turns out to be in A \ B. Strange as it may seem. except that the string may lie outside the set of codable strings. the process can be described as follows. and the answer more surprising. Thermodynamical aspects of classical lattice systems. In the real world. . information is selected for a purpose and its content is well-chosen. Shannon theorem A basic problem in information theory deals with encoding large quantities of information. SHANNON ENTROPY Grand-canonical ensemble. How many bits are required so that the file can be encoded without loss of information? The answer is given by the information size.54 6. the number of required bits is roughly N H(A). µ). But we modelize the problem of information transmission using probabilities. but Hδ (AN ) is smaller than N Hδ (A) in general. or the 128 ASCII symbols. One wants to encode this string. D. Progr. an MP3 file of Manon Lescaut has been compressed as if the music was the the result of random composing by Puccini. 2Ch. Birkh¨ auser (2002) . ´ Pfister. We now seek to encode only files that fall in a set B ⊂ A. that can be approached but not improved. The question becomes more interesting. where Hδ (A) = inf B⊂A p(B)>1−δ log2 |B|. and also that H(A) 6 H0 (A). A thorough rigorous discussion of the relations between Shannon and thermodynamic entropies. Notice that Hδ (A) 6 H0 (A). (6. .

Consider the random variable − log2 p(a). The law of large numbers states that. It is based on the (weak) law of large numbers. n N . for any ε > 0. SHANNON THEOREM 55 Proof.4.

1 X o  .

.

.

= 0. aN ) : . lim Prob (a1 . . . . .

− log2 p(ai ) − E − log2 p(a) .

ε satisfies 2−N (H(A)+ε) 6 p(a1 . let BN. N (6. .12) so that |AN.δ ) > 1 − δ.ε | 6 N (H(A) + ε).aN ) −N (6.δ be the minimizer for Hδ . (6.10) There exists therefore a set AN. and such that any (a1 . Then Hδ (AN ) 6 log2 |AN. > ε N →∞ N i=1 | {z } {z } | H(A) 1 log2 p(a1 .ε ) > 1 − δ. .13) It follows that lim sup N →∞ 1 Hδ (AN ) 6 H(A).. (6.ε ) > |AN.11) The number of elements of AN. we can choose N large enough so that p(AN. .14) For the lower bound. that is.. p(BN. (6. ..ε | 2−N (H(A)+ε) . aN ) ∈ AN. . aN ) 6 2−N (H(A)−ε) . . and .ε | 6 2N (H(A)+ε) . For any δ > 0.ε ⊂ AN such that limN p(AN.ε is easily estimated: 1 > p(AN.. .ε ) = 1. .

.

δ | > log2 . Hδ (AN ) = log2 |BN.

BN.δ ∩ AN.ε .

ε ..  1−δ 6 p BN.16) 6δ if N large Then .ε {z } | 6|BN.δ ∩ AcN.ε |2−N (H(A)−ε)  + p BN.15) (6. {z } | (6. We need a lower bound for the latter term.δ ∩AN.δ ∩ AN.

.

.

ε .BN.δ ∩ AN.

17) We obtain 1 1 Hδ (AN ) > log2 (1 − 2δ) + H(A) − ε. we get that Z ∞ 2 1 2 2 p(AN. > (1 − 2δ) 2N (H(A)−ε) .18) N N This gives the desired bound for the lim inf. Nε 1 (σ 2 is the variance of the random variable − log2 p(a). It is surprising that δ can be so tiny.ε ) ≈ 2σ √ e− 2 t dt ≈ e−N ε ≈ δ. and yet makes such a difference! .  It is instructive to quantify the ε’s and δ’s of the proof. (6. Invoking the central limit theorem instead of the law of large numbers.) This shows that ε ≈ N − 2 and δ ≈ e−N . (6. and Shannon’s theorem follows.

. an ) = c(a1 ) . independently of one another. an ) of elements of an “alphabet” A. For instance. One extracts the letter. Let A = {a1 . These are uniquely decodable codes where a sequence of binary numbers can be decoded sequentially. . . one reads the binary numbers from the left. one by one. a3 . or sequence of bits. The coding of a string is sequential.4. The concatenation of two binary sequences b and b0 .1. . Tree representations for the prefix codes c(3) and c(4) . of respective lengths m and n. ASCII characters use 7 bits for any letter. Prefix codes can be represented by trees. Let B be the set of all finite binary sequences. A code is a map c from ∪N >1 AN to B. . of arbitrary length. . letters are coded one by one. SHANNON ENTROPY 5. the length of bb0 is m + n. bm . c(2) is uniquely decodable but is not a prefix code. . where each bi = 0. Some codes map any letter a ∈ A to a binary sequence with fixed length. . . . is written b = b1 b2 . These notions are best understood with the help of examples. . A binary sequence. We want to encode a string (a1 . uniquely decodable code is also prefix. is the sequence bb0 = b1 . c(1) : a1 a2 a3 a4 c(2) : a1 a2 a3 a4 7→ 0 7→ 1 7→ 00 7→ 11 7→ 0 7→ 01 7→ 011 7→ 0111 c(3) : a1 a2 a3 a4 7 00 → 7→ 01 7→ 10 7→ 11 c(4) : a1 a2 a3 a4 7→ 0 7→ 10 7→ 110 7→ 111 Any fixed length. . Of course. But compression algorithms use variable length codes. and c(3) and c(4) are prefix codes. b0n . 1.56 6. . see Fig. until one recognizes the code of a letter. A code is uniquely decodable if the mapping is injective (one-to-one). That is. c(an ).1. r a1 7→ 00 r HH 1 r a2 7→ 01 0 0 r @ 0 r 1@  r  @ 1@ r a3 7→ 10 a4 7→ 11 0 r a1 7→ 0 r r a2 → 7 10 0 @ 1@ r r a3 7→ 110 0 @ 1@ r @ 1@ r a4 → 7 111 Figure 6. a2 . and reads the next binary numbers until the next letter is identified. a4 }. . . A special class of variable length codes are prefix codes. . i. Codes and compression algorithms A code is a mapping from a “string” (a finite sequence of letters) to a finite sequence of binary numbers. which satisfies the following sequential property: c(a1 . . Definition 6. We need some notation.e. 6. bm b01 . The code c(1) is undecodable.

5. • Any one-to-one code on A satisfies X 2−`(a) 6 1.. . p2 = 14 .21) L=N `min Since the right side grows like N . . a∈A • Given {`(a) : a ∈ A} satisfying the inequality above. let `(a) be the length of the sequence c(a). Proposition 6. the number #{·} above is no more than 2L . c) = p(a)`(a). given lengths that satisfy Kraft inequality.aN a∈A = NX `max (6. and for c(4) it is 2. If each letter a occurs with (independent) probability p(a). a2 . It is left as an exercise. It is known as Kraft inequality. by suggesting an explicit construction for the prefix code.−`(aN ) a1 . and the two prefix codes above. CODES AND COMPRESSION ALGORITHMS 57 The goal of compression algorithms3 is to encode strings with the smallest sequence of binary numbers. p3 = p4 = 18 .20)   2−L # (a1 . The idea is that frequent letters should be coded with smaller length.. (6. Since the code is one-to-one.25 bits. As we will see. aN ) : ` c(a1 . Proof. the expectation for the length of one letter is X L(A. Notice that H(A) = 1. . The second claim can be proved e. but it is now 1. Notice that H(A) = 2 bits. that does not involve probabilities. the expected length for c(3) is 2 bits.  3The word “algorithm” derives from Abu Ja’far Muhammad ibn Musa Al-Khwarizmi. . a4 }. it must be less or equal to 1. not necessarily a prefix code. We start with X N X 2−`(a) = 2−`(a1 )−. that will need longer strings in order for the code to be decodable. . The second code is better. This clearly comes at the expense of other letters.75 bits for c(4) ... Then X a∈A 2−`(a) N 6 NX `max 2−L 2L = N (`max − `min + 1). aN ) = L . (2) If p1 = 21 . .4 (Kraft inequality). the left side cannot grow exponentially with N .. . It is instructive to consider the following two situations with A = {a1 . who was born in Baghdad around 780. The first code is better. For a ∈ A. and similarly for `max . .g. (6. the codes are optimal for these two situations. and who died around 850. (1) If p1 = p2 = p3 = p4 = 41 .75 bits. L=N `min We set `min = mina∈A `(a). the expected length for c(3) is still 2 bits.. . a3 .19) a∈A Thus the goal is to find the code that minimizes E(`). There is a bound on minimally achievable lengths. there corresponds a prefix code. We need to prove the first statement for any decodable code.

consider a code c with lengths `i = `(c(ai )). (6. the optimal prefix code c satisfies H(A) 6 L(A. and because z 6 1 (Kraft inequality).58 6. P −`i Define qi = 2 z .23) 2−`i 6 i=1 i=1 This shows that Kraft inequality is verified.2. For the upper bound. c) 6 H(A) + 1. Given lengths {`(a)} satisfying Kraft inequality. Exercise 6. For any alphabet A. and any probability p on A. . (6. n n X X L(A. (6. with z = j 2−`j . Proof. Theorem II (Limit to compression).22) i=1 The inequality holds because of positivity of the relative entropy. show the existence of a corresponding prefix code.3. Consider a discrete statistical mechanics model such as the lattice gas with nearest-neighbour interactions (Ising model).24) i=1 i=1  Exercise 6. Proposition 6. We have L(A. Explain why c(1) is undecodable. Exercise 6. c(2) . The expected length is easily estimated. c(3) and c(4) are prefix codes. Check that Shannon’s entropy of the grand-canonical probability is equal to the corresponding entropy. SHANNON ENTROPY Shannon entropy again appears as a limit for data compression. For the lower bound. Then n n X X pi = 1. c) = n X pi `i = − i=1 n X pi log2 qi − log2 z > − i=1 n X pi log2 pi = H(A). c(2) is uniquely decodable but not prefix. so there exists a prefix code c with these lengths.4. define `i = d− log2 pi e (the integer immediately bigger than − log2 pi ). c(3) .3. Recall the definitions for the codes c(1) . Exercise 6. c) = pi d− log2 pi e 6 pi (− log2 pi + 1) = H(A) + 1. and c(4) . State and prove Jensen’s inequality.1.