Information Theory and Statistical Mechanics: Connected by Entropy

INFORMATION THEORY AND STATISTICAL MECHANICS
UNIVERSITY OF ENERGY AND NATURAL RESOURCES

SCHOOL OF SCIENCES
DEPARTMENT OF MATHEMATICS AND STATISTICS
A THESIS SUBMITTED
BY
ANDOH FREDERICK - UE20037214

KWAO DAVID ATSEYOE - UE20038714
ISAAC OKYERE - UE20039614
TO THE MATHEMATICS AND STATISTICS DEPARTMENT, UNIVERSITY OF

ENERGY AND NATURAL RESOURCES, SUNYANI, IN PARTIAL FULFILMENT
OF THE REQUIREMENT FOR THE DEGREE OF BACHELOR OF SCIENCE
(BSC.) MATHEMATICS
MAY, 2018
Declaration
We hereby declare that we elaborated this thesis with the topic “Information theory and
Statistical Mechanics” alone under the guidance of Dr. Alex Akwasi Opoku. All the
literature and other sources of information have been acknowledged and are mentioned
in the list of references.
ANDOH FREDERICK ..................... ..................

Student (UE20037214) Signature Date
KWAO DAVID ATSEYOE ..................... ..................

ISAAC OKYERE ..................... ..................

Certified by:
DR. ALEX AKWASI OPOKU ..................... ..................
Supervisor Signature Date
Certified by:
DR. ALEX AKWASI OPOKU ..................... ..................
Head of Department Signature Date
i
Dedication
We affectionately dedicate this project to our GOD. And to our parents, relatives,
friends and anyone who motivated and supported us in prayers and finance. “I the
LORD thy GOD will hold thy right hand, saying unto thee, Fear not I will help thee” .
[Isaiah 41:13]
ii
Acknowledgement
Our foremost gratitude goes to our GOD Almighty, our Father, who guided us through
these years of education. We are affectionately related to HIM and we believe that HE
alone can tell the magnitude of the gratitude we wanted to express, which can not all be
reduced into writing. We would like to also mention our Supervisor, Dr. Alex Opoku
to whom we owe an inexpressible gratitude for his guidance. We can not forget your
love Dr. Alex Opoku. We are deeply grateful. Further, thanks to the Mathematics
Department, University of Energy and Natural Resources for providing a great working
environment.
iii
Abstract
Entropy refers to disorder or uncertainty of a system. the maximization of entropy is

not an application of a law of physics, but merely a method of reasoning which ensures
that no unconscious arbitrary assumptions have been introduced.
In this paper, we show that the maximum entropy principle serves as a connection
between information theory and statistical mechanics through the notion of entropy.
iv
Contents
Declaration i
Dedication ii
Acknowledgements ii
Abstract iii
Table of Contents iv
1 INTRODUCTION 2
2 FREQUENTIST VRS BAYESIAN 5

2.1 Frequentist view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Bayesian view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Difference between Bayesian and Frequentist . . . . . . . . . . . . . . . 8
3 INFORMATION THEORY AND STATISTICAL MECHANICS 9

3.1 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Statistical Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Maximum Entropy Principle . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Properties of entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
3.5 Applications of Statistical Mechanics . . . . . . . . . . . . . . . . . . . 29
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography 32
vi
List of Figures
2.1 Representation of Bayesian theory . . . . . . . . . . . . . . . . . . . . 7
3.1 The relation between information and entropy . . . . . . . . . . . . . . 10

3.2 The communication system . . . . . . . . . . . . . . . . . . . . . . . . 12
1
Chapter 1
INTRODUCTION
Information theory is a branch of mathematics that describes how uncertainty should

be quantified, manipulated and represented. Ever since the foundation of information
theory were laid down by Claude Shannon in 1949, the theory has almost been ap-
plied in almost all facet of science and technology. Statistical mechanics is a branch of
theoretical physics that uses probability theory to study the average behaviour of a sys-
tem whose exact state is uncertain. The relationships between information theory and
statistical mechanics are by no means new and many researchers have been exploiting
them for many years. Perhaps the first relation, or analogy, that crosses one’s mind is
that in both fields there is a fundamental notion of entropy (disorder or uncertainty of
a system).
Generally, entropy refers to disorder or uncertainty and the definition of entropy
used in information theory is directly analogous to the definition used in statistical
thermodynamics. The defining expression for entropy in the theory of information
established by Claude E. Shannon in 1948 is mathematically express as
n
X
H(P ) = − pi log pi ,
i=1
2
where P = (p1 , ..., pn ) and pi is the probability for observing message i and for statistical
mechanics, the defining expression for entropy established by Ludwig Boltzmann and
J. Willard Gibbs in the 1870s, is of the form
n
X
−k pi log pi ,
i=1
where pi is the probability of the microstate i taken from an equilibrium ensemble and
k is the Boltzmann constant.
Further, we presumably have some information about state of a natural system that
exhibits a phenomenon that we are trying to model, and we expect our probability
distribution to reflect our state of knowledge. The principle of insufficient reason (vari-
ously attributed to Jacob Bernoulli, Laplace, Thomas Bayes, etc.) provides one model
selection criterion: in the absence of any reason to believe that one outcome is more
likely than another, we must assume that all outcomes are equally likely [1]. But what
if we are not a disciple of that school of thought, then we have to say that not all
outcomes are equally likely. An alternative to the principle of insufficient reason is the
maximum entropy principle: the least biased probability distribution that maximizes
the entropy H subject to known constraints. By choosing the distribution which maxi-
mizes the entropy, we are choosing the distribution with the least bias. In other words,
our probability estimates should reflect what we know and what we do not know [9].
At the beginning of every problem in probability theory, there arises a need to
assign probability distribution (set up an ensemble). The assignment of probabilities
must agree with the information we have. A reasonable assignment of probabilities
must not only agree with the data and must not ignore any possibility, but It must also
not give undue emphasis to any possibility [9]. This has lead to the subjective and the
objective schools of thoughts of probability. The subjective view point plays crucial
3
role in this work. The rest of the thesis is organised as follows:
Section 2.1 and 2.2 discuss Frequentist and Bayesian views of probability, Section
2.3 establishes the relations between Frequentist and Bayesian, Section 3.1 discusses in-
formation theory, Section 3.2 also talks about statistical mechanics, Section 3.3 reviews
maximum entropy principle, Section 3.4 details the properties of entropy and finally
Section 3.5 looks at applications to statistical mechanics.
4
Chapter 2
FREQUENTIST VRS BAYESIAN
Statistician are interested in the outcomes of random experiments which make prob-
ability a useful tool in their analysis. Statistical inference is the process of deducing
properties of underlying probability distributions via analysis of data. The foundation
of statistics is concern with the debate on how one should conduct inferences from data.
Two common methods of computing statistical inferences are the Frequentist and the
Bayesian inferences. The differences between the two come from the way the concept
of probability itself is interpreted [18].
2.1 Frequentist view
Frequentist defines probability of an event as the limit of its relative frequency in a

large number of trial. In the Frequentist interpretation, probabilities are discussed in the
context of well defined random experiments (or random samples). The set of all possible
outcomes of a random experiment is called the sample space of the experiment. An
event is defined as a subset of the sample space. The relative frequency of occurrence
of an event, observed in a number of repetitions of the experiment is a measure of the
probability of that event. This is core to the Frequentist interpretation of probability.
5
Frequentists do not attach probabilities to hypotheses. Again, Frequentist statistics
make predictions on underlying truths of the experiment using only data from the
experiment. The more data they collect, the better they can pinpoint the truth.
Thus, if nt is the total number of trials and nx is the number of trials where the
event x occurred, the probability P (x) of the event occurring will be approximated by
the relative frequency as follows
nx
P (x) ≈ .
nt
A claim of the frequentist approach is that in the long run, as the number of trials
approaches infinity, the relative frequency will converge exactly to the true probability
[18]
nx
P (x) = lim .
nt →∞ nt
For example, let us consider a bag containing 10 red and 20 blue marbles having the
sme sizes and one marble is drawn at random. What is the probability that the marble
is red? Imagine we can repeat this experiment as often as we like with replacement; it
1
is then clear that we can draw any of the marbles, irrespective of colour, and so in 3
of the cases we would draw a red marble. So the long-run frequency of drawing a red
1
marble would be 3
and this then is our desired probability [19].
2.2 Bayesian view
Bayesian approach defines probability as a plausibility of hypothesis given incomplete

knowledge. Bayesian probability measures the degree of belief that you have in a
random event. By this definition, probability is highly subjective. Bayesian probability
is an interpretation of the concept of probability, in which, instead of frequency or
propensity of some phenomenon, probability is interpreted as reasonable expectation
6
representing a state of knowledge or as quantification of a personal belief [21, 22, 23].
The Bayesian interpretation of probability can be seen as an extension of proposi-
tional logic that enables reasoning with hypotheses, i.e, the propositions whose truth
or falsity is uncertain. In the Bayesian view, a probability is assigned to a hypothe-
sis, whereas under frequentist inference, a hypothesis is typically tested without being
assigned a probability. Bayesian probability belongs to the category of evidential prob-
abilities, to evaluate the probability of a hypothesis, the Bayesian probabilist specifies
some prior probability before observation, which is then updated to a posterior proba-
bility (after observation) in the light of new, relevant data (evidence) [23]. The Bayesian
interpretation provides a standard set of procedures and formula to evaluate probability.
For example, what is the probability that a coin is fair, given that there were 13
heads and 7 tails in 20 flips of the coin? Using Bayes theorem, this is written in terms
of conditional probabilities as follows
P (13h, 7t|f air)P (f air)

P (f air|13h, 7t) =
P (13h, 7t)
Figure 2.1: Representation of Bayesian theory

[20]
Let us take a look at each of the terms in the figure above
7
1. P (P arameter|Data): The posterior, or the probability of the model parameters
given the data: this is the result we want to compute.
2. P (Data|P arameter): The likelihood of the data to have come from the model
with the given parameter.
3. P(Parameter): The model prior, which encodes what we knew about the model
prior to the application of the collection of data.
4. P(Data): The data probability, which in practice amounts to simply a normal-

ization term.
2.3 Difference between Bayesian and Frequentist
Frequentist
1. Data are repeatable random sample.
2. There is a frequency.
3. Underlying parameters remain constant during this repeatable process.
4. Parameters are fixed.
Bayesian
1. Data are observed from the realized sample.
2. Parameters are unknown and described probabilistically.
3. Data are fixed.
8
Chapter 3
INFORMATION THEORY AND

STATISTICAL MECHANICS
3.1 Information Theory
Information theory is a branch of mathematics that describes how uncertainty should

be quantified, manipulated and represented. The most fundamental quantity in infor-
mation theory is entropy [3]. Shannon borrowed the concept of entropy from thermo-
dynamics where it describes the amount of disorder in a system [10]. In information
theory, entropy measures the amount of uncertainty of an unknown or random quantity.
When trying to work out the information content of an event, there are other difficulty
linked with the subjectivity of the information which is encountered when such event
occurs. To deal with this problem, Shannon provides the idea of defining the informa-
tion content G(E) of an event E as a function which depends solely on the probability
P (E) which satisfies these basic axioms
1. G(E) is a decreasing function of P (E), the more an event is likely (high proba-
bility), the less information its occurrence brings to us.
9
2. G(E) = 0 if P (E) = 1, since if we are certain (there is no doubt) that E will
occur, we get no information from its outcome.
3. G(E ∩ F ) = G(E) + G(F ) if E and F are independent events, i.e P (E ∩ F ) =

P (E) × P (F ).
Figure 3.1: The relation between information and entropy
The information content is given by
1
G(E) = log = − log(P (E)).
P (E)
The basic model of a data communication system is composed of three elements, a

source of data, a channel, and a receiver as expressed by Shannon, who essentially
single-handedly created the field of information theory the fundamental problem of
communication is for the receiver to be able to identify what data was generated by
the source, based on the signal it receives through the channel [3]. The meaning of the
events observed does not matter in the definition of entropy. Entropy only takes into
10
account the probability of observing a specific event, so the information it encapsulates
is information about the underlying probability distribution, but has no information
about the event itself.
Information is typically measured in bits. The unit of the measurement depends on
the base of the logarithm that is used to define the entropy. Let S stand for some event
or source which emits symbols in some alphabet A which consists of n symbols. For
instance, S could be an event when
1. A coin is tossed. Here A consists of two symbols, ‘head’ and ‘tail’.
2. A die is rolled. A consists of six symbols, the numbers 1, 2, 3, 4, 5, 6.
For an experiment where all symbols have an equal probability of occurring i.e using
the idea of principles of insufficient reason, the probability of any symbol occurring is
1
n
.
The information measure or entropy (measured in bits) of the source is given as
n
X
H(P ) = − pi log2 pi ,
i=1
where pi is the probability of the event (i) and P = (p1 , p2 ......pn )

Let us consider a fair coin, where i ∈ {head, tail}, P(heads) = P(tails) = 0.5. Using
the information of a fair coin above, then the information measure or entropy(measured
in bits) is
1 1 1 1
H(P ) = −[ × log2 ( ) + × log2 ( )] (3.1)
2 2 2 2
1 1
= −[( × −1 + × −1)] = 1bit (3.2)
2 2
But on the other hand, if a coin is biased with P(head) =0.7 and P(tail) = 0.3, then
11
P = (0.7, 0.3)
H(P ) = −[(0.3 × log2 (0.3) + 0.7 × log2 (0.7))]
= −[(0.3 × 1.737 + 0.7 × 0.515)] = 0.8816bits.
Therefore we can say that the biased coin generates less information than the fair
coin. This is because the uncertainty in the biased coin case is less than that in the fair
coin case. A coin that has zero probability of showing head and has zero entropy since
the coin will always come up tail, and the outcome can be predicted without any doubt
of mind. The entropy of the unknown result of the next toss of the coin is maximized
if the coin is fair (that is, if heads and tails both have equal probability of 12 ).
The Shannon/Weaver communication model
Figure 3.2: The communication system

[12]
12
The communication system consist of
1. An information source which produces a message or sequence of messages to be

communicated to the receiving terminal
2. A transmitter which operates on the message in some way to produce a signal

suitable for transmission over the channel
3. The channel is merely the medium used to transmit the signal from transmitter
to receiver. It may be a pair of wires, a coaxial cable, a band of radio frequencies,
a beam of light, etc.
4. The receiver ordinarily performs the inverse operation of that done by the trans-
mitter, reconstructing the message from the signal.
5. The destination is the person (or thing) for whom the message is intend
A good example of this model in action is Internet telephony. John says ’Hello Sally’
in starting a conversation with Sally over Skype. John is the informer or information
source and the words he utters constitute the message. John’s computer receives this
message via its microphone and digitally encodes it in preparation for transmission. The
encoding is done in a binary alphabet, consisting conceptually of 00 s and 10 s. The signal
for this encoded message is sent over the Internet, which is the communication channel.
Along the way some noise is added to the message, which interferes with the data
corresponding to ‘Sally’. The received signal is decoded by Sally’s computer, converted
into audio and played through the speakers. Sally, the informee at the information
destination, hears ‘Hello Sal**’, where * stands for unintelligible crackles due to the
noise in the decoded signal [4].
13
3.2 Statistical Mechanics
Statistical mechanics is a branch of theoretical physics that uses probability theory to

study the average behaviour of a system whose exact state is uncertain. The primary
goal of statistical thermodynamics aimed at describing thermodynamics of materials in
terms of the properties of their constituent particles and the interactions between them.
An example of such a system is a gas with many particles with its interacting prop-
erties. In other words, statistical thermodynamics provides a connection between the
macroscopic properties of materials in thermodynamic equilibrium, and the microscopic
behaviours of motions occurring inside the material.
In a statistical mechanical analysis, one aims at providing a probabilistic description
of the behaviour of the elementary components. But a detailed deterministic description
with respect to each component is in most cases not feasible. However, the probabilistic
description at the microscopic level of the system’s components is sufficient to obtain a
deterministic description at the macroscopic level, i.e to characterize the macroscopic
behaviour of the system [5].
Starting with partition function one needs to define the entropy. It is possible to
start with information theory and derive a fairness function. In particular, because it
has been recognized that the canonical ensemble or Gibbs measure serves to maximize
the entropy of a system, subject to a set of constraints, this is the principle of maximum
entropy.
The benefit of using statistical mechanics is that it provides exact methods to con-
nect thermodynamic quantities (such as heat capacity) to microscopic behaviour. In its
applications thermodynamics has two major objectives. One of these is to describe the
properties of matter when it exists in what is called an equilibrium state, a condition
in which its properties show no tendency to change. The other objective is to describe
processes in which the properties of matter undergo changes and to relate these changes
14
to the energy transfers in the form of heat and work which accompany them. Entropy
has often been associated with the amount of disorder of a system. The traditional
qualitative description of entropy is that it refers to changes in the states of the system
and is a measure of molecular disorder and the amount of energy loss in a dynamical
energy transformation from one state to another.
In particular, the entropy of a system is describe by the second law of thermody-
namics, which states that in an isolated system (close system) entropy never decreases
spontaneously. There are two related entropy definitions, the thermodynamic defini-
tion and the statistical mechanics definition. In the classical thermodynamics view-
point, the system is made of very large numbers of constituents (atoms, molecules) and
the state of the system is described by the average thermodynamic properties of those
constituents. The interpretation of entropy in statistical mechanics is the measure of
uncertainty which remains about a system after its observable macroscopic properties,
such as temperature, pressure and volume, have been taken into account. For a given
set of macroscopic variables, the entropy measures the degree to which the probability
of the system is spread out over different possible microstates. This definition describes
the entropy as being proportional to the natural logarithm of the number of possible
microscopic configurations of the individual atoms and molecules of the system (mi-
crostates) which could give rise to the observed macroscopic state (macrostate) of the
system [15, 16, 17]. Suppose P = (pi )ni=1 , where i ∈ A is the probability describing the
state of the system. Then the entropy of the system is given by
n
X
S(P ) = −k pi log pi ,
i=1
where k = 1.38065×10−23 J/K is the Boltzmann constant. The summation is over all
the possible microstates of the system, and pi is the probability that the system is in the
15
ith microstate. Both classical and statistical thermodynamics are only valid for systems
in equilibrium. If the system is not in equilibrium (irreversible thermodynamics) then
the problem becomes considerably more difficult.
In an isolated system where there is no exchange of energy between a system and
its neighbouring environment. The microstates are describe by the arrangement of its
constituent. Where Boltzmann measure of disorder is given by S = k log W , where W
is the number of microstates of a system.
A system approaches equilibrium because it evolves from states of lower probability
toward states of higher probability, and the equilibrium state is the state of highest
probability. In most cases, the initial state will be a very unlikely state. From this state
the system will steadily evolve towards more likely states until it has finally reached
the most likely state, i.e., the state of thermal equilibrium.
Consider a system consisting of ` different particles types. Suppose there are n1 of
type 1 particles, n2 of type 2 particles and so on. Then the total number of particles in
the system is N = `i=1 ni . Assume each type i particle has an energy of i , i = 1, ..., `
P
then the total energy of the system U is given by
`
X
U= i ni
i=1
. The total number of different microstates of the system is given by
N!
W = Q` . (3.3)
i=1 n i !
Using Stirling approximation,
log n! ≈ n log n − n (3.4)
16
and this implies that
N!
log ≈ log N ! − log(n1 !, n2 !, ..., n` !) (3.5)
n1 !n2 !n3 !, ..., n` !
= N log N − N − (n1 log n1 − n1 ) − ... − (n` log n` − n` )(3.6)
X X X X`
= ni log ni − ni log ni − N + ni (3.7)
i=1
X X X
= ni log ni − ni log ni − N + N (3.8)
X
= N log N − ni log ni . (3.9)
Introducing the Lagrangian multipliers
X X X X X
L(ni , α, β) = ni log ni − ni log ni − α ni − N − β ni i − U
∂ ∂ hX X X X X i
L(ni , α, β) = ni log ni − ni log ni − α ni − N − β ni i − U
∂ni ∂ni
X
= log ni − log ni − α − βi .
∂L
Therefore if ∂ni
= 0, we have that
X
log ni − log ni − α − βi = 0.
Thus
log ni = log N − α − βi
ni = elog N e−(α+βi ) .
Hence
ni = N e−(α+βi ) (3.10)
17
Note that
`
X `
X `
X
−(α+βi ) −α
N= ni = N e = Ne e−βi .
i=1 i=1 i=i
Therefore the probability distribution of selecting a particle of type i from the system
is given is given by
ni
pi = (3.11)
N
N e−α e−βi
= (3.12)
N e−α `i=1 βi
P
e−βi
= P` . (3.13)
−βi
i=1 e
The Boltzmann entropy now becomes
S = k log W (3.14)
!
N!
= k log Q` (3.15)
i=1 ni !
= k [log N ! − log n1 ! − log n2 !... − log n` !] (3.16)
= k [N log N − N − (n1 log n1 − n1 − n2 log n2 − n2 ... − n` log n` − n` )](3.17)

X X X
= k ni log ni − k ni log ni (3.18)
`
X
= −N k pi log pi . (3.19)
i=1
In equality 2 we use (3.3). Therefore, dividing both sides by N, the expression for
S leads to the specific entropy
`
X
s = −k pi log pi
i=1
S
where s = N
.
18
3.3 Maximum Entropy Principle
The maximum entropy is used to estimate probabilities when we have insufficient data
to determine them accurately. Suppose the quantity x is capable of assuming discrete
values xi , (i=1,2,...,n). We are not given the corresponding probabilities pi , all we
know is the expectation value of a function f . The problem is to find the probability
assignment pi = P (xi ) which satisfies the constraints
n
X
hf (x)i = pi f (xi ), (3.20)
i=1
and
n
X
pi = 1. (3.21)
i=1
The problem of specification of probabilities in cases where little or no information

is available, is as old as the theory of probability [6]. The Principle of Insufficient
Reason was an attempt to supply a criterion of choice, in which one assigns events
with equal probabilities if there is no reason to think otherwise. However, except in
cases where there is an evident element of symmetry that clearly renders the events
equally possible [6], assumption may appear just as arbitrary as any other that might
be made. In cases of continuous random variable which the Laplace idea seemed to
have inconsistencies. Since the time of Laplace, this way of formulating problems has
been largely abandoned, owing to the lack of any constructive principle which would
give us a reason for preferring one probability distribution over another in cases where
both agree equally well with the available information [6].
For an assignment of probabilities, one must recognize the fact that probability
theory has developed in two different philosophical approaches i.e Bayesian(subjective)
and Frequentist (objectivist). The objective school of thought regards the probability
19
of an event as an objective property of that event. The probability of the event is
obtained from the relative frequency of its occurrence in a large number trials of a
random experiment.
On the other hand, the subjective school of thought, regards probabilities as expres-
sions of human ignorance. The probability of an event is a formal expression of our
expectation that the event will or did occur, based on whatever information is avail-
able [6]. To the subjectivist, the purpose of probability theory is to help us in forming
plausible conclusions in cases where there is not enough information available to lead
to certain conclusions; thus detailed verification is not expected.
In the various statistical problems presented to us by physics, both viewpoints are
needed. The subjective view is evidently the broader one, since it is always possible to
interpret frequency ratios, furthermore, the subjectivist will admit as legitimate objects
of inquiry many questions which the objectivist considers meaningless. The problem
posed at the beginning of this section is of this type, and therefore in considering it we
are necessarily adopting the subjective point of view.
Just as in applied statistics the crux of a problem is often the devising of some
method of sampling that avoids bias, our problem is that of finding a probability as-
signment which avoids bias, while agreeing with whatever information is given. The
great advance provided by information theory lies in the discovery that there is a unique,
unambiguous criterion for the amount of uncertainty represented by a discrete prob-
ability distribution, which agrees with our intuitive notions that a broad distribution
represents more uncertainty than does a sharply peaked one, and satisfies all other con-
ditions which make it reasonable [6]. We know from Section 3.1 that entropy which is
positive, increases with increasing uncertainty, and is additive for independent sources
of uncertainty, is given by
n
X
S(P ) = −k pi log pi , (3.22)
i=1
20
where P = (p1 , ..., pn ) and k is the Boltzmann constant. This is the expression for
specific entropy from statistical mechanics, it will be called the entropy of the prob-
ability distribution P , henceforth we will consider the terms entropy and uncertainty
as synonymous. It is now evident how to solve our problem, in making inferences on
the basis of partial information we must use that probability distribution which has
maximum entropy subject to whatever is known. This is the least biased assignment
we can make. To maximize equation (3.22) subject to the constraints (3.20) and (3.21).
We introduce the Lagrangian
n
X
pi f (xi ) − hf (x)i = 0
i=1
and
n
X
pi − 1 = 0
i=1
n n
! n
!
X X X
L(p, λ, µ) = −k pi log pi − λ pi − 1 −µ pi f (xi ) − hf (x)i (3.23)
i=1 i=1 i=1
Differentiating the Langrangian partially with respect to the pi ’s leads to
∂
L = −k (log pi + 1) − λ − µf (xi ). (3.24)
∂pi
∂
Putting ∂pi
L =0
−k[log pi + 1] − λ − µf (xi ) = 0. (3.25)
Therefore
−k log pi = λ + µf (xi ) + k. (3.26)
21
Dividing both side by k will lead us to
log pi = −λ − µf (xi ) − 1 (3.27)
= e−[λ+µf (xi )+1] (3.28)
= e[−λk −µf (xi )] , (3.29)
where 1 + λ = λk . Note that
n
X n
X
1= pi = e (−1−λ)
e−µf (xi ) . (3.30)
i=1 i=1
This implies that

n
X
λk
e = e[−µf (xi )] (3.31)
i=1
Hence
X
λk = log e[−µf (xi )] (3.32)
λk = log Z(µ). (3.33)
Here
X
Z(µ) = e[−µf (xi )] ; (3.34)
is called the partition function. Therefore
e[−µf (xi )]
pi = e−λk e[−µf (xi )] = .
Z(µ)
22
General Case
This can be generalized to any number of functions f (xi ) = (f1 (xi ), f2 (xi ), ..., fm (xi )) :
X ∈ <m and
m
X m
X
hf i = pi f (xi ) = pi (f1 (xi ), ..., fm (xi )) = (f1 (xi ), ..., fm (xi )) (3.35)
i=1 i=1
As we know from (3.22) to be our entropy, then what is the least bias estimate such
that (3.35) is satisfied. We maximize H subject to (3.21) and (3.35). Since our problem
is a non linear, therefore introducing Lagrangian µ = (µ1 , ..., µm ) and λ.
m m m
! m
!
X X X X
L(p, µ, λ) = −k pi log pi − µv hfv i − pi fv (xi ) −λ pi − 1 (3.36)
i=1 v=1 i=1 i=1
Differentiating with respect to pi , we get
m
∂ X
L(p, µ, λ) = −k log pi − k + µv fv (xi ) − λ. (3.37)
∂pi v=1
∂
For pi to be maximum ∂pi
L(p, µ, λ) = 0. Therefore
m
X
−k log pi + µv fv (xi ) − λ − k = 0 (3.38)
v=1
m
X
c − k log pi + µv fv (xi ) = 0 (3.39)
v=1
Where c = −λ − k
m
X
c+ µv fv (xi ) = k log pi . (3.40)
v=1
Dividing both sides by k, then
Pm
pi = ec e v=1 µv fv (xi )
(3.41)
23
Pm
Now, since i=1 pi = 1, then
m
X Pm
c µv fv (xi )
1 = e e v=1 (3.42)
i=1
1
ec = Pm
µv fv (xi )
(3.43)
e v=1
m
X Pm
µv fv (xi )
c = − log e v=1 . (3.44)
i=1
Putting the expression for c into (3.41), then
Pm
µv fv (xi )
e v=1
p i = Pm Pm (3.45)
i=1 e v=1 µv fv (xi )
Then the maximum entropy probability distribution is given by
e(−[µ1 f1 (xi )+...+µm fm (xi )])

pi = (3.46)
Z(µ1 , ..., µm )
where
X
Z(µ1 , ..., µm ) = exp(−[µ1 f1 (xi ) + ... + µm fm (xi )]) (3.47)
i
forms the partition function.
3.4 Properties of entropy
In what follows we will collect some properties of the entropy functions.
Lemma 1
Suppose P = (p1 , ..., pn ) and U = (u1 , ..., un ) are any two probability distributions over
the xi , i = 1, 2, ..., n then
n
X n
X
− pi log ui > − pi log pi = H(P )
i=1 i=1
24
Proof 1
Note that
n n n X n X n n
X X X pi ui X
pi log pi − pi log ui = pi log > pi 1 − = pi − ui = 1−1 = 0.
i=1 i=1 i=1
ui i=1
pi i=1 i=1
1
In the inequality above we used the fact that log x > 1 − x
Remark 2 The quantity
n
X n
X
H(P | U ) = pi log pi − pi log ui
i=1 i=1
n
X pi
= pi log >0
i=1
ui
is called the relative entropy of P with respect to U .
H(P | U ) = 0
if and only if P = U
Proposition 3 (Gibbs variational formular)

Suppose U = (u1 , ..., un ) is such that
Pm
e− v=1 µv fv (xi )
ui = . (3.48)
Z(µ1 , ..., µm )
Then for any probability distribution P on xi , i = 1, ...n,
m X
X n
H(P ) =≤ log Z(µ1 , ..., µm ) + pi µv fv (xi ).
v=1 i=1
25
In particular equality is attained provided P = U.
Proof 4 Note that
n
X n
X
0 ≤ H(P | U ) = pi log pi − pi log ui (3.49)
i=1 i=1
Xn Xn m
X
= pi log pi + pi µv fv (xi ) + log Z(µ1 , ..., µm ). (3.50)
i=1 i=1 v=1
Pn
Therefore subtracting i=1 pi log pi from both sides of the inequality gives rise to
n
X n X
X m
− pi log pi = H(P ) ≤ pi µv fv (xi ) + log Z(µ1 , ..., µm ). (3.51)
i=1 i=1 v=1
Note that H(P | U ) = 0 if and only if P = U . Hence the inequality
n X
X m
H(P ) ≤ pi µv fv (xi ) + log Z(µ1 , ..., µm ).
i=1 v=1
becomes an equality provided P = U. Therefore
m
X
Hmax = log Z + µv hfv (xi )i . (3.52)
v=1
It remains to choose the unspecified constants µv so that an analogue of (3.20) is

satisfied. This is the case, as one readily verifies, if the µv is determined in terms of the
26
given data [9], Fv = hfv i and is given by
∂
hfv (x)i = − log(Z) (3.53)
∂µv
n
∂ X
= − log e(−[µ1 f1 (xi )+...+µm fm (xi )]) (3.54)
∂µv i=1
n
X e(−[µ1 f1 (xi )+...+µm fm (xi )])
= − fv (x) Pn (−[µ1 f1 (xi )+...+µm fm (xi )]) (3.55)
i=1 i=1 e
Xn
= − pi fv (xi ) (3.56)
i=1
Then using (3.52) and (3.53)

∂H
= µv (3.57)
hfv (x)i
Suppose m = 1, then
∂ log(Z(µ))
hf i = − (3.58)
∂µ
e−µf (xi )
P
∂ log
= (3.59)
∂µ
f (xi )e−µf (xi )
P
= (3.60)
Z(µ)
X
= pi f (xi ). (3.61)
27
The variance of the function f is also given as
∂2

∂ ∂ log(Z(µ))
log Z(µ) =
∂ 2µ ∂µ ∂µ
P
∂ − f (xi ) exp −(µf (xi ))
=
∂µ Z(µ)
∂ − f (xi )e−(µf (xi )
P
= P −µf (x )
∂µ e i
f (xi )2 e−µf (xi ) e−µf (xi ) + f (xi )e−µf (xi ) (− f (xi )e−µf (xi )
P P P P
= P
[ e−µf (xi ) ]2
2
− f (xi )2 e−µf (xi ) −f (xi )e−µf (xi )
P P
= P −µf (x ) − P −µf (x )
e i e i
X X 2
= f (xi )2 pi − pi e−µf (xi )
= f (xi )2 − hf (xi )i2 .

In addition to its dependent on x, the function f may also contain other parameter
α = α1 , α2 , .....αn and it is find out that
X
hf (xi , α)i = pi f (xi , α) (3.62)
and the probability

e−µf (xi ,α)
pi = P −µf (xi ,α) (3.63)
e
for which the partition function becomes
n
X
Z(µ) = e−µf (xi ,α) . (3.64)
i=1
Taking log of both sides of (3.64) we get
n
X
log Z = log e−µf (xi ,α) , (3.65)
i=1
28
which
n
X ∂f (xi , α) e−µf (xi ,α)
∂
log Z = −µ (3.66)
∂αk i=1
∂αk Z
n
X ∂f (xi , α)
= −µ pi (3.67)
i=1
∂αk

∂f
= −µ , (3.68)
∂αk
therefore the maximum entropy estimate for the derivative is given by

−1 ∂ log Z ∂f
= . (3.69)
µ ∂αk ∂αk
Which in the general case is given by
n
∂ X
− log Z = log e−µ1 f1 (xi ,α)+...+µm fm (xi ,α) (3.70)
∂α i=1
Pm Pm ∂f −(µ1 f (xi ,α)+...+µm f (xi ,α))
v=1 v=1 µv ∂α e
= (3.71)
Z
m m
X X ∂f
= µv pi (3.72)
v=1 v=1
∂α
m
X ∂f
= µv (3.73)
v=1
∂α
3.5 Applications of Statistical Mechanics
Considering the energy levels of a system Ei (α1 , α2 , ....), where αi may be the volume,
strain tensor, gravitational potential, magnetic fields etc are external parameters. If
we know only the average energy hEi i, then the maximum entropy probability of the
levels Ei is calculated as in Section 3.3 using conventional rules of statistical mechanics
discussed in Sectiton 3.2. Here the parameter of interest are temperature(T), free
29
energy(F), etc [24]. In what follows we shall use U and E interchangeably
From equation (3.10), we get that
ni e−βi e−βi
= P −βi = , (3.74)
N e Z
Where
X
Z= e−βi
It can also be established from (3.9) that
X
S = k[N log(N ) − ni log(ni )] (3.75)
X e−βi
= kN log(N ) − k ni log(N ) (3.76)
Z
X X X
= kN log(N ) − k ni log N + k ni βi + k ni log Z (3.77)
X X
= k ni log Z + k ni βi (3.78)
= kN log Z + kβU = k log W. (3.79)
1
Note that, the thermodynamics inverse temperature β is defined as kT
. Therefore
∂S 1 1
= kβ = k =
∂U kT T
1
Substituting β = kT
into the expression for S in (3.79) we get
U
S = kN log(Z) + . (3.80)
T
(3.81)
30
Multiplying the above expression by the temperature T we get
U − T S = −N kT log Z (3.82)
Here F (T, α) = −N kT log Z, hence
U − T S = F (T, α1 , α2 ...). (3.83)
Therefore differentiating F partially with respect to T and using the equation (3.79)
we get
∂F X
S = − = −k pi log pi (3.84)
∂T
(3.85)
The thermodynamic entropy is identical with the information theory entropy of the
probability distribution except for the presence of Boltzmann’s constant.
31
3.6 Conclusion
The statements presented above is that we recognized the Shannon expression for en-
tropy as a measure of the amount of uncertainty represented by a probability distri-
bution is more fundamental even than energy. In addition, the prediction of problems
in statistical mechanics is in the subjective sense, we can derive the usual relations in
a very elementary way without any consideration of ensembles or appeal to the usual
arguments concerning equal a priori probabilities. The principles and mathematical
methods of statistical mechanics are seen to be of much more general applicability than
conventional arguments which would lead one to suppose. In the problem of predic-
tion, the maximization of entropy is not an application of a law of physics, but merely
a method of reasoning which ensures that no unconscious arbitrary assumptions have
been introduced.
32
Bibliography
[1] A. Clark, (2010), The Handbook of Computational Linguistics and Natural Lan-
guage Processing, Chris Fox, and Shalom Lappin, page 159.
[2] Bell( 27.July and October) Systems Technical Journal., pp. 379-423
[3] C. Shannon, (1948), A mathematical theory of communication, Bell System Tech-

nical Journal, page 377- 379.
[4] M. Ramage and David Chapman, (2011), Perspectives on Information , Rutledge,

pages 36-(50).
[5] M. Luksza, (2011), Cluster statistics and gene expression analysis, Dissertation , pg
18.
[6] E.T.Jayne’s, (1957), Information theory and Statistical mechanics, Department of

physics Standford, Standford University, pg 622.
[7] E.P.Northrop, (1944), riddles in mathematics, D. Van Nostrand company, inc, New
York.
[8] H. Jeffreys, (1992) ,theory of probability, Oxford University Press, London.
[9] K. W. Ford, (1962), Brandeis University Summer Institute Lectures in Theoretical

Physics, Brandeis Summer Institute.
33
[10] Z. Ghahramani, (2000), Encyclopedia Of Cognitive Science, Macmillan Reference
LtdZoubin Ghahramani, University College London United Kingdom pg 1.
[11] M.Thomas Cover and Joy A. Thomas, (2006), Elements of Information Theory,
Wiley, second edition, pg 1-8.
[12] L. Floridi, (2009), Semantic conceptions of information, In Edward N. Zalta, ed-

itor, The Stanford Encyclopedia of Philosophy, Stanford University, summer (2009)
edition.
[13] S. D’Alfonso, An Overview of the Mathematical Theory of Communication Par-

ticularly for Philosophers Interested in Information.
[14] W.Thomas Leland, (1980), Basic Principles of Classical and Statistical Thermo-
dynamics, Department of Chemical Engineering, University of Illinois at Chicago.
Clinton Street, Chicago.
[15] D. Licker, Mark, (2004), McGraw-Hill concise encyclopedia of chemistry, New York
McGraw-Hill Professional. ISBN 978-0-07-143953-4.
[16] P. James, Sethna, (2006), Statistical mechanics Entropy, order parameters and
complexity, Oxford University Press. page 78.
[17] O.E. John Clark, (2004), The essential dictionary of science, New York: Barnes
and Noble. ISBN 978-0-7607-4616-5.
[18] L. Pekelis, (2015), Frequentist And Bayesian statistics.
[19] H.Maarten. P. Ambaum , (2012), Frequentist vs Bayesian statistics a non-

statisticians view, Department of Meteorology, University of Reading, UK.page(1-4).
[20] T. CTHAEH , (2016) The Frequentist And Bayesian Approaches In Statistics.
34
[21] R. T.Cox, (1946). Probability, Frequency and Reasonable Expectation. American
Journal of Physics.page 14.
[22] E. T .Jaynes, (1986), ”Bayesian Methods: General Background In Maximum-

Entropy and Bayesian Methods in Applied Statistics, by J. H. Justice, Cam-
bridge:Cambridge Univ.Press.
[23] B. de Finetti, (1974), Theory of probability (2 vols), J.Wiley Sons, Inc, New York.
[24] C. L . Tien, Statistical Thermodynamics, University of California, Berkeley.
35

Information Theory and Statistical Mechanics: Connected by Entropy

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory and Statistical Mechanics: Connected by Entropy

Uploaded by

Copyright:

Available Formats

INFORMATION THEORY AND STATISTICAL MECHANICS

UNIVERSITY OF ENERGY AND NATURAL RESOURCES

ANDOH FREDERICK - UE20037214

TO THE MATHEMATICS AND STATISTICS DEPARTMENT, UNIVERSITY OF

ANDOH FREDERICK ..................... ..................

KWAO DAVID ATSEYOE ..................... ..................

ISAAC OKYERE ..................... ..................

Entropy refers to disorder or uncertainty of a system. the maximization of entropy is

2 FREQUENTIST VRS BAYESIAN 5

3 INFORMATION THEORY AND STATISTICAL MECHANICS 9

2.1 Representation of Bayesian theory . . . . . . . . . . . . . . . . . . . . 7

3.1 The relation between information and entropy . . . . . . . . . . . . . . 10

Information theory is a branch of mathematics that describes how uncertainty should

FREQUENTIST VRS BAYESIAN

2.1 Frequentist view

Frequentist defines probability of an event as the limit of its relative frequency in a

2.2 Bayesian view

Bayesian approach defines probability as a plausibility of hypothesis given incomplete

P (13h, 7t|f air)P (f air)

Figure 2.1: Representation of Bayesian theory

Let us take a look at each of the terms in the figure above

4. P(Data): The data probability, which in practice amounts to simply a normal-

2.3 Difference between Bayesian and Frequentist

1. Data are repeatable random sample.

3. Underlying parameters remain constant during this repeatable process.

4. Parameters are fixed.

1. Data are observed from the realized sample.

2. Parameters are unknown and described probabilistically.

3. Data are fixed.

INFORMATION THEORY AND

3.1 Information Theory

Information theory is a branch of mathematics that describes how uncertainty should

3. G(E ∩ F ) = G(E) + G(F ) if E and F are independent events, i.e P (E ∩ F ) =

Figure 3.1: The relation between information and entropy

The information content is given by

The basic model of a data communication system is composed of three elements, a

where pi is the probability of the event (i) and P = (p1 , p2 ......pn )

= −[(0.3 × 1.737 + 0.7 × 0.515)] = 0.8816bits.

Figure 3.2: The communication system

1. An information source which produces a message or sequence of messages to be

2. A transmitter which operates on the message in some way to produce a signal

Statistical mechanics is a branch of theoretical physics that uses probability theory to

then the total energy of the system U is given by

. The total number of different microstates of the system is given by

Using Stirling approximation,

log n! ≈ n log n − n (3.4)

Introducing the Lagrangian multipliers

The Boltzmann entropy now becomes

= k [N log N − N − (n1 log n1 − n1 − n2 log n2 − n2 ... − n` log n` − n` )](3.17)

The problem of specification of probabilities in cases where little or no information

Differentiating the Langrangian partially with respect to the pi ’s leads to

−k[log pi + 1] − λ − µf (xi ) = 0. (3.25)

−k log pi = λ + µf (xi ) + k. (3.26)

log pi = −λ − µf (xi ) − 1 (3.27)

= e−[λ+µf (xi )+1] (3.28)

= e[−λk −µf (xi )] , (3.29)

where 1 + λ = λk . Note that

This implies that

λk = log Z(µ). (3.33)

is called the partition function. Therefore

Differentiating with respect to pi , we get

Dividing both sides by k, then