Professional Documents
Culture Documents
A THESIS SUBMITTED
BY
Certified by:
DR. ALEX AKWASI OPOKU ..................... ..................
Supervisor Signature Date
Certified by:
DR. ALEX AKWASI OPOKU ..................... ..................
Head of Department Signature Date
i
Dedication
We affectionately dedicate this project to our GOD. And to our parents, relatives,
friends and anyone who motivated and supported us in prayers and finance. “I the
LORD thy GOD will hold thy right hand, saying unto thee, Fear not I will help thee” .
[Isaiah 41:13]
ii
Acknowledgement
Our foremost gratitude goes to our GOD Almighty, our Father, who guided us through
these years of education. We are affectionately related to HIM and we believe that HE
alone can tell the magnitude of the gratitude we wanted to express, which can not all be
reduced into writing. We would like to also mention our Supervisor, Dr. Alex Opoku
to whom we owe an inexpressible gratitude for his guidance. We can not forget your
love Dr. Alex Opoku. We are deeply grateful. Further, thanks to the Mathematics
Department, University of Energy and Natural Resources for providing a great working
environment.
iii
Abstract
iv
Contents
Declaration i
Dedication ii
Acknowledgements ii
Abstract iii
Table of Contents iv
1 INTRODUCTION 2
v
3.5 Applications of Statistical Mechanics . . . . . . . . . . . . . . . . . . . 29
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography 32
vi
List of Figures
1
Chapter 1
INTRODUCTION
n
X
H(P ) = − pi log pi ,
i=1
2
where P = (p1 , ..., pn ) and pi is the probability for observing message i and for statistical
mechanics, the defining expression for entropy established by Ludwig Boltzmann and
J. Willard Gibbs in the 1870s, is of the form
n
X
−k pi log pi ,
i=1
where pi is the probability of the microstate i taken from an equilibrium ensemble and
k is the Boltzmann constant.
Further, we presumably have some information about state of a natural system that
exhibits a phenomenon that we are trying to model, and we expect our probability
distribution to reflect our state of knowledge. The principle of insufficient reason (vari-
ously attributed to Jacob Bernoulli, Laplace, Thomas Bayes, etc.) provides one model
selection criterion: in the absence of any reason to believe that one outcome is more
likely than another, we must assume that all outcomes are equally likely [1]. But what
if we are not a disciple of that school of thought, then we have to say that not all
outcomes are equally likely. An alternative to the principle of insufficient reason is the
maximum entropy principle: the least biased probability distribution that maximizes
the entropy H subject to known constraints. By choosing the distribution which maxi-
mizes the entropy, we are choosing the distribution with the least bias. In other words,
our probability estimates should reflect what we know and what we do not know [9].
At the beginning of every problem in probability theory, there arises a need to
assign probability distribution (set up an ensemble). The assignment of probabilities
must agree with the information we have. A reasonable assignment of probabilities
must not only agree with the data and must not ignore any possibility, but It must also
not give undue emphasis to any possibility [9]. This has lead to the subjective and the
objective schools of thoughts of probability. The subjective view point plays crucial
3
role in this work. The rest of the thesis is organised as follows:
Section 2.1 and 2.2 discuss Frequentist and Bayesian views of probability, Section
2.3 establishes the relations between Frequentist and Bayesian, Section 3.1 discusses in-
formation theory, Section 3.2 also talks about statistical mechanics, Section 3.3 reviews
maximum entropy principle, Section 3.4 details the properties of entropy and finally
Section 3.5 looks at applications to statistical mechanics.
4
Chapter 2
Statistician are interested in the outcomes of random experiments which make prob-
ability a useful tool in their analysis. Statistical inference is the process of deducing
properties of underlying probability distributions via analysis of data. The foundation
of statistics is concern with the debate on how one should conduct inferences from data.
Two common methods of computing statistical inferences are the Frequentist and the
Bayesian inferences. The differences between the two come from the way the concept
of probability itself is interpreted [18].
5
Frequentists do not attach probabilities to hypotheses. Again, Frequentist statistics
make predictions on underlying truths of the experiment using only data from the
experiment. The more data they collect, the better they can pinpoint the truth.
Thus, if nt is the total number of trials and nx is the number of trials where the
event x occurred, the probability P (x) of the event occurring will be approximated by
the relative frequency as follows
nx
P (x) ≈ .
nt
A claim of the frequentist approach is that in the long run, as the number of trials
approaches infinity, the relative frequency will converge exactly to the true probability
[18]
nx
P (x) = lim .
nt →∞ nt
For example, let us consider a bag containing 10 red and 20 blue marbles having the
sme sizes and one marble is drawn at random. What is the probability that the marble
is red? Imagine we can repeat this experiment as often as we like with replacement; it
1
is then clear that we can draw any of the marbles, irrespective of colour, and so in 3
of the cases we would draw a red marble. So the long-run frequency of drawing a red
1
marble would be 3
and this then is our desired probability [19].
6
representing a state of knowledge or as quantification of a personal belief [21, 22, 23].
The Bayesian interpretation of probability can be seen as an extension of proposi-
tional logic that enables reasoning with hypotheses, i.e, the propositions whose truth
or falsity is uncertain. In the Bayesian view, a probability is assigned to a hypothe-
sis, whereas under frequentist inference, a hypothesis is typically tested without being
assigned a probability. Bayesian probability belongs to the category of evidential prob-
abilities, to evaluate the probability of a hypothesis, the Bayesian probabilist specifies
some prior probability before observation, which is then updated to a posterior proba-
bility (after observation) in the light of new, relevant data (evidence) [23]. The Bayesian
interpretation provides a standard set of procedures and formula to evaluate probability.
For example, what is the probability that a coin is fair, given that there were 13
heads and 7 tails in 20 flips of the coin? Using Bayes theorem, this is written in terms
of conditional probabilities as follows
7
1. P (P arameter|Data): The posterior, or the probability of the model parameters
given the data: this is the result we want to compute.
2. P (Data|P arameter): The likelihood of the data to have come from the model
with the given parameter.
3. P(Parameter): The model prior, which encodes what we knew about the model
prior to the application of the collection of data.
Frequentist
2. There is a frequency.
Bayesian
8
Chapter 3
1. G(E) is a decreasing function of P (E), the more an event is likely (high proba-
bility), the less information its occurrence brings to us.
9
2. G(E) = 0 if P (E) = 1, since if we are certain (there is no doubt) that E will
occur, we get no information from its outcome.
1
G(E) = log = − log(P (E)).
P (E)
10
account the probability of observing a specific event, so the information it encapsulates
is information about the underlying probability distribution, but has no information
about the event itself.
Information is typically measured in bits. The unit of the measurement depends on
the base of the logarithm that is used to define the entropy. Let S stand for some event
or source which emits symbols in some alphabet A which consists of n symbols. For
instance, S could be an event when
1. A coin is tossed. Here A consists of two symbols, ‘head’ and ‘tail’.
2. A die is rolled. A consists of six symbols, the numbers 1, 2, 3, 4, 5, 6.
For an experiment where all symbols have an equal probability of occurring i.e using
the idea of principles of insufficient reason, the probability of any symbol occurring is
1
n
.
The information measure or entropy (measured in bits) of the source is given as
n
X
H(P ) = − pi log2 pi ,
i=1
1 1 1 1
H(P ) = −[ × log2 ( ) + × log2 ( )] (3.1)
2 2 2 2
1 1
= −[( × −1 + × −1)] = 1bit (3.2)
2 2
But on the other hand, if a coin is biased with P(head) =0.7 and P(tail) = 0.3, then
11
P = (0.7, 0.3)
H(P ) = −[(0.3 × log2 (0.3) + 0.7 × log2 (0.7))]
Therefore we can say that the biased coin generates less information than the fair
coin. This is because the uncertainty in the biased coin case is less than that in the fair
coin case. A coin that has zero probability of showing head and has zero entropy since
the coin will always come up tail, and the outcome can be predicted without any doubt
of mind. The entropy of the unknown result of the next toss of the coin is maximized
if the coin is fair (that is, if heads and tails both have equal probability of 12 ).
The Shannon/Weaver communication model
12
The communication system consist of
3. The channel is merely the medium used to transmit the signal from transmitter
to receiver. It may be a pair of wires, a coaxial cable, a band of radio frequencies,
a beam of light, etc.
4. The receiver ordinarily performs the inverse operation of that done by the trans-
mitter, reconstructing the message from the signal.
5. The destination is the person (or thing) for whom the message is intend
A good example of this model in action is Internet telephony. John says ’Hello Sally’
in starting a conversation with Sally over Skype. John is the informer or information
source and the words he utters constitute the message. John’s computer receives this
message via its microphone and digitally encodes it in preparation for transmission. The
encoding is done in a binary alphabet, consisting conceptually of 00 s and 10 s. The signal
for this encoded message is sent over the Internet, which is the communication channel.
Along the way some noise is added to the message, which interferes with the data
corresponding to ‘Sally’. The received signal is decoded by Sally’s computer, converted
into audio and played through the speakers. Sally, the informee at the information
destination, hears ‘Hello Sal**’, where * stands for unintelligible crackles due to the
noise in the decoded signal [4].
13
3.2 Statistical Mechanics
14
to the energy transfers in the form of heat and work which accompany them. Entropy
has often been associated with the amount of disorder of a system. The traditional
qualitative description of entropy is that it refers to changes in the states of the system
and is a measure of molecular disorder and the amount of energy loss in a dynamical
energy transformation from one state to another.
In particular, the entropy of a system is describe by the second law of thermody-
namics, which states that in an isolated system (close system) entropy never decreases
spontaneously. There are two related entropy definitions, the thermodynamic defini-
tion and the statistical mechanics definition. In the classical thermodynamics view-
point, the system is made of very large numbers of constituents (atoms, molecules) and
the state of the system is described by the average thermodynamic properties of those
constituents. The interpretation of entropy in statistical mechanics is the measure of
uncertainty which remains about a system after its observable macroscopic properties,
such as temperature, pressure and volume, have been taken into account. For a given
set of macroscopic variables, the entropy measures the degree to which the probability
of the system is spread out over different possible microstates. This definition describes
the entropy as being proportional to the natural logarithm of the number of possible
microscopic configurations of the individual atoms and molecules of the system (mi-
crostates) which could give rise to the observed macroscopic state (macrostate) of the
system [15, 16, 17]. Suppose P = (pi )ni=1 , where i ∈ A is the probability describing the
state of the system. Then the entropy of the system is given by
n
X
S(P ) = −k pi log pi ,
i=1
where k = 1.38065×10−23 J/K is the Boltzmann constant. The summation is over all
the possible microstates of the system, and pi is the probability that the system is in the
15
ith microstate. Both classical and statistical thermodynamics are only valid for systems
in equilibrium. If the system is not in equilibrium (irreversible thermodynamics) then
the problem becomes considerably more difficult.
In an isolated system where there is no exchange of energy between a system and
its neighbouring environment. The microstates are describe by the arrangement of its
constituent. Where Boltzmann measure of disorder is given by S = k log W , where W
is the number of microstates of a system.
A system approaches equilibrium because it evolves from states of lower probability
toward states of higher probability, and the equilibrium state is the state of highest
probability. In most cases, the initial state will be a very unlikely state. From this state
the system will steadily evolve towards more likely states until it has finally reached
the most likely state, i.e., the state of thermal equilibrium.
Consider a system consisting of ` different particles types. Suppose there are n1 of
type 1 particles, n2 of type 2 particles and so on. Then the total number of particles in
the system is N = `i=1 ni . Assume each type i particle has an energy of i , i = 1, ..., `
P
`
X
U= i ni
i=1
N!
W = Q` . (3.3)
i=1 n i !
16
and this implies that
N!
log ≈ log N ! − log(n1 !, n2 !, ..., n` !) (3.5)
n1 !n2 !n3 !, ..., n` !
= N log N − N − (n1 log n1 − n1 ) − ... − (n` log n` − n` )(3.6)
X X X X`
= ni log ni − ni log ni − N + ni (3.7)
i=1
X X X
= ni log ni − ni log ni − N + N (3.8)
X
= N log N − ni log ni . (3.9)
X X X X X
L(ni , α, β) = ni log ni − ni log ni − α ni − N − β ni i − U
∂ ∂ hX X X X X i
L(ni , α, β) = ni log ni − ni log ni − α ni − N − β ni i − U
∂ni ∂ni
X
= log ni − log ni − α − βi .
∂L
Therefore if ∂ni
= 0, we have that
X
log ni − log ni − α − βi = 0.
Thus
log ni = log N − α − βi
ni = elog N e−(α+βi ) .
Hence
ni = N e−(α+βi ) (3.10)
17
Note that
`
X `
X `
X
−(α+βi ) −α
N= ni = N e = Ne e−βi .
i=1 i=1 i=i
Therefore the probability distribution of selecting a particle of type i from the system
is given is given by
ni
pi = (3.11)
N
N e−α e−βi
= (3.12)
N e−α `i=1 βi
P
e−βi
= P` . (3.13)
−βi
i=1 e
S = k log W (3.14)
!
N!
= k log Q` (3.15)
i=1 ni !
= k [log N ! − log n1 ! − log n2 !... − log n` !] (3.16)
In equality 2 we use (3.3). Therefore, dividing both sides by N, the expression for
S leads to the specific entropy
`
X
s = −k pi log pi
i=1
S
where s = N
.
18
3.3 Maximum Entropy Principle
The maximum entropy is used to estimate probabilities when we have insufficient data
to determine them accurately. Suppose the quantity x is capable of assuming discrete
values xi , (i=1,2,...,n). We are not given the corresponding probabilities pi , all we
know is the expectation value of a function f . The problem is to find the probability
assignment pi = P (xi ) which satisfies the constraints
n
X
hf (x)i = pi f (xi ), (3.20)
i=1
and
n
X
pi = 1. (3.21)
i=1
19
of an event as an objective property of that event. The probability of the event is
obtained from the relative frequency of its occurrence in a large number trials of a
random experiment.
On the other hand, the subjective school of thought, regards probabilities as expres-
sions of human ignorance. The probability of an event is a formal expression of our
expectation that the event will or did occur, based on whatever information is avail-
able [6]. To the subjectivist, the purpose of probability theory is to help us in forming
plausible conclusions in cases where there is not enough information available to lead
to certain conclusions; thus detailed verification is not expected.
In the various statistical problems presented to us by physics, both viewpoints are
needed. The subjective view is evidently the broader one, since it is always possible to
interpret frequency ratios, furthermore, the subjectivist will admit as legitimate objects
of inquiry many questions which the objectivist considers meaningless. The problem
posed at the beginning of this section is of this type, and therefore in considering it we
are necessarily adopting the subjective point of view.
Just as in applied statistics the crux of a problem is often the devising of some
method of sampling that avoids bias, our problem is that of finding a probability as-
signment which avoids bias, while agreeing with whatever information is given. The
great advance provided by information theory lies in the discovery that there is a unique,
unambiguous criterion for the amount of uncertainty represented by a discrete prob-
ability distribution, which agrees with our intuitive notions that a broad distribution
represents more uncertainty than does a sharply peaked one, and satisfies all other con-
ditions which make it reasonable [6]. We know from Section 3.1 that entropy which is
positive, increases with increasing uncertainty, and is additive for independent sources
of uncertainty, is given by
n
X
S(P ) = −k pi log pi , (3.22)
i=1
20
where P = (p1 , ..., pn ) and k is the Boltzmann constant. This is the expression for
specific entropy from statistical mechanics, it will be called the entropy of the prob-
ability distribution P , henceforth we will consider the terms entropy and uncertainty
as synonymous. It is now evident how to solve our problem, in making inferences on
the basis of partial information we must use that probability distribution which has
maximum entropy subject to whatever is known. This is the least biased assignment
we can make. To maximize equation (3.22) subject to the constraints (3.20) and (3.21).
We introduce the Lagrangian
n
X
pi f (xi ) − hf (x)i = 0
i=1
and
n
X
pi − 1 = 0
i=1
n n
! n
!
X X X
L(p, λ, µ) = −k pi log pi − λ pi − 1 −µ pi f (xi ) − hf (x)i (3.23)
i=1 i=1 i=1
∂
L = −k (log pi + 1) − λ − µf (xi ). (3.24)
∂pi
∂
Putting ∂pi
L =0
Therefore
21
Dividing both side by k will lead us to
n
X n
X
1= pi = e (−1−λ)
e−µf (xi ) . (3.30)
i=1 i=1
Hence
X
λk = log e[−µf (xi )] (3.32)
Here
X
Z(µ) = e[−µf (xi )] ; (3.34)
e[−µf (xi )]
pi = e−λk e[−µf (xi )] = .
Z(µ)
22
General Case
This can be generalized to any number of functions f (xi ) = (f1 (xi ), f2 (xi ), ..., fm (xi )) :
X ∈ <m and
m
X m
X
hf i = pi f (xi ) = pi (f1 (xi ), ..., fm (xi )) = (f1 (xi ), ..., fm (xi )) (3.35)
i=1 i=1
As we know from (3.22) to be our entropy, then what is the least bias estimate such
that (3.35) is satisfied. We maximize H subject to (3.21) and (3.35). Since our problem
is a non linear, therefore introducing Lagrangian µ = (µ1 , ..., µm ) and λ.
m m m
! m
!
X X X X
L(p, µ, λ) = −k pi log pi − µv hfv i − pi fv (xi ) −λ pi − 1 (3.36)
i=1 v=1 i=1 i=1
m
∂ X
L(p, µ, λ) = −k log pi − k + µv fv (xi ) − λ. (3.37)
∂pi v=1
∂
For pi to be maximum ∂pi
L(p, µ, λ) = 0. Therefore
m
X
−k log pi + µv fv (xi ) − λ − k = 0 (3.38)
v=1
m
X
c − k log pi + µv fv (xi ) = 0 (3.39)
v=1
Where c = −λ − k
m
X
c+ µv fv (xi ) = k log pi . (3.40)
v=1
Pm
pi = ec e v=1 µv fv (xi )
(3.41)
23
Pm
Now, since i=1 pi = 1, then
m
X Pm
c µv fv (xi )
1 = e e v=1 (3.42)
i=1
1
ec = Pm
µv fv (xi )
(3.43)
e v=1
m
X Pm
µv fv (xi )
c = − log e v=1 . (3.44)
i=1
Pm
µv fv (xi )
e v=1
p i = Pm Pm (3.45)
i=1 e v=1 µv fv (xi )
where
X
Z(µ1 , ..., µm ) = exp(−[µ1 f1 (xi ) + ... + µm fm (xi )]) (3.47)
i
Lemma 1
Suppose P = (p1 , ..., pn ) and U = (u1 , ..., un ) are any two probability distributions over
the xi , i = 1, 2, ..., n then
n
X n
X
− pi log ui > − pi log pi = H(P )
i=1 i=1
24
Proof 1
Note that
n n n X n X n n
X X X pi ui X
pi log pi − pi log ui = pi log > pi 1 − = pi − ui = 1−1 = 0.
i=1 i=1 i=1
ui i=1
pi i=1 i=1
1
In the inequality above we used the fact that log x > 1 − x
n
X n
X
H(P | U ) = pi log pi − pi log ui
i=1 i=1
n
X pi
= pi log >0
i=1
ui
H(P | U ) = 0
if and only if P = U
Pm
e− v=1 µv fv (xi )
ui = . (3.48)
Z(µ1 , ..., µm )
m X
X n
H(P ) =≤ log Z(µ1 , ..., µm ) + pi µv fv (xi ).
v=1 i=1
25
In particular equality is attained provided P = U.
n
X n
X
0 ≤ H(P | U ) = pi log pi − pi log ui (3.49)
i=1 i=1
Xn Xn m
X
= pi log pi + pi µv fv (xi ) + log Z(µ1 , ..., µm ). (3.50)
i=1 i=1 v=1
Pn
Therefore subtracting i=1 pi log pi from both sides of the inequality gives rise to
n
X n X
X m
− pi log pi = H(P ) ≤ pi µv fv (xi ) + log Z(µ1 , ..., µm ). (3.51)
i=1 i=1 v=1
n X
X m
H(P ) ≤ pi µv fv (xi ) + log Z(µ1 , ..., µm ).
i=1 v=1
m
X
Hmax = log Z + µv hfv (xi )i . (3.52)
v=1
26
given data [9], Fv = hfv i and is given by
∂
hfv (x)i = − log(Z) (3.53)
∂µv
n
∂ X
= − log e(−[µ1 f1 (xi )+...+µm fm (xi )]) (3.54)
∂µv i=1
n
X e(−[µ1 f1 (xi )+...+µm fm (xi )])
= − fv (x) Pn (−[µ1 f1 (xi )+...+µm fm (xi )]) (3.55)
i=1 i=1 e
Xn
= − pi fv (xi ) (3.56)
i=1
Suppose m = 1, then
∂ log(Z(µ))
hf i = − (3.58)
∂µ
e−µf (xi )
P
∂ log
= (3.59)
∂µ
f (xi )e−µf (xi )
P
= (3.60)
Z(µ)
X
= pi f (xi ). (3.61)
27
The variance of the function f is also given as
∂2
∂ ∂ log(Z(µ))
log Z(µ) =
∂ 2µ ∂µ ∂µ
P
∂ − f (xi ) exp −(µf (xi ))
=
∂µ Z(µ)
∂ − f (xi )e−(µf (xi )
P
= P −µf (x )
∂µ e i
f (xi )2 e−µf (xi ) e−µf (xi ) + f (xi )e−µf (xi ) (− f (xi )e−µf (xi )
P P P P
= P
[ e−µf (xi ) ]2
2
− f (xi )2 e−µf (xi ) −f (xi )e−µf (xi )
P P
= P −µf (x ) − P −µf (x )
e i e i
X X 2
= f (xi )2 pi − pi e−µf (xi )
In addition to its dependent on x, the function f may also contain other parameter
α = α1 , α2 , .....αn and it is find out that
X
hf (xi , α)i = pi f (xi , α) (3.62)
n
X
Z(µ) = e−µf (xi ,α) . (3.64)
i=1
n
X
log Z = log e−µf (xi ,α) , (3.65)
i=1
28
which
n
X ∂f (xi , α) e−µf (xi ,α)
∂
log Z = −µ (3.66)
∂αk i=1
∂αk Z
n
X ∂f (xi , α)
= −µ pi (3.67)
i=1
∂αk
∂f
= −µ , (3.68)
∂αk
−1 ∂ log Z ∂f
= . (3.69)
µ ∂αk ∂αk
n
∂ X
− log Z = log e−µ1 f1 (xi ,α)+...+µm fm (xi ,α) (3.70)
∂α i=1
Pm Pm ∂f −(µ1 f (xi ,α)+...+µm f (xi ,α))
v=1 v=1 µv ∂α e
= (3.71)
Z
m m
X X ∂f
= µv pi (3.72)
v=1 v=1
∂α
m
X ∂f
= µv (3.73)
v=1
∂α
Considering the energy levels of a system Ei (α1 , α2 , ....), where αi may be the volume,
strain tensor, gravitational potential, magnetic fields etc are external parameters. If
we know only the average energy hEi i, then the maximum entropy probability of the
levels Ei is calculated as in Section 3.3 using conventional rules of statistical mechanics
discussed in Sectiton 3.2. Here the parameter of interest are temperature(T), free
29
energy(F), etc [24]. In what follows we shall use U and E interchangeably
From equation (3.10), we get that
ni e−βi e−βi
= P −βi = , (3.74)
N e Z
Where
X
Z= e−βi
X
S = k[N log(N ) − ni log(ni )] (3.75)
X e−βi
= kN log(N ) − k ni log(N ) (3.76)
Z
X X X
= kN log(N ) − k ni log N + k ni βi + k ni log Z (3.77)
X X
= k ni log Z + k ni βi (3.78)
1
Note that, the thermodynamics inverse temperature β is defined as kT
. Therefore
∂S 1 1
= kβ = k =
∂U kT T
1
Substituting β = kT
into the expression for S in (3.79) we get
U
S = kN log(Z) + . (3.80)
T
(3.81)
30
Multiplying the above expression by the temperature T we get
U − T S = −N kT log Z (3.82)
Therefore differentiating F partially with respect to T and using the equation (3.79)
we get
∂F X
S = − = −k pi log pi (3.84)
∂T
(3.85)
The thermodynamic entropy is identical with the information theory entropy of the
probability distribution except for the presence of Boltzmann’s constant.
31
3.6 Conclusion
The statements presented above is that we recognized the Shannon expression for en-
tropy as a measure of the amount of uncertainty represented by a probability distri-
bution is more fundamental even than energy. In addition, the prediction of problems
in statistical mechanics is in the subjective sense, we can derive the usual relations in
a very elementary way without any consideration of ensembles or appeal to the usual
arguments concerning equal a priori probabilities. The principles and mathematical
methods of statistical mechanics are seen to be of much more general applicability than
conventional arguments which would lead one to suppose. In the problem of predic-
tion, the maximization of entropy is not an application of a law of physics, but merely
a method of reasoning which ensures that no unconscious arbitrary assumptions have
been introduced.
32
Bibliography
[1] A. Clark, (2010), The Handbook of Computational Linguistics and Natural Lan-
guage Processing, Chris Fox, and Shalom Lappin, page 159.
[2] Bell( 27.July and October) Systems Technical Journal., pp. 379-423
[5] M. Luksza, (2011), Cluster statistics and gene expression analysis, Dissertation , pg
18.
[7] E.P.Northrop, (1944), riddles in mathematics, D. Van Nostrand company, inc, New
York.
33
[10] Z. Ghahramani, (2000), Encyclopedia Of Cognitive Science, Macmillan Reference
LtdZoubin Ghahramani, University College London United Kingdom pg 1.
[11] M.Thomas Cover and Joy A. Thomas, (2006), Elements of Information Theory,
Wiley, second edition, pg 1-8.
[14] W.Thomas Leland, (1980), Basic Principles of Classical and Statistical Thermo-
dynamics, Department of Chemical Engineering, University of Illinois at Chicago.
Clinton Street, Chicago.
[15] D. Licker, Mark, (2004), McGraw-Hill concise encyclopedia of chemistry, New York
McGraw-Hill Professional. ISBN 978-0-07-143953-4.
[16] P. James, Sethna, (2006), Statistical mechanics Entropy, order parameters and
complexity, Oxford University Press. page 78.
[17] O.E. John Clark, (2004), The essential dictionary of science, New York: Barnes
and Noble. ISBN 978-0-7607-4616-5.
34
[21] R. T.Cox, (1946). Probability, Frequency and Reasonable Expectation. American
Journal of Physics.page 14.
[23] B. de Finetti, (1974), Theory of probability (2 vols), J.Wiley Sons, Inc, New York.
35