Entropy Optimization Principles and Their Applications

ENTROPY OPTIMIZATION PRINCIPLES AND THEIR APPLICATIONS
J.N. KAPUR AND H.K. KESAVAN

Systems Design Department
University of Waterloo
Waterloo, Ontario, Canada N2L 3Gl
ABSTRACT. A state-of-the-art description of the theory and applications of the various

entropy optimization principles is given. These principles include Jaynes' maximum en-
tropy principle (MaxEnt), Kullback's minimum cross-entropy principle (MinxEnt), gener-
alised maximum entropy and minimum cross-entropy principles, inverse entropy optimiza-
tion principles, minimum interdependence principle, minimax entropy principle and finally,
the dual entropy optimization principles. The relation between information-theoretic en-
tropy and thermodynamic entropy is specially recalled in the context of the more general
relationship that exist between what are designated as primary and secondary entropies.
1. INFORMATION - THEORETIC ENTROPY
In the present paper, we shall be concerned primarily with information-theoretic entropy,

which for all practical purposes, we shall identify with uncertainty; we shall, however, also
discuss the relationship of this information-theoretic entropy with the classical concept of
thermodynamic entropy.
The concept of information-theoretic entropy was given in 1948 by Shannon [42], while
the concept of thermodynamic entropy dates back to at least one hundred years earlier.
The latter concept has held a great fascination for scientists and engineers all along, since
entropy in general always increases unlike all the other variables they discuss, which can
either increase or decrease with time. The 'arrow of time', as it has been called has been
some sort of mystery to most persons. The concept nevertheless has been basic to the
development of modern science and technology. The second law of thermodynamics which
is based on the concept of thermodynamic entropy is essentially a physical law. It was
applied by analogy by the economist Georgescu-Rogen [7] to economic problems in his
book 'Economics and the Entropy Law' and by the science journalist Rifkin [40] to current
social and political problems in his book 'Entropy: the Universal Law'.
Shannon was a communication engineer whose primary aim was to develop a measure
for information lost in communication across a noisy channel. He was not concerned about
thermodynamic entropy. Since information supplied is equal to the uncertainty removed,
he began searching for a measure of uncertainty of a probability distribution
n
p = (P1,P2, "·,Pn), Pi 2': 0, "',Pn 2': 0, LPi =1 (1)

;=1
He laid down the following postulates for a measure of uncertainty of the distribution:
(i) The measure of uncertainty should be a continuous function H(P1,P2, ... ,Pn) of the
3
v. P. Singh and M. Fiorentino (eds.), Entropy and Energy Dissipation in Water Resources, 3-20.
© 1992 Kluwer Academic Publishers.
4 J. N. KAPUR AND H. K. KESA VAN
probabilities, i.e., the uncertainty should change only by a small amount if Pl,P2, ... ,Pn
change by small amounts.
(ii) H(Pl,P2, "',Pn) should be a permutationally symmetric function of P1!P2, "',Pn i.e., it
should not change when P1!P2, ... ,Pn are permuted among themselves or when the
outcomes are labelled differently.
(iii) H(PbP2, "',Pn) should be maximum when PI = P2 ... = Pn = lin and this maximum
value should be an increasing function of n.
(iv) H(Pl,P2, ... ,Pn) should follow the branching or the recursivity principle i.e.,
H(Pl,P2 .. ·Pn-bPnQl,Pnq2, · ..PnQm) = H(PbP2, "',Pn) +PnH(Qb Q2, ... , Qm) (2)
where Qj ~ 0, 2:7'==1 Qj = 1. In his epoch-making paper, Shannon [42] proved that the only
function which satisfies all these postulates is
n
H(Pl,P2, .. ·,Pn) = -K LPi ln Pi, (3)

i==1
where K is an arbitrary positive constant. Shannon did not want to call this function as a
measure of information or a measure of uncertainty and so, he approached his friend, the
famous mathematician-physicist John Von Neumann who allegedly advised him to call it
entropy because of two reasons: "Firstly you have got the same expression as is used for
entropy in thermodynamics and secondly and even more importantly, since even after one
hundred years, nobody understands what entropy is, and so if you use the word entropy,
you will always win in an argument!" Tribus [46]. Shannon took the suggestion and thus
a measure of uncertainty came to be known as a measure of entropy, solely because this
measure had the same mathematical expression as the thermodynamic entropy. At that
time, no relationship with thermodynamic entropy was yet established. Such a relationship
was discovered later and we shall derive it in section 3. This relationship is established by
making use of Jaynes' [10] principle of maximum entropy, which we proceed to discuss in
the next section.
2. JAYNES' PRINCIPLE OF MAXIMUM ENTROPY
Let the only information available about the probability distribution P be given by
n
PI ~ 0, P2 ~ 0, "',Pn ~ 0, LPi = 1 (4)

i=1
and
n
LPigr(i) = a., r = 1,2, ... ,m (5)

i=1
where m + 1 <t: n. Thus (4) and (5) are not sufficient to determine PI! P2, ... , Pn uniquely.
In other words, there may be many, even an infinity of probability distributions, satisfying
the given constraints (4) and (5). Which of these distributions should we choose?
ENTROPY OPTIMIZATION PRINCIPLES AND THEIR APPLICATIONS 5
We have a system with partial information and there is some missing information.
Consequently, there is some uncertainty due to this missing information.
In order to make a choice among the distributions satisfying (4) and (5), Jaynes made
use of the following principles of ancient wisdom:
• Speak the truth and nothing but the truth
• Make use of all the given information you are given and scrupulously avoid using the
information not given to you, and
• Make use of all the given information and be maximally uncommitted to the missing
information or be maximally uncertain about it.
He therefore proposed that to determine Pl,P2, ... ,Pm we should maximize Shannon's mea-
sure of uncertainty (3), subject to (4) and (5) being satisfied.
If we use Lagrange's method of undetermined multipliers, we get
Pi = exp(-Ao - Algl(i) - ... - Amgm(i)), i = 1,2, ... ,n (6)
where Ao,Al, ... ,Am are determined by using (4) and (5).
Now by a fortunate circumstance, the function (3) happens to be a concave function
and (4) and (5) are linear constraints and so the local maximum of expression (3) will also
be its global maximum.
We call it a fortunate circumstance because Shannon did not design his measure to be a
concave function and, accordingly, this was not one of his postulates. However, his entropy
function turned out to be a concave function.
Another fortunate circumstance was that the probabilities given by (6) are always ~ 0,
so that there is no necessity for imposing the non-negativity constraints in (4), since these
are automatically satisfied.
These two fortunate circumstances favoured the great success of the maximum entropy
principle, since in all optimization problems, great difficulties arise when we have to decide
whether
• the extreme value found is a maximum or minimum
• the maximum obtained is local or global, and
• the non-negativity constraints are satisfied.
By using Shannon's measure, all the three problems are automatically taken care of, at
least for linear constraints.
3. RELATION BETWEEN INFORMATION-THEORETIC AND THERMO-

DYNAMIC ENTROPIES
Consider a system of particles in which each particle can be in one of the n energy levels
with energies 101,102, ••• , En. Let Pi be the probability of a particle being in the ith energy
level, so that the expected energy of the system is given by
n
LPiEi =i (7)
i=l
Now suppose the only information we have about the system is the value of i i.e., we believe
that the system is characterised by its expected energy i, so that the only information we
have about PlIP2"",Pn is given by (4) and (7). We have only two equations and n (> 2)
quantities to be determined. There may be an infinity of probability distributions consistent
with (4) and (7), and according to the maximum entropy principle we should choose that
distribution which maximjzes (3) subject to (4) and (7). This gives
exp( -JlEi)
Pi=~n ( )' i=I,2, ... ,n (8)
L.i=1 exp -JlEi
where Jl is determined by using
. = ~===---";-'---'-~
10
I:i=l Ei exp( -JlEi) (9)
I:i=l exp( -JlEi)
The probability distribution given by (8) and (9) is called the Maxwell-Boltzmann dis-
tribution.
So far Jl is just a mathematical concept, a Lagrange multiplier but from (9) Jl is a
function of i. This gives
di [I:i=l exp[-JlEO)][- I:i=l ~ exp( -JlEd] + [I:i=l Ei exp( -JlEi]2 (10)

dJl = [I:i=l exp( -JlEi)]2
By using Cauchy-Schwarz inequality, we find
di
-<0 (11)
dJl -
and the equality sign holds only when
(12)
We shall assume that (12) is not satisfied, so that
di dJl
dJl < 0, di < 0, (13)
so that Jl is a decreasing function of i and if we define T by

1
Jl = kT' (14)
where k is a suitable constant, this T is an increasing function of E. Thus T can now be

interpreted as a physical entity characteristic of the system which increases with expected
energy E. We define it as the temperature of the system. Now from (7)
n n
dE = LPidti + L tidpi (15)

i=1 ;=1
This means that change in E can be either due to increase in individual energy lev-
els or due to changes in proportions of particles in different energy levels. Now changes
dtl, dtQ, ... , dtn can be brought about by doing work on the system. As such we write
n
LPidti = -LlW (16)
i=1
and call Ll W as the work effect. We also write
n
Ltidpi = LlH (17)
i=1
and call it heat effect.
From (15), (16), (17)
dE = -LlW + LlH (18)
and
!(-LlW + LlH) = 0 (19)
Now there may be an infinity of possible probability distributions with same i. All of them
will have different information-theoretic entropies and the maximum entropy will be given
by
n n n
Smax =- L Pi in Pi - LPi[-{Lti -In L exp( -{Lt;)]

i=1 ;=1 ;=1
n
{LE + in L exp( -{Ltd, (20)

i=1
so that if we do not change ti, tl, ... , tn,
dSmax d
{Lt
-d + 2:i=I(-t
-+ t{L n
i )exp (-W;)d
{L
2:i= 1 exp ( -{Lti)
{Ldi
LlH
(21)
kT
Comparing this with the corresponding expression for thermodynamic entropy, we con-
clude that Smax corresponds to the thermodynamic entropy.
Thus thermodynamic entropy is the maximum value of information-theoretic entropy

when the system has a fixed expected energy f. and fi, f1, ... , fn are not changed.
4. PRIMARY AND SECONDARY ENTROPIES
If there is no additional information about P1,P2, ... ,Pn, beyond the natural constraint, the
maximum value of entropy is In n which will be called the primary entropy. However, if
there is some information in the form of our knowing the values of one or more moments,
the maximum value of entropy will be accordingly reduced and this entropy will be called
the secondary entropy corresponding to the given constraints. This can be given a name
according to the context of the application. Thus, the entropy corresponding to the energy
constraint (7) is called the thermodynamic entropy. Similarly, if there is a constraint on
average cost in an economic system, the corresponding maximum entropy may be called
economodynamic entropy [32,33J.
Likewise, if both the expected energy and the expected number of particles are pre-
scribed and the number of particles in any energy level can be any number from 0 to
00, then the maximum entropy is called Bose-Einstein entropy. If instead the number of
particles in any energy level can be only 0 or 1, it will be called Fermi-Dirac entropy.
Again, in the same way, if in an urban transportation problem, the number of persons
living in m residential colonies and the number of persons working in n offices are known
and if the expected travel cost is prescribed, then the maximum value of entropy may be
called transportation entropy or simply interactivity of the transportation system [4J.
5. CHARACTERIZATION OF PROBABILITY DISTRIBUTIONS
Let f( x) be the probability density function for a continuous random variate varying over
a finite or infinite interval, then on analogy with (3), its entropy is defined as
-t f(x)lnf(x)dx (22)
This however cannot represent uncertainty in a strict sense, since this can be negative
and this is not invariant for coordinate transformations.
However the definition in (22) does not cause any serious problem when it is used
for implementing Jaynes' maximum entropy principle, since our object is not to find the
maximum value of (22), but to find the density function which makes (22) a maximum.
This can always be done.
Thus, let the range be (- 00, 00) and let the only information available about f( x) be
i: i: i:
given by
f(x)dx = 1, xf(x)dx = m, (x - m)2 f(x)dx = u 2 (23)
i.e., we know only the mean and variance of the distribution. There can be an infinity of
probability distributions with the same mean m and same variance u 2 and the distribution
which has the maximum entropy out of these is obtained by maximizing (22) subject to
(23) so that
1 (z-m)2
/(z) = !<>= exp[-t 2 ], (24)
V 27r0" 0"
which is the normal distribution.

Thus out of all continuous variate distributions over (-00,00), which have the same
mean m and same variance 0"2, the normal distribution has the maximum entropy. Thus the
normal distribution is characterised by the two simplest moments viz. mean and variance
and this accounts for the importance of the normal distribution.
Even a layman can understand the characterisation of the normal distribution in terms
of mean and variance; it may be difficult for him to understand its characterisation by the
expression (24) involving transcendental numbers 7r and e.
In the same way, almost all important probability distributions, univariate or mul-
tivariate, discrete or continuous, Lagrangian or non-Lagrangian, can be characterised as
maximum entropy probability distributions (MEPD) in terms of some simple moments.
Thus we have the table:
Range Characterising moments MEPD
(0,00) E(z) exponential

(0,00) E(z), E(ln z) Gamma
(0,00) E(lnz), E(lnz)2 log normal
(0,1) E(ln z), E(ln (1- z)) Beta
(0,00) E(ln z), E In(1 + z) Beta
(0,00) E(Zi), E(ZiZj); i,j=1,2, ... ,n Multivariate normal
6. PRINCIPLE OF MINIMUM CROSS-ENTROPY
Suppose on the basis of intuition, experience or theory, we have a feeling that the probability
distribution should be qll q2, ... , qn' To confirm our guess, we take some observations or
otherwise and find some characterizing moments of the distribution.
There may be many probability distributions which may be consistent with the given
constraints. Out of those, we choose that one which is 'nearest' in some sense, to the given
'a priori' distribution.
The principle we use here is similar to the principle of maximum entropy. The first part
is the same viz., use all the information you are given. In the second part instead of saying
that we should be as uncertain about the missing information as possible, we say that we
should be as near to our intuition and experience as possible.
To implement this principle, we need a measure of 'distance' or 'discrepancy' or 'di-
vergence' of a distribution P from a given distribution Q. One such measure was given
by Kullback and Leibler [37], even before the principle of maximum entropy was explicitly
stated. The measure was
n
p'
D(P: Q) = LPiln~ (25)
i=l q,
10 1. N. KAPUR AND H. K. KESAVAN
It can easily be verified that

(i) D(P : Q) ? 0 (26)
(ii) D(P : Q) = 0 if and only if Pi = qi for each i (27)
(iii) D(P : Q) is a convex function of P1,P2, ... ,Pn so that when it is minimized subject to
linear constraints, its local minimum is its global minimum.
(iv) If no a priori probability distribution is available and if according to Laplace's principle
of insufficient reason, we choose Q = U, the uniform distribution, we get
n n
Pi
D(P: U) = LPiIn-l- = In n - (- LPiin Pn), (28)
i=l 1 n i=l
so that minimizing D(P : U) would be equivalent to maximizing Shannon's measure of

entropy (3).
According to Kullback's [38] principle of minimum cross-entropy, given an a priori prob-
ability distribution Q and given constraints (4) and (5), we should choose that probability
distribution P which minimizes (25) subject to (4) and (5), so that we get
(29)
Again the global minimum always exists and each Pi ? O. This principle includes Jaynes'
principle of maximum entropy as a special case when Q = U. This is obvious since the
uniform distribution is the most uncertain distribution.
7. APPLICATIONS OF JAYNES' MaxEnt AND KULLBACK'S MinxEnt

PRINCIPLES
These two principles have had tremendous applications in a large variety of fields. Kapur
[16] has surveyed developments during 1957-1982. Some thoughts on the scientific and
philosophical foundations of maximum-entropy principle were given in [24]. Other early
surveys were given in [17, 19, 21].
In [30], these principles were used to solve the following fourteen problems:
Problem 1: Given a system of particles with possible energy levels E1, E2, ••• , En and with
average energy i, obtain estimates for PbP2, ... ,Pm where Pi is the probability of a
particle being in the i energy level.
Problem 2: If in Problem 1, the expected total number of particles in the system is also
known to be N, obtain estimates for the expected number of particles in each energy
level.
Problem 3: Given "n' residential colonies, with costs Cb C2, ... , Cn of travel from these to
the central business district, and being also given the expected average cost
n
C = LPiCi,
i=l
estimate the proportions PI, P2, ... , Pn of the total population living in these colonies.
Problem 4: Given (i) populations aI, a2, ••• , am of 'm' residential colonies, (ii) number of
jobs bl , b2, ••• , bn in "n' offices, (iii) cost Cij of travel from i-th colony to j-th office
(i = 1,2, ... , mj j = 1,2, ... , n), and (iv) the travel budget
n m
C= LL CijTijj
j=li=l
obtain estimate for Tij, the number of trips from the i-th colony to the j-th office.
Problem 5: Given network of 'm' queues and the average sizes of queues as all a2, ••• , am,
estimate p(nl' n2, ••• , n m ) which is the probability of there being nl, n2, ••• , nm persons
in the 'm' queues.
t
Problem 6: Given the average absorption J f( 'Z, y)dy per unit length in a slice of tissue,
of a photon beam of length 1 sent through the slice for a large number of such beams,
estimate the coefficient f( 'Z, y) at every point of the slice.
Problem 7: Given market shares of a number of brands of a product in a market, estimate

(i) the proportion of consumers loyal to each brand, (ii) the proportion of consumers
loyal to each pair or each triplet, etc. of brands, and (iii) probabilities of switching
from one brand to another.
Problem 8: Given voting shares of a number of political parties, estimate (i) the number
of voters loyal to each party, each pair of parties, each triplet of parties and so on,
and (iii) the probability of switching of a voter from a given party to another.
Problem 9: Given (i) the number of beds in each ward of a hospital, (ii) the cost of each
bed in each ward, (iii) the total occupancy of the hospital, and (iv) the total revenue,
estimate the number of beds occupied in each ward.
Problem 10: Given that a continuous random variate varies from -00 to 00 and given
that its mean is 'm' and variance is (72, estimate its probability density function.
Problem 11: Given (i) the ranges of values for each component of a multidimensional
random variate, (ii) whether variate is continuous or discrete, and (iii) some moments
of the distribution of the random variate, estimate the probability density function
for the random variate.
Problem 12: Given 'L']=l aij'Zj == Ci, i == 1,2, ... , m, m ~ nj 'L']=l 'Zj == 1, 'Zb 'Z2, ... , 'Zn 2:
OJ estimate 'Zl, 'Z2, ••• , 'Zn·
Problem 13: Given a contingency table of any dimension, estimate dependence in it.
Problem 14: Given n random variables 'Zl, 'Z2, ••• , 'Zn with 'Zl + 'Z2 + ... + 'Zn == 1, each
'Zi 2: 0, estimate the density functions for 'Zl, 'Z2, ••• , :en'
12 J. N. KAPUR AND H. K. KESAVAN
The first comprehensive book on Maximum Entropy Models in Science and Engineering
is [26J. It devotes four chapters to describe discrete univariate, continuous univariate, dis-
crete multivariate and continuous multivariate maximum-entropy probability distributions.
In one chapter it obtains Maxwell-Boltzmann, Fermi-Dirac, Bose-Einstein and Intermedi-
ate Statistics distributions of statistical mechanics from MaxEnt. In another chapter, it
gives MaxEnt discussion of thermodynamics of closed and open or diffusive systems. One
chapter is devoted to maximum entropy models in regional and urban planning and this
discusses population distribution and transportation models and Fermi-Dirac and Bose-
Einstein distributions for residential location and trip distributions. Another chapter is
devoted to maximum-entropy models for brand-switching in marketing and vote-switching
in elections.
One chapter is devoted to obtaining information in contingency tables and another is
devoted to applications to statistical inference, non-parametric density estimation and other
applications in statistics.
One chapter is devoted to economics, finance, insurance and accountancy. It discusses
economic inequalities, optimum taxation policies, international trade models, stock market
models, loss of information on account of aggregation in accountancy etc ..
Another chapter deals with maximum-entropy principles in operations research. It
discusses MaxEnt in search theory, reliability theory, queueing theory, theory of games and
optimal portfolio analysis.
Three chapters deal with recent engineering applications of MaxEnt to spectral analysis,
image reconstruction and pattern recognition. These discuss comparison of MESA with
other methods of spectral analysis and multi-dimensional MESA, grey-level thresholding,
computerised tomography, Karhunen-Loeve expansion and pattern recognition as a quest
for minimum entropy [47J.
The final chapter deals with MaxEnt in pharmacokinetics, epidemic models, ecology,
design of experiments, and logistic law of population growth.
The list of topics given above and the 640 references given in [26J give an idea of the
all-pervading applications of MaxEnt and MinxEnt principles.
Another set of about three hundred papers on MaxEnt and Bayesian methods has ap-
peared in the proceedings of the ten MaxEnt conferences [3, 11, 43, 44, 45J. Besides a
discussion of theory of MaxEnt, MinxEnt and Bayesian statistics, these discuss applica-
tions to fields like magnetohydrodynamics, plasma physics, turbulence, condensed matter
physics, energy dissipation, random cellular structures, drug absorption, nuclear magnetic
resonance, image reconstruction, cyclotrone resonance, mass spectroscopy, underwater stud-
ies, magnetic resonance imaging, crystallography, chemical spectroscopy, time series, neural
networks, structural molecular biology, expert systems and information retrieval.
8. GENERALISED MAXIMUM ENTROPY AND MINIMUM CROSS-ENTROPY

PRINCIPLES
The generalised minimum cross-entropy principle requires that out of all probability distri-
butions satisfying given constraints, we should choose that probability distribution P which
is 'nearest' to a given a priori probability distribution Q.
Similarly, the generalised maximum-entropy principle requires that, in the absence of

any knowledge about Q, out of all probability distributions satisfying given constraints, we
would choose that probability distribution which is 'nearest' to the most random distribu-
tion i.e., the uniform distribution.
In both these generalised principles, the measure of 'nearness' or 'closeness' or 'distance'
is not specified; it is left at our disposal. If we choose Kullback-Leibler measure, we get
Kullback's principle of minimum cross-entropy and Jaynes' principle of maximum entropy.
However a large number of other measures of 'distance' of a probability distribution of
P from another probability distribution Q have been proposed [9, 18, 32, 33, 39] and can
be used. Some of these are
n n
[~)Pi - qi)2]1/2, 2: 1 Pi - qi I, [2:;=1 1 Pi - qi Ik]i,

i=l i=l
(30)
1 [~'"
--1 ~ Pi qi1 - '" - 1] , (Havrda and Charvat [9]), (31)
a - i=l
1 n
- - I n "p~q~-'" (Renyi [39]) (32)
1 ~" ,
a - i=l
n p. 1 n 1 + ap.
2:PiIn-': - - 2:(1 + api)ln--', (Kapur [12, 18, 24] (33)
i=l qi a i=l 1 + aqi
Which of these or other proposed measures we choose, depends on our requirements. For
Euclidean distances, we have the requirements
(i) D(P : Q) ~ 0 (Non-negativity) (34)
(ii) D(P: Q) = 0 iff P = Q (Identity) (35)
(iii) D(P: Q) = D(Q : P) (Symmetry) (36)
(iv) D(P: Q) + D(Q : R) ~ D(P: R) (Triangle Inequality) (37)
In our case, we do require (i) and (ii) always. We do not require (iii), since we have
always to measure distances of various distributions from a given a priori distribution Q or
from the uniform distribution. We require only one-way or directed distances or directed
divergences.
We do not require (iv) also, since this triangle inequality arises from consideration of
geometrical distances in a Euclidean plane, whereas we are considering distances between
probability distributions.
We do not mind if (iii) and (iv) are satisfied, but we do not insist on these, since such
an insistence will restrict the choice of discrepancy measures.
We can now impose some additional requirements
(v) D(P: Q) should be a convex function of P so that when D(P : Q) is minimized subject
to linear constraints, the local minimum will be the global minimum.
(vi) When D(P : Q) is minimized subject to linear constraints, the minimizing probabil-
ities should automatically come out to be non-negative, since otherwise we have to
explicitly impose non-negativity constraints and this causes computational problems.
Requirements (v) and (vi) arise out of the need for simplifying mathematical computa-
tions, but this is an important requirement for practical implementation of the maximum
entropy and minimum cross-entropy principles.
Kullback-Leibler measure satisfies conditions (i), (ii), (v) and (vi).
There are other measures which also satisfy these and there should be no hesitation in
using these.
The advantage of using these measures over Kullback-Leibler measure arises because
each of these involves one or more parameters and these parameters can be chosen to give
better fit to given data.
Thus while use of Kullback-Leibler or corresponding Shannon's measure leads to only
one model of population growth viz. exponential law of population growth, use of Kapur's
measure leads to the logistic model with one parameter corresponding to the carrying ca-
pacity of the environment [13, 27J.
9. INVERSE ENTROPY OPTIMIZATION PRINCIPLES
If we are given the moments, the problem of finding the MEPD will be called the direct
problem. The inverse problem is concerned with finding the characterising moments when
a given probability distribution is regarded as a MEPD.
Thus suppose we find that the observed probability distribution is the normal distribu-
tion and we want to know which moments will characterise it as a MEPD. We find that
if we are given two independent moments E(alz + b1 z 2 ), E(a2z + b2 z 2) we will get the
normal distribution.
The problem of existence and uniqueness of solutions of inverse maximum entropy and
minimum cross entropy principles has been studied by us in [35J.
Thus if we know that the income distribution in a society is the Pareto distribution, we
show by using the inverse maximum entropy principle, that the characterising moment is
E(ln z) which is the geometrical mean of income. Thus we know in this society it is not
the amount of income which matters, rather it is the logarithm of the amount of income
that matters. In other words the utility function in this society is logarithmic [29J.
Similarly if we find that if the probability distribution for intensity of earthquakes has
three parameters, we can say that the earthquakes in that region are determined by three
seismological characteristics of the region and we can try to find a physical interpretation
for these parameters.
Another very important result we get by using the inverse principles is that if in a
closed queueing network (for flexible manufacturing systems or computer systems) if the
probability distribution is of the product form, then the characterising information is about
the mean lengths of the queues [29, 32, 33J.
Thus the inverse principles can enable us to find the probabilistic causes of given prob-
abilistic systems.
ENTROPY OPTIMIZA nON PRINCIPLES AND THEIR APPLICA nONS 15
10. PRINCIPLE OF MINIMUM INTERDEPENDENCE
There may be an infinity of multivariate probability distributions of :1:1, :1:2, ••• ,:1: rn which may
have some given joint moments or marginal distributions. Out of these we want according to
this principle that distribution, which gives minimum interdependence among the variates.
For this we need a measure of dependence among the variates. The correlation coefficient is
not useful because there are m( m -1) /2 correlation coefficients measuring linear dependence
and we need only one measure measuring all types of dependence. Such a measure is given
by (Watanabe [48])
D = S1 + S2 + ... + Sm - S, (38)
where Si is the entropy of the probability distribution of the ith variate and S is the entropy
of the joint distribution. The principle of interdependence was first stated by Guissu et al.
[8].
This principle was further discussed [20] and was used in Kapur [24] to solve an im-
portant problem in pattern recognition. The problem is to find the matrix A so that the
components of Y = AX are as independent as possible, where X is a given normal variate.
It was shown that
(39)
where U1 , U2 , ••• , Urn are the eigenvectors corresponding to the m largest eigenvalues of the
correlation matrix of X.
11. DUAL OPTIMIZATION PRINCIPLES
The maximum entropy principle distribution gives the most unbiased distribution satisfying
given constraints. In the same way, the minimum entropy probability distribution gives the
most biased distribution satisfying the same constraints, and the true distribution will have
an entropy between the maximum and the minimum entropies.
As more and more information becomes available, the maximum entropy will decrease
and the minimum entropy will increase till the two coincide. Any further increase in infor-
mation will not change the entropy and we will have got the maximum information about
the system.
This is the problem of pattern recognition since knowing all information, we can con-
struct the pattern. In practice we can even stop when maximum and minimum entropies
are very close to each other.
Watanabe [47] described pattern recognition as a quest for minimum entropy.
The principle of minimum entropy is the dual of the principle of maximum entropy, but
it is more difficult to implement since it involves minimizing a concave function.
In the same way, principles of maximum cross-entropy and of maximum interdependence
will be duals of principles of minimum cross-entropy and minimum interdependence.
We have similar dual principles for generalised and inverse principles.
12. MINIMAX ENTROPY AND MAXIMUM CROSS-ENTROPY PRINCI-

PLES
If we have a number of classes, entropy can be decompound into entropy within classes and
entropy between classes [32]. In general, we want to choose the classes in such a manner
that the entropy within classes is maximum and the entropy between classes is minimum,
so that each class is as homogenous as possible and different classes are as distinct as
possible. This is an important requirement in cluster analysis, group technology, flexible
manufacturing systems etc. and the principle of minimax entropy can enable us to achieve
the goals. This requires maximization of one type of entropy and minimization of entropy
of another type.
The dual of this principle will be the maximin entropy principle.
We will get two other principles by replacing entropy by cross-entropy.
13. GENERATION OF MEASURES OF ENTROPY FROM MATHEMATI-

CAL MODELS
For given constraints and for a given measure of entropy or cross-entropy, we can get a
corresponding maximum-entropy or minimum cross entropy probability distribution model.
For different measures of entropy, we should get different models. Conversely given a
mathematical model and given constraints, we can find a corresponding measure of entropy
which will lead to the given model.
This method has been used to generate measures of entropy from mathematical models
of population growth, innovation diffusion, epidemic models and chemical kinetics. Most of
the measures so obtained are the same as measures obtained from axiomatic considerations,
however we obtain a few new measures.
This shows that in every situation, a suitable measure of entropy is being maximized
subject to suitable constraints.
Our object in scientific research is to find these appropriate measures of entropy and
the corresponding appropriate constraints. In many cases, Shannon's measure is the appro-
priate measure of entropy and in this case, the problem reduces to simply that of finding
appropriate constraints.
14. PROLIFERATION OF ENTROPY OPTIMIZATION PRINCIPLES
This proliferation arises because we can deal with

(i) entropy / cross-entropy / dependence (3)

(ii) direct /inverse principles (2)
(iii) maximize /minimize /maximize and minimize (3)
(iv) primal/ dual principles (2)
(v) usual/generalised measures (3)
even with three measures, we get 108 principles and if we use other generalised measures,
the number of principles increases fast.
Each principle has a variety of applications in a large number of fields.
Apart from the application, each principle leads to a number of challenging open math-
ematical problems, the solution of which may require new mathematical (analytical, nu-
merical, simulation, computer) techniques.
The discussion of these principles was initiated in [26] and will be continued in [33].
15. CONCLUDING REMARKS
There are many persons who are merely satisfied with MaxEnt and its rich applications.
They want to avoid the philosophical, mathematical and computational complications that
may arise out of the other principles. But these new principles have tremendous possibilities
of providing great insights and of exploring much wider classes of phenomena. It is hoped
that some of the readers of this paper will come forward to explore the new possibilities
that have been revealed by these entropy optimization principles.
REFERENCES
1 R. Christensen (1981) Entropy Minimax Source Book, 1-4, Entropy Ltd., Lincoln, Mass.
2 1. Csiszer (1972) "A class of measures of informativity of Observation channels", Periodic

Math. Hungarica, Vol. 2, pp. 191-213.
3 J.G. Erickson and C.R. Smith (editors) (1988) Maximum Entropy and Bayesian Methods
in Science and Engineering, Vol. 1 (Foundations), Vol. 2 (Applications). Kluwer
Academic Publishers, New York.
4 S. Erlander (1978) Optimal Interaction and the Gravity Models, Springer-Verlag, New
York.
5 B. Forte and C. Sempi (1976) "Maximizing conditional Entropy: A Derivation of Quantal

Statistics", Rendi Conta de Mathematics, Vol. 9, pp. 551-566.
6 P.F. Fourgere (ed) (1990) Maximum Entropy and Bayesian Methods, Kluwer Academic
Press, New York.
7 N. Georgescu-Rogen (1971) The Entropy Law and The Economic Process, Harvard Uni-
versity Press, Camb.
8 S. Guissu, R. Lablanc and L. Reischer (1982) "On the principle of minimum interdepen-
dence", J. Inf. Opt. Sci. Vol. 3, pp. 149-172.
9 J.H. Havrda and F. Charvat (1967) "Quantification methods of classification Processes:

Concepts of Structural Entropy", Kybernatica, Vol. 3, pp. 30-35.
10 E.T. Jaynes (1957) "Information theory and statistical mechanics", Physical Reviews,
Vol. 106, pp. 620-630.
11 J.H. Justice (editor) (1986) Maximum Entropy and Bayesian Methods in Applied Statis-
tics, Cambridge University Press, Boston.
12 J.N. Kapur (1972) "Measures of Uncertainty, Mathematical programming and Physics",

Jour. Ind. Soc. Agri. Stat., Vol. 24, pp. 46-66.
13 J.N. Kapur (1983) "Derivation of logistic law of population growth from maximum
entropy principle", Nat. Acad. Sci. Letters, Vol. 6, No. 12, pp. 429-433.
14 J.N. Kapur (1983) "Comparative assessment of various measures of entropy", Jour. Inf.
and Opt. Sci. Vol. 4, No.1, pp. 207-232.
15 J.N. Kapur (1983) "Non-additive measures of entropy and distributions of Statistical

Mechanics", Ind. Jour., Pure and App. Math., Vol. 1, No. 11, pp. 1372-1384.
16 J.N. Kapur (1983) "Twenty-five years of maximum entropy", Jour. Math. Phy. Sci.
Vol. 17, No.2, pp. 103-156.
17 J.N. Kapur (1984) "The role of maximum entropy and minimum discrimination infor-
mation principles in statistics", Jour. Ind. Soc. Agri. Stat., Vol. 36, No.3, pp.
12-55.
18 J.N. Kapur (1984) "A comparative assessment of various measures of directed diver-
gences", Advances in Management studies, Vol. 3, pp. 1-16.
19 J.N. Kapur (1984) "On maximum entropy principle and its applications to science and
engineering", Proc. Nat. Symposium on Mathematical Modelling, MRI, Allahabad,
India, pp. 75-78.
20 J.N. Kapur (1984) "On minimum interdependence principle", Ind. Jour. Pure and App.
Math. 15(9), 968-977.
21 J.N. Kapur (1984) "Maximum entropy models in science and engineering", Proc. Nat.
Acad. Sciences, (Presidential Address, Physical Sciences Section), Annual number,
pp. 35-57.
22 J.N. Kapur, P.K. Sahoo and A.K.C. Wong (1985) "A new method of grey level picture
thresholding using entropy of the histogram", Computer vision, Graphics and Image
Processing, Vol. 29, pp. 273-28?
23 J.N. Kapur (1986) "Application of entropic measures of stochastic dependence", Pattern

Recognition, Vol. 19, pp. 473-476.
24 J.N. Kapur (1985) "Some thoughts on scientific and philosophical foundation of the
maximum entropy principle", Bull. Math. Ass. Ind., 17, pp. 15-40.
25 J.N. Kapur (1986) "Four families of measures of entropy", Ind. Jour. Pure and App.
Maths., Vol. 17, No.4, pp. 429-449.
26 J.N. Kapur (1990) Maximum-Entropy Models in Science and Engineering, John Wiley,
New York.
27 J.N. Kapur "Applications of generalised maximum entropy principle to population dy-

namics, innovation diffusion models and chemical reactions". To appear in Journ.
Math. Phy. Sciences.
28 J.N. Kapur "Entropy optimization principles and their applications". To appear in

Mathematics Student.
29 J.N. Kapur and H.K. Kesavan (1990) "Inverse MaxEnt or MinxEnt principles and their
applications", In Maximum Entropy and Bayesian Methods, edited by P. Fourgere,
pp. 433-450.
30 J.N. Kapur (1989) Maximum entropy principle, large-scale systems and cybernetics, In
Artificial Intelligence, Ed. by A. Ghoshal et al., South Asia Publishers, New Delhi.
31 H.K. Kesavan and J.N. Kapur (1989) "The generalised maximum entropy principle",
IEEE Trans. Syst. Man. Cyb., 19, pp. 1042-1052.
32 J.N. Kapur and H.K. Kesavan (1988) Generalised Maximum Entropy Principle (with
Applications), PP. 225, Sandford Educational Press, University of Waterloo.
33 J.N. Kapur and H.K. Kesavan (1991) Entropy Optimization Principles and their Appli-
cations. (book under publication).
34 J.N. Kapur and H.K. Kesavan (1989) "Generalised maximum entropy principle", Proc.
Int. Conf. Math. Mod., Vol. 2, lIT Madras, pp. 1-11.
35 H.K. Kesavan and J.N. Kapur (1990) "On the families of solutions of generalised maxi-
mum and minimum cross-entropy models", Int. Jour. Systems, Vol. 16, pp. 199-219.
36 J.N. Kapur and H.K. Kesavan (1990) "Maximum entropy and minimum cross entropy
principles: need for new perspectives", In Maximum Entropy and Bayesian Methods,
edited by P. Fougere, Kluwer Press, pp. 419-432.
37 S. Kullback and R.A. Leibler (1951) "On information and sufficiency", Ann. Math.
Stat., Vol. 22, pp. 79-86.
38 S. Kullback (1958) On Information Theory and Statistics, John Wiley, New York.
39 A. Renyi (1961) "On measures of entropy and information", Proc. 4th Berkeley Sym-
posium, Maths. Stat. Prob., Vol. 1, pp. 547-561.
40 J. Rifkin (1980) Entropy: A New World View, Vikas Press.

20 J. N. KAPUR AND H. K. KESA V AN
41 A.K. Seth (1989) "Prof. J.N. Kapur's Views on Entropy Optimization Principles", Bull.
Math. Ass. Ind., Vol. 21, pp. 1-38, Vol. 22, 1-42.
42 C.E. Shannon (1948) "A mathematical theory of communication", Bell System, Tech.
J., Vol. 27, pp. 379-423, 623-659.
43 J. Skilling (Editor) (1989) Maximum Entropy and Bayesian Methods, Kluwer Academic
Publishers, New York.
44 C.R. Smith and W.T. Grandy, Jr. (1985) (eds) Maximum Entropy and Bayesian Meth-
ods in Inverse Problem, D. Reidel, Doedrecht, Holland.
45 C.R. Smith and J.G Erickson (eds) (1989) Maximum-Entropy and Bayesian Spectral
Analysis and Estimation Problems, D. Reidel and Kluwer Academic Publishers, New
York.
46 M. Tribus (1979) "Thirty years of information theory", In Maximum Entropy Formal-

ism, (ed) R.D. Levine and M. Tribus, MIT Press.
47 S. Watanabe (1981) "Pattern recognition as a quest for minimum entropy", Pattern

Recognition, Vol. 13, pp. 381-387.
48 S. Watanabe (1969) Knowing and Guessing, John Wiley, New York.

Entropy Optimization Principles and Their Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Entropy Optimization Principles and Their Applications

Uploaded by

Copyright:

Available Formats

ENTROPY OPTIMIZATION PRINCIPLES AND THEIR APPLICATIONS

J.N. KAPUR AND H.K. KESAVAN

ABSTRACT. A state-of-the-art description of the theory and applications of the various

1. INFORMATION - THEORETIC ENTROPY

In the present paper, we shall be concerned primarily with information-theoretic entropy,

p = (P1,P2, "·,Pn), Pi 2': 0, "',Pn 2': 0, LPi =1 (1)

H(Pl,P2, .. ·,Pn) = -K LPi ln Pi, (3)

2. JAYNES' PRINCIPLE OF MAXIMUM ENTROPY

PI ~ 0, P2 ~ 0, "',Pn ~ 0, LPi = 1 (4)

LPigr(i) = a., r = 1,2, ... ,m (5)

• Speak the truth and nothing but the truth

Pi = exp(-Ao - Algl(i) - ... - Amgm(i)), i = 1,2, ... ,n (6)

• the extreme value found is a maximum or minimum

• the maximum obtained is local or global, and

• the non-negativity constraints are satisfied.

3. RELATION BETWEEN INFORMATION-THEORETIC AND THERMO-

di [I:i=l exp[-JlEO)][- I:i=l ~ exp( -JlEd] + [I:i=l Ei exp( -JlEi]2 (10)

and the equality sign holds only when

We shall assume that (12) is not satisfied, so that

so that Jl is a decreasing function of i and if we define T by

where k is a suitable constant, this T is an increasing function of E. Thus T can now be

dE = LPidti + L tidpi (15)

Smax =- L Pi in Pi - LPi[-{Lti -In L exp( -{Lt;)]

{LE + in L exp( -{Ltd, (20)

Thus thermodynamic entropy is the maximum value of information-theoretic entropy

4. PRIMARY AND SECONDARY ENTROPIES

5. CHARACTERIZATION OF PROBABILITY DISTRIBUTIONS

f(x)dx = 1, xf(x)dx = m, (x - m)2 f(x)dx = u 2 (23)

which is the normal distribution.

Range Characterising moments MEPD

(0,00) E(z) exponential

6. PRINCIPLE OF MINIMUM CROSS-ENTROPY

It can easily be verified that

so that minimizing D(P : U) would be equivalent to maximizing Shannon's measure of

7. APPLICATIONS OF JAYNES' MaxEnt AND KULLBACK'S MinxEnt

Problem 7: Given market shares of a number of brands of a product in a market, estimate

8. GENERALISED MAXIMUM ENTROPY AND MINIMUM CROSS-ENTROPY

Similarly, the generalised maximum-entropy principle requires that, in the absence of

[~)Pi - qi)2]1/2, 2: 1 Pi - qi I, [2:;=1 1 Pi - qi Ik]i,

(ii) D(P: Q) = 0 iff P = Q (Identity) (35)

(iii) D(P: Q) = D(Q : P) (Symmetry) (36)

(iv) D(P: Q) + D(Q : R) ~ D(P: R) (Triangle Inequality) (37)

9. INVERSE ENTROPY OPTIMIZATION PRINCIPLES

10. PRINCIPLE OF MINIMUM INTERDEPENDENCE

11. DUAL OPTIMIZATION PRINCIPLES

We have similar dual principles for generalised and inverse principles.

12. MINIMAX ENTROPY AND MAXIMUM CROSS-ENTROPY PRINCI-

13. GENERATION OF MEASURES OF ENTROPY FROM MATHEMATI-

14. PROLIFERATION OF ENTROPY OPTIMIZATION PRINCIPLES

This proliferation arises because we can deal with

(i) entropy / cross-entropy / dependence (3)

15. CONCLUDING REMARKS

2 1. Csiszer (1972) "A class of measures of informativity of Observation channels", Periodic

5 B. Forte and C. Sempi (1976) "Maximizing conditional Entropy: A Derivation of Quantal

9 J.H. Havrda and F. Charvat (1967) "Quantification methods of classification Processes:

12 J.N. Kapur (1972) "Measures of Uncertainty, Mathematical programming and Physics",

15 J.N. Kapur (1983) "Non-additive measures of entropy and distributions of Statistical

23 J.N. Kapur (1986) "Application of entropic measures of stochastic dependence", Pattern

27 J.N. Kapur "Applications of generalised maximum entropy principle to population dy-

28 J.N. Kapur "Entropy optimization principles and their applications". To appear in

40 J. Rifkin (1980) Entropy: A New World View, Vikas Press.

46 M. Tribus (1979) "Thirty years of information theory", In Maximum Entropy Formal-

47 S. Watanabe (1981) "Pattern recognition as a quest for minimum entropy", Pattern