Professional Documents
Culture Documents
In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of
all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is
known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures),
then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first,
maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to
move towards maximal entropy configurations over time.
If is a continuous random variable with probability density , then the differential entropy of is defined as[1][2][3]
This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum
entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because
maximizing will also maximize the more general forms.
The base of the logarithm is not important as long as the same one is used consistently: change of base merely results in a
rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians
and physicists will often prefer the natural logarithm, resulting in a unit of nats for the entropy.
The choice of the measure is however crucial in determining the entropy and the resulting maximum entropy distribution,
even though the usual recourse to the Lebesgue measure is often defended as "natural".
Continuous case
Suppose is a closed subset of the real numbers and we choose to specify measurable functions and
numbers . We consider the class of all real-valued random variables which are supported on (i.e. whose density
function is zero outside of ) and which satisfy the moment conditions:
If there is a member in whose density function is positive everywhere in , and if there exists a maximal entropy distribution
for , then its probability density has the following form:
where we assume that . The constant and the Lagrange multipliers solve the constrained
optimization problem with (this condition ensures that integrates to unity): [4]
Using the Karush–Kuhn–Tucker conditions, it can be shown that the optimization problem has a unique solution because the
objective function in the optimization is concave in .
Note that if the moment conditions are equalities (instead of inequalities), that is,
then the constraint condition is dropped, making the optimization over the Lagrange multipliers unconstrained.
Discrete case
Suppose is a (finite or infinite) discrete subset of the reals and we choose to specify functions f1 ,...,fn and
n numbers a1 ,...,an . We consider the class C of all discrete random variables X which are supported on S and which satisfy the n
moment conditions
If there exists a member of C which assigns positive probability to all members of S and if there exists a maximum entropy
distribution for C, then this distribution has the following shape:
where we assume that and the constants solve the constrained optimization problem with
:[5]
Again, if the moment conditions are equalities (instead of inequalities), then the constraint condition is not present in the
optimization.
In the case of equality constraints, this theorem is proved with the calculus of variations and Lagrange multipliers. The
constraints can be written as
where and are the Lagrange multipliers. The zeroth constraint ensures the second axiom of probability. The other
constraints are that the measurements of the function are given constants up to order . The entropy attains an extremum when
the functional derivative is equal to zero:
Therefore, the extremal entropy probability distribution in this case must be of the form ( ),
remembering that . It can be verified that this is the maximal solution by checking that the variation around this
solution is always negative.
Suppose , are distributions satisfying the expectation-constraints. Letting and considering the distribution
it is clear that this distribution satisfies the expectation-constraints and furthermore has as support
. From basic facts about entropy, it holds that . Taking
limits and respectively yields .
It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support —
i. e. the distribution is almost everywhere positive. It follows that the maximising distribution must be an internal point in the
space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the
local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the
local extreme is the global maximum).
Suppose are local extremes. Reformulating the above computations these are characterised by parameters via
and similarly for , where . We now note a series of identities: Via the
where is similar to the distribution above, only parameterised by . Assuming that no non-trivial linear combination of the
observables is almost everywhere (a.e.) constant, (which e.g. holds if the observables are independent and not a.e. constant), it
holds that has non-zero variance, unless . By the above equation it is thus clear, that the latter must be the
case. Hence , so the parameters characterising the local extrema are identical, which means that the
distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique—
provided a local extreme actually exists.
Caveats
Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of
arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that
the entropies are bounded above but there is no distribution which attains the maximal entropy.[a] It is also possible that the
expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our
theorem doesn't apply, but one can work around this by shrinking the set S.
Examples
Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has
its own entropy. To see this, rewrite the density as and compare to the expression of the theorem above.
By choosing to be the measurable function and
to be the constant, is the maximum entropy probability distribution under the constraint
Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy.
These are often found by starting with the same procedure and finding that can be separated into parts.
A table of examples of maximum entropy distributions is given in Lisman (1972)[6] and Park & Bera (2009).[7]
The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are
supported in the interval [a, b], and thus the probability density is 0 outside of the interval. This uniform density can be related to
Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a
subdivision a=a0 < a1 < ... < ak = b of the interval [a,b] and probabilities p1 ,...,pk that add up to one, then we can consider the
class of all continuous distributions such that
The density of the maximum entropy distribution for this class is constant on each of the intervals [aj-1 ,aj). The uniform
distribution on the finite set {x1 ,...,xn } (which assigns a probability of 1/n to each of these values) is the maximum entropy
distribution among all discrete distributions supported on this set.
is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a specified mean of 1/λ.
In the case of distributions supported on [0,∞), the maximum entropy distribution depends on relationships between the first and
second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be
undefinable.[8]
Among all the discrete distributions supported on the set {x1 ,...,xn } with a specified mean μ, the maximum entropy distribution
has the following shape:
where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and
the expected value must be μ.
For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this
information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the
situation considered above, with {x1 ,...,x6 } = {1,...,6} and μ = S/N.
Finally, among all the discrete distributions supported on the infinite set with mean μ, the maximum entropy
distribution has the shape:
where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the
expected value must be μ. For example, in the case that xk = k, this gives
For a continuous random variable distributed about the unit circle, the Von Mises distribution maximizes the entropy when
the real and imaginary parts of the first circular moment are specified[9] or, equivalently, the circular mean and circular variance
are specified.
When the mean and variance of the angles modulo are specified, the wrapped normal distribution maximizes the
entropy.[9]
There exists an upper bound on the entropy of continuous random variables on with a specified mean, variance, and skew.
However, there is no distribution which achieves this upper bound, because is
unbounded when (see Cover & Thomas (2006: chapter 12)).
However, the maximum entropy is ε-achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a
normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by
a small amount at a value many σ larger than the mean. The skewness, being proportional to the third moment, will be affected
more than the lower order moments.
This is a special case of the general case in which the exponential of any odd-order polynomial in x will be unbounded on .
For example, will likewise be unbounded on , but when the support is limited to a bounded or semi-bounded interval the
upper entropy bound may be achieved (e.g. if x lies in the interval [0,∞] and λ< 0, the exponential distribution will result).
In particular, the maximal entropy distribution with specified mean and deviation is:
Other examples
In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third
column, and the constraint that x be included in the support of the probability density, which is listed in the fourth column.[6][7]
Several examples (Bernoulli, geometric, exponential, Laplace, Pareto) listed are trivially true because their associated constraints
are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or
Uniform
None
(discrete)
Uniform
None
(continuous)
Bernoulli
Geometric
Exponential
Laplace
Asymmetric
Laplace
Pareto
Normal
Truncated
(see article)
normal
von Mises
Rayleigh
Beta for
Cauchy
Chi
Chi-squared
Erlang
Gamma
Lognormal
Maxwell–
Boltzmann
Weibull
Multivariate
normal
Binomial [11]
Poisson [11]
Logistic
The maximum entropy principle can be used to upper bound the entropy of statistical mixtures.[12]
See also
Exponential family
Gibbs measure
Partition function (mathematics)
Maximal entropy random walk - maximizing entropy rate for a graph
Notes
a. For example, the class of all continuous distributions X on R with E(X) = 0 and E(X2) = E(X3) = 1 (see Cover,
Ch 12).
Citations
1. Williams, D. (2001), Weighing the Odds, Cambridge University Press, ISBN 0-521-00618-X (pages 197-199).
2. Bernardo, J. M., Smith, A. F. M. (2000), Bayesian Theory, Wiley. ISBN 0-471-49464-X (pages 209, 366)
3. O'Hagan, A. (1994), Kendall's Advanced Theory of Statistics, Vol 2B, Bayesian Inference, Edward Arnold.
ISBN 0-340-52922-9 (Section 5.40)
4. Botev, Z. I.; Kroese, D. P. (2011). "The Generalized Cross Entropy Method, with Applications to Probability
Density Estimation" (http://espace.library.uq.edu.au/view/UQ:200564/UQ200564_preprint.pdf) (PDF).
Methodology and Computing in Applied Probability. 13 (1): 1–27. doi:10.1007/s11009-009-9133-7 (https://doi.o
rg/10.1007%2Fs11009-009-9133-7). S2CID 18155189 (https://api.semanticscholar.org/CorpusID:18155189).
5. Botev, Z. I.; Kroese, D. P. (2008). "Non-asymptotic Bandwidth Selection for Density Estimation of Discrete
Data". Methodology and Computing in Applied Probability. 10 (3): 435. doi:10.1007/s11009-007-9057-z (http
s://doi.org/10.1007%2Fs11009-007-9057-z). S2CID 122047337 (https://api.semanticscholar.org/CorpusID:122
047337).
6. Lisman, J. H. C.; van Zuylen, M. C. A. (1972). "Note on the generation of most probable frequency
distributions". Statistica Neerlandica. 26 (1): 19–23. doi:10.1111/j.1467-9574.1972.tb00152.x (https://doi.org/1
0.1111%2Fj.1467-9574.1972.tb00152.x).
7. Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (ht
tps://web.archive.org/web/20160307144515/http://wise.xmu.edu.cn/uploadfiles/paper-masterdownload/200951
9932327055475115776.pdf) (PDF). Journal of Econometrics. 150 (2): 219–230. CiteSeerX 10.1.1.511.9750 (ht
tps://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.511.9750). doi:10.1016/j.jeconom.2008.12.014 (http
s://doi.org/10.1016%2Fj.jeconom.2008.12.014). Archived from the original (http://www.wise.xmu.edu.cn/Master/
Download/..%5C..%5CUploadFiles%5Cpaper-masterdownload%5C2009519932327055475115776.pdf)
(PDF) on 2016-03-07. Retrieved 2011-06-02.
8. Dowson, D.; Wragg, A. (September 1973). "Maximum-entropy distributions having prescribed first and second
moments". IEEE Transactions on Information Theory (correspondance). 19 (5): 689–693.
doi:10.1109/tit.1973.1055060 (https://doi.org/10.1109%2Ftit.1973.1055060). ISSN 0018-9448 (https://www.worl
dcat.org/issn/0018-9448).
9. Jammalamadaka, S. Rao; SenGupta, A. (2001). Topics in circular statistics (https://books.google.com/books?id
=sKqWMGqQXQkC&q=Jammalamadaka+Topics+in+circular). New Jersey: World Scientific. ISBN 978-981-
02-3778-3. Retrieved 2011-05-15.
10. Grechuk, B., Molyboha, A., Zabarankin, M. (2009) Maximum Entropy Principle with General Deviation
Measures (https://www.researchgate.net/publication/220442393_Maximum_Entropy_Principle_with_General_
Deviation_Measures), Mathematics of Operations Research 34(2), 445--467, 2009.
11. Harremös, Peter (2001), "Binomial and Poisson distributions as maximum entropy distributions", IEEE
Transactions on Information Theory, 47 (5): 2039–2041, doi:10.1109/18.930936 (https://doi.org/10.1109%2F18.
930936), S2CID 16171405 (https://api.semanticscholar.org/CorpusID:16171405).
12. Frank Nielsen; Richard Nock (2017). "MaxEnt upper bounds for the differential entropy of univariate continuous
distributions". IEEE Signal Processing Letters. IEEE. 24 (4): 402-406. Bibcode:2017ISPL...24..402N (https://ui.
adsabs.harvard.edu/abs/2017ISPL...24..402N). doi:10.1109/LSP.2017.2666792 (https://doi.org/10.1109%2FLS
P.2017.2666792). S2CID 14092514 (https://api.semanticscholar.org/CorpusID:14092514).
References
Cover, T. M.; Thomas, J. A. (2006). "Chapter 12, Maximum Entropy" (https://archive.org/download/ElementsOfIn
formationTheory2ndEd/Wiley_-_2006_-_Elements_of_Information_Theory_2nd_Ed.pdf) (PDF). Elements of
Information Theory (2 ed.). Wiley. ISBN 978-0471241959.
F. Nielsen, R. Nock (2017), MaxEnt upper bounds for the differential entropy of univariate continuous
distributions (https://ieeexplore.ieee.org/abstract/document/7849175), IEEE Signal Processing Letters, 24(4),
402-406
I. J. Taneja (2001), Generalized Information Measures and Their Applications (http://www.mtm.ufsc.br/~taneja/b
ook/book.html). Chapter 1 (http://www.mtm.ufsc.br/~taneja/book/node14.html)
Nader Ebrahimi, Ehsan S. Soofi, Refik Soyer (2008), "Multivariate maximum entropy identification,
transformation, and dependence", Journal of Multivariate Analysis 99: 1217–1231,
doi:10.1016/j.jmva.2007.08.004 (https://doi.org/10.1016%2Fj.jmva.2007.08.004)