You are on page 1of 6

Minimum description length

Minimum Description Length (MDL) is a model selection principle where the shortest description of the
data is the best model. MDL methods learn through a data compression perspective and are sometimes
described as mathematical applications of Occam's razor. The MDL principle can be extended to other
forms of inductive inference and learning, for example to estimation and sequential prediction, without
explicitly identifying a single model of the data.

MDL has its origins mostly in information theory and has been further developed within the general fields
of statistics, theoretical computer science and machine learning, and more narrowly computational learning
theory.

Historically, there are different, yet interrelated, usages of the definite noun phrase "the minimum
description length principle" that vary in what is meant by description:

Within Jorma Rissanen's theory of learning, a central concept of information theory, models
are statistical hypotheses and descriptions are defined as universal codes.
Rissanen's 1978[1] pragmatic first attempt to automatically derive short descriptions, relates
to the Bayesian Information Criterion (BIC).
Within Algorithmic Information Theory, where the description length of a data sequence is
the length of the smallest program that outputs that data set. In this context, it is also known
as 'idealized' MDL principle and it is closely related to Solomonoff's theory of inductive
inference, which is that the best model of a data set is represented by its shortest self-
extracting archive.

Overview
Selecting the minimum length description of the available data as the best model observes the principle
identified as Occam's razor. Prior to the advent of computer programming, generating such descriptions was
the intellectual labor of scientific theorists. It was far less formal than it has become in the computer age. If
two scientists had a theoretic disagreement, they rarely could formally apply Occam's razor to choose
between their theories. They would have different data sets and possibly different descriptive languages.
Nevertheless, science advanced as Occam's razor was an informal guide in deciding which model was best.

With the advent of formal languages and computer programming Occam's razor was mathematically
defined. Models of a given set of observations, encoded as bits of data, could be created in the form of
computer programs that output that data. Occam's razor could then formally select the shortest program,
measured in bits of this algorithmic information, as the best model.

To avoid confusion, note that there is nothing in the MDL principle that implies a machine produced the
program embodying the model. It can be entirely the product of humans. The MDL principle applies
regardless of whether the description to be run on a computer is the product of humans, machines or any
combination thereof. The MDL principle requires only that the shortest description, when executed,
produce the original data set without error.

Two-Part codes
The distinction in computer programs between programs and literal data applies to all formal descriptions
and is sometimes referred to as "two parts" of a description. In statistical MDL learning, such a description
is frequently called a two-part code.

MDL in machine learning


MDL applies in machine learning when algorithms (machines) generate descriptions. Learning occurs
when an algorithm generates a shorter description of the same data set.

The theoretic minimum description length of a data set, called its Kolmogorov complexity, cannot,
however, be computed. That is to say, even if by random chance an algorithm generates the shortest
program of all that outputs the data set, an automated theorem prover cannot prove there is no shorter such
program. Nevertheless, given two programs that output the dataset, the MDL principle selects the shorter of
the two as embodying the best model.

Recent work on algorithmic MDL learning


Recent machine MDL learning of algorithmic, as opposed to statistical, data models have received
increasing attention with increasing availability of data, computation resources and theoretic advances.[2][3]
Approaches are informed by the burgeoning field of artificial general intelligence. Shortly before his death,
Marvin Minsky came out strongly in favor of this line of research, saying:[4]

It seems to me that the most important discovery since Gödel was the discovery by Chaitin,
Solomonoff and Kolmogorov of the concept called Algorithmic Probability which is a
fundamental new theory of how to make predictions given a collection of experiences and this
is a beautiful theory, everybody should learn it, but it’s got one problem, that is, that you
cannot actually calculate what this theory predicts because it is too hard, it requires an infinite
amount of work. However, it should be possible to make practical approximations to the
Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we
have today. Everybody should learn all about that and spend the rest of their lives working on
it.

— Panel discussion on The Limits of Understanding, World Science Festival, NYC, Dec
14, 2014

Statistical MDL learning


Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet.

[The MDL Principle] is based on the following insight: any regularity in a given set of data can
be used to compress the data, i.e. to describe it using fewer symbols than needed to describe
the data literally. (Grünwald, 2004)[5]

Based on this, in 1978, Jorma Rissanen published an MDL learning algorithm using the statistical notion of
information rather than algorithmic information. Over the past 40 years this has developed into a rich theory
of statistical and machine learning procedures with connections to Bayesian model selection and averaging,
penalization methods such as Lasso and Ridge, and so on - Grünwald and Roos (2020)[6] give an
introduction including all modern developments. Rissanen started out with this idea: all statistical learning is
about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one
that is able to statistically compress the data most. Like other statistical methods, it can be used for learning
the parameters of a model using some data. Usually though, standard statistical methods assume that the
general form of a model is fixed. MDL's main strength is that it can also be used for selecting the general
form of a model and its parameters. The quantity of interest (sometimes just a model, sometimes just
parameters, sometimes both at the same time) is called a hypothesis. The basic idea is then to consider the
(lossless) two-stage code that encodes data with length by first encoding a hypothesis in the set
of considered hypotheses and then coding "with the help of" ; in the simplest context this just
means "encoding the deviations of the data from the predictions made by :

The achieving this minimum is then viewed as the best explanation of data . As a simple example,
take a regression problem: the data could consist of a sequence of points ,
the set could be the set of all polynomials from to . To describe a polynomial of degree (say) ,
one would first have to discretize the parameters to some precision; one would then have to describe this
precision (a natural number); next, one would have to describe the degree (another natural number), and
in the final step, one would have to describe parameters; the total length would be . One
would then describe the points in using some fixed code for the x-values and then using a code for the
deviations .

In practice, one often (but not always) uses a probabilistic model. For example, one associates each
polynomial with the corresponding conditional distribution expressing that given , is normally
distributed with mean and some variance which could either be fixed or added as a free
parameter. Then the set of hypotheses reduces to the assumption of a linear model, ,
with a polynomial.

Furthermore, one is often not directly interested in specific parameters values, but just, for example, the
degree of the polynomial. In that case, one sets to be where each represents the
hypothesis that the data is best described as a j-th degree polynomial. One then codes data given
hypothesis using a one-part code designed such that, whenever some hypothesis fits the data
well, the codelength is short. The design of such codes is called universal coding. There are
various types of universal codes one could use, often giving similar lengths for long data sequences but
differing for short ones. The 'best' (in the sense that it has a minimax optimality property) are the
normalized maximum likelihood (NML) or Shtarkov codes. A quite useful class of codes are the Bayesian
marginal likelihood codes. For exponential families of distributions, when Jeffreys prior is used and the
parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL
theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys'
prior, albeit for different reasons. The MDL approach to model selection "gives a selection criterion
formally identical to the BIC approach"[7] for large number of samples.

Example of Statistical MDL Learning

A coin is flipped 1000 times, and the numbers of heads and tails are recorded. Consider two model classes:
The first is a code that represents outcomes with a 0 for heads or a 1 for tails. This code
represents the hypothesis that the coin is fair. The code length according to this code is
always exactly 1000 bits.
The second consists of all codes that are efficient for a coin with some specific bias,
representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490
tails. Then the code length according to the best code in the second model class is shorter
than 1000 bits.

For this reason, a naive statistical method might choose the second model as a better explanation for the
data. However, an MDL approach would construct a single code based on the hypothesis, instead of just
using the best one. This code could be the normalized maximum likelihood code or a Bayesian code. If
such a code is used, then the total codelength based on the second model class would be larger than 1000
bits. Therefore, the conclusion when following an MDL approach is inevitably that there is not enough
evidence to support the hypothesis of the biased coin, even though the best element of the second model
class provides better fit to the data.

Statistical MDL Notation

Central to MDL theory is the one-to-one correspondence between code length functions and probability
distributions (this follows from the Kraft–McMillan inequality). For any probability distribution , it is
possible to construct a code such that the length (in bits) of is equal to ; this code
minimizes the expected code length. Conversely, given a code , one can construct a probability
distribution such that the same holds. (Rounding issues are ignored here.) In other words, searching for
an efficient code is equivalent to searching for a good probability distribution.

Limitations of Statistical MDL Learning

The description language of statistical MDL is not computationally universal. Therefore it cannot, even in
principle, learn models of recursive natural processes.

Related concepts

Statistical MDL learning is very strongly connected to probability theory and statistics through the
correspondence between codes and probability distributions mentioned above. This has led some
researchers to view MDL as equivalent to Bayesian inference: code length of model and data together in
MDL correspond respectively to prior probability and marginal likelihood in the Bayesian framework.[8]

While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also
accommodates other codes that are not Bayesian. An example is the Shtarkov normalized maximum
likelihood code, which plays a central role in current MDL theory, but has no equivalent in Bayesian
inference. Furthermore, Rissanen stresses that we should make no assumptions about the true data-
generating process: in practice, a model class is typically a simplification of reality and thus does not
contain any code or probability distribution that is true in any objective sense.[9][10] In the last mentioned
reference Rissanen bases the mathematical underpinning of MDL on the Kolmogorov structure function.
According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe
priors that would lead to poor results. The priors that are acceptable from an MDL point of view also tend
to be favored in so-called objective Bayesian analysis; there, however, the motivation is usually
different.[11]

Other systems
Rissanen's was not the first information-theoretic approach to learning; as early as 1968 Wallace and
Boulton pioneered a related concept called minimum message length (MML). The difference between
MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but
there are some significant differences, especially in interpretation:

MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's
beliefs about the data-generating process in the form of a prior distribution. MDL avoids
assumptions about the data-generating process.
Both methods make use of two-part codes: the first part always represents the information
that one is trying to learn, such as the index of a model class (model selection) or parameter
values (parameter estimation); the second part is an encoding of the data given the
information in the first part. The difference between the methods is that, in the MDL literature,
it is advocated that unwanted parameters should be moved to the second part of the code,
where they can be represented with the data by using a so-called one-part code, which is
often more efficient than a two-part code. In the original description of MML, all parameters
are encoded in the first part, so all parameters are learned.
Within the MML framework, each parameter is stated to exactly the precision which results in
the optimal overall message length: the preceding example might arise if some parameter
was originally considered "possibly useful" to a model but was subsequently found to be
unable to help to explain the data (such a parameter will be assigned a code length
corresponding to the (Bayesian) prior probability that the parameter would be found to be
unhelpful). In the MDL framework, the focus is more on comparing model classes than
models, and it is more natural to approach the same question by comparing the class of
models that explicitly include such a parameter against some other class that doesn't. The
difference lies in the machinery applied to reach the same conclusion.

See also
Algorithmic probability
Algorithmic information theory
Inductive inference
Inductive probability
Lempel–Ziv complexity

References
1. Rissanen, J. (September 1978). "Modeling by shortest data description". Automatica. 14 (5):
465–471. doi:10.1016/0005-1098(78)90005-5 (https://doi.org/10.1016%2F0005-1098%287
8%2990005-5).
2. Zenil, Hector; Kiani, Narsis A.; Zea, Allan A.; Tegnér, Jesper (January 2019). "Causal
deconvolution by algorithmic generative models". Nature Machine Intelligence. 1 (1): 58–66.
doi:10.1038/s42256-018-0005-0 (https://doi.org/10.1038%2Fs42256-018-0005-0).
hdl:10754/630919 (https://hdl.handle.net/10754%2F630919). S2CID 86562557 (https://api.s
emanticscholar.org/CorpusID:86562557).
3. "Remodelling machine learning: An AI that thinks like a scientist". Nature Machine
Intelligence: 1. 28 January 2019. doi:10.1038/s42256-019-0026-3 (https://doi.org/10.1038%2
Fs42256-019-0026-3). S2CID 189929110 (https://api.semanticscholar.org/CorpusID:189929
110).
4. Archived at Ghostarchive (https://ghostarchive.org/varchive/youtube/20211205/DfY-DRsE86
s) and the Wayback Machine (https://web.archive.org/web/20151226130036/https://www.you
tube.com/watch?v=DfY-DRsE86s): "The Limits of Understanding" (https://www.youtube.com/
watch?v=DfY-DRsE86s&feature=youtu.be&t=5402). YouTube.
5. Grunwald, Peter (June 2004). "A tutorial introduction to the minimum description length
principle". arXiv:math/0406077 (https://arxiv.org/abs/math/0406077).
Bibcode:2004math......6077G (https://ui.adsabs.harvard.edu/abs/2004math......6077G).
6. Grünwald, Peter; Roos, Teemu (2020). "Minimum Description Length Revisited" (https://doi.o
rg/10.1142%2FS2661335219300018). International Journal of Mathematics for Industry. 11
(1). doi:10.1142/S2661335219300018 (https://doi.org/10.1142%2FS2661335219300018).
S2CID 201314867 (https://api.semanticscholar.org/CorpusID:201314867).
7. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "Model Assessment and
Selection". The Elements of Statistical Learning. Springer Series in Statistics. pp. 219–259.
doi:10.1007/978-0-387-84858-7_7 (https://doi.org/10.1007%2F978-0-387-84858-7_7).
ISBN 978-0-387-84857-0.
8. MacKay, David J. C.; Kay, David J. C. Mac (2003). Information Theory, Inference and
Learning Algorithms. Cambridge University Press. ISBN 978-0-521-64298-9.
9. Rissanen, Jorma. "Homepage of Jorma Rissanen" (https://web.archive.org/web/201512100
54325/http://www.mdl-research.net/jorma.rissanen/). Archived from the original (http://www.m
dl-research.net/jorma.rissanen/) on 2015-12-10. Retrieved 2010-07-03.
10. Rissanen, J. (2007). Information and Complexity in Statistical Modeling (https://www.springe
r.com/computer/foundations/book/978-0-387-36610-4). Springer. Retrieved 2010-07-03.
11. Nannen, Volker (May 2010). "A Short Introduction to Model Selection, Kolmogorov
Complexity and Minimum Description Length (MDL)". arXiv:1005.2364 (https://arxiv.org/abs/
1005.2364). Bibcode:2010arXiv1005.2364N (https://ui.adsabs.harvard.edu/abs/2010arXiv10
05.2364N).

Further reading
Minimum Description Length on the Web (https://web.archive.org/web/20190109161219/htt
p://www.mdl-research.net/), by the University of Helsinki. Features readings, demonstrations,
events and links to MDL researchers.
Homepage of Jorma Rissanen (https://web.archive.org/web/20151210054325/http://www.md
l-research.net/jorma.rissanen/), containing lecture notes and other recent material on MDL.
Advances in Minimum Description Length, MIT Press, ISBN 0-262-07262-9.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Minimum_description_length&oldid=1160760855"

You might also like