You are on page 1of 51

CS898, Winter, 2023

Kolmogorov complexity and


its applications

Ming Li
School of Computer Science
University of Waterloo
http://www.cs.uwaterloo.ca/~mli/cs898-2023.html
Why take this course?

Marvin Minsky, 2010:


It seems to me that the most important discovery since
Gödel was the discovery by Chaitin, Solomonoff, and
Kolmogorov of a concept called Algorithmic Probability. ...
This is a beautiful theory ... Everybody should learn all
about that and spend the rest of their lives working on it.
Information Avalanche

Doubling the information:


1750-1900: 150 years to double
1900-1950: 50 years to double
1950-1960: 10 years to double
1960-1965: 5 years to double
….
2020: 73 days to double.
Information Explosion

We live in an information society. Information


science is our profession. But do you know what
is “information”, and how to use it?
Modern Information Theory

Not Shannon Information Theory:


Shannon theory is about how to encode an
ensemble of (indivisible) objects (know to both
sides) and send sequences of such objects
efficiently on average.
Kolmogorov Complexity:
Encode information in one object most efficiently.
Example: Transferred a book War and Peace requires 1 bit in Shannon and the
whole book (compressed) with Kolmogorov complexity.
Industrial Revolutions and Mathematics

First Industrial Second Industrial Information Tech


Revolution (1760’s-): Revolution (1860’s-): Revolution (1960’s-):
Calculus (Methodology PDEs and Maxwell Need a theory of
governing mechanics) equations information?

Let us explore this


question together
Pioneers whose work made this course possible
Pioneers

Alan Turing William Tutte


Enima Lorenz
Course information
Textbook: Li-Vitanyi: An introduction
to Kolmogorov complexity and its
applications. You may use any edition
(1st , 2nd , 3rd, 4th).

Prerequisites: Basics in Turing


machines, machine learning,
algorithms, probability.

Students do 3 assignments at 20 marks


each, and a term project with final
report and zoom presentation (40
marks)
Course outline

Theory of Kolmogorov complexity, including:


Universality, and non-computability
Symmetry of information
Information distance, 1 shot learning
Statistical testing
Typical applications. For example:
Average case analysis of Shellsort.
Lovasz Local Lemma
Few shot learning
Why take this course?
Amazing: Flight from Russia
to China, 190 passengers
antibody
tests show identical
numbers!

Could this be coincidental?


After all, the probability of all testing being identical is the same to any
other particular outcome.
Even the founding fathers of classical probability theory like Laplace
were puzzled about such phenomenon.
Another example:

If your grades are:


83
91
64
91
75
89
82
….
(last digits = π )

Do you think this is a conspiracy from your professors? Take this course then
you will know how to “sue” the university or the Russian Airlines for this.
Why take this course?

Have you ever tried to analyze the average-case


complexity of an algorithm?
Did you have to average over all inputs (or use
probability to compute the expectations)?
Not any more, this course will show you how to do it in a
simple way, using Kolmogorov complexity, and using
only one input.
Why take this course?
Everything carries information:
Our DNA carries 3 billion basepairs
Our textbook has 800 pages
The Russia-China flight antibody test has 190 identical numbers
BERT was trained with Wikepedia (2.5B words) and Wikibooks
(1B words)
A pair of Question-Answer, only a few words.
A python program
Temperature or moving molecules in a room
Can we unify all such information, and define a universal metric
that governs all information? I.e. Information in one object, or
between 2 objects?
Few shot learning

Deep learning is data-heavy, parameter-laden, energy-


costing.
Human learning is the opposite.
We will tell you how to model few-shot human
learning using Kolmogorov complexity
Why take this course?
Few shot learning: we do not need many examples or prior knowledge to learn.
But this is what they do currently: transfer features …

Is the above true 1 shot learning. What about 1 shot learning in unknown domains. I
wish to explore a general theory of 1-shot learning in this course, using
Kolmogorov complexity.
Lecture 1. History and Definitions

History
Intuition and ideas
Inventors
Basic definitions and mathematical theory
Intuition
What is in common in the following individual strings?
111 …. 1 n
3.1415926 … π
1’s
1.267606002282e+30
Champernowne’s number: 2
1
0.1234567891011121314 …
0
Note: This number is normal (every block
keep of size k has
on concatenating
0
same frequency) i+1
All these numbers share one commonality: there are “small”
programs to generate them.
What about this string: 10101000110110101111100
The easiest description of it is probably itself.
So there are two kinds of sequences
First kind: Those that have short descriptions
Second kind: Those that do not.

You may wonder: perhaps every sequence can be


compressed some ways? Does the second kind exist?
Especially for finite sequences, as they depend heavily on
the compressors. Thus we do not even have a precise way to
define compressibility and these two kinds of sequences
Before we answer these questions, let’s go into the history,
and see how famous people in the past thought about such
questions.
1903: An interesting year

This and the next two pages were


taken from Lance Fortnow
1903: An interesting year

Kolmogorov Church von


Neumann
Andrey Nikolaevich Kolmogorov
(1903-1987, Tambov, Russia)

Measure Theory
Probability
Analysis
Intuitionistic Logic
Cohomology
Dynamical Systems
Hydrodynamics
Kolmogorov complexity
Ray Solomonoff: 1926 – 2009(First
person to publish Kolmogorov complexity)
Ray Solomonoff: 1926 – 2009(Also a
pioneer of Artificial Intelligence)

Dartmouth Conference where they defined AI


(5 of 10 original attendees at AI@50)

Ray & Grace, me & my wife


in front of their home
Independently discovered “Kolmogorov complexity”
When there
were no digital
cameras
(1987).
A case of Dr. Samuel Johnson
(1709-1784)

… Dr. Beattie observed, as something


remarkable which had happened to him, that he
chanced to see both No.1 and No.1000
hackney-coaches. “Why sir,” said Johnson
“there is an equal chance for one’s seeing those
two numbers as any other two.”

Boswell’s Life of Johnson


The case of cheating casino

Bob proposes to flip a coin with Alice:


Alice wins a dollar if Heads;
Bob wins a dollar if Tails
Result: TTTTTT …. 100 Tails in a roll.

Alice lost $100. She feels being cheated.


Alice goes to the court

Alice complains: T100 is not random.


Bob asks Alice to produce a random coin flip
sequence.
Alice flipped her coin 100 times and got
THTTHHTHTHHHTTTTH …
But Bob claims Alice’s sequence has probability 2 -
100
, and so does his.
How do we define randomness?
Pierre Laplace (1749-1827)

Father of classical probability theory


A sequence is extraordinary (nonrandom) because
it contains rare regularity.
But what does this mean?
All 1 is regularity
π is regularity
What else? We can’t enumerate all.
How to theorize Laplace’s observation using modern
thoughts? I encourage you to make a guess here!
Von Mises
In 1919. von Mises proposed the following 2 conditions for a
random (infinite) sequence S:
1.

1. limn→∞{ #(1) in S1:n }/n =p, 0<p<1, and


2. The above holds for any subsequence of S selected by an
“admissible” function.
3.

But if you allow any partial function (as admissible), then


there is no sequence that can satisfy the above two
conditions. (Because you can always choose all 0 positions
in S as a subsequence S’ in 2, such that S’ does not satisfy
condition 1.)
Recursion theory
The 2 conditions of von Mises in 1919:
Alonso Church
1. 1903-1995
1. limn→∞{ #(1) in S1:n }/n =p, 0<p<1, and
2. The above holds for any subsequence of S selected by an
“admissible” function.
3.

If you allow any partial function to be admissible, then there


is no random sequence.
A. Ward proposes: if you allow only countably many
functions, then von Mises sequences exist!
A. Church: let’s use recursive functions!
J. Ville: von Mises-Wald-Church random sequences do not
satisfy all laws of randomness. Thus, this line of effort in
defining randomness ends here.
Shannon Information Theory

Shannon-Weaver classical information theory is


on an ensemble. It does not deal with
randomness of an individual object. But as we Claude Shannon
will see, these two different lines of thinking 1916-2001
lead to similar formulations although with
different meanings. For example:
Entropy
Mutual information
Symmetry of information
Bayesian Induction

The great Bayes formula

Thomas Bayes
1702-1761

So simple, so powerful.

Here, P(A) is the prior probability of A. If you wish to make


P(A) maximal & dominating “all other” distributions, it leads
to Kolmogorov complexity and randomness!
Preliminaries and Notations
Strings: x, y, z. Usually binary.
x=x1x2 ... an infinite or finite binary sequence
xi:j =xi xi+1 … xj
|x| is number of bits in x. Textbook uses l(x).
Sets, A, B, C …
|A|, number of elements in set A. Textbook uses
d(A).
We will have two versions of Kolmogorov
complexity: C(x), and K(x).
I assume you know Turing machines, universal
TM’s, and basic facts from CS341.
Komolgorov complexitySolomonoff (1960)-
Kolmogorov (1963)-Chaitin (1965)

Definition: Kolmogorov complexity of a string x, CU(x), is the


size of the smallest program, w.r.t. universal TM U, generating
that string:

cUK (x ) = min {p : U ( p ) = x}
p

Remember what Laplace said 200 years ago: “A sequence is extraordinary


because it contains rare regularities”. If you assume Laplace’s regularities to be
“computable”, then this means a short program to describe them!
Examples
C(0n):
Program 1: Print(“00 … 0”), length is n+O(1)
Program 2: for i:=1 to n Print(“0”), length logn+O(1)
If n=2m, then there is a loglogn +O(1) length program.

But what happens if we change a U? Say instead of Java, we use


Python as a universal TM? Ain’t CJava(x), and CPython(x)
different? Then how do we have a well-defined metric?
The invariance theorem
Informal Statement: It does not matter which universal
Turing machine U we choose. I.e. all “encoding methods”
are ok.

Formal statement: There exists a computable function S


such that for all computable functions S’, there is a constant
cS’ such that for all strings x ε {0,1}*

CS(x) ≤ CS’(x) + cS’


Proof of the Invariance theorem
Proof. Fix an effective enumeration of all TM’s: T1, T2, …
Let U be a universal TM such that (p produces x)
U(0n1p) = Tn(p)

Thus, for any other universal TM U’, U’ is Tn for some n. If


U’(p)=x, then U(0n1p)=Tn(p)=U’(p)=x

CU(x) < CU’(x) + O(1)

where O(1) depends on n, but not x. QED

Thus, fixing a U, we will write C(x) instead of CU(x).


That is, Ujava (x) = Upython(x), modulo compiler size.
Applications

Mathematics --- probability theory, logic.


Physics --- chaos, thermodynamics.
Computer Science – average case analysis, inductive inference and learning,
shared information between documents, data mining and clustering,
incompressibility method -- examples:
Shellsort average case
Heapsort average case
Circuit complexity
Lower bounds on Turing machines, formal languages
Combinatorics: Lovazs local lemma and related proofs.
Philosophy, biology etc – randomness, inference, complex systems, sequence
similarity
Information theory – information in individual objects, information distance
Machine Learning: zero-shot learning, comsciousness
Conditional Kolmogorov complexity

Definition: The conditional Kolmogorov complexity of x,


given another string y, C(x|y), is the length of the shortest
program that outputs x given input y. Formally, w.r.t. a
universal TM U, the conditional Kolmogorov complexity of x
given y, CU(x|y), is

CU(x|y)=minp{|p|: U(p|y) = x}
One can prove a similar invariance theorem hence we can again
drop U.
Some facts

C(xx) = C(x) + O(1)


C(xy) ≤ C(x) + C(y) + O(log(min{C(x),C(y)})
C(1n ) ≤ O(logn)
C(π1:n) ≤ O(logn)
For all x, C(x) ≤ |x|+O(1), Print (”x”)
C(x|x) = O(1)
C(x|ε) = C(x)
Incompressibility and randomness
Incompressibility: For constant c>0, a string x ε {0,1} * is c-
incompressible if C(x) ≥ |x|-c. For constant c, we often simply say that
x is incompressible.

Lemma. There are at least 2n – 2n-c +1 c-incompressible strings of


length n.
Proof. There are only
∑k=0,…,n-c-1 2k = 2n-c -1
programs with length less than n-c.
Hence only that many strings (out of total 2n strings of length n) can
have programs (descriptions) shorter than n-c.
QED.

In Lecture 2, we show incompressible strings pass all randomness


tests, hence we call them random strings.
1,2,3,4,5,6,7 ….

More Facts T

If x=uvw is incompressible, then


C(v) ≥ |v| - O(log |x|).
If p is the shortest program for x, then
C(p) ≥ |p| - O(1)
C(x|p) = O(1)
If a subset of {0,1}* A is recursively enumerable (r.e.) (the
elements of A can be listed by a Turing machine), and A is
sparse (|A=n| ≤ p(n) for some polynomial p), then for all x in A, |
x|=n,
C(x) ≤ O(log p(n) ) + O(C(n)) = O(logn).
X1, x2, x3 … x=xk … k<= p(n)
In the textbook: r.e. is c.e. and recursive is computable.
Asymptotics
Enumeration of binary strings: 0,1,00,01,10,
mapping to natural numbers 0, 1, 2, 3, …
C(x) →∞ as x →∞
Define m(x) to be the monotonic lower bound of
C(x) curve (as natural number x →∞). Then
m(x) →∞, as x →∞
m(x) < Q(x) for any unbounded computable Q.
Nonmonotonicity: for x=yz, it does not imply that
C(y)≤C(x)+O(1). --- leave to you to think
m(x) graph
Kolmogorov complexity is not computable

Theorem (Kolmogorov) (1) C(x) is not partially recursive. That


is, there is no TM M s.t. M accepts (x,k) if C(x)≥k and
undefined otherwise.
(2) However, there is H(t,x) such that
limt→∞H(t,x)=C(x)
where H(t,x), for each fixed t, is total recursive.

Proof. (1) If such M exists, then design M’ as follows. Choose


k >> |M’|. M’ simulates M on input (x,k), for all |x|=k in
“parallel” (one step each), and M’ outputs the first x such that
M says yes. Thus we have a contradiction: C(x)≥k by M, but |
M’| outputs x hence |x|=k >> |M’| ≥ C(x) ≥ k.

(2) H(t,x) = mini=1..t {|pi| : pi outputs x in t steps}

QED
Godel’s Theorem
1900, David Hilbert listed 23 open problem in
mathematics.
Answering Hilbert problem 2, Kurt Godel showed that
(with Peano arithmetic axioms) there are things that are
true, but not provable, in 1931.
Godel’s proof was complicated (he essentially encoded a
message in a number claiming “I am not provable”.)
Chaitin give a simple version of it. Chaitin’s proof does not
involve “self-referential statements”.
Godel’s Theorem by Chaitin
Theorem. The statement “x is incompressible” is not
provable.
Proof (G. Chaitin). Let F be an axiomatic theory, encoded
in C bits.
If the theorem is false and statement “x is
incompressible” is provable in F, then we can enumerate
all proofs in F to find
a proof of “x is incompressible”
and |x| >> C,
We output (first) such x. Then C(x) < C +O(1) But the
proof for “x is incompressible” implies that C(x) ≥ |x| >>
C. Contradiction. QED
Summary
Kolmogorov complexity is not just mathematics,
but it is also about information science
applications, history, philosophy, physics, and
many other things.
I hope you have seen the colorful history that has
led to this beautiful concept of Kolmogorov
complexity.
In the following lectures, you will see many more
elegant theorems, and applications: we start with
Martin-Lof’s test that is key to identify
incompressibility with randomness.

You might also like