You are on page 1of 62

LECTURE NOTES FOR THE COURSE TOPICS IN

ERGODIC THEORY, MICHAELMAS 2016

PETER VARJU

These are brief lecture notes for my course. Please send comments
to: pv270@dpmms.cam.ac.uk.

1. Measure preserving systems


Ergodic Theory is the study of measure preserving systems.
Definition 1. A measure preserving system (MPS) is a quadruple
(X, B, , T ), where
X is a set
B is a -algebra
is a probability measure, that is, (X) = 1 and (A) 0 for
all A B
T : X X is a measure preserving transformation, that is a
measurable transformation such that (T 1 (A)) = (A) for all
A B.
Example 2 (Circle rotation). Fix a number [0, 1). Let X =
[0, 1] = R/Z, let B be the collection of Borel sets in [0, 1], the
Lebesgue measure, and T = R , where R is the map
Ra : x 7 x + mod 1.
Example 3 (Times 2 map). Let X, B, be as in the previous ex-
ample and let T = T2 , where
T2 (x) = 2x mod 1.
These examples can be generalized to more general compact groups.
In the first case the map is addition (or multiplication) by a fixed
element of the group, in the second case it is an endomorphism of the
group.
Proof that T2 is measure preserving. Easy to see for intervals:
(a, b) = b a = ((a/2, b/2) (a/2 + 1/2, b/2 + 1/2)) = (T21 (a, b)).
An open set U T is the disjoint union of intervals I1 I2 . . ., hence
X X
(U ) = (Ij ) = (T21 (Ij )) = (T21 (U )).
A compact set K is the complement of an open set U , hence
(K) = 1 (U ) = 1 (T21 (U )) = (X\T21 (U )) = (T21 (K)).
1
2 PETER VARJU

Finally a general Borel set B B is approximated by an open set U


and a compact set K such that K B U and (U ) (K) . On
the other hand
(T21 (B)) (T21 (U )) = (U )
and similarly (T21 (B)) (K), so we must have |(T21 (B))
(B)| , which proves the claim letting 0. 
Ergodic theory studies the long term behaviour of orbits in MPSs.
The orbit of a point x X is the sequence x, T x, T 2 x, . . .. In particular
the following questions are asked:
Let A B and x A. Does the orbit of x visit A infinitely
often?
What is the proportion of the ns such that T n x A?
What is the measure of {x A : T n x A} for some large n?
Example 4. Let A = [0, 1/4). Then T2n x A if and only if the nth
and n + 1th binary digits of x are both 0. Hence the orbit of
1
x = = 0.00101010101010 . . .(2)
6
never visits A again. However, the opposite is true for most points.
In addition:
1
({x A : T n x A}) = for all n 2.
16
We mention one further example of a MPS.
Example 5 (Markov shifts). Let n 2 be an integer, let p1 , . . . , pn be
nn
a probability vector and A = (ai,j ) R0 a matrix, called the matrix
of transition probabilities. We assume
1 1

A .
. = .. , (p , . . . , p )A = (p , . . . , p ).
. . 1 n 1 n
1 1
The following MPS is called a Markov shift.
X = {1, 2, . . . , n}Z ,
B is the Borel -algebra generated by the product topology,
T = is the shift map (x)m = xm+1 ,
the measure is defined by the following formula
(1) ({x X : xm = i0 , xm+1 =i1 , . . . , xm+k = ik })
=pi0 ai0 i1 ai1 i2 aik1 ik ,
which is required to hold for all m, k, i0 , . . . , ik .
It requires a proof that the set-function defined by (1) is
additive and that it can be extended to a probability measure
on B that is -invariant. This will appear in the first example
sheet.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 3

An important special case of Markov shifts is when ai,j = pj for all i, j.


Inspection of (1) shows that is the infinite product of the measure on
{1, . . . , k} given by the probability vector (p1 , . . . , pn ). In this special
case the MPS is called a Bernoulli shift.

2. Furstenbergs correspondence principle


Theorem 6 (Szemeredi). Let S Z be a set of positive upper Banach
density, that is:
1
d(S) := lim sup |{x S : M x < N }|.
N M N M

Then S contains an arbitrarily long arithmetic progression, that is: for


each l Z>0 , there is a Z and d Z>0 such that
{a + jd : j = 0, . . . , l 1} S.
Fix a set S with positive upper Banach density. We construct below
an MPS and show that Szemeredis theorem follows from the so-called
multiple recurrence property of this system.
We define X = {0, 1}Z and take B to be the Borel -algebra gener-
ated by the product topology. We consider the shift transformation
: X X, ((x))n = xn+1 .
The invariant measure will be constructed below, and it will encode
the set S.
Define the element xS X by
(
1 if n S
xSn :=
0 if n
/ S.

and consider the set


A := {x X : x0 = 1}.
The following observation will be used to relate the set S to the dy-
namics: For each n Z we have
nS if and only if n xS A.
Indeed,
n S 1 = xSn = ( n xS )0 n xS A.
The invariant measure in the MPS will be constructed as the limit of
normalized counting measures on longer and longer segments of the or-
bit of xS . Before we give the details, we recall the notion and properties
of weak limits of probability measures.
4 PETER VARJU

2.1. Weak convergence of measures. In this section X is a compact


metric space.
Definition 7. Let n be a sequence of Borel probability measures on
X and let be another Borel probability measure. We say that n
weakly converges (1) to if
Z Z
lim f dn = f d for all f C(X).
n

Here C(X) stands for the space of continuous functions on X. In our


notation, we will write
lim-wn n = .
This concept is useful thanks to the following result, which is an
analogue of the Bolzano-Weierstrass theorem.
Theorem 8 (Banach-Alaoglu/Helly). The space of Borel probability
measures M (X) on a compact metric space X endowed with the topol-
ogy of weak convergence is compact and metrizable. In other words, any
sequence of probability measures has a weakly convergent subsequence.
Remark: If we drop the condition that X is metric (but it should
be Hausdorff, at least), then the compactness of M (X) will still hold,
but we may lose metrizability and the sequential version needs to be
reformulated in terms of nets.
2.2. The invariant measure. Let S, X, B, and xS be the same as
above. We construct now the invariant measure on the MPS that will
be used to prove Szemeredis theorem.
We write x for the probability measure defined by
(
1 if x B
x (B) =
0 if x
/ B.
We choose two sequences of integers {Mm }, {Nm } Z such that
1
d(S) = lim |{x S : Mm x < Nm }|.
m Nm Mm

These sequences exist by the definition of d(S).


We define the measures m M (X) by
NXm 1
1
m = i S.
Nm Mm i=M x
m

Interpretation: m (B) is the proportion of the points in the orbit seg-


ment Mm xS , . . . , Nm 1 xS in the set B.
We define to be the weak limit of a subsequence of m .
Lemma 9. The above defined (X, B, , ) is a MPS.
(1)This is also called weak- convergence.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 5

Proof sketch. Later in the course we prove a general result, which con-
tains this lemma. For this reason, we do not give a rigorous argument
now.
Without loss of generality, we assume that lim-w m = . (Otherwise
we pass to the suitable subsequence.) We need to prove that preserves
, but we only show that preserves m alomst.
We can write for a set B B:
1
m (B) = |{i : Mm i < Nm , i xS B}|
Nm Mm
and similarly
1
m ( 1 (B)) = |{i : Mm i < Nm , i xS 1 (B)}|
Nm Mm
1
= |{i : Mm + 1 i < Nm + 1, i xS B}|.
Nm Mm
Hence
1
|m (B) m ( 1 B)| < .
Nm Mm

Remark 10. If B is a cylinder set(2), i.e. a set of the form
B = {x X : (xN , . . . , xN ) B}
for some N Z0 and B {0, 1}2N +1 , then B is both open and closed,
hence B is continuous.
By the definition of weak convergence, we have then that
Z
1
( (B)) = 1 B d = lim 1 B dm
m
Z
= lim B dm = B d = (B)
m

holds for any cylinder set B.


Using this observation, together with the argument in the Proof
sketch, we can conclude that (B) = (B) for cylinder sets. It is
possible to make the proof of Lemma 9 rigorous by approximating ar-
bitrary Borel sets with cylinder sets in a suitable way.
Proposition 11. Let S Z be a set of positive upper Banach density
and let (X, B, , ) be the measure preserving system constructed above.
Let A X be the set A = {x X : x0 = 1}. Let l 1 be an integer.
Suppose that
(A n (A) 2n (A) . . . (l1)n (A)) > 0
for an integer n 1. Then S contains an arithmetic progression of
length l.
(2)Some sources use a more restrictive terminology and call only sets of the form
{x X : xN = aN , . . . , xN = aN } for some aN , . . . , aN {0, 1} cylinder sets.
6 PETER VARJU

Proof. The set


B := A n (A) 2n (A) . . . (l1)n (A)
is a cylinder set, hence (B) = lim m (B), as we have seen in the
preceding proof. This means that for some m large enough m (B) > 0.
By the definition of m , we must have j xS B for some Mm j
Nm . Then j xS in (A), hence j+in xS A for all 0 i l 1. As
we observed in the beginning of the lecture, this translates as
{j + in : 0 i l 1} S,
which was to be shown. 
We note that (A) = d(S) as can be seen easily from the construction
of . We can now conclude that the following theorem of Furstenberg
implies Szemeredis theorem.
Theorem 12 (Multiple Recurrence, Furstenberg). Let (X, B, , T ) be
a MPS and A B with (A) > 0. Then for any integer l 1 we have
N
1 X
lim inf (A T n A T 2n A . . . T (l1)n A) > 0.
N n=1

3. Poincare recurrence, Ergodicity


The following lemma can be considered the pigeon hole principle of
Ergodic Theory. Putting together with the Furstenberg correspondence
principle it implies that every set of integers of positive upper Banach
density contains an arithmetic progression of length 2, which is not
very impressive yet.
Lemma 13. Let (X, B, , T ) be an MPS and let A B with (A) > 0.
Then there is some n 1 such that (T n (A) A) > 0.
Proof. Assume to the contrary that (A T n (A)) = 0 for all n Z>0 .
Let i < j Z0 . Then
(T i (A) T j (A)) = (T i (A T (ji)(A) )) = 0.
We fix a number N and write
(AT 1 (A) . . . T N (A))
=(A) + (T 1 A) (T 1 A A)
+ (T 2 A) (T 2 A (A T 1 A)) + . . .
+ (T N A) (T N (A) (A . . . T N A))
=N (A),
which is a contradiciton if N > (A)1 . 
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 7

Theorem 14 (Poincare recurrence). Let (X, B, , T ) be an MPS and


let A B with (A) > 0. Then almost every point of A goes back to A
infinitely often. That is to say
 [
\ 
A\ T n (A) = 0.
N =1 n=N
n
S Note that T (A) consists of the set of points x that satisfy T n x A,
n
n=N T (A) is the set T
of points
S thatn
visit A at least once after the
N th iteration of T and N =1 n=N T (A) is the set of points that
visit A infinitely often.
Proof. We denote by A0 the set of points in A that never come back to
A. Then A0 T n A0 A T n A0 = for all n 1, hence (A0 ) = 0
by the lemma.
Let x A be a point that does not come back to A infinitely often.
Let n be the largest integer such that T n x A. Then T n x A0 . This
shows that [
 \  [
A\ T n (A) T n (A0 ).
N =1 n=N n=0
This latter set is of measure 0, because it is the countable union of
measure 0 sets. 
Definition 15 (Ergodicity). An MPS (X, B, , T ) is called ergodic if
A = T 1 (A) implies (A) = 0 or (A) = 1 for any A B.
Ergodicity is a notion of indecomposibility for MPSs. If A B is
invariant (that is T 1 (A) = A), then we can restrict B, and T to A
and after renormalizing the measure, we obtain a new MPS.
The following lemma gives a few alternative characterization of er-
godicity.
Lemma 16. The following are equivalent
(1) The MPS (X, B, , T ) is ergodic.
(2) For any A B with (A) > 0, we have (
T S n
N =1 n=N T (A)) =
1.
(3) For any A B, (A4T 1 (A)) = 0 implies (A) = 0 or (A) =
1.
(4) For any bounded measurable function f : X R, f T =
f holds almost everywhere implies that f is constant almost
everywhere.
(5) For any measurable function f : X C, f T = f holds almost
everywhere implies that f is constant almost everywhere.
The second item shows that for ergodic systems the Poincare recur-
rence holds in a stronger form: not only almost every point in A but
also almost every point in X visit A infinitely often. The third, fourth
and fifth items give characterizations that are useful in practice. In
8 PETER VARJU

particular, we will use the fourth item to show that the circle rotation
R is ergodic if and only if is irrational.
Proof. (1) (2) : The set B :=
T S n
N =1 n=N T (A) is invariant. In-
deed, x B if and only if its orbit visits A infinitely often. This is true
for x if and only if it is true for T x. This shows that B = T 1 (B), as
claimed. By Poincare recurrence (B) (A) > 0, hence (B) = 1 by
ergodicity.
(2) (3) : Let A satisfy (A4T 1 (A)) = 0 and let B be as in the
previous paragraph. We show that (B) (A). For n 1, we define
Dn as the set of points x B\A such that n is the smallest integer
with T n x A. The sets Dn clearly partition B\A. We can write
(Dn ) (T n (A)\T n+1 (A)) = (T 1 A\A) (T 1 A4A) = 0.
Hence (
S
n=1 Dn ) = 0 and (B) (A), as required. If (A) 6= 0,
then (A) (B) = 1, which proves (3).
(3) (4) : For t R write
At := {x X : f (x) t}.
Clearly (At 4T 1 At ) = 0, hence (At ) = 0 or (At ) = 1 for each
t. Observe that the sets At are increasing with t, hence the function
t 7 (At ) is a monotone increasing function.
Since f is bounded, we have (At ) = 0 if t is sufficiently small, and
(At ) = 1 if t is sufficiently large. Therefore, there is a finite real
number c such that (At ) = 0 for t < c and (At ) = 1 for t > c. By
the previous observations,

\
{x : f (x) = c} = (Ac+1/n \Ac1/n )
n=1

is of measure 1.
(4) (1) : Let A B be such that T 1 (A) = A. Then A T = A ,
hence A is constant almost everywhere. If this constant is 0, then
(A) = 0, if it is 1, then (A) = 1. Hence the system is ergodic,
indeed. 
Example 17. The circle rotation R is ergodic if and only if is
irrational.
Indeed: Let f : R/Z R be a bounded measurable function. We
can expand f in Fourier series
X
f= an exp(2inx).
nZ

Similarly for f R :
X X
f R = an exp(2in(x + )) = (an exp(2in)) exp(2inx).
nZ nZ
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 9

We see that f is R -invariant if and only if an = an exp(2in) holds


for all n by the uniqueness of Fourier series.
If is irrational, then exp(2in) 6= 1 for all n 6= 0. Hence an = 0
for all n 6= 0, if f is invariant. This proves that f (x) = a0 almost
everywhere.
If is rational, then n1 is an integer for some n1 6= 0, hence the
n1 th condition holds irrespective of the value of an1 . Thus f (x) =
cos(2in1 x) is R invariant and not constant almost everywhere.
4. Ergodic theorems
In the previous section, we discussed recurrence, which is concerened
with orbits visiting a given set infinitely often. We have not yet ad-
dressed the question how frequent these visits are, which is what we
pursue now.
Theorem 18 (Mean Ergodic Theorem, von Neumann). Let (X, B, , T )
be an MPS and denote by
I := {f L2 (X) : f T = f a.e.}
the closed subspace of T -invariant functions. Denote by PT : L2 (X)
I the orthogonal projection. Then for every f L2 (X):
N 1
1 X
f T n PT f in L2 (X).
N n=0
Theorem 19 (Pointwise Ergodic Theorem, Birkhoff). Let (X, B, , T )
be an MPS. Then for every f L1 (X), there is a T -invariant function
f L1 (X) such that
N 1
1 X
f (T n x) f (x) pointwise -almost everywhere.
N n=0
When f L2 (X), then of course the function f in the Pointwise
Ergodic Theorem is PT f . There are also Lp variants of the Mean
ErgodicPTheorem for 1 p < , see the example sheet. The quantity
N 1
(1/N ) n=0 f T n is called an ergodic average.
To understand the meaning of these results, we look at the case,
when the system is ergodic. In that case, the limit function f (or PT f
in case of the Mean Ergodic Theorem) is constant almost everywhere
being T invariant. To figure out the value of this constant we look at
the integral of the ergodic averages. We see that
Z N 1 Z
1 X n
f T d = f d
N n=0
for all N . Using the L1 version
R of the Mean Ergodic Theorem
R (or

2
R
assuming f L (X)) we get f d = f d. Thus f = f d
almost everywhere.
10 PETER VARJU

Hence we can write the conclusion of the ergodic theorems as


N 1 Z
1 X n
lim f (T x) = f d
N N X
n=0

in the appropriate sense (i.e. almost everywhere, or in L2 .) The quan-


tities on the left are called time averages, because in a physical system
they correspond to averages of the results of measurements at time
0, 1, . . . , N 1. (The measurement is the evaluation of the function f
at a given point, which represent a state of the physical system.) The
quantity on the right is called the space average, as it corresponds to
the average of the result of the measurement taken at all possible states
of the system with respect to the invariant measure. In this language
the ergodic theorems are interpreted as the convergence of the time
averages to the space average.
We begin with some background material concerning the operator
f 7 f T .
Lemma 20. Let (X, B, ) be a probability space and let T : X X
be a measurable transformation. Then T is measure preserving if and
only if
Z Z
(2) f T d = f d
X X
1
for all f L (X).
Proof. The If part: Let A B and note that x T 1 (A) is equivalent
to T x A. Hence we can write:
Z Z Z
1
(T (A)) = T 1 A d = A T d = A d = (A),

which proves that T is measure preserving.


The Only if part: Similarly to the argument of the If part we
can show that (2) holds for characteristic functions. Then by linearity
of integration, it follows also for simple functions, that is functions that
take only finitely many different values.
Now let f L1 (X) and assume that it is non-negative. Let f1 , f2 , . . .
be a monotone increasing sequence of simple functions with f = lim fn
almost everywhere. Then f T = lim fn T almost everywhere. (Here
we used the measure preserving property.) Hence using the definition
of integration and (2) for simple functions:
Z Z Z Z
f T d = lim fn T d = lim fn d = f d.

If f L1 (X) is not non-negative, then we can write it as the dif-


ference of two non-negative functions and conclude (2) by linearity of
integration. 
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 11

Definition 21. Let (X, B, , T ) be an MPS. If f : X C is a mea-


surable function, we write
UT f = f T
and call UT the Koopman operator.
Lemma 22. The Koopman operator UT is an isometry on the Hilbert
space L2 (X). That is to say
hf, gi = hUT f, UT gi
for all f, g L2 (X).
Proof. Apply the previous lemma to the function f g L1 (X). 
Definition 23. An MPS (X, B, , T ) is said to be invertible, if there is
a measure preserving map S : X X such that T S = S T = IdX
almost everywhere. If such a map exists, it is customary to denote it
by T 1 .
Example 24. The circle rotation is invertible, but the times 2 map is
not.
Lemma 25. If the system (X, B, , T ) is invertible, then UT is unitary
on L2 (X) and UT = UT 1 .
Proof. We use the first lemma of the lecture to the function f (g T 1 ):
Z Z Z
1 1
hf, UT 1 gi = f (gT )d = (f (gT ))T d = f T gd = hUT f, gi.

This shows that UT = UT 1 , indeed. Clearly UT UT 1 = UT 1 UT =


IdL2 (X) , which together with the first part proves that UT is unitary. 
The proof of both ergodic theorems (at least the proofs that we will
give) rely on the observation that the convergence can be proved easily
for certain special functions. First, the ergodic averages of a T -invariant
function f I are equal to f , so they converge to f .
If f = UT g g for some g, then the ergodic averages become tele-
scopic sums and only the boundary terms remain, that is:
N 1
1 X n 1
UT (UT g g) = (UTN g g)
N n=0 N
and the right hand side is easily seen to converge to 0.
The following lemma shows that these two type of functions are
enough to look at.
Lemma 26. Write
B := {UT g g : g L2 (X)}.
Then I = B .
12 PETER VARJU

It is important to note that the space B is not closed. So we have


2
L (X) = I B, but not every function in L2 (X) is the sum of a
T -invariant function and a cocycle. (The elements of B are called
cocycles.)
Proof. We can write
f B hf, UT g gi = 0 for all g L2 (X)
hf, gi = hf, UT gi = hUT f, gi for all g L2 (X) f = UT f.
If the system was invertible, then we could finish the proof by applying
UT = (UT )1 to both sides of the last equation.
We prove that f = UT f f = UT f holds in the general case, too.
We can write
f =UT f hf UT f, f UT f i = 0
kf k22 + kUT f k22 hf, UT f i hUT f, f i = 0
kf k22 + kUT f k22 hUT f, f i hf, UT f i + (kUT f k22 kUT f k22 ) = 0
kf UT f k22 + (kUT f k22 kUT f k22 ) = 0
We note that both terms in the left hand side of the last equation are
non-negative. This is clear about the first term. To show it for the
second term, we observe that kUT f k2 = kf k2 , since UT is an isometry,
and kUT f k2 kf k2 , as kUT k = kUT k = 1.
It follows then that
f = UT f f = UT f and kf k2 = kUT f k2 .
The second statement on the right follows from the first one, hence
f = UT f f = UT f , as required. 
Proof of the Mean Ergodic Theorem. Let f L2 (X) and fix an > 0.
By the lemma, we can write f = PT f +UT gg+e, for some e, g L2 (X)
such that kek2 < . Thus
1 N 1 1 N 1
X n

N 1 X n
lim sup UT f PT f lim sup (UT gg)+ UT e < .

N N n=0
2 N N N n=0
2

We can conclude the proof by taking 0. 

5. Proof of the pointwise ergodic theorem


Theorem 27 (Wiener, Maximal Ergodic Theorem). Let (X, B, , T )
be a MPS and let f L1 (X) be a function. Write
( N 1
)
1 X n
E = x X : sup UT f (x) > .
N Z>0 N n=0

Then (E ) 1 kf k1 .
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 13

Proposition 28. Let (X, B, , T ) be a MPS and let f L1 (X) be a


function. Write
f0 = 0, f1 = f, f2 = f + UT f, . . . ,
fn = f + UT f + . . . + UTn1 f, . . .
and
FN (x) = max {fi (x)}.
0iN 1
Then Z
f d 0.
{xX:FN (x)>0}

Proof. Fix an x X is such that FN (x) > 0. Then FN (x) = fj (x) for
some 1 j N 1. Hence
FN (x) = UT fj1 (x) + f (x) UT FN (x) + f (x).
Thus FN (x) > 0 implies that
f (x) FN (x) UT FN (x).
We can write
Z Z
f d (FN (x) UT FN (x))d
{xX:FN (x)>0} {xX:FN (x)>0}
Z
(FN (x) UT FN (x))d = 0.
X
The second inequality holds because UT Fn is always non-negative and
hence FN (x) = 0 implies that the integrand FN (x) UT FN (x) is non-
positive. 
Proof of the Maximal Ergodic Theorem. Put
( N 1
)
1 X n
E,M = x X : max UT f (x) >
1N M N
n=0
( N 1
)
X
= x X : max (UTn f (x) ) > 0 .
1N M
n=0

By the proposition applied to the function f , we have


Z Z
0 (f )d f d (E,M ) kf k1 (E,M ).
E,M E,M

This implies
S (E,M ) 1 kf k1 . We can conclude the proof by noting
E = E,M . (The sets E,M are increasing with M .) 
Proof of the Pointwise Ergodic Theorem. We fix a number > 0. We
can write f = f + e1, such that f L2 and ke1, k1 < .
Recall that I = {f L2 (X) : UT f = f } denotes the space of
invariant functions and B = {UT g g : g L2 (X)}. As we have seen
14 PETER VARJU

in the proof of the Mean Ergodic Theorem, we have L2 (X) = I B.


Hence we can write

f = PT f + UT g g + e2,

such that ke2, k1 ke2, k2 < . (Recall that PT denotes the orthogonal
projection to the space I.)
Furthermore, we can write g = h + e3, such that h L (X)
andke3, k1 < . Putting everything together, we obtain

f = PT f + UT h h + e ,

where ke k1 4. (Here we used that the L2 -norm always dominates


the L1 -norm on a probability space.)
We note that for every x X, we have
1 N 1 1 N 1
X n
2 X n

(3) UT f (x) PT f (x) kh k + UT e (x) .

N n=0 N N n=0

We estimate the measure of the set of points x, where the right hand
side is large using the Maximal Ergodic Theorem. We consider the set
N 1
n 1 X n 1o
Em, := x X : lim sup | UT f (x) PT f (x)| > .
N N n=0 m

Using (3) and the Maximal Ergodic theorem for e and e with =
1/m we obtain:
(Em, ) 2 m 4.
At this point, we intend to take the limit 0, however, the fact
that PT f depends on presents a difficulty. One way to overcome
this problem is to first focus on the convergence of the ergodic averages
only without identifying the limit.
We denote
PN 1 by F the set of points x such that the ergodic averages
(1/N ) n=0 UTn f (x) does not converge at x. We aim to show that
(F ) = 0. Write for all m Z>0
1 1
1 NX N2 1
n
n 1 X n
2o
Fm := x X : lim sup UT f (x) UT f (x) > .

N1 ,N2 N1 n=0 N2 n=0 m
S
It is clear that F Fm , so it is enough to show that (Fm ) = 0 for
all m.
We observe that Fm Em, for all m, , hence (Fm ) 8m. We
take 0 and conclude that (Fm ) = 0, as required.
We proved that the ergodic averages convere to a limit function f
almost everywhere. By Fatous Lemma, f L1 .
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 15

To prove T -invariance, we observe that for almost all x, we have


N 1
1 X n
f (x) = lim U f (x),
N n=0 T
N 1
1 X n
f (T x) = lim U f (T x).
N n=0 T
Subtracting these two equations, we find
1 1 N
f (T x) f (x) = lim (f (T N x) f (x)) = lim N (f (T 2 x) f (x)).
N 2
We observe that the series

1

N
X
N (f (T 2 x) f (x))

N =1
2
converges in L1 , hence it must be bounded almost everywhere. This
implies that the terms converge to 0 almost everywhere, hence f (T x)
f (x) = 0, as required. 
Definition 29. Let x [0, 1) be a number and K 2 an integer.
Write x = 0.a1 a2 . . .(K) for the base K expansion of x. We say that the
number x is normal in base K, if for all integer M 1 and b1 , . . . , bM
{0, . . . , K 1} we have
|{1 n N : an+j = bj for all 1 j M }| 1
M.
N K
Theorem 30. Almost every number x [0, 1) is normal in every base
K 2.
Proof. We apply the pointwise ergodic theorem for the system (R/Z, B, m, TK ),
where m is the Lebesgue measure and TK is the map x 7 K x mod 1.
This system P is ergodic; see
Pthe first example sheet.
M M
Put A = [ j=1 bj K , j=1 bj K j + K M ). Observe that T n x A
j

if and only if an+j = bj for all 1 j M . Then the pointwise ergodic


theorem applied for f = A , for every set A of this form, implies that
almost every x [0, 1) is normal in base K. Here we use that there
is a countable collection of sets A that we need to consider, hence the
union of the countably many sets of exceptional points is of measure 0.
The union for K of the set of non-normal numbers in base K is of
measure 0, since there are only countably many K. 
6. Unique ergodicity
In the previous lecture, we applied the pointwise ergodic theorem to
show that a property holds for almost every numbers. This result gives
no insight whether the property holds for specific numbers, e.g. e,
or 21/3 . In this lecture, we study the question: In which situation does
the pointwise ergodic theorem hold for all points?
16 PETER VARJU

To make sense of this question, we work in the continuous category


(measurable functions are not even defined everywhere).
If there are more than one T -invariant measures for a transforma-
tion T , then we can apply the Pointwise Ergodic Theorem for both
measures, which may predict different limits for the ergodic averages
in each case. This is not a contradiction, because the set, where con-
vergence holds in one case may be a null-set with respect to the other
measure. However, this suggests, that the everywhere convergence of
ergodic averages prohibits the presence of multiple invariant measures.
The next result confirms this intuition.
Definition 31. Let T : X X be a continuous map on a compact
metric space. We say that the topological dynamical system (X, T ) is
uniquely ergodic if there is exactly one T -invariant Borel probability
measure.
Theorem 32. Let X be a compact metric space and let T : X X be
a continuous map. The following are equivalent
(1) The system (X, T ) is uniquely ergodic.
(2) For every f C(X) there is a number cf C such that
N 1
1 X n
U f (x) cf
N n=0 T

uniformly in x X.
(3) There is a dense subset A C(X) such that for every f A
there is a number cf C such that
N 1
1 X n
U f (x) cf
N n=0 T

for every x X not necessarily uniformly.


We begin by recalling the following result.
Theorem 33 (Riesz Representation Theorem). Let X be a compact
metric space and denote by C(X) the space of continuous functions.
To a complex Borel measure , we associate a bounded functional L
on C(X) as follows:
Z
L f := f d.

Then the operation L is a bijection (in fact a Banach space


isomorphism) between the space of complex Borel measures on X and
the space of bounded linear functionals on C(X).
Most of the time we will only need the following corollary, which is
much simpler than the theorem itself.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 17

Corollary 34. If two probability measures 1 , 2 M(X) satisfy


Z Z
f d1 = f d2

for all f C(X), then 1 = 2 .


For the rest of this lecture, X denotes a compact metric space. We
denote by M(X) the space of Borel probability measures on X. If
T : X X is a continuous map, then we define the push-forward of a
measure M(X) as
T (A) := (T 1 (A))
for all Borel set A B. We note that T preserves the measure (or in
other words is T -invariant) if and only if T = .
Lemma 35. We have
Z Z
f dT = f T d

for every bounded measurable functions f .


Proof Sketch. First we consider a characteristic function f = A for a
Borel set A B. Then
Z
A dT = T (A) = (T 1 A)
Z Z
A T d = T 1 A d = (T 1 A),

and we see that the claim holds. The claim can be verified for simple
functions and then for general Borel functions in the same way as we
did before. 
The following is a useful characterization of the measure preserving
property of continuous maps.
Lemma 36. Let T : X X be a continuous transformation. A
measure M(X) is T -invariant if and only if
Z Z
(4) f d = f T d

for all f C(X).


Proof. We have already seen that (4) holds even for all functions in L1
if the measure is T -invariant.
For the other direction, we note that by the previous lemma
Z Z Z
f dT = f T d = f d

for all f C(X). We have then T = , by the Riesz representation


theorem. Thus is T -invariant, indeed. 
18 PETER VARJU

The next result is a powerful tool to construct invariant measures. In


particular it shows that there is at least one always if T is a continuous
map on a compact metric space.
Theorem 37. Let T : X X be a continuous map on a compact
metric space. Let j M(X) be a sequence of probability measures on
X and let {Nj } Z0 be a sequence such that Nj . Then any
weak limit point of the sequence of measures
Nj 1
1 X n
T j
Nj n=0

is invariant for T .
Proof. Assume that
Nj 1
1 X n
= lim-w T j
Nj n=0

for otherwise we can pass to a subsequence. Then for every f C(X)


we have
Z Z h 1 NX j 1 Z Nj 1 Z
n 1 X i
f T d f d = lim f T dT j f dTn j
Nj n=0 Nj n=0
h 1 NX j 1 Z Nj 1 Z
n+1 1 X i
= lim f T dj f T n dj
Nj n=0 Nj n=0
Z Z
1 h i
= lim f T Nj dj f dj = 0.
Nj


Proof of Theorem 32. (1) (2) : Denote by M(X) the unique


invariant
R measure. Suppose to the contrary that (2) fails for cf =
f d. Then there is a sequence of points {xj } X and a sequence of
positive integers {Nj } Z0 with Nj such that
1 NXj 1 Z
n
UT f (xj ) f d

Nj n=0

for some > 0 and all j. Moreover, we can assume that


Nj 1
1 X n
lim U f (xj ) = a
Nj n=0 T

for some number


R a C, otherwise
R we pass to a subsequence. Neces-
sarily a 6= f d, indeed a f d .
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 19

We let to be any weak limit point of the sequence of measures


Nj 1
1 X n
T x .
Nj n=0 j
R R
We know that it is T invariant, but f d = a 6= f d. Hence 6= ,
which is a contradiction.
(3) (1) : Let and be two T -invariant probability measures. By
dominated convergence,
Z N 1
1 X
lim f T n d = cf
N N n=0

for all f A. Since


Z N 1 Z
1 X n
f T d = f d
N n=0
R
for all N , this
R implies f d = cf . R R
Similarly, f d = cf for all f A. Then f d = f d holds for
all f A. It also holds for all f C(X), because A is dense in C(X).
Hence = , as required. 

This theorem illustrates the importance of understanding the struc-


ture of measures invariant for a given continuous map. In many cases,
the invariant measures are hopeless to classify, but sometimes they have
nice structure. The following is a very important question raised by
Furstenberg.
Open Problem 38. What are the Borel probability measures on R/Z
that are invariant under both T2 : x 7 2x and T3 : x 7 3x. The
Lebesgue measure is one example, and it is also easy to give exam-
ples that are supported on finitely many rational points. Any known
examples are convex combinations of these.
Later in the course we will discuss Rudolphs theorem, which claims
that a measure that is invariant and ergodic under the semigroup gen-
erated by T2 and T3 and that has positive entropy with respect to T2
is necessarily the Lebesgue measure.
Example 39. If is irrational, then the circle rotation (R/Z, B, m, R )
is uniquely ergodic. Indeed, let be an invariant measure. Then for
all n Z
Z Z
exp(2inx)d = exp(2in(x + ))d
Z
= exp(2in) exp(2inx)d.
20 PETER VARJU

Since n is not an integer for any n Z except for n = 0, we have


exp(2in) 6= 1. Thus
Z

b(n) = exp(2inx)d = 0

for all n 6= 0. This implies that is the Lebesgue measure. (If you have
not seen this before, look up the Stone Weierstrass theorem and show
that the set of trigonometric polynomials (i.e. finite linear combina-
tions of the functions exp(2inx)) form a dense subset of C(R/Z).)
Definition 40. A sequence {xn } [0, 1] is equidistributed if
N Z
1 X
lim f (xn ) = f d
N N
n=1

holds for all f C(R/Z).


Remark 41. If {xn } [0, 1] is equidistributed, then
|{1 n N : a xn < b}|
ba
N
for every numbers 0 a < b 1.
The following is an immediate corollary of the unique ergodicity of
circle rotations.
Theorem 42. The sequence {n mod 1} is equidistributed if is ir-
rational.

7. Equidistribution of polynomials
The goal of this section is to generalize the equidistribution result
for {n mod 1} to polynomial sequences. We begin with a theorem of
Furstenberg, which we will use to construct uniquely ergodic systems
that are suited to that application.
Theorem 43 (Furstenberg). Let X be a compact metric space and let
T : X X be a continuous transformation. Suppose (X, T ) is uniquely
ergodic and denote by the invariant measure. Let c : X R/Z be a
continuous function. Define the map S on X R/Z by
S(x, y) = (T x, c(x) + y mod 1).
Then m is S-invariant. If the measure m is ergodic, then
the system (X R/Z, S) is uniquely ergodic.
Remark 44. This result holds with an arbitrary compact metrizable
topological group in place of R/Z, which need not be Abelian. The
name of this construction is skew-product.
One of the key notions used in the proof of Furstenbergs theorem is
genericity that we define now.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 21

Definition 45. Let X be a compact metric space, let T : X X be a


continuous transformation and let be a T -invariant Borel probability
measure. A point x X is called generic with respect to , if
N 1
1 X
lim-wN T n x = .
N n=0

Proposition 46. Let (X, B, , T ) be as above, and suppose that it is


an ergodic MPS. Then -almost every x X is generic with respect to
.
Proof. Let F C(X) be a dense countable set. By the pointwise
ergodic theorem, for every f F , there is Xf B with (Xf ) = 1
such that
N 1 Z
1 X n
lim f (T x) = f d.
N N
n=0
T
Fix x f F Xf . Let g C(X). Fix > 0 and choose f F such
that kg f k < . Then
1 N 1 N 1
X n 1 X n

f (T x) g(T x) <,
N n=0 N n=0

Z Z
f d gd <.

Thus
1 N 1 Z
X n
lim sup g(T x) gd < 2.

N N n=0
We take 0 and conclude that x is generic with respect to , since
g was arbitrary. 
Proof of Furstenbergs theorem on skew products. For showing that S
preserves m, we choose a testfunction f C(X R/Z) and write
Z Z Z Z 1c(x)
f (S(x, y))d(x)dy = f (T x, c(x) + y)dyd(x)
R/Z X X c(x)
Z Z 1
= f (T x, y)dyd(x)
X 0
Z Z
= f (x, y)d(x)dy.
R/Z X

Assume that m is ergodic, and we prove unique ergodicity. De-


note by E X R/Z the set of generic points with respect to m.
Our strategy is to show that E is very large.
We first observe m(E) = 1. Then we prove the following.
Claim. If (x, y) E, then (x, y + t) E, also for every t R/Z.
22 PETER VARJU

Proof. For each t R/Z, we consider the transformation Ut : X


R/Z X R/Z defined as Ut (x, y) = (x, y + t). The proof relies on
the observation that Ut S = S Ut . Indeed,
Ut S(x, y) = Ut (x, c(x)+y) = (x, c(x)+y+t) = S(x, y+t) = SUt (x, y).
To show that (x, y + t) is generic, we fix a testfunction f C(X
R/Z) and write
N 1 N 1
1 X n 1 X
f (S (x, y + t)) = f (S n Ut (x, y))
N n=0 N n=0
N 1 ZZ
1 X
= f Ut (S n (x, y)) f Ut d(x)dy
N n=0
ZZ
= f (x, y)d(x)dy.

We used that (x, y) E and the definition of genericity applied to the


testfunction f Ut and then that Ut preserves m, which is easy to
check. 
By the above claim, there is a set A B(X) such that E = AR/Z.
We also have (A) = m(E) = 1.
Let be an arbitrary S invariant Borel probability measure. Denote
by P : X R/Z X the coordinate projection. Then P is T -
invariant, hence it is equal to . Indeed,
P (T 1 B) =((T 1 B) R/Z) = (S 1 (B R/Z))
=(B R/Z) = P (B)
for any B B(X).
Then P (A) = (A) = 1, hence (E) = (A R/Z) = 1. Fix
f C(X R/Z). For all x E, hence for -almost every x, we have
N 1 Z
1 X n
f (S (x, y)) f d m.
N n=0
By dominated convergence, we have
Z N 1 Z
1 X n
f (S (x, y))d f d m.
N n=0
R
The left hand side is equal to f d for all N , hence
Z Z
f d = f d m.

since f was arbitrary, this proves that = m, hence (X R/Z, S)


is indeed uniquely ergodic. 
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 23

Corollary 47. Let R/Z be irrational and define the map S on


X = (R/Z)d by
S(x1 , . . . , xd ) = (x1 + , x2 + x1 , . . . , xd + xd1 ).
Then the system (X, S) is uniquely ergodic.
Proof. We argue by induction. If d = 1, then the system is an irrational
circle rotation, hence it is uniquely ergodic.
We assume that d > 1 and the claim holds for d 1. We can apply
Furstenbergs above theorem, and we only need to show that the system
is ergodic.
To that end, we fix an invariant bounded measurable function f on
X and aim to show it is constant. We expand f in Fourier series and
write:
X
f (x) = an exp(2in x),
nZd
X
f (Sx) = an exp(2in Sx)
nZd
X
= an exp(2i(n1 (x1 + ) + n2 (x2 + x1 ) + . . . + nd (xd + xd1 )))
nZd
X
= an exp(2in1 ) exp(2i((n1 + n2 )x1 + . . .
nZd
+ (nd1 + nd )xd1 + nd xd )).

Writing
S(n)
b = (n1 + n2 , . . . , nd1 + nd , nd )
d
for n Z , we obtain
an exp(2in1 ) = aS(n)
b .
This means that |an | = |aS(n)
b |, in particular.
d
Let m Z be such that am 6= 0. By Parsevals formula,
X
|an |2 = kf k22 .
nZd

Hence the number of n Zd such that |an | = |am | must be finite. This
means that the sequence Sbk (m) must be periodic. Since (Sbk m)d1 =
md1 + kmd , we have md = 0. By a similar argument, we can prove
m2 = . . . = md = 0.
Finally, we consider the case m2 = . . . = md = 0 and am 6= 0. Then
S(m) = m, hence am exp(2im1 ) = am , which implies exp(2im1 ) =
b
1. Since is irrational, this implies m1 = 0.
We showed that an = 0 unless n = (0, . . . , 0), which proves that f is
constant, which is precisely what we wanted to show. 
24 PETER VARJU

Theorem 48 (Weyl). Let p(x) = ad xd + . . . + a1 x + a0 be a polynomial


such that at least one of the coefficients ai is irrational. Then the
sequence (p(n) mod 1) is equidistributed.
Proof. We first consider the special case when ad is irrational. We
consider the system (X, S) defined in the corollary given above. It is
easily proved by induction that
n n

x1 x +

0 1 1
x 2
nx2 + nx1 + n
0 1 2

n

S .. = .. .

. . 
n n n n
  
xd 0
xd + 1 xd1 + . . . + d1 x1 + d
Here we wrote the vector vertically just for notational convenience.
Write
t(t 1) (t i + 1)
qi (t) = .
i!
Since qi (t) for i = 0, . . . , d is a basis in the vectorspace of polynomials
of degree at most d, there are some real numbers , x1 , x2 , . . . , xd such
that
p(t) = q0 (t)xd + q1 (t)xd1 + . . . + qd1 (t)x1 + qd .
Moreover, = ad d! is irrational. Hence there are , x1 , . . . , xd R/Z
such that
       
n n n n
(5) p(n) = xd + xd1 + . . . + x1 + mod 1
0 1 d1 d
for each n Z>0 and is irrational.
Fix f C(R/Z) and define g C((R/Z)d ) by g(x) = f (xd ). Then
N 1 N 1
1 X 1 X
(6) f (p(n)) = g(S n x),
N n=0 N n=0
where and x = (x1 , . . . , xd ) are chosen in such a way that (5) holds.
By unique ergodicity of the system (X, S), we conclude that (6) con-
verges to Z Z
gdmd = f dm,
as required.
Finally, we consider the general case and reduce it to the special
case considered above. Let k be maximal such that ak is irrational.
Let q Z>0 be such that qaj Z for all j = d, . . . , k + 1. Then
p(qn+b) = ad bd +. . .+ak+1 bk+1 +ak (qn+b)k +. . . a1 (qn+b)+a0 mod 1,
where the right hand side is a polynomial in n, whose leading coefficient
is irrational. By the special case, we know then that p(qn + b) is
equidistributed for each b = 0, . . . , q 1. Then the original sequence is
also equidistributed. 
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 25

8. Mixing properties
Definition 49. A MPS (X, B, , T ) is said to be mixing if for every
measurable sets A, B B
lim (T n (A) B) = (A)(B).
We note that (T n (A) B) = (x X : x B, T n x A), hence
mixing can be interpreted as the property that the state of the system
at time n becomes independent of the state at times 0 as n grows. This
notion can be generalized to multiple sets and times as follows.
Definition 50. A MPS (X, B, , T ) is said to be mixing on k sets if
the following holds. Let A0 , . . . , Ak1 B and let > 0. Then there is
a number N Z>0 such that for all n1 , n2 , . . . , nk1 Z>0 that satisfy
n1 N, n2 n1 N, . . . , nk1 nk2 N
we have
|(A0 T n1 A1 . . . T nk1 Ak1 ) (A0 ) (Ak1 )| .
It is clear that a MPS is mixing if and only if it is mixing on 2 sets.
The following is a long standing open problem in ergodic theory.
Open Problem 51. Is there a MPS that is mixing on 2 sets but not
mixing on 3 sets?
The fact that this basic problem is still open underlines the fact that
mixing is not an easy notion to work with. We will introduce now
the notion of weak mixing, which may look less natural at first sight
but turns out to be more useful for the theory. This notion replaces
the limit in the definition of mixing with another kind of convergence,
which we define now.
Definition 52. We say that a set S Z>0 has full density if
1
|S [1, N ]| = 1.
lim
N
We say that a sequence of complex numbers {an } C converges
in density to a complex number a C if the set
{n Z>0 : |a an | < }
has full density for all > 0. We denote this by D-lim an = a.
We say that a sequence of complex numbers {an } C Cesaro
converges to a complex number a C if
N
1 X
lim an = a.
N N
n=1

We denote this by C-lim an = a.


26 PETER VARJU

Definition 53. A MPS (X, B, , T ) is said to be weak mixing if


D-limn (T n (A) B) = (A) (B)
for all A, B B.
Lemma 54. Let {an } R be a bounded sequence and a R. The
following are equivalent.
(1) D-lim an = a,
(2) C-lim |an a| = 0,
(3) C-lim(a an )2 = 0,
(4) C-lim an = a and C-lim a2n = a2 .
Proof. (1) (2): Fix > 0 and denote by M the supremum of |an |.
Let N be sufficiently large so that
1
|{n {1, . . . , N } : |a an | < }| > 1 .
N
Then
XN
|a an | 2M N + N = (2M N + N ).
n=1
We finish the proof by taking 0.
(2) (1): Fix > 0. Then
N
X
|{n {1, . . . , N } : |a an | > }| |a an |.
n=1
Thus
1
lim |{n {1, . . . , N } : |a an | > }| = 0.
N N
This proves the claim.
(1) (3) is very similar to (1) (2).
(1), (2) (4):
1 X N 1 X N N
1 X
lim an a = lim (an a) lim |an a| = 0

N N N N N N
n=1 n=1 n=1
showing the first part of the claim. The second part can be proved
similarly by noting that D-lim an = a implies D-lim a2n = a2 , which can
be seen directly from the definition.
(4) (3):
N
1 X 2
C-lim(a an )2 = lim (a 2aan + a2n )
N N
n=1
=a2 2a C-lim an + C-lim a2n = a2 2a2 + a2 = 0.

Theorem 55. Let (X, B, , T ) be an MPS. Then the following are
equivalent.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 27

(1) (X, B, , T ) is weak mixing,


(2) (X Y, B C, , T S) is ergodic for all ergodic MPS
(Y, C, , S),
(3) (X X, B B, , T T ) is ergodic,
(4) (X X, B B, , T T ) is weak mixing,
(5) the operator UT has no non-constant eigenfunctions.

The following lemma, which appears in the second example sheet,


will be used in the proof of the theorem.

Lemma 56. Let (X, B, , T ) be a MPS. Let S B be a semi-algebra


(or -system) generating B.
The system (X, B, , T ) is weak mixing if and only if

D-limn (T n (A) B) = (A) (B)

for all A, B S.
The system (X, B, , T ) is ergodic if and only if

C-limn (T n (A) B) = (A) (B)

for all A, B S.

Proof of the theorem. (1) (2): We consider the semi-algebra of mea-


surable rectangles

S := {B C : B B, C C},

which generates the -algebra B C. We check that the property


characterizing ergodicity in the previous lemma holds. Let B1 C1 , B2
28 PETER VARJU

C2 S. Then
1 N 1
X
((T S)n (B1 C1 ) B2 C2 )
N n=0


(B1 C1 ) (B2 C2 )

1 N 1
X
= [(T n (B1 ) B2 )(S n (C1 ) C2 )
N n=0

(B1 )(C1 )(B2 )(C2 )]

1 N 1
X
= [(T n (B1 ) B2 )(S n (C1 ) C2 )
N n=0

(B1 )(B2 )(S n (C1 ) C2 )]

1 N 1
X n
+ [(B1 )(B2 )(S (C1 ) C2 ) (B1 )(C1 )(B2 )(C2 )]

N n=0
N 1
1 X
|(T n (B1 ) B2 ) (B1 )(B2 )|
N n=0
1 N 1
X
+ (B1 )(B2 ) [(S n (C1 ) C2 ) (C1 )(C2 )]

N n=0

In the last inequality, we used the triangle inequality to estimate the


first sum and then estimated (S n (C1 ) C2 ) above by 1 in each term.
Thus
1 N 1
X
lim sup ((T S)n (B1 C1 ) B2 C2 )
N N n=0

(B1 C1 ) (B2 C2 )

N 1
1 X
lim |(T n (B1 ) B2 ) (B1 )(B2 )|
N N
n=0
1 N 1
X
+ lim [(S n (C1 ) C2 ) (C1 )(C2 )] = 0.

N N
n=0

The first sum converges to 0 since (X, B, , T ) is weak mixing. The


second sum converges to 0 since (Y, C, , S) is ergodic. This proves
that the product system is indeed ergodic.
(2) (3): First, we apply (2) with Y being the one point space to
see that (X, B, , T ) is ergodic. Then (3) follows as a special case of
(2).
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 29

(3) (1): Fix B, C B. We first apply the characterization of


ergodicity in the lemma for the sets B X, C X X X:
C-lim ((T T )n (B X) C X) = (B X) (C X).
Unwinding the definitions, we write
C-lim (T n (B) C)(T n (X) X) = (B)(X)(C)(X)
and hence
C-lim (T n (B) C) = (B)(C).
We do the same with the sets B B, C C X X:
C-lim ((T T )n (B B) C C) = (B B) (C C).
Thus
C-lim (T n (B) C)2 = (B)2 (C)2 .
This together with the previous paragraph implies
D-lim (T n (B) C) = (B)(C),
which was to be proved.
(1) (4): This is very similar to the above arguments.
(3) (5): Let f be a non-constant eigenfunction of the operator
UT . We show that the function fe(x1 , x2 ) = f (x1 ) f (x2 ) : X X C
is T T invariant. By ergodicity of (X X, B B, , T T ),
fe is constant almost everywhere, hence f must be constant almost
everywhere, also.
To show the claim, we write
fe(T x1 , T x2 ) = f (T x1 )f (T x2 ) = f (x1 ) f (x2 ) = f (x1 )f (x2 ) = fe(x1 , x2 ),
where is the eigenvalue of UT corresponding to f . Since UT is an
isometry, || = 1 necessarily. This completes the proof. 

The proof of the implication that (5) implies weak mixing requires
some knowledge of operator theory and it is not examinable. Two
possible approaches are outlined in the second example sheet.

9. Multiple recurrence for weak mixing systems


Our next goal is to prove the following result.
Theorem 57. Let (X, B, , T ) be a weak mixing MPS. Then for any
k Z>0 , and f1 , . . . , fk L (X), we have
N 1 Z Z Z
1 X n 2n kn
(7) U f1 UT f2 UT fk 2 f1 d f2 d fk d.
N n=0 T in L
30 PETER VARJU

Corollary 58. In the setting of the theorem, let f0 L be another


function. Then
N 1 Z Z Z
1 X n 2n kn
(8) lim f0 UT f1 UT f2 UT fk d = f0 d fk d.
N N
n=0
In particular, taking f0 = f1 = . . . = fk = A for a set A B we
obtain
N 1
1 X
lim (A T n A . . . T kn A) = (A)k+1 .
N N
n=0

This proves Furstenbergs multiple recurrence theorem for weak mix-


ing systems.
Observe that (8) follows immediately from (7). Indeed, (7) claims
that a certain sequence of functions converge in norm, while (8) claims
that the inner products with any fixed function f0 converge.
The following lemma is very useful in showing that the averages of
a sequence of vectors converge to 0 in a Hilbert space.
R In the proof of
Theorem 57 we will use it in the special case when fk d = 0.
Lemma 59 (van der Corput). Let u1 , u2 , . . . be a bounded sequence of
vectors in a Hilbert space. Write
N
1 X
sh = lim sup hun , un+h i .

N N n=1

Suppose that D-lim sh = 0. Then


N
1
X

lim un = 0.
N N

n=1

Note that D-lim sh = 0 is equivalent C-lim sh = 0, since sh 0 for


all h. The following would be a natural approach to prove the lemma.
We can write
1 X N N N
2 1 X X
un = 2 hun , un2 i

N n=1 N n =1 n =1 1

1 2

N N 1 N h
1 X
2
X X 
(9) = 2 kun k + 2 Rehun , un+h i .
N n=1 h=1 n=1
PN h
Unfortunately, the expression n=1 Rehun , un+h i can be estimated by
sh only for h fixed and N . Observe how we overcome this problem
in the following proof.
Proof. We fix > 0 and let H be sufficiently large so that
H1
1 X
sh < .
H h=0
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 31

If N is sufficiently large (depending on H, and sup kun k) then


1 X N N H H N +H
1 XX 1 X 1 X
un un+h kun k + kun k < .

N n=1 N H n=1 h=1 N n=1 N n=N +1

To see this, note that the contribution of un for H < n N is the


same to both averages.
The above inequality allows us to turn our attention to the double
average. We first use the triangle inequality:
1 X N X H N H
1 X 1 X

un+h un+h .

N H n=1 h=1 N n=1 H h=1

We use now the Cauchy-Schwartz inequality to turn the above into an


average of norm squares that can be expanded similarly to (9). Note
that this allows us to avoid inner products of vectors whose indices
differ by more than H.
1 X N X H 2 1 X N H
1 X
2
un+h un+h

N H n=1 h=1 N n=1 H h=1

N H H
1 X 1 XX
= hun+h1 , un+h2 i
N n=1 H 2 h =1 h =1
1 2

H H N
1 X X 1 X
hun+h1 , un+h2 i .

H2 N n=1

h1 =1 h2 =1

Keeping H fixed, we let N :


1 X N X H H H
2 1 XX
lim sup un+h 2 s|h h |

N N H n=1 h=1 H h =1 h =1 1 2
1 2

H
1 X
2Hsh < 2.
H2 h=1

We combine this with our first inequality and obtain


N
1 X
lim sup un + 2.

N N n=1
Letting 0, we can complete the proof. 
We recall tho auxiliary results from the second example sheet that
we need for the proof of the theorem.
Lemma 60. Let (X, B, , T ) be a weak mixing MPS. Then for any
f, g L2 (X) we have
Z Z
n
D-limhUT f, gi = f d gd.
32 PETER VARJU

Note that when f and g are characteristic functions of sets, then the
displayed equation is precisely the definition of weak mixing.

Lemma 61. If the MPS (X, B, , T ) is weak mixing than so is (X, B, , T k )


for all k Z>0 .

Proof of Theorem 57. The proof is by induction on k. The claim for


k = 1 is the mean ergodic theorem. We assume that k > 1 and that
the theorem and the corollary holds for k
R 1.
We first consider the special case when fk d = 0. We set

un = UTn f1 UT2n f2 UTkn fk

and prove k(1/N ) N


P
n=1 un k2 0 using the van der Corput lemma.
We compute
Z
hun , un+h i = UTn f1 UT2n f2 UTkn fk
2(n+h) k(n+h)
UTn+h f1 UT f2 UT fk d
Z
= UTn [(f1 UTh f1 ) (UTn f2 UTn+2h f2 )
(k1)n (k2)n+kh
(UT fk U T fk )]d
Z
(k1)n
= (f1 UTh f1 ) UTn (f2 UT2h f2 ) UT (fk UTkh fk )d.

The last equation used the measure preserving property.


We use now the induction hypothesis, more precisely (8) for k 1
and fi UTih fi in place of fi1 :

N Z Z Z
1 X
lim hun , un+h i = f1 UT f1 d f2 UT f2 d fk UTkh fk d.
h 2h
n N
n=1

We note that
Z
fi UT fi d kfi k2
ih

for all 1 i k. We apply this for 1 i k 1 and write


Z
2 2 kh
sh kf1 k kfk1 k fk UT fk d .

We apply now Lemma 60 for the system (X, B, , T k ), which is weak


mixing by Lemma 61. We conclude that D-lim sh = 0. Hence R van der
Corputs lemma applies and implies (7) in the special case fk d = 0.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 33

fk d and fk0 := fk a. We can


R
In the general case we write a :=
write thus
N 1
1 X n
UT f1 UT2n f2 UTkn fk
N n=0
N 1
1 X n
= UT f1 UT2n f2 UTkn (fk0 + a)
N n=0
N 1
1 X n
= U f1 UT2n f2 UTkn fk0
N n=0 T
N 1
1 X n (k1)n
+a U f1 UT2n f2 UT fk1 .
N n=0 T

The first sum converges to 0 by the special case, the second converges
to Z Z Z Z
a f1 d fk1 d = f1 d fk d

by the induction hypothesis. This proves the theorem in the general


case. 

10. Cutting and stacking


Our next goal is to exhibit an example of a measure preserving sys-
tem that is weak mixing but not mixing. We do this using a construc-
tion called cutting and stacking. We begin with a simple example;
our only purpose is to explain how the construction works.
We build a system on the space X = [0, 1], using the Borel -algebra
B, and the Lebesgue measure = m. The transformation T is defined
using the following procedure. For each n Z0 , we specify a suitable
collection of intervals
(n) (n) (n) (n) (n) (n)
I1 = [a1 , b1 ), . . . , I2n = [a2n , b2n )
such that each of them are of length 2n and
(n)
2(n)
[0, 1) = I1 . . . I n .

Once these intervals are given, we define T such that the following
holds for each n. For each j = 1, . . . , 2n 1, the map T restricted to
(n) (n)
the interval Ij is a translation onto the interval Ij+1 . More formally,
we put
(n) (n) (n)
T x = x + aj+1 aj for all x Ij for all 1 j < 2n .
This rule gives a partial definition of T for each n, and the intervals
has to be constructed in a suitable way so that these partial definitions
34 PETER VARJU

are compatible with each other. Furthermore, for the points

\ (n)
x I2n {1},
n

the rule does not define T x. However, this is a set of measure 0, hence
T can be extended arbitrarily.
(n)
Before explaining the construction of the intervals Ij , we visualize
the construction by the following figure. In this picture, each interval
is represented by a vertical line, and the map T moves each point to
the one directly above on the next level of the stack.

(n)
The intervals Ij are constructed by iterating the following proce-
(0)
dure. We start by taking I1 = [0, 1). Then we construct the intervals
(n+1) (n) (n)
Ij from the intervals Ij as follows. We cut each interval Ij into
(n+1)
two equal subinterval, and the one on the left will be defined Ij and
(n+1)
the one on the right will be defined Ij+2n . More formally, we put

(n+1)  (n) a(n) (n)


j + bj  (n+1) a(n)
j + bj
(n)
(n) 
Ij = aj , , and Ij+2n = , bj .
2 2

This construction is depicted in the next picture. We take the


stack of the intervals at the nth iteration, cut the whole stack ver-
tically into two and put the right half of the stack on top of the left
half.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 35

It is intuitively clear that the partial definition of T given by the


nth stack of intervals is compatible with the n + 1th stack. Indeed,
if a point is directly above another one, then it will be directly above
the same point in the next iteration, also.
We prove this now more formally.
(n)
Lemma 62. If x Ij for some 1 j < 2n for some n, then there is
(n+1)
1 j 0 < 2n+1 such that x Ij 0 and
(n+1) (n+1) (n) (n)
aj 0 +1 aj 0 = aj+1 aj .
In particular, the map T is well-defined.
(n) (n) (n)
Proof. Let x Ij and suppose first that x < (aj + bj )/2. Then
(n+1)
x Ij , so j 0 = j works. Furthermore,

(n+1) (n+1)  (n) a(n) (n)


j + bj  (n+1) (n+1)  (n) a(n) (n)
j+1 + bj+1 
[aj , bj ) = aj , , [aj+1 , bj+1 ) = aj+1 , .
2 2
Thus
(n+1) (n+1) (n) (n)
aj+1 aj = aj+1 aj ,
as required.
(n) (n)
The other case, when x (aj + bj )/2, is very similar. 
Lemma 63. The system (X, B, m, T ) is measure preserving.
(n)
This is also expected to hold, because T restricted to any Ij with
j < 2n is a translation and translations preserve the Lebesgue measure.
Proof. Let A B and fix some n Z0 . We can write
2 n
X (n)
1
m(T A) = m(T 1 (A) Ij ).
j=1
(n) (n)
Since T : Ij Ij+1 is a measure preserving bijection for 1 j < 2n ,
we have
36 PETER VARJU

n 1
2X
(n) (n)
m(T 1 A) = m(T 1 (A Ij+1 )) + m(T 1 (A) I2n )
j=1
n 1
2X
(n) (n)
= m(A Ij+1 ) + m(T 1 (A) I2n )
j=1
(n) (n)
=m(A) m(A I1 ) + m(T 1 (A) I2n ).

This proves that

|m(T 1 A) m(A)| 2 2n .

Taking n , we conclude m(T 1 A) = m(A), which proves that the


system is indeed measure preserving. 

11. The Chacon map


We discuss a similar cutting and stacking construction, called the
Chacon map. This is an example of a weak mixing but not mixing
system.
Similarly as before, we take X = [0, 1], let B be the Borel -algebra
and let = m be the Lebesgue measure. We will define a collection of
disjoint intervals

(n) (n) (n)


Ij = [aj , bj ), j = 1, . . . , N (n)

for each n. However, this time they will not cover the whole interval
[0, 1). The common length of the intervals is 2/3n+1 .
(0)
The intervals are constructed as follows. We take I1 = [0, 2/3).
(n+1) (n)
Then the intervals Ij are constructed using the intervals Ij as
(n)
follows. We let N (n + 1) = 3N (n) + 1. We cut up the interval Ij into
three equal subintervals, and
(n+1)
Ij will be defined the left-most subinterval,
(n+1)
Ij+N (n) will be defined the middle subinterval,
(n+1)
Ij+2N (n)+1 will be defined the right-most subinterval.
(n)
This does not define I2N (n)+1 , yet. We cut off an interval of length
S (n) (n)
2/3n+2 from [0, 1]\ j Ij and I2N (n)+1 is defined to be this interval.
We visualize this construction by the following figure.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 37

(n)
The map T is again defined in such a way that it maps Ij for any
(n)
1 j < 2n onto Ij+1 via a translation. The facts that this construction
yields a well-defined map, and it is measure preserving can be proved
very similarly to the corresponding results in the previous example and
we omit the details.
We will need the following lemma in both the proof that the Chacon
map is not mixing and in the proof that it is weak mixing.

Lemma 64. For all n and 1 j N (n), we have

(n) (n) 1 (n)


m(T N (n) (Ij ) Ij ) m(Ij ) and
3
(n) (n) 1 (n)
m(T N (n)+1 (Ij ) Ij ) m(Ij ).
3
Proof. The proof is illustrated by the figure below. We first note that
(n) (n+1) (n+1) (n+1) (n+1)
Ij = Ij Ij+N (n) Ij+2N (n)+1 . We also note that T N (n) (Ij )=
(n+1) (n+1) (n+1)
Ij+N (n) and T N (n)+1 (Ij+N (n) ) = Ij+2N (n)+1 . This shows that

(n) (n) (n+1) 1 (n)


m(T N (n) (Ij ) Ij ) m(Ij+N (n) ) = m(Ij ) and
3
(n) (n) (n+1) 1 (n)
m(T N (n)+1 (Ij ) Ij ) m(Ij+2N (n)+1 ) = m(Ij ),
3
as required. 
38 PETER VARJU

Theorem 65. The Chacon map is not mixing.


Proof. We put A = [0, 2/9) and note that for each n 1, there is a set
S (n)
Jn {1, . . . , N (n)} such that A = jJn Ij . This can be proved by
(n)
induction. Applying the lemma for each Ij with j Jn we get
X (n) (n)
m(A T N (n) (A)) =m(T N (n) A A) m(T N (n) Ij Ij )
jJn
1X (n) 1
m(Ij ) = m(A).
3 jJ 3
n

This gives lim inf (A T N (n) (A)) m(A)/3 > m(A)2 showing that
the system is not mixing. (Note that m(A) = 2/9.) 
Theorem 66. The Chacon map is weak mixing.
Proof. Let f L2 (X) be an eigenfunction of UT , that is UT f = f
for some complex with || = 1. We show below that f is constant
almost everywhere.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 39

We fix 0 < < 1/12, which is also sufficiently small so that


m({x X : |f (x)| > 2}) > 10.
If we replace f by a sufficiently large constant multiple, then such an
can be found.
By Luzins theorem, there is a continuous function h such that
m({x X : h(x) 6= f (x)}) < .
Then h is uniformly continuous, hence there is n Z0 such that for
any x1 , x2 X with |x1 x2 | < 2/3n+1 we have |h(x1 ) h(x2 )| < /10.
(n)
Thus |h(x1 ) h(x2 )| < /10 holds in particular when x1 , x2 Ij for
some 1 j N (n).
(n)
Since the intervals Ij cover more than half of X, there is 1 j
N (n) such that
(n) (n)
m({x Ij : f (x) 6= h(x)}) < 2m(Ij ).
(n)
We select an arbitrary x0 Ij and put z = h(x0 ). Then
(n) (n)
m({x Ij : |f (x) z| > }) < 2m(Ij ).
(n) (n)
We note that T ij (Ij ) = Ii for every 1 i N (n). Using
UTij f = ij f and the measure preserving property we can write
(n)
m({x Ii : |f (x) ij z| > })
(n) (n)
= m({y Ij : |f (T ij y) ij z| > }) < 2m(Ij ).
Since < 1, this shows in particular that
N (n)  N (n)
X (n)
[ (n) 
m({x X : |f (x)| > |z| + 1}) < 2m(Ij ) + m X\ Ij
j=1 j=1
1
< 2 + .
2 3n
If n is sufficiently large (which we may assume) then 1/(2 3n ) < 8
and hence 2 < |z| + 1, hence |z| > 1.
Now we use Lemma 64 and obtain that
(n)
m({x Ij : |f (x) N (n) z| < })
(n) (n)
m({y Ij : T N (n) (y) Ij and |f (T N (n) y) N (n) z| < })
(n) (n)
=m(Ij T N (n) Ij )
(n) (n)
m(y Ij T N (n) Ij : |N (n) f (y) N (n) z| > )
(n) (n) (n)
m(T N (n) Ij Ij ) m(y Ij : |f (y) z| > )
1 (n) (n)
m(Ij ) 2m(Ij ).
3
40 PETER VARJU

(n) (n) (n)


Since < 1/12, (1/3)m(Ij ) 2m(Ij ) > 2m(Ij ), hence there is
(n)
x Ij such that both |f (x) z| < and |f (x) N (n) z| < hold.
Thus |z N (n) z| < 2 and |1 N (n) | < 2 as |z| > 1.
By a similar argument using the second part of the lemma we can
get |1 N (n)+1 | < 2, hence |N (n) N (n)+1 | < 4 and |1 | < 4.
Taking 0 we conclude that = 1.
Summarizing the above argument we can prove the following. For
every > 0 there is n0 such that for all integer n > n0 there is a
complex number z,n such that for all 1 i N (n) we have
(n) (n)
m({x Ii : |f (x) z,n | < }) > (1 2)m(Ii ).
Summing this up for 1 i N (n), we obtain
N (n)
X (n)
 1 
m({x X : |f (x)z,n | < }) > (12)m(Ii ) = (12) 1 n
.
i=1
2 3
Taking 0 and n we conclude that f must be constant almost
everywhere. 

12. Entropy
Entropy is a number attached to measure preserving systems that
quantifies the following related intuitive properties.
What is the information content of a typical orbit x, T x, T 2 x, . . .?
How chaotic is a typical orbit x, T x, T 2 x, . . .?
Knowing x, T x, . . . , T n x, to what extent can we predict T n+1 x?
The original motivation for the introduction of entropy to ergodic
theory was the following problem.
Definition 67. Let (p1 , . . . , pn ) be a probability vector, i.e. pi 0 and
p1 + . . . + pn = 1. The (p1 , . . . , pn )-Bernoulli shift is the measure pre-
serving system ({1, . . . , n}Z , B, , ), where B is the Borel -algebra,
is the infinite-fold product of the measure p1 1 +. . .+pn n on {1, . . . , n}
and is the shift (x)n = xn+1 .
Problem 68. Are the (1/2, 1/2) and (1/3, 1/3, 1/3) Bernoulli shifts
isomorphic?
This problem may not seem very deep at first sight, but it was open
for a decade or so until it was solved by Kolmogorov, and the following
facts suggest that it is subtle.
The two systems have the same mixing properties. (They are
both mixing on k sets for all k Z>0 .)
They are also spectrally isomorphic. That is to say, there is a
unitary transformation U : L2 ({1, 2}Z ) L2 ({1, 2, 3}Z ) such
that
U U2 = U3 U,
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 41

where i is the shift map on {1, . . . , i}Z .


Meshalkin proved that the (1/4, 1/4, 1/4, 1/4) and the
(1/2, 1/8, 1/8, 1/8, 1/8) Bernoulli shifts are isomorphic.
For more details on these facts, see the third example sheet.
In the next lectures we develop entropy theory and prove that the
(1/2, 1/2) and (1/3, 1/3, 1/3) Bernoulli shifts are not isomorphic.
12.1. Jensens inequality. We recall an inequality that will be used
frequently to derive elementary properties of entropy.
Definition 69. We say that a continuous function f : [a, b] R{}
is convex if there is a number x for each a < x < b such that
f (y) f (x) + x (y x)
for all y (a, b). The function is called strictly convex, if the inequality
is strict for all y 6= x.
We note that if f C 2 and f 00 (x) > 0 for all x (a, b), then f is
strictly convex, as can be seen easily from the Taylor expansion of f .
Theorem 70 (Jensens inequality). Let f : [a, b] R {} be a
convex function. Let x1 , x2 , . . . , xn [a, b] and let p1 , . . . , pn be a prob-
ability vector. Then
(10) f (p1 x1 + . . . + pn xn ) p1 f (x1 ) + . . . + pn f (xn ).
If the function is strictly convex, then equality occurs in (10) if and
only if the numbers xi coincide for all i such that pi > 0.
12.2. Entropy of partitions. Let (X, B, ) be a probability space. A
finite measurable partition is a finite collection of pairwise disjoint
measurable sets, whose union cover the whole space X. We call the
sets in this collection the atoms of the partition. If , B are two
finite partitions, then their join or coarsest common refinement is
the finite partition
= {A B : A , B }.
We define the function H on the set of probability vectors (of varying
length) by
H(p1 , . . . , pk ) = p1 log p1 . . . pk log pk ,
where we use the convention p log p = 0 if p = 0.
We define the entropy of a partition = (A1 , . . . , Ak ) as
H () = H((A1 ), . . . , (Ak )).
We define the conditional entropy of relative to another partition
= {B1 , . . . , Bl } as
l  (A B )
X 1 j (Ak Bj ) 
H (|) = (Bj )H ,..., .
j=1
Bj (Bj )
42 PETER VARJU

One should think about conditional entropy as follows. The parti-


tion subdivides the whole space as the union of probability spaces.
Indeed, we can endow each atom Bj with a probability measure given
by (C)/(Bj ) for each measurable set C Bj . The partition in-
duces a partition on each of these spaces: {A1 Bj , . . . , Ak Bj }. Now
the conditional entropy is the average of the entropies of these parti-
tions on each of the probability spaces weighted by the measure of the
corresponding atom of Bj of .
The following intuition for the meaning of entropy is useful. One
may think about the space X as the possible states of a physical system
and about the partition as an experiment; the atoms are the possible
outcomes. Then we may think of entropy as the amount of uncertainty
in the experiment, i.e. how difficult it is to predict its outcome. Or
equivalently, we may think of entropy as the amount of information we
gain when we learn the outcome of the experiment.
With this interpretation, the two partitions are two experiments, and
the conditional entropy measures the amount of new information that
we gain when we learn the outcome of the second experiment if we
already new the outcome of the first one.
The choice of the function
(11) H(p1 , . . . , pk ) = p1 log p1 . . . pk log pk
may seem arbitrary at first, but the following properties are reassuring.
Lemma 71. The following holds.
(1) H () 0 for any partition .
(2) The maximum of H () over all partitions with k atoms is ob-
tained when each atoms have the same probability 1/k.
Interpretation: Uncertainity is maximal, when each outcome is
equally likely.
(3) H (A1 , . . . , Ak ) = H (A1 , . . . , Ak ) for any permutation .
(4) H ( ) = H () + H (|). Interpretation: The amount of
information contained together in the two experiments and
is the same as the amount of information contained in plus
the amount of new information contained in when we learn
the outcome with the prior knowledge of the outcome of .
We note that Khinchin proved that the function H given in (11) is
unique upto scalar multiples such that the resulting notion of entropy
enjoys the properties listed in the lemma.
Proof. (1): This holds because p log p 0 for all p [0, 1].
(2): This follows from Jensens inequality applied to the strictly
convex function x 7 x log x with pi = 1/k and xi = (Ai ). Indeed, we
note
X X1 1
p i xi = (Ai ) = ,
k k
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 43

hence the inequality yields


k
1 X 1
log k (Ai ) log((Ai )),
k i=1
k

which in turn gives


log k H().
Analysing the equality case in Jensens inequality, we find that it occurs
if and only if (A1 ) = . . . = (Ak ) = 1/k.
(3): Immediate from the definition. 

We will prove item (4) below in greater generality

12.3. The information function. We introduce a notion, which will


be a very useful gadget in our calculations with entropy. Let (X, B, )
be a probability space and let B be a finite partition. The infor-
mation function of is the function X 7 R {}
I ()(x) = log ([x] ),
where [x] denotes the atom of that contains x. If B is another
finite partition, then the conditional information function of
relative to is defined as
([x] )
I (|)(x) = log .
([x] )
The link with entropy is given in the next lemma.
Lemma 72. In the above setting, we have
Z
H () = I ()d,
Z
H (|) = I (|)d.

Proof. The first claim is a special case of the second one when we
consider = {X}, therefore, we only consider the second claim.
If A and B , then I (|) is constant on A B and its value
is log((A B)/(B)), which is immediate from the definition. Thus
(A B)
Z X
I (|)d = (A B) log
A,B
(B)
X X (A B) (A B)
= (B) log ,
B A
(B) (B)

as required. 
44 PETER VARJU

Lemma 73 (Chain rule). Let , , B be finite partitions in a prob-


ability space (X, B, ). Then
I ( |) =I (|) + I (| )
H ( |) =H (|) + H (| )
Note that this lemma generalizes item (4) from the first lemma of
the section.
Proof. The second line follows from the first one upon integration
thanks to the previous lemma.
Unwinding the definitions, we can write
([x] )
I (|)(x) = log
([x] )
([x] )
I (| )(x) = log
([x] )
([x] )
I ( |)(x) = log .
([x] )
It is clear that the sum of the first two lines yields the third line, as
required. 
Lemma 74. Let , , B be finite partitions in a probability space
(X, B, ). Then
H (| ) H (|).
Interpretation: The new information content in the experiment
after we know the outcomes of and is at most as much as its new
information content if we know the outcome of alone.
Proof. We note the definitions
P (ABC)
(12) H (| ) = A,B,C (A B C) log (BC)
P (AC)
(13) H (|) = A,C (A C) log (C) .

Therefore, it is enough to show that


X (A B C) (A C)
(A B C) log (A C) log
B
(B C) (C)

for each A and C .


In order to show this, we fix A , C and apply Jensens
inequality for the function x 7 x log x at points (AB C)/(B C)
with weights (B C)/(C) as B runs through the atoms of . We
note that
X (B C) (A B C) (A C)
= .
B
(C) (B C) (C)
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 45

Hence Jensens inequality yields


X (B C) (A B C) (A B C) (A C) (A C)
log log ,
B
(C) (B C) (B C) (C) (C)
which in turn yields
X (A B C) (A C)
(A B C) log (A C) log ,
B
(B C) (C)
as required. 
The above inequality together with the chain rule can be used to
prove a range of useful inequalities for entropies of partitions. For
instance, we note the following.
Corollary 75. Let , B be finite partitions in a probability space
(X, B, ). Then
H () H ( ) H () + H ().
Proof. By the chain rule, we have
H ( ) = H () + H (|).
Since p log p 0 for all p [0, 1], H (|) 0 similarly as we saw in
item (1) of the first lemma of the section. This proves the inequality
on the left.
By the previous lemma (applied with = {X}), we have H (|)
H (), which proves the inequality on the right. 
12.4. Entropy of measure preserving systems. In what follows
(X, B, , T ) is a measure preserving system and , B are finite
partitions.
Notation. We write T 1 for the partition, whose atoms are T 1 ([x] ).
Lemma 76 (Invariance). In the above setting, we have
I (T 1 |T 1 )(x) =I (|)(T x),
H (T 1 |T 1 ) =H (|).
Proof. By the definition, we have
([x]T 1 T 1 )
I (T 1 |T 1 )(x) = log .
([x]T 1 )
We note that
T 1 T 1 = T 1 ( ),
and
[x]T 1 () = T 1 ([T x] ).
Indeed, the set on the right hand side is an atom of T 1 ( ) and it
contains x.
46 PETER VARJU

By the measure preserving property, we have


([x]T 1 () ) = ([T x] ),
and similarly ([x]T 1 () ) = ([T x] ), which proves the first claim.
The second claim follows from the first one by integration and the
measure preserving property. 
Before we can give the definition of entropy of a measure preserving
system, we need two more auxiliary results.
Lemma 77. Let a1 , a2 , . . . be a subadditive sequence, i.e. an + am
an+m for all n, m. Then
an an
lim = inf ,
n n n
in particular the limit exists.
Proof. Fix n0 Z>0 . For each n Z>0 , write n = i(n)n0 + j(n), where
0 j(n) < n0 . By (repeated use of) subadditivity,
an i(n)an0 + aj(n) ,
hence
an i(n)an0 + aj(n)
lim sup lim sup
n n n n
i(n) aj(n)
= lim sup an0 + lim sup
n n n n
n j(n) an0 an
= lim sup = 0.
n n n0 n0
Since n0 is arbitrary, we can write
an an an an
inf lim sup lim inf inf ,
n n n n n n
which proves the claim. 
Notation. We write
n
m = T m T (m+1) . . . T n
for any integers 0 m n. When the system is invertible, we allow
m or n to be negative.
Lemma 78. In the above setting, we have
H (0n1 ) + H (0m1 ) H (0 )n+m1 ,
i.e. the sequence n 7 H (0n1 ) is subadditive.
Proof. We note that 0n+m1 = 0n1 nn+m1 and T n 0m1 = nn+m1 .
By the above corollary and invariance, we can write
H (0n+m1 ) H (0n1 ) + H (nn+m1 ) = H (0n1 ) + H (0m1 ),
as required. 
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 47

Definition 79. Let (X, B, , T ) be a measure preserving system, and


let B be a finite partition. The entropy of the system with
respect to the partition is
H (0n1 ) H (0n1 )
h (T, ) = lim = inf .
n n n
The entropy of the system is
h (T ) = sup h (T, ).
B is a finite partition

The following intuition is useful to bear in mind. If we think about


the measure preserving system as a physical system, the transformation
T is the progression of time and is in experiment, then the partition
0n1 corresponds to the experiment performed repeatedly at time
0, 1, . . . , n 1. Therefore, the entropy of the system with respect to
is the amount of information gained per one experiment when we
perform it repeatedly at regular time intervals. The entropy of the
system is the maximal amount of information we can gain per one
experiment optimized over all possible experiments we can perform.
12.5. The Kolmogorov-Sinai theorem and the entropy of Bernoulli
shifts. One of the difficulties in calculating the entropy of a measure
preserving system is that the definition involves a supremum over all
finite partitions. The Kolmogorov-Sinai theorem addresses this issue.
Definition 80. Let (X, B, , T ) be an invertible measure preserving
system. For a finite partition B, we denote by B() the -algebra
generated by the atoms of . A finite partition B is called a two-
sided generator, if for every A B and for every > 0, there are a
number n Z>0 and a set A0 B(n n
) such that (A4A0 ) < .
Example 81. Let (X = {1, . . . , k}Z , B, , ) be the (p1 , . . . , pk )- Bernoulli
shift. Then the partition
= {{x X : x0 = i} : i = 1, . . . , k}
is a two-sided generator.
Indeed, it is easy to check that the collection of sets
{A B : for every > 0, there are n and A0 B(n
n
) with (A4A0 ) < }
is a -algebra and it contains all cylinder sets, hence it equals B.
Theorem 82 (Kolmogorov-Sinai). Let (X, B, , T ) be an invertible
measure preserving system and let B be a finite partition that
is a two-sided generator. Then
h (T ) = h (T, ).
There is a version of this theorem with a similar proof for non-
invertible measure preserving systems, where a two-sided generator is
replaced by the analogous notion of a one-sided generator. However,
we will not need that result, therefore we omit the details.
48 PETER VARJU

Example 83. Let (X = {1, . . . , k}Z , B, , ) be the (p1 , . . . , pk )- Bernoulli


shift, and let be the partition we considered in the previous example.
We prove that h (T, ) = H(p1 , . . . , pk ), which implies h (T ) =
H(p1 , . . . , pk ) by the Kolmogorov-Sinai theorem.
We first show that
(14) H (|1n ) = H(p1 , . . . , pk )
for all n Z>0 . To this end we calculate the information function
([x]0n )
I (|1n )(x) = log .
([x]1n )
We note that
[x]0n = {y : y0 = x0 , . . . , yn = xn },
which has measure px0 pxn by the definition of the product measure.
Similarly,
([x]1n ) = px1 pxn .
Thus
I (|1n )(x) = log px0 ,
and (14) follows by integration.
Using the chain rule, invariance and then (14), we obtain
H (0n1 ) =H (n1
n1 n2 n1
) + H (n2 |n1 ) + . . . + H (00 |1n1 )
=H () + H (|11 ) + . . . + H (|1n1 )
=nH(p1 , . . . , pk ).
We divide both sides by n and take the limit to conclude
h (T, ) = H(p1 , . . . , pk ),
as required.
Since isomorphic systems have the same entropy, this shows that the
(1/2, 1/2) and (1/3, 1/3, 1/3) Bernoulli shifts are not isomorphic. We
add that a deep theorem of Ornstein shows that two Bernoulli shifts
are isomorphic if and only if they have the same entropy.
The proof of the Kolmogorov-Sinai theorem relies on the following
three Lemmata.
Lemma 84. Let (X, B, , T ) be a measure preserving system and let
B be a finite measurable partition. Then
n
h (T, ) = h (T, n )
for all n Z>0 .
Lemma 85. Let (X, B, , T ) be a measure preserving system and let
, B be finite measurable partitions. Then
h (T, ) h (T, ) + H (|).
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 49

Lemma 86. For every > 0 and for every k Z>0 , there is > 0
such that the following holds. Let (X, B, ) be a probability space, and
let , B be two finite partitions. Suppose that has k atoms and
for each A there is A0 B() such that (A4B) .
Then H (|) .
Proof of Lemma 84. We can write
H (0m1 )
h (T, ) = lim
m m
and
n
n
H (n ) H (T 1 n
n
) . . . H (T (m1) n
n
)
h (T, n ) = lim
m m
m+n1
H (n )
= lim
m m
H (0m+2n1 )
= lim
m m
H (0m+2n1 )
= lim .
m m + 2n 1

Hence the entropies with respect to the two partitions equal, indeed.

Proof of Lemma 85. We can write
H (0m1 ) H (0m1 0m1 )
=H (0m1 ) + H (0m1 |0m1 )
m1
X
=H (0m1 ) + H (jj |0m1 j+1
m1
)
j=0
m1
X
H (0m1 ) + H (jj |jj )
j=0

=H (0m1 ) + m H (|).
Here, we used the Corollary of Lemma 74, then the chain rule twice,
then Lemma 74, and finally invariance.
We divide this inequality by m and take the limit to obtain the
claim. 
Proof of Lemma 86. Let = {A1 , . . . , Ak }, and for each 1 i k, let
n
Bi B(n ) be such that (Ai 4Bi ) < . We consider the partition
that has the following k + 1 atoms:
[ [ 
C0 := Ai Bi \ Bj
j6=i

and Ci := Ai \C0 .
50 PETER VARJU

The set C0 has the following two key properties. First, it has large
measure, i.e.
k
X
(C0 ) ((Ai ) k) = 1 k 2 .
i=1

Second, for x C0 , we have x Bi if and only if x Ai .


We note that
k
X
H () = (C0 ) log (C0 ) (Ci ) log (Ci )
i=1
k Pk
X
i=1 (Ci )
(C0 ) log (C0 ) (Ci ) log .
i=1
k

Indeed, this follows from Jensens inequality applied for the function
x 7 x log x at the points (Ci ) for i = 1, . . . , k with weights 1/k
similarly to the proof of (1) in Lemma 71. If is sufficiently small,
then (C0 ) is arbitrarily close to 1 and ki=1 (Ci ) is arbitrarily close
P
to 0. Hence H () < , provided is small enough.
We prove that H (| ) = 0, and hence
H (|) H ( |) = H (|) + H (| ) < ,
as required.
To that end, we prove I (| )(x) = 0 and consider two cases.
If x C0 , then [x] C0 , because C0 B( ). Moreover, there
is a unique 1 i k such that x Ai , hence x Bi , because
C0 Bi = C0 Ai . Then [x] Bi , because Bi B( ). Therefore
[x] Ai = [x]
and I (| )(x) = 0 in this case.
The second case is when x Ci for some i > 1. Then [x] Ci ,
because Ci B( ). Since Ci Ai , we have
[x] Ai = [x] .
Again, we find that I (| )(x) = 0 in this case. This completes the
proof. 

Proof of the Kolmogorov Sinai theorem. We need to show that h (T |)


h (T |) for any finite partition B.
n
Fix > 0. Since is a two-sided generator, we have H (|n )<
for n large enough by Lemma 86. By Lemma 85, we have
n n
h (T, ) h (T, n ) + H (|) h (T, n ) + .
n
By Lemma 84, h (T, n ) = h (T, ), hence h (T |) h (T |) + .
This is sufficient, because is arbitrarily small. 
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 51

13. The Shannon McMillan Breiman theorem


Theorem 87. Let (X, B, , T ) be an ergodic measure preserving system
and let B be a finite partition. Then
1
I (0N 1 ) h (T, ) as N
N
pointwise -almost everywhere and in L1 (X, ).
Recall the definition of the information function
I (0N 1 )(x) = log([x]N 1 ).
0

Therefore, the theorem states that the atom of a typical point in the
partition 0N 1 is approximately exp(N h (T, )).
The Shannon McMillan Breiman theorem is also useful in calculating
the entropy of measure preserving systems. See the last example sheet
for some examples.
By the definition of entropy, we know that H (0N 1 ) is approxi-
mately N h (T, ) for N sufficiently large. In principle, it could hap-
pen, that in the partition 0N 1 the atoms have all kind of different sizes,
with a weighted logarithmic average of N h (T, ). However, the next
corollary of the Shannon McMillan Breiman theorem shows that this
does not happen in an ergodic measure preserving system; most atoms
in the sense of measure have size approximately exp(N h (T, )).
Corollary 88 (Asymptotic equipartition property). Let (X, B, , T ) be
an ergodic measure preserving system and let B be a finite partition.
Then for any > 0, there is N Z>0 such that the following holds for
each n N . There is a subset A B such that (A) 1 and
(15) exp(n(h (T, ) + )) ([x]0n1 ) exp(n(h (T, ) ))
for each x A.
In our usual interpretation, the corollary has the following intuitive
meaning. For n sufficiently large, there is a set of typical sequences of
outcomes for the experiment performed at times 0, . . . , n 1. The
number of these typical sequences is approximately exp(nh (T, )) and
each is approximately equally likely. Here approximation is understood
in the loose sense of (15).
Proof of the corollary. Fix > 0 and define A(n, ) as the set of points
x such that (15) holds. By Shannon-McMillan-Breiman, for almost
every x X, there N Z>0 such that x A(n, ) for every n N .
That is to say
 [ \ 
A(n, ) = 1.
N Z n=N

Thus (
T
n=N A(n, )) 1 for N sufficiently large, as claimed. 
52 PETER VARJU

13.1. Conditional expectation. The proof of the Shannon McMillan


Breiman theorem is based on the following idea. Using the chain rule
and invariance, we can write
N
X 1 N
X 1
I (0N 1 )(x) = N 1
I (nn |n+1 )(x) = I (|1N 1n )(T n x).
n=0 n=0
The sum on the right hand side looks like an ergodic sum and it would
be tempting to apply the pointwise ergodic theorem to conclude the
proof. However, in this sum, at each point a different function is eval-
uated and the ergodic theorem applies only if we evaluate the same
function. The idea to overcome this issue is to generalize the notion of
conditional information functions to -algebras in place of partitions
and show that the functions I (|1N 1n ) converges to such a general
conditional information function. Then we can replace the functions in
the sum by this limit and obtain a genuine ergodic sum.
This generalization requires some preparation. We begin by recalling
the definition of conditional expectation and its basic properties.
Theorem 89. Let (X, B, ) be a probability space and let A B be a
-algebra. Then for every f L1 (X, ), there is f L1 (X, ) such
that the following holds.
(1) Rf is A-measurable,
(2) A f d = A f d for all A A.
R

If f1 , f2 L1 (X, ) are two functions that satisfy both items (1) and
(2) in the role of f , then f1 = f2 holds -almost everywhere.
Definition 90. The function f in the above theorem, which is defined
up to a set of -measure 0 is called the conditional expectation of f
with respect to A and is denoted by E(f |A).
We note that
VA = {f L2 (X, ) : f is A measurable}
is a closed subspace of L2 (X, ) and for f L2 (X, ), E(f |A) is the
orthogonal projection of f to VA .
Example 91. Suppose that A is generated by a finite partition .
Then a function is A-measurable if and only if it is constant on the
atoms of , in particular, E(f |A) is constant on the atoms of . Let A
be an atom of . Then for -almost every x A, we have
Z Z
1 1
E(f |A)(x) = (A) E(f |A)d = (A) f d.
A A

The following theorem summarizes the basic properties of conditional


expectation.
Theorem 92. Let (X, B, ) be a probability space, and let A, A1 , A2
B be -algebras, and let f, f1 , f2 L1 (X, ). Then
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 53

(1) E(f1 + f2 |A) = E(f1 |A) + E(f2 |A),


(2) E(f1 f2 |A) = f1 E(f2 |A) if f1 is A-measurable,
(3) E(f |A2 ) = E(E(f |A1 )|A2 ) if A2 A1 .
We will also need the following result.
Theorem 93 (Increasing and decreasing martingale theorems). Let
(X, B, ) be a probability space and let A1 , A2 , . . . B and A B be
-algebras. Suppose that either
S
A1 A2 . . . and A is Tthe -algebra generated by Ai or
A1 A2 . . . and A = Ai .
Let f L1 (X, ).
Then
lim E(f |Ai ) = E(f |A)
n

pointwise -almost everywhere and in L1 (X, ).


We omit the proof, which can be found in standard textbooks. It is
a useful exercise to prove this theorem for L2 -convergence (in the case
f L2 ), however, understanding the proof of this result is not required
in what follows.
As we have seen above, conditional expectation is particularly easy
to compute and work with when the -algebra is generated by a finite
partition. In many situations, a more complicated -algebra A can
be approximated by ones generated by finite partitions in the sense
of the martingale theorem, i.e. there is a sequence of finer and finer
finite partitions whose union generate A. In that case the conditional
expectation with respect to A can be approximated by conditional
expectations with respect to finite partitions thanks to the martingale
theorem.

13.2. Conditional entropy with respect to -algebras.


Definition 94. Let (X, B, ) be a probability space, let B be a
finite partition and let A B be a -algebra. We define
X
I (|A) = A log E(A |A),
A

where A is the characteristic function of the set A. We also set


Z
H (|A) = I (|A)d.

This definition may look arbitrary. To illustrate that this is the right
one, we first verify that it coincide with the original definition, when
A = B() for a finite partition B.
54 PETER VARJU

Example 95. Let B be a finite partition and let A be the -algebra


it generates. Then
X
I (|A)(x) = A (x) log E(A |A)(x)
A
R
[x]
[x] d
= log E([x] |A)(x) = log
([x] )
([x] [x] )
= log = I (|).
([x] )
As a second reassurance, we prove that the conditional information
function with respect to finer and finer finite partitions converge to the
conditional information function with respect to the -algebra gener-
ated by the partitions. This will also be important in the proof of the
Shannon-McMillan-Breiman theorem.
Proposition 96. Let (X, B, ) be a probability space and let 1 , 2 , . . .
B be a sequence of finite partitions such that B(i ) B(i+1 ) for all i.
Let A = B(1 2 ). Let B be another finite partition. Then
I (|A) = lim I (|n )
n
1
in L and pointwise -almost everywhere.
Moreover,
(16) I (x) := sup I (|n )(x) L1 (X, ).
nZ>0

Proof. First we prove the pointwise convergence. This follows easily


form the martingale theorem and the definition of the conditional in-
formation function. Indeed,
I (|A)(x) = log E([x] |A)(x)
= lim log E([x] |B(n ))(x)
n
= lim I (|n )(x)
n
holds almost everywhere.
Next, we prove (16). This together with the pointwise convergence
imply the L1 convergence by the dominated convergence theorem.
Fix R>0 and A . For each x X, we define n(x) as the
smallest n such that
log E(A |B(n ))(x) .
We put n(x) = if there is no such n.
For each n, the set
Bn :={x X : n(x) = n}
={x X : E(A |B(n ))(x) exp() and
E(A |B(j ))(x) > exp() for every j < n}
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 55

is B(n ) measurable. Therefore,


Z Z
(Bn ) exp() E(A |B(n ))(x) = A d = (A Bn ).
Bn Bn

Note that the sets Bn are pairwise disjoint and they cover the set
A := {x A : I (x) } = {x A : inf E(A |B(n )) exp()}.
Thus
X X
(A ) = (A Bn ) exp()(Bn ) exp().
nZ>0 nZ>0

Now
X
(x X : I (x) ) (A ) || exp().
A

Thus
Z X X
I d n(x : n1 I (x) n) || nexp((n1)) < ,
nZ>0 nZ>0

as claimed. 

13.3. Proof of the theorem.


Lemma 97. Let (X, B, , T ) be a measure preserving system and let
B be a finite partition. Then
lim H (|1n ) = h (T, ).
n

Proof. Similarly to the calculations we did when we computed the en-


tropy of Bernoulli shifts, we can write using the chain rule and invari-
ance
H (0n1 ) = H (|1n1 ) + . . . + H (|11 ) + H ().
Therefore
h (T, ) = C-limn H (|1n1 ).
Since n H (|1n1 ) is a monotone non-increasing sequence, it con-
verges and its limit must equal to its Cesaro limit, which was to be
proved. 

Proof of the Shannon-McMillan-Breiman theorem. Recall from the be-


ginning of the section that
N 1
1 1 X
I (0N 1 )(x) = I (|1N 1n )(T n x).
N N n=0
56 PETER VARJU

We can thus write


(17)
N 1
1 N 1 1 X
I (0 )(x) = I (|B(1 ))(T n x)
N N n=0
N 1
1 X
+ (I (|1N n1 )(T n x) I (|B(1 ))(T n x)),
N n=0
where B(1 ) denotes the smallest -algebra that contains 1N for all N .
By the pointwise ergodic theorem, we have
N 1 Z
1 X
I (|B(1 ))(T x) I (|B(1 ))d.
n
N n=0
We prove that I (|B(1 ))d = h (T, ). We have
R
Z Z

I (|B(1 ))d = lim I (|1N )d = lim H (|1N ) = h (T, ).
N N

Therefore, it is enough to show that the second term in (17) converges


to 0. To that end, we write

IK (x) = sup |I (|B(1 ))(x) I (|1k )(x)|.
kK

By the previous proposition, we have IK L1 (X, ) for all K, and
IK 0 in L1 (X, ) and pointwise -almost everywhere. We fix an
integer K and write
1 N 1
X N n1 n n
(I (|1 )(T x)I (|B(1 ))(T x))

N n=0

N K1 N 1
1 X
1 X
IK (T n x) + I0 (T n x).
N n=0
N n=N K

We estimate these two sums individually. First we write


N K1 N Z
1 X n 1 X n
I (T x) I (T x) IK d
N n=0 K N n=0 K
as N . For the second sum, we write
N 1 N 1
1 X 1 X n
I0 (T n x) = I (T x)
N n=N K
N n=0 0
N K1
N K 1 1 X
I (T n x)
N N K 1 n=0 0
Z Z
I0 d I0 d = 0,

as N .
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 57

We obtained that
1 N 1 Z
X N n1 n n
lim sup (I (|1 )(T x) I (|B(1 ))(T x)) IK d.

N N n=0
Since K was arbitrary, and
Z

lim IK d = 0,
K

this proves the theorem. 


14. Mixing and entropy
Earlier in the course we discussed weak mixing as a more convenient
alternative for mixing. Now we discuss a similarly convenient notion
that is stronger than mixing.
Definition 98. A measure preserving system (X, B, , T ) is called K-
mixing (K for Kolmogorov) if the following holds. Let A B and let
B be a finite partition. Then for every > 0, there is N Z>0

such that for every B B(N ), we have
|(A B) (A)(B)| < .

Recall that B(N ) denotes the -algebra generated by the partitions
M
N for M ZN .
We note that we can take = {C, X\C} in the definition for an
arbitrary fixed set C B. This allows us to take B = T n (C) for
n N and obtain
|(A T n C) (A)(C)| < .
This shows that K-mixing indeed implies mixing. Moreover, it also
implies mixing on k-sets for any k Z2 , which is the content of an
exercise on the last example sheet.
Some authors refer to invertible K-mixing systems as K-automorphisms.
The notion of K-mixing can be characterized using entropy or tail
-algebras. We first define the latter.
Definition 99. Let (X, B, , T ) be a measure preserving system and
let B be a finite partition. The tail -algebra of is

\
T () = B(n ).
n=0

In our usual interpretation, we think about the -algebra B(n ) as


the knowledge of the result of an experiment for time n, n + 1, . . .. In
this manner we think about T () as the knowledge of the result of the
experiment in the distant future.
Theorem 100. Let (X, B, , T ) be a measure preserving system. The
following are equivalent.
58 PETER VARJU

(1) The system is K-mixing.


(2) For each finite partition B, T () is trivial, that is it contains
only sets of measure 0 and 1.
(3) The system is of totally positive entropy, that is, we have h (T, ) >
0 for any finite partition B with H () > 0.
We note that the Kolmogorov 0 1 law states that T () is trivial if
the system is a Bernoulli shift and = {{x : x0 = i} : i = 1, . . . , k}. In
fact, this is true for any finite partitions B, hence Bernoulli shifts
are examples of K-mixing systems.
We begin with two easier implications in the theorem.
Proof of (1) (2). Suppose the system is K-mixing. Fix a finite par-

tition , and let A T (). Then A B(N ) for all N , hence
|(A A) (A)2 | <
for any > 0. Then (A) = (A)2 , hence (A) = 0 or 1, as required.

Proof of (2) (1). Fix A B and fix a finite partition B.
Suppose T () is trivial. We first show that E(A |T ()) = (A)
almost everywhere. If this fails, then
B := {x : E(A |T ()) (A) + }
is of positive measure for some > 0. Since B T (), (B) = 1, and
we have
Z Z
(A) = A d = E(A |T ())d (A) + ,
B B

a contradiction.
Fix > 0. By the martingale convergence theorem, we have

kE(A |T ()) E(A |B(N ))k1 <
provided N is sufficiently large, which we may and will assume.

Let now B B(N ). Then
Z Z

(A B) = A d = E(A |B(N ))d.
B B

Therefore
Z

|(A B) (A)(B)| = (E(A |B(N )) E(A |T ()))d < ,

B

as required. 
The equivalence with item (3) requires some preparation. We also
need the following generalization of the basic properties of entropy that
we have proved for finite partitions. This lemma is included in the last
example sheet.
TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 59

Lemma 101. Let (X, B, , T ) be a MPS. Let , B be finite parti-


tions and let A B be a -algebra. Suppose that there is a sequence
of finite partitions 1 , 2 , . . . B such that A is the smallest -algebra
that contains the atoms of i for all i. We write A for the smallest
-algebra that contains A and the atoms of . Then
I (T 1 |T 1 A)(x) = I (|A)(T x),
H (T 1 |T 1 A) = H (|A),
I ( |A)(x) = I (|A)(x) + I (| A)(x),
H ( |A) = H (|A) + H (| A),
H (| A) H (|A).
We also recall the following result from the last example sheet, which
is a variant of Proposition 96.
Proposition 102. Let (X, B, ) be a probability space and let A1 , A2 , . . .
B be a sequence of -algebras such that Ai Ai+1 for all i. Let
A = A1 A2 . Let B be a finite partition. Then
I (|A) = lim I (|An )
n
1
in L and pointwise -almost everywhere.
The next two lemmata will be used in the proof of the implication
(2) (3).
Lemma 103. Let (X, B, , T ) be a measure preserving system and let
B be a finite partition. Then
h (T, ) = H (|B(1 )).
Proof. In the previous section, we showed that
h (T, ) = lim H (|1n ).
n
Now the claim follows from the convergence of entropy conditioned on
partitions to the entropy conditioned on the -algebra generated by
the partitions, which was proved in Proposition 96. 
Lemma 104. Let (X, B, , T ) be a measure preserving system and let
B be a finite partition. If h (T |) = 0, then H (|T ()) = 0.
Proof. Suppose h (T |) = 0. By the previous lemma, we have
H (|B(1 )) = 0.
Using the chain rule and invariance, we can write
n1
X
H (|B(n )) H (0n1 |B(n )) = H (jj |B(j+1

))
j=0
n1
X
= H (|B(1 )) = 0.
j=0

Now the lemma follows from Proposition 102. 


60 PETER VARJU

Proof of (2) (3). Suppose (2) holds and h (T, ) = 0 for some finite
partition B. We show that H () = 0, which completes the proof.
By the previous lemma, we have H (|T ()) = 0.
By (2), we know that T () is trivial, hence
I (|T ()) = log(E([x] , T ())) = log ([x] ),
and H (|T ()) = H () = 0, as required. 
The proof of the implication (3) (2) requires the following propo-
sition.
Proposition 105. Let (X, B, , T ) be a measure preserving system and
let , B be two finite partitions. Then
h (T, ) = H (|B(1 ) T ()).
We begin with two lemmata.
Lemma 106. Let (X, B, , T ) be a measure preserving system and let
B be a finite partition. Then
1
h (T, ) = H (0n1 |B(n ))
n
for each n.
Proof. By the chain rule and invariance, we have
n1
X
H (0n1 |B(n )) = H (jj |B(j+1

)) = nH (|B(1 )).
j=0

The claim now follows from Lemma 103. 


Lemma 107. Let (X, B, , T ) be a measure preserving system and let
, B be two finite partitions. Then
1
h (, T ) = lim H (0n1 |B(n ) B(n )).
n n

Proof. We note that


H (0n1 |B(n ) B(n )) H (0n1 )
for each n. Therefore,
1
h (, T ) lim sup H (0n1 |B(n ) B(n )).
n n

In order to prove
1
(18) H (0n1 |B(n ) B(n )),
h (, T ) lim inf
n n

we suppose to the contrary that it does not hold.


TOPICS IN ERGODIC THEORY, MICHAELMAS 2016 61

We can write
H (0n1 0n1 |B(n ) B(n )) =H (0n1 |B(n ) B(n ))
+ H (0n1 |B(0 ) B(n ))
H (0n1 0n1 |B(n )) =H (0n1 |B(n )) + H (0n1 |B(0 )).
We note that
H (0n1 |B(0 )) H (0n1 |B(0 ) B(n ))
for each n, and the failure of (18) yields (using the previous lemma)
1 1
lim H (0n1 |B(n )) > lim inf H (0n1 |B(n ) B(n )).
n n n n

Therefore
1
lim inf H (0n1 0n1 |B(n ) B(n ))
n n
1
< lim sup H (0n1 0n1 |B(n ))
n n
1
lim sup H (0n1 0n1 )
n n
=h (T, ).
This contradicts, however, the previous lemma, hence (18) and the
lemma hold. 
Proof of the proposition. Using the chain rule and invariance, we can
write
n1
X
H (0n1 |B(n ) B(n )) = H (jj |B(j+1

) B(n ))
j=0
n1
X
= H (|B(1 ) B(nj

)).
j=0

Therefore, it follows from the previous lemma that


h (, T ) = C-limn H (|B(1 ) B(n )).
By Proposition 102, we have
lim H (|B(1 ) B(n )) = H (|B(1 ) T ()),
n

and the claim follows. 


Proof of (3) (2). Suppose that (3) holds. Suppose to the contrary
that there is a finite partition B such that T () is not trivial, that
is, it contains a set A T () with 0 < (A) < 1.
Let = {A, Ac }, so H () > 0. We prove that
(19) H (|B(1 ) T ()) = 0,
62 PETER VARJU

which implies h (T, ) = 0 by the proposition. This contradicts (3),


finishing the proof of the theorem.
We calculate
I (|B(1 ) T ())(x) = log E([x] |B(1 ) T ())(x).
Since [x] is either A or Ac , which are both contained in T (), we see
that [x] is B(1 ) T ()-measurable, hence
E([x] |B(1 ) T ()) = [x] .
This proves that
I (|B(1 ) T ())(x) = 0
for -almost all x, hence (19) holds.